block/rbd: add .bdrv_reopen_prepare() stub

Message ID	cd07f4c1-6b18-ecfc-b37c-a5d9369a13bb@gmail.com
State	New
Headers	show Return-Path: <qemu-devel-bounces+incoming=patchwork.ozlabs.org@nongnu.org> To: Kevin Wolf <kwolf@redhat.com> References: <3f8edd94-280c-8aa7-0f07-ab674ee4667b@gmail.com> <20160513084548.GA5529@noname.redhat.com> From: =?UTF-8?Q?Sebastian_F=c3=a4rber?= <sfaerber82@gmail.com> Message-ID: <cd07f4c1-6b18-ecfc-b37c-a5d9369a13bb@gmail.com> Date: Tue, 17 May 2016 12:03:53 +0200 User-Agent: Mozilla/5.0 (Windows NT 10.0; WOW64; rv:45.0) Gecko/20100101 Thunderbird/45.1.0 MIME-Version: 1.0 In-Reply-To: <20160513084548.GA5529@noname.redhat.com> Content-Type: text/plain; charset=UTF-8 Content-Transfer-Encoding: 8bit Subject: Re: [Qemu-devel] [PATCH] block/rbd: add .bdrv_reopen_prepare() stub Precedence: list Cc: jdurgin@redhat.com, mreitz@redhat.com, qemu-devel@nongnu.org, qemu-block@nongnu.org, jcody@redhat.com Errors-To: qemu-devel-bounces+incoming=patchwork.ozlabs.org@nongnu.org Sender: "Qemu-devel" <qemu-devel-bounces+incoming=patchwork.ozlabs.org@nongnu.org>

Sebastian Färber May 17, 2016, 10:03 a.m. UTC

Hi Kevin,

> A correct reopen implementation must consider all options and flags that
> .bdrv_open() looked at.
> 
> The options are okay, as both "filename" and "password-secret" aren't
> things that we want to allow a reopen to change. However, in the flags
> BDRV_O_NOCACHE makes a difference:
> 
>     if (flags & BDRV_O_NOCACHE) {
>         rados_conf_set(s->cluster, "rbd_cache", "false");
>     } else {
>         rados_conf_set(s->cluster, "rbd_cache", "true");
>     }
> 
> A reopen must either update the setting, or if it can't (e.g. because
> librbd doesn't support it) any attempt to change the flag must fail.

Thanks for the feedback.
As far as i can tell it's not possible to update the cache settings
without reconnecting. I've added a check in the following patch.
Would be great if someone who knows the internals of ceph/rbd could
have a look as well.

Sebastian

-- >8 --
Subject: [PATCH] block/rbd: add .bdrv_reopen_prepare() stub

Add support for reopen() by adding the .bdrv_reopen_prepare() stub

Signed-off-by: Sebastian Färber <sfaerber82@gmail.com>
---
 block/rbd.c | 14 ++++++++++++++
 1 file changed, 14 insertions(+)

--
1.8.3.1

Jason Dillaman May 17, 2016, 2 p.m. UTC | #1

On Tue, May 17, 2016 at 6:03 AM, Sebastian Färber <sfaerber82@gmail.com> wrote:
> Hi Kevin,
>
>> A correct reopen implementation must consider all options and flags that
>> .bdrv_open() looked at.
>>
>> The options are okay, as both "filename" and "password-secret" aren't
>> things that we want to allow a reopen to change. However, in the flags
>> BDRV_O_NOCACHE makes a difference:
>>
>>     if (flags & BDRV_O_NOCACHE) {
>>         rados_conf_set(s->cluster, "rbd_cache", "false");
>>     } else {
>>         rados_conf_set(s->cluster, "rbd_cache", "true");
>>     }
>>
>> A reopen must either update the setting, or if it can't (e.g. because
>> librbd doesn't support it) any attempt to change the flag must fail.
>
> Thanks for the feedback.
> As far as i can tell it's not possible to update the cache settings
> without reconnecting. I've added a check in the following patch.
> Would be great if someone who knows the internals of ceph/rbd could
> have a look as well.

Correct, you cannot currently dynamically enable nor disable the
cache.  You would need to close the image and re-open it to change the
cache settings.

> Sebastian
>
> -- >8 --
> Subject: [PATCH] block/rbd: add .bdrv_reopen_prepare() stub
>
> Add support for reopen() by adding the .bdrv_reopen_prepare() stub
>
> Signed-off-by: Sebastian Färber <sfaerber82@gmail.com>
> ---
>  block/rbd.c | 14 ++++++++++++++
>  1 file changed, 14 insertions(+)
>
> diff --git a/block/rbd.c b/block/rbd.c
> index 5bc5b32..8ecf096 100644
> --- a/block/rbd.c
> +++ b/block/rbd.c
> @@ -577,6 +577,19 @@ failed_opts:
>      return r;
>  }
>
> +/* Note that this will not re-establish a connection with the Ceph cluster
> +   - it is effectively a NOP.  */
> +static int qemu_rbd_reopen_prepare(BDRVReopenState *state,
> +                                   BlockReopenQueue *queue, Error **errp)
> +{
> +    if (state->flags & BDRV_O_NOCACHE &&
> +        ((state->bs->open_flags & BDRV_O_NOCACHE) == 0)) {
> +        error_setg(errp, "Cannot turn off rbd_cache during reopen");
> +        return -EINVAL;
> +    }
> +    return 0;
> +}
> +
>  static void qemu_rbd_close(BlockDriverState *bs)
>  {
>      BDRVRBDState *s = bs->opaque;
> @@ -976,6 +989,7 @@ static BlockDriver bdrv_rbd = {
>      .instance_size      = sizeof(BDRVRBDState),
>      .bdrv_needs_filename = true,
>      .bdrv_file_open     = qemu_rbd_open,
> +    .bdrv_reopen_prepare = qemu_rbd_reopen_prepare,
>      .bdrv_close         = qemu_rbd_close,
>      .bdrv_create        = qemu_rbd_create,
>      .bdrv_has_zero_init = bdrv_has_zero_init_1,
> --
> 1.8.3.1
>
>

Josh Durgin May 17, 2016, 6:48 p.m. UTC | #2

On 05/17/2016 03:03 AM, Sebastian Färber wrote:
> Hi Kevin,
>
>> A correct reopen implementation must consider all options and flags that
>> .bdrv_open() looked at.
>>
>> The options are okay, as both "filename" and "password-secret" aren't
>> things that we want to allow a reopen to change. However, in the flags
>> BDRV_O_NOCACHE makes a difference:
>>
>>      if (flags & BDRV_O_NOCACHE) {
>>          rados_conf_set(s->cluster, "rbd_cache", "false");
>>      } else {
>>          rados_conf_set(s->cluster, "rbd_cache", "true");
>>      }
>>
>> A reopen must either update the setting, or if it can't (e.g. because
>> librbd doesn't support it) any attempt to change the flag must fail.

Updating this setting on an open image won't do anything, but if you
rbd_close() and rbd_open() it again the setting will take effect.
rbd_close() will force a flush of any pending I/O in librbd and
free the memory for librbd's ImageCtx, which may or may not be desired
here.

> Thanks for the feedback.
> As far as i can tell it's not possible to update the cache settings
> without reconnecting. I've added a check in the following patch.
> Would be great if someone who knows the internals of ceph/rbd could
> have a look as well.

There's no need to reset the librados state, so connections to the
cluster can stick around. I'm a bit unclear on the bdrv_reopen_*
functions though - what is their intended use and semantics?

> Sebastian
>
> -- >8 --
> Subject: [PATCH] block/rbd: add .bdrv_reopen_prepare() stub
>
> Add support for reopen() by adding the .bdrv_reopen_prepare() stub
>
> Signed-off-by: Sebastian Färber <sfaerber82@gmail.com>
> ---
>   block/rbd.c | 14 ++++++++++++++
>   1 file changed, 14 insertions(+)
>
> diff --git a/block/rbd.c b/block/rbd.c
> index 5bc5b32..8ecf096 100644
> --- a/block/rbd.c
> +++ b/block/rbd.c
> @@ -577,6 +577,19 @@ failed_opts:
>       return r;
>   }
>
> +/* Note that this will not re-establish a connection with the Ceph cluster
> +   - it is effectively a NOP.  */
> +static int qemu_rbd_reopen_prepare(BDRVReopenState *state,
> +                                   BlockReopenQueue *queue, Error **errp)
> +{
> +    if (state->flags & BDRV_O_NOCACHE &&
> +        ((state->bs->open_flags & BDRV_O_NOCACHE) == 0)) {
> +        error_setg(errp, "Cannot turn off rbd_cache during reopen");
> +        return -EINVAL;
> +    }
> +    return 0;
> +}
> +
>   static void qemu_rbd_close(BlockDriverState *bs)
>   {
>       BDRVRBDState *s = bs->opaque;
> @@ -976,6 +989,7 @@ static BlockDriver bdrv_rbd = {
>       .instance_size      = sizeof(BDRVRBDState),
>       .bdrv_needs_filename = true,
>       .bdrv_file_open     = qemu_rbd_open,
> +    .bdrv_reopen_prepare = qemu_rbd_reopen_prepare,
>       .bdrv_close         = qemu_rbd_close,
>       .bdrv_create        = qemu_rbd_create,
>       .bdrv_has_zero_init = bdrv_has_zero_init_1,
> --
> 1.8.3.1
>

Jeff Cody May 18, 2016, 5:31 a.m. UTC | #3

On Tue, May 17, 2016 at 12:03:53PM +0200, Sebastian Färber wrote:
> Hi Kevin,
> 
> > A correct reopen implementation must consider all options and flags that
> > .bdrv_open() looked at.
> > 
> > The options are okay, as both "filename" and "password-secret" aren't
> > things that we want to allow a reopen to change. However, in the flags
> > BDRV_O_NOCACHE makes a difference:
> > 
> >     if (flags & BDRV_O_NOCACHE) {
> >         rados_conf_set(s->cluster, "rbd_cache", "false");
> >     } else {
> >         rados_conf_set(s->cluster, "rbd_cache", "true");
> >     }
> > 
> > A reopen must either update the setting, or if it can't (e.g. because
> > librbd doesn't support it) any attempt to change the flag must fail.
> 
> Thanks for the feedback.
> As far as i can tell it's not possible to update the cache settings
> without reconnecting. I've added a check in the following patch.
> Would be great if someone who knows the internals of ceph/rbd could
> have a look as well.
> 
> Sebastian
> 

Is there any reason you cannot create a new connection, and then tear down
the old one, and switch over to use the new connection?

The reopen functions have a three operations:

*  _prepare()
 Parses flags, creates new connections / file descriptors, etc., but does not
 do anything that cannot be undone, or anything that makes the reopen
 operation "live".

* _commit()
 (if applicable) switches over to the new fd, and makes the
 reopen operation complete (and cleans up anything that needs to be cleaned
 up from _prepare()).

* _abort()
 cleanly undoes anything done in _prepare() - called if there is an error in
 _prepare(), or if the reopen operation is part of a larger transaction of
 multiple reopens, one of which failed.


For instance, it may be useful to look at how it is handled in the gluster
driver, and see if you can do something similar here in ceph.  For gluster,
we create a new gluster connection in the _prepare(), and in the _commit()
we close the old fd, and switch the image driver instance state fd over to
the new one.


> -- >8 --
> Subject: [PATCH] block/rbd: add .bdrv_reopen_prepare() stub
> 
> Add support for reopen() by adding the .bdrv_reopen_prepare() stub
> 
> Signed-off-by: Sebastian Färber <sfaerber82@gmail.com>
> ---
>  block/rbd.c | 14 ++++++++++++++
>  1 file changed, 14 insertions(+)
> 
> diff --git a/block/rbd.c b/block/rbd.c
> index 5bc5b32..8ecf096 100644
> --- a/block/rbd.c
> +++ b/block/rbd.c
> @@ -577,6 +577,19 @@ failed_opts:
>      return r;
>  }
> 
> +/* Note that this will not re-establish a connection with the Ceph cluster
> +   - it is effectively a NOP.  */
> +static int qemu_rbd_reopen_prepare(BDRVReopenState *state,
> +                                   BlockReopenQueue *queue, Error **errp)
> +{
> +    if (state->flags & BDRV_O_NOCACHE &&
> +        ((state->bs->open_flags & BDRV_O_NOCACHE) == 0)) {
> +        error_setg(errp, "Cannot turn off rbd_cache during reopen");
> +        return -EINVAL;
> +    }
> +    return 0;
> +}
> +
>  static void qemu_rbd_close(BlockDriverState *bs)
>  {
>      BDRVRBDState *s = bs->opaque;
> @@ -976,6 +989,7 @@ static BlockDriver bdrv_rbd = {
>      .instance_size      = sizeof(BDRVRBDState),
>      .bdrv_needs_filename = true,
>      .bdrv_file_open     = qemu_rbd_open,
> +    .bdrv_reopen_prepare = qemu_rbd_reopen_prepare,
>      .bdrv_close         = qemu_rbd_close,
>      .bdrv_create        = qemu_rbd_create,
>      .bdrv_has_zero_init = bdrv_has_zero_init_1,
> --
> 1.8.3.1
>

Sebastian Färber May 18, 2016, 7:36 a.m. UTC | #4

> 
> There's no need to reset the librados state, so connections to the
> cluster can stick around. I'm a bit unclear on the bdrv_reopen_*
> functions though - what is their intended use and semantics?

My motivation for implementing this basic reopen support is getting
active block commit in qemu (driven by libvirt) to work for rbd.
For example:
- VM is running and using rbd device as primary disk
- Create external local qcow2 snapshot with rbd as backing device
- All writes end up in local qcow2 file
- Do active block commit to get changes back to rbd after a while

Without the patch the last part (active block commit) fails in qemu:
Block format 'rbd' used by device '' does not support reopening files

I think, in this case, block commit wants to switch the rbd device to RW
to make sure it can write back the changes. Which is unnecessary in this case 
because it already is writeable.

Unfortunately i don't think i can implement proper/full-blown reopen support (like gluster)
for rbd - this would be way above my head ...

Sebastian

Kevin Wolf May 18, 2016, 8:19 a.m. UTC | #5

Am 17.05.2016 um 20:48 hat Josh Durgin geschrieben:
> On 05/17/2016 03:03 AM, Sebastian Färber wrote:
> >Hi Kevin,
> >
> >>A correct reopen implementation must consider all options and flags that
> >>.bdrv_open() looked at.
> >>
> >>The options are okay, as both "filename" and "password-secret" aren't
> >>things that we want to allow a reopen to change. However, in the flags
> >>BDRV_O_NOCACHE makes a difference:
> >>
> >>     if (flags & BDRV_O_NOCACHE) {
> >>         rados_conf_set(s->cluster, "rbd_cache", "false");
> >>     } else {
> >>         rados_conf_set(s->cluster, "rbd_cache", "true");
> >>     }
> >>
> >>A reopen must either update the setting, or if it can't (e.g. because
> >>librbd doesn't support it) any attempt to change the flag must fail.
> 
> Updating this setting on an open image won't do anything, but if you
> rbd_close() and rbd_open() it again the setting will take effect.
> rbd_close() will force a flush of any pending I/O in librbd and
> free the memory for librbd's ImageCtx, which may or may not be desired
> here.

First rbd_close() and then rbd_open() risks that the rbd_open() fails
and we end up with no usable image at all. Can we open a second instance
of the image first and only close the first one if that succeeded?

We already flush all requests before calling this, so that part
shouldn't make a difference.

> >Thanks for the feedback.
> >As far as i can tell it's not possible to update the cache settings
> >without reconnecting. I've added a check in the following patch.
> >Would be great if someone who knows the internals of ceph/rbd could
> >have a look as well.
> 
> There's no need to reset the librados state, so connections to the
> cluster can stick around. I'm a bit unclear on the bdrv_reopen_*
> functions though - what is their intended use and semantics?

They change the options and flags that were specified in .bdrv_open().
The most important use case today is switching between read-only and
read-write, but changing the cache mode or any other option can be
requested as well.

> >Sebastian
> >
> >-- >8 --
> >Subject: [PATCH] block/rbd: add .bdrv_reopen_prepare() stub
> >
> >Add support for reopen() by adding the .bdrv_reopen_prepare() stub
> >
> >Signed-off-by: Sebastian Färber <sfaerber82@gmail.com>
> >---
> >  block/rbd.c | 14 ++++++++++++++
> >  1 file changed, 14 insertions(+)
> >
> >diff --git a/block/rbd.c b/block/rbd.c
> >index 5bc5b32..8ecf096 100644
> >--- a/block/rbd.c
> >+++ b/block/rbd.c
> >@@ -577,6 +577,19 @@ failed_opts:
> >      return r;
> >  }
> >
> >+/* Note that this will not re-establish a connection with the Ceph cluster
> >+   - it is effectively a NOP.  */
> >+static int qemu_rbd_reopen_prepare(BDRVReopenState *state,
> >+                                   BlockReopenQueue *queue, Error **errp)
> >+{
> >+    if (state->flags & BDRV_O_NOCACHE &&
> >+        ((state->bs->open_flags & BDRV_O_NOCACHE) == 0)) {

This misses the other direction, where you try to turn on caching. If we
don't implement the real functionality, we should always error out if
the bit changes. The most readable check is probably:

(state->flags & BDRV_O_NOCACHE) != (state->bs->open_flags & BDRV_O_NOCACHE)

> >+        error_setg(errp, "Cannot turn off rbd_cache during reopen");
> >+        return -EINVAL;
> >+    }
> >+    return 0;
> >+}

Kevin

Jason Dillaman May 18, 2016, 3:54 p.m. UTC | #6

On Wed, May 18, 2016 at 4:19 AM, Kevin Wolf <kwolf@redhat.com> wrote:
>> Updating this setting on an open image won't do anything, but if you
>> rbd_close() and rbd_open() it again the setting will take effect.
>> rbd_close() will force a flush of any pending I/O in librbd and
>> free the memory for librbd's ImageCtx, which may or may not be desired
>> here.
>
> First rbd_close() and then rbd_open() risks that the rbd_open() fails
> and we end up with no usable image at all. Can we open a second instance
> of the image first and only close the first one if that succeeded?
>
> We already flush all requests before calling this, so that part
> shouldn't make a difference.


You can open the same image twice without issue.  It's perfectly fine
to rbd_open() with the new settings and upon success rbd_close() the
original image handle.

block/rbd: add .bdrv_reopen_prepare() stub

Commit Message

Comments

Patch