diff mbox

[for-2.6,2/2] block/gluster: prevent data loss after i/o error

Message ID d9445c47de98ad9142a0e40598f7c8152ed833d1.1459913103.git.jcody@redhat.com
State New
Headers show

Commit Message

Jeff Cody April 6, 2016, 3:29 a.m. UTC
Upon receiving an I/O error after an fsync, by default gluster will
dump its cache.  However, QEMU will retry the fsync, which is especially
useful when encountering errors such as ENOSPC when using the werror=stop
option.  When using caching with gluster, however, the last written data
will be lost upon encountering ENOSPC.  Using the cache xlator option of
'resync-failed-syncs-after-fsync' should cause gluster to retain the
cached data after a failed fsync, so that ENOSPC and other transient
errors are recoverable.

Signed-off-by: Jeff Cody <jcody@redhat.com>
---
 block/gluster.c | 27 +++++++++++++++++++++++++++
 configure       |  8 ++++++++
 2 files changed, 35 insertions(+)

Comments

Kevin Wolf April 6, 2016, 11:02 a.m. UTC | #1
[ Adding some CCs ]

Am 06.04.2016 um 05:29 hat Jeff Cody geschrieben:
> Upon receiving an I/O error after an fsync, by default gluster will
> dump its cache.  However, QEMU will retry the fsync, which is especially
> useful when encountering errors such as ENOSPC when using the werror=stop
> option.  When using caching with gluster, however, the last written data
> will be lost upon encountering ENOSPC.  Using the cache xlator option of
> 'resync-failed-syncs-after-fsync' should cause gluster to retain the
> cached data after a failed fsync, so that ENOSPC and other transient
> errors are recoverable.
> 
> Signed-off-by: Jeff Cody <jcody@redhat.com>
> ---
>  block/gluster.c | 27 +++++++++++++++++++++++++++
>  configure       |  8 ++++++++
>  2 files changed, 35 insertions(+)
> 
> diff --git a/block/gluster.c b/block/gluster.c
> index 30a827e..b1cf71b 100644
> --- a/block/gluster.c
> +++ b/block/gluster.c
> @@ -330,6 +330,23 @@ static int qemu_gluster_open(BlockDriverState *bs,  QDict *options,
>          goto out;
>      }
>  
> +#ifdef CONFIG_GLUSTERFS_XLATOR_OPT
> +    /* Without this, if fsync fails for a recoverable reason (for instance,
> +     * ENOSPC), gluster will dump its cache, preventing retries.  This means
> +     * almost certain data loss.  Not all gluster versions support the
> +     * 'resync-failed-syncs-after-fsync' key value, but there is no way to
> +     * discover during runtime if it is supported (this api returns success for
> +     * unknown key/value pairs) */

Honestly, this sucks. There is apparently no way to operate gluster so
we can safely recover after a failed fsync. "We hope everything is fine,
but depending on your gluster version, we may now corrupt your image"
isn't very good.

We need to consider very carefully if this is good enough to go on after
an error. I'm currently leaning towards "no". That is, we should only
enable this after Gluster provides us a way to make sure that the option
is really set.

> +    ret = glfs_set_xlator_option (s->glfs, "*-write-behind",
> +                                           "resync-failed-syncs-after-fsync",
> +                                           "on");
> +    if (ret < 0) {
> +        error_setg_errno(errp, errno, "Unable to set xlator key/value pair");
> +        ret = -errno;
> +        goto out;
> +    }
> +#endif

We also need to consider the case without CONFIG_GLUSTERFS_XLATOR_OPT.
In this case (as well as theoretically in the case that the option
didn't take effect - if only we could know about it), a failed
glfs_fsync_async() is fatal and we need to stop operating on the image,
i.e. set bs->drv = NULL like when we detect corruption in qcow2 images.
The guest will see a broken disk that fails all I/O requests, but that's
better than corrupting data.

Kevin
Ric Wheeler April 6, 2016, 11:19 a.m. UTC | #2
We had a thread discussing this not on the upstream list.

My summary of the thread is that I don't understand why gluster should drop 
cached data after a failed fsync() for any open file. For closed files, I think 
it might still happen but this is the same as any file system (and unlikely to 
be the case for qemu?).

I will note that Linux in general had (still has I think?) the behavior that 
once the process closes a file (or exits), we lose context to return an error 
to. From that point on, any failed IO from the page cache to the target disk 
will be dropped from cache. To hold things in the cache would lead it to fill 
with old data that is not really recoverable and we have no good way to know 
that the situation is repairable and how long that might take. Upstream kernel 
people have debated this, the behavior might be tweaked for certain types of errors.

Regards,

Ric


On 04/06/2016 07:02 AM, Kevin Wolf wrote:
> [ Adding some CCs ]
>
> Am 06.04.2016 um 05:29 hat Jeff Cody geschrieben:
>> Upon receiving an I/O error after an fsync, by default gluster will
>> dump its cache.  However, QEMU will retry the fsync, which is especially
>> useful when encountering errors such as ENOSPC when using the werror=stop
>> option.  When using caching with gluster, however, the last written data
>> will be lost upon encountering ENOSPC.  Using the cache xlator option of
>> 'resync-failed-syncs-after-fsync' should cause gluster to retain the
>> cached data after a failed fsync, so that ENOSPC and other transient
>> errors are recoverable.
>>
>> Signed-off-by: Jeff Cody <jcody@redhat.com>
>> ---
>>   block/gluster.c | 27 +++++++++++++++++++++++++++
>>   configure       |  8 ++++++++
>>   2 files changed, 35 insertions(+)
>>
>> diff --git a/block/gluster.c b/block/gluster.c
>> index 30a827e..b1cf71b 100644
>> --- a/block/gluster.c
>> +++ b/block/gluster.c
>> @@ -330,6 +330,23 @@ static int qemu_gluster_open(BlockDriverState *bs,  QDict *options,
>>           goto out;
>>       }
>>   
>> +#ifdef CONFIG_GLUSTERFS_XLATOR_OPT
>> +    /* Without this, if fsync fails for a recoverable reason (for instance,
>> +     * ENOSPC), gluster will dump its cache, preventing retries.  This means
>> +     * almost certain data loss.  Not all gluster versions support the
>> +     * 'resync-failed-syncs-after-fsync' key value, but there is no way to
>> +     * discover during runtime if it is supported (this api returns success for
>> +     * unknown key/value pairs) */
> Honestly, this sucks. There is apparently no way to operate gluster so
> we can safely recover after a failed fsync. "We hope everything is fine,
> but depending on your gluster version, we may now corrupt your image"
> isn't very good.
>
> We need to consider very carefully if this is good enough to go on after
> an error. I'm currently leaning towards "no". That is, we should only
> enable this after Gluster provides us a way to make sure that the option
> is really set.
>
>> +    ret = glfs_set_xlator_option (s->glfs, "*-write-behind",
>> +                                           "resync-failed-syncs-after-fsync",
>> +                                           "on");
>> +    if (ret < 0) {
>> +        error_setg_errno(errp, errno, "Unable to set xlator key/value pair");
>> +        ret = -errno;
>> +        goto out;
>> +    }
>> +#endif
> We also need to consider the case without CONFIG_GLUSTERFS_XLATOR_OPT.
> In this case (as well as theoretically in the case that the option
> didn't take effect - if only we could know about it), a failed
> glfs_fsync_async() is fatal and we need to stop operating on the image,
> i.e. set bs->drv = NULL like when we detect corruption in qcow2 images.
> The guest will see a broken disk that fails all I/O requests, but that's
> better than corrupting data.
>
> Kevin
Kevin Wolf April 6, 2016, 11:41 a.m. UTC | #3
Am 06.04.2016 um 13:19 hat Ric Wheeler geschrieben:
> 
> We had a thread discussing this not on the upstream list.
> 
> My summary of the thread is that I don't understand why gluster
> should drop cached data after a failed fsync() for any open file.

It certainly shouldn't, but it does by default. :-)

Have a look at commit 3fcead2d in glusterfs.git, which at least
introduces an option to get usable behaviour:

    { .key = {"resync-failed-syncs-after-fsync"},
      .type = GF_OPTION_TYPE_BOOL,
      .default_value = "off",
      .description = "If sync of \"cached-writes issued before fsync\" "
                     "(to backend) fails, this option configures whether "
                     "to retry syncing them after fsync or forget them. "
                     "If set to on, cached-writes are retried "
                     "till a \"flush\" fop (or a successful sync) on sync "
                     "failures. "
                     "fsync itself is failed irrespective of the value of "
                     "this option. ",
    },

As you can see, the default is still to drop cached data, and this is
with the file still opened. qemu needs to make sure that this option is
set, and if Jeff's comment in the code below is right, there is no way
currently to make sure that the option isn't silently ignored.

Can we get some function that sets an option and fails if the option is
unknown? Or one that queries the state after setting an option, so we
can check whether we succeeded in switching to the mode we need?

> For closed files, I think it might still happen but this is the same
> as any file system (and unlikely to be the case for qemu?).

Our problem is only with open images. Dropping caches for files that
qemu doesn't use any more is fine as far as I'm concerned.

Note that our usage can involve cases where we reopen a file with
different flags, i.e. first open a second file descriptor, then close
the first one. The image was never completely closed here and we would
still want the cache to preserve our data in such cases.

> I will note that Linux in general had (still has I think?) the
> behavior that once the process closes a file (or exits), we lose
> context to return an error to. From that point on, any failed IO
> from the page cache to the target disk will be dropped from cache.
> To hold things in the cache would lead it to fill with old data that
> is not really recoverable and we have no good way to know that the
> situation is repairable and how long that might take. Upstream
> kernel people have debated this, the behavior might be tweaked for
> certain types of errors.

That's fine, we just don't want the next fsync() to signal success when
in reality the cache has thrown away our data. As soon as we close the
image, there is no next fsync(), so you can do whatever you like.

Kevin

> On 04/06/2016 07:02 AM, Kevin Wolf wrote:
> >[ Adding some CCs ]
> >
> >Am 06.04.2016 um 05:29 hat Jeff Cody geschrieben:
> >>Upon receiving an I/O error after an fsync, by default gluster will
> >>dump its cache.  However, QEMU will retry the fsync, which is especially
> >>useful when encountering errors such as ENOSPC when using the werror=stop
> >>option.  When using caching with gluster, however, the last written data
> >>will be lost upon encountering ENOSPC.  Using the cache xlator option of
> >>'resync-failed-syncs-after-fsync' should cause gluster to retain the
> >>cached data after a failed fsync, so that ENOSPC and other transient
> >>errors are recoverable.
> >>
> >>Signed-off-by: Jeff Cody <jcody@redhat.com>
> >>---
> >>  block/gluster.c | 27 +++++++++++++++++++++++++++
> >>  configure       |  8 ++++++++
> >>  2 files changed, 35 insertions(+)
> >>
> >>diff --git a/block/gluster.c b/block/gluster.c
> >>index 30a827e..b1cf71b 100644
> >>--- a/block/gluster.c
> >>+++ b/block/gluster.c
> >>@@ -330,6 +330,23 @@ static int qemu_gluster_open(BlockDriverState *bs,  QDict *options,
> >>          goto out;
> >>      }
> >>+#ifdef CONFIG_GLUSTERFS_XLATOR_OPT
> >>+    /* Without this, if fsync fails for a recoverable reason (for instance,
> >>+     * ENOSPC), gluster will dump its cache, preventing retries.  This means
> >>+     * almost certain data loss.  Not all gluster versions support the
> >>+     * 'resync-failed-syncs-after-fsync' key value, but there is no way to
> >>+     * discover during runtime if it is supported (this api returns success for
> >>+     * unknown key/value pairs) */
> >Honestly, this sucks. There is apparently no way to operate gluster so
> >we can safely recover after a failed fsync. "We hope everything is fine,
> >but depending on your gluster version, we may now corrupt your image"
> >isn't very good.
> >
> >We need to consider very carefully if this is good enough to go on after
> >an error. I'm currently leaning towards "no". That is, we should only
> >enable this after Gluster provides us a way to make sure that the option
> >is really set.
> >
> >>+    ret = glfs_set_xlator_option (s->glfs, "*-write-behind",
> >>+                                           "resync-failed-syncs-after-fsync",
> >>+                                           "on");
> >>+    if (ret < 0) {
> >>+        error_setg_errno(errp, errno, "Unable to set xlator key/value pair");
> >>+        ret = -errno;
> >>+        goto out;
> >>+    }
> >>+#endif
> >We also need to consider the case without CONFIG_GLUSTERFS_XLATOR_OPT.
> >In this case (as well as theoretically in the case that the option
> >didn't take effect - if only we could know about it), a failed
> >glfs_fsync_async() is fatal and we need to stop operating on the image,
> >i.e. set bs->drv = NULL like when we detect corruption in qcow2 images.
> >The guest will see a broken disk that fails all I/O requests, but that's
> >better than corrupting data.
> >
> >Kevin
>
Kevin Wolf April 6, 2016, 11:51 a.m. UTC | #4
Am 06.04.2016 um 13:41 hat Kevin Wolf geschrieben:
> Am 06.04.2016 um 13:19 hat Ric Wheeler geschrieben:
> > 
> > We had a thread discussing this not on the upstream list.
> > 
> > My summary of the thread is that I don't understand why gluster
> > should drop cached data after a failed fsync() for any open file.
> 
> It certainly shouldn't, but it does by default. :-)
> 
> Have a look at commit 3fcead2d in glusterfs.git, which at least
> introduces an option to get usable behaviour:
> 
>     { .key = {"resync-failed-syncs-after-fsync"},
>       .type = GF_OPTION_TYPE_BOOL,
>       .default_value = "off",
>       .description = "If sync of \"cached-writes issued before fsync\" "
>                      "(to backend) fails, this option configures whether "
>                      "to retry syncing them after fsync or forget them. "
>                      "If set to on, cached-writes are retried "
>                      "till a \"flush\" fop (or a successful sync) on sync "
>                      "failures. "
>                      "fsync itself is failed irrespective of the value of "
>                      "this option. ",
>     },
> 
> As you can see, the default is still to drop cached data, and this is
> with the file still opened. qemu needs to make sure that this option is
> set, and if Jeff's comment in the code below is right, there is no way
> currently to make sure that the option isn't silently ignored.
> 
> Can we get some function that sets an option and fails if the option is
> unknown? Or one that queries the state after setting an option, so we
> can check whether we succeeded in switching to the mode we need?
> 
> > For closed files, I think it might still happen but this is the same
> > as any file system (and unlikely to be the case for qemu?).
> 
> Our problem is only with open images. Dropping caches for files that
> qemu doesn't use any more is fine as far as I'm concerned.
> 
> Note that our usage can involve cases where we reopen a file with
> different flags, i.e. first open a second file descriptor, then close
> the first one. The image was never completely closed here and we would
> still want the cache to preserve our data in such cases.

Hm, actually, maybe we should just call bdrv_flush() before reopening an
image, and if an error is returned, we abort the reopen. It's far from
being a hot path, so the overhead of a flush shouldn't matter, and it
seems we're taking an unnecessary risk without doing this.

Kevin

> > I will note that Linux in general had (still has I think?) the
> > behavior that once the process closes a file (or exits), we lose
> > context to return an error to. From that point on, any failed IO
> > from the page cache to the target disk will be dropped from cache.
> > To hold things in the cache would lead it to fill with old data that
> > is not really recoverable and we have no good way to know that the
> > situation is repairable and how long that might take. Upstream
> > kernel people have debated this, the behavior might be tweaked for
> > certain types of errors.
> 
> That's fine, we just don't want the next fsync() to signal success when
> in reality the cache has thrown away our data. As soon as we close the
> image, there is no next fsync(), so you can do whatever you like.
> 
> Kevin
> 
> > On 04/06/2016 07:02 AM, Kevin Wolf wrote:
> > >[ Adding some CCs ]
> > >
> > >Am 06.04.2016 um 05:29 hat Jeff Cody geschrieben:
> > >>Upon receiving an I/O error after an fsync, by default gluster will
> > >>dump its cache.  However, QEMU will retry the fsync, which is especially
> > >>useful when encountering errors such as ENOSPC when using the werror=stop
> > >>option.  When using caching with gluster, however, the last written data
> > >>will be lost upon encountering ENOSPC.  Using the cache xlator option of
> > >>'resync-failed-syncs-after-fsync' should cause gluster to retain the
> > >>cached data after a failed fsync, so that ENOSPC and other transient
> > >>errors are recoverable.
> > >>
> > >>Signed-off-by: Jeff Cody <jcody@redhat.com>
> > >>---
> > >>  block/gluster.c | 27 +++++++++++++++++++++++++++
> > >>  configure       |  8 ++++++++
> > >>  2 files changed, 35 insertions(+)
> > >>
> > >>diff --git a/block/gluster.c b/block/gluster.c
> > >>index 30a827e..b1cf71b 100644
> > >>--- a/block/gluster.c
> > >>+++ b/block/gluster.c
> > >>@@ -330,6 +330,23 @@ static int qemu_gluster_open(BlockDriverState *bs,  QDict *options,
> > >>          goto out;
> > >>      }
> > >>+#ifdef CONFIG_GLUSTERFS_XLATOR_OPT
> > >>+    /* Without this, if fsync fails for a recoverable reason (for instance,
> > >>+     * ENOSPC), gluster will dump its cache, preventing retries.  This means
> > >>+     * almost certain data loss.  Not all gluster versions support the
> > >>+     * 'resync-failed-syncs-after-fsync' key value, but there is no way to
> > >>+     * discover during runtime if it is supported (this api returns success for
> > >>+     * unknown key/value pairs) */
> > >Honestly, this sucks. There is apparently no way to operate gluster so
> > >we can safely recover after a failed fsync. "We hope everything is fine,
> > >but depending on your gluster version, we may now corrupt your image"
> > >isn't very good.
> > >
> > >We need to consider very carefully if this is good enough to go on after
> > >an error. I'm currently leaning towards "no". That is, we should only
> > >enable this after Gluster provides us a way to make sure that the option
> > >is really set.
> > >
> > >>+    ret = glfs_set_xlator_option (s->glfs, "*-write-behind",
> > >>+                                           "resync-failed-syncs-after-fsync",
> > >>+                                           "on");
> > >>+    if (ret < 0) {
> > >>+        error_setg_errno(errp, errno, "Unable to set xlator key/value pair");
> > >>+        ret = -errno;
> > >>+        goto out;
> > >>+    }
> > >>+#endif
> > >We also need to consider the case without CONFIG_GLUSTERFS_XLATOR_OPT.
> > >In this case (as well as theoretically in the case that the option
> > >didn't take effect - if only we could know about it), a failed
> > >glfs_fsync_async() is fatal and we need to stop operating on the image,
> > >i.e. set bs->drv = NULL like when we detect corruption in qcow2 images.
> > >The guest will see a broken disk that fails all I/O requests, but that's
> > >better than corrupting data.
> > >
> > >Kevin
> > 
>
Jeff Cody April 6, 2016, 12:44 p.m. UTC | #5
On Wed, Apr 06, 2016 at 01:02:16PM +0200, Kevin Wolf wrote:
> [ Adding some CCs ]
> 
> Am 06.04.2016 um 05:29 hat Jeff Cody geschrieben:
> > Upon receiving an I/O error after an fsync, by default gluster will
> > dump its cache.  However, QEMU will retry the fsync, which is especially
> > useful when encountering errors such as ENOSPC when using the werror=stop
> > option.  When using caching with gluster, however, the last written data
> > will be lost upon encountering ENOSPC.  Using the cache xlator option of
> > 'resync-failed-syncs-after-fsync' should cause gluster to retain the
> > cached data after a failed fsync, so that ENOSPC and other transient
> > errors are recoverable.
> > 
> > Signed-off-by: Jeff Cody <jcody@redhat.com>
> > ---
> >  block/gluster.c | 27 +++++++++++++++++++++++++++
> >  configure       |  8 ++++++++
> >  2 files changed, 35 insertions(+)
> > 
> > diff --git a/block/gluster.c b/block/gluster.c
> > index 30a827e..b1cf71b 100644
> > --- a/block/gluster.c
> > +++ b/block/gluster.c
> > @@ -330,6 +330,23 @@ static int qemu_gluster_open(BlockDriverState *bs,  QDict *options,
> >          goto out;
> >      }
> >  
> > +#ifdef CONFIG_GLUSTERFS_XLATOR_OPT
> > +    /* Without this, if fsync fails for a recoverable reason (for instance,
> > +     * ENOSPC), gluster will dump its cache, preventing retries.  This means
> > +     * almost certain data loss.  Not all gluster versions support the
> > +     * 'resync-failed-syncs-after-fsync' key value, but there is no way to
> > +     * discover during runtime if it is supported (this api returns success for
> > +     * unknown key/value pairs) */
> 
> Honestly, this sucks. There is apparently no way to operate gluster so
> we can safely recover after a failed fsync. "We hope everything is fine,
> but depending on your gluster version, we may now corrupt your image"
> isn't very good.
> 
> We need to consider very carefully if this is good enough to go on after
> an error. I'm currently leaning towards "no". That is, we should only
> enable this after Gluster provides us a way to make sure that the option
> is really set.
> 
> > +    ret = glfs_set_xlator_option (s->glfs, "*-write-behind",
> > +                                           "resync-failed-syncs-after-fsync",
> > +                                           "on");
> > +    if (ret < 0) {
> > +        error_setg_errno(errp, errno, "Unable to set xlator key/value pair");
> > +        ret = -errno;
> > +        goto out;
> > +    }
> > +#endif
> 
> We also need to consider the case without CONFIG_GLUSTERFS_XLATOR_OPT.
> In this case (as well as theoretically in the case that the option
> didn't take effect - if only we could know about it), a failed
> glfs_fsync_async() is fatal and we need to stop operating on the image,
> i.e. set bs->drv = NULL like when we detect corruption in qcow2 images.
> The guest will see a broken disk that fails all I/O requests, but that's
> better than corrupting data.
>

Gluster versions that don't support CONFIG_GLUSTERFS_XLATOR_OPT will
also not have the gluster patch that removes the file descriptor
invalidation upon error (unless that was a relatively new
bug/feature).  But if that is the case, every operation on the file
descriptor in those versions will return error.  But it is also rather
old versions that don't support glfs_set_xlator_option() I believe.
Jeff Cody April 6, 2016, 1:10 p.m. UTC | #6
On Wed, Apr 06, 2016 at 01:51:59PM +0200, Kevin Wolf wrote:
> Am 06.04.2016 um 13:41 hat Kevin Wolf geschrieben:
> > Am 06.04.2016 um 13:19 hat Ric Wheeler geschrieben:
> > > 
> > > We had a thread discussing this not on the upstream list.
> > > 
> > > My summary of the thread is that I don't understand why gluster
> > > should drop cached data after a failed fsync() for any open file.
> > 
> > It certainly shouldn't, but it does by default. :-)
> > 
> > Have a look at commit 3fcead2d in glusterfs.git, which at least
> > introduces an option to get usable behaviour:
> > 
> >     { .key = {"resync-failed-syncs-after-fsync"},
> >       .type = GF_OPTION_TYPE_BOOL,
> >       .default_value = "off",
> >       .description = "If sync of \"cached-writes issued before fsync\" "
> >                      "(to backend) fails, this option configures whether "
> >                      "to retry syncing them after fsync or forget them. "
> >                      "If set to on, cached-writes are retried "
> >                      "till a \"flush\" fop (or a successful sync) on sync "
> >                      "failures. "
> >                      "fsync itself is failed irrespective of the value of "
> >                      "this option. ",
> >     },
> > 
> > As you can see, the default is still to drop cached data, and this is
> > with the file still opened. qemu needs to make sure that this option is
> > set, and if Jeff's comment in the code below is right, there is no way
> > currently to make sure that the option isn't silently ignored.
> > 
> > Can we get some function that sets an option and fails if the option is
> > unknown? Or one that queries the state after setting an option, so we
> > can check whether we succeeded in switching to the mode we need?
> > 
> > > For closed files, I think it might still happen but this is the same
> > > as any file system (and unlikely to be the case for qemu?).
> > 
> > Our problem is only with open images. Dropping caches for files that
> > qemu doesn't use any more is fine as far as I'm concerned.
> > 
> > Note that our usage can involve cases where we reopen a file with
> > different flags, i.e. first open a second file descriptor, then close
> > the first one. The image was never completely closed here and we would
> > still want the cache to preserve our data in such cases.
> 
> Hm, actually, maybe we should just call bdrv_flush() before reopening an
> image, and if an error is returned, we abort the reopen. It's far from
> being a hot path, so the overhead of a flush shouldn't matter, and it
> seems we're taking an unnecessary risk without doing this.
>

[I seemed to have been dropped from the cc]

Are you talking about doing a bdrv_flush() on the new descriptor (i.e.
reop_s->glfs)?  Because otherwise, we already do this in
bdrv_reopen_prepare() on the original fd.  It happens right before the call
to drv->bdrv_reopen_prepare():


2020     ret = bdrv_flush(reopen_state->bs);
2021     if (ret) {
2022         error_setg_errno(errp, -ret, "Error flushing drive");
2023         goto error;
2024     }
2025 
2026     if (drv->bdrv_reopen_prepare) {
2027         ret = drv->bdrv_reopen_prepare(reopen_state, queue, &local_err);

> 
> > > I will note that Linux in general had (still has I think?) the
> > > behavior that once the process closes a file (or exits), we lose
> > > context to return an error to. From that point on, any failed IO
> > > from the page cache to the target disk will be dropped from cache.
> > > To hold things in the cache would lead it to fill with old data that
> > > is not really recoverable and we have no good way to know that the
> > > situation is repairable and how long that might take. Upstream
> > > kernel people have debated this, the behavior might be tweaked for
> > > certain types of errors.
> > 
> > That's fine, we just don't want the next fsync() to signal success when
> > in reality the cache has thrown away our data. As soon as we close the
> > image, there is no next fsync(), so you can do whatever you like.
> > 
> > Kevin
> > 
> > > On 04/06/2016 07:02 AM, Kevin Wolf wrote:
> > > >[ Adding some CCs ]
> > > >
> > > >Am 06.04.2016 um 05:29 hat Jeff Cody geschrieben:
> > > >>Upon receiving an I/O error after an fsync, by default gluster will
> > > >>dump its cache.  However, QEMU will retry the fsync, which is especially
> > > >>useful when encountering errors such as ENOSPC when using the werror=stop
> > > >>option.  When using caching with gluster, however, the last written data
> > > >>will be lost upon encountering ENOSPC.  Using the cache xlator option of
> > > >>'resync-failed-syncs-after-fsync' should cause gluster to retain the
> > > >>cached data after a failed fsync, so that ENOSPC and other transient
> > > >>errors are recoverable.
> > > >>
> > > >>Signed-off-by: Jeff Cody <jcody@redhat.com>
> > > >>---
> > > >>  block/gluster.c | 27 +++++++++++++++++++++++++++
> > > >>  configure       |  8 ++++++++
> > > >>  2 files changed, 35 insertions(+)
> > > >>
> > > >>diff --git a/block/gluster.c b/block/gluster.c
> > > >>index 30a827e..b1cf71b 100644
> > > >>--- a/block/gluster.c
> > > >>+++ b/block/gluster.c
> > > >>@@ -330,6 +330,23 @@ static int qemu_gluster_open(BlockDriverState *bs,  QDict *options,
> > > >>          goto out;
> > > >>      }
> > > >>+#ifdef CONFIG_GLUSTERFS_XLATOR_OPT
> > > >>+    /* Without this, if fsync fails for a recoverable reason (for instance,
> > > >>+     * ENOSPC), gluster will dump its cache, preventing retries.  This means
> > > >>+     * almost certain data loss.  Not all gluster versions support the
> > > >>+     * 'resync-failed-syncs-after-fsync' key value, but there is no way to
> > > >>+     * discover during runtime if it is supported (this api returns success for
> > > >>+     * unknown key/value pairs) */
> > > >Honestly, this sucks. There is apparently no way to operate gluster so
> > > >we can safely recover after a failed fsync. "We hope everything is fine,
> > > >but depending on your gluster version, we may now corrupt your image"
> > > >isn't very good.
> > > >
> > > >We need to consider very carefully if this is good enough to go on after
> > > >an error. I'm currently leaning towards "no". That is, we should only
> > > >enable this after Gluster provides us a way to make sure that the option
> > > >is really set.
> > > >
> > > >>+    ret = glfs_set_xlator_option (s->glfs, "*-write-behind",
> > > >>+                                           "resync-failed-syncs-after-fsync",
> > > >>+                                           "on");
> > > >>+    if (ret < 0) {
> > > >>+        error_setg_errno(errp, errno, "Unable to set xlator key/value pair");
> > > >>+        ret = -errno;
> > > >>+        goto out;
> > > >>+    }
> > > >>+#endif
> > > >We also need to consider the case without CONFIG_GLUSTERFS_XLATOR_OPT.
> > > >In this case (as well as theoretically in the case that the option
> > > >didn't take effect - if only we could know about it), a failed
> > > >glfs_fsync_async() is fatal and we need to stop operating on the image,
> > > >i.e. set bs->drv = NULL like when we detect corruption in qcow2 images.
> > > >The guest will see a broken disk that fails all I/O requests, but that's
> > > >better than corrupting data.
> > > >
> > > >Kevin
> > > 
> > 
>
Kevin Wolf April 6, 2016, 1:20 p.m. UTC | #7
Am 06.04.2016 um 15:10 hat Jeff Cody geschrieben:
> On Wed, Apr 06, 2016 at 01:51:59PM +0200, Kevin Wolf wrote:
> > Am 06.04.2016 um 13:41 hat Kevin Wolf geschrieben:
> > > Am 06.04.2016 um 13:19 hat Ric Wheeler geschrieben:
> > > > 
> > > > We had a thread discussing this not on the upstream list.
> > > > 
> > > > My summary of the thread is that I don't understand why gluster
> > > > should drop cached data after a failed fsync() for any open file.
> > > 
> > > It certainly shouldn't, but it does by default. :-)
> > > 
> > > Have a look at commit 3fcead2d in glusterfs.git, which at least
> > > introduces an option to get usable behaviour:
> > > 
> > >     { .key = {"resync-failed-syncs-after-fsync"},
> > >       .type = GF_OPTION_TYPE_BOOL,
> > >       .default_value = "off",
> > >       .description = "If sync of \"cached-writes issued before fsync\" "
> > >                      "(to backend) fails, this option configures whether "
> > >                      "to retry syncing them after fsync or forget them. "
> > >                      "If set to on, cached-writes are retried "
> > >                      "till a \"flush\" fop (or a successful sync) on sync "
> > >                      "failures. "
> > >                      "fsync itself is failed irrespective of the value of "
> > >                      "this option. ",
> > >     },
> > > 
> > > As you can see, the default is still to drop cached data, and this is
> > > with the file still opened. qemu needs to make sure that this option is
> > > set, and if Jeff's comment in the code below is right, there is no way
> > > currently to make sure that the option isn't silently ignored.
> > > 
> > > Can we get some function that sets an option and fails if the option is
> > > unknown? Or one that queries the state after setting an option, so we
> > > can check whether we succeeded in switching to the mode we need?
> > > 
> > > > For closed files, I think it might still happen but this is the same
> > > > as any file system (and unlikely to be the case for qemu?).
> > > 
> > > Our problem is only with open images. Dropping caches for files that
> > > qemu doesn't use any more is fine as far as I'm concerned.
> > > 
> > > Note that our usage can involve cases where we reopen a file with
> > > different flags, i.e. first open a second file descriptor, then close
> > > the first one. The image was never completely closed here and we would
> > > still want the cache to preserve our data in such cases.
> > 
> > Hm, actually, maybe we should just call bdrv_flush() before reopening an
> > image, and if an error is returned, we abort the reopen. It's far from
> > being a hot path, so the overhead of a flush shouldn't matter, and it
> > seems we're taking an unnecessary risk without doing this.
> >
> 
> [I seemed to have been dropped from the cc]
> 
> Are you talking about doing a bdrv_flush() on the new descriptor (i.e.
> reop_s->glfs)?  Because otherwise, we already do this in
> bdrv_reopen_prepare() on the original fd.  It happens right before the call
> to drv->bdrv_reopen_prepare():
> 
> 
> 2020     ret = bdrv_flush(reopen_state->bs);
> 2021     if (ret) {
> 2022         error_setg_errno(errp, -ret, "Error flushing drive");
> 2023         goto error;
> 2024     }
> 2025 
> 2026     if (drv->bdrv_reopen_prepare) {
> 2027         ret = drv->bdrv_reopen_prepare(reopen_state, queue, &local_err);

Ah, thanks. Yes, this is what I meant. I expected it somewhere close to
the bdrv_drain_all() call, so I missed the call you quoted. So that's
good news, at least this part of the problem doesn't exist then. :-)

Kevin

> > 
> > > > I will note that Linux in general had (still has I think?) the
> > > > behavior that once the process closes a file (or exits), we lose
> > > > context to return an error to. From that point on, any failed IO
> > > > from the page cache to the target disk will be dropped from cache.
> > > > To hold things in the cache would lead it to fill with old data that
> > > > is not really recoverable and we have no good way to know that the
> > > > situation is repairable and how long that might take. Upstream
> > > > kernel people have debated this, the behavior might be tweaked for
> > > > certain types of errors.
> > > 
> > > That's fine, we just don't want the next fsync() to signal success when
> > > in reality the cache has thrown away our data. As soon as we close the
> > > image, there is no next fsync(), so you can do whatever you like.
> > > 
> > > Kevin
> > > 
> > > > On 04/06/2016 07:02 AM, Kevin Wolf wrote:
> > > > >[ Adding some CCs ]
> > > > >
> > > > >Am 06.04.2016 um 05:29 hat Jeff Cody geschrieben:
> > > > >>Upon receiving an I/O error after an fsync, by default gluster will
> > > > >>dump its cache.  However, QEMU will retry the fsync, which is especially
> > > > >>useful when encountering errors such as ENOSPC when using the werror=stop
> > > > >>option.  When using caching with gluster, however, the last written data
> > > > >>will be lost upon encountering ENOSPC.  Using the cache xlator option of
> > > > >>'resync-failed-syncs-after-fsync' should cause gluster to retain the
> > > > >>cached data after a failed fsync, so that ENOSPC and other transient
> > > > >>errors are recoverable.
> > > > >>
> > > > >>Signed-off-by: Jeff Cody <jcody@redhat.com>
> > > > >>---
> > > > >>  block/gluster.c | 27 +++++++++++++++++++++++++++
> > > > >>  configure       |  8 ++++++++
> > > > >>  2 files changed, 35 insertions(+)
> > > > >>
> > > > >>diff --git a/block/gluster.c b/block/gluster.c
> > > > >>index 30a827e..b1cf71b 100644
> > > > >>--- a/block/gluster.c
> > > > >>+++ b/block/gluster.c
> > > > >>@@ -330,6 +330,23 @@ static int qemu_gluster_open(BlockDriverState *bs,  QDict *options,
> > > > >>          goto out;
> > > > >>      }
> > > > >>+#ifdef CONFIG_GLUSTERFS_XLATOR_OPT
> > > > >>+    /* Without this, if fsync fails for a recoverable reason (for instance,
> > > > >>+     * ENOSPC), gluster will dump its cache, preventing retries.  This means
> > > > >>+     * almost certain data loss.  Not all gluster versions support the
> > > > >>+     * 'resync-failed-syncs-after-fsync' key value, but there is no way to
> > > > >>+     * discover during runtime if it is supported (this api returns success for
> > > > >>+     * unknown key/value pairs) */
> > > > >Honestly, this sucks. There is apparently no way to operate gluster so
> > > > >we can safely recover after a failed fsync. "We hope everything is fine,
> > > > >but depending on your gluster version, we may now corrupt your image"
> > > > >isn't very good.
> > > > >
> > > > >We need to consider very carefully if this is good enough to go on after
> > > > >an error. I'm currently leaning towards "no". That is, we should only
> > > > >enable this after Gluster provides us a way to make sure that the option
> > > > >is really set.
> > > > >
> > > > >>+    ret = glfs_set_xlator_option (s->glfs, "*-write-behind",
> > > > >>+                                           "resync-failed-syncs-after-fsync",
> > > > >>+                                           "on");
> > > > >>+    if (ret < 0) {
> > > > >>+        error_setg_errno(errp, errno, "Unable to set xlator key/value pair");
> > > > >>+        ret = -errno;
> > > > >>+        goto out;
> > > > >>+    }
> > > > >>+#endif
> > > > >We also need to consider the case without CONFIG_GLUSTERFS_XLATOR_OPT.
> > > > >In this case (as well as theoretically in the case that the option
> > > > >didn't take effect - if only we could know about it), a failed
> > > > >glfs_fsync_async() is fatal and we need to stop operating on the image,
> > > > >i.e. set bs->drv = NULL like when we detect corruption in qcow2 images.
> > > > >The guest will see a broken disk that fails all I/O requests, but that's
> > > > >better than corrupting data.
> > > > >
> > > > >Kevin
> > > > 
> > > 
> >
Niels de Vos April 6, 2016, 2:47 p.m. UTC | #8
On Wed, Apr 06, 2016 at 08:44:18AM -0400, Jeff Cody wrote:
> On Wed, Apr 06, 2016 at 01:02:16PM +0200, Kevin Wolf wrote:
> > [ Adding some CCs ]
> > 
> > Am 06.04.2016 um 05:29 hat Jeff Cody geschrieben:
> > > Upon receiving an I/O error after an fsync, by default gluster will
> > > dump its cache.  However, QEMU will retry the fsync, which is especially
> > > useful when encountering errors such as ENOSPC when using the werror=stop
> > > option.  When using caching with gluster, however, the last written data
> > > will be lost upon encountering ENOSPC.  Using the cache xlator option of

Use "write-behind xlator" instead of "cache xlator". There are different
caches in Gluster.

> > > 'resync-failed-syncs-after-fsync' should cause gluster to retain the
> > > cached data after a failed fsync, so that ENOSPC and other transient
> > > errors are recoverable.
> > > 
> > > Signed-off-by: Jeff Cody <jcody@redhat.com>
> > > ---
> > >  block/gluster.c | 27 +++++++++++++++++++++++++++
> > >  configure       |  8 ++++++++
> > >  2 files changed, 35 insertions(+)
> > > 
> > > diff --git a/block/gluster.c b/block/gluster.c
> > > index 30a827e..b1cf71b 100644
> > > --- a/block/gluster.c
> > > +++ b/block/gluster.c
> > > @@ -330,6 +330,23 @@ static int qemu_gluster_open(BlockDriverState *bs,  QDict *options,
> > >          goto out;
> > >      }
> > >  
> > > +#ifdef CONFIG_GLUSTERFS_XLATOR_OPT
> > > +    /* Without this, if fsync fails for a recoverable reason (for instance,
> > > +     * ENOSPC), gluster will dump its cache, preventing retries.  This means
> > > +     * almost certain data loss.  Not all gluster versions support the
> > > +     * 'resync-failed-syncs-after-fsync' key value, but there is no way to
> > > +     * discover during runtime if it is supported (this api returns success for
> > > +     * unknown key/value pairs) */
> > 
> > Honestly, this sucks. There is apparently no way to operate gluster so
> > we can safely recover after a failed fsync. "We hope everything is fine,
> > but depending on your gluster version, we may now corrupt your image"
> > isn't very good.
> > 
> > We need to consider very carefully if this is good enough to go on after
> > an error. I'm currently leaning towards "no". That is, we should only
> > enable this after Gluster provides us a way to make sure that the option
> > is really set.

Unfortunately it is also possible to disable the write-behind xlator as
well. This would cause setting the option to fail too :-/ At the moment
there is no real useful way to detect if write-behind is disabled (it is
enabled by default).

> > > +    ret = glfs_set_xlator_option (s->glfs, "*-write-behind",
> > > +                                           "resync-failed-syncs-after-fsync",
> > > +                                           "on");
> > > +    if (ret < 0) {
> > > +        error_setg_errno(errp, errno, "Unable to set xlator key/value pair");
> > > +        ret = -errno;
> > > +        goto out;
> > > +    }
> > > +#endif
> > 
> > We also need to consider the case without CONFIG_GLUSTERFS_XLATOR_OPT.
> > In this case (as well as theoretically in the case that the option
> > didn't take effect - if only we could know about it), a failed
> > glfs_fsync_async() is fatal and we need to stop operating on the image,
> > i.e. set bs->drv = NULL like when we detect corruption in qcow2 images.
> > The guest will see a broken disk that fails all I/O requests, but that's
> > better than corrupting data.
> >
> 
> Gluster versions that don't support CONFIG_GLUSTERFS_XLATOR_OPT will
> also not have the gluster patch that removes the file descriptor
> invalidation upon error (unless that was a relatively new
> bug/feature).  But if that is the case, every operation on the file
> descriptor in those versions will return error.  But it is also rather
> old versions that don't support glfs_set_xlator_option() I believe.

Indeed, glfs_set_xlator_option() was introduced with glusterfs-3.4.0. We
are currently on glusterfs-3.7, with the oldest supported version of
3.5. In ~2 months we hopefully have a 3.8 release and that will cause
the end-of-life of 3.5. 3.4 has been EOL for about a year now, hopefully
all our users have upgraded, but we know that some users will stay on
unsupported versions for a long time...

However, the "resync-failed-syncs-after-fsync" option was only
introduced recently, with glusterfs-3.7.9. You could detect this with
pkg-config glusterfs-api >= 7.3.7.9 if need to be.

More details about the problem the option addresses are in the commit
message on http://review.gluster.org/13057 .

HTH,
Niels
Jeff Cody April 6, 2016, 4:05 p.m. UTC | #9
On Wed, Apr 06, 2016 at 04:47:32PM +0200, Niels de Vos wrote:
> On Wed, Apr 06, 2016 at 08:44:18AM -0400, Jeff Cody wrote:
> > On Wed, Apr 06, 2016 at 01:02:16PM +0200, Kevin Wolf wrote:
> > > [ Adding some CCs ]
> > > 
> > > Am 06.04.2016 um 05:29 hat Jeff Cody geschrieben:
> > > > Upon receiving an I/O error after an fsync, by default gluster will
> > > > dump its cache.  However, QEMU will retry the fsync, which is especially
> > > > useful when encountering errors such as ENOSPC when using the werror=stop
> > > > option.  When using caching with gluster, however, the last written data
> > > > will be lost upon encountering ENOSPC.  Using the cache xlator option of
> 
> Use "write-behind xlator" instead of "cache xlator". There are different
> caches in Gluster.
>

Thanks, I'll update the commit message.

> > > > 'resync-failed-syncs-after-fsync' should cause gluster to retain the
> > > > cached data after a failed fsync, so that ENOSPC and other transient
> > > > errors are recoverable.
> > > > 
> > > > Signed-off-by: Jeff Cody <jcody@redhat.com>
> > > > ---
> > > >  block/gluster.c | 27 +++++++++++++++++++++++++++
> > > >  configure       |  8 ++++++++
> > > >  2 files changed, 35 insertions(+)
> > > > 
> > > > diff --git a/block/gluster.c b/block/gluster.c
> > > > index 30a827e..b1cf71b 100644
> > > > --- a/block/gluster.c
> > > > +++ b/block/gluster.c
> > > > @@ -330,6 +330,23 @@ static int qemu_gluster_open(BlockDriverState *bs,  QDict *options,
> > > >          goto out;
> > > >      }
> > > >  
> > > > +#ifdef CONFIG_GLUSTERFS_XLATOR_OPT
> > > > +    /* Without this, if fsync fails for a recoverable reason (for instance,
> > > > +     * ENOSPC), gluster will dump its cache, preventing retries.  This means
> > > > +     * almost certain data loss.  Not all gluster versions support the
> > > > +     * 'resync-failed-syncs-after-fsync' key value, but there is no way to
> > > > +     * discover during runtime if it is supported (this api returns success for
> > > > +     * unknown key/value pairs) */
> > > 
> > > Honestly, this sucks. There is apparently no way to operate gluster so
> > > we can safely recover after a failed fsync. "We hope everything is fine,
> > > but depending on your gluster version, we may now corrupt your image"
> > > isn't very good.
> > > 
> > > We need to consider very carefully if this is good enough to go on after
> > > an error. I'm currently leaning towards "no". That is, we should only
> > > enable this after Gluster provides us a way to make sure that the option
> > > is really set.
> 
> Unfortunately it is also possible to disable the write-behind xlator as
> well. This would cause setting the option to fail too :-/ At the moment
> there is no real useful way to detect if write-behind is disabled (it is
> enabled by default).
> 
> > > > +    ret = glfs_set_xlator_option (s->glfs, "*-write-behind",
> > > > +                                           "resync-failed-syncs-after-fsync",
> > > > +                                           "on");
> > > > +    if (ret < 0) {
> > > > +        error_setg_errno(errp, errno, "Unable to set xlator key/value pair");
> > > > +        ret = -errno;
> > > > +        goto out;
> > > > +    }
> > > > +#endif
> > > 
> > > We also need to consider the case without CONFIG_GLUSTERFS_XLATOR_OPT.
> > > In this case (as well as theoretically in the case that the option
> > > didn't take effect - if only we could know about it), a failed
> > > glfs_fsync_async() is fatal and we need to stop operating on the image,
> > > i.e. set bs->drv = NULL like when we detect corruption in qcow2 images.
> > > The guest will see a broken disk that fails all I/O requests, but that's
> > > better than corrupting data.
> > >
> > 
> > Gluster versions that don't support CONFIG_GLUSTERFS_XLATOR_OPT will
> > also not have the gluster patch that removes the file descriptor
> > invalidation upon error (unless that was a relatively new
> > bug/feature).  But if that is the case, every operation on the file
> > descriptor in those versions will return error.  But it is also rather
> > old versions that don't support glfs_set_xlator_option() I believe.
> 
> Indeed, glfs_set_xlator_option() was introduced with glusterfs-3.4.0. We
> are currently on glusterfs-3.7, with the oldest supported version of
> 3.5. In ~2 months we hopefully have a 3.8 release and that will cause

Is it possible for 3.8 to be able to validate key/value pairs in
glfs_set_xlator_option()?  Or is it something that by design is decoupled
enough that it isn't feasible?

This is the second instance I know of that a lack of error information from
the gluster api has made it difficult to determine if a feature is
supported (the other instance being SEEK_DATA/SEEK_HOLE).  It'd be nice if
error returns were more consistent in 3.8 in the API, is possible.

> the end-of-life of 3.5. 3.4 has been EOL for about a year now, hopefully
> all our users have upgraded, but we know that some users will stay on
> unsupported versions for a long time...
> 
> However, the "resync-failed-syncs-after-fsync" option was only
> introduced recently, with glusterfs-3.7.9. You could detect this with
> pkg-config glusterfs-api >= 7.3.7.9 if need to be.
> 

Unfortunately, compile-time detection of the specific key option isn't a
great option, since we are using gluster as a shared library. Backports for
various distributions may or may not have the
"resync-failed-syncs-after-fsync" key in it regardless of versioning, and
it isn't a new symbol that is detectable by the linker or at runtime.
Additionally, QEMU on a system with gluster version 3.7.8, for instance,
would have no way of knowing that "resync-failed-syncs-after-fsync" does
nothing.

The reason for the compile-time check for glfs_set_xlator_option() is
because it is a symbol that will be looked up when QEMU is linked (and
executed).  In contrast, if the user downgrades gluster on their system to
a version that doesn't have glfs_set_xlator_option(), QEMU won't run, so
the breakage will be visible.

> More details about the problem the option addresses are in the commit
> message on http://review.gluster.org/13057 .
>

Thanks, that is useful!

Jeff
Pranith Kumar Karampuri April 7, 2016, 7:48 a.m. UTC | #10
+Raghavendra G who implemented this option in write-behind, to this 
upstream patch review discussion

Pranith
On 04/06/2016 06:50 PM, Kevin Wolf wrote:
> Am 06.04.2016 um 15:10 hat Jeff Cody geschrieben:
>> On Wed, Apr 06, 2016 at 01:51:59PM +0200, Kevin Wolf wrote:
>>> Am 06.04.2016 um 13:41 hat Kevin Wolf geschrieben:
>>>> Am 06.04.2016 um 13:19 hat Ric Wheeler geschrieben:
>>>>> We had a thread discussing this not on the upstream list.
>>>>>
>>>>> My summary of the thread is that I don't understand why gluster
>>>>> should drop cached data after a failed fsync() for any open file.
>>>> It certainly shouldn't, but it does by default. :-)
>>>>
>>>> Have a look at commit 3fcead2d in glusterfs.git, which at least
>>>> introduces an option to get usable behaviour:
>>>>
>>>>      { .key = {"resync-failed-syncs-after-fsync"},
>>>>        .type = GF_OPTION_TYPE_BOOL,
>>>>        .default_value = "off",
>>>>        .description = "If sync of \"cached-writes issued before fsync\" "
>>>>                       "(to backend) fails, this option configures whether "
>>>>                       "to retry syncing them after fsync or forget them. "
>>>>                       "If set to on, cached-writes are retried "
>>>>                       "till a \"flush\" fop (or a successful sync) on sync "
>>>>                       "failures. "
>>>>                       "fsync itself is failed irrespective of the value of "
>>>>                       "this option. ",
>>>>      },
>>>>
>>>> As you can see, the default is still to drop cached data, and this is
>>>> with the file still opened. qemu needs to make sure that this option is
>>>> set, and if Jeff's comment in the code below is right, there is no way
>>>> currently to make sure that the option isn't silently ignored.
>>>>
>>>> Can we get some function that sets an option and fails if the option is
>>>> unknown? Or one that queries the state after setting an option, so we
>>>> can check whether we succeeded in switching to the mode we need?
>>>>
>>>>> For closed files, I think it might still happen but this is the same
>>>>> as any file system (and unlikely to be the case for qemu?).
>>>> Our problem is only with open images. Dropping caches for files that
>>>> qemu doesn't use any more is fine as far as I'm concerned.
>>>>
>>>> Note that our usage can involve cases where we reopen a file with
>>>> different flags, i.e. first open a second file descriptor, then close
>>>> the first one. The image was never completely closed here and we would
>>>> still want the cache to preserve our data in such cases.
>>> Hm, actually, maybe we should just call bdrv_flush() before reopening an
>>> image, and if an error is returned, we abort the reopen. It's far from
>>> being a hot path, so the overhead of a flush shouldn't matter, and it
>>> seems we're taking an unnecessary risk without doing this.
>>>
>> [I seemed to have been dropped from the cc]
>>
>> Are you talking about doing a bdrv_flush() on the new descriptor (i.e.
>> reop_s->glfs)?  Because otherwise, we already do this in
>> bdrv_reopen_prepare() on the original fd.  It happens right before the call
>> to drv->bdrv_reopen_prepare():
>>
>>
>> 2020     ret = bdrv_flush(reopen_state->bs);
>> 2021     if (ret) {
>> 2022         error_setg_errno(errp, -ret, "Error flushing drive");
>> 2023         goto error;
>> 2024     }
>> 2025
>> 2026     if (drv->bdrv_reopen_prepare) {
>> 2027         ret = drv->bdrv_reopen_prepare(reopen_state, queue, &local_err);
> Ah, thanks. Yes, this is what I meant. I expected it somewhere close to
> the bdrv_drain_all() call, so I missed the call you quoted. So that's
> good news, at least this part of the problem doesn't exist then. :-)
>
> Kevin
>
>>>>> I will note that Linux in general had (still has I think?) the
>>>>> behavior that once the process closes a file (or exits), we lose
>>>>> context to return an error to. From that point on, any failed IO
>>>>> from the page cache to the target disk will be dropped from cache.
>>>>> To hold things in the cache would lead it to fill with old data that
>>>>> is not really recoverable and we have no good way to know that the
>>>>> situation is repairable and how long that might take. Upstream
>>>>> kernel people have debated this, the behavior might be tweaked for
>>>>> certain types of errors.
>>>> That's fine, we just don't want the next fsync() to signal success when
>>>> in reality the cache has thrown away our data. As soon as we close the
>>>> image, there is no next fsync(), so you can do whatever you like.
>>>>
>>>> Kevin
>>>>
>>>>> On 04/06/2016 07:02 AM, Kevin Wolf wrote:
>>>>>> [ Adding some CCs ]
>>>>>>
>>>>>> Am 06.04.2016 um 05:29 hat Jeff Cody geschrieben:
>>>>>>> Upon receiving an I/O error after an fsync, by default gluster will
>>>>>>> dump its cache.  However, QEMU will retry the fsync, which is especially
>>>>>>> useful when encountering errors such as ENOSPC when using the werror=stop
>>>>>>> option.  When using caching with gluster, however, the last written data
>>>>>>> will be lost upon encountering ENOSPC.  Using the cache xlator option of
>>>>>>> 'resync-failed-syncs-after-fsync' should cause gluster to retain the
>>>>>>> cached data after a failed fsync, so that ENOSPC and other transient
>>>>>>> errors are recoverable.
>>>>>>>
>>>>>>> Signed-off-by: Jeff Cody <jcody@redhat.com>
>>>>>>> ---
>>>>>>>   block/gluster.c | 27 +++++++++++++++++++++++++++
>>>>>>>   configure       |  8 ++++++++
>>>>>>>   2 files changed, 35 insertions(+)
>>>>>>>
>>>>>>> diff --git a/block/gluster.c b/block/gluster.c
>>>>>>> index 30a827e..b1cf71b 100644
>>>>>>> --- a/block/gluster.c
>>>>>>> +++ b/block/gluster.c
>>>>>>> @@ -330,6 +330,23 @@ static int qemu_gluster_open(BlockDriverState *bs,  QDict *options,
>>>>>>>           goto out;
>>>>>>>       }
>>>>>>> +#ifdef CONFIG_GLUSTERFS_XLATOR_OPT
>>>>>>> +    /* Without this, if fsync fails for a recoverable reason (for instance,
>>>>>>> +     * ENOSPC), gluster will dump its cache, preventing retries.  This means
>>>>>>> +     * almost certain data loss.  Not all gluster versions support the
>>>>>>> +     * 'resync-failed-syncs-after-fsync' key value, but there is no way to
>>>>>>> +     * discover during runtime if it is supported (this api returns success for
>>>>>>> +     * unknown key/value pairs) */
>>>>>> Honestly, this sucks. There is apparently no way to operate gluster so
>>>>>> we can safely recover after a failed fsync. "We hope everything is fine,
>>>>>> but depending on your gluster version, we may now corrupt your image"
>>>>>> isn't very good.
>>>>>>
>>>>>> We need to consider very carefully if this is good enough to go on after
>>>>>> an error. I'm currently leaning towards "no". That is, we should only
>>>>>> enable this after Gluster provides us a way to make sure that the option
>>>>>> is really set.
>>>>>>
>>>>>>> +    ret = glfs_set_xlator_option (s->glfs, "*-write-behind",
>>>>>>> +                                           "resync-failed-syncs-after-fsync",
>>>>>>> +                                           "on");
>>>>>>> +    if (ret < 0) {
>>>>>>> +        error_setg_errno(errp, errno, "Unable to set xlator key/value pair");
>>>>>>> +        ret = -errno;
>>>>>>> +        goto out;
>>>>>>> +    }
>>>>>>> +#endif
>>>>>> We also need to consider the case without CONFIG_GLUSTERFS_XLATOR_OPT.
>>>>>> In this case (as well as theoretically in the case that the option
>>>>>> didn't take effect - if only we could know about it), a failed
>>>>>> glfs_fsync_async() is fatal and we need to stop operating on the image,
>>>>>> i.e. set bs->drv = NULL like when we detect corruption in qcow2 images.
>>>>>> The guest will see a broken disk that fails all I/O requests, but that's
>>>>>> better than corrupting data.
>>>>>>
>>>>>> Kevin
Jeff Cody April 7, 2016, 4:17 p.m. UTC | #11
On Wed, Apr 06, 2016 at 01:02:16PM +0200, Kevin Wolf wrote:
> [ Adding some CCs ]
> 
> Am 06.04.2016 um 05:29 hat Jeff Cody geschrieben:
> > Upon receiving an I/O error after an fsync, by default gluster will
> > dump its cache.  However, QEMU will retry the fsync, which is especially
> > useful when encountering errors such as ENOSPC when using the werror=stop
> > option.  When using caching with gluster, however, the last written data
> > will be lost upon encountering ENOSPC.  Using the cache xlator option of
> > 'resync-failed-syncs-after-fsync' should cause gluster to retain the
> > cached data after a failed fsync, so that ENOSPC and other transient
> > errors are recoverable.
> > 
> > Signed-off-by: Jeff Cody <jcody@redhat.com>
> > ---
> >  block/gluster.c | 27 +++++++++++++++++++++++++++
> >  configure       |  8 ++++++++
> >  2 files changed, 35 insertions(+)
> > 
> > diff --git a/block/gluster.c b/block/gluster.c
> > index 30a827e..b1cf71b 100644
> > --- a/block/gluster.c
> > +++ b/block/gluster.c
> > @@ -330,6 +330,23 @@ static int qemu_gluster_open(BlockDriverState *bs,  QDict *options,
> >          goto out;
> >      }
> >  
> > +#ifdef CONFIG_GLUSTERFS_XLATOR_OPT
> > +    /* Without this, if fsync fails for a recoverable reason (for instance,
> > +     * ENOSPC), gluster will dump its cache, preventing retries.  This means
> > +     * almost certain data loss.  Not all gluster versions support the
> > +     * 'resync-failed-syncs-after-fsync' key value, but there is no way to
> > +     * discover during runtime if it is supported (this api returns success for
> > +     * unknown key/value pairs) */
> 
> Honestly, this sucks. There is apparently no way to operate gluster so
> we can safely recover after a failed fsync. "We hope everything is fine,
> but depending on your gluster version, we may now corrupt your image"
> isn't very good.
> 
> We need to consider very carefully if this is good enough to go on after
> an error. I'm currently leaning towards "no". That is, we should only
> enable this after Gluster provides us a way to make sure that the option
> is really set.
> 

Thinking from the viewpoint of what is achievable for 2.6:

* This patch is an improvement over the current state.  It will at least
  provide protection on more recent versions of Gluster.

  The question is, are there then Gluster versions that will actually
  result in inconsistent data if this resync fsync option is not supported?

  There was another change in Gluster around v3.7.5 (can someone from
  Gluster confirm?) that removed Gluster's file descriptor invalidation
  upon any write I/O error.  If that version is correct, that means Gluster
  versions 3.7.5 - 3.7.8 are problematic on fsync failure with write-behind
  caching.

Given that, here are some alternatives that are feasible for 2.6:

A) We could decide that the lack of discoverability of this option means we
   treat all Gluster versions as not having the option.  On a fsync
   failure, we always switch over to read-only.  There will be data loss
   (and perhaps on some Gluster version where it could have otherwise been
   avoided), but we will have consistency.

B) Another option--sub-optimal from a performance perspective--is to just
   always open Gluster in O_DIRECT with the 'strict-o-direct' performance
   key value set to true.


Beyond just 2.6:

There is a larger question for QEMU, per Ric's point on other filesystem
behavior on fsync failure (different thread), and given that the POSIX spec
leaves cache validity after fsync i/o failure as not guaranteed [1]. Is
QEMU overall currently doing the correct thing on a bdrv_flush() failure?

That is, how certain are we that the other underlying protocol drivers
(including raw-posix) can safely retry a fsync?

Given the wording of the POSIX spec, should we move all images to read-only
upon a bdrv_flush() failure (unless we know explicitly otherwise by the
protocol driver)?  And, if _POSIX_SYNCHRONIZED_IO is not set, are we even
safe simply setting an image to r/o?

[1] From another thread, courtesy of Eric Blake:
http://pubs.opengroup.org/onlinepubs/9699919799/functions/fsync.html

Jeff

> > +    ret = glfs_set_xlator_option (s->glfs, "*-write-behind",
> > +                                           "resync-failed-syncs-after-fsync",
> > +                                           "on");
> > +    if (ret < 0) {
> > +        error_setg_errno(errp, errno, "Unable to set xlator key/value pair");
> > +        ret = -errno;
> > +        goto out;
> > +    }
> > +#endif
> 
> We also need to consider the case without CONFIG_GLUSTERFS_XLATOR_OPT.
> In this case (as well as theoretically in the case that the option
> didn't take effect - if only we could know about it), a failed
> glfs_fsync_async() is fatal and we need to stop operating on the image,
> i.e. set bs->drv = NULL like when we detect corruption in qcow2 images.
> The guest will see a broken disk that fails all I/O requests, but that's
> better than corrupting data.
> 
> Kevin
Raghavendra Gowdappa April 11, 2016, 4:26 a.m. UTC | #12
> +Raghavendra G who implemented this option in write-behind, to this
> upstream patch review discussion

Thanks Pranith. Kritika did inform us about the discussion. We were working on ways to find out solutions to the problems raised (and it was a long festive weekend in Bangalore).

Sorry for top-posting. I am trying to address two issues raised in this mail thread:
1. No ways to identify whether setting of option succeeded in gfapi.
2. Why retaining cache after fsync failure is not default behavior?

1. No ways to identify whether setting of option succeeded in gfapi:

I've added Poornima and Raghavendra Talur who work on gfapi to assist on this.

2. Why retaining cache after fsync failure is not default behavior?

It was mostly to not to break two synchronized applications, which rely on fsync failures to retry. Details of discussion can be found below. The other reason was we could not figure out what POSIX's take on the state of earlier write after fsync failure (Other filesystems xfs, ext4 had non-uniform behavior). The question more specifically was "is it correct for a write issued before a failed fsync to succeed on the backend storage (persisting happened after fsync failure)?". I've also added Vijay Bellur who was involved in the earlier discussion to cc list.

Following is the discussion we had earlier with Kevin, Jeff Cody and others on the same topic. I am quoting it verbatim below.

<quote>

> > > > > I would actually argue that this setting would be right for everyone,
> > > > > not just qemu. Can you think of a case where keeping the data cached
> > > > > after a failed fsync would actively hurt any application? I can only
> > > > > think of cases where data is unnecessarily lost if data is dropped.
> > > > > 
> > > > 
> > > > I worry about use cases with concurrent writers to the same file and
> > > > how different applications would handle fsync() failures with our new
> > > > behavior.
> > > 
> > > Any specific scenario you're worried about?
> > > 
> > > > Keeping the known old behavior as the default will ensure that we do
> > > > not break anything once this is out. qemu/virt users with gluster are
> > > > accustomed to setting the virt group and hence no additional knobs
> > > > would need to be altered by them.
> > > 
> > > Not changing anything is a safe way to avoid regressions. But it's also
> > > a safe way to leave bugs unfixed. I would say that the current behavíour
> > > is at least borderline buggy and very hard for applications to handle
> > > correctly.
> > 
> > I tend to agree with Kevin on this. If we've an error handling strategy
> > that
> > is posix-complaint, I don't think there is need to add one more option.
> > Most
> > of the times options tend to be used in default values, which is equivalent
> > to not providing an option at all. However, before doing that its better we
> > think through whether it can affect any existing deployments adversely
> > (even
> > when they are not posix-complaint).
> > 
> 
> One pattern that I can think of -
> 
> Applications that operate on the same file from different clients through
> some locking or other co-ordination would have this pattern:
> 
> lock (file), write (file), fsync (file), unlock (file);
>  
> Now if the first fsync() fails, based on application logic the offset used
> for the failed write + fsync could be re-utilized by a co-ordinating
> application on another node to write out legitimate data. When control
> returns back to the application that received a failure, the subsequent
> write + fsync can cause data to be overwritten at the old offset along with
> new data being written at the new offset.

Yeah. I agree. Co-ordinated applications on different mounts will have issues, if they are working on the assumption that after fsync no older writes will hit the backend. Given that there seems to be a fair bit of confusion on status of retry of older writes after an fsync failure, we can expect some regressions. So, to summarize,

1. Keep the behaviour in patchset 11 of [1] as default. Both fsync and flush act as barriers and will make sure either
   a. older writes are synced to backend
   b. old writes are failed and never retried.

   after a failed fsync/flush.

2. Modify the behaviour of patchset 11 of [1] to keep failed syncs after a failed fsync and retry them till a flush. After a flush, no retries of failed syncs will be attempted. This behaviour will be introduced as an option.

3. Transient and non-transient errors will be treated similarly and failed syncs will be retried alike.

Does everyone agree on the above points? If yes, I'll modify [1] accordingly.

[1] http://review.gluster.org/#/c/12594/

</quote>
Raghavendra Talur May 9, 2016, 6:38 a.m. UTC | #13
On Mon, Apr 11, 2016 at 9:56 AM, Raghavendra Gowdappa <rgowdapp@redhat.com>
wrote:

>
> > +Raghavendra G who implemented this option in write-behind, to this
> > upstream patch review discussion
>
> Thanks Pranith. Kritika did inform us about the discussion. We were
> working on ways to find out solutions to the problems raised (and it was a
> long festive weekend in Bangalore).
>
> Sorry for top-posting. I am trying to address two issues raised in this
> mail thread:
> 1. No ways to identify whether setting of option succeeded in gfapi.
> 2. Why retaining cache after fsync failure is not default behavior?
>
> 1. No ways to identify whether setting of option succeeded in gfapi:
>
> I've added Poornima and Raghavendra Talur who work on gfapi to assist on
> this.
>

 There is currently no such feature in gfapi. We could think of two
possible solutions:

 a. Have the qemu-glusterfs driver require a version of glusterfs-api.rpm
which surely has this write behind option. In that case, if you use
glfs_set_xlator_option before glfs_init with right key and value, there is
no way the volume set gets rejected. Note that this is a installation time
dependency, we don't have libgfapi library versioning. This solution is
good enough, if the logic in qemu is

ret = glfs_set_xlator_option;
if (ret) {
    exit because we don't have required environment.
}
proceed with work;


b. Second solution is to implement a case in write_behind getxattr FOP to
handle such request and reply back with the current setting. Look at
dht_get_real_filename for similar feature. You will have to implement logic
similar to something like below

ret = glfs_getxattr ( fs, "/", glusterfs.write-behind-
resync-failed-syncs-after-fsync-status, &value, size);

if ( (ret = -1 && errno == ENODATA) || value == 0 ) {
       // write behind in the stack does not understand this option
       //  OR
      //  we have write-behind with required option set to off
    <work with the assumptions that there are not retries>
} else {
    // We do have the required option
}

One negative aspect of both the solutions above is the loosely coupled
nature of logic. If the behavior ever changes in lower layer(which is gfapi
or write-behind in this case) the upper layer(qemu) will have to
a. have a dependency of the sort which defines both the minimum version and
maximum version of package required
b. interpret the return values with more if-else conditions to maintain
backward compatibility.

 We are thinking of other ways too, but given above are current solutions
that come to mind.

Thanks,
Raghavendra Talur



> 2. Why retaining cache after fsync failure is not default behavior?
>
> It was mostly to not to break two synchronized applications, which rely on
> fsync failures to retry. Details of discussion can be found below. The
> other reason was we could not figure out what POSIX's take on the state of
> earlier write after fsync failure (Other filesystems xfs, ext4 had
> non-uniform behavior). The question more specifically was "is it correct
> for a write issued before a failed fsync to succeed on the backend storage
> (persisting happened after fsync failure)?". I've also added Vijay Bellur
> who was involved in the earlier discussion to cc list.
>
> Following is the discussion we had earlier with Kevin, Jeff Cody and
> others on the same topic. I am quoting it verbatim below.
>
> <quote>
>
> > > > > > I would actually argue that this setting would be right for
> everyone,
> > > > > > not just qemu. Can you think of a case where keeping the data
> cached
> > > > > > after a failed fsync would actively hurt any application? I can
> only
> > > > > > think of cases where data is unnecessarily lost if data is
> dropped.
> > > > > >
> > > > >
> > > > > I worry about use cases with concurrent writers to the same file
> and
> > > > > how different applications would handle fsync() failures with our
> new
> > > > > behavior.
> > > >
> > > > Any specific scenario you're worried about?
> > > >
> > > > > Keeping the known old behavior as the default will ensure that we
> do
> > > > > not break anything once this is out. qemu/virt users with gluster
> are
> > > > > accustomed to setting the virt group and hence no additional knobs
> > > > > would need to be altered by them.
> > > >
> > > > Not changing anything is a safe way to avoid regressions. But it's
> also
> > > > a safe way to leave bugs unfixed. I would say that the current
> behavíour
> > > > is at least borderline buggy and very hard for applications to handle
> > > > correctly.
> > >
> > > I tend to agree with Kevin on this. If we've an error handling strategy
> > > that
> > > is posix-complaint, I don't think there is need to add one more option.
> > > Most
> > > of the times options tend to be used in default values, which is
> equivalent
> > > to not providing an option at all. However, before doing that its
> better we
> > > think through whether it can affect any existing deployments adversely
> > > (even
> > > when they are not posix-complaint).
> > >
> >
> > One pattern that I can think of -
> >
> > Applications that operate on the same file from different clients through
> > some locking or other co-ordination would have this pattern:
> >
> > lock (file), write (file), fsync (file), unlock (file);
> >
> > Now if the first fsync() fails, based on application logic the offset
> used
> > for the failed write + fsync could be re-utilized by a co-ordinating
> > application on another node to write out legitimate data. When control
> > returns back to the application that received a failure, the subsequent
> > write + fsync can cause data to be overwritten at the old offset along
> with
> > new data being written at the new offset.
>
> Yeah. I agree. Co-ordinated applications on different mounts will have
> issues, if they are working on the assumption that after fsync no older
> writes will hit the backend. Given that there seems to be a fair bit of
> confusion on status of retry of older writes after an fsync failure, we can
> expect some regressions. So, to summarize,
>
> 1. Keep the behaviour in patchset 11 of [1] as default. Both fsync and
> flush act as barriers and will make sure either
>    a. older writes are synced to backend
>    b. old writes are failed and never retried.
>
>    after a failed fsync/flush.
>
> 2. Modify the behaviour of patchset 11 of [1] to keep failed syncs after a
> failed fsync and retry them till a flush. After a flush, no retries of
> failed syncs will be attempted. This behaviour will be introduced as an
> option.
>
> 3. Transient and non-transient errors will be treated similarly and failed
> syncs will be retried alike.
>
> Does everyone agree on the above points? If yes, I'll modify [1]
> accordingly.
>
> [1] http://review.gluster.org/#/c/12594/
>
> </quote>
>
>
Kevin Wolf May 9, 2016, 8:02 a.m. UTC | #14
Am 09.05.2016 um 08:38 hat Raghavendra Talur geschrieben:
> 
> 
> On Mon, Apr 11, 2016 at 9:56 AM, Raghavendra Gowdappa <rgowdapp@redhat.com>
> wrote:
> 
> 
>     > +Raghavendra G who implemented this option in write-behind, to this
>     > upstream patch review discussion
> 
>     Thanks Pranith. Kritika did inform us about the discussion. We were working
>     on ways to find out solutions to the problems raised (and it was a long
>     festive weekend in Bangalore).
> 
>     Sorry for top-posting. I am trying to address two issues raised in this
>     mail thread:
>     1. No ways to identify whether setting of option succeeded in gfapi.
>     2. Why retaining cache after fsync failure is not default behavior?
> 
>     1. No ways to identify whether setting of option succeeded in gfapi:
> 
>     I've added Poornima and Raghavendra Talur who work on gfapi to assist on
>     this.
> 
> 
>  There is currently no such feature in gfapi. We could think of two possible
> solutions:
> 
>  a. Have the qemu-glusterfs driver require a version of glusterfs-api.rpm which
> surely has this write behind option. In that case, if you use
> glfs_set_xlator_option before glfs_init with right key and value, there is no
> way the volume set gets rejected. Note that this is a installation time
> dependency, we don't have libgfapi library versioning. This solution is good
> enough, if the logic in qemu is 
> 
> ret = glfs_set_xlator_option;
> if (ret) {
>     exit because we don't have required environment.
> } 
> proceed with work;

This is not good enough. Basically the requirement is: Given a qemu
binary that is put on a machine with a random gluster version installed,
this qemu binary never corrupts a gluster image. Either it errors out or
it works correctly.

Just assuming that the right gluster version is installed is much too
weak, and even version number checks aren't very good when you consider
things like backports.

> b. Second solution is to implement a case in write_behind getxattr FOP to
> handle such request and reply back with the current setting. Look at
> dht_get_real_filename for similar feature. You will have to implement logic
> similar to something like below
> 
> ret = glfs_getxattr ( fs, "/", glusterfs.write-behind-
> resync-failed-syncs-after-fsync-status, &value, size);
> 
> if ( (ret = -1 && errno == ENODATA) || value == 0 ) {
>        // write behind in the stack does not understand this option
>        //  OR
>       //  we have write-behind with required option set to off
>     <work with the assumptions that there are not retries>
> } else {
>     // We do have the required option
> }

Yes, please this one.

Or a new glfs_setxattr_fail() that returns an error if the option didn't
take effect. But I guess a glfs_getxattr() function makes sense anyway.
In qemu, we try to allow querying everything that you can set.

> One negative aspect of both the solutions above is the loosely coupled nature
> of logic. If the behavior ever changes in lower layer(which is gfapi or
> write-behind in this case) the upper layer(qemu) will have to
> a. have a dependency of the sort which defines both the minimum version and
> maximum version of package required
> b. interpret the return values with more if-else conditions to maintain
> backward compatibility.

It's gluster's job to keep the compatibility with existing users of the
APIs. Just like qemu has to make sure that old QMP clients keep working
when we make changes or extensions to the protocol.

In other words: If the behaviour of a lower layer changes while the
upper layer only uses old APIs, that's clearly a bug in the lower layer.

Kevin
diff mbox

Patch

diff --git a/block/gluster.c b/block/gluster.c
index 30a827e..b1cf71b 100644
--- a/block/gluster.c
+++ b/block/gluster.c
@@ -330,6 +330,23 @@  static int qemu_gluster_open(BlockDriverState *bs,  QDict *options,
         goto out;
     }
 
+#ifdef CONFIG_GLUSTERFS_XLATOR_OPT
+    /* Without this, if fsync fails for a recoverable reason (for instance,
+     * ENOSPC), gluster will dump its cache, preventing retries.  This means
+     * almost certain data loss.  Not all gluster versions support the
+     * 'resync-failed-syncs-after-fsync' key value, but there is no way to
+     * discover during runtime if it is supported (this api returns success for
+     * unknown key/value pairs) */
+    ret = glfs_set_xlator_option (s->glfs, "*-write-behind",
+                                           "resync-failed-syncs-after-fsync",
+                                           "on");
+    if (ret < 0) {
+        error_setg_errno(errp, errno, "Unable to set xlator key/value pair");
+        ret = -errno;
+        goto out;
+    }
+#endif
+
     qemu_gluster_parse_flags(bdrv_flags, &open_flags);
 
     s->fd = glfs_open(s->glfs, gconf->image, open_flags);
@@ -386,6 +403,16 @@  static int qemu_gluster_reopen_prepare(BDRVReopenState *state,
         goto exit;
     }
 
+#ifdef CONFIG_GLUSTERFS_XLATOR_OPT
+    ret = glfs_set_xlator_option(reop_s->glfs, "*-write-behind",
+                                 "resync-failed-syncs-after-fsync", "on");
+    if (ret < 0) {
+        error_setg_errno(errp, errno, "Unable to set xlator key/value pair");
+        ret = -errno;
+        goto exit;
+    }
+#endif
+
     reop_s->fd = glfs_open(reop_s->glfs, gconf->image, open_flags);
     if (reop_s->fd == NULL) {
         /* reops->glfs will be cleaned up in _abort */
diff --git a/configure b/configure
index 5db29f0..3e921fe 100755
--- a/configure
+++ b/configure
@@ -298,6 +298,7 @@  coroutine=""
 coroutine_pool=""
 seccomp=""
 glusterfs=""
+glusterfs_xlator_opt="no"
 glusterfs_discard="no"
 glusterfs_zerofill="no"
 archipelago="no"
@@ -3397,6 +3398,9 @@  if test "$glusterfs" != "no" ; then
     glusterfs="yes"
     glusterfs_cflags=`$pkg_config --cflags glusterfs-api`
     glusterfs_libs=`$pkg_config --libs glusterfs-api`
+    if $pkg_config --atleast-version=4 glusterfs-api; then
+      glusterfs_xlator_opt="yes"
+    fi
     if $pkg_config --atleast-version=5 glusterfs-api; then
       glusterfs_discard="yes"
     fi
@@ -5339,6 +5343,10 @@  if test "$glusterfs" = "yes" ; then
   echo "GLUSTERFS_LIBS=$glusterfs_libs" >> $config_host_mak
 fi
 
+if test "$glusterfs_xlator_opt" = "yes" ; then
+  echo "CONFIG_GLUSTERFS_XLATOR_OPT=y" >> $config_host_mak
+fi
+
 if test "$glusterfs_discard" = "yes" ; then
   echo "CONFIG_GLUSTERFS_DISCARD=y" >> $config_host_mak
 fi