diff mbox

[13/14] raw-posix: Implement image locking

Message ID 1477928314-11184-14-git-send-email-famz@redhat.com
State New
Headers show

Commit Message

Fam Zheng Oct. 31, 2016, 3:38 p.m. UTC
This implements open flag sensible image locking for local file
and host device protocol.

virtlockd in libvirt locks the first byte, so we start looking at the
file bytes from 1.

Quoting what was proposed by Kevin Wolf <kwolf@redhat.com>, there are
four locking modes by combining two bits (BDRV_O_RDWR and
BDRV_O_SHARE_RW), and implemented by taking two locks:

Lock bytes:

* byte 1: I can't allow other processes to write to the image
* byte 2: I am writing to the image

Lock modes:

* shared writer (BDRV_O_RDWR | BDRV_O_SHARE_RW): Take shared lock on
  byte 2. Test whether byte 1 is locked using an exclusive lock, and
  fail if so.

* exclusive writer (BDRV_O_RDWR only): Take shared lock on byte 2. Test
  whether byte 1 is locked using an exclusive lock, and fail if so. Then
  take shared lock on byte 1. I suppose this is racy, but we can
  probably tolerate that.

* reader that can tolerate writers (BDRV_O_SHARE_RW only): Don't do anything

* reader that can't tolerate writers (neither bit is set): Take shared
  lock on byte 1. Test whether byte 2 is locked, and fail if so.

The complication is in the transactional reopen.  To make the reopen
logic managable, and allow better reuse, the code is internally
organized with a table from old mode to the new one.

Signed-off-by: Fam Zheng <famz@redhat.com>
---
 block/raw-posix.c | 710 ++++++++++++++++++++++++++++++++++++++++++++++++++----
 1 file changed, 660 insertions(+), 50 deletions(-)

Comments

Eric Blake Oct. 31, 2016, 10:01 p.m. UTC | #1
On 10/31/2016 10:38 AM, Fam Zheng wrote:
> This implements open flag sensible image locking for local file
> and host device protocol.
> 
> virtlockd in libvirt locks the first byte, so we start looking at the
> file bytes from 1.

What happens if we try to use a raw file with less than 3 bytes? There's
not much to be locked in that case (especially if we round down to
sector sizes - the file is effectively empty) - but it's probably a
corner case you have to be aware of.

> 
> Quoting what was proposed by Kevin Wolf <kwolf@redhat.com>, there are
> four locking modes by combining two bits (BDRV_O_RDWR and
> BDRV_O_SHARE_RW), and implemented by taking two locks:
> 
> Lock bytes:
> 
> * byte 1: I can't allow other processes to write to the image
> * byte 2: I am writing to the image
> 
> Lock modes:
> 
> * shared writer (BDRV_O_RDWR | BDRV_O_SHARE_RW): Take shared lock on
>   byte 2. Test whether byte 1 is locked using an exclusive lock, and
>   fail if so.
> 
> * exclusive writer (BDRV_O_RDWR only): Take shared lock on byte 2. Test
>   whether byte 1 is locked using an exclusive lock, and fail if so. Then
>   take shared lock on byte 1. I suppose this is racy, but we can
>   probably tolerate that.
> 
> * reader that can tolerate writers (BDRV_O_SHARE_RW only): Don't do anything
> 
> * reader that can't tolerate writers (neither bit is set): Take shared
>   lock on byte 1. Test whether byte 2 is locked, and fail if so.
> 
> The complication is in the transactional reopen.  To make the reopen
> logic managable, and allow better reuse, the code is internally

s/managable/manageable/

> organized with a table from old mode to the new one.
> 
> Signed-off-by: Fam Zheng <famz@redhat.com>
> ---
>  block/raw-posix.c | 710 ++++++++++++++++++++++++++++++++++++++++++++++++++----
>  1 file changed, 660 insertions(+), 50 deletions(-)
> 

> +typedef enum {
> +    /* Read only and accept other writers. */
> +    RAW_L_READ_SHARE_RW,
> +    /* Read only and try to forbid other writers. */
> +    RAW_L_READ,
> +    /* Read write and accept other writers. */
> +    RAW_L_WRITE_SHARE_RW,
> +    /* Read write and try to forbit other writers. */

s/forbit/forbid/


>  
> +static int raw_lock_fd(int fd, BDRVRawLockMode mode, Error **errp)
> +{
> +    int ret;
> +    assert(fd >= 0);
> +    /* Locking byte 1 avoids interfereing with virtlockd. */

s/interfereing/interfering/


> +/**
> + * Transactionally moving between possible locking states is tricky and must be
> + * done carefully. That is mostly because downgrading an exclusive lock to
> + * shared or unlocked is not guaranteed to be revertable. As a result, in such

s/revertable/revertible/

> + * cases we have to defer the downgraing to "commit", given that no revert will

s/downgraing/downgrading/

> + * happen after that point, and that downgrading a lock should never fail.
> + *
> + * On the other hand, upgrading a lock (e.g. from unlocked or shared to
> + * exclusive lock) must happen in "prepare" because it may fail.
> + *
> + * Manage the operation matrix with this state transition table to make
> + * fulfulling above conditions easier.

s/fulfulling/fulfilling/


> @@ -560,61 +1177,24 @@ static int raw_reopen_prepare(BDRVReopenState *state,
>  
>      raw_parse_flags(state->flags, &rs->open_flags);
>  
> -    rs->fd = -1;
> -
> -    int fcntl_flags = O_APPEND | O_NONBLOCK;
> -#ifdef O_NOATIME
> -    fcntl_flags |= O_NOATIME;
> -#endif
> -
> -#ifdef O_ASYNC
> -    /* Not all operating systems have O_ASYNC, and those that don't
> -     * will not let us track the state into rs->open_flags (typically
> -     * you achieve the same effect with an ioctl, for example I_SETSIG
> -     * on Solaris). But we do not use O_ASYNC, so that's fine.
> -     */
> -    assert((s->open_flags & O_ASYNC) == 0);
> -#endif

It looks like you are doing some code motion (refactoring into a helper
function) mixed in with everything else; it might be worth splitting
that into a separate commit for ease of review.
Richard W.M. Jones Oct. 31, 2016, 10:39 p.m. UTC | #2
On Mon, Oct 31, 2016 at 05:01:44PM -0500, Eric Blake wrote:
> On 10/31/2016 10:38 AM, Fam Zheng wrote:
> > This implements open flag sensible image locking for local file
> > and host device protocol.
> > 
> > virtlockd in libvirt locks the first byte, so we start looking at the
> > file bytes from 1.
> 
> What happens if we try to use a raw file with less than 3 bytes? There's
> not much to be locked in that case (especially if we round down to
> sector sizes - the file is effectively empty) - but it's probably a
> corner case you have to be aware of.

It would be nice if qemu could handle zero-length or even < 512 byte
drives, but we (libguestfs) gave up that battle a long time ago after
encountering several bugs in this area.  These days when you try to
add an empty file[1] through the libguestfs API, the implementation
replaces it with a 4096 byte raw-format file of zeroes.

More here:
https://github.com/libguestfs/libguestfs/blob/master/src/drives.c#L395

Rich.

[1] It has the name "/dev/null" for historical reasons, but is not
that actual device for the reasons given above.
Fam Zheng Nov. 1, 2016, 2:06 a.m. UTC | #3
On Mon, 10/31 17:01, Eric Blake wrote:
> On 10/31/2016 10:38 AM, Fam Zheng wrote:
> > This implements open flag sensible image locking for local file
> > and host device protocol.
> > 
> > virtlockd in libvirt locks the first byte, so we start looking at the
> > file bytes from 1.
> 
> What happens if we try to use a raw file with less than 3 bytes? There's
> not much to be locked in that case (especially if we round down to
> sector sizes - the file is effectively empty) - but it's probably a
> corner case you have to be aware of.

Yes.  Not sure if it is complete (and always true) but patch 14 covers a 0 byte
test case and it seems the lock just works the same a a large file.

So I look at the kernel implementation of fcntl locks, to see if it limits lock
range to file size, but I don't see any.

I also manually tested with "touch /var/tmp/zerofile && qemu-io
/var/tmp/zerofile" and "lslocks", it indeed report the locked bytes tough the
file is 0 byte.

As another example, /dev/null is also lockable by this series, that's why I have
to add a patch to change it to null-co://.

So, I think where a zero byte file cannot be locked, if any, is a corner case.

> 
> > 
> > Quoting what was proposed by Kevin Wolf <kwolf@redhat.com>, there are
> > four locking modes by combining two bits (BDRV_O_RDWR and
> > BDRV_O_SHARE_RW), and implemented by taking two locks:
> > 
> > Lock bytes:
> > 
> > * byte 1: I can't allow other processes to write to the image
> > * byte 2: I am writing to the image
> > 
> > Lock modes:
> > 
> > * shared writer (BDRV_O_RDWR | BDRV_O_SHARE_RW): Take shared lock on
> >   byte 2. Test whether byte 1 is locked using an exclusive lock, and
> >   fail if so.
> > 
> > * exclusive writer (BDRV_O_RDWR only): Take shared lock on byte 2. Test
> >   whether byte 1 is locked using an exclusive lock, and fail if so. Then
> >   take shared lock on byte 1. I suppose this is racy, but we can
> >   probably tolerate that.
> > 
> > * reader that can tolerate writers (BDRV_O_SHARE_RW only): Don't do anything
> > 
> > * reader that can't tolerate writers (neither bit is set): Take shared
> >   lock on byte 1. Test whether byte 2 is locked, and fail if so.
> > 
> > The complication is in the transactional reopen.  To make the reopen
> > logic managable, and allow better reuse, the code is internally
> 
> s/managable/manageable/
> 
> > organized with a table from old mode to the new one.
> > 
> > Signed-off-by: Fam Zheng <famz@redhat.com>
> > ---
> >  block/raw-posix.c | 710 ++++++++++++++++++++++++++++++++++++++++++++++++++----
> >  1 file changed, 660 insertions(+), 50 deletions(-)
> > 
> 
> > +typedef enum {
> > +    /* Read only and accept other writers. */
> > +    RAW_L_READ_SHARE_RW,
> > +    /* Read only and try to forbid other writers. */
> > +    RAW_L_READ,
> > +    /* Read write and accept other writers. */
> > +    RAW_L_WRITE_SHARE_RW,
> > +    /* Read write and try to forbit other writers. */
> 
> s/forbit/forbid/
> 
> 
> >  
> > +static int raw_lock_fd(int fd, BDRVRawLockMode mode, Error **errp)
> > +{
> > +    int ret;
> > +    assert(fd >= 0);
> > +    /* Locking byte 1 avoids interfereing with virtlockd. */
> 
> s/interfereing/interfering/
> 
> 
> > +/**
> > + * Transactionally moving between possible locking states is tricky and must be
> > + * done carefully. That is mostly because downgrading an exclusive lock to
> > + * shared or unlocked is not guaranteed to be revertable. As a result, in such
> 
> s/revertable/revertible/
> 
> > + * cases we have to defer the downgraing to "commit", given that no revert will
> 
> s/downgraing/downgrading/
> 
> > + * happen after that point, and that downgrading a lock should never fail.
> > + *
> > + * On the other hand, upgrading a lock (e.g. from unlocked or shared to
> > + * exclusive lock) must happen in "prepare" because it may fail.
> > + *
> > + * Manage the operation matrix with this state transition table to make
> > + * fulfulling above conditions easier.
> 
> s/fulfulling/fulfilling/

Thanks, will fix these typos and misspells.

> 
> 
> > @@ -560,61 +1177,24 @@ static int raw_reopen_prepare(BDRVReopenState *state,
> >  
> >      raw_parse_flags(state->flags, &rs->open_flags);
> >  
> > -    rs->fd = -1;
> > -
> > -    int fcntl_flags = O_APPEND | O_NONBLOCK;
> > -#ifdef O_NOATIME
> > -    fcntl_flags |= O_NOATIME;
> > -#endif
> > -
> > -#ifdef O_ASYNC
> > -    /* Not all operating systems have O_ASYNC, and those that don't
> > -     * will not let us track the state into rs->open_flags (typically
> > -     * you achieve the same effect with an ioctl, for example I_SETSIG
> > -     * on Solaris). But we do not use O_ASYNC, so that's fine.
> > -     */
> > -    assert((s->open_flags & O_ASYNC) == 0);
> > -#endif
> 
> It looks like you are doing some code motion (refactoring into a helper
> function) mixed in with everything else; it might be worth splitting
> that into a separate commit for ease of review.

Yes, good idea.

Fam
Max Reitz Dec. 2, 2016, 2:58 a.m. UTC | #4
On 31.10.2016 16:38, Fam Zheng wrote:
> This implements open flag sensible image locking for local file
> and host device protocol.
> 
> virtlockd in libvirt locks the first byte, so we start looking at the
> file bytes from 1.
> 
> Quoting what was proposed by Kevin Wolf <kwolf@redhat.com>, there are
> four locking modes by combining two bits (BDRV_O_RDWR and
> BDRV_O_SHARE_RW), and implemented by taking two locks:
> 
> Lock bytes:
> 
> * byte 1: I can't allow other processes to write to the image
> * byte 2: I am writing to the image
> 
> Lock modes:
> 
> * shared writer (BDRV_O_RDWR | BDRV_O_SHARE_RW): Take shared lock on
>   byte 2. Test whether byte 1 is locked using an exclusive lock, and
>   fail if so.

Testing whether something is locked would be easier by using F_OFD_GETLK
instead of actually creating an exclusive lock and then releasing it.
The reason why I proposed exclusive locks for this in the first place
was because I had the order reversed (first take the exclusive lock for
testing, then take the permanent shared lock, then release the exclusive
lock).

(And so you don't overlook it: You should be using F_OFD_GETLK, see [2])

> * exclusive writer (BDRV_O_RDWR only): Take shared lock on byte 2. Test
>   whether byte 1 is locked using an exclusive lock, and fail if so. Then
>   take shared lock on byte 1. I suppose this is racy, but we can
>   probably tolerate that.

As I said in my response to Kevin, we can make this non-racy in two ways:

(1) Just take an exclusive lock on both bytes.
(2) First start as an exclusive reader. Then upgrade the lock to that of
an exclusive writer by taking an exclusive lock on byte 2 and then
downgrading it to a shared lock.

The latter can be optimized by taking exclusive locks on both bytes and
then downgrading both to shared locks. But at that point you could just
as well do (1).

> * reader that can tolerate writers (BDRV_O_SHARE_RW only): Don't do anything
> 
> * reader that can't tolerate writers (neither bit is set): Take shared
>   lock on byte 1. Test whether byte 2 is locked, and fail if so.
> 
> The complication is in the transactional reopen.  To make the reopen
> logic managable, and allow better reuse, the code is internally
> organized with a table from old mode to the new one.
> 
> Signed-off-by: Fam Zheng <famz@redhat.com>
> ---
>  block/raw-posix.c | 710 ++++++++++++++++++++++++++++++++++++++++++++++++++----
>  1 file changed, 660 insertions(+), 50 deletions(-)
> 
> diff --git a/block/raw-posix.c b/block/raw-posix.c
> index 7c62fc3..07ab117 100644
> --- a/block/raw-posix.c
> +++ b/block/raw-posix.c
> @@ -131,8 +131,44 @@ do { \
>  
>  #define MAX_BLOCKSIZE	4096
>  
> +/* Posix file locking bytes. Libvirt takes byte 0, so start from byte 1. */
> +#define RAW_LOCK_BYTE_MIN 1
> +#define RAW_LOCK_BYTE_NO_OTHER_WRITER 1
> +#define RAW_LOCK_BYTE_WRITE     2
> +#define RAW_LOCK_BYTE_MAX 2
> +
> +/*
> + ** shared writer: Take shared lock on byte 2. Test whether byte 1 is
> + *  locked using an exclusive lock, and fail if so.
> + *
> + ** exclusive writer: Take shared lock on byte 2. Test whether byte 1 is
> + *  locked using an exclusive lock, and fail if so. Then take shared lock
> + *  on byte 1. I suppose this is racy, but we can probably tolerate that.
> + *
> + ** reader that can tolerate writers: Don't do anything
> + *
> + ** reader that can't tolerate writers: Take shared lock on byte 1. Test
> + *  whether byte 2 is locked, and fail if so.
> + */

I guess it would make sense to bring this comment into the same order as
the names are given in BDRVRawLockMode.

> +
> +typedef enum {
> +    /* Read only and accept other writers. */
> +    RAW_L_READ_SHARE_RW,
> +    /* Read only and try to forbid other writers. */
> +    RAW_L_READ,
> +    /* Read write and accept other writers. */
> +    RAW_L_WRITE_SHARE_RW,
> +    /* Read write and try to forbit other writers. */
> +    RAW_L_WRITE,
> +} BDRVRawLockMode;
> +
>  typedef struct BDRVRawState {
>      int fd;
> +    /* A dup of @fd to make manipulating lock easier, especially during reopen,
> +     * where this will accept BDRVRawReopenState.lock_fd. */
> +    int lock_fd;
> +    bool disable_lock;
> +    bool lock_on_invalidate;
>      int type;
>      int open_flags;
>      size_t buf_align;
> @@ -146,10 +182,13 @@ typedef struct BDRVRawState {
>      bool use_linux_aio:1;
>      bool has_fallocate;
>      bool needs_alignment;
> +    BDRVRawLockMode cur_lock_mode;
>  } BDRVRawState;
>  
>  typedef struct BDRVRawReopenState {
>      int fd;
> +    /* A dup of @fd used for acquiring lock. */
> +    int lock_fd;
>      int open_flags;
>  } BDRVRawReopenState;
>  
> @@ -368,6 +407,77 @@ static void raw_parse_flags(int bdrv_flags, int *open_flags)
>      }
>  }
>  
> +static int raw_lock_fd(int fd, BDRVRawLockMode mode, Error **errp)
> +{
> +    int ret;
> +    assert(fd >= 0);
> +    /* Locking byte 1 avoids interfereing with virtlockd. */

This comment should be superseded by the one above the definition of
RAW_LOCK_BYTE_*.

> +    switch (mode) {
> +    case RAW_L_READ_SHARE_RW:
> +        ret = qemu_unlock_fd(fd, RAW_LOCK_BYTE_MIN,
> +                             RAW_LOCK_BYTE_MAX - RAW_LOCK_BYTE_MIN + 1);
> +        if (ret) {
> +            error_setg_errno(errp, errno, "Failed to unlock fd");

Why not use -ret instead of errno?

> +            goto fail;
> +        }
> +        break;
> +    case RAW_L_READ:
> +        ret = qemu_lock_fd(fd, RAW_LOCK_BYTE_NO_OTHER_WRITER, 1, false);
> +        if (ret) {
> +            error_setg_errno(errp, errno, "Failed to lock share byte");
> +            goto fail;
> +        }
> +        ret = qemu_lock_fd(fd, RAW_LOCK_BYTE_WRITE, 1, true);
> +        if (ret) {
> +            error_setg_errno(errp, errno, "Failed to lock write byte exclusively");

I think we should drop errno in favor of a more useful hint like "(maybe
already opened R/W by a different process?)", but that seems a bit long
to include...

> +            goto fail;
> +        }
> +        qemu_unlock_fd(fd, RAW_LOCK_BYTE_WRITE, 1);
> +        break;
> +    case RAW_L_WRITE_SHARE_RW:
> +        ret = qemu_lock_fd(fd, RAW_LOCK_BYTE_WRITE, 1, false);
> +        if (ret) {
> +            error_setg_errno(errp, errno, "Failed to lock write byte");
> +            goto fail;
> +        }
> +        ret = qemu_lock_fd(fd, RAW_LOCK_BYTE_NO_OTHER_WRITER, 1, true);
> +        if (ret) {
> +            error_setg_errno(errp, errno, "Failed to lock share byte exclusively");
> +            goto fail;
> +        }
> +        ret = qemu_unlock_fd(fd, RAW_LOCK_BYTE_NO_OTHER_WRITER, 1);
> +        if (ret) {
> +            error_setg_errno(errp, errno, "Failed to unlock share byte");
> +            goto fail;
> +        }
> +        break;
> +    case RAW_L_WRITE:
> +        ret = qemu_lock_fd(fd, RAW_LOCK_BYTE_WRITE, 1, false);
> +        if (ret) {
> +            error_setg_errno(errp, errno, "Failed to lock write byte");
> +            goto fail;
> +        }
> +        ret = qemu_lock_fd(fd, RAW_LOCK_BYTE_NO_OTHER_WRITER, 1, true);
> +        if (ret) {
> +            error_setg_errno(errp, errno, "Failed to lock share byte exclusively");
> +            goto fail;
> +        }
> +        ret = qemu_lock_fd(fd, RAW_LOCK_BYTE_NO_OTHER_WRITER, 1, false);
> +        if (ret) {
> +            error_setg_errno(errp, errno, "Failed to downgrade share byte");
> +            goto fail;
> +        }
> +        break;
> +    default:
> +        abort();
> +    }
> +    return 0;
> +fail:
> +    qemu_unlock_fd(fd, RAW_LOCK_BYTE_NO_OTHER_WRITER, 1);
> +    qemu_unlock_fd(fd, RAW_LOCK_BYTE_WRITE, 1);
> +    return -errno;

qemu_unlock_fd() may overwrite errno. Not sure if that is intended.
Also, just returning ret would be fine.

> +}
> +
>  static void raw_parse_filename(const char *filename, QDict *options,
>                                 Error **errp)
>  {
> @@ -393,10 +503,115 @@ static QemuOptsList raw_runtime_opts = {
>              .type = QEMU_OPT_STRING,
>              .help = "host AIO implementation (threads, native)",
>          },
> +        {
> +            .name = "disable-lock",
> +            .type = QEMU_OPT_BOOL,
> +            .help = "don't lock the file",
> +        },
>          { /* end of list */ }
>      },
>  };
>  
> +static BDRVRawLockMode raw_get_lock_mode(int flags)
> +{
> +    switch (flags & (BDRV_O_RDWR | BDRV_O_SHARE_RW)) {
> +    case BDRV_O_RDWR:
> +        return RAW_L_WRITE;
> +    case BDRV_O_RDWR | BDRV_O_SHARE_RW:
> +        return RAW_L_WRITE_SHARE_RW;
> +    case BDRV_O_SHARE_RW:
> +        return RAW_L_READ_SHARE_RW;
> +    case 0:
> +        return RAW_L_READ;
> +    default:
> +        abort();
> +    }
> +}
> +
> +static int raw_dup_flags(int fd, int old_flags, int new_flags,
> +                         const char *filename, Error **errp)
> +{
> +    int ret = -1;
> +    int fcntl_flags = O_APPEND | O_NONBLOCK;
> +#ifdef O_NOATIME
> +    fcntl_flags |= O_NOATIME;
> +#endif
> +
> +#ifdef O_ASYNC
> +    /* Not all operating systems have O_ASYNC, and those that don't
> +     * will not let us track the state into rs->open_flags (typically
> +     * you achieve the same effect with an ioctl, for example I_SETSIG
> +     * on Solaris). But we do not use O_ASYNC, so that's fine.
> +     */
> +    assert((old_flags & O_ASYNC) == 0);
> +#endif
> +
> +    if ((new_flags & ~fcntl_flags) == (old_flags & ~fcntl_flags)) {
> +        /* dup the original fd */
> +        ret = qemu_dup(fd);
> +        if (ret >= 0) {
> +            if (fcntl_setfl(ret, new_flags)) {
> +                int new_fd = ret;
> +                ret = -errno;
> +                qemu_close(new_fd);
> +            }
> +        }
> +    }
> +
> +    /* If we cannot use fcntl, or fcntl failed, fall back to qemu_open() */
> +    if (ret < 0) {
> +        const char *normalized_filename = filename;
> +        ret = raw_normalize_devicepath(&normalized_filename);
> +        if (ret < 0) {
> +            error_setg_errno(errp, -ret, "Could not normalize device path");
> +        } else {
> +            assert(!(new_flags & O_CREAT));
> +            ret = qemu_open(normalized_filename, new_flags);
> +            if (ret == -1) {
> +                error_setg_errno(errp, errno, "Could not open file with new flags");
> +                ret = -errno;

I personally prefer it the other way round (ret = -errno;
error_setg_errno(errp, -ret, ...);) but error_setg_errno() now saves
errno, it's not wrong this way.

> +            }
> +        }
> +    }
> +    return ret;
> +}
> +
> +static int raw_lock_image(BlockDriverState *bs, int bdrv_flags, Error **errp)
> +{
> +    int ret;
> +    BDRVRawState *s = bs->opaque;
> +    BDRVRawLockMode lock_mode;
> +
> +    if (bdrv_flags & BDRV_O_INACTIVE) {
> +        s->disable_lock = true;
> +        s->lock_on_invalidate = true;
> +    }

Ah, I see, that's interesting (regarding my comments on when and how an
inactive BDS should be shared).

Anyway, using s->disable_lock both for the user-supplied option and for
internal stuff seems wrong. It's not so bad here, but it seems wrong in
raw_invalidate_cache().

> +    if (!s->disable_lock) {
> +        lock_mode = raw_get_lock_mode(bdrv_flags);
> +        if (!(bdrv_flags & BDRV_O_RDWR) && access(bs->filename, W_OK) != 0) {
> +            s->disable_lock = true;
> +        }

What is this code for (the above three lines)? Why do you disable the
lock if the image cannot be written to and it's opened read-only?

I suppose it's because of [1], but: What if it's just this process that
cannot write to it but others (e.g. invoked by root) can? Then this is
not a no-op.

> +    }
> +    if (!s->disable_lock && lock_mode != RAW_L_READ_SHARE_RW) {
> +        int lock_flags = s->open_flags;
> +        if (!(bdrv_flags & BDRV_O_SHARE_RW)) {
> +            lock_flags |= O_RDWR;
> +        }

This will result in the next function call trying to open an image R/W,
even if the original image was opened in read-only mode. That doesn't
seem quite right to me.

[1] So I guess above you are setting disable_lock to true so that
raw_dup_flags() will at least succeed.

In any case, it seems wrong to me to open the image R/W if the user has
requested read-only, even if it succeeds and even if we are not actually
writing to it.

[2] I guess this is required because otherwise we can't place exclusive
locks? Would using F_OFD_GETLK instead of trying to place an exclusive
lock fix this issue? (A quick test on my part says: Yes, it would;
F_OFD_GETLK ignores permissions and just seems to care about whether
there already is a lock.)

> +        ret = raw_dup_flags(s->fd, s->open_flags, lock_flags, bs->filename,

I'd propose using bs->exact_filename instead of bs->filename, because
the former is guaranteed not to be a json: filename.

While for raw-posix you are guaranteed to never to have a json: filename
in bs->filename, it's still a bit more clear.

> +                                   errp);

The indentation is off.

> +        if (ret < 0) {
> +            return ret;
> +        }
> +        s->lock_fd = ret;
> +        ret = raw_lock_fd(s->lock_fd, lock_mode, errp);
> +        if (ret) {
> +            return ret;
> +        }
> +        s->cur_lock_mode = lock_mode;
> +    }
> +    return 0;
> +}
> +
>  static int raw_open_common(BlockDriverState *bs, QDict *options,
>                             int bdrv_flags, int open_flags, Error **errp)
>  {
> @@ -440,6 +655,7 @@ static int raw_open_common(BlockDriverState *bs, QDict *options,
>      raw_parse_flags(bdrv_flags, &s->open_flags);
>  
>      s->fd = -1;
> +    s->lock_fd = -1;
>      fd = qemu_open(filename, s->open_flags, 0644);
>      if (fd < 0) {
>          ret = -errno;
> @@ -451,6 +667,15 @@ static int raw_open_common(BlockDriverState *bs, QDict *options,
>      }
>      s->fd = fd;
>  
> +    s->disable_lock = qemu_opt_get_bool(opts, "disable-lock", false);
> +
> +    if (!s->disable_lock) {
> +        ret = raw_lock_image(bs, bdrv_flags, errp);
> +        if (ret) {
> +            goto fail;
> +        }
> +    }
> +
>  #ifdef CONFIG_LINUX_AIO
>       /* Currently Linux does AIO only for files opened with O_DIRECT */
>      if (s->use_linux_aio && !(s->open_flags & O_DIRECT)) {

Pausing review here because it's already 4 am...

(I'll try to look at it later today, after I've had some sleep.)

> @@ -538,6 +763,398 @@ static int raw_open(BlockDriverState *bs, QDict *options, int flags,

[...]

> @@ -1832,6 +2420,27 @@ static int raw_get_info(BlockDriverState *bs, BlockDriverInfo *bdi)

And continuing here.

>      return 0;
>  }
>  
> +static int raw_inactivate(BlockDriverState *bs)
> +{
> +    BDRVRawState *s = bs->opaque;
> +    int r = 0;
> +
> +    if (s->cur_lock_mode != RAW_L_READ_SHARE_RW) {
> +        r = raw_lock_fd(s->lock_fd, RAW_L_READ_SHARE_RW, NULL);
> +    }
> +    return r;
> +}
> +
> +static void raw_invalidate_cache(BlockDriverState *bs, Error **errp)
> +{
> +    BDRVRawState *s = bs->opaque;
> +
> +    if (s->lock_on_invalidate) {
> +        s->disable_lock = false;

As I hinted at above, this looks wrong. If the user has specified that
this image should be unlocked, it should be unlocked.

Max

> +        raw_lock_image(bs, bdrv_get_flags(bs), errp);
> +    }
> +}
> +
>  static QemuOptsList raw_create_opts = {
>      .name = "raw-create-opts",
>      .head = QTAILQ_HEAD_INITIALIZER(raw_create_opts.head),
> @@ -1885,7 +2494,8 @@ BlockDriver bdrv_file = {
>      .bdrv_get_info = raw_get_info,
>      .bdrv_get_allocated_file_size
>                          = raw_get_allocated_file_size,
> -
> +    .bdrv_inactivate = raw_inactivate,
> +    .bdrv_invalidate_cache = raw_invalidate_cache,
>      .create_opts = &raw_create_opts,
>  };
>  
>
Max Reitz Dec. 2, 2016, 4:13 p.m. UTC | #5
On 31.10.2016 16:38, Fam Zheng wrote:
> This implements open flag sensible image locking for local file
> and host device protocol.
> 
> virtlockd in libvirt locks the first byte, so we start looking at the
> file bytes from 1.
> 
> Quoting what was proposed by Kevin Wolf <kwolf@redhat.com>, there are
> four locking modes by combining two bits (BDRV_O_RDWR and
> BDRV_O_SHARE_RW), and implemented by taking two locks:
> 
> Lock bytes:
> 
> * byte 1: I can't allow other processes to write to the image
> * byte 2: I am writing to the image
> 
> Lock modes:
> 
> * shared writer (BDRV_O_RDWR | BDRV_O_SHARE_RW): Take shared lock on
>   byte 2. Test whether byte 1 is locked using an exclusive lock, and
>   fail if so.
> 
> * exclusive writer (BDRV_O_RDWR only): Take shared lock on byte 2. Test
>   whether byte 1 is locked using an exclusive lock, and fail if so. Then
>   take shared lock on byte 1. I suppose this is racy, but we can
>   probably tolerate that.
> 
> * reader that can tolerate writers (BDRV_O_SHARE_RW only): Don't do anything
> 
> * reader that can't tolerate writers (neither bit is set): Take shared
>   lock on byte 1. Test whether byte 2 is locked, and fail if so.
> 
> The complication is in the transactional reopen.  To make the reopen
> logic managable, and allow better reuse, the code is internally
> organized with a table from old mode to the new one.
> 
> Signed-off-by: Fam Zheng <famz@redhat.com>
> ---
>  block/raw-posix.c | 710 ++++++++++++++++++++++++++++++++++++++++++++++++++----
>  1 file changed, 660 insertions(+), 50 deletions(-)
> 
> diff --git a/block/raw-posix.c b/block/raw-posix.c
> index 7c62fc3..07ab117 100644
> --- a/block/raw-posix.c
> +++ b/block/raw-posix.c

Resuming review!

> @@ -538,6 +763,398 @@ static int raw_open(BlockDriverState *bs, QDict *options, int flags,
>      return raw_open_common(bs, options, flags, 0, errp);
>  }
>  
> +typedef enum {
> +    RAW_LT_PREPARE,
> +    RAW_LT_COMMIT,
> +    RAW_LT_ABORT
> +} RawLockTransOp;
> +
> +typedef int (*RawReopenFunc)(RawLockTransOp op,
> +                             int old_lock_fd, int new_lock_fd,
> +                             BDRVRawLockMode old_lock,
> +                             BDRVRawLockMode new_lock,
> +                             Error **errp);
> +
> +static int raw_lt_nop(RawLockTransOp op,
> +                            int old_lock_fd, int new_lock_fd,
> +                            BDRVRawLockMode old_lock,
> +                            BDRVRawLockMode new_lock,
> +                            Error **errp)

Indentation is off.

> +{
> +    assert(old_lock == new_lock || new_lock == RAW_L_READ_SHARE_RW);
> +    return 0;
> +}
> +

Note and question before everything (after I had read basically
everything...): A small test shows me that locks are apparently shared
between dup'ed file descriptors, i.e.:

fd = open("foo");

lock(fd);
dup(fd);
unlock(fd);
close(fd);

new_fd = open("foo");
lock(new_fd); // fails

It works the other way around, too:

fd = open("foo");

lock(fd);
duped_fd = dup(fd);
unlock(duped_fd);
close(duped_fd);

new_fd = open("foo");
lock(new_fd); // fails


(The first lock() call has to use a shared lock and the second an
exclusive lock so that it works within a single process.)

This is pretty horrible because we don't know how old_lock_fd and
new_lock_fd are related. old_lock_fd was created somewhere through
raw_dup_flags() from s->fd; we have no idea whether that was actually
dup'ed or reopened. Then, rs->fd is created from s->fd, again through
raw_dup_flags(). Finally, new_lock_fd is created from rs->fd, once more
through said raw_dup_flags().

In the end, we have no idea whether new_lock_fd() is the end of a long
dup() chain from old_lock_fd or whether it was reopened somewhere in
between (well, we know that it has to be reopened when changing from R/W
to RO or the other way around, but we have no idea when keeping R/W or
RO (because dup() or fcntl() may have simply failed)).

In my opinion, this makes reviewing the code pretty difficult because
it's hard to keep in mind what cases may happen everywhere.

Would it be better to always fully re-open() the lock_fd instead of
duplicating it? Then we would at least know that old_lock_fd and
new_lock_fd never share the same lock configuration.

If you don't want to do that, there should be at least comments in the
RawReopenFuncs explaining whether new_lock_fd may share locks with
old_lock_fd or whether that's impossible.

> +static int raw_lt_from_unlock(RawLockTransOp op,
> +                              int old_lock_fd, int new_lock_fd,
> +                              BDRVRawLockMode old_lock,
> +                              BDRVRawLockMode new_lock,
> +                              Error **errp)
> +{
> +    assert(old_lock != new_lock);
> +    assert(old_lock == RAW_L_READ_SHARE_RW);
> +    switch (op) {
> +    case RAW_LT_PREPARE:
> +        return raw_lock_fd(new_lock_fd, new_lock, errp);
> +        break;

That break is a tad superfluous.

> +    case RAW_LT_COMMIT:

Maybe a "fall through" comment or something here? I think Coverity might
complain without.

> +    case RAW_LT_ABORT:
> +        break;
> +    }
> +
> +    return 0;
> +}
> +
> +static int raw_lt_read_to_write_share_rw(RawLockTransOp op,
> +                                         int old_lock_fd, int new_lock_fd,
> +                                         BDRVRawLockMode old_lock,
> +                                         BDRVRawLockMode new_lock,
> +                                         Error **errp)
> +{
> +    int ret = 0;
> +
> +    assert(old_lock == RAW_L_READ);
> +    assert(new_lock == RAW_L_WRITE_SHARE_RW);
> +    switch (op) {
> +    case RAW_LT_PREPARE:
> +        ret = qemu_lock_fd(new_lock_fd, RAW_LOCK_BYTE_WRITE, 1, false);
> +        if (ret) {
> +            error_setg_errno(errp, errno, "Failed to lock new fd (write byte)");
> +            break;
> +        }
> +        ret = qemu_lock_fd(new_lock_fd, RAW_LOCK_BYTE_NO_OTHER_WRITER, 1, false);
> +        if (ret) {
> +            error_setg_errno(errp, errno, "Failed to lock new fd (share byte)");
> +            break;
> +        }
> +        ret = qemu_unlock_fd(old_lock_fd, RAW_LOCK_BYTE_NO_OTHER_WRITER, 1);
> +        if (ret) {
> +            error_setg_errno(errp, errno, "Failed to unlock old fd (share byte)");
> +            break;
> +        }
> +        ret = qemu_lock_fd(new_lock_fd, RAW_LOCK_BYTE_NO_OTHER_WRITER, 1, true);
> +        if (ret) {
> +            error_setg_errno(errp, errno, "Failed to upgrade new fd (share byte)");
> +            break;
> +        }
> +        ret = qemu_lock_fd(new_lock_fd, RAW_LOCK_BYTE_NO_OTHER_WRITER, 1, false);
> +        if (ret) {
> +            /* This is very unlikely, but catch it anyway. */
> +            error_setg_errno(errp, errno, "Failed to downgrade new fd (share byte)");

I'd put a "break;" even into the last error block.

However, maybe a "break;" is not correct: When a reopening operation
fails, its abort function does not get called (as far as I can see). So
if we fail somewhere here, the old lock won't be restored. I think you
should either manually invoke the abort branch or in
raw_reopen_handle_lock() just always invoke the function again with
RAW_LT_ABORT if RAW_LT_PREPARE failed.

> +        }
> +        break;
> +    case RAW_LT_COMMIT:
> +        break;
> +    case RAW_LT_ABORT:
> +        ret = qemu_lock_fd(old_lock_fd, RAW_LOCK_BYTE_NO_OTHER_WRITER, 1, false);
> +        if (ret) {
> +            error_report("Failed to restore lock on old fd (share byte)");
> +        }
> +        break;
> +    }
> +    return ret ? -errno : 0;

What's wrong with "return ret;"?

> +}
> +
> +static int raw_lt_read_to_write(RawLockTransOp op,
> +                                int old_lock_fd, int new_lock_fd,
> +                                BDRVRawLockMode old_lock,
> +                                BDRVRawLockMode new_lock,
> +                                Error **errp)
> +{
> +    int ret = 0;
> +
> +    assert(old_lock == RAW_L_READ);
> +    assert(new_lock == RAW_L_WRITE);
> +    switch (op) {
> +    case RAW_LT_PREPARE:
> +        ret = qemu_lock_fd(new_lock_fd, RAW_LOCK_BYTE_WRITE, 1, false);
> +        if (ret) {
> +            error_setg_errno(errp, errno, "Failed to lock new fd (write byte)");
> +            break;
> +        }
> +        ret = qemu_lock_fd(new_lock_fd, RAW_LOCK_BYTE_NO_OTHER_WRITER, 1, false);
> +        if (ret) {
> +            error_setg_errno(errp, errno, "Failed to lock new fd (share byte)");
> +            break;
> +        }
> +        ret = qemu_unlock_fd(old_lock_fd, RAW_LOCK_BYTE_NO_OTHER_WRITER, 1);
> +        if (ret) {
> +            error_setg_errno(errp, errno, "Failed to unlock old fd (share byte)");
> +            break;
> +        }
> +        ret = qemu_lock_fd(new_lock_fd, RAW_LOCK_BYTE_NO_OTHER_WRITER, 1, true);
> +        if (ret) {
> +            error_setg_errno(errp, errno, "Failed to upgrade new fd (share byte)");
> +            break;
> +        }

I'd suggest upgrading RAW_LOCK_BYTE_WRITE, too, to avoid race conditions.

(By the way, I don't think we can use F_OFD_GETLK here, we have to
upgrade (and then optionally downgrade) the locks because both locks
have to be held at the same time. Then again, this isn't so bad because
new_lock_fd has write access anyway.)

> +        ret = qemu_lock_fd(new_lock_fd, RAW_LOCK_BYTE_NO_OTHER_WRITER, 1, false);
> +        if (ret) {
> +            error_setg_errno(errp, errno, "Failed to restore old fd (share byte) b");

block/raw-posix.c:892:81: warning: trailing " b" at the end of string
[-Wpedantic]

> +            break;
> +        }
> +        ret = qemu_lock_fd(new_lock_fd, RAW_LOCK_BYTE_NO_OTHER_WRITER, 1, false);
> +        if (ret) {
> +            error_setg_errno(errp, errno, "Failed to lock new fd (share byte)");
> +            break;
> +        }

This is exactly the same as the qemu_lock_fd() invocation above it
(without the trailing " b" though).

> +        break;
> +    case RAW_LT_COMMIT:
> +        break;
> +    case RAW_LT_ABORT:
> +        ret = qemu_lock_fd(old_lock_fd, RAW_LOCK_BYTE_NO_OTHER_WRITER, 1, false);
> +        if (ret) {
> +            error_report("Failed to restore lock on old fd (share byte)");
> +        }
> +        break;
> +    }
> +    return ret ? -errno : 0;
> +}
> +
> +static int raw_lt_write_share_rw_to_read(RawLockTransOp op,
> +                                         int old_lock_fd, int new_lock_fd,
> +                                         BDRVRawLockMode old_lock,
> +                                         BDRVRawLockMode new_lock,
> +                                         Error **errp)
> +{
> +    int ret = 0;
> +
> +    assert(old_lock == RAW_L_WRITE_SHARE_RW);
> +    assert(new_lock == RAW_L_READ);
> +    switch (op) {
> +    case RAW_LT_PREPARE:
> +        /* Make sure there are no other writers. */
> +        ret = qemu_lock_fd(old_lock_fd, RAW_LOCK_BYTE_WRITE, 1, true);
> +        if (ret) {
> +            error_setg_errno(errp, errno, "Failed to lock old fd (write byte)");
> +            break;
> +        }
> +        ret = qemu_lock_fd(new_lock_fd, RAW_LOCK_BYTE_NO_OTHER_WRITER, 1, false);
> +        if (ret) {
> +            error_setg_errno(errp, errno, "Failed to lock new fd (share byte)");
> +            break;
> +        }
> +        break;
> +    case RAW_LT_COMMIT:
> +        ret = qemu_unlock_fd(old_lock_fd, RAW_LOCK_BYTE_WRITE, 1);
> +        if (ret) {
> +            error_report("Failed to unlock old fd (write byte)");
> +            break;
> +        }
> +        break;
> +    case RAW_LT_ABORT:
> +        break;

How about restoring the shared lock on RAW_LOCK_BYTE_WRITE in old_lock_fd?

> +    }
> +    return ret ? -errno : 0;
> +}
> +
> +static int raw_lt_write_share_rw_to_write(RawLockTransOp op,
> +                                          int old_lock_fd, int new_lock_fd,
> +                                          BDRVRawLockMode old_lock,
> +                                          BDRVRawLockMode new_lock,
> +                                          Error **errp)
> +{
> +    int ret = 0;
> +
> +    assert(old_lock == RAW_L_WRITE_SHARE_RW);
> +    assert(new_lock == RAW_L_WRITE);
> +    switch (op) {
> +    case RAW_LT_PREPARE:
> +        /* Make sure there are no other writers. */
> +        ret = qemu_lock_fd(old_lock_fd, RAW_LOCK_BYTE_WRITE, 1, true);
> +        if (ret) {
> +            error_setg_errno(errp, errno, "Failed to lock old fd (write byte)");
> +            break;
> +        }
> +        ret = qemu_lock_fd(new_lock_fd, RAW_LOCK_BYTE_NO_OTHER_WRITER, 1, false);
> +        if (ret) {
> +            error_setg_errno(errp, errno, "Failed to lock new fd (share byte)");
> +            break;
> +        }
> +        ret = qemu_lock_fd(old_lock_fd, RAW_LOCK_BYTE_WRITE, 1, false);
> +        if (ret) {
> +            error_setg_errno(errp, errno, "Failed to downgrade old fd (write byte)");
> +            break;
> +        }
> +        ret = qemu_lock_fd(new_lock_fd, RAW_LOCK_BYTE_WRITE, 1, false);
> +        if (ret) {
> +            error_setg_errno(errp, errno, "Failed to lock new fd (write byte)");
> +            break;
> +        }
> +        break;
> +    case RAW_LT_COMMIT:
> +        break;
> +    case RAW_LT_ABORT:
> +        break;

As I said above, as far as I can see, the abort branch is not invoked if
the prepare branch failed. However, it should be, and then it's not
correct for this to be empty: old_lock_fd may still have an exclusive
lock on RAW_LOCK_BYTE_WRITE, this should be downgraded to a shared lock.

> +    }
> +    return ret ? -errno : 0;
> +}
> +
> +static int raw_lt_write_to_read(RawLockTransOp op,
> +                                int old_lock_fd, int new_lock_fd,
> +                                BDRVRawLockMode old_lock,
> +                                BDRVRawLockMode new_lock,
> +                                Error **errp)
> +{
> +    int ret = 0;
> +
> +    assert(old_lock == RAW_L_WRITE);
> +    assert(new_lock == RAW_L_READ);
> +    switch (op) {
> +    case RAW_LT_PREPARE:
> +        ret = qemu_lock_fd(new_lock_fd, RAW_LOCK_BYTE_NO_OTHER_WRITER, 1, false);
> +        if (ret) {
> +            error_setg_errno(errp, errno, "Failed to lock new fd (share byte)");
> +            break;
> +        }
> +        break;
> +    case RAW_LT_COMMIT:
> +        ret = qemu_unlock_fd(old_lock_fd, RAW_LOCK_BYTE_WRITE, 1);
> +        if (ret) {
> +            error_report("Failed to unlock old fd (write byte)");
> +            break;
> +        }

old_lock_fd is closed automatically which should take care of this.
Better safe than sorry, though, of course.

(But there are a couple of places above where you are relying on
old_lock_fd being closed automatically, and also on new_lock_fd being
closed automatically on abort.)

> +        break;
> +    case RAW_LT_ABORT:
> +        break;
> +    }
> +    return ret ? -errno : 0;
> +}
> +
> +static int raw_lt_write_to_write_share_rw(RawLockTransOp op,
> +                                          int old_lock_fd, int new_lock_fd,
> +                                          BDRVRawLockMode old_lock,
> +                                          BDRVRawLockMode new_lock,
> +                                          Error **errp)
> +{
> +    int ret = 0;
> +
> +    assert(old_lock == RAW_L_WRITE);
> +    assert(new_lock == RAW_L_WRITE_SHARE_RW);
> +    switch (op) {
> +    case RAW_LT_PREPARE:

You're not locking the write byte of new_lock here. Maybe it gets
inherited from old_lock_fd automatically, but we don't know whether
new_lock_fd is a dup() of old_lock_fd or whether it is not.

> +        break;
> +    case RAW_LT_COMMIT:
> +        ret = qemu_unlock_fd(old_lock_fd, RAW_LOCK_BYTE_NO_OTHER_WRITER, 1);
> +        if (ret) {
> +            error_report("Failed to unlock old fd (share byte)");
> +            break;
> +        }
> +        break;
> +    case RAW_LT_ABORT:
> +        break;
> +    }
> +    return ret ? -errno : 0;
> +}
> +
> +/**
> + * Transactionally moving between possible locking states is tricky and must be
> + * done carefully. That is mostly because downgrading an exclusive lock to
> + * shared or unlocked is not guaranteed to be revertable. As a result, in such
> + * cases we have to defer the downgraing to "commit", given that no revert will
> + * happen after that point, and that downgrading a lock should never fail.
> + *
> + * On the other hand, upgrading a lock (e.g. from unlocked or shared to
> + * exclusive lock) must happen in "prepare" because it may fail.
> + *
> + * Manage the operation matrix with this state transition table to make
> + * fulfulling above conditions easier.
> + */
> +static const struct RawReopenFuncRecord {
> +    BDRVRawLockMode old_lock;
> +    BDRVRawLockMode new_lock;
> +    RawReopenFunc func;
> +    bool need_lock_fd;
> +    bool close_old_lock_fd;
> +} reopen_functions[] = {
> +
> +    {RAW_L_READ_SHARE_RW, RAW_L_READ_SHARE_RW, raw_lt_nop, false, false},
> +    {RAW_L_READ_SHARE_RW, RAW_L_READ, raw_lt_from_unlock, true},
> +    {RAW_L_READ_SHARE_RW, RAW_L_WRITE_SHARE_RW, raw_lt_from_unlock, true},
> +    {RAW_L_READ_SHARE_RW, RAW_L_WRITE, raw_lt_from_unlock, true},
> +
> +    {RAW_L_READ, RAW_L_READ_SHARE_RW, raw_lt_nop, false, true},
> +    {RAW_L_READ, RAW_L_READ, raw_lt_nop, false, false},
> +    {RAW_L_READ, RAW_L_WRITE_SHARE_RW, raw_lt_read_to_write_share_rw, true},
> +    {RAW_L_READ, RAW_L_WRITE, raw_lt_read_to_write, true},
> +
> +    {RAW_L_WRITE_SHARE_RW, RAW_L_READ_SHARE_RW, raw_lt_nop, false, true},
> +    {RAW_L_WRITE_SHARE_RW, RAW_L_READ, raw_lt_write_share_rw_to_read, true},
> +    {RAW_L_WRITE_SHARE_RW, RAW_L_WRITE_SHARE_RW, raw_lt_nop, false, false},
> +    {RAW_L_WRITE_SHARE_RW, RAW_L_WRITE, raw_lt_write_share_rw_to_write, true},
> +
> +    {RAW_L_WRITE, RAW_L_READ_SHARE_RW, raw_lt_nop, false, true},
> +    {RAW_L_WRITE, RAW_L_READ, raw_lt_write_to_read, true},
> +    {RAW_L_WRITE, RAW_L_WRITE_SHARE_RW, raw_lt_write_to_write_share_rw, true},
> +    {RAW_L_WRITE, RAW_L_WRITE, raw_lt_nop, false, false},
> +};
> +
> +static int raw_reopen_handle_lock(BDRVReopenState *state,
> +                                  RawLockTransOp op,
> +                                  Error **errp)
> +{
> +    BDRVRawReopenState *rs = state->opaque;
> +    BDRVRawState *s = state->bs->opaque;
> +    BDRVRawLockMode old_lock, new_lock;
> +    const struct RawReopenFuncRecord *rec;
> +    int ret;
> +
> +    old_lock = s->cur_lock_mode;
> +    if (qdict_get_try_bool(state->options, "disable-lock", false)) {

Putting a disable_lock into BDRVRawReopenState, setting that in prepare
and then copying it over to s->disable_lock on commit seems better to me.

> +        new_lock = RAW_L_READ_SHARE_RW;
> +    } else {
> +        new_lock = raw_get_lock_mode(state->flags);
> +    }
> +    qdict_del(state->options, "disable-lock");

So you're reading it on prepare, then deleting it, and on commit you're
just getting new_lock from raw_get_lock_mode() because you already
deleted it from state->options? That seems wrong to me.

> +
> +    for (rec = &reopen_functions[0];
> +         rec < &reopen_functions[ARRAY_SIZE(reopen_functions)];
> +         rec++) {
> +        if (rec->old_lock == old_lock && rec->new_lock == new_lock) {
> +            break;
> +        }
> +    }
> +    assert(rec != &reopen_functions[ARRAY_SIZE(reopen_functions)]);
> +
> +    switch (op) {
> +    case RAW_LT_PREPARE:
> +        if (rec->need_lock_fd) {
> +            int lock_flags = rs->open_flags;
> +            if (!(state->flags & BDRV_O_SHARE_RW)) {
> +                lock_flags |= O_RDWR;
> +            }

Again, I think we can and should do without this.

> +            ret = raw_dup_flags(rs->fd, s->open_flags, lock_flags,
> +                                state->bs->filename, errp);
> +            if (ret < 0) {
> +                return ret;
> +            }
> +            rs->lock_fd = ret;
> +        } else {
> +            rs->lock_fd = -1;
> +        }
> +        return rec->func(op, s->lock_fd, rs->lock_fd, old_lock, new_lock, errp);

As I said above somewhere, it might make sense to fall through to
RAW_LT_ABORT if this returns an error.

> +    case RAW_LT_COMMIT:
> +        rec->func(op, s->lock_fd, rs->lock_fd, old_lock, new_lock, errp);

Since you're ignoring the return value, you can also pass &error_abort
(I know it doesn't matter because this function is invoked with errp ==
&error_abort, but it looks cleaner).

> +        if ((rec->need_lock_fd || rec->close_old_lock_fd) && s->lock_fd >= 0) {
> +            qemu_close(s->lock_fd);

I'd propose setting it to -1 afterwards (relevant when
!rec->need_lock_fd && rec->close_old_lock_fd).

> +        }
> +        if (rec->need_lock_fd) {
> +            s->lock_fd = rs->lock_fd;
> +        }
> +        s->cur_lock_mode = new_lock;
> +        break;
> +    case RAW_LT_ABORT:
> +        rec->func(op, s->lock_fd, rs->lock_fd, old_lock, new_lock, errp);

Again, could pass &error_abort instead of errp.

> +        if (rec->need_lock_fd) {
> +            if (rs->lock_fd >= 0) {

Actually, rs->lock_fd >= 0 should be equivalent to rec->need_lock_fd.

> +                qemu_close(rs->lock_fd);
> +                rs->lock_fd = -1;
> +            }
> +        }
> +        break;
> +    }
> +    return 0;
> +}
> +
>  static int raw_reopen_prepare(BDRVReopenState *state,
>                                BlockReopenQueue *queue, Error **errp)
>  {
> @@ -560,61 +1177,24 @@ static int raw_reopen_prepare(BDRVReopenState *state,
>  
>      raw_parse_flags(state->flags, &rs->open_flags);
>  
> -    rs->fd = -1;
> -
> -    int fcntl_flags = O_APPEND | O_NONBLOCK;
> -#ifdef O_NOATIME
> -    fcntl_flags |= O_NOATIME;
> -#endif
> -
> -#ifdef O_ASYNC
> -    /* Not all operating systems have O_ASYNC, and those that don't
> -     * will not let us track the state into rs->open_flags (typically
> -     * you achieve the same effect with an ioctl, for example I_SETSIG
> -     * on Solaris). But we do not use O_ASYNC, so that's fine.
> -     */
> -    assert((s->open_flags & O_ASYNC) == 0);
> -#endif
> -
> -    if ((rs->open_flags & ~fcntl_flags) == (s->open_flags & ~fcntl_flags)) {
> -        /* dup the original fd */
> -        rs->fd = qemu_dup(s->fd);
> -        if (rs->fd >= 0) {
> -            ret = fcntl_setfl(rs->fd, rs->open_flags);
> -            if (ret) {
> -                qemu_close(rs->fd);
> -                rs->fd = -1;
> -            }
> -        }
> +    ret = raw_dup_flags(s->fd, s->open_flags, rs->open_flags,
> +                        state->bs->filename, errp);
> +    if (ret < 0) {
> +        return ret;
>      }
>  
> -    /* If we cannot use fcntl, or fcntl failed, fall back to qemu_open() */
> -    if (rs->fd == -1) {
> -        const char *normalized_filename = state->bs->filename;
> -        ret = raw_normalize_devicepath(&normalized_filename);
> -        if (ret < 0) {
> -            error_setg_errno(errp, -ret, "Could not normalize device path");
> -        } else {
> -            assert(!(rs->open_flags & O_CREAT));
> -            rs->fd = qemu_open(normalized_filename, rs->open_flags);
> -            if (rs->fd == -1) {
> -                error_setg_errno(errp, errno, "Could not reopen file");
> -                ret = -1;
> -            }
> -        }
> -    }
> +    rs->fd = ret;
>  
>      /* Fail already reopen_prepare() if we can't get a working O_DIRECT
>       * alignment with the new fd. */
> -    if (rs->fd != -1) {
> -        raw_probe_alignment(state->bs, rs->fd, &local_err);
> -        if (local_err) {
> -            qemu_close(rs->fd);
> -            rs->fd = -1;
> -            error_propagate(errp, local_err);
> -            ret = -EINVAL;
> -        }
> +    raw_probe_alignment(state->bs, rs->fd, &local_err);
> +    if (local_err) {
> +        qemu_close(rs->fd);
> +        rs->fd = -1;
> +        error_propagate(errp, local_err);
> +        return -EINVAL;
>      }
> +    ret = raw_reopen_handle_lock(state, RAW_LT_PREPARE, errp);

Shouldn't this have the same clean-up routine on error as above, i.e.
close the new FD?

>  
>      return ret;
>  }
> @@ -626,6 +1206,8 @@ static void raw_reopen_commit(BDRVReopenState *state)
>  
>      s->open_flags = rs->open_flags;
>  
> +    raw_reopen_handle_lock(state, RAW_LT_COMMIT, &error_abort);
> +
>      qemu_close(s->fd);
>      s->fd = rs->fd;
>  
> @@ -643,6 +1225,8 @@ static void raw_reopen_abort(BDRVReopenState *state)
>          return;
>      }
>  
> +    raw_reopen_handle_lock(state, RAW_LT_ABORT, &error_abort);
> +
>      if (rs->fd >= 0) {
>          qemu_close(rs->fd);
>          rs->fd = -1;
> @@ -1332,6 +1916,10 @@ static void raw_close(BlockDriverState *bs)
>          qemu_close(s->fd);
>          s->fd = -1;
>      }
> +    if (s->lock_fd >= 0) {
> +        qemu_close(s->lock_fd);
> +        s->lock_fd = -1;
> +    }
>  }
>  
>  static int raw_truncate(BlockDriverState *bs, int64_t offset)
> @@ -1832,6 +2420,27 @@ static int raw_get_info(BlockDriverState *bs, BlockDriverInfo *bdi)

*Resumed review complete* :-)

Max
Fam Zheng Jan. 18, 2017, 10:48 a.m. UTC | #6
On Fri, 12/02 03:58, Max Reitz wrote:
> On 31.10.2016 16:38, Fam Zheng wrote:
> > This implements open flag sensible image locking for local file
> > and host device protocol.
> > 
> > virtlockd in libvirt locks the first byte, so we start looking at the
> > file bytes from 1.
> > 
> > Quoting what was proposed by Kevin Wolf <kwolf@redhat.com>, there are
> > four locking modes by combining two bits (BDRV_O_RDWR and
> > BDRV_O_SHARE_RW), and implemented by taking two locks:
> > 
> > Lock bytes:
> > 
> > * byte 1: I can't allow other processes to write to the image
> > * byte 2: I am writing to the image
> > 
> > Lock modes:
> > 
> > * shared writer (BDRV_O_RDWR | BDRV_O_SHARE_RW): Take shared lock on
> >   byte 2. Test whether byte 1 is locked using an exclusive lock, and
> >   fail if so.
> 
> Testing whether something is locked would be easier by using F_OFD_GETLK
> instead of actually creating an exclusive lock and then releasing it.

My attempt to do this shows it doesn't work: fcntl forces the tested lock type
to read lock for some unknown reason, so it cannot be used to test other shared
lockers.
Max Reitz Jan. 18, 2017, 1:02 p.m. UTC | #7
On 18.01.2017 11:48, Fam Zheng wrote:
> On Fri, 12/02 03:58, Max Reitz wrote:
>> On 31.10.2016 16:38, Fam Zheng wrote:
>>> This implements open flag sensible image locking for local file
>>> and host device protocol.
>>>
>>> virtlockd in libvirt locks the first byte, so we start looking at the
>>> file bytes from 1.
>>>
>>> Quoting what was proposed by Kevin Wolf <kwolf@redhat.com>, there are
>>> four locking modes by combining two bits (BDRV_O_RDWR and
>>> BDRV_O_SHARE_RW), and implemented by taking two locks:
>>>
>>> Lock bytes:
>>>
>>> * byte 1: I can't allow other processes to write to the image
>>> * byte 2: I am writing to the image
>>>
>>> Lock modes:
>>>
>>> * shared writer (BDRV_O_RDWR | BDRV_O_SHARE_RW): Take shared lock on
>>>   byte 2. Test whether byte 1 is locked using an exclusive lock, and
>>>   fail if so.
>>
>> Testing whether something is locked would be easier by using F_OFD_GETLK
>> instead of actually creating an exclusive lock and then releasing it.
> 
> My attempt to do this shows it doesn't work: fcntl forces the tested lock type
> to read lock for some unknown reason, so it cannot be used to test other shared
> lockers.

Do you have a reproducer? The attached quick and dirty test case works
for me (compile test.c to test and test2.c to test2), producing the
following output (when running ./test):

set rd lock: 0, Success
get lock: 0, Success; read lock in place
set wr lock: -1, Resource temporarily unavailable
unset lock: 0, Resource temporarily unavailable
===
unset lock: 0, Success
get lock: 0, Success; unlocked
set wr lock: 0, Success
unset lock: 0, Success
===
set wr lock: 0, Success
get lock: 0, Success; write lock in place
set wr lock: -1, Resource temporarily unavailable
unset lock: 0, Resource temporarily unavailable

So the l_type field is correctly set after every F_OFD_GETLK query call.

Max
Fam Zheng Jan. 18, 2017, 1:19 p.m. UTC | #8
On Wed, 01/18 14:02, Max Reitz wrote:
> >> Testing whether something is locked would be easier by using F_OFD_GETLK
> >> instead of actually creating an exclusive lock and then releasing it.
> > 
> > My attempt to do this shows it doesn't work: fcntl forces the tested lock type
> > to read lock for some unknown reason, so it cannot be used to test other shared
> > lockers.
> 
> Do you have a reproducer? The attached quick and dirty test case works
> for me (compile test.c to test and test2.c to test2), producing the
> following output (when running ./test):
> 
> set rd lock: 0, Success
> get lock: 0, Success; read lock in place
> set wr lock: -1, Resource temporarily unavailable
> unset lock: 0, Resource temporarily unavailable
> ===
> unset lock: 0, Success
> get lock: 0, Success; unlocked
> set wr lock: 0, Success
> unset lock: 0, Success
> ===
> set wr lock: 0, Success
> get lock: 0, Success; write lock in place
> set wr lock: -1, Resource temporarily unavailable
> unset lock: 0, Resource temporarily unavailable

You are right, now I see why I was confused with the F_OFD_GETLK interface.

Fam

> 
> So the l_type field is correctly set after every F_OFD_GETLK query call.
> 
> Max
diff mbox

Patch

diff --git a/block/raw-posix.c b/block/raw-posix.c
index 7c62fc3..07ab117 100644
--- a/block/raw-posix.c
+++ b/block/raw-posix.c
@@ -131,8 +131,44 @@  do { \
 
 #define MAX_BLOCKSIZE	4096
 
+/* Posix file locking bytes. Libvirt takes byte 0, so start from byte 1. */
+#define RAW_LOCK_BYTE_MIN 1
+#define RAW_LOCK_BYTE_NO_OTHER_WRITER 1
+#define RAW_LOCK_BYTE_WRITE     2
+#define RAW_LOCK_BYTE_MAX 2
+
+/*
+ ** shared writer: Take shared lock on byte 2. Test whether byte 1 is
+ *  locked using an exclusive lock, and fail if so.
+ *
+ ** exclusive writer: Take shared lock on byte 2. Test whether byte 1 is
+ *  locked using an exclusive lock, and fail if so. Then take shared lock
+ *  on byte 1. I suppose this is racy, but we can probably tolerate that.
+ *
+ ** reader that can tolerate writers: Don't do anything
+ *
+ ** reader that can't tolerate writers: Take shared lock on byte 1. Test
+ *  whether byte 2 is locked, and fail if so.
+ */
+
+typedef enum {
+    /* Read only and accept other writers. */
+    RAW_L_READ_SHARE_RW,
+    /* Read only and try to forbid other writers. */
+    RAW_L_READ,
+    /* Read write and accept other writers. */
+    RAW_L_WRITE_SHARE_RW,
+    /* Read write and try to forbit other writers. */
+    RAW_L_WRITE,
+} BDRVRawLockMode;
+
 typedef struct BDRVRawState {
     int fd;
+    /* A dup of @fd to make manipulating lock easier, especially during reopen,
+     * where this will accept BDRVRawReopenState.lock_fd. */
+    int lock_fd;
+    bool disable_lock;
+    bool lock_on_invalidate;
     int type;
     int open_flags;
     size_t buf_align;
@@ -146,10 +182,13 @@  typedef struct BDRVRawState {
     bool use_linux_aio:1;
     bool has_fallocate;
     bool needs_alignment;
+    BDRVRawLockMode cur_lock_mode;
 } BDRVRawState;
 
 typedef struct BDRVRawReopenState {
     int fd;
+    /* A dup of @fd used for acquiring lock. */
+    int lock_fd;
     int open_flags;
 } BDRVRawReopenState;
 
@@ -368,6 +407,77 @@  static void raw_parse_flags(int bdrv_flags, int *open_flags)
     }
 }
 
+static int raw_lock_fd(int fd, BDRVRawLockMode mode, Error **errp)
+{
+    int ret;
+    assert(fd >= 0);
+    /* Locking byte 1 avoids interfereing with virtlockd. */
+    switch (mode) {
+    case RAW_L_READ_SHARE_RW:
+        ret = qemu_unlock_fd(fd, RAW_LOCK_BYTE_MIN,
+                             RAW_LOCK_BYTE_MAX - RAW_LOCK_BYTE_MIN + 1);
+        if (ret) {
+            error_setg_errno(errp, errno, "Failed to unlock fd");
+            goto fail;
+        }
+        break;
+    case RAW_L_READ:
+        ret = qemu_lock_fd(fd, RAW_LOCK_BYTE_NO_OTHER_WRITER, 1, false);
+        if (ret) {
+            error_setg_errno(errp, errno, "Failed to lock share byte");
+            goto fail;
+        }
+        ret = qemu_lock_fd(fd, RAW_LOCK_BYTE_WRITE, 1, true);
+        if (ret) {
+            error_setg_errno(errp, errno, "Failed to lock write byte exclusively");
+            goto fail;
+        }
+        qemu_unlock_fd(fd, RAW_LOCK_BYTE_WRITE, 1);
+        break;
+    case RAW_L_WRITE_SHARE_RW:
+        ret = qemu_lock_fd(fd, RAW_LOCK_BYTE_WRITE, 1, false);
+        if (ret) {
+            error_setg_errno(errp, errno, "Failed to lock write byte");
+            goto fail;
+        }
+        ret = qemu_lock_fd(fd, RAW_LOCK_BYTE_NO_OTHER_WRITER, 1, true);
+        if (ret) {
+            error_setg_errno(errp, errno, "Failed to lock share byte exclusively");
+            goto fail;
+        }
+        ret = qemu_unlock_fd(fd, RAW_LOCK_BYTE_NO_OTHER_WRITER, 1);
+        if (ret) {
+            error_setg_errno(errp, errno, "Failed to unlock share byte");
+            goto fail;
+        }
+        break;
+    case RAW_L_WRITE:
+        ret = qemu_lock_fd(fd, RAW_LOCK_BYTE_WRITE, 1, false);
+        if (ret) {
+            error_setg_errno(errp, errno, "Failed to lock write byte");
+            goto fail;
+        }
+        ret = qemu_lock_fd(fd, RAW_LOCK_BYTE_NO_OTHER_WRITER, 1, true);
+        if (ret) {
+            error_setg_errno(errp, errno, "Failed to lock share byte exclusively");
+            goto fail;
+        }
+        ret = qemu_lock_fd(fd, RAW_LOCK_BYTE_NO_OTHER_WRITER, 1, false);
+        if (ret) {
+            error_setg_errno(errp, errno, "Failed to downgrade share byte");
+            goto fail;
+        }
+        break;
+    default:
+        abort();
+    }
+    return 0;
+fail:
+    qemu_unlock_fd(fd, RAW_LOCK_BYTE_NO_OTHER_WRITER, 1);
+    qemu_unlock_fd(fd, RAW_LOCK_BYTE_WRITE, 1);
+    return -errno;
+}
+
 static void raw_parse_filename(const char *filename, QDict *options,
                                Error **errp)
 {
@@ -393,10 +503,115 @@  static QemuOptsList raw_runtime_opts = {
             .type = QEMU_OPT_STRING,
             .help = "host AIO implementation (threads, native)",
         },
+        {
+            .name = "disable-lock",
+            .type = QEMU_OPT_BOOL,
+            .help = "don't lock the file",
+        },
         { /* end of list */ }
     },
 };
 
+static BDRVRawLockMode raw_get_lock_mode(int flags)
+{
+    switch (flags & (BDRV_O_RDWR | BDRV_O_SHARE_RW)) {
+    case BDRV_O_RDWR:
+        return RAW_L_WRITE;
+    case BDRV_O_RDWR | BDRV_O_SHARE_RW:
+        return RAW_L_WRITE_SHARE_RW;
+    case BDRV_O_SHARE_RW:
+        return RAW_L_READ_SHARE_RW;
+    case 0:
+        return RAW_L_READ;
+    default:
+        abort();
+    }
+}
+
+static int raw_dup_flags(int fd, int old_flags, int new_flags,
+                         const char *filename, Error **errp)
+{
+    int ret = -1;
+    int fcntl_flags = O_APPEND | O_NONBLOCK;
+#ifdef O_NOATIME
+    fcntl_flags |= O_NOATIME;
+#endif
+
+#ifdef O_ASYNC
+    /* Not all operating systems have O_ASYNC, and those that don't
+     * will not let us track the state into rs->open_flags (typically
+     * you achieve the same effect with an ioctl, for example I_SETSIG
+     * on Solaris). But we do not use O_ASYNC, so that's fine.
+     */
+    assert((old_flags & O_ASYNC) == 0);
+#endif
+
+    if ((new_flags & ~fcntl_flags) == (old_flags & ~fcntl_flags)) {
+        /* dup the original fd */
+        ret = qemu_dup(fd);
+        if (ret >= 0) {
+            if (fcntl_setfl(ret, new_flags)) {
+                int new_fd = ret;
+                ret = -errno;
+                qemu_close(new_fd);
+            }
+        }
+    }
+
+    /* If we cannot use fcntl, or fcntl failed, fall back to qemu_open() */
+    if (ret < 0) {
+        const char *normalized_filename = filename;
+        ret = raw_normalize_devicepath(&normalized_filename);
+        if (ret < 0) {
+            error_setg_errno(errp, -ret, "Could not normalize device path");
+        } else {
+            assert(!(new_flags & O_CREAT));
+            ret = qemu_open(normalized_filename, new_flags);
+            if (ret == -1) {
+                error_setg_errno(errp, errno, "Could not open file with new flags");
+                ret = -errno;
+            }
+        }
+    }
+    return ret;
+}
+
+static int raw_lock_image(BlockDriverState *bs, int bdrv_flags, Error **errp)
+{
+    int ret;
+    BDRVRawState *s = bs->opaque;
+    BDRVRawLockMode lock_mode;
+
+    if (bdrv_flags & BDRV_O_INACTIVE) {
+        s->disable_lock = true;
+        s->lock_on_invalidate = true;
+    }
+    if (!s->disable_lock) {
+        lock_mode = raw_get_lock_mode(bdrv_flags);
+        if (!(bdrv_flags & BDRV_O_RDWR) && access(bs->filename, W_OK) != 0) {
+            s->disable_lock = true;
+        }
+    }
+    if (!s->disable_lock && lock_mode != RAW_L_READ_SHARE_RW) {
+        int lock_flags = s->open_flags;
+        if (!(bdrv_flags & BDRV_O_SHARE_RW)) {
+            lock_flags |= O_RDWR;
+        }
+        ret = raw_dup_flags(s->fd, s->open_flags, lock_flags, bs->filename,
+                                   errp);
+        if (ret < 0) {
+            return ret;
+        }
+        s->lock_fd = ret;
+        ret = raw_lock_fd(s->lock_fd, lock_mode, errp);
+        if (ret) {
+            return ret;
+        }
+        s->cur_lock_mode = lock_mode;
+    }
+    return 0;
+}
+
 static int raw_open_common(BlockDriverState *bs, QDict *options,
                            int bdrv_flags, int open_flags, Error **errp)
 {
@@ -440,6 +655,7 @@  static int raw_open_common(BlockDriverState *bs, QDict *options,
     raw_parse_flags(bdrv_flags, &s->open_flags);
 
     s->fd = -1;
+    s->lock_fd = -1;
     fd = qemu_open(filename, s->open_flags, 0644);
     if (fd < 0) {
         ret = -errno;
@@ -451,6 +667,15 @@  static int raw_open_common(BlockDriverState *bs, QDict *options,
     }
     s->fd = fd;
 
+    s->disable_lock = qemu_opt_get_bool(opts, "disable-lock", false);
+
+    if (!s->disable_lock) {
+        ret = raw_lock_image(bs, bdrv_flags, errp);
+        if (ret) {
+            goto fail;
+        }
+    }
+
 #ifdef CONFIG_LINUX_AIO
      /* Currently Linux does AIO only for files opened with O_DIRECT */
     if (s->use_linux_aio && !(s->open_flags & O_DIRECT)) {
@@ -538,6 +763,398 @@  static int raw_open(BlockDriverState *bs, QDict *options, int flags,
     return raw_open_common(bs, options, flags, 0, errp);
 }
 
+typedef enum {
+    RAW_LT_PREPARE,
+    RAW_LT_COMMIT,
+    RAW_LT_ABORT
+} RawLockTransOp;
+
+typedef int (*RawReopenFunc)(RawLockTransOp op,
+                             int old_lock_fd, int new_lock_fd,
+                             BDRVRawLockMode old_lock,
+                             BDRVRawLockMode new_lock,
+                             Error **errp);
+
+static int raw_lt_nop(RawLockTransOp op,
+                            int old_lock_fd, int new_lock_fd,
+                            BDRVRawLockMode old_lock,
+                            BDRVRawLockMode new_lock,
+                            Error **errp)
+{
+    assert(old_lock == new_lock || new_lock == RAW_L_READ_SHARE_RW);
+    return 0;
+}
+
+static int raw_lt_from_unlock(RawLockTransOp op,
+                              int old_lock_fd, int new_lock_fd,
+                              BDRVRawLockMode old_lock,
+                              BDRVRawLockMode new_lock,
+                              Error **errp)
+{
+    assert(old_lock != new_lock);
+    assert(old_lock == RAW_L_READ_SHARE_RW);
+    switch (op) {
+    case RAW_LT_PREPARE:
+        return raw_lock_fd(new_lock_fd, new_lock, errp);
+        break;
+    case RAW_LT_COMMIT:
+    case RAW_LT_ABORT:
+        break;
+    }
+
+    return 0;
+}
+
+static int raw_lt_read_to_write_share_rw(RawLockTransOp op,
+                                         int old_lock_fd, int new_lock_fd,
+                                         BDRVRawLockMode old_lock,
+                                         BDRVRawLockMode new_lock,
+                                         Error **errp)
+{
+    int ret = 0;
+
+    assert(old_lock == RAW_L_READ);
+    assert(new_lock == RAW_L_WRITE_SHARE_RW);
+    switch (op) {
+    case RAW_LT_PREPARE:
+        ret = qemu_lock_fd(new_lock_fd, RAW_LOCK_BYTE_WRITE, 1, false);
+        if (ret) {
+            error_setg_errno(errp, errno, "Failed to lock new fd (write byte)");
+            break;
+        }
+        ret = qemu_lock_fd(new_lock_fd, RAW_LOCK_BYTE_NO_OTHER_WRITER, 1, false);
+        if (ret) {
+            error_setg_errno(errp, errno, "Failed to lock new fd (share byte)");
+            break;
+        }
+        ret = qemu_unlock_fd(old_lock_fd, RAW_LOCK_BYTE_NO_OTHER_WRITER, 1);
+        if (ret) {
+            error_setg_errno(errp, errno, "Failed to unlock old fd (share byte)");
+            break;
+        }
+        ret = qemu_lock_fd(new_lock_fd, RAW_LOCK_BYTE_NO_OTHER_WRITER, 1, true);
+        if (ret) {
+            error_setg_errno(errp, errno, "Failed to upgrade new fd (share byte)");
+            break;
+        }
+        ret = qemu_lock_fd(new_lock_fd, RAW_LOCK_BYTE_NO_OTHER_WRITER, 1, false);
+        if (ret) {
+            /* This is very unlikely, but catch it anyway. */
+            error_setg_errno(errp, errno, "Failed to downgrade new fd (share byte)");
+        }
+        break;
+    case RAW_LT_COMMIT:
+        break;
+    case RAW_LT_ABORT:
+        ret = qemu_lock_fd(old_lock_fd, RAW_LOCK_BYTE_NO_OTHER_WRITER, 1, false);
+        if (ret) {
+            error_report("Failed to restore lock on old fd (share byte)");
+        }
+        break;
+    }
+    return ret ? -errno : 0;
+}
+
+static int raw_lt_read_to_write(RawLockTransOp op,
+                                int old_lock_fd, int new_lock_fd,
+                                BDRVRawLockMode old_lock,
+                                BDRVRawLockMode new_lock,
+                                Error **errp)
+{
+    int ret = 0;
+
+    assert(old_lock == RAW_L_READ);
+    assert(new_lock == RAW_L_WRITE);
+    switch (op) {
+    case RAW_LT_PREPARE:
+        ret = qemu_lock_fd(new_lock_fd, RAW_LOCK_BYTE_WRITE, 1, false);
+        if (ret) {
+            error_setg_errno(errp, errno, "Failed to lock new fd (write byte)");
+            break;
+        }
+        ret = qemu_lock_fd(new_lock_fd, RAW_LOCK_BYTE_NO_OTHER_WRITER, 1, false);
+        if (ret) {
+            error_setg_errno(errp, errno, "Failed to lock new fd (share byte)");
+            break;
+        }
+        ret = qemu_unlock_fd(old_lock_fd, RAW_LOCK_BYTE_NO_OTHER_WRITER, 1);
+        if (ret) {
+            error_setg_errno(errp, errno, "Failed to unlock old fd (share byte)");
+            break;
+        }
+        ret = qemu_lock_fd(new_lock_fd, RAW_LOCK_BYTE_NO_OTHER_WRITER, 1, true);
+        if (ret) {
+            error_setg_errno(errp, errno, "Failed to upgrade new fd (share byte)");
+            break;
+        }
+        ret = qemu_lock_fd(new_lock_fd, RAW_LOCK_BYTE_NO_OTHER_WRITER, 1, false);
+        if (ret) {
+            error_setg_errno(errp, errno, "Failed to restore old fd (share byte) b");
+            break;
+        }
+        ret = qemu_lock_fd(new_lock_fd, RAW_LOCK_BYTE_NO_OTHER_WRITER, 1, false);
+        if (ret) {
+            error_setg_errno(errp, errno, "Failed to lock new fd (share byte)");
+            break;
+        }
+        break;
+    case RAW_LT_COMMIT:
+        break;
+    case RAW_LT_ABORT:
+        ret = qemu_lock_fd(old_lock_fd, RAW_LOCK_BYTE_NO_OTHER_WRITER, 1, false);
+        if (ret) {
+            error_report("Failed to restore lock on old fd (share byte)");
+        }
+        break;
+    }
+    return ret ? -errno : 0;
+}
+
+static int raw_lt_write_share_rw_to_read(RawLockTransOp op,
+                                         int old_lock_fd, int new_lock_fd,
+                                         BDRVRawLockMode old_lock,
+                                         BDRVRawLockMode new_lock,
+                                         Error **errp)
+{
+    int ret = 0;
+
+    assert(old_lock == RAW_L_WRITE_SHARE_RW);
+    assert(new_lock == RAW_L_READ);
+    switch (op) {
+    case RAW_LT_PREPARE:
+        /* Make sure there are no other writers. */
+        ret = qemu_lock_fd(old_lock_fd, RAW_LOCK_BYTE_WRITE, 1, true);
+        if (ret) {
+            error_setg_errno(errp, errno, "Failed to lock old fd (write byte)");
+            break;
+        }
+        ret = qemu_lock_fd(new_lock_fd, RAW_LOCK_BYTE_NO_OTHER_WRITER, 1, false);
+        if (ret) {
+            error_setg_errno(errp, errno, "Failed to lock new fd (share byte)");
+            break;
+        }
+        break;
+    case RAW_LT_COMMIT:
+        ret = qemu_unlock_fd(old_lock_fd, RAW_LOCK_BYTE_WRITE, 1);
+        if (ret) {
+            error_report("Failed to unlock old fd (write byte)");
+            break;
+        }
+        break;
+    case RAW_LT_ABORT:
+        break;
+    }
+    return ret ? -errno : 0;
+}
+
+static int raw_lt_write_share_rw_to_write(RawLockTransOp op,
+                                          int old_lock_fd, int new_lock_fd,
+                                          BDRVRawLockMode old_lock,
+                                          BDRVRawLockMode new_lock,
+                                          Error **errp)
+{
+    int ret = 0;
+
+    assert(old_lock == RAW_L_WRITE_SHARE_RW);
+    assert(new_lock == RAW_L_WRITE);
+    switch (op) {
+    case RAW_LT_PREPARE:
+        /* Make sure there are no other writers. */
+        ret = qemu_lock_fd(old_lock_fd, RAW_LOCK_BYTE_WRITE, 1, true);
+        if (ret) {
+            error_setg_errno(errp, errno, "Failed to lock old fd (write byte)");
+            break;
+        }
+        ret = qemu_lock_fd(new_lock_fd, RAW_LOCK_BYTE_NO_OTHER_WRITER, 1, false);
+        if (ret) {
+            error_setg_errno(errp, errno, "Failed to lock new fd (share byte)");
+            break;
+        }
+        ret = qemu_lock_fd(old_lock_fd, RAW_LOCK_BYTE_WRITE, 1, false);
+        if (ret) {
+            error_setg_errno(errp, errno, "Failed to downgrade old fd (write byte)");
+            break;
+        }
+        ret = qemu_lock_fd(new_lock_fd, RAW_LOCK_BYTE_WRITE, 1, false);
+        if (ret) {
+            error_setg_errno(errp, errno, "Failed to lock new fd (write byte)");
+            break;
+        }
+        break;
+    case RAW_LT_COMMIT:
+        break;
+    case RAW_LT_ABORT:
+        break;
+    }
+    return ret ? -errno : 0;
+}
+
+static int raw_lt_write_to_read(RawLockTransOp op,
+                                int old_lock_fd, int new_lock_fd,
+                                BDRVRawLockMode old_lock,
+                                BDRVRawLockMode new_lock,
+                                Error **errp)
+{
+    int ret = 0;
+
+    assert(old_lock == RAW_L_WRITE);
+    assert(new_lock == RAW_L_READ);
+    switch (op) {
+    case RAW_LT_PREPARE:
+        ret = qemu_lock_fd(new_lock_fd, RAW_LOCK_BYTE_NO_OTHER_WRITER, 1, false);
+        if (ret) {
+            error_setg_errno(errp, errno, "Failed to lock new fd (share byte)");
+            break;
+        }
+        break;
+    case RAW_LT_COMMIT:
+        ret = qemu_unlock_fd(old_lock_fd, RAW_LOCK_BYTE_WRITE, 1);
+        if (ret) {
+            error_report("Failed to unlock old fd (write byte)");
+            break;
+        }
+        break;
+    case RAW_LT_ABORT:
+        break;
+    }
+    return ret ? -errno : 0;
+}
+
+static int raw_lt_write_to_write_share_rw(RawLockTransOp op,
+                                          int old_lock_fd, int new_lock_fd,
+                                          BDRVRawLockMode old_lock,
+                                          BDRVRawLockMode new_lock,
+                                          Error **errp)
+{
+    int ret = 0;
+
+    assert(old_lock == RAW_L_WRITE);
+    assert(new_lock == RAW_L_WRITE_SHARE_RW);
+    switch (op) {
+    case RAW_LT_PREPARE:
+        break;
+    case RAW_LT_COMMIT:
+        ret = qemu_unlock_fd(old_lock_fd, RAW_LOCK_BYTE_NO_OTHER_WRITER, 1);
+        if (ret) {
+            error_report("Failed to unlock old fd (share byte)");
+            break;
+        }
+        break;
+    case RAW_LT_ABORT:
+        break;
+    }
+    return ret ? -errno : 0;
+}
+
+/**
+ * Transactionally moving between possible locking states is tricky and must be
+ * done carefully. That is mostly because downgrading an exclusive lock to
+ * shared or unlocked is not guaranteed to be revertable. As a result, in such
+ * cases we have to defer the downgraing to "commit", given that no revert will
+ * happen after that point, and that downgrading a lock should never fail.
+ *
+ * On the other hand, upgrading a lock (e.g. from unlocked or shared to
+ * exclusive lock) must happen in "prepare" because it may fail.
+ *
+ * Manage the operation matrix with this state transition table to make
+ * fulfulling above conditions easier.
+ */
+static const struct RawReopenFuncRecord {
+    BDRVRawLockMode old_lock;
+    BDRVRawLockMode new_lock;
+    RawReopenFunc func;
+    bool need_lock_fd;
+    bool close_old_lock_fd;
+} reopen_functions[] = {
+
+    {RAW_L_READ_SHARE_RW, RAW_L_READ_SHARE_RW, raw_lt_nop, false, false},
+    {RAW_L_READ_SHARE_RW, RAW_L_READ, raw_lt_from_unlock, true},
+    {RAW_L_READ_SHARE_RW, RAW_L_WRITE_SHARE_RW, raw_lt_from_unlock, true},
+    {RAW_L_READ_SHARE_RW, RAW_L_WRITE, raw_lt_from_unlock, true},
+
+    {RAW_L_READ, RAW_L_READ_SHARE_RW, raw_lt_nop, false, true},
+    {RAW_L_READ, RAW_L_READ, raw_lt_nop, false, false},
+    {RAW_L_READ, RAW_L_WRITE_SHARE_RW, raw_lt_read_to_write_share_rw, true},
+    {RAW_L_READ, RAW_L_WRITE, raw_lt_read_to_write, true},
+
+    {RAW_L_WRITE_SHARE_RW, RAW_L_READ_SHARE_RW, raw_lt_nop, false, true},
+    {RAW_L_WRITE_SHARE_RW, RAW_L_READ, raw_lt_write_share_rw_to_read, true},
+    {RAW_L_WRITE_SHARE_RW, RAW_L_WRITE_SHARE_RW, raw_lt_nop, false, false},
+    {RAW_L_WRITE_SHARE_RW, RAW_L_WRITE, raw_lt_write_share_rw_to_write, true},
+
+    {RAW_L_WRITE, RAW_L_READ_SHARE_RW, raw_lt_nop, false, true},
+    {RAW_L_WRITE, RAW_L_READ, raw_lt_write_to_read, true},
+    {RAW_L_WRITE, RAW_L_WRITE_SHARE_RW, raw_lt_write_to_write_share_rw, true},
+    {RAW_L_WRITE, RAW_L_WRITE, raw_lt_nop, false, false},
+};
+
+static int raw_reopen_handle_lock(BDRVReopenState *state,
+                                  RawLockTransOp op,
+                                  Error **errp)
+{
+    BDRVRawReopenState *rs = state->opaque;
+    BDRVRawState *s = state->bs->opaque;
+    BDRVRawLockMode old_lock, new_lock;
+    const struct RawReopenFuncRecord *rec;
+    int ret;
+
+    old_lock = s->cur_lock_mode;
+    if (qdict_get_try_bool(state->options, "disable-lock", false)) {
+        new_lock = RAW_L_READ_SHARE_RW;
+    } else {
+        new_lock = raw_get_lock_mode(state->flags);
+    }
+    qdict_del(state->options, "disable-lock");
+
+    for (rec = &reopen_functions[0];
+         rec < &reopen_functions[ARRAY_SIZE(reopen_functions)];
+         rec++) {
+        if (rec->old_lock == old_lock && rec->new_lock == new_lock) {
+            break;
+        }
+    }
+    assert(rec != &reopen_functions[ARRAY_SIZE(reopen_functions)]);
+
+    switch (op) {
+    case RAW_LT_PREPARE:
+        if (rec->need_lock_fd) {
+            int lock_flags = rs->open_flags;
+            if (!(state->flags & BDRV_O_SHARE_RW)) {
+                lock_flags |= O_RDWR;
+            }
+            ret = raw_dup_flags(rs->fd, s->open_flags, lock_flags,
+                                state->bs->filename, errp);
+            if (ret < 0) {
+                return ret;
+            }
+            rs->lock_fd = ret;
+        } else {
+            rs->lock_fd = -1;
+        }
+        return rec->func(op, s->lock_fd, rs->lock_fd, old_lock, new_lock, errp);
+    case RAW_LT_COMMIT:
+        rec->func(op, s->lock_fd, rs->lock_fd, old_lock, new_lock, errp);
+        if ((rec->need_lock_fd || rec->close_old_lock_fd) && s->lock_fd >= 0) {
+            qemu_close(s->lock_fd);
+        }
+        if (rec->need_lock_fd) {
+            s->lock_fd = rs->lock_fd;
+        }
+        s->cur_lock_mode = new_lock;
+        break;
+    case RAW_LT_ABORT:
+        rec->func(op, s->lock_fd, rs->lock_fd, old_lock, new_lock, errp);
+        if (rec->need_lock_fd) {
+            if (rs->lock_fd >= 0) {
+                qemu_close(rs->lock_fd);
+                rs->lock_fd = -1;
+            }
+        }
+        break;
+    }
+    return 0;
+}
+
 static int raw_reopen_prepare(BDRVReopenState *state,
                               BlockReopenQueue *queue, Error **errp)
 {
@@ -560,61 +1177,24 @@  static int raw_reopen_prepare(BDRVReopenState *state,
 
     raw_parse_flags(state->flags, &rs->open_flags);
 
-    rs->fd = -1;
-
-    int fcntl_flags = O_APPEND | O_NONBLOCK;
-#ifdef O_NOATIME
-    fcntl_flags |= O_NOATIME;
-#endif
-
-#ifdef O_ASYNC
-    /* Not all operating systems have O_ASYNC, and those that don't
-     * will not let us track the state into rs->open_flags (typically
-     * you achieve the same effect with an ioctl, for example I_SETSIG
-     * on Solaris). But we do not use O_ASYNC, so that's fine.
-     */
-    assert((s->open_flags & O_ASYNC) == 0);
-#endif
-
-    if ((rs->open_flags & ~fcntl_flags) == (s->open_flags & ~fcntl_flags)) {
-        /* dup the original fd */
-        rs->fd = qemu_dup(s->fd);
-        if (rs->fd >= 0) {
-            ret = fcntl_setfl(rs->fd, rs->open_flags);
-            if (ret) {
-                qemu_close(rs->fd);
-                rs->fd = -1;
-            }
-        }
+    ret = raw_dup_flags(s->fd, s->open_flags, rs->open_flags,
+                        state->bs->filename, errp);
+    if (ret < 0) {
+        return ret;
     }
 
-    /* If we cannot use fcntl, or fcntl failed, fall back to qemu_open() */
-    if (rs->fd == -1) {
-        const char *normalized_filename = state->bs->filename;
-        ret = raw_normalize_devicepath(&normalized_filename);
-        if (ret < 0) {
-            error_setg_errno(errp, -ret, "Could not normalize device path");
-        } else {
-            assert(!(rs->open_flags & O_CREAT));
-            rs->fd = qemu_open(normalized_filename, rs->open_flags);
-            if (rs->fd == -1) {
-                error_setg_errno(errp, errno, "Could not reopen file");
-                ret = -1;
-            }
-        }
-    }
+    rs->fd = ret;
 
     /* Fail already reopen_prepare() if we can't get a working O_DIRECT
      * alignment with the new fd. */
-    if (rs->fd != -1) {
-        raw_probe_alignment(state->bs, rs->fd, &local_err);
-        if (local_err) {
-            qemu_close(rs->fd);
-            rs->fd = -1;
-            error_propagate(errp, local_err);
-            ret = -EINVAL;
-        }
+    raw_probe_alignment(state->bs, rs->fd, &local_err);
+    if (local_err) {
+        qemu_close(rs->fd);
+        rs->fd = -1;
+        error_propagate(errp, local_err);
+        return -EINVAL;
     }
+    ret = raw_reopen_handle_lock(state, RAW_LT_PREPARE, errp);
 
     return ret;
 }
@@ -626,6 +1206,8 @@  static void raw_reopen_commit(BDRVReopenState *state)
 
     s->open_flags = rs->open_flags;
 
+    raw_reopen_handle_lock(state, RAW_LT_COMMIT, &error_abort);
+
     qemu_close(s->fd);
     s->fd = rs->fd;
 
@@ -643,6 +1225,8 @@  static void raw_reopen_abort(BDRVReopenState *state)
         return;
     }
 
+    raw_reopen_handle_lock(state, RAW_LT_ABORT, &error_abort);
+
     if (rs->fd >= 0) {
         qemu_close(rs->fd);
         rs->fd = -1;
@@ -1332,6 +1916,10 @@  static void raw_close(BlockDriverState *bs)
         qemu_close(s->fd);
         s->fd = -1;
     }
+    if (s->lock_fd >= 0) {
+        qemu_close(s->lock_fd);
+        s->lock_fd = -1;
+    }
 }
 
 static int raw_truncate(BlockDriverState *bs, int64_t offset)
@@ -1832,6 +2420,27 @@  static int raw_get_info(BlockDriverState *bs, BlockDriverInfo *bdi)
     return 0;
 }
 
+static int raw_inactivate(BlockDriverState *bs)
+{
+    BDRVRawState *s = bs->opaque;
+    int r = 0;
+
+    if (s->cur_lock_mode != RAW_L_READ_SHARE_RW) {
+        r = raw_lock_fd(s->lock_fd, RAW_L_READ_SHARE_RW, NULL);
+    }
+    return r;
+}
+
+static void raw_invalidate_cache(BlockDriverState *bs, Error **errp)
+{
+    BDRVRawState *s = bs->opaque;
+
+    if (s->lock_on_invalidate) {
+        s->disable_lock = false;
+        raw_lock_image(bs, bdrv_get_flags(bs), errp);
+    }
+}
+
 static QemuOptsList raw_create_opts = {
     .name = "raw-create-opts",
     .head = QTAILQ_HEAD_INITIALIZER(raw_create_opts.head),
@@ -1885,7 +2494,8 @@  BlockDriver bdrv_file = {
     .bdrv_get_info = raw_get_info,
     .bdrv_get_allocated_file_size
                         = raw_get_allocated_file_size,
-
+    .bdrv_inactivate = raw_inactivate,
+    .bdrv_invalidate_cache = raw_invalidate_cache,
     .create_opts = &raw_create_opts,
 };