diff mbox

[VERSION,2] Disk image shared and exclusive locks.

Message ID 20091207141652.GA28705@amd.home.annexia.org
State New
Headers show

Commit Message

Richard W.M. Jones Dec. 7, 2009, 2:16 p.m. UTC
Here's a second go at this patch.  The only change is that we add a
second parameter (backinglock=none|shared|exclusive) which can be used
to control locking on the first level (only) of backing disk.

Dan Berrange convinced me there was a case to retain a distinct shared
locking type.

For any qemu instance which is going to use the "commit" command, I
would advise using: lock=exclusive,backinglock=exclusive.

For qemu instances which won't be using commit, I would advise:
lock=exclusive,backinglock=shared.

For qemu instances which are using GFS-like shared filesystems, I
would advise: lock=shared.

Rich.

Comments

Anthony Liguori Dec. 7, 2009, 3:06 p.m. UTC | #1
Richard W.M. Jones wrote:
> Here's a second go at this patch.  The only change is that we add a
> second parameter (backinglock=none|shared|exclusive) which can be used
> to control locking on the first level (only) of backing disk.
>
> Dan Berrange convinced me there was a case to retain a distinct shared
> locking type.
>
> For any qemu instance which is going to use the "commit" command, I
> would advise using: lock=exclusive,backinglock=exclusive.
>
> For qemu instances which won't be using commit, I would advise:
> lock=exclusive,backinglock=shared.
>   

ETOOCOMPLEX

So the situation we're trying to support is:

1) something grabs exclusive (but why would it?)
2) we want to still be able to run things as shared

The alternative would be that we only provide lock=on|off.  For 
GPFS-like shared filesystems (:-P) we would always use lock=off.  It 
wouldn't protect us from a non-cluster aware writer.

I assume libguestfs is the user of exclusive.  However, wouldn't it be 
reasonable for libguestfs to want to be able to mount the GPFS volume 
and manipulate it?  Why wouldn't libguestfs want to also use lock=shared 
| lock=off in this case?

I think this is a knob you have to expose to your users.  I think that 
also means that shared/exclusive is not so useful.

Regards,

Anthony Liguori
Paolo Bonzini Dec. 8, 2009, 8:48 a.m. UTC | #2
On 12/07/2009 04:06 PM, Anthony Liguori wrote:
> For GPFS-like shared filesystems (:-P) we would always use lock=off.  It
> wouldn't protect us from a non-cluster aware writer.

Management tools can always do this the right way---even with 
partition-grained locking if needed.

Paolo
Kevin Wolf Dec. 8, 2009, 10 a.m. UTC | #3
Am 07.12.2009 15:16, schrieb Richard W.M. Jones:
> Here's a second go at this patch.  The only change is that we add a
> second parameter (backinglock=none|shared|exclusive) which can be used
> to control locking on the first level (only) of backing disk.

And what about the backing file of the backing file? ;-) This approach
feels wrong somehow. We need to infer the locking for backing file from
the COW file (none => none; shared/exclusive => shared; commit acquires
an exclusive lock temporarily, but still can't prevent corruption with
turned off VMs)

You probably should add the locking flags to bdrv_open calls in
qemu-img, too.

Kevin
Richard W.M. Jones Dec. 8, 2009, 10:25 a.m. UTC | #4
On Tue, Dec 08, 2009 at 11:00:17AM +0100, Kevin Wolf wrote:
> Am 07.12.2009 15:16, schrieb Richard W.M. Jones:
> > Here's a second go at this patch.  The only change is that we add a
> > second parameter (backinglock=none|shared|exclusive) which can be used
> > to control locking on the first level (only) of backing disk.
> 
> And what about the backing file of the backing file? ;-) This approach
> feels wrong somehow.

I was under the impression that the grandparent backing file is never
written to (eg. by commit or whatever), and so no locking would be
required.  That's why the backinglock doesn't get inherited
recursively, and I checked that too.  However maybe since it is being
read, we should acquire shared locks for grandparent and above.

> We need to infer the locking for backing file from the COW file
> (none => none; shared/exclusive => shared; commit acquires an
> exclusive lock temporarily, but still can't prevent corruption with
> turned off VMs)
>
> You probably should add the locking flags to bdrv_open calls in
> qemu-img, too.

OK.

Rich.
diff mbox

Patch

diff --git a/block.c b/block.c
index 853f025..f80d6a8 100644
--- a/block.c
+++ b/block.c
@@ -448,7 +448,8 @@  int bdrv_open2(BlockDriverState *bs, const char *filename, int flags,
     try_rw = !bs->read_only || bs->is_temporary;
     if (!(flags & BDRV_O_FILE))
         open_flags = (try_rw ? BDRV_O_RDWR : 0) |
-            (flags & (BDRV_O_CACHE_MASK|BDRV_O_NATIVE_AIO));
+            (flags & (BDRV_O_CACHE_MASK | BDRV_O_NATIVE_AIO |
+                      BDRV_O_LOCK_MASK | BDRV_O_BACKLOCK_MASK));
     else
         open_flags = flags & ~(BDRV_O_FILE | BDRV_O_SNAPSHOT);
     if (use_bdrv_whitelist && !bdrv_is_whitelisted(drv))
@@ -479,14 +480,19 @@  int bdrv_open2(BlockDriverState *bs, const char *filename, int flags,
     if (bs->backing_file[0] != '\0') {
         /* if there is a backing file, use it */
         BlockDriver *back_drv = NULL;
+        int back_drv_open_flags = open_flags;
         bs->backing_hd = bdrv_new("");
         /* pass on read_only property to the backing_hd */
         bs->backing_hd->read_only = bs->read_only;
+        /* lock for the first level backing disk only */
+        back_drv_open_flags &= ~(BDRV_O_BACKLOCK_MASK | BDRV_O_LOCK_MASK);
+        back_drv_open_flags |=
+          (open_flags & BDRV_O_BACKLOCK_MASK) >> BDRV_O_BACKLOCK_TO_LOCK_SHIFT;
         path_combine(backing_filename, sizeof(backing_filename),
                      filename, bs->backing_file);
         if (bs->backing_format[0] != '\0')
             back_drv = bdrv_find_format(bs->backing_format);
-        ret = bdrv_open2(bs->backing_hd, backing_filename, open_flags,
+        ret = bdrv_open2(bs->backing_hd, backing_filename, back_drv_open_flags,
                          back_drv);
         if (ret < 0) {
             bdrv_close(bs);
diff --git a/block.h b/block.h
index 4a8b628..3005eca 100644
--- a/block.h
+++ b/block.h
@@ -38,8 +38,15 @@  typedef struct QEMUSnapshotInfo {
 #define BDRV_O_NOCACHE     0x0020 /* do not use the host page cache */
 #define BDRV_O_CACHE_WB    0x0040 /* use write-back caching */
 #define BDRV_O_NATIVE_AIO  0x0080 /* use native AIO instead of the thread pool */
+#define BDRV_O_LOCK_SHARED        0x0100 /* fail unless we can lock shared */
+#define BDRV_O_LOCK_EXCLUSIVE     0x0200 /* fail unless we can lock exclusive */
+#define BDRV_O_BACKLOCK_SHARED    0x0400 /* same for backing disk */
+#define BDRV_O_BACKLOCK_EXCLUSIVE 0x0800 /* "" */
 
 #define BDRV_O_CACHE_MASK  (BDRV_O_NOCACHE | BDRV_O_CACHE_WB)
+#define BDRV_O_LOCK_MASK   (BDRV_O_LOCK_SHARED | BDRV_O_LOCK_EXCLUSIVE)
+#define BDRV_O_BACKLOCK_MASK (BDRV_O_BACKLOCK_SHARED|BDRV_O_BACKLOCK_EXCLUSIVE)
+#define BDRV_O_BACKLOCK_TO_LOCK_SHIFT 2
 
 #define BDRV_SECTOR_BITS   9
 #define BDRV_SECTOR_SIZE   (1 << BDRV_SECTOR_BITS)
diff --git a/block/raw-posix.c b/block/raw-posix.c
index 5a6a22b..8b11b89 100644
--- a/block/raw-posix.c
+++ b/block/raw-posix.c
@@ -133,6 +133,7 @@  static int raw_open_common(BlockDriverState *bs, const char *filename,
 {
     BDRVRawState *s = bs->opaque;
     int fd, ret;
+    struct flock lk;
 
     s->lseek_err_cnt = 0;
 
@@ -163,6 +164,21 @@  static int raw_open_common(BlockDriverState *bs, const char *filename,
     s->fd = fd;
     s->aligned_buf = NULL;
 
+    if (bdrv_flags & BDRV_O_LOCK_MASK) {
+        if (bdrv_flags & BDRV_O_LOCK_SHARED)
+            lk.l_type = F_RDLCK;
+        else /* bdrv_flags & BDRV_O_LOCK_EXCLUSIVE */
+            lk.l_type = F_WRLCK;
+        lk.l_whence = SEEK_SET;
+        lk.l_start = 0;
+        lk.l_len = 0; /* means lock the whole file */
+        fprintf (stderr, "acquiring %s lock on %s ...\n",
+                 bdrv_flags & BDRV_O_LOCK_SHARED ? "shared" : "exclusive",
+                 filename);
+        if (fcntl (fd, F_SETLK, &lk) == -1)
+            goto out_close;
+    }
+
     if ((bdrv_flags & BDRV_O_NOCACHE)) {
         s->aligned_buf = qemu_blockalign(bs, ALIGNED_BUFFER_SIZE);
         if (s->aligned_buf == NULL) {
diff --git a/block/raw-win32.c b/block/raw-win32.c
index 72acad5..9d0cfc7 100644
--- a/block/raw-win32.c
+++ b/block/raw-win32.c
@@ -78,6 +78,8 @@  static int raw_open(BlockDriverState *bs, const char *filename, int flags)
     BDRVRawState *s = bs->opaque;
     int access_flags, create_flags;
     DWORD overlapped;
+    DWORD lock_flags;
+    OVERLAPPED ov;
 
     s->type = FTYPE_FILE;
 
@@ -106,6 +108,21 @@  static int raw_open(BlockDriverState *bs, const char *filename, int flags)
             return -EACCES;
         return -1;
     }
+
+    if (flags & BDRV_O_LOCK_MASK) {
+        lock_flags = LOCKFILE_FAIL_IMMEDIATELY;
+        if (flags & BDRV_O_LOCK_EXCLUSIVE)
+            lock_flags |= LOCKFILE_EXCLUSIVE_LOCK;
+
+        memset(&ov, 0, sizeof(ov));
+        ov.Offset = 0;
+        ov.OffsetHigh = 0;
+
+        if (!LockFileEx(s->hfile, lock_flags, 0, 1, 0, &ov))
+            /* For compatibility with the POSIX lock failure ... */
+            return -EAGAIN;
+    }
+
     return 0;
 }
 
diff --git a/qemu-config.c b/qemu-config.c
index 92b5363..62641b4 100644
--- a/qemu-config.c
+++ b/qemu-config.c
@@ -77,6 +77,14 @@  QemuOptsList qemu_drive_opts = {
         },{
             .name = "readonly",
             .type = QEMU_OPT_BOOL,
+        },{
+            .name = "lock",
+            .type = QEMU_OPT_STRING,
+            .help = "lock disk image (exclusive, shared, none)",
+        },{
+            .name = "backinglock",
+            .type = QEMU_OPT_STRING,
+            .help = "lock backing disk image (exclusive, shared, none)",
         },
         { /* end if list */ }
     },
diff --git a/qemu-options.hx b/qemu-options.hx
index 1b5781a..a458f3c 100644
--- a/qemu-options.hx
+++ b/qemu-options.hx
@@ -104,6 +104,7 @@  DEF("drive", HAS_ARG, QEMU_OPTION_drive,
     "       [,cyls=c,heads=h,secs=s[,trans=t]][,snapshot=on|off]\n"
     "       [,cache=writethrough|writeback|none][,format=f][,serial=s]\n"
     "       [,addr=A][,id=name][,aio=threads|native]\n"
+    "       [,lock=exclusive|shared|none][,backinglock=exclusive|shared|none]\n"
     "                use 'file' as a drive image\n")
 DEF("set", HAS_ARG, QEMU_OPTION_set,
     "-set group.id.arg=value\n"
@@ -146,6 +147,19 @@  an untrusted format header.
 This option specifies the serial number to assign to the device.
 @item addr=@var{addr}
 Specify the controller's PCI address (if=virtio only).
+@item lock=@var{mode}
+Acquire a lock on the disk image (@var{file}).
+Available modes are: exclusive, shared, none.
+The default is "none", meaning we don't try to acquire a lock.  To
+avoid multiple virtual machines trying to write to a disk at the
+same time (which can cause disk corruption), use lock=exclusive.
+@item backinglock=@var{mode}
+Acquire a lock on the backing disk.
+Available modes are: exclusive, shared, none.
+The default is "none", meaning we don't try to acquire a lock.  For
+disk formats that don't have backing disks, this option is ignored.
+In the case where multiple levels of backing disk are used, this
+only applies to the first ("parent") backing disk.
 @end table
 
 By default, writethrough caching is used for all block device.  This means that
diff --git a/vl.c b/vl.c
index 09a0ec5..29fff4d 100644
--- a/vl.c
+++ b/vl.c
@@ -2030,6 +2030,7 @@  DriveInfo *drive_init(QemuOpts *opts, void *opaque,
     const char *devaddr;
     DriveInfo *dinfo;
     int snapshot = 0;
+    int lock_flags = 0;
 
     *fatal_error = 1;
 
@@ -2220,6 +2221,32 @@  DriveInfo *drive_init(QemuOpts *opts, void *opaque,
         }
     }
 
+    if ((buf = qemu_opt_get(opts, "lock")) != NULL) {
+        if (!strcmp(buf, "none"))
+            /* nothing */;
+        else if (!strcmp(buf, "shared"))
+            lock_flags |= BDRV_O_LOCK_SHARED;
+        else if (!strcmp(buf, "exclusive"))
+            lock_flags |= BDRV_O_LOCK_EXCLUSIVE;
+        else {
+           fprintf(stderr, "qemu: invalid lock option\n");
+           return NULL;
+        }
+    }
+
+    if ((buf = qemu_opt_get(opts, "backinglock")) != NULL) {
+        if (!strcmp(buf, "none"))
+            /* nothing */;
+        else if (!strcmp(buf, "shared"))
+            lock_flags |= BDRV_O_BACKLOCK_SHARED;
+        else if (!strcmp(buf, "exclusive"))
+            lock_flags |= BDRV_O_BACKLOCK_EXCLUSIVE;
+        else {
+           fprintf(stderr, "qemu: invalid lock option\n");
+           return NULL;
+        }
+    }
+
     /* compute bus and unit according index */
 
     if (index != -1) {
@@ -2364,6 +2391,8 @@  DriveInfo *drive_init(QemuOpts *opts, void *opaque,
         (void)bdrv_set_read_only(dinfo->bdrv, 1);
     }
 
+    bdrv_flags |= lock_flags;
+
     if (bdrv_open2(dinfo->bdrv, file, bdrv_flags, drv) < 0) {
         fprintf(stderr, "qemu: could not open disk image %s: %s\n",
                         file, strerror(errno));