diff mbox

[VERSION,3] Disk image exclusive and shared locks.

Message ID 20091215164238.GA24410@amd.home.annexia.org
State New
Headers show

Commit Message

Richard W.M. Jones Dec. 15, 2009, 4:42 p.m. UTC
This is v3 of the lock patch, previously discussed here:

http://lists.gnu.org/archive/html/qemu-devel/2009-12/threads.html#00461

In this version I've reverted back to the simpler interface.  There is
now only one "lock" option, which can be lock=exclusive|shared|none.

At Kevin Wolf's suggestion,
lock=exclusive|shared => all backing disks are locked shared
lock=none => no locks are acquired on any front or back disks

In order to mitigate the problem with locks during live migration,
I've added a lock command to the monitor, which currently allows you
to acquire (but not revoke) a lock.  (Revocation could be added fairly
easily too.)  This should allow the management tool to start the qemu
destination process without locking, and lock it once migration is
complete.

To keep the implementation very simple, the "commit" doesn't try to
acquire any extra locks.  The "commit" command should probably never
be used when a backing file could be shared.

In general I've attempted to keep the patch simple, but with room for
future extension.

Rich.

Comments

Anthony Liguori Dec. 15, 2009, 6:02 p.m. UTC | #1
Richard W.M. Jones wrote:
> This is v3 of the lock patch, previously discussed here:
>
> http://lists.gnu.org/archive/html/qemu-devel/2009-12/threads.html#00461
>
> In this version I've reverted back to the simpler interface.  There is
> now only one "lock" option, which can be lock=exclusive|shared|none.
>
> At Kevin Wolf's suggestion,
> lock=exclusive|shared => all backing disks are locked shared
> lock=none => no locks are acquired on any front or back disks
>   

I don't quite understand why we need exclusive|shared as opposed to just 
'on'.  Can you enumerate the use-cases associated with exclusive and shared?

> In order to mitigate the problem with locks during live migration,
> I've added a lock command to the monitor, which currently allows you
> to acquire (but not revoke) a lock.  (Revocation could be added fairly
> easily too.)  This should allow the management tool to start the qemu
> destination process without locking, and lock it once migration is
> complete.
>   

I really dislike this as an interface.  I think we need to make a 
decision about whether we delay open until after migration has 
completed.  I think unless there's a really compelling argument against 
it, this is probably what we should do.

As it stands, we cannot make lock=!none the default if it takes an extra 
monitor command to allow for live migration.  I think if we're going to 
introduce this functionality, we probably should be enabling it by default.

Regards,

Anthony Liguori
Richard W.M. Jones Dec. 15, 2009, 6:09 p.m. UTC | #2
On Tue, Dec 15, 2009 at 12:02:04PM -0600, Anthony Liguori wrote:
> Richard W.M. Jones wrote:
>> This is v3 of the lock patch, previously discussed here:
>>
>> http://lists.gnu.org/archive/html/qemu-devel/2009-12/threads.html#00461
>>
>> In this version I've reverted back to the simpler interface.  There is
>> now only one "lock" option, which can be lock=exclusive|shared|none.
>>
>> At Kevin Wolf's suggestion,
>> lock=exclusive|shared => all backing disks are locked shared
>> lock=none => no locks are acquired on any front or back disks
>>   
>
> I don't quite understand why we need exclusive|shared as opposed to just  
> 'on'.  Can you enumerate the use-cases associated with exclusive and 
> shared?

Thanks for looking at this.

The use case is "cluster filesystem with an admin tool that must be
run exclusively".  Cluster nodes open the block device for write, but
with a shared lock.  The admin tool needs exclusive access (no nodes
must be writing), so it tries to open the device with lock=exclusive.
This only succeeds if the normal cluster nodes have backed off.

However, the patch would be a bit simpler if we just had lock=on|off
at the moment, and it wouldn't stop us from adding the shared case in
future.  (For my needs I don't care about the shared case).

>> In order to mitigate the problem with locks during live migration,
>> I've added a lock command to the monitor, which currently allows you
>> to acquire (but not revoke) a lock.  (Revocation could be added fairly
>> easily too.)  This should allow the management tool to start the qemu
>> destination process without locking, and lock it once migration is
>> complete.
>>   
>
> I really dislike this as an interface.  I think we need to make a  
> decision about whether we delay open until after migration has  
> completed.  I think unless there's a really compelling argument against  
> it, this is probably what we should do.

I'm guessing this needs some quite major surgery to qemu so that block
device opening is delayed in the case of migration.  I can take a look
at this.

> As it stands, we cannot make lock=!none the default if it takes an extra  
> monitor command to allow for live migration.  I think if we're going to  
> introduce this functionality, we probably should be enabling it by 
> default.

Xen has a similar feature, and it is enabled by default.

Rich.
Jamie Lokier Dec. 15, 2009, 6:33 p.m. UTC | #3
Richard W.M. Jones wrote:
> This is v3 of the lock patch, previously discussed here:
> 
> http://lists.gnu.org/archive/html/qemu-devel/2009-12/threads.html#00461
> 
> In this version I've reverted back to the simpler interface.  There is
> now only one "lock" option, which can be lock=exclusive|shared|none.
> 
> At Kevin Wolf's suggestion,
> lock=exclusive|shared => all backing disks are locked shared
> lock=none => no locks are acquired on any front or back disks

What do you recommend when backing disks are shared?

> To keep the implementation very simple, the "commit" doesn't try to
> acquire any extra locks.  The "commit" command should probably never
> be used when a backing file could be shared.

Shared backing disks aren't safe after "commit" anyway.  Other VMs may
not be running at the time "commit" renders their image corrupt, so
locks don't offer adequate protection against the backing disk being changed.

One strategy that would offer a bit more protection would be: backing
disks opened read-only, re-opened as writable at the time of "commit",
and (where the format supports it) have a generation number stored in
them which is incremented prior to the first write after writable
open.  The generation number would be stored in the referring delta
image, which would complain if it found the backing file did not have
a matching generation.  This would at least alert the user to
inconsistencies, and the exclusive lock arising from re-opening as
writable would block "commit" if there were actively running VMs.

A different strategy would be to simply have a user-settable flag in
backing VM images meaning "shared therefore commit not allowed".

You might think the user could do that by setting the permissions to
read-only, but root ignores file permissions.  (That's why we need a
"ro" option too).

-- Jamie
Anthony Liguori Dec. 15, 2009, 6:45 p.m. UTC | #4
Richard W.M. Jones wrote:
> Thanks for looking at this.
>   

Thanks for following up :-)

> The use case is "cluster filesystem with an admin tool that must be
> run exclusively".  Cluster nodes open the block device for write, but
> with a shared lock.  The admin tool needs exclusive access (no nodes
> must be writing), so it tries to open the device with lock=exclusive.
> This only succeeds if the normal cluster nodes have backed off.
>   

This seems like a reasonable uncommon scenario and my concern is that it 
really hurts the interface.  lock=on|off I think qualifies as "The 
obvious use is the correct one" whereas I think 
lock=exclusive|shared|none would qualify as "Read the implementation and 
you'll get it right."

And that's mainly due to the number of corner cases that have to be 
considered in order for exclusive to degrade into shared.

> However, the patch would be a bit simpler if we just had lock=on|off
> at the moment, and it wouldn't stop us from adding the shared case in
> future.  (For my needs I don't care about the shared case).
>   

Let's stick with lock=on|off.  If anyone comes along and actually wants 
shared|exclusive, we can revisit the thread :-)


>> I really dislike this as an interface.  I think we need to make a  
>> decision about whether we delay open until after migration has  
>> completed.  I think unless there's a really compelling argument against  
>> it, this is probably what we should do.
>>     
>
> I'm guessing this needs some quite major surgery to qemu so that block
> device opening is delayed in the case of migration.  I can take a look
> at this.
>   

I think it's actually easier than I expected it to be.

We really just need to delay lock acquisition until it's really needed 
(read/write op).  For probing, we can get away without having a lock 
held since it's read only and we only use it to decide which bdrv to use.

So what I'd expect is that we would to acquire a lock until we actually 
ran the vm.  We would basically have a bdrv_lock_all() that could fail 
if some lock couldn't be acquired.  We would execute that right before 
running a VM (which means it happens after incoming migration completes).

After we completed migration (and did an fsck), we would call 
bdrv_unlock_all() to drop the lock.  Before we did 'cont' again, we 
would issue bdrv_lock_all().

As long as we do a bdrv_unlock_all() before writing the last byte of the 
migration stream, it's not at all racy.  We also don't have to close the 
fd on the source which means that we maintain the same level of robustness.


>> As it stands, we cannot make lock=!none the default if it takes an extra  
>> monitor command to allow for live migration.  I think if we're going to  
>> introduce this functionality, we probably should be enabling it by 
>> default.
>>     
>
> Xen has a similar feature, and it is enabled by default.
>   

And I really, really dislike it :-)  There are a lot of corner cases 
with requiring the use of '!' to the point where we ended up just using 
it for everything in our management tools.  IIRC, they don't use 
advisory locking and instead try to be clever with something similar to 
lsof.

Regards,

Anthony Liguori
Jamie Lokier Dec. 15, 2009, 11:26 p.m. UTC | #5
Jamie Lokier wrote:
> > At Kevin Wolf's suggestion,
> > lock=exclusive|shared => all backing disks are locked shared
> > lock=none => no locks are acquired on any front or back disks
> 
> What do you recommend when backing disks are shared?

Please ignore that part of my reply.  It was meant to be deleted after
I'd read the parent propely :-)

-- Jamie
Kevin Wolf Dec. 16, 2009, 10:37 a.m. UTC | #6
Am 15.12.2009 19:33, schrieb Jamie Lokier:
> Shared backing disks aren't safe after "commit" anyway.  Other VMs may
> not be running at the time "commit" renders their image corrupt, so
> locks don't offer adequate protection against the backing disk being changed.
> 
> One strategy that would offer a bit more protection would be: backing
> disks opened read-only, re-opened as writable at the time of "commit",
> and (where the format supports it) have a generation number stored in
> them which is incremented prior to the first write after writable
> open.  The generation number would be stored in the referring delta
> image, which would complain if it found the backing file did not have
> a matching generation.  This would at least alert the user to
> inconsistencies, and the exclusive lock arising from re-opening as
> writable would block "commit" if there were actively running VMs.
> 
> A different strategy would be to simply have a user-settable flag in
> backing VM images meaning "shared therefore commit not allowed".

Probably both suggestions are doable in qcow2 with an extended header.
However, raw backing file are not uncommon and you'll have a hard time
adding something there.

Also I'm not sure if they are really helpful. Who would really set the
user-settable flag after all? The generation number works automatically,
but it only can recognize the damage afterwards when the image is
already corrupted.

> You might think the user could do that by setting the permissions to
> read-only, but root ignores file permissions.  (That's why we need a
> "ro" option too).

We do have readonly=on|off.

Kevin
Christoph Hellwig Dec. 17, 2009, 10:53 a.m. UTC | #7
FYI I don't think this is all too useful.  Just about no one actually
uses file locking APIs on block devices.  If you want to provide
protection against mounting the image on the host or scribbling over it
using mkfs you need to open the block device node with O_EXCL on Linux
as that's the mechanisms most tools and the filesystem code us for
exclusion.

If you're primarily interested in protection against other qemu
instances you can add you code on top, but that seems like a rather
marginal use case.
Richard W.M. Jones Dec. 17, 2009, 11:06 a.m. UTC | #8
On Thu, Dec 17, 2009 at 11:53:45AM +0100, Christoph Hellwig wrote:
> If you're primarily interested in protection against other qemu
> instances you can add you code on top, but that seems like a rather
> marginal use case.

It's a pretty serious case for people accessing live virtual machines
with the libguestfs tools.  I'd like to make it impossible (or at
least harder) for them to corrupt their disk images.

Rich.
Jamie Lokier Dec. 17, 2009, 1:26 p.m. UTC | #9
Kevin Wolf wrote:
> > You might think the user could do that by setting the permissions to
> > read-only, but root ignores file permissions.  (That's why we need a
> > "ro" option too).
> 
> We do have readonly=on|off.

Sure, but if you have to do that for safe behaviour when running qemu
as root, and you don't need it when running qemu as a user because you
get into the habit of depending on file permissions, that's asking for
an accident to happen.

I know this, because I have accidentally opened read-only images
writable when putting "sudo" at the start of a qemu command to make
something completely unrelated work (networking).

Imho, the open-writable-if-permissions-allow-else-fallback-to-readable
behaviour should either be abolished entirely (not such a bad idea),
or be made to behave consistently no matter what user is used to run qemu.

-- Jamie
Jamie Lokier Dec. 17, 2009, 3:38 p.m. UTC | #10
Christoph Hellwig wrote:
> If you want to provide
> protection against mounting the image on the host or scribbling over it
> using mkfs you need to open the block device node with O_EXCL on Linux
> as that's the mechanisms most tools and the filesystem code us for
> exclusion.

Due to that I do suggest opening block devices with O_EXCL when
exclusive locking is requested.

-- Jamie
diff mbox

Patch

diff --git a/block.c b/block.c
index 3f3496e..0e69ba8 100644
--- a/block.c
+++ b/block.c
@@ -449,7 +449,7 @@  int bdrv_open2(BlockDriverState *bs, const char *filename, int flags,
     try_rw = !bs->read_only || bs->is_temporary;
     if (!(flags & BDRV_O_FILE))
         open_flags = (try_rw ? BDRV_O_RDWR : 0) |
-            (flags & (BDRV_O_CACHE_MASK|BDRV_O_NATIVE_AIO));
+            (flags & (BDRV_O_CACHE_MASK|BDRV_O_NATIVE_AIO|BDRV_O_LOCK_MASK));
     else
         open_flags = flags & ~(BDRV_O_FILE | BDRV_O_SNAPSHOT);
     if (use_bdrv_whitelist && !bdrv_is_whitelisted(drv))
@@ -480,14 +480,19 @@  int bdrv_open2(BlockDriverState *bs, const char *filename, int flags,
     if (bs->backing_file[0] != '\0') {
         /* if there is a backing file, use it */
         BlockDriver *back_drv = NULL;
+        int back_drv_open_flags = open_flags;
         bs->backing_hd = bdrv_new("");
         /* pass on read_only property to the backing_hd */
         bs->backing_hd->read_only = bs->read_only;
+        /* if front disk is locked, lock backing disk shared */
+        back_drv_open_flags &= ~BDRV_O_LOCK_MASK;
+        if (open_flags & BDRV_O_LOCK_MASK)
+            back_drv_open_flags |= BDRV_O_LOCK_SHARED;
         path_combine(backing_filename, sizeof(backing_filename),
                      filename, bs->backing_file);
         if (bs->backing_format[0] != '\0')
             back_drv = bdrv_find_format(bs->backing_format);
-        ret = bdrv_open2(bs->backing_hd, backing_filename, open_flags,
+        ret = bdrv_open2(bs->backing_hd, backing_filename, back_drv_open_flags,
                          back_drv);
         if (ret < 0) {
             bdrv_close(bs);
@@ -1388,6 +1393,16 @@  int bdrv_get_info(BlockDriverState *bs, BlockDriverInfo *bdi)
     return drv->bdrv_get_info(bs, bdi);
 }
 
+int bdrv_acquire_lock(BlockDriverState *bs, int lock_flags)
+{
+    BlockDriver *drv = bs->drv;
+    if (!drv)
+        return -ENOMEDIUM;
+    if (!drv->bdrv_acquire_lock)
+        return -ENOTSUP;
+    return drv->bdrv_acquire_lock(bs, lock_flags);
+}
+
 int bdrv_save_vmstate(BlockDriverState *bs, const uint8_t *buf,
                       int64_t pos, int size)
 {
diff --git a/block.h b/block.h
index fa51ddf..bd15bbe 100644
--- a/block.h
+++ b/block.h
@@ -39,8 +39,11 @@  typedef struct QEMUSnapshotInfo {
 #define BDRV_O_NOCACHE     0x0020 /* do not use the host page cache */
 #define BDRV_O_CACHE_WB    0x0040 /* use write-back caching */
 #define BDRV_O_NATIVE_AIO  0x0080 /* use native AIO instead of the thread pool */
+#define BDRV_O_LOCK_SHARED 0x0100 /* fail unless we can lock shared */
+#define BDRV_O_LOCK_EXCLUSIVE 0x0200 /* fail unless we can lock exclusive */
 
 #define BDRV_O_CACHE_MASK  (BDRV_O_NOCACHE | BDRV_O_CACHE_WB)
+#define BDRV_O_LOCK_MASK   (BDRV_O_LOCK_SHARED | BDRV_O_LOCK_EXCLUSIVE)
 
 #define BDRV_SECTOR_BITS   9
 #define BDRV_SECTOR_SIZE   (1 << BDRV_SECTOR_BITS)
@@ -170,6 +173,7 @@  const char *bdrv_get_device_name(BlockDriverState *bs);
 int bdrv_write_compressed(BlockDriverState *bs, int64_t sector_num,
                           const uint8_t *buf, int nb_sectors);
 int bdrv_get_info(BlockDriverState *bs, BlockDriverInfo *bdi);
+int bdrv_acquire_lock(BlockDriverState *bs, int lock_flags);
 
 const char *bdrv_get_encrypted_filename(BlockDriverState *bs);
 void bdrv_get_backing_filename(BlockDriverState *bs,
diff --git a/block/raw-posix.c b/block/raw-posix.c
index 5a6a22b..4c3326c 100644
--- a/block/raw-posix.c
+++ b/block/raw-posix.c
@@ -128,6 +128,24 @@  static int64_t raw_getlength(BlockDriverState *bs);
 static int cdrom_reopen(BlockDriverState *bs);
 #endif
 
+static int raw_acquire_lock(BlockDriverState *bs, int lock_flags)
+{
+    BDRVRawState *s = bs->opaque;
+    struct flock lk;
+
+    if (lock_flags & BDRV_O_LOCK_SHARED)
+        lk.l_type = F_RDLCK;
+    else /* bdrv_flags & BDRV_O_LOCK_EXCLUSIVE */
+        lk.l_type = F_WRLCK;
+    lk.l_whence = SEEK_SET;
+    lk.l_start = 0;
+    lk.l_len = 0;               /* means lock the whole file */
+
+    if (fcntl (s->fd, F_SETLK, &lk) == -1)
+        return -errno;
+    return 0;
+}
+
 static int raw_open_common(BlockDriverState *bs, const char *filename,
                            int bdrv_flags, int open_flags)
 {
@@ -163,6 +181,11 @@  static int raw_open_common(BlockDriverState *bs, const char *filename,
     s->fd = fd;
     s->aligned_buf = NULL;
 
+    if (bdrv_flags & BDRV_O_LOCK_MASK) {
+        if (raw_acquire_lock(bs, bdrv_flags & BDRV_O_LOCK_MASK) < 0)
+            goto out_close;
+    }
+
     if ((bdrv_flags & BDRV_O_NOCACHE)) {
         s->aligned_buf = qemu_blockalign(bs, ALIGNED_BUFFER_SIZE);
         if (s->aligned_buf == NULL) {
@@ -768,6 +791,8 @@  static BlockDriver bdrv_raw = {
     .bdrv_truncate = raw_truncate,
     .bdrv_getlength = raw_getlength,
 
+    .bdrv_acquire_lock = raw_acquire_lock,
+
     .create_options = raw_create_options,
 };
 
diff --git a/block/raw-win32.c b/block/raw-win32.c
index 72acad5..6f89f03 100644
--- a/block/raw-win32.c
+++ b/block/raw-win32.c
@@ -73,6 +73,27 @@  static int set_sparse(int fd)
 				 NULL, 0, NULL, 0, &returned, NULL);
 }
 
+static int raw_acquire_lock(BlockDriverState *bs, int lock_flags)
+{
+    BDRVRawState *s = bs->opaque;
+    DWORD flags;
+    OVERLAPPED ov;
+
+    flags = LOCKFILE_FAIL_IMMEDIATELY;
+    if (lock_flags & BDRV_O_LOCK_EXCLUSIVE)
+        flags |= LOCKFILE_EXCLUSIVE_LOCK;
+
+    memset(&ov, 0, sizeof(ov));
+    ov.Offset = 0;
+    ov.OffsetHigh = 0;
+
+    if (!LockFileEx(s->hfile, flags, 0, 1, 0, &ov))
+        /* For compatibility with the POSIX lock failure ... */
+        return -EAGAIN;
+
+    return 0;
+}
+
 static int raw_open(BlockDriverState *bs, const char *filename, int flags)
 {
     BDRVRawState *s = bs->opaque;
@@ -106,6 +127,15 @@  static int raw_open(BlockDriverState *bs, const char *filename, int flags)
             return -EACCES;
         return -1;
     }
+
+    if (flags & BDRV_O_LOCK_MASK) {
+        int err;
+
+        err = raw_acquire_lock(s, flags & BDRV_O_LOCK_MASK);
+        if (err < 0)
+            return err;
+    }
+
     return 0;
 }
 
@@ -253,6 +283,7 @@  static BlockDriver bdrv_raw = {
     .bdrv_write		= raw_write,
     .bdrv_truncate	= raw_truncate,
     .bdrv_getlength	= raw_getlength,
+    .bdrv_acquire_lock	= raw_acquire_lock,
 
     .create_options = raw_create_options,
 };
diff --git a/block_int.h b/block_int.h
index 9a3b2e0..729d540 100644
--- a/block_int.h
+++ b/block_int.h
@@ -92,6 +92,7 @@  struct BlockDriver {
     int (*bdrv_snapshot_list)(BlockDriverState *bs,
                               QEMUSnapshotInfo **psn_info);
     int (*bdrv_get_info)(BlockDriverState *bs, BlockDriverInfo *bdi);
+    int (*bdrv_acquire_lock)(BlockDriverState *bs, int lock_flags);
 
     int (*bdrv_save_vmstate)(BlockDriverState *bs, const uint8_t *buf,
                              int64_t pos, int size);
diff --git a/monitor.c b/monitor.c
index d97d529..d00d5d4 100644
--- a/monitor.c
+++ b/monitor.c
@@ -890,6 +890,38 @@  static void do_block_set_passwd(Monitor *mon, const QDict *qdict,
     }
 }
 
+static void do_block_lock(Monitor *mon, const QDict *qdict,
+                          QObject **ret_data)
+{
+    BlockDriverState *bs;
+    const char *p;
+    int lock_flags;
+
+    bs = bdrv_find(qdict_get_str(qdict, "device"));
+    if (!bs) {
+        qemu_error_new(QERR_DEVICE_NOT_FOUND, qdict_get_str(qdict, "device"));
+        return;
+    }
+
+    p = qdict_get_str(qdict, "mode");
+    if (!p) {
+        qemu_error_new(QERR_MISSING_PARAMETER, "mode");
+        return;
+    }
+    if (!strcmp(p, "exclusive"))
+        lock_flags = BDRV_O_LOCK_EXCLUSIVE;
+    else if (!strcmp (p, "shared"))
+        lock_flags = BDRV_O_LOCK_SHARED;
+    else {
+        qemu_error_new(QERR_INVALID_PARAMETER, p);
+        return;
+    }
+
+    if (bdrv_acquire_lock(bs, lock_flags) < 0) {
+        qemu_error_new(QERR_LOCK_ACQUIRE_FAILED);
+    }
+}
+
 static void do_change_block(Monitor *mon, const char *device,
                             const char *filename, const char *fmt)
 {
diff --git a/qemu-config.c b/qemu-config.c
index c3203c8..df0d3fb 100644
--- a/qemu-config.c
+++ b/qemu-config.c
@@ -78,6 +78,10 @@  QemuOptsList qemu_drive_opts = {
         },{
             .name = "readonly",
             .type = QEMU_OPT_BOOL,
+        },{
+            .name = "lock",
+            .type = QEMU_OPT_STRING,
+            .help = "lock disk image (exclusive, shared, none)",
         },
         { /* end if list */ }
     },
diff --git a/qemu-monitor.hx b/qemu-monitor.hx
index c788c73..9474905 100644
--- a/qemu-monitor.hx
+++ b/qemu-monitor.hx
@@ -1063,6 +1063,25 @@  STEXI
 Set the encrypted device @var{device} password to @var{password}
 ETEXI
 
+    {
+        .name       = "lock",
+        .args_type  = "device:B,mode:s",
+        .params     = "device [exclusive|shared]",
+        .help       = "acquire a lock on an existing device",
+        .user_print = monitor_user_noop,
+        .mhandler.cmd_new = do_block_lock,
+    },
+
+STEXI
+@item lock @var{device} @var{mode}
+Acquire a lock on @var{device}.  The @var{mode} string should be
+@var{exclusive} to acquire an exclusive lock, or @var{shared} to
+acquire a shared lock.
+
+This does not attempt to lock the backing disk for formats like
+qcow2 that can have backing storage.
+ETEXI
+
 STEXI
 @end table
 ETEXI
diff --git a/qemu-options.hx b/qemu-options.hx
index b8cc375..efc5f19 100644
--- a/qemu-options.hx
+++ b/qemu-options.hx
@@ -104,6 +104,7 @@  DEF("drive", HAS_ARG, QEMU_OPTION_drive,
     "       [,cyls=c,heads=h,secs=s[,trans=t]][,snapshot=on|off]\n"
     "       [,cache=writethrough|writeback|none][,format=f][,serial=s]\n"
     "       [,addr=A][,id=name][,aio=threads|native]\n"
+    "       [,lock=exclusive|shared|none]\n"
     "                use 'file' as a drive image\n")
 DEF("set", HAS_ARG, QEMU_OPTION_set,
     "-set group.id.arg=value\n"
@@ -149,6 +150,12 @@  an untrusted format header.
 This option specifies the serial number to assign to the device.
 @item addr=@var{addr}
 Specify the controller's PCI address (if=virtio only).
+@item lock=@var{mode}
+Acquire a lock on the disk image (@var{file}).
+Available modes are: exclusive, shared, none.
+The default is "none", meaning we don't try to acquire a lock.  To
+avoid multiple virtual machines trying to write to a disk at the
+same time (which can cause disk corruption), use lock=exclusive.
 @end table
 
 By default, writethrough caching is used for all block device.  This means that
diff --git a/qerror.c b/qerror.c
index 5f8fc5d..8b74a58 100644
--- a/qerror.c
+++ b/qerror.c
@@ -120,6 +120,10 @@  static const QErrorStringTable qerror_table[] = {
         .error_fmt = QERR_VNC_SERVER_FAILED,
         .desc      = "Could not start VNC server on %(target)",
     },
+    {
+        .error_fmt = QERR_LOCK_ACQUIRE_FAILED,
+        .desc      = "Could not lock device with requested mode",
+    },
     {}
 };
 
diff --git a/qerror.h b/qerror.h
index 9e220d6..540bf8d 100644
--- a/qerror.h
+++ b/qerror.h
@@ -100,4 +100,7 @@  QError *qobject_to_qerror(const QObject *obj);
 #define QERR_VNC_SERVER_FAILED \
     "{ 'class': 'VNCServerFailed', 'data': { 'target': %s } }"
 
+#define QERR_LOCK_ACQUIRE_FAILED \
+    "{ 'class': 'LockAcquireFailed', 'data': {} }"
+
 #endif /* QERROR_H */
diff --git a/vl.c b/vl.c
index c0d98f5..b114518 100644
--- a/vl.c
+++ b/vl.c
@@ -2110,6 +2110,7 @@  DriveInfo *drive_init(QemuOpts *opts, void *opaque,
     const char *devaddr;
     DriveInfo *dinfo;
     int snapshot = 0;
+    int lock_flags = 0;
 
     *fatal_error = 1;
 
@@ -2300,6 +2301,19 @@  DriveInfo *drive_init(QemuOpts *opts, void *opaque,
         }
     }
 
+    if ((buf = qemu_opt_get(opts, "lock")) != NULL) {
+        if (!strcmp(buf, "none"))
+            /* nothing */;
+        else if (!strcmp(buf, "shared"))
+            lock_flags |= BDRV_O_LOCK_SHARED;
+        else if (!strcmp(buf, "exclusive"))
+            lock_flags |= BDRV_O_LOCK_EXCLUSIVE;
+        else {
+           fprintf(stderr, "qemu: invalid lock option\n");
+           return NULL;
+        }
+    }
+
     /* compute bus and unit according index */
 
     if (index != -1) {
@@ -2444,6 +2458,8 @@  DriveInfo *drive_init(QemuOpts *opts, void *opaque,
         (void)bdrv_set_read_only(dinfo->bdrv, 1);
     }
 
+    bdrv_flags |= lock_flags;
+
     if (bdrv_open2(dinfo->bdrv, file, bdrv_flags, drv) < 0) {
         fprintf(stderr, "qemu: could not open disk image %s: %s\n",
                         file, strerror(errno));