diff mbox

Disk image shared and exclusive locks.

Message ID 20091204165301.GA4167@amd.home.annexia.org
State New
Headers show

Commit Message

Richard W.M. Jones Dec. 4, 2009, 4:53 p.m. UTC
[from the commit message ...]

Allow qemu to acquire shared and exclusive locks on disk images.
This is done by extending the -drive option with an additional,
optional parameter:

  -drive [...],lock=none
  -drive [...],lock=shared
  -drive [...],lock=exclusive

lock=none is the default, and it means that we don't try to acquire
any sort of lock.

lock=shared tries to acquire a shared lock on the disk image.
Multiple instances of qemu may all hold this sort of lock.

lock=exclusive tries to acquire an exclusive lock on the disk
image.  An exclusive lock excludes all other shared and exclusive
locks.

If acquisition of a lock fails, opening the image fails.

The implementation of locks only works for raw POSIX and Win32
files.  However many of the other block types are implemented
in terms of these drivers, so they "inherit" locking too.  Other
drivers are read-only, so don't require locking.  Below we note
only the cases where locking is *not* implemented:

  cloop - directly open()s the file, no locking implemented
  cow - same as cloop
  curl - protocol probably doesn't support locking
  nbd - same as curl

---

Rich.

Comments

Anthony Liguori Dec. 4, 2009, 5:15 p.m. UTC | #1
Richard W.M. Jones wrote:
> [from the commit message ...]
>
> Allow qemu to acquire shared and exclusive locks on disk images.
> This is done by extending the -drive option with an additional,
> optional parameter:
>
>   -drive [...],lock=none
>   -drive [...],lock=shared
>   -drive [...],lock=exclusive
>
> lock=none is the default, and it means that we don't try to acquire
> any sort of lock.
>
> lock=shared tries to acquire a shared lock on the disk image.
> Multiple instances of qemu may all hold this sort of lock.
>
> lock=exclusive tries to acquire an exclusive lock on the disk
> image.  An exclusive lock excludes all other shared and exclusive
> locks.
>
> If acquisition of a lock fails, opening the image fails.
>
> The implementation of locks only works for raw POSIX and Win32
> files.  However many of the other block types are implemented
> in terms of these drivers, so they "inherit" locking too.  Other
> drivers are read-only, so don't require locking.  Below we note
> only the cases where locking is *not* implemented:
>
>   cloop - directly open()s the file, no locking implemented
>   cow - same as cloop
>   curl - protocol probably doesn't support locking
>   nbd - same as curl
>   

The problem with something like this is that it gives a false sense of 
security.

You can still run into problems if you have an application accessing the 
image that doesn't attempt to acquire the lock.

Regards,

Anthony Liguori
Richard W.M. Jones Dec. 4, 2009, 9:57 p.m. UTC | #2
On Fri, Dec 04, 2009 at 11:15:12AM -0600, Anthony Liguori wrote:
> Richard W.M. Jones wrote:
>> [from the commit message ...]
>>
>> Allow qemu to acquire shared and exclusive locks on disk images.
>> This is done by extending the -drive option with an additional,
>> optional parameter:
>>
>>   -drive [...],lock=none
>>   -drive [...],lock=shared
>>   -drive [...],lock=exclusive
>>
>> lock=none is the default, and it means that we don't try to acquire
>> any sort of lock.
>>
>> lock=shared tries to acquire a shared lock on the disk image.
>> Multiple instances of qemu may all hold this sort of lock.
>>
>> lock=exclusive tries to acquire an exclusive lock on the disk
>> image.  An exclusive lock excludes all other shared and exclusive
>> locks.
>>
>> If acquisition of a lock fails, opening the image fails.
>>
>> The implementation of locks only works for raw POSIX and Win32
>> files.  However many of the other block types are implemented
>> in terms of these drivers, so they "inherit" locking too.  Other
>> drivers are read-only, so don't require locking.  Below we note
>> only the cases where locking is *not* implemented:
>>
>>   cloop - directly open()s the file, no locking implemented
>>   cow - same as cloop
>>   curl - protocol probably doesn't support locking
>>   nbd - same as curl
>>   
>
> The problem with something like this is that it gives a false sense of  
> security.
>
> You can still run into problems if you have an application accessing the  
> image that doesn't attempt to acquire the lock.

The idea would be for management tools (eg. libvirt) to add the lock
parameter for all virtual machines that they manage.

I don't see how this is worse than what we have now -- ie. no
possibility of locking at all and a very real risk of disk corruption.

Anyway, we could make the locking mandatory.  I believe that the
locking under Win32 (using LockFileEx) is already mandatory.  With
mandatory locks it would be pretty watertight in the single node case.
Alternatively how about making lock=exclusive the default?

Rich.
Anthony Liguori Dec. 4, 2009, 10:29 p.m. UTC | #3
Richard W.M. Jones wrote:
> The idea would be for management tools (eg. libvirt) to add the lock
> parameter for all virtual machines that they manage.
>
> I don't see how this is worse than what we have now -- ie. no
> possibility of locking at all and a very real risk of disk corruption.
>
> Anyway, we could make the locking mandatory.  I believe that the
> locking under Win32 (using LockFileEx) is already mandatory.  With
> mandatory locks it would be pretty watertight in the single node case.
> Alternatively how about making lock=exclusive the default?
>   

Well disk sharing isn't actually bad as long as it's raw or a physical 
device.

For qcow2, it's very complicated by backing files because we really need 
to express the concept of a read-write lock.

That is, as long as the guests are opening the file read only, you 
should allow many readers.  However, you should prevent anyone from 
opening with write permission.  Likewise, writes should prevent future 
reads.

Regards,

Anthony Liguori
Avi Kivity Dec. 5, 2009, 5:31 p.m. UTC | #4
On 12/05/2009 12:29 AM, Anthony Liguori wrote:
>
> Well disk sharing isn't actually bad as long as it's raw or a physical 
> device.
>
> For qcow2, it's very complicated by backing files because we really 
> need to express the concept of a read-write lock.
>
> That is, as long as the guests are opening the file read only, you 
> should allow many readers.  However, you should prevent anyone from 
> opening with write permission.  Likewise, writes should prevent future 
> reads.
>

Well shared/exclusive or read/write locks support exactly that.

For non-raw storage we should take the locks unconditionally, since 
there is no useful way to share such images opened for write access.
Anthony Liguori Dec. 5, 2009, 5:47 p.m. UTC | #5
Avi Kivity wrote:
> On 12/05/2009 12:29 AM, Anthony Liguori wrote:
>>
>> Well disk sharing isn't actually bad as long as it's raw or a 
>> physical device.
>>
>> For qcow2, it's very complicated by backing files because we really 
>> need to express the concept of a read-write lock.
>>
>> That is, as long as the guests are opening the file read only, you 
>> should allow many readers.  However, you should prevent anyone from 
>> opening with write permission.  Likewise, writes should prevent 
>> future reads.
>>
>
> Well shared/exclusive or read/write locks support exactly that.
>
> For non-raw storage we should take the locks unconditionally, since 
> there is no useful way to share such images opened for write access.
>

I think I made my point poorly.  Consider the following:

qemu-img create -f raw base.img 10G
qemu-img create -f qcow2 -b base.img cow1.img
qemu-img create -f qcow2 -b base.img cow2.img

qemu -drive file=cow1.img,lock=exclusive
qemu -drive file=cow2.img,lock=exclusive

With the current patch, the second command will fail and it's impossible 
to invoke correctly.  That's because flags are passed down to backing 
devices directly.  You really need to be much smarter here in how you 
handle locking.

Exposing exclusive/shared is probably the wrong interface.  I think the 
better thing to do would be to expose a boolean and then automatically 
choose the proper lock type.  This make require some deeper surgery to 
get right as we currently don't handle cdrom's intelligently (we try to 
open with write permission first).

Regards,

Anthony Liguori
Avi Kivity Dec. 5, 2009, 5:55 p.m. UTC | #6
On 12/05/2009 07:47 PM, Anthony Liguori wrote:
> Avi Kivity wrote:
>> On 12/05/2009 12:29 AM, Anthony Liguori wrote:
>>>
>>> Well disk sharing isn't actually bad as long as it's raw or a 
>>> physical device.
>>>
>>> For qcow2, it's very complicated by backing files because we really 
>>> need to express the concept of a read-write lock.
>>>
>>> That is, as long as the guests are opening the file read only, you 
>>> should allow many readers.  However, you should prevent anyone from 
>>> opening with write permission.  Likewise, writes should prevent 
>>> future reads.
>>>
>>
>> Well shared/exclusive or read/write locks support exactly that.
>>
>> For non-raw storage we should take the locks unconditionally, since 
>> there is no useful way to share such images opened for write access.
>>
>
> I think I made my point poorly.  Consider the following:
>
> qemu-img create -f raw base.img 10G
> qemu-img create -f qcow2 -b base.img cow1.img
> qemu-img create -f qcow2 -b base.img cow2.img
>
> qemu -drive file=cow1.img,lock=exclusive
> qemu -drive file=cow2.img,lock=exclusive
>
> With the current patch, the second command will fail and it's 
> impossible to invoke correctly.  That's because flags are passed down 
> to backing devices directly.  You really need to be much smarter here 
> in how you handle locking.

Oh, that's just an implementation bug.  Obviously there's no need to 
open a file for exclusive access if it will only be accessed for read 
(when merging a snapshot we'd need to upgrade the lock to exclusive 
while that takes place).  I thought you objected to the whole concept.
Anthony Liguori Dec. 5, 2009, 5:59 p.m. UTC | #7
Avi Kivity wrote:
>> I think I made my point poorly.  Consider the following:
>>
>> qemu-img create -f raw base.img 10G
>> qemu-img create -f qcow2 -b base.img cow1.img
>> qemu-img create -f qcow2 -b base.img cow2.img
>>
>> qemu -drive file=cow1.img,lock=exclusive
>> qemu -drive file=cow2.img,lock=exclusive
>>
>> With the current patch, the second command will fail and it's 
>> impossible to invoke correctly.  That's because flags are passed down 
>> to backing devices directly.  You really need to be much smarter here 
>> in how you handle locking.
>
> Oh, that's just an implementation bug.  Obviously there's no need to 
> open a file for exclusive access if it will only be accessed for read 
> (when merging a snapshot we'd need to upgrade the lock to exclusive 
> while that takes place).  I thought you objected to the whole concept.

I'm not sure whether it's best to enable it by default because, as I 
said earlier, I'm not comfortable with the lack of correctness wrt 
advisory vs. mandatory locking.  Only qemu can implement this level of 
locking though so it's definitely something we ought to support.

However, from a UI and implementation perspective, I think we need 
significantly different semantics.  You either want locking or you 
don't.  I don't think the selection of none|shared|exclusive really 
makes sense.

Regards,

Anthony Liguori
Daniel P. Berrangé Dec. 7, 2009, 9:38 a.m. UTC | #8
On Fri, Dec 04, 2009 at 11:15:12AM -0600, Anthony Liguori wrote:
> Richard W.M. Jones wrote:
> >[from the commit message ...]
> >
> >Allow qemu to acquire shared and exclusive locks on disk images.
> >This is done by extending the -drive option with an additional,
> >optional parameter:
> >
> >  -drive [...],lock=none
> >  -drive [...],lock=shared
> >  -drive [...],lock=exclusive
> >
> >lock=none is the default, and it means that we don't try to acquire
> >any sort of lock.
> >
> >lock=shared tries to acquire a shared lock on the disk image.
> >Multiple instances of qemu may all hold this sort of lock.
> >
> >lock=exclusive tries to acquire an exclusive lock on the disk
> >image.  An exclusive lock excludes all other shared and exclusive
> >locks.
> >
> >If acquisition of a lock fails, opening the image fails.
> >
> >The implementation of locks only works for raw POSIX and Win32
> >files.  However many of the other block types are implemented
> >in terms of these drivers, so they "inherit" locking too.  Other
> >drivers are read-only, so don't require locking.  Below we note
> >only the cases where locking is *not* implemented:
> >
> >  cloop - directly open()s the file, no locking implemented
> >  cow - same as cloop
> >  curl - protocol probably doesn't support locking
> >  nbd - same as curl
> >  
> 
> The problem with something like this is that it gives a false sense of 
> security.
> 
> You can still run into problems if you have an application accessing the 
> image that doesn't attempt to acquire the lock.

The same is true of any UNIX application that uses POSIX locking, but
that doesn't mean they should not bother to use locking. POSIX advisory
locking relies on applications to play nicely together. If management
applications using QEMU take advantage of this functionality then it will
provide a useful level of protection within a host, and (depending on the
type of disk storage) may also provide protection between hosts.

Regards,
Daniel
Jamie Lokier Dec. 7, 2009, 10:31 a.m. UTC | #9
Anthony Liguori wrote:
> I'm not sure whether it's best to enable it by default because, as I 
> said earlier, I'm not comfortable with the lack of correctness wrt 
> advisory vs. mandatory locking.

In my experience, disk images are accessed in one of five ways:

    qemu-img
    qemu
    qemu-nbd
    mount -o loop
    cp/rsync

If all but the last implement qemu's advisory locking, that's comforting.

> Only qemu can implement this level of locking though so it's
> definitely something we ought to support.

That's not quite true.  I have management scripts which call qemu-img
to determine the chain of backing images, and then can lock the chain.
(They determine the chain anyway, to provide reliable behaviour with
image names containing unusual characters and the old monitor, by
creating links with sane names in /tmp to the real files.)

But it's a lot nicer if qemu does it.

> However, from a UI and implementation perspective, I think we need 
> significantly different semantics.  You either want locking or you 
> don't.  I don't think the selection of none|shared|exclusive really 
> makes sense.

Sometimes shared access to a raw image (partitioned or whole disk
filesystem) is ok, and sometimes it is not ok.  Only the user knows
the difference, because only the user knows if the guests they are
running use distinct partitions in the same raw image, or cooperative
access to a shard image.

But does it make sense to request a shared lock in that case?  Not
really.  If you have a group of guests correctly sharing an image, you
still want to prevent running the same group a second time - and a
shared lock wouldn't do that, because each group would be requesting
shared locks.

So the distinction read/write makes more sense.  Can anyone think of a
situation where a shared lock on an image opened for writing is useful?

-- Jamie
Chris Webb Dec. 7, 2009, 10:39 a.m. UTC | #10
Hi. There's a connected discussion on the sheepdog list about locking, and I
have a patch there which could complement this one quite well.

Sheepdog is a distributed, replicated block store being developed
(primarily) for Qemu. Images have a mandatory exclusive locking requirement,
enforced by the cluster manager. Without this, the replication scheme
breaks down and you can end up with inconsistent copies of the block
image.

The initial release of sheepdog took these locks in the block driver
bdrv_open() and bdrv_close() hooks. They also added a bdrv_closeall() and
ensured it was called in all the usual qemu exit paths to avoid stray locks.
(The rarer case of crashing hosts or crashing qemus will have to be handled
externally, and is 'to do'.)

The problem was that this prevented live migration, because both ends wanted
to open the image at once, even though only one would be using it at a time.
To get around this, we introduced bdrv_claim() and bdrv_release() block
driver hooks, which can be used to claim and release the lock on vm start
and stop. Since at most one end of a live migration is running at any one
time, this is sufficient to allow migration for Sheepdog devices. It would
also allow live migration of virtual machines on any other shared block
devices with mandatory exclusive locking, such as the locks introduced by
your patch.

My patch in the sheepdog qemu branch is here:

  http://sheepdog.git.sourceforge.net/git/gitweb.cgi?p=sheepdog/qemu-kvm;a=commitdiff;h=fd9d7c739831cf04e5a913a89bdca680ba028ada

I was intending to rebase it against current qemu head for more detailed
review over the next couple of weeks after it's had a little more testing,
but given that your patch now provides a second use-case for it outside of
sheepdog, this seems like a timely moment to ask for any more general
feedback about our proposed approach.

Cheers,

Chris.
Kevin Wolf Dec. 7, 2009, 10:42 a.m. UTC | #11
Am 07.12.2009 11:31, schrieb Jamie Lokier:
> So the distinction read/write makes more sense.  Can anyone think of a
> situation where a shared lock on an image opened for writing is useful?

I think there are people using shared writable images with cluster file
systems.

Kevin
Daniel P. Berrangé Dec. 7, 2009, 10:45 a.m. UTC | #12
On Mon, Dec 07, 2009 at 10:31:28AM +0000, Jamie Lokier wrote:
> Anthony Liguori wrote:
> > I'm not sure whether it's best to enable it by default because, as I 
> > said earlier, I'm not comfortable with the lack of correctness wrt 
> > advisory vs. mandatory locking.
> 
> In my experience, disk images are accessed in one of five ways:
> 
>     qemu-img
>     qemu
>     qemu-nbd
>     mount -o loop
>     cp/rsync
> 
> If all but the last implement qemu's advisory locking, that's comforting.
> 
> > Only qemu can implement this level of locking though so it's
> > definitely something we ought to support.
> 
> That's not quite true.  I have management scripts which call qemu-img
> to determine the chain of backing images, and then can lock the chain.
> (They determine the chain anyway, to provide reliable behaviour with
> image names containing unusual characters and the old monitor, by
> creating links with sane names in /tmp to the real files.)
> 
> But it's a lot nicer if qemu does it.
> 
> > However, from a UI and implementation perspective, I think we need 
> > significantly different semantics.  You either want locking or you 
> > don't.  I don't think the selection of none|shared|exclusive really 
> > makes sense.
> 
> Sometimes shared access to a raw image (partitioned or whole disk
> filesystem) is ok, and sometimes it is not ok.  Only the user knows
> the difference, because only the user knows if the guests they are
> running use distinct partitions in the same raw image, or cooperative
> access to a shard image.
> 
> But does it make sense to request a shared lock in that case?  Not
> really.  If you have a group of guests correctly sharing an image, you
> still want to prevent running the same group a second time - and a
> shared lock wouldn't do that, because each group would be requesting
> shared locks.
> 
> So the distinction read/write makes more sense.  Can anyone think of a
> situation where a shared lock on an image opened for writing is useful?

Isn't this what Richard has already done ? The patch implements 'shared'
as a 'F_RDLCK' lock and 'exclusive' as 'F_WRLCK':

+        if (bdrv_flags & BDRV_O_LOCK_SHARED)
+            lk.l_type = F_RDLCK;
+        else /* bdrv_flags & BDRV_O_LOCK_EXCLUSIVE */
+            lk.l_type = F_WRLCK;

It seems we're just debating different terminology for the same thing.
Indeed the fcntl()  man page uses read/write and shared/exclusive
interchangably too

       The  l_type  field  can  be  used  to  place  a  read
       (F_RDLCK) or a write (F_WRLCK) lock on a  file.   Any
       number  of  processes  may  hold  a read lock (shared
       lock) on a file region, but only one process may hold
       a  write  lock  (exclusive  lock).  An exclusive lock
       excludes all other locks, both shared and  exclusive.

Regards,
Daniel
Avi Kivity Dec. 7, 2009, 10:48 a.m. UTC | #13
On 12/07/2009 12:42 PM, Kevin Wolf wrote:
> I think there are people using shared writable images with cluster file
> systems.
>    

Hopefully not with qcow2.
Kevin Wolf Dec. 7, 2009, 10:56 a.m. UTC | #14
Am 07.12.2009 11:48, schrieb Avi Kivity:
> On 12/07/2009 12:42 PM, Kevin Wolf wrote:
>> I think there are people using shared writable images with cluster file
>> systems.
>>    
> 
> Hopefully not with qcow2.

Right, it doesn't make sense with all of the image formats. But it does
with some of them (basically, raw and host_*).

Kevin
Richard W.M. Jones Dec. 7, 2009, 10:58 a.m. UTC | #15
On Sat, Dec 05, 2009 at 11:47:11AM -0600, Anthony Liguori wrote:
> I think I made my point poorly.  Consider the following:
>
> qemu-img create -f raw base.img 10G
> qemu-img create -f qcow2 -b base.img cow1.img
> qemu-img create -f qcow2 -b base.img cow2.img
>
> qemu -drive file=cow1.img,lock=exclusive
> qemu -drive file=cow2.img,lock=exclusive
>
> With the current patch, the second command will fail and it's impossible  
> to invoke correctly.  That's because flags are passed down to backing  
> devices directly.

Isn't it correct that the second command fails?  Although the base
image is mostly used read-only, if the user does the "commit" command
in the monitor then the backing image is overwritten.  (That's my
understanding of the documentation anyway - I've never actually used
this feature of qcow2).

Also if we only acquire the lock during the commit operation then
we'll end up with disk corruption.  So we have to acquire the
exclusive lock the whole time, for the image and the backing store.

(That's my interpretation anyway -- it could very likely be wrong)

Rich.
Richard W.M. Jones Dec. 7, 2009, 11:04 a.m. UTC | #16
On Mon, Dec 07, 2009 at 10:31:28AM +0000, Jamie Lokier wrote:
> Anthony Liguori wrote:
> > I'm not sure whether it's best to enable it by default because, as I 
> > said earlier, I'm not comfortable with the lack of correctness wrt 
> > advisory vs. mandatory locking.
> 
> In my experience, disk images are accessed in one of five ways:
> 
>     qemu-img
>     qemu
>     qemu-nbd
>     mount -o loop
>     cp/rsync
> 
> If all but the last implement qemu's advisory locking, that's comforting.

It's not directly relevant to what you said above, but I'd just like
to interject my use case:

We'd like to prevent libguestfs from corrupting disk images.

If libguestfs is accessing a disk image read-only, then we would not
acquire any lock (ie. lock=none).  This gives reasonable results,
enough to grab a file or do 'virt-df' even against a running guest, in
all but the most pathological circumstances.  The only danger is that
libguestfs gets an error -- it can't corrupt the disk image of the
running VM.

If libguestfs is accessing a disk image read-write, then we would try
to acquire lock=exclusive.  At the same time, libvirt would be
modified so it sets lock=exclusive on all disk images, so libguestfs
would fail to acquire the lock when trying to write to a disk image
that is in use, and this would prevent disk corruption.

Rich.
Jamie Lokier Dec. 7, 2009, 11:19 a.m. UTC | #17
Daniel P. Berrange wrote:
> > Sometimes shared access to a raw image (partitioned or whole disk
> > filesystem) is ok, and sometimes it is not ok.  Only the user knows
> > the difference, because only the user knows if the guests they are
> > running use distinct partitions in the same raw image, or cooperative
> > access to a shard image.
> > 
> > But does it make sense to request a shared lock in that case?  Not
> > really.  If you have a group of guests correctly sharing an image, you
> > still want to prevent running the same group a second time - and a
> > shared lock wouldn't do that, because each group would be requesting
> > shared locks.
> > 
> > So the distinction read/write makes more sense.  Can anyone think of a
> > situation where a shared lock on an image opened for writing is useful?
> 
> Isn't this what Richard has already done ? The patch implements 'shared'
> as a 'F_RDLCK' lock and 'exclusive' as 'F_WRLCK':

No, the question is whether it makes sense to provide a 'shared'
option on the command line, or simply to always map:

     image opened read only => F_FDLCK
     image opened writable => F_WRLCK

and provide only a single command line option: 'lock'.

-- Jamie
Jamie Lokier Dec. 7, 2009, 11:28 a.m. UTC | #18
Kevin Wolf wrote:
> Am 07.12.2009 11:31, schrieb Jamie Lokier:
> > So the distinction read/write makes more sense.  Can anyone think of a
> > situation where a shared lock on an image opened for writing is useful?
> 
> I think there are people using shared writable images with cluster file
> systems.

Yes, they are.  Please the following again:

> > Sometimes shared access to a raw image (partitioned or whole disk
> > filesystem) is ok, and sometimes it is not ok.  Only the user knows
> > the difference, because only the user knows if the guests they are
> > running use distinct partitions in the same raw image, or cooperative
> > access to a shard image.
> >
> > But does it make sense to request a shared lock in that case?  Not
> > really.  If you have a group of guests correctly sharing an image, you
> > still want to prevent running the same group a second time - and a
> > shared lock wouldn't do that, because each group would be requesting
> > shared locks.

If you run each guest in the disk-sharing cluster with 'lock=shared',
it reflects that they are sharing - but that doesn't actually prevent
mistakes, does it?  If you run any of those guests a second time, it
won't prevent that.  So what's the point in the shared lock?

Same when you have guests sharing a partitioned disk, with each guest
accessing a different partition.  E.g. a Windows/Linux dual boot, and
you decide to run both guest OSes at the same time as they should be
fine.  You'd have to use 'lock=shared' - and when you do that, it
doesn't protect you against running them again by accident.

In all those cases you need locking at a higher level to be effective
at stopping the wrong guests from being invoked simultaneously.

So - can anyone think of a situation where being able to ask for a
shared lock on an image opened for writing is useful?

-- Jamie
Daniel P. Berrangé Dec. 7, 2009, 11:30 a.m. UTC | #19
On Mon, Dec 07, 2009 at 11:19:54AM +0000, Jamie Lokier wrote:
> Daniel P. Berrange wrote:
> > > Sometimes shared access to a raw image (partitioned or whole disk
> > > filesystem) is ok, and sometimes it is not ok.  Only the user knows
> > > the difference, because only the user knows if the guests they are
> > > running use distinct partitions in the same raw image, or cooperative
> > > access to a shard image.
> > > 
> > > But does it make sense to request a shared lock in that case?  Not
> > > really.  If you have a group of guests correctly sharing an image, you
> > > still want to prevent running the same group a second time - and a
> > > shared lock wouldn't do that, because each group would be requesting
> > > shared locks.
> > > 
> > > So the distinction read/write makes more sense.  Can anyone think of a
> > > situation where a shared lock on an image opened for writing is useful?
> > 
> > Isn't this what Richard has already done ? The patch implements 'shared'
> > as a 'F_RDLCK' lock and 'exclusive' as 'F_WRLCK':
> 
> No, the question is whether it makes sense to provide a 'shared'
> option on the command line, or simply to always map:
> 
>      image opened read only => F_FDLCK
>      image opened writable => F_WRLCK
> 
> and provide only a single command line option: 'lock'.

That doesn't work in the case of setting up a clustered filesystem
shared between guests. That requires that the disk be opened writable,
but with a shared (F_RDLOCK) lock.


Daniel
Richard W.M. Jones Dec. 7, 2009, 11:31 a.m. UTC | #20
On Mon, Dec 07, 2009 at 11:30:14AM +0000, Daniel P. Berrange wrote:
> On Mon, Dec 07, 2009 at 11:19:54AM +0000, Jamie Lokier wrote:
> > Daniel P. Berrange wrote:
> > > > Sometimes shared access to a raw image (partitioned or whole disk
> > > > filesystem) is ok, and sometimes it is not ok.  Only the user knows
> > > > the difference, because only the user knows if the guests they are
> > > > running use distinct partitions in the same raw image, or cooperative
> > > > access to a shard image.
> > > > 
> > > > But does it make sense to request a shared lock in that case?  Not
> > > > really.  If you have a group of guests correctly sharing an image, you
> > > > still want to prevent running the same group a second time - and a
> > > > shared lock wouldn't do that, because each group would be requesting
> > > > shared locks.
> > > > 
> > > > So the distinction read/write makes more sense.  Can anyone think of a
> > > > situation where a shared lock on an image opened for writing is useful?
> > > 
> > > Isn't this what Richard has already done ? The patch implements 'shared'
> > > as a 'F_RDLCK' lock and 'exclusive' as 'F_WRLCK':
> > 
> > No, the question is whether it makes sense to provide a 'shared'
> > option on the command line, or simply to always map:
> > 
> >      image opened read only => F_FDLCK
> >      image opened writable => F_WRLCK
> > 
> > and provide only a single command line option: 'lock'.
> 
> That doesn't work in the case of setting up a clustered filesystem
> shared between guests. That requires that the disk be opened writable,
> but with a shared (F_RDLOCK) lock.

I think Jamie's point is that you might as well use no locking at all
in this configuration.  It's hard to see what lock=shared is
protecting you against.

Rich.
Jamie Lokier Dec. 7, 2009, 11:35 a.m. UTC | #21
Richard W.M. Jones wrote:
> On Sat, Dec 05, 2009 at 11:47:11AM -0600, Anthony Liguori wrote:
> > I think I made my point poorly.  Consider the following:
> >
> > qemu-img create -f raw base.img 10G
> > qemu-img create -f qcow2 -b base.img cow1.img
> > qemu-img create -f qcow2 -b base.img cow2.img
> >
> > qemu -drive file=cow1.img,lock=exclusive
> > qemu -drive file=cow2.img,lock=exclusive
> >
> > With the current patch, the second command will fail and it's impossible  
> > to invoke correctly.  That's because flags are passed down to backing  
> > devices directly.
> 
> Isn't it correct that the second command fails?  Although the base
> image is mostly used read-only, if the user does the "commit" command
> in the monitor then the backing image is overwritten.  (That's my
> understanding of the documentation anyway - I've never actually used
> this feature of qcow2).
> 
> Also if we only acquire the lock during the commit operation then
> we'll end up with disk corruption.  So we have to acquire the
> exclusive lock the whole time, for the image and the backing store.

It's ok to acquire the shared lock for the backing store, and then try
to upgrade it to exclusive when you do 'commit', then back to shared
afterwards.

Rather than disk corruption, what'll happen with both running is
'commit' will fail to get an exclusive lock because of the other
instance already having a shared lock, so you have to allow for the
possibility of 'commit' failing.  There won't be corruption anyway.

It's advisable to have an option to request exclusive lock for the
backing file at startup, so that you can commit to 'commit' succeeding.
Especially with -snapshot.

With -snapshot, often you mustn't take an exclusive lock on the
*grandparent* backing file, so the option to do exclusive lock up the
chain should not go all the way back.

-- Jamie
Jamie Lokier Dec. 7, 2009, 11:38 a.m. UTC | #22
Richard W.M. Jones wrote:
> > That doesn't work in the case of setting up a clustered filesystem
> > shared between guests. That requires that the disk be opened writable,
> > but with a shared (F_RDLOCK) lock.
> 
> I think Jamie's point is that you might as well use no locking at all
> in this configuration.  It's hard to see what lock=shared is
> protecting you against.

Exactly.  Shared locks are only useful if you have an exclusive locker
to protect against, and in the case of a shared-disk cluster, there
isn't one.  Except, perhaps, some admin tools when the cluster is down.

That said, with shared partitioned disk images, it could be useful to
exclusively lock the individual partitions.  I think by then we're
involving a tool outside qemu, so it can take the lock itself.

That's a reason why qemu's F_WRLOCK should cover the whole range of
file offsets, not just say the first byte.

-- Jamie
Daniel P. Berrangé Dec. 7, 2009, 11:49 a.m. UTC | #23
On Mon, Dec 07, 2009 at 11:31:47AM +0000, Richard W.M. Jones wrote:
> On Mon, Dec 07, 2009 at 11:30:14AM +0000, Daniel P. Berrange wrote:
> > On Mon, Dec 07, 2009 at 11:19:54AM +0000, Jamie Lokier wrote:
> > > 
> > > No, the question is whether it makes sense to provide a 'shared'
> > > option on the command line, or simply to always map:
> > > 
> > >      image opened read only => F_FDLCK
> > >      image opened writable => F_WRLCK
> > > 
> > > and provide only a single command line option: 'lock'.
> > 
> > That doesn't work in the case of setting up a clustered filesystem
> > shared between guests. That requires that the disk be opened writable,
> > but with a shared (F_RDLOCK) lock.
> 
> I think Jamie's point is that you might as well use no locking at all
> in this configuration.  It's hard to see what lock=shared is
> protecting you against.

This is saying that there is no need to protect 'shared writers'
from 'exclusive writers', which is not true.

Consider you have a cluster of VMs managed by libvirt with a shared
writable disk running the GFS filesystem, and you've not done any
locking on the disk file.

Now an admin comes along with with libguestfs and attempts to access
the disk containing the GFS volume. libguestfs isn't part of the 
cluster but that doesn't matter because you can happily access a GFS
filesystem in standalone mode provided it is not in use by any other
nodes. 

We need to stop libguestfs opening the disk in exclusive-write mode
if other QEMU VMs are using it in shared-write mode.

If QEMU with the shared-writable disks is been using F_RDLOCK, then
this would have prevent libguestfs opening the disk for write with
F_WRLOCK, since the F_RDLOCK blocks all F_WRLOCK attempts.

We really do have 3 combinations of locking / access mode here

 - read-only  + F_RDLOCK
 - read-write + F_RDLOCK
 - read-write + F_WRLOCK

So a single 'lock' flag is not sufficient, we need the shared/exclusive
semantics of the original patch.

Daniel
Kevin Wolf Dec. 7, 2009, 11:51 a.m. UTC | #24
Am 07.12.2009 12:28, schrieb Jamie Lokier:
> Kevin Wolf wrote:
>> Am 07.12.2009 11:31, schrieb Jamie Lokier:
>>> So the distinction read/write makes more sense.  Can anyone think of a
>>> situation where a shared lock on an image opened for writing is useful?
>>
>> I think there are people using shared writable images with cluster file
>> systems.
> 
> Yes, they are.  Please the following again:
> 
>>> Sometimes shared access to a raw image (partitioned or whole disk
>>> filesystem) is ok, and sometimes it is not ok.  Only the user knows
>>> the difference, because only the user knows if the guests they are
>>> running use distinct partitions in the same raw image, or cooperative
>>> access to a shard image.
>>>
>>> But does it make sense to request a shared lock in that case?  Not
>>> really.  If you have a group of guests correctly sharing an image, you
>>> still want to prevent running the same group a second time - and a
>>> shared lock wouldn't do that, because each group would be requesting
>>> shared locks.
> 
> If you run each guest in the disk-sharing cluster with 'lock=shared',
> it reflects that they are sharing - but that doesn't actually prevent
> mistakes, does it?  If you run any of those guests a second time, it
> won't prevent that.  So what's the point in the shared lock?

Sorry, I apparently misunderstood.  I was thinking about shared vs.
exclusive whereas you are talking about shared vs. none (vs. something
completely different). I think you're right there.

Kevin
Richard W.M. Jones Dec. 7, 2009, 11:59 a.m. UTC | #25
On Mon, Dec 07, 2009 at 11:49:32AM +0000, Daniel P. Berrange wrote:
> On Mon, Dec 07, 2009 at 11:31:47AM +0000, Richard W.M. Jones wrote:
> > On Mon, Dec 07, 2009 at 11:30:14AM +0000, Daniel P. Berrange wrote:
> > > On Mon, Dec 07, 2009 at 11:19:54AM +0000, Jamie Lokier wrote:
> > > > 
> > > > No, the question is whether it makes sense to provide a 'shared'
> > > > option on the command line, or simply to always map:
> > > > 
> > > >      image opened read only => F_FDLCK
> > > >      image opened writable => F_WRLCK
> > > > 
> > > > and provide only a single command line option: 'lock'.
> > > 
> > > That doesn't work in the case of setting up a clustered filesystem
> > > shared between guests. That requires that the disk be opened writable,
> > > but with a shared (F_RDLOCK) lock.
> > 
> > I think Jamie's point is that you might as well use no locking at all
> > in this configuration.  It's hard to see what lock=shared is
> > protecting you against.
> 
> This is saying that there is no need to protect 'shared writers'
> from 'exclusive writers', which is not true.
[example snipped]

OK so the case for lock=shared is where you have a shared clustered
resource, *and* an uber-admin-tool which may require exclusive access
to the resource.  I now understand, and that seems to make sense
(albeit a rather rare case).

It has to be said the only reason I implemented the shared mode in the
original patch, was because both POSIX and Win32 (LockFileEx) have the
distinction between shared and exclusive, in other words 'coz it was
easy :-)

Rich.
Daniel P. Berrangé Dec. 7, 2009, 12:06 p.m. UTC | #26
On Mon, Dec 07, 2009 at 11:28:34AM +0000, Jamie Lokier wrote:
> Kevin Wolf wrote:
> > Am 07.12.2009 11:31, schrieb Jamie Lokier:
> > > So the distinction read/write makes more sense.  Can anyone think of a
> > > situation where a shared lock on an image opened for writing is useful?
> > 
> > I think there are people using shared writable images with cluster file
> > systems.
> 
> Yes, they are.  Please the following again:
> 
> > > Sometimes shared access to a raw image (partitioned or whole disk
> > > filesystem) is ok, and sometimes it is not ok.  Only the user knows
> > > the difference, because only the user knows if the guests they are
> > > running use distinct partitions in the same raw image, or cooperative
> > > access to a shard image.
> > >
> > > But does it make sense to request a shared lock in that case?  Not
> > > really.  If you have a group of guests correctly sharing an image, you
> > > still want to prevent running the same group a second time - and a
> > > shared lock wouldn't do that, because each group would be requesting
> > > shared locks.
> 
> If you run each guest in the disk-sharing cluster with 'lock=shared',
> it reflects that they are sharing - but that doesn't actually prevent
> mistakes, does it?  If you run any of those guests a second time, it
> won't prevent that.  So what's the point in the shared lock?

In the case of the clustered scenario, your guest will likely have
two disks. Its exclusive root disk, and the shared cluster disk. The
lock on the former will ensure you can't run the guest itself twice.
The lock on the latter will prevent other apps thinking they've got
exclusive access to the cluster disk

Regards,
Daniel
Anthony Liguori Dec. 7, 2009, 1:32 p.m. UTC | #27
Chris Webb wrote:
> Hi. There's a connected discussion on the sheepdog list about locking, and I
> have a patch there which could complement this one quite well.
>
> Sheepdog is a distributed, replicated block store being developed
> (primarily) for Qemu. Images have a mandatory exclusive locking requirement,
> enforced by the cluster manager. Without this, the replication scheme
> breaks down and you can end up with inconsistent copies of the block
> image.
>
> The initial release of sheepdog took these locks in the block driver
> bdrv_open() and bdrv_close() hooks. They also added a bdrv_closeall() and
> ensured it was called in all the usual qemu exit paths to avoid stray locks.
> (The rarer case of crashing hosts or crashing qemus will have to be handled
> externally, and is 'to do'.)
>
> The problem was that this prevented live migration, because both ends wanted
> to open the image at once, even though only one would be using it at a time.
>   
Yeah, this is a bigger problem I think.  Technically speaking, when 
using NFS as the backing filesystem, we really should not open the 
destination end before we close the source end to keep the caches fully 
coherent.

I've resisted this because I'm concerned that if we delay the opening of 
the file on the destination, it could fail.  That's a very late failure 
and that makes me uncomfortable as just a work around for NFS.

But considering this locking situation, I think it is not a bad idea now.

Regards,

Anthony Liguori
Chris Webb Dec. 7, 2009, 1:38 p.m. UTC | #28
Anthony Liguori <anthony@codemonkey.ws> writes:

> I've resisted this because I'm concerned that if we delay the
> opening of the file on the destination, it could fail.  That's a
> very late failure and that makes me uncomfortable as just a work
> around for NFS.

I don't know much about NFS's semantics with cache-coherency, but I take it
there isn't some sort of synchronisation operation that works on a file
descriptor and could be done instead as the vm starts and stops to avoid
having to delay the open itself? I agree with you that failing this late
could make for some rather nasty behaviour compared to the current version.

Cheers,

Chris.
Anthony Liguori Dec. 7, 2009, 1:39 p.m. UTC | #29
Richard W.M. Jones wrote:
> On Sat, Dec 05, 2009 at 11:47:11AM -0600, Anthony Liguori wrote:
>   
>> I think I made my point poorly.  Consider the following:
>>
>> qemu-img create -f raw base.img 10G
>> qemu-img create -f qcow2 -b base.img cow1.img
>> qemu-img create -f qcow2 -b base.img cow2.img
>>
>> qemu -drive file=cow1.img,lock=exclusive
>> qemu -drive file=cow2.img,lock=exclusive
>>
>> With the current patch, the second command will fail and it's impossible  
>> to invoke correctly.  That's because flags are passed down to backing  
>> devices directly.
>>     
>
> Isn't it correct that the second command fails?  Although the base
> image is mostly used read-only, if the user does the "commit" command
> in the monitor then the backing image is overwritten.  (That's my
> understanding of the documentation anyway - I've never actually used
> this feature of qcow2).
>   

That's a use-case that needs to be thought through.  Maybe commit should 
attempt to force the lower level to acquire a write lock first?

The above use-case is extremely common so it's something that's worth 
supporting.

> Also if we only acquire the lock during the commit operation then
> we'll end up with disk corruption.

Why do we end up with disk corruption?

If someone else has the backing file open with a read lock, we won't be 
able to acquire the write lock.  Once we've acquired the write lock, no 
one can acquire the read lock.  Once we've released the write lock 
(after an fsync of  course), we won't touch it again.

Regards,

Anthony Liguori
Anthony Liguori Dec. 7, 2009, 1:43 p.m. UTC | #30
Daniel P. Berrange wrote:
> That doesn't work in the case of setting up a clustered filesystem
> shared between guests. That requires that the disk be opened writable,
> but with a shared (F_RDLOCK) lock.
>   

If you'd like data corruption :-)

qemu -drive file=/dev/hda,lock=shared   # clustered file system
qemu -drive file=/dev/hda,lock=shared   # clustered file system

qemu -drive file=/dev/hda,lock=shared -snapshot  # i have no awareness 
of the last two.

The 3rd invocation is totally fine.  /dev/hda is shared (but the 
expectation is read-only).

Better to stick with on/off.  That gives much easier to understand 
semantics.

Regards,

Anthony Liguori
Anthony Liguori Dec. 7, 2009, 1:47 p.m. UTC | #31
Chris Webb wrote:
> Anthony Liguori <anthony@codemonkey.ws> writes:
>
>   
>> I've resisted this because I'm concerned that if we delay the
>> opening of the file on the destination, it could fail.  That's a
>> very late failure and that makes me uncomfortable as just a work
>> around for NFS.
>>     
>
> I don't know much about NFS's semantics with cache-coherency, but I take it
> there isn't some sort of synchronisation operation that works on a file
> descriptor and could be done instead as the vm starts and stops to avoid
> having to delay the open itself? I agree with you that failing this late
> could make for some rather nasty behaviour compared to the current version.
>   

I think this is a case of spec vs. in practice.  By the NFS spec, 
coherence is only guaranteed as close-after-open.  Let me poke around 
and see if I can find anyone with better info.

Regards,

Anthony Liguori
> Cheers,
>
> Chris.
>
Daniel P. Berrangé Dec. 7, 2009, 2:01 p.m. UTC | #32
On Mon, Dec 07, 2009 at 07:43:53AM -0600, Anthony Liguori wrote:
> Daniel P. Berrange wrote:
> >That doesn't work in the case of setting up a clustered filesystem
> >shared between guests. That requires that the disk be opened writable,
> >but with a shared (F_RDLOCK) lock.
> >  
> 
> If you'd like data corruption :-)
> 
> qemu -drive file=/dev/hda,lock=shared   # clustered file system
> qemu -drive file=/dev/hda,lock=shared   # clustered file system
> 
> qemu -drive file=/dev/hda,lock=shared -snapshot  # i have no awareness 
> of the last two.
> 
> The 3rd invocation is totally fine.  /dev/hda is shared (but the 
> expectation is read-only).

> Better to stick with on/off.  That gives much easier to understand 
> semantics.

The 3rd invocation is not changing the backing file, so it is not causing
data corruption on the master file. Sure the 3rd OS instance will probably
crash & get confused because its backing file is changing under its feet
but as long as 'commit' isn't used, it is still safe. 'commit' should 
require that the disk have been locked in exclusive  mode from the time
it started. It is not safe to merely let it upgrade the lock from
shared to exclusive at time of committing because of this exact scenario.

Daniel
Richard W.M. Jones Dec. 7, 2009, 2:08 p.m. UTC | #33
On Mon, Dec 07, 2009 at 07:39:11AM -0600, Anthony Liguori wrote:
> Richard W.M. Jones wrote:
>> Also if we only acquire the lock during the commit operation then
>> we'll end up with disk corruption.
>
> Why do we end up with disk corruption?

Forget about locking for a minute, I don't think this is safe
currently.  If you have two VMs set up like:

  qemu-img create -b backing.img foo.img
  qemu-img create -b backing.img bar.img

  qemu -drive file=foo.img     # VM1
  qemu -drive file=bar.img     # VM2

If VM1 does a commit to the backing image, then VM2 may be caching (in
its kernel memory) bits of the old backing image, and will
subsequently fetch bits of the new backing image, so it'll see a
mixture of old and new data.  How is VM2 supposed to cope with this?
It sounds like massive disk corruption to me ...

Rich.
Anthony Liguori Dec. 7, 2009, 2:15 p.m. UTC | #34
Daniel P. Berrange wrote:
>> Better to stick with on/off.  That gives much easier to understand 
>> semantics.
>>     
>
> The 3rd invocation is not changing the backing file, so it is not causing
> data corruption on the master file. Sure the 3rd OS instance will probably
> crash & get confused because its backing file is changing under its feet
> but as long as 'commit' isn't used, it is still safe.

The 3rd instance is getting data corruption.  The corruption is 
transient because we're using -snapshot but if you change the scenario 
to use a persistent cow overlay, then the corruption is permanent.

The semantics of shared/exclusive are non-obvious to anything but 
someone who understands the implementation.

>  'commit' should 
> require that the disk have been locked in exclusive  mode from the time
> it started. It is not safe to merely let it upgrade the lock from
> shared to exclusive at time of committing because of this exact scenario.
>   
I don't understand why it isn't safe to temporarily acquire a write lock 
with commit.  Can you give an example of how it would cause corruption?

Regards,

Anthony Liguori

> Daniel
>
Anthony Liguori Dec. 7, 2009, 2:22 p.m. UTC | #35
Richard W.M. Jones wrote:
> On Mon, Dec 07, 2009 at 07:39:11AM -0600, Anthony Liguori wrote:
>   
>> Richard W.M. Jones wrote:
>>     
>>> Also if we only acquire the lock during the commit operation then
>>> we'll end up with disk corruption.
>>>       
>> Why do we end up with disk corruption?
>>     
>
> Forget about locking for a minute, I don't think this is safe
> currently.  If you have two VMs set up like:
>
>   qemu-img create -b backing.img foo.img
>   qemu-img create -b backing.img bar.img
>
>   qemu -drive file=foo.img     # VM1
>   qemu -drive file=bar.img     # VM2
>
> If VM1 does a commit to the backing image, then VM2 may be caching (in
> its kernel memory) bits of the old backing image, and will
> subsequently fetch bits of the new backing image, so it'll see a
> mixture of old and new data.  How is VM2 supposed to cope with this?
> It sounds like massive disk corruption to me ...
>   

Yes, this will cause corruption.  Implementing locking in the fashion 
I've previously described will prevent 'commit' from being run (since 
you can't upgrade the lock since someone else is holding a read-lock).

Regards,

Anthony Liguori
Daniel P. Berrangé Dec. 7, 2009, 2:25 p.m. UTC | #36
On Mon, Dec 07, 2009 at 07:32:16AM -0600, Anthony Liguori wrote:
> Chris Webb wrote:
> >Hi. There's a connected discussion on the sheepdog list about locking, and 
> >I
> >have a patch there which could complement this one quite well.
> >
> >Sheepdog is a distributed, replicated block store being developed
> >(primarily) for Qemu. Images have a mandatory exclusive locking 
> >requirement,
> >enforced by the cluster manager. Without this, the replication scheme
> >breaks down and you can end up with inconsistent copies of the block
> >image.
> >
> >The initial release of sheepdog took these locks in the block driver
> >bdrv_open() and bdrv_close() hooks. They also added a bdrv_closeall() and
> >ensured it was called in all the usual qemu exit paths to avoid stray 
> >locks.
> >(The rarer case of crashing hosts or crashing qemus will have to be handled
> >externally, and is 'to do'.)
> >
> >The problem was that this prevented live migration, because both ends 
> >wanted
> >to open the image at once, even though only one would be using it at a 
> >time.
> >  
> Yeah, this is a bigger problem I think.  Technically speaking, when 
> using NFS as the backing filesystem, we really should not open the 
> destination end before we close the source end to keep the caches fully 
> coherent.
> 
> I've resisted this because I'm concerned that if we delay the opening of 
> the file on the destination, it could fail.  That's a very late failure 
> and that makes me uncomfortable as just a work around for NFS.

The only other alternative would be for the destination to open the disks,
but not immediately acquire the locks. In the final stage of migration have
the source release its lock & signal to the dest that it can now acquire
the lock. The assumption being that the lock acquisition is far less likely
to fail than the open(), so we focus on making sure we can properly handle
open() failure.

Daniel
Daniel P. Berrangé Dec. 7, 2009, 2:28 p.m. UTC | #37
On Mon, Dec 07, 2009 at 08:15:51AM -0600, Anthony Liguori wrote:
> Daniel P. Berrange wrote:
> >>Better to stick with on/off.  That gives much easier to understand 
> >>semantics.
> >>    
> >
> >The 3rd invocation is not changing the backing file, so it is not causing
> >data corruption on the master file. Sure the 3rd OS instance will probably
> >crash & get confused because its backing file is changing under its feet
> >but as long as 'commit' isn't used, it is still safe.
> 
> The 3rd instance is getting data corruption.  The corruption is 
> transient because we're using -snapshot but if you change the scenario 
> to use a persistent cow overlay, then the corruption is permanent.
> 
> The semantics of shared/exclusive are non-obvious to anything but 
> someone who understands the implementation.
> 
> > 'commit' should 
> >require that the disk have been locked in exclusive  mode from the time
> >it started. It is not safe to merely let it upgrade the lock from
> >shared to exclusive at time of committing because of this exact scenario.
> >  
> I don't understand why it isn't safe to temporarily acquire a write lock 
> with commit.  Can you give an example of how it would cause corruption?

It is safe if you assume that no one else has tried to modify the disk
since you opened it, otherwise you'd be commiting changes against a 
base state which no longer exists.  The cluster scenario you outlined
in the parent message was the one I was thinking off.

Regards,
Daniel
Richard W.M. Jones Dec. 7, 2009, 2:31 p.m. UTC | #38
On Mon, Dec 07, 2009 at 08:22:24AM -0600, Anthony Liguori wrote:
> Richard W.M. Jones wrote:
>> On Mon, Dec 07, 2009 at 07:39:11AM -0600, Anthony Liguori wrote:
>>   
>>> Richard W.M. Jones wrote:
>>>     
>>>> Also if we only acquire the lock during the commit operation then
>>>> we'll end up with disk corruption.
>>>>       
>>> Why do we end up with disk corruption?
>>>     
>>
>> Forget about locking for a minute, I don't think this is safe
>> currently.  If you have two VMs set up like:
>>
>>   qemu-img create -b backing.img foo.img
>>   qemu-img create -b backing.img bar.img
>>
>>   qemu -drive file=foo.img     # VM1
>>   qemu -drive file=bar.img     # VM2
>>
>> If VM1 does a commit to the backing image, then VM2 may be caching (in
>> its kernel memory) bits of the old backing image, and will
>> subsequently fetch bits of the new backing image, so it'll see a
>> mixture of old and new data.  How is VM2 supposed to cope with this?
>> It sounds like massive disk corruption to me ...
>>   
>
> Yes, this will cause corruption.  Implementing locking in the fashion  
> I've previously described will prevent 'commit' from being run (since  
> you can't upgrade the lock since someone else is holding a read-lock).

So to be clear, the use case is that all the other VMs must be shut
down, then the VM which wants to commit will upgrade its lock and
commit, and then all the other VMs will restart?  I agree this should
avoid corruption, although it sounds like something which is fairly
unlikely to be done in practice.

Rich.
Paolo Bonzini Dec. 7, 2009, 2:35 p.m. UTC | #39
> Now an admin comes along with with libguestfs and attempts to access
> the disk containing the GFS volume. libguestfs isn't part of the
> cluster but that doesn't matter because you can happily access a GFS
> filesystem in standalone mode provided it is not in use by any other
> nodes.
>
> We need to stop libguestfs opening the disk in exclusive-write mode
> if other QEMU VMs are using it in shared-write mode.

I think it's simpler to use no locking in QEMU and let the cluster 
management tool do the locking.  The tool would be the one to invoke 
QEMU, of course.

> If QEMU with the shared-writable disks is been using F_RDLOCK, then
> this would have prevent libguestfs opening the disk for write with
> F_WRLOCK, since the F_RDLOCK blocks all F_WRLOCK attempts.
>
> We really do have 3 combinations of locking / access mode here
>
>   - read-only  + F_RDLOCK
>   - read-write + F_RDLOCK
>   - read-write + F_WRLOCK
>
> So a single 'lock' flag is not sufficient, we need the shared/exclusive
> semantics of the original patch.

I'd still prefer having a DWIM lock option that chooses the right kind 
of locking depending on what you are doing.

You could have lock={no,yes,shared,exclusive} as follows

image type         lock=no       yes             shared        exclusive
raw, opened RO     no            F_RDLOCK        F_RDLOCK      F_WRLOCK
raw, opened RW     no            F_WRLOCK        F_RDLOCK      F_WRLOCK
qcow2, opened RO   no            F_RDLOCK        F_RDLOCK      F_WRLOCK
    backing image   no            F_RDLOCK        F_RDLOCK      F_WRLOCK
qcow2, opened RW   no            F_WRLOCK        error         F_WRLOCK
    backing image   no            F_RDLOCK        error         F_WRLOCK

However, I think lock=shared would be the wrong thing to do in the 
cluster case.  It's better if the management tool locks the partitions 
for read-write instead, and this clearly does not belong in QEMU.  It 
would be a step in the wrong direction, and one that will be there 
forever in QEMU.

I think only lock={no,yes} is the best option to start with.

Paolo
Paolo Bonzini Dec. 7, 2009, 2:38 p.m. UTC | #40
On 12/07/2009 02:39 PM, Anthony Liguori wrote:
>
> That's a use-case that needs to be thought through.  Maybe commit should
> attempt to force the lower level to acquire a write lock first?

Yes.  This is an argument in favor of a "DWIM" lock option instead of 
giving shared/exclusive choice to the user: do not say what kind of 
operation you want to do, and let commit upgrade the lock in the future.

If the management tools wants to force shared/exclusive, they can do the 
locking themselves and tell QEMU to not lock.

Paolo
Anthony Liguori Dec. 7, 2009, 2:53 p.m. UTC | #41
Daniel P. Berrange wrote:
> It is safe if you assume that no one else has tried to modify the disk
> since you opened it, otherwise you'd be commiting changes against a 
> base state which no longer exists.

1) first user opens cow1.qcow, acquires F_WRLCK
2) first user opens base.qcow, acquires F_RDLCK
3) second user opens cow2.qcow, acquires F_WRLCK
4) second user opens base.qcow, acquires F_RDLCK

5) second user attempts to commit cow2.qcow to base.qcow, tries to 
acquire F_WRLCK.  Calling F_SETLK with F_WRLCK will drop the F_RDLCK and 
upgrade to F_WRLCK.
6) F_SETLK fails because another process is holding F_RDLCK, error is 
printed to user

Now assume that there is only one user and commit succeeded, then we'd 
drop back down to F_RDLCK (after an fsync) which means that afterwards, 
people could open up the image again.

Regards,

Anthony Liguori
Anthony Liguori Dec. 7, 2009, 2:55 p.m. UTC | #42
Richard W.M. Jones wrote:
> So to be clear, the use case is that all the other VMs must be shut
> down, then the VM which wants to commit will upgrade its lock and
> commit, and then all the other VMs will restart?  I agree this should
> avoid corruption, although it sounds like something which is fairly
> unlikely to be done in practice.
>   

Commit is an extremely uncommon use-case.  There's really no correct way 
to use it.  That said, assuming people used it, here would be a 
reasonable flow:

1) user starts up with base.img -> cow.img
2) user makes some changes, realized it would be good for the golden 
master so issues a commit to base.img
3) user does not shut down this machine
4) user tries to start another vm with a base.img -> cow2.img

This is a valid scenario and temporarily promoting to write lock during 
commit allows this scenario without introducing a window for data 
corruption.

Regards,

Anthony Liguori
Chris Webb Dec. 7, 2009, 2:58 p.m. UTC | #43
"Daniel P. Berrange" <berrange@redhat.com> writes:

> The only other alternative would be for the destination to open the disks,
> but not immediately acquire the locks. In the final stage of migration have
> the source release its lock & signal to the dest that it can now acquire
> the lock. The assumption being that the lock acquisition is far less likely
> to fail than the open(), so we focus on making sure we can properly handle
> open() failure.

This is what we've implemented as a prototype for sheepdog, with bdrv_claim
(happens on vm_start) and bdrv_release (happens on exit/vm_stop) hooks
available to block drivers.

Cheers,

Chris.
Kevin Wolf Dec. 8, 2009, 9:40 a.m. UTC | #44
Am 07.12.2009 15:53, schrieb Anthony Liguori:
> Daniel P. Berrange wrote:
>> It is safe if you assume that no one else has tried to modify the disk
>> since you opened it, otherwise you'd be commiting changes against a 
>> base state which no longer exists.
> 
> 1) first user opens cow1.qcow, acquires F_WRLCK
> 2) first user opens base.qcow, acquires F_RDLCK
> 3) second user opens cow2.qcow, acquires F_WRLCK
> 4) second user opens base.qcow, acquires F_RDLCK
> 
> 5) second user attempts to commit cow2.qcow to base.qcow

This is broken regardless of locking (except for very... special use
that I don't want to even think about, I'd call it criminal). Better
don't commit to a backing file that is used by other COW images.

Kevin
Kevin Wolf Dec. 8, 2009, 9:48 a.m. UTC | #45
Am 07.12.2009 15:31, schrieb Richard W.M. Jones:
> On Mon, Dec 07, 2009 at 08:22:24AM -0600, Anthony Liguori wrote:
>> Richard W.M. Jones wrote:
>>> On Mon, Dec 07, 2009 at 07:39:11AM -0600, Anthony Liguori wrote:
>>>   
>>>> Richard W.M. Jones wrote:
>>>>     
>>>>> Also if we only acquire the lock during the commit operation then
>>>>> we'll end up with disk corruption.
>>>>>       
>>>> Why do we end up with disk corruption?
>>>>     
>>>
>>> Forget about locking for a minute, I don't think this is safe
>>> currently.  If you have two VMs set up like:
>>>
>>>   qemu-img create -b backing.img foo.img
>>>   qemu-img create -b backing.img bar.img
>>>
>>>   qemu -drive file=foo.img     # VM1
>>>   qemu -drive file=bar.img     # VM2
>>>
>>> If VM1 does a commit to the backing image, then VM2 may be caching (in
>>> its kernel memory) bits of the old backing image, and will
>>> subsequently fetch bits of the new backing image, so it'll see a
>>> mixture of old and new data.  How is VM2 supposed to cope with this?
>>> It sounds like massive disk corruption to me ...
>>>   
>>
>> Yes, this will cause corruption.  Implementing locking in the fashion  
>> I've previously described will prevent 'commit' from being run (since  
>> you can't upgrade the lock since someone else is holding a read-lock).
> 
> So to be clear, the use case is that all the other VMs must be shut
> down, then the VM which wants to commit will upgrade its lock and
> commit, and then all the other VMs will restart?  I agree this should
> avoid corruption, although it sounds like something which is fairly
> unlikely to be done in practice.

I can't see how the file system of VM2 could possibly survive if VM1
commits its changes. Even if VM2 or even both VMs are shut down while
we're corrupting the base image.

Basically, you must not commit to a backing file unless your COW file is
the only user of this backing file.

Kevin
Richard W.M. Jones Dec. 8, 2009, 10:16 a.m. UTC | #46
On Tue, Dec 08, 2009 at 10:48:17AM +0100, Kevin Wolf wrote:
> Am 07.12.2009 15:31, schrieb Richard W.M. Jones:
> > So to be clear, the use case is that all the other VMs must be shut
> > down, then the VM which wants to commit will upgrade its lock and
> > commit, and then all the other VMs will restart?  I agree this should
> > avoid corruption, although it sounds like something which is fairly
> > unlikely to be done in practice.
> 
> I can't see how the file system of VM2 could possibly survive if VM1
> commits its changes. Even if VM2 or even both VMs are shut down while
> we're corrupting the base image.

Yes, I see, you're correct - even if the VMs are shut down, committing
to the backing file will result in corruption.

Rich.
diff mbox

Patch

diff --git a/block.c b/block.c
index 853f025..653ffeb 100644
--- a/block.c
+++ b/block.c
@@ -448,7 +448,7 @@  int bdrv_open2(BlockDriverState *bs, const char *filename, int flags,
     try_rw = !bs->read_only || bs->is_temporary;
     if (!(flags & BDRV_O_FILE))
         open_flags = (try_rw ? BDRV_O_RDWR : 0) |
-            (flags & (BDRV_O_CACHE_MASK|BDRV_O_NATIVE_AIO));
+            (flags & (BDRV_O_CACHE_MASK|BDRV_O_NATIVE_AIO|BDRV_O_LOCK_MASK));
     else
         open_flags = flags & ~(BDRV_O_FILE | BDRV_O_SNAPSHOT);
     if (use_bdrv_whitelist && !bdrv_is_whitelisted(drv))
diff --git a/block.h b/block.h
index 4a8b628..5458619 100644
--- a/block.h
+++ b/block.h
@@ -38,8 +38,11 @@  typedef struct QEMUSnapshotInfo {
 #define BDRV_O_NOCACHE     0x0020 /* do not use the host page cache */
 #define BDRV_O_CACHE_WB    0x0040 /* use write-back caching */
 #define BDRV_O_NATIVE_AIO  0x0080 /* use native AIO instead of the thread pool */
+#define BDRV_O_LOCK_SHARED 0x0100 /* fail unless we can lock shared */
+#define BDRV_O_LOCK_EXCLUSIVE 0x0200 /* fail unless we can lock exclusive */
 
 #define BDRV_O_CACHE_MASK  (BDRV_O_NOCACHE | BDRV_O_CACHE_WB)
+#define BDRV_O_LOCK_MASK   (BDRV_O_LOCK_SHARED | BDRV_O_LOCK_EXCLUSIVE)
 
 #define BDRV_SECTOR_BITS   9
 #define BDRV_SECTOR_SIZE   (1 << BDRV_SECTOR_BITS)
diff --git a/block/raw-posix.c b/block/raw-posix.c
index 5a6a22b..e6c141b 100644
--- a/block/raw-posix.c
+++ b/block/raw-posix.c
@@ -133,6 +133,7 @@  static int raw_open_common(BlockDriverState *bs, const char *filename,
 {
     BDRVRawState *s = bs->opaque;
     int fd, ret;
+    struct flock lk;
 
     s->lseek_err_cnt = 0;
 
@@ -163,6 +164,18 @@  static int raw_open_common(BlockDriverState *bs, const char *filename,
     s->fd = fd;
     s->aligned_buf = NULL;
 
+    if (bdrv_flags & BDRV_O_LOCK_MASK) {
+        if (bdrv_flags & BDRV_O_LOCK_SHARED)
+            lk.l_type = F_RDLCK;
+        else /* bdrv_flags & BDRV_O_LOCK_EXCLUSIVE */
+            lk.l_type = F_WRLCK;
+        lk.l_whence = SEEK_SET;
+        lk.l_start = 0;
+        lk.l_len = 0; /* means lock the whole file */
+        if (fcntl (fd, F_SETLK, &lk) == -1)
+            goto out_close;
+    }
+
     if ((bdrv_flags & BDRV_O_NOCACHE)) {
         s->aligned_buf = qemu_blockalign(bs, ALIGNED_BUFFER_SIZE);
         if (s->aligned_buf == NULL) {
diff --git a/block/raw-win32.c b/block/raw-win32.c
index 72acad5..9d0cfc7 100644
--- a/block/raw-win32.c
+++ b/block/raw-win32.c
@@ -78,6 +78,8 @@  static int raw_open(BlockDriverState *bs, const char *filename, int flags)
     BDRVRawState *s = bs->opaque;
     int access_flags, create_flags;
     DWORD overlapped;
+    DWORD lock_flags;
+    OVERLAPPED ov;
 
     s->type = FTYPE_FILE;
 
@@ -106,6 +108,21 @@  static int raw_open(BlockDriverState *bs, const char *filename, int flags)
             return -EACCES;
         return -1;
     }
+
+    if (flags & BDRV_O_LOCK_MASK) {
+        lock_flags = LOCKFILE_FAIL_IMMEDIATELY;
+        if (flags & BDRV_O_LOCK_EXCLUSIVE)
+            lock_flags |= LOCKFILE_EXCLUSIVE_LOCK;
+
+        memset(&ov, 0, sizeof(ov));
+        ov.Offset = 0;
+        ov.OffsetHigh = 0;
+
+        if (!LockFileEx(s->hfile, lock_flags, 0, 1, 0, &ov))
+            /* For compatibility with the POSIX lock failure ... */
+            return -EAGAIN;
+    }
+
     return 0;
 }
 
diff --git a/qemu-config.c b/qemu-config.c
index 92b5363..106ebcf 100644
--- a/qemu-config.c
+++ b/qemu-config.c
@@ -77,6 +77,10 @@  QemuOptsList qemu_drive_opts = {
         },{
             .name = "readonly",
             .type = QEMU_OPT_BOOL,
+        },{
+            .name = "lock",
+            .type = QEMU_OPT_STRING,
+            .help = "lock disk image (exclusive, shared, none)",
         },
         { /* end if list */ }
     },
diff --git a/qemu-options.hx b/qemu-options.hx
index 1b5781a..c43eefb 100644
--- a/qemu-options.hx
+++ b/qemu-options.hx
@@ -104,6 +104,7 @@  DEF("drive", HAS_ARG, QEMU_OPTION_drive,
     "       [,cyls=c,heads=h,secs=s[,trans=t]][,snapshot=on|off]\n"
     "       [,cache=writethrough|writeback|none][,format=f][,serial=s]\n"
     "       [,addr=A][,id=name][,aio=threads|native]\n"
+    "       [,lock=exclusive|shared|none]\n"
     "                use 'file' as a drive image\n")
 DEF("set", HAS_ARG, QEMU_OPTION_set,
     "-set group.id.arg=value\n"
@@ -146,6 +147,12 @@  an untrusted format header.
 This option specifies the serial number to assign to the device.
 @item addr=@var{addr}
 Specify the controller's PCI address (if=virtio only).
+@item lock=@var{mode}
+Acquire a lock on the disk image (@var{file}).
+Available modes are: exclusive, shared, none.
+The default is "none", meaning we don't try to acquire a lock.  To
+avoid multiple virtual machines trying to write to a disk at the
+same time (which can cause disk corruption), use lock=exclusive.
 @end table
 
 By default, writethrough caching is used for all block device.  This means that
diff --git a/roms/seabios b/roms/seabios
index 42bc394..4d9d400 160000
--- a/roms/seabios
+++ b/roms/seabios
@@ -1 +1 @@ 
-Subproject commit 42bc3940d93911e382f5e72289f043d1faa9083e
+Subproject commit 4d9d400031417528d4fbe05f845f3759237e48d3
diff --git a/vl.c b/vl.c
index 96ab020..69e8cb7 100644
--- a/vl.c
+++ b/vl.c
@@ -2030,6 +2030,7 @@  DriveInfo *drive_init(QemuOpts *opts, void *opaque,
     const char *devaddr;
     DriveInfo *dinfo;
     int snapshot = 0;
+    int lockmode = 0;
 
     *fatal_error = 1;
 
@@ -2220,6 +2221,19 @@  DriveInfo *drive_init(QemuOpts *opts, void *opaque,
         }
     }
 
+    if ((buf = qemu_opt_get(opts, "lock")) != NULL) {
+        if (!strcmp(buf, "none"))
+            lockmode = 0;
+        else if (!strcmp(buf, "shared"))
+            lockmode = BDRV_O_LOCK_SHARED;
+        else if (!strcmp(buf, "exclusive"))
+            lockmode = BDRV_O_LOCK_EXCLUSIVE;
+        else {
+           fprintf(stderr, "qemu: invalid lock option\n");
+           return NULL;
+        }
+    }
+
     /* compute bus and unit according index */
 
     if (index != -1) {
@@ -2364,6 +2378,8 @@  DriveInfo *drive_init(QemuOpts *opts, void *opaque,
         (void)bdrv_set_read_only(dinfo->bdrv, 1);
     }
 
+    bdrv_flags |= lockmode;
+
     if (bdrv_open2(dinfo->bdrv, file, bdrv_flags, drv) < 0) {
         fprintf(stderr, "qemu: could not open disk image %s: %s\n",
                         file, strerror(errno));