Patchwork [2/2] Add flush=off parameter to -drive

login
register
mail settings
Submitter Alexander Graf
Date May 10, 2010, 9:51 p.m.
Message ID <1273528310-7051-3-git-send-email-agraf@suse.de>
Download mbox | patch
Permalink /patch/52238/
State New
Headers show

Comments

Alexander Graf - May 10, 2010, 9:51 p.m.
Usually the guest can tell the host to flush data to disk. In some cases we
don't want to flush though, but try to keep everything in cache.

So let's add a new parameter to -drive that allows us to set the flushing
behavior to "on" or "off", defaulting to enabling the guest to flush.

Signed-off-by: Alexander Graf <agraf@suse.de>
---
 block/raw-posix.c |   13 +++++++++++++
 qemu-config.c     |    3 +++
 qemu-options.hx   |    3 +++
 vl.c              |    3 +++
 4 files changed, 22 insertions(+), 0 deletions(-)
Kevin Wolf - May 11, 2010, 8:36 a.m.
Am 10.05.2010 23:51, schrieb Alexander Graf:
> Usually the guest can tell the host to flush data to disk. In some cases we
> don't want to flush though, but try to keep everything in cache.
> 
> So let's add a new parameter to -drive that allows us to set the flushing
> behavior to "on" or "off", defaulting to enabling the guest to flush.
> 
> Signed-off-by: Alexander Graf <agraf@suse.de>

What about another cache=... value instead of adding more options? I'm
quite sure you'll only ever need this with writeback caching. So we
could have cache=none|writethrough|writeback|wb-noflush or something
like that.

> ---
>  block/raw-posix.c |   13 +++++++++++++

This is obviously wrong. If you want to introduce new behaviour to the
block layer, you must do it consistently and not just for one block
driver. So these changes should be made to the generic functions in
block.c instead.

Kevin
Christoph Hellwig - May 11, 2010, 10:55 a.m.
On Tue, May 11, 2010 at 10:36:24AM +0200, Kevin Wolf wrote:
> Am 10.05.2010 23:51, schrieb Alexander Graf:
> > Usually the guest can tell the host to flush data to disk. In some cases we
> > don't want to flush though, but try to keep everything in cache.
> > 
> > So let's add a new parameter to -drive that allows us to set the flushing
> > behavior to "on" or "off", defaulting to enabling the guest to flush.
> > 
> > Signed-off-by: Alexander Graf <agraf@suse.de>
> 
> What about another cache=... value instead of adding more options? I'm
> quite sure you'll only ever need this with writeback caching. So we
> could have cache=none|writethrough|writeback|wb-noflush or something
> like that.

The cache option really isn't too useful.  There's a matrix of 3x2
possible I/O modes for the posix backend, and right now we only expose
two of them.  I think we really should expose all of them, split into
two options:

                                      caching

                    | normal             | O_DIRECT           |
            --------+--------------------+--------------------+
            none    | -                  | -                  |
integrity   O_DSYNC | cache=writeback    | -                  |
            fsync   | cache=writethrough | cache=none         |
            --------+--------------------+--------------------+

While we have to keep the cache= option for legacy purposes we're
much better to have two new options for the I/O mode and integrity
to express everything we need.  Especially given that I'm looking
into allowing the guest to tweak the volatile write cache setting
of the emulated disk, which will allow us to turn on/off the O_SYNC
on the flight.  In the current scheme that would introduce a
O_DSYNC|O_DIRECT mode that's not currently accesible through the
command line/monitor.

Markus was looking into separating the block device state in host/
guest portions lately, and splitting cache= would help with this.
Driver cache enabled or not (fsync means enabled, O_DSYNC disabled,
and none probably disabled given that we don't care), while O_DIRECT
vs normal cached I/O is host state.
Paul Brook - May 11, 2010, 12:15 p.m.
> > What about another cache=... value instead of adding more options? I'm
> > quite sure you'll only ever need this with writeback caching. So we
> > could have cache=none|writethrough|writeback|wb-noflush or something
> > like that.

I agree.
 
> The cache option really isn't too useful.  There's a matrix of 3x2
> possible I/O modes for the posix backend, and right now we only expose
> two of them.  I think we really should expose all of them, split into
> two options:
> 
>                                       caching
> 
>                     | normal             | O_DIRECT           |
> 
>             --------+--------------------+--------------------+
>             none    | -                  | -                  |
> integrity   O_DSYNC | cache=writeback    | -                  |
>             fsync   | cache=writethrough | cache=none         |
>             --------+--------------------+--------------------+

I think this table is misleading, and from a user perspective there are only 4 
interesting combinations:

cache=none:
  No host caching. Reads and writes both go directly to underlying storage. 
Useful to avoid double-caching.

cache=writethrough
  Reads are cached. Writes go directly to underlying storage.  Useful for 
broken guests that aren't aware of drive caches.

cache=writeback
  Reads and writes are cached. Guest flushes are honoured. Note that this may 
include a guest visible knob that allows the write cache to be disabled. If 
the guest requests the cache be disabled then this should act the same as 
cache=writethrough. Closest to how real hardware works.

cache=always (or a more scary name like cache=lie to defend against idiots)
  Reads and writes are cached. Guest flushes are ignored.  Useful for dumb 
guests in non-critical environments.

IMO the other boxes in your table (disable cache but require guest flushes) 
don't make any sense.

Paul
Anthony Liguori - May 11, 2010, 12:43 p.m.
On 05/11/2010 07:15 AM, Paul Brook wrote:
>>> What about another cache=... value instead of adding more options? I'm
>>> quite sure you'll only ever need this with writeback caching. So we
>>> could have cache=none|writethrough|writeback|wb-noflush or something
>>> like that.
>>>        
> I think this table is misleading, and from a user perspective there are only 4
> interesting combinations:
>    

I agree that splitting cache would be unfortunate for users.

> cache=none:
>    No host caching. Reads and writes both go directly to underlying storage.
> Useful to avoid double-caching.
>
> cache=writethrough
>    Reads are cached. Writes go directly to underlying storage.  Useful for
> broken guests that aren't aware of drive caches.
>
> cache=writeback
>    Reads and writes are cached. Guest flushes are honoured. Note that this may
> include a guest visible knob that allows the write cache to be disabled. If
> the guest requests the cache be disabled then this should act the same as
> cache=writethrough. Closest to how real hardware works.
>
> cache=always (or a more scary name like cache=lie to defend against idiots)
>    Reads and writes are cached. Guest flushes are ignored.  Useful for dumb
> guests in non-critical environments.
>    

I really don't believe that we should support a cache=lie.  There are 
many other obtain the same results.  For instance, mount your guest 
filesystem with barrier=0.

Introducing a mode where data integrity isn't preserved means that it's 
possible for someone to accidentally use it or use it when they really 
shouldn't.  People already struggle with what caching mode they should 
use and this would just make the problem worse.

Regards,

Anthony Liguori

> IMO the other boxes in your table (disable cache but require guest flushes)
> don't make any sense.
>
> Paul
>
>
Paul Brook - May 11, 2010, 1:12 p.m.
> > cache=always (or a more scary name like cache=lie to defend against
> > idiots)
> > 
> >    Reads and writes are cached. Guest flushes are ignored.  Useful for
> >    dumb guests in non-critical environments.
> 
> I really don't believe that we should support a cache=lie.  There are
> many other obtain the same results.  For instance, mount your guest
> filesystem with barrier=0.

Ideally yes. However in practice I suspect this is still a useful option. Is 
it even possible to disable barriers in all cases (e.g. NTFS under windows)?

In a production environment it's probably not so useful - you're generally 
dealing with long lived, custom configured guests.

In a development environment the rules can be a bit different. For example if 
you're testing an OS installer then you really don't want to be passing magic 
mount options. If the host machine dies then you don't care about the state of 
the guest because you're going to start from scratch anyway.

Paul
Anthony Liguori - May 11, 2010, 1:20 p.m.
On 05/11/2010 08:12 AM, Paul Brook wrote:
>>> cache=always (or a more scary name like cache=lie to defend against
>>> idiots)
>>>
>>>     Reads and writes are cached. Guest flushes are ignored.  Useful for
>>>     dumb guests in non-critical environments.
>>>        
>> I really don't believe that we should support a cache=lie.  There are
>> many other obtain the same results.  For instance, mount your guest
>> filesystem with barrier=0.
>>      
> Ideally yes. However in practice I suspect this is still a useful option. Is
> it even possible to disable barriers in all cases (e.g. NTFS under windows)?
>
> In a production environment it's probably not so useful - you're generally
> dealing with long lived, custom configured guests.
>
> In a development environment the rules can be a bit different. For example if
> you're testing an OS installer then you really don't want to be passing magic
> mount options. If the host machine dies then you don't care about the state of
> the guest because you're going to start from scratch anyway.
>    

Then create a mount point on your host and mount the host file system 
under that mount with barrier=0.

The problem with options added for developers is that those options are 
very often accidentally used for production.

Regards,

Anthony Liguori

> Paul
>
>
Paul Brook - May 11, 2010, 1:50 p.m.
> > In a development environment the rules can be a bit different. For
> > example if you're testing an OS installer then you really don't want to
> > be passing magic mount options. If the host machine dies then you don't
> > care about the state of the guest because you're going to start from
> > scratch anyway.
> 
> Then create a mount point on your host and mount the host file system
> under that mount with barrier=0.
> 
> The problem with options added for developers is that those options are
> very often accidentally used for production.

I disagree.  We should not be removing or rejecting features just because they 
allow you to shoot yourself in the foot.  We probably shouldn't be enabling 
them by default, but that's a whole different question.

Mounting a filesystem with barrier=0 is not a good answer because it's a 
global setting.  While my qemu VM may be disposable, it's unlikely that the 
same is true of the rest of my machine.  By your argument linux shouldn't be 
allowing me to do that in the first place because a dumb sysadmin could use 
that option on the filesystem containing the mail store. In fact with the 
average user that's *likely* to happen because they'll only have a single 
partition on their machine.

Paul

N.B. in this context "developers" refers to users who happen to be using qemu 
as part of a toolset for development of $whatever, not development of qemu 
itself. c.f. "production" where qemu is part of the IT infrastructure used to 
directly provide an external service.
Alexander Graf - May 11, 2010, 3:18 p.m.
Paul Brook wrote:
>>> What about another cache=... value instead of adding more options? I'm
>>> quite sure you'll only ever need this with writeback caching. So we
>>> could have cache=none|writethrough|writeback|wb-noflush or something
>>> like that.
>>>       
>
> I agree.
>  
>   
>> The cache option really isn't too useful.  There's a matrix of 3x2
>> possible I/O modes for the posix backend, and right now we only expose
>> two of them.  I think we really should expose all of them, split into
>> two options:
>>
>>                                       caching
>>
>>                     | normal             | O_DIRECT           |
>>
>>             --------+--------------------+--------------------+
>>             none    | -                  | -                  |
>> integrity   O_DSYNC | cache=writeback    | -                  |
>>             fsync   | cache=writethrough | cache=none         |
>>             --------+--------------------+--------------------+
>>     
>
> I think this table is misleading, and from a user perspective there are only 4 
> interesting combinations:
>
> cache=none:
>   No host caching. Reads and writes both go directly to underlying storage. 
> Useful to avoid double-caching.
>
> cache=writethrough
>   Reads are cached. Writes go directly to underlying storage.  Useful for 
> broken guests that aren't aware of drive caches.
>
> cache=writeback
>   Reads and writes are cached. Guest flushes are honoured. Note that this may 
> include a guest visible knob that allows the write cache to be disabled. If 
> the guest requests the cache be disabled then this should act the same as 
> cache=writethrough. Closest to how real hardware works.
>
> cache=always (or a more scary name like cache=lie to defend against idiots)
>   Reads and writes are cached. Guest flushes are ignored.  Useful for dumb 
> guests in non-critical environments.
>   

How about cache=insecure?

But I agree. Having that as a cache= option seems to be the easiest way
to expose it to a user.


Alex
Anthony Liguori - May 11, 2010, 3:40 p.m.
On 05/11/2010 08:50 AM, Paul Brook wrote:
>>> In a development environment the rules can be a bit different. For
>>> example if you're testing an OS installer then you really don't want to
>>> be passing magic mount options. If the host machine dies then you don't
>>> care about the state of the guest because you're going to start from
>>> scratch anyway.
>>>        
>> Then create a mount point on your host and mount the host file system
>> under that mount with barrier=0.
>>
>> The problem with options added for developers is that those options are
>> very often accidentally used for production.
>>      
> I disagree.  We should not be removing or rejecting features just because they
> allow you to shoot yourself in the foot.  We probably shouldn't be enabling
> them by default, but that's a whole different question.
>    

I disagree and think the mentality severely hurts usability.  QEMU's 
role should be to enable features, not to simplify every obscure 
feature.  In general, if someone wants to accomplish something, we 
should try to provide a mechanism to accomplish it.  
cache=none|writeback|writethrough is an example of this.  No one other 
than QEMU can control how we open a file descriptor so we need to 
provide a knob for it.

But if the goal is to make sure that fsync's don't result in data 
actually being on disk, there are many other ways to accomplish this.  
First, for the vast majority of users, this is already the case because 
ext3 defaults to disabling barriers.  Alex is running into this issue 
only because SuSE enables barriers by default for ext3 and fsync()'s are 
horribly slow on ext3.  The fact that this is measurable is purely a 
statement of ext3 suckage and a conscious decision by the SuSE team to 
live with that suckage.  It shouldn't be nearly as bad on ext4.

But even in this context, it's not a huge burden to disable barriers 
within the guest or to create a dedicated filesystem mount for your 
images on the host that disabled barriers on the mount.

On the other hand, the cost of adding another caching option is pretty 
significant.  Very few users really understand what caching options they 
should use under what circumstance.  Adding another option will just 
make that worse.  The fact that this new option can result in data 
corruption that won't occur under normal circumstances is troubling.

> Mounting a filesystem with barrier=0 is not a good answer because it's a
> global setting.  While my qemu VM may be disposable, it's unlikely that the
> same is true of the rest of my machine.

You can have multiple mounts.  In fact, you can just make a loopback 
mount within your existing file system which means that you don't even 
have to muck with your disk.

If your VM is disposable, then shouldn't you be using -snapshot?


>    By your argument linux shouldn't be
> allowing me to do that in the first place because a dumb sysadmin could use
> that option on the filesystem containing the mail store. In fact with the
> average user that's *likely* to happen because they'll only have a single
> partition on their machine.
>    

We aren't preventing sophisticated users from doing sophisticated 
things.  But why should we simplify something that most people don't 
need and if they accidentally use it, bad things will happen?

Regards,

Anthony Liguori
Paul Brook - May 11, 2010, 3:53 p.m.
> > I disagree.  We should not be removing or rejecting features just because
> > they allow you to shoot yourself in the foot.  We probably shouldn't be
> > enabling them by default, but that's a whole different question.
> 
> I disagree and think the mentality severely hurts usability.  QEMU's
> role should be to enable features, not to simplify every obscure
> feature.  In general, if someone wants to accomplish something, we
> should try to provide a mechanism to accomplish it.
> cache=none|writeback|writethrough is an example of this.  No one other
> than QEMU can control how we open a file descriptor so we need to
> provide a knob for it.

Doesn't the same argument apply to the existing cache=writethrough?
i.e. if you want to avoid data loss you should make sure your guest issues 
flushes properly, and it's not something qemu should be trying to hack round 
be adding an implicit flushe after every write.

Paul
Jamie Lokier - May 11, 2010, 4:32 p.m.
Anthony Liguori wrote:
> On 05/11/2010 08:12 AM, Paul Brook wrote:
> >>>cache=always (or a more scary name like cache=lie to defend against
> >>>idiots)
> >>>
> >>>    Reads and writes are cached. Guest flushes are ignored.  Useful for
> >>>    dumb guests in non-critical environments.
> >>>       
> >>I really don't believe that we should support a cache=lie.  There are
> >>many other obtain the same results.  For instance, mount your guest
> >>filesystem with barrier=0.
> >>     
> >Ideally yes. However in practice I suspect this is still a useful option. 
> >Is
> >it even possible to disable barriers in all cases (e.g. NTFS under 
> >windows)?
> >
> >In a production environment it's probably not so useful - you're generally
> >dealing with long lived, custom configured guests.
> >
> >In a development environment the rules can be a bit different. For example 
> >if
> >you're testing an OS installer then you really don't want to be passing 
> >magic
> >mount options. If the host machine dies then you don't care about the 
> >state of
> >the guest because you're going to start from scratch anyway.
> >   
> 
> Then create a mount point on your host and mount the host file system 
> under that mount with barrier=0.

Two reasons that advice doesn't work:

1. It doesn't work in many environments.  You can't mount a filesystem
with barrier=0 in one place and barrier=1 on a different point, and
there's ofen only one host partition.

2. barrier=0 does _not_ provide the cache=off behaviour.  It only
disables barriers; it does not prevent writing to the disk hardware.

If you are doing a transient OS install, ideally you want an amount
equal to your free RAM not written to disk until the end.  barrier=0
does not achieve that.

> The problem with options added for developers is that those options are 
> very often accidentally used for production.

We already have risky cache= options.  Also, do we call fdatasync
(with barrier) on _every_ write for guests which disable the
emulated disk cache?

-- Jamie
Anthony Liguori - May 11, 2010, 5:09 p.m.
On 05/11/2010 10:53 AM, Paul Brook wrote:
>>> I disagree.  We should not be removing or rejecting features just because
>>> they allow you to shoot yourself in the foot.  We probably shouldn't be
>>> enabling them by default, but that's a whole different question.
>>>        
>> I disagree and think the mentality severely hurts usability.  QEMU's
>> role should be to enable features, not to simplify every obscure
>> feature.  In general, if someone wants to accomplish something, we
>> should try to provide a mechanism to accomplish it.
>> cache=none|writeback|writethrough is an example of this.  No one other
>> than QEMU can control how we open a file descriptor so we need to
>> provide a knob for it.
>>      
> Doesn't the same argument apply to the existing cache=writethrough?
> i.e. if you want to avoid data loss you should make sure your guest issues
> flushes properly, and it's not something qemu should be trying to hack round
> be adding an implicit flushe after every write.
>    

cache is the host page cache acting as an extended disk cache.  In 
writethrough mode, the behavior is identical to writethrough on a normal 
disk cache in that all operations are completed only when sent down to 
the next storage layer.  In writeback mode, you rely on the integrity of 
the host page cache to prevent data loss.

Data loss is different than data corruption though.  In neither mode 
will a correct application experience data corruption because if a guest 
issues a flush, we respect those requests.  However, the new proposed 
mode would introduce the possibility of data corruption because it 
becomes possible for writes to be written to disk outside the order 
requested by the guest.  In the event of power loss, the result is 
corruption.

Regards,

Anthony Liguori

> Paul
>
Anthony Liguori - May 11, 2010, 5:15 p.m.
On 05/11/2010 11:32 AM, Jamie Lokier wrote:
> Anthony Liguori wrote:
>    
>> On 05/11/2010 08:12 AM, Paul Brook wrote:
>>      
>>>>> cache=always (or a more scary name like cache=lie to defend against
>>>>> idiots)
>>>>>
>>>>>     Reads and writes are cached. Guest flushes are ignored.  Useful for
>>>>>     dumb guests in non-critical environments.
>>>>>
>>>>>            
>>>> I really don't believe that we should support a cache=lie.  There are
>>>> many other obtain the same results.  For instance, mount your guest
>>>> filesystem with barrier=0.
>>>>
>>>>          
>>> Ideally yes. However in practice I suspect this is still a useful option.
>>> Is
>>> it even possible to disable barriers in all cases (e.g. NTFS under
>>> windows)?
>>>
>>> In a production environment it's probably not so useful - you're generally
>>> dealing with long lived, custom configured guests.
>>>
>>> In a development environment the rules can be a bit different. For example
>>> if
>>> you're testing an OS installer then you really don't want to be passing
>>> magic
>>> mount options. If the host machine dies then you don't care about the
>>> state of
>>> the guest because you're going to start from scratch anyway.
>>>
>>>        
>> Then create a mount point on your host and mount the host file system
>> under that mount with barrier=0.
>>      
> Two reasons that advice doesn't work:
>
> 1. It doesn't work in many environments.  You can't mount a filesystem
> with barrier=0 in one place and barrier=1 on a different point, and
> there's ofen only one host partition.
>    

qemu-img create -f raw foo.img 10G
mkfs.ext3 foo.img
mount -oloop,rw,barrier=1 -t ext3 foo.img mnt

Works perfectly fine.

> 2. barrier=0 does _not_ provide the cache=off behaviour.  It only
> disables barriers; it does not prevent writing to the disk hardware.
>    

The proposal has nothing to do with cache=off.

>> The problem with options added for developers is that those options are
>> very often accidentally used for production.
>>      
> We already have risky cache= options.  Also, do we call fdatasync
> (with barrier) on _every_ write for guests which disable the
> emulated disk cache?
>    

None of our cache= options should result in data corruption on power 
loss.  If they do, it's a bug.

Regards,

Anthony Liguori

> -- Jamie
>
Jamie Lokier - May 11, 2010, 6:13 p.m.
Anthony Liguori wrote:
> qemu-img create -f raw foo.img 10G
> mkfs.ext3 foo.img
> mount -oloop,rw,barrier=1 -t ext3 foo.img mnt
> 
> Works perfectly fine.

Hmm, interesting.  Didn't know loop propagated barriers.

So you're suggesting to use qemu with a loop device, and ext2 (bit
faster than ext3) and barrier=0 (well, that's implied if you use
ext2), and a raw image file on the ext2/3 filesystem, to provide the
effect of flush=off, becuase the loop device caches block writes on
the host, except for explicit barrier requests from the fs, which are
turned off?

That wasn't obvious the first time :-)

Does the loop device cache fs writes instead of propagating them
immediately to the underlying fs?  I guess it probably does.

Does the loop device allow the backing file to grow sparsely, to get
behavious like qcow2?

That's ugly but it might just work.

> >2. barrier=0 does _not_ provide the cache=off behaviour.  It only
> >disables barriers; it does not prevent writing to the disk hardware.
> 
> The proposal has nothing to do with cache=off.

Sorry, I meant flush=off (the proposal).  Mounting the host filesystem
(i.e. not using a loop device anywhere) with barrier=0 doesn't have
even close to the same effect.

> >>The problem with options added for developers is that those options are
> >>very often accidentally used for production.
> >>     
> >We already have risky cache= options.  Also, do we call fdatasync
> >(with barrier) on _every_ write for guests which disable the
> >emulated disk cache?
> 
> None of our cache= options should result in data corruption on power 
> loss.  If they do, it's a bug.

(I might have the details below a bit off.)

If cache=none uses O_DIRECT without calling fdatasync for guest
barriers, then it will get data corruption on power loss.

If cache=none does call fdatasync for guest barriers, then it might
still get corruption on power loss; I am not sure if recent Linux host
behaviour of O_DIRECT+fdatasync (with no buffered writes to commit)
issues the necessary barriers.  I am quite sure that older kernels did not.

cache=writethrough will get data corruption on power loss with older
Linux host kernels.  O_DSYNC did not issue barriers.  I'm not sure if
the behaviour of O_DSYNC that was recently changed is now issuing
barriers after every write.

Provided all the cache= options call fdatasync/fsync when the guest
issues a cache flush, and call fdatasync/fsync following _every_ write
when the guest has disabled the emulated write cache, that should be
as good as Qemu can reasonably do.  It's up to the host from there.

-- Jamie
Jamie Lokier - May 11, 2010, 6:20 p.m.
Paul Brook wrote:
> cache=none:
>   No host caching. Reads and writes both go directly to underlying storage. 
> Useful to avoid double-caching.
> 
> cache=writethrough
>   Reads are cached. Writes go directly to underlying storage.  Useful for 
> broken guests that aren't aware of drive caches.

These are misleading descriptions - because cache=none does not push
writes down to powerfail-safe storage, while cache=writethrough might.

> cache=always (or a more scary name like cache=lie to defend against idiots)
>   Reads and writes are cached. Guest flushes are ignored.  Useful for dumb 
> guests in non-critical environments.

cache=unsafe would tell it like it is.

Even non-idiots could be excused for getting the wrong impression from
cache=always.

-- Jamie
Avi Kivity - May 11, 2010, 7:04 p.m.
On 05/11/2010 11:36 AM, Kevin Wolf wrote:
> Am 10.05.2010 23:51, schrieb Alexander Graf:
>    
>> Usually the guest can tell the host to flush data to disk. In some cases we
>> don't want to flush though, but try to keep everything in cache.
>>
>> So let's add a new parameter to -drive that allows us to set the flushing
>> behavior to "on" or "off", defaulting to enabling the guest to flush.
>>
>> Signed-off-by: Alexander Graf<agraf@suse.de>
>>      
> What about another cache=... value instead of adding more options? I'm
> quite sure you'll only ever need this with writeback caching. So we
> could have cache=none|writethrough|writeback|wb-noflush or something
> like that.
>    

I implemented this once and promptly forgot about it, as cache=volatile.
Avi Kivity - May 11, 2010, 7:11 p.m.
On 05/11/2010 06:40 PM, Anthony Liguori wrote:
>
> But if the goal is to make sure that fsync's don't result in data 
> actually being on disk, there are many other ways to accomplish this.  
> First, for the vast majority of users, this is already the case 
> because ext3 defaults to disabling barriers.  Alex is running into 
> this issue only because SuSE enables barriers by default for ext3 and 
> fsync()'s are horribly slow on ext3.  The fact that this is measurable 
> is purely a statement of ext3 suckage and a conscious decision by the 
> SuSE team to live with that suckage.  It shouldn't be nearly as bad on 
> ext4.
>

There is a huge difference between disabling barriers and cache=volatile.

With barrier=0, data is still forced out of host pagecache and into disk 
cache; it's simply not forced from disk cache to the platter.  But since 
the disk cache is very limited, you'll still be running at disk speed.

With cache=volatile and enough memory you'll never wait for the disk to 
write data.

>
> On the other hand, the cost of adding another caching option is pretty 
> significant.  Very few users really understand what caching options 
> they should use under what circumstance.  Adding another option will 
> just make that worse.  The fact that this new option can result in 
> data corruption that won't occur under normal circumstances is troubling.

I don't think it's a real problem with proper naming, but we can always 
hide the option behind a ./configure --enable-developer-options.

>
>> Mounting a filesystem with barrier=0 is not a good answer because it's a
>> global setting.  While my qemu VM may be disposable, it's unlikely 
>> that the
>> same is true of the rest of my machine.
>
> You can have multiple mounts.  In fact, you can just make a loopback 
> mount within your existing file system which means that you don't even 
> have to muck with your disk.
>
> If your VM is disposable, then shouldn't you be using -snapshot?

For my use case (autotest) the VM is not disposable (it's reused between 
tests) but I don't care about the data in case of a host crash.

>
>
>>    By your argument linux shouldn't be
>> allowing me to do that in the first place because a dumb sysadmin 
>> could use
>> that option on the filesystem containing the mail store. In fact with 
>> the
>> average user that's *likely* to happen because they'll only have a 
>> single
>> partition on their machine.
>
> We aren't preventing sophisticated users from doing sophisticated 
> things.  But why should we simplify something that most people don't 
> need and if they accidentally use it, bad things will happen?

We don't have a real alternative.
Paul Brook - May 11, 2010, 9:58 p.m.
> Paul Brook wrote:
> > cache=none:
> >   No host caching. Reads and writes both go directly to underlying
> >   storage.
> > 
> > Useful to avoid double-caching.
> > 
> > cache=writethrough
> > 
> >   Reads are cached. Writes go directly to underlying storage.  Useful for
> > broken guests that aren't aware of drive caches.
> 
> These are misleading descriptions - because cache=none does not push
> writes down to powerfail-safe storage, while cache=writethrough might.

If so, then this is a serious bug.

Paul
Paul Brook - May 11, 2010, 10:11 p.m.
> > Paul Brook wrote:
> > > cache=none:
> > >   No host caching. Reads and writes both go directly to underlying
> > >   storage.
> > > 
> > > Useful to avoid double-caching.
> > > 
> > > cache=writethrough
> > > 
> > >   Reads are cached. Writes go directly to underlying storage.  Useful
> > >   for
> > > 
> > > broken guests that aren't aware of drive caches.
> > 
> > These are misleading descriptions - because cache=none does not push
> > writes down to powerfail-safe storage, while cache=writethrough might.
> 
> If so, then this is a serious bug.

.. though it may be a kernel bug rather that a qemu bug, depending on the 
exact details.  Either way, I consider any mode that inhibits host filesystem 
write cache but not volatile drive cache to be pretty worthless.  Either we 
guaranteed data integrity on completion or we don't.

Paul
Paul Brook - May 11, 2010, 10:33 p.m.
> On 05/11/2010 10:53 AM, Paul Brook wrote:
> >>> I disagree.  We should not be removing or rejecting features just
> >>> because they allow you to shoot yourself in the foot.  We probably
> >>> shouldn't be enabling them by default, but that's a whole different
> >>> question.
> >> 
> >> I disagree and think the mentality severely hurts usability.  QEMU's
> >> role should be to enable features, not to simplify every obscure
> >> feature.  In general, if someone wants to accomplish something, we
> >> should try to provide a mechanism to accomplish it.
> >> cache=none|writeback|writethrough is an example of this.  No one other
> >> than QEMU can control how we open a file descriptor so we need to
> >> provide a knob for it.
> > 
> > Doesn't the same argument apply to the existing cache=writethrough?
> > i.e. if you want to avoid data loss you should make sure your guest
> > issues flushes properly, and it's not something qemu should be trying to
> > hack round be adding an implicit flushe after every write.
> 
> cache is the host page cache acting as an extended disk cache.  In
> writethrough mode, the behavior is identical to writethrough on a normal
> disk cache in that all operations are completed only when sent down to
> the next storage layer.

IMO this is a bug. Making host pagecache writethrough but still having a 
volatile writeback disk cache seems like a complete waste of time. I can see 
the advantage of disabling host pagecache (avoid double caching in host RAM), 
but having different levels of cache be writethrough/writeback seems extremely 
suspect.

It's also occurred to me that you're also basing your arguments on the 
assumption that host pagecache is volatile.  On a machine with a good UPS this 
is not true. In the even of external power failure the UPS will flush the host 
page cache and cleanly shut the machine down. As with battery-backed RAID 
cards, it's entirely reasonable to consider the cache to be non-volatile 
storage and ignore the flush requests.

If you don't trust your host OS in this situation then you're into a whole 
different level of pain, and raises obvious questions about the firmware 
running on your storage subsystem.

Paul
Jamie Lokier - May 12, 2010, 10:09 a.m.
Paul Brook wrote:
> > > Paul Brook wrote:
> > > > cache=none:
> > > >   No host caching. Reads and writes both go directly to underlying
> > > >   storage.
> > > > 
> > > > Useful to avoid double-caching.
> > > > 
> > > > cache=writethrough
> > > > 
> > > >   Reads are cached. Writes go directly to underlying storage.  Useful
> > > >   for
> > > > 
> > > > broken guests that aren't aware of drive caches.
> > > 
> > > These are misleading descriptions - because cache=none does not push
> > > writes down to powerfail-safe storage, while cache=writethrough might.
> > 
> > If so, then this is a serious bug.
> 
> .. though it may be a kernel bug rather that a qemu bug, depending on the 
> exact details.

It's not a kernel bug.  cache=none uses O_DIRECT, and O_DIRECT must
not force writes to powerfail-safe storage.  If it did, it would be
unusably slow for applications using O_DIRECT as a performance
enhancer / memory saver.  They can call fsync/fdatasync when they need
to for integrity.  (There might be kernel bugs in the latter department.)

> Either way, I consider any mode that inhibits host filesystem write
> cache but not volatile drive cache to be pretty worthless.

On the contrary, it greatly reduces host memory consumption so that
guest data isn't cached twice (it's already cached in the guest), and
it may improve performance by relaxing the POSIX write-serialisation
constraint (not sure if Linux cares; Solaris does).

> Either we guaranteed data integrity on completion or we don't.

The problem with the description of cache=none is it uses O_DIRECT,
which does always not push writes to powerfail-safe storage,.

O_DIRECT is effectively a hint.  It requests less caching in kernel
memory, may reduce memory usage and copying, may invoke direct DMA.

O_DIRECT does not tell the disk hardware to commit to powerfail-safe
storage.  I.e. it doesn't issue barriers or disable disk write caching.
(However, depending on a host setup, it might have that effect if disk
write cache is disabled by the admin).

Also, it doesn't even always write to disk: It falls back to buffered
in some circumstances, even on filesystems which support it - see
recent patches for btrfs which use buffered I/O for O_DIRECT for some
parts of some files.  (Many non-Linux OSes fall back to buffered
when any other process holds a non-O_DIRECT file descriptor, or when
requests don't meet some criteria).

The POSIX thing to use for cache=none would be O_DSYNC|O_RSYNC, and
that should work on some hosts, but Linux doesn't implement real O_RSYNC.

A combination which ought to work is O_DSYNC|O_DIRECT.  O_DIRECT is
the performance hint; O_DSYNC provides the commit request.  Christoph
Hellwig has mentioned that combination elsewhere on this thread.
It makes sense to me for cache=none.

O_DIRECT by itself is a useful performance & memory hint, so there
does need to be some option which maps onto O_DIRECT alone.  But it
shouldn't be documented as stronger than cache=writethrough, because
it isn't.

--  Jamie
Alexander Graf - May 12, 2010, 3:05 p.m.
Kevin Wolf wrote:
> Am 10.05.2010 23:51, schrieb Alexander Graf:
>   
>> Usually the guest can tell the host to flush data to disk. In some cases we
>> don't want to flush though, but try to keep everything in cache.
>>
>> So let's add a new parameter to -drive that allows us to set the flushing
>> behavior to "on" or "off", defaulting to enabling the guest to flush.
>>
>> Signed-off-by: Alexander Graf <agraf@suse.de>
>>     
>
> What about another cache=... value instead of adding more options? I'm
> quite sure you'll only ever need this with writeback caching. So we
> could have cache=none|writethrough|writeback|wb-noflush or something
> like that.
>
>   

Yes, cache=volatile seems reasonable. Or cache=unsafe.

>> ---
>>  block/raw-posix.c |   13 +++++++++++++
>>     
>
> This is obviously wrong. If you want to introduce new behaviour to the
> block layer, you must do it consistently and not just for one block
> driver. So these changes should be made to the generic functions in
> block.c instead.
>   

How so? The callback functions are called using bdrv->drv->xxx. If I
modify that pointer, I end up affecting all other virtual disks as well.

Alex
Kevin Wolf - May 12, 2010, 3:36 p.m.
Am 12.05.2010 17:05, schrieb Alexander Graf:
> Kevin Wolf wrote:
>> Am 10.05.2010 23:51, schrieb Alexander Graf:
>>   
>>> Usually the guest can tell the host to flush data to disk. In some cases we
>>> don't want to flush though, but try to keep everything in cache.
>>>
>>> So let's add a new parameter to -drive that allows us to set the flushing
>>> behavior to "on" or "off", defaulting to enabling the guest to flush.
>>>
>>> Signed-off-by: Alexander Graf <agraf@suse.de>
>>>     
>>
>> What about another cache=... value instead of adding more options? I'm
>> quite sure you'll only ever need this with writeback caching. So we
>> could have cache=none|writethrough|writeback|wb-noflush or something
>> like that.
>>
>>   
> 
> Yes, cache=volatile seems reasonable. Or cache=unsafe.
> 
>>> ---
>>>  block/raw-posix.c |   13 +++++++++++++
>>>     
>>
>> This is obviously wrong. If you want to introduce new behaviour to the
>> block layer, you must do it consistently and not just for one block
>> driver. So these changes should be made to the generic functions in
>> block.c instead.
>>   
> 
> How so? The callback functions are called using bdrv->drv->xxx. If I
> modify that pointer, I end up affecting all other virtual disks as well.

By doing the check in bdrv_flush/bdrv_aio_flush before this function
pointer is even called.

Kevin
Alexander Graf - May 12, 2010, 3:51 p.m.
Am 12.05.2010 um 17:36 schrieb Kevin Wolf <kwolf@redhat.com>:

> Am 12.05.2010 17:05, schrieb Alexander Graf:
>> Kevin Wolf wrote:
>>> Am 10.05.2010 23:51, schrieb Alexander Graf:
>>>
>>>> Usually the guest can tell the host to flush data to disk. In  
>>>> some cases we
>>>> don't want to flush though, but try to keep everything in cache.
>>>>
>>>> So let's add a new parameter to -drive that allows us to set the  
>>>> flushing
>>>> behavior to "on" or "off", defaulting to enabling the guest to  
>>>> flush.
>>>>
>>>> Signed-off-by: Alexander Graf <agraf@suse.de>
>>>>
>>>
>>> What about another cache=... value instead of adding more options?  
>>> I'm
>>> quite sure you'll only ever need this with writeback caching. So we
>>> could have cache=none|writethrough|writeback|wb-noflush or something
>>> like that.
>>>
>>>
>>
>> Yes, cache=volatile seems reasonable. Or cache=unsafe.
>>
>>>> ---
>>>> block/raw-posix.c |   13 +++++++++++++
>>>>
>>>
>>> This is obviously wrong. If you want to introduce new behaviour to  
>>> the
>>> block layer, you must do it consistently and not just for one block
>>> driver. So these changes should be made to the generic functions in
>>> block.c instead.
>>>
>>
>> How so? The callback functions are called using bdrv->drv->xxx. If I
>> modify that pointer, I end up affecting all other virtual disks as  
>> well.
>
> By doing the check in bdrv_flush/bdrv_aio_flush before this function
> pointer is even called.

Sorry yeah, realized that 2 minutes after writing the mail myself :).

I put together an updated patch, but need to test it properly first.

Alex

>
Markus Armbruster - May 14, 2010, 9:16 a.m.
Christoph Hellwig <hch@lst.de> writes:

[...]
> Markus was looking into separating the block device state in host/
> guest portions lately, and splitting cache= would help with this.
> Driver cache enabled or not (fsync means enabled, O_DSYNC disabled,
> and none probably disabled given that we don't care), while O_DIRECT
> vs normal cached I/O is host state.

Let me provide a bit more context.  -drive is a happy mix of guest and
host parameters.  Some parameters are clearly host, some clearly guest,
and some are... complicated.  For instance, I'd argue that "readonly"
applies to both host and guest.

I'm working on turning all guest parameters proper qdev properties, and
collect the host parameters in a new -blockdev.  -drive stays for
compatibility and convenience (I'd like to deprecate if=none, though).

If you change -drive, please keep the coming separation in mind, and
avoid parameters that mix up guest and host.
Christoph Hellwig - May 17, 2010, 12:40 p.m.
On Tue, May 11, 2010 at 11:11:12PM +0100, Paul Brook wrote:
> .. though it may be a kernel bug rather that a qemu bug, depending on the 
> exact details.  Either way, I consider any mode that inhibits host filesystem 
> write cache but not volatile drive cache to be pretty worthless.  Either we 
> guaranteed data integrity on completion or we don't.

O_DIRECT just means bypassing the pagecache, it does not mean flushing
the disk cache on every access, which for certain workloads can be very
painful.  It also doesn't require a synchronous writeout of metadata
required to reach the data, e.g. in case when we have to allocate blocks
for a sparse image file.

To get the behaviour you want you need O_DIRECT|O_DSYNC, which is
something that is not exposed by qemu's current cache= suboption.

If we want to implement this properly we need to split the cache option,
as I already mentioned.  This would also have benefits in other areas,
but again refer to my previous mail for that.
Christoph Hellwig - May 17, 2010, 12:41 p.m.
On Fri, May 14, 2010 at 11:16:47AM +0200, Markus Armbruster wrote:
> If you change -drive, please keep the coming separation in mind, and
> avoid parameters that mix up guest and host.

This new option seems to be exactly that, unfortunately..
Alexander Graf - May 17, 2010, 12:42 p.m.
On 17.05.2010, at 14:41, Christoph Hellwig wrote:

> On Fri, May 14, 2010 at 11:16:47AM +0200, Markus Armbruster wrote:
>> If you change -drive, please keep the coming separation in mind, and
>> avoid parameters that mix up guest and host.
> 
> This new option seems to be exactly that, unfortunately..
> 

How so? It only affects the host. The guest never gets to see that flushing is disabled.


Alex

Patch

diff --git a/block/raw-posix.c b/block/raw-posix.c
index 7541ed2..2510b1b 100644
--- a/block/raw-posix.c
+++ b/block/raw-posix.c
@@ -106,6 +106,7 @@  typedef struct BDRVRawState {
     int fd;
     int type;
     int open_flags;
+    int bdrv_flags;
 #if defined(__linux__)
     /* linux floppy specific */
     int64_t fd_open_time;
@@ -133,6 +134,7 @@  static int raw_open_common(BlockDriverState *bs, const char *filename,
     BDRVRawState *s = bs->opaque;
     int fd, ret;
 
+    s->bdrv_flags = bdrv_flags;
     s->open_flags = open_flags | O_BINARY;
     s->open_flags &= ~O_ACCMODE;
     if (bdrv_flags & BDRV_O_RDWR) {
@@ -555,6 +557,11 @@  static BlockDriverAIOCB *raw_aio_flush(BlockDriverState *bs,
     if (fd_open(bs) < 0)
         return NULL;
 
+    /* Don't flush? */
+    if (s->bdrv_flags & BDRV_O_NOFLUSH) {
+        return bdrv_aio_noop_em(bs, cb, opaque);
+    }
+
     return paio_submit(bs, s->fd, 0, NULL, 0, cb, opaque, QEMU_AIO_FLUSH);
 }
 
@@ -726,6 +733,12 @@  static int raw_create(const char *filename, QEMUOptionParameter *options)
 static void raw_flush(BlockDriverState *bs)
 {
     BDRVRawState *s = bs->opaque;
+
+    /* No flush means no flush */
+    if (s->bdrv_flags & BDRV_O_NOFLUSH) {
+        return;
+    }
+
     qemu_fdatasync(s->fd);
 }
 
diff --git a/qemu-config.c b/qemu-config.c
index d500885..c358add 100644
--- a/qemu-config.c
+++ b/qemu-config.c
@@ -79,6 +79,9 @@  QemuOptsList qemu_drive_opts = {
         },{
             .name = "readonly",
             .type = QEMU_OPT_BOOL,
+        },{
+            .name = "flush",
+            .type = QEMU_OPT_BOOL,
         },
         { /* end if list */ }
     },
diff --git a/qemu-options.hx b/qemu-options.hx
index 12f6b51..69ae8de 100644
--- a/qemu-options.hx
+++ b/qemu-options.hx
@@ -120,6 +120,7 @@  DEF("drive", HAS_ARG, QEMU_OPTION_drive,
     "       [,cyls=c,heads=h,secs=s[,trans=t]][,snapshot=on|off]\n"
     "       [,cache=writethrough|writeback|none][,format=f][,serial=s]\n"
     "       [,addr=A][,id=name][,aio=threads|native][,readonly=on|off]\n"
+    "       [,flush=on|off]\n"
     "                use 'file' as a drive image\n", QEMU_ARCH_ALL)
 STEXI
 @item -drive @var{option}[,@var{option}[,@var{option}[,...]]]
@@ -151,6 +152,8 @@  These options have the same definition as they have in @option{-hdachs}.
 @var{cache} is "none", "writeback", or "writethrough" and controls how the host cache is used to access block data.
 @item aio=@var{aio}
 @var{aio} is "threads", or "native" and selects between pthread based disk I/O and native Linux AIO.
+@item flush=@var{flush}
+@var{flush} is "on" (default), or "off" and select whether the guest can trigger a host flush
 @item format=@var{format}
 Specify which disk @var{format} will be used rather than detecting
 the format.  Can be used to specifiy format=raw to avoid interpreting
diff --git a/vl.c b/vl.c
index 85bcc84..a7ca2c3 100644
--- a/vl.c
+++ b/vl.c
@@ -787,6 +787,7 @@  DriveInfo *drive_init(QemuOpts *opts, void *opaque,
     int max_devs;
     int index;
     int ro = 0;
+    int flush = 1;
     int bdrv_flags = 0;
     int on_read_error, on_write_error;
     const char *devaddr;
@@ -819,6 +820,7 @@  DriveInfo *drive_init(QemuOpts *opts, void *opaque,
 
     snapshot = qemu_opt_get_bool(opts, "snapshot", 0);
     ro = qemu_opt_get_bool(opts, "readonly", 0);
+    flush = qemu_opt_get_bool(opts, "flush", 1);
 
     file = qemu_opt_get(opts, "file");
     serial = qemu_opt_get(opts, "serial");
@@ -1118,6 +1120,7 @@  DriveInfo *drive_init(QemuOpts *opts, void *opaque,
     }
 
     bdrv_flags |= ro ? 0 : BDRV_O_RDWR;
+    bdrv_flags |= flush ? 0 : BDRV_O_NOFLUSH;
 
     if (bdrv_open(dinfo->bdrv, file, bdrv_flags, drv) < 0) {
         fprintf(stderr, "qemu: could not open disk image %s: %s\n",