[1/4] block: add enable_write_cache flag

Message ID	20090831201651.GA4874@lst.de
State	Superseded
Headers	show Return-Path: <qemu-devel-bounces+incoming=patchwork.ozlabs.org@nongnu.org> Date: Mon, 31 Aug 2009 22:16:51 +0200 From: Christoph Hellwig <hch@lst.de> To: qemu-devel@nongnu.org Message-ID: <20090831201651.GA4874@lst.de> References: <20090831201627.GA4811@lst.de> Mime-Version: 1.0 Content-Type: text/plain; charset=us-ascii Content-Disposition: inline In-Reply-To: <20090831201627.GA4811@lst.de> User-Agent: Mutt/1.3.28i Subject: [Qemu-devel] [PATCH 1/4] block: add enable_write_cache flag Precedence: list Sender: qemu-devel-bounces+incoming=patchwork.ozlabs.org@nongnu.org Errors-To: qemu-devel-bounces+incoming=patchwork.ozlabs.org@nongnu.org

Christoph Hellwig Aug. 31, 2009, 8:16 p.m. UTC

Add a enable_write_cache flag in the block driver state, and use it to
decide if we claim to have a volatile write cache that needs controlled
flushing from the guest.

Currently we only claim to have it when cache=none is specified.  While
that might seem wrong it actually is the case as we still have
outstanding block allocations and host drive caches to flush.  We do
not need to claim a write cache when we use cache=writethrough because
O_SYNC writes are guaranteed to have data on stable storage.

We would have to claim one for data=writeback to be safe, but for I will
follow Avi's opinion that it is a useless mode and should be our
dedicated unsafe mode.  If anyone disagrees please start the flame
thrower now and I will change it.  Otherwise a documentation patch will
follow to explicitly document data=writeback as unsafe.

Both scsi-disk and ide now use the new flage, changing from their
defaults of always off (ide) or always on (scsi-disk).

Signed-off-by: Christoph Hellwig <hch@lst.de>

Jamie Lokier Aug. 31, 2009, 10:09 p.m. UTC | #1

Christoph Hellwig wrote:
> We would have to claim one for data=writeback to be safe, but for I will
> follow Avi's opinion that it is a useless mode and should be our
> dedicated unsafe mode.  If anyone disagrees please start the flame
> thrower now and I will change it.  Otherwise a documentation patch will
> follow to explicitly document data=writeback as unsafe.

*Opening flame thrower!*

Unsafe, useless?

It's the most useful mode when you're starting and stopping VMs
regularly, or if you can't use O_DIRECT.

It's safe if fdatasync is called - in other words not advertising a
write cache is silly.

I haven't measured but I'd expect it to be much faster than O_SYNC on
some host hardware, for the same reason that barriers + volatile write
cache are much faster on some host hardware than disabling the write cache.

Right now, on a Linux host O_SYNC is unsafe with hardware that has a
volatile write cache.  That might not be changed, but if it is than
performance with cache=writethrough will plummet (due to issuing a
CACHE FLUSH to the hardware after every write), while performance with
cache=writeback will be reasonable.

If an unsafe mode is desired (I think it is, for those throwaway
testing VMs, or during OS installs), I suggest adding cache=volatile:

    cache=none
        O_DIRECT, fdatasync, advertise volatile write cache

    cache=writethrough
        O_SYNC, do not advertise

    cache=writeback
        fdatasync, advertise volatile write cache

    cache=volatile
        nothing (perhaps fdatasync on QEMU blockdev close)

When using guests OSes which issue CACHE FLUSH commands (that's a
guest config issue), why would you ever use cache=writethrough?
cache=writeback should be faster and equally safe - provided you do
actually advertise the write cache!  So please do!  Not doing so is
silly.

-- Jamie

Christoph Hellwig Aug. 31, 2009, 10:16 p.m. UTC | #2

On Mon, Aug 31, 2009 at 11:09:50PM +0100, Jamie Lokier wrote:
> Right now, on a Linux host O_SYNC is unsafe with hardware that has a
> volatile write cache.  That might not be changed, but if it is than
> performance with cache=writethrough will plummet (due to issuing a
> CACHE FLUSH to the hardware after every write), while performance with
> cache=writeback will be reasonable.

Currenly all modes are more or less unsafe with volatile write caches
at least when using ext3 or raw block device accesses.  XFS is safe
two thirds due to doing the right thing and one third due to sheer
luck.

> If an unsafe mode is desired (I think it is, for those throwaway
> testing VMs, or during OS installs), I suggest adding cache=volatile:
> 
>     cache=none
>         O_DIRECT, fdatasync, advertise volatile write cache
> 
>     cache=writethrough
>         O_SYNC, do not advertise
> 
>     cache=writeback
>         fdatasync, advertise volatile write cache
> 
>     cache=volatile
>         nothing (perhaps fdatasync on QEMU blockdev close)

Fine withe me, let the flame war begin :)

> When using guests OSes which issue CACHE FLUSH commands (that's a
> guest config issue), why would you ever use cache=writethrough?
> cache=writeback should be faster and equally safe - provided you do
> actually advertise the write cache!

And provided the guest OS actually issues cache flushes when it should,
something that at least Linux historicaly utterly failed at, and some
other operating systems haven't even tried.

cache=writethrough is the equivalent of turning off the volatile
write cache of the real disk.  It might be slower (which isn't even
always the case for real disks), but it is much safer.

E.g. if you want to move your old SCO Unix box into a VM it's the
only safe option.

Jamie Lokier Aug. 31, 2009, 10:46 p.m. UTC | #3

Christoph Hellwig wrote:
> On Mon, Aug 31, 2009 at 11:09:50PM +0100, Jamie Lokier wrote:
> > Right now, on a Linux host O_SYNC is unsafe with hardware that has a
> > volatile write cache.  That might not be changed, but if it is than
> > performance with cache=writethrough will plummet (due to issuing a
> > CACHE FLUSH to the hardware after every write), while performance with
> > cache=writeback will be reasonable.
> 
> Currenly all modes are more or less unsafe with volatile write caches
> at least when using ext3 or raw block device accesses.  XFS is safe
> two thirds due to doing the right thing and one third due to sheer
> luck.

Right, but now you've made it worse.  By not calling fdatasync at all,
you've reduced the integrity.  Previously it would reach the drive's
cache, and take whatever (short) time it took to reach the platter.
Now you're leaving data in the host cache which can stay for much
longer, and is vulnerable to host kernel crashes.

Oh, and QEMU could call whatever "hdparm -F" does when using raw block
devices ;-)

> >         nothing (perhaps fdatasync on QEMU blockdev close)
> 
> Fine withe me, let the flame war begin :)

Well I'd like to start by pointing out your patch introduces a
regression in the combination cache=writeback with emulated SCSI,
because it effectively removes the fdatasync calls in that case :-)

So please amend the patch before it gets applied lest such silliness
be propagated.  Thanks :-)

I've actually been using cache=writeback with emulated IDE on deployed
server VMs, assuming that worked with KVM.  It's been an eye opener to
find that it was broken all along because the driver failed set the
"has write cache" bit.  Thank you for the detective work.

It goes to show no matter how hard we try, data integrity is a
slippery thing where getting it wrong does not show up under normal
circumstances, only during catastrophic system failures.

Ironically, with emulated SCSI, I used cache=writethrough, thinking
guests would not issue CACHE FLUSH commands over SCSI because
historically performance has been reached by having overlapping writes
instead.

> > When using guests OSes which issue CACHE FLUSH commands (that's a
> > guest config issue), why would you ever use cache=writethrough?
> > cache=writeback should be faster and equally safe - provided you do
> > actually advertise the write cache!
> 
> And provided the guest OS actually issues cache flushes when it should,
> something that at least Linux historicaly utterly failed at, and some
> other operating systems haven't even tried.

For the hosts, yes - fsync/fdatasync/O_SYNC/O_DIRECT all utterly fail.
(Afaict, Windows hosts do it in some combinations).  But I'd like to
think we're about to fix Linux hosts soon, thanks to your good work on
that elsewhere.

For guests, Linux has been good at issuing the necessary flushes for
ordinary journalling, (in ext3's case, provided barrier=1 is given in
mount options), which is quite important.  It failed with fsync, which
is also important to applications, but filesystem integrity is the
most important thing and it's been good at that for many years.

> cache=writethrough is the equivalent of turning off the volatile
> write cache of the real disk.  It might be slower (which isn't even
> always the case for real disks), but it is much safer.

When O_SYNC is made to flush hardware cache on Linux hosts, it will be
excruciatingly slow: it'll have to seek twice for every write.  Once
for the data, once for the inode update.  That's another reason
O_DSYNC is important.

> E.g. if you want to move your old SCO Unix box into a VM it's the
> only safe option.

I agree, and for that reason, cache=writethrough or cache=none are the
only reasonable defaults.

By the way, all this has led me to another idea...  We may find that
O_SYNC is slower than batching several writes followed by one
fdatasync whose completion allows several writes to report as
completed, when emulating SCSI or virtio-blk (anything which allows
overlapping write commands from the guest).

-- Jamie

Anthony Liguori Aug. 31, 2009, 10:53 p.m. UTC | #4

Christoph Hellwig wrote:
>> If an unsafe mode is desired (I think it is, for those throwaway
>> testing VMs, or during OS installs), I suggest adding cache=volatile:
>>
>>     cache=none
>>         O_DIRECT, fdatasync, advertise volatile write cache
>>
>>     cache=writethrough
>>         O_SYNC, do not advertise
>>
>>     cache=writeback
>>         fdatasync, advertise volatile write cache
>>
>>     cache=volatile
>>         nothing (perhaps fdatasync on QEMU blockdev close)
>>     
>
> Fine withe me, let the flame war begin :)
>   

I think we should pity our poor users and avoid adding yet another 
obscure option that is likely to be misunderstood.

Can someone do some benchmarking with cache=writeback and fdatasync 
first and quantify what the real performance impact is?

I think the two reasonable options are 1) make cache=writeback safe, 
avoid a massive perf decrease in the process 2) keep cache=writeback as 
a no-guarantees option.

Regards,

Anthony Liguori

Jamie Lokier Aug. 31, 2009, 10:55 p.m. UTC | #5

Anthony Liguori wrote:
> Christoph Hellwig wrote:
> >>If an unsafe mode is desired (I think it is, for those throwaway
> >>testing VMs, or during OS installs), I suggest adding cache=volatile:
> >>
> >>    cache=none
> >>        O_DIRECT, fdatasync, advertise volatile write cache
> >>
> >>    cache=writethrough
> >>        O_SYNC, do not advertise
> >>
> >>    cache=writeback
> >>        fdatasync, advertise volatile write cache
> >>
> >>    cache=volatile
> >>        nothing (perhaps fdatasync on QEMU blockdev close)
> >>    
> >
> >Fine withe me, let the flame war begin :)
> >  
> 
> I think we should pity our poor users and avoid adding yet another 
> obscure option that is likely to be misunderstood.
> 
> Can someone do some benchmarking with cache=writeback and fdatasync 
> first and quantify what the real performance impact is?
> 
> I think the two reasonable options are 1) make cache=writeback safe, 
> avoid a massive perf decrease in the process 2) keep cache=writeback as 
> a no-guarantees option.

Right now, cache=writeback does set the bit for SCSI emulation, which makes it
safe for guests which understand that.

Removing that is a regression in safety, not merely a lack of change.

-- Jamie

Christoph Hellwig Aug. 31, 2009, 10:58 p.m. UTC | #6

On Mon, Aug 31, 2009 at 05:53:23PM -0500, Anthony Liguori wrote:
> I think we should pity our poor users and avoid adding yet another 
> obscure option that is likely to be misunderstood.
> 
> Can someone do some benchmarking with cache=writeback and fdatasync 
> first and quantify what the real performance impact is?

I can run benchmarks.  Any workloads that you are particularly looking
for?

Jamie Lokier Aug. 31, 2009, 10:59 p.m. UTC | #7

Anthony Liguori wrote:
> Can someone do some benchmarking with cache=writeback and fdatasync 
> first and quantify what the real performance impact is?

Unfortunately we can't yet quantify the impact on the hardware I care
about (ordinary consumer PCs with non-NCQ SATA disks), because Linux
hosts don't *yet* implement O_SYNC or fdatasync properly.

I would expect the performance difference to be much more significant
after those are implemented on the host.

-- Jamie

Christoph Hellwig Aug. 31, 2009, 11:06 p.m. UTC | #8

On Mon, Aug 31, 2009 at 11:46:45PM +0100, Jamie Lokier wrote:
> > On Mon, Aug 31, 2009 at 11:09:50PM +0100, Jamie Lokier wrote:
> > > Right now, on a Linux host O_SYNC is unsafe with hardware that has a
> > > volatile write cache.  That might not be changed, but if it is than
> > > performance with cache=writethrough will plummet (due to issuing a
> > > CACHE FLUSH to the hardware after every write), while performance with
> > > cache=writeback will be reasonable.
> > 
> > Currenly all modes are more or less unsafe with volatile write caches
> > at least when using ext3 or raw block device accesses.  XFS is safe
> > two thirds due to doing the right thing and one third due to sheer
> > luck.
> 
> Right, but now you've made it worse.  By not calling fdatasync at all,
> you've reduced the integrity.  Previously it would reach the drive's
> cache, and take whatever (short) time it took to reach the platter.
> Now you're leaving data in the host cache which can stay for much
> longer, and is vulnerable to host kernel crashes.

Your last comment is for data=writeback, which in Avi's proposal that
I implemented would indeed lost any guarantees and be for all pratical
matters unsafe.  It's not true for any of the other options.

> Oh, and QEMU could call whatever "hdparm -F" does when using raw block
> devices ;-)

Actually for ide/scsi implementing cache control is on my todo list.
Not sure about virtio yet.

> Well I'd like to start by pointing out your patch introduces a
> regression in the combination cache=writeback with emulated SCSI,
> because it effectively removes the fdatasync calls in that case :-)

Yes, you already pointed this out above.

> It goes to show no matter how hard we try, data integrity is a
> slippery thing where getting it wrong does not show up under normal
> circumstances, only during catastrophic system failures.

Honestly, it should not.  Digging through all this was a bit of work,
but I was extremly how carelessly most people that touched it before
were.  It's not rocket sciense and can be tested quite easily using
various tools - qemu beeing the easiest nowdays but scsi_debug or
an instrumented iscsi target would do the same thing.

> It failed with fsync, which
> is also important to applications, but filesystem integrity is the
> most important thing and it's been good at that for many years.

Users might disagree with that.  With my user hat on I couldn't care
less on what state the internal metadata is as long as I get back at
my data which the OS has guaranteed me to reach the disk after a
successfull fsync/fdatasync/O_SYNC write.

> > E.g. if you want to move your old SCO Unix box into a VM it's the
> > only safe option.
> 
> I agree, and for that reason, cache=writethrough or cache=none are the
> only reasonable defaults.

despite the extremly misleading name cache=none is _NOT_ an alternative,
unless we make it open the image using O_DIRECT|O_SYNC.

Christoph Hellwig Aug. 31, 2009, 11:06 p.m. UTC | #9

On Mon, Aug 31, 2009 at 11:59:25PM +0100, Jamie Lokier wrote:
> Anthony Liguori wrote:
> > Can someone do some benchmarking with cache=writeback and fdatasync 
> > first and quantify what the real performance impact is?
> 
> Unfortunately we can't yet quantify the impact on the hardware I care
> about (ordinary consumer PCs with non-NCQ SATA disks), because Linux
> hosts don't *yet* implement O_SYNC or fdatasync properly.

They do if you use XFS.

Christoph Hellwig Aug. 31, 2009, 11:09 p.m. UTC | #10

On Tue, Sep 01, 2009 at 01:06:46AM +0200, Christoph Hellwig wrote:
> On Mon, Aug 31, 2009 at 11:59:25PM +0100, Jamie Lokier wrote:
> > Anthony Liguori wrote:
> > > Can someone do some benchmarking with cache=writeback and fdatasync 
> > > first and quantify what the real performance impact is?
> > 
> > Unfortunately we can't yet quantify the impact on the hardware I care
> > about (ordinary consumer PCs with non-NCQ SATA disks), because Linux
> > hosts don't *yet* implement O_SYNC or fdatasync properly.
> 
> They do if you use XFS.

And data-writeback + fdatasync also when using reiserfs.  For ext3/4 you
need the patches I sent out today, for O_SYNC on everythig but XFS also
Jan Kara's patch series.

Jamie Lokier Sept. 1, 2009, 10:38 a.m. UTC | #11

Christoph Hellwig wrote:
> > Oh, and QEMU could call whatever "hdparm -F" does when using raw block
> > devices ;-)
> 
> Actually for ide/scsi implementing cache control is on my todo list.
> Not sure about virtio yet.

I think hdparm -f -F does for some block devices what fdatasync
ideally does for files.  What I was getting at was until we have
perfect fdatasync on block devices for Linux, QEMU could use the
blockdev ioctls to accomplish the same thing on older kernels.

> > It goes to show no matter how hard we try, data integrity is a
> > slippery thing where getting it wrong does not show up under normal
> > circumstances, only during catastrophic system failures.
> 
> Honestly, it should not.  Digging through all this was a bit of work,
> but I was extremly how carelessly most people that touched it before
> were.  It's not rocket sciense and can be tested quite easily using
> various tools - qemu beeing the easiest nowdays but scsi_debug or
> an instrumented iscsi target would do the same thing.

Oh I agree - we have increasingly good debugging tools.  What's
missing is a dirty script^H^H^H^H^H^H a good validation test which
stresses the various combinations of ways to sync data on block
devices and various filesystems, and various types of emulated
hardware with/without caches enabled, and various mount options, and
checks the I/O does what is desired in every case.

> > It failed with fsync, which is also important to applications, but
> > filesystem integrity is the most important thing and it's been
> > good at that for many years.
> 
> Users might disagree with that.  With my user hat on I couldn't care
> less on what state the internal metadata is as long as I get back at
> my data which the OS has guaranteed me to reach the disk after a
> successfull fsync/fdatasync/O_SYNC write.

I guess it depends what you're doing.  I've observed more instances of
filesystem corruption due to lack of barriers, resulting in an
inability to find files, than I've ever noticed missing data inside
files, but then I hardly ever keep large amounts of data in databases.
And I get so much mail I wouldn't notice if a few got lost ;-)

> > > E.g. if you want to move your old SCO Unix box into a VM it's the
> > > only safe option.
> > 
> > I agree, and for that reason, cache=writethrough or cache=none are the
> > only reasonable defaults.
> 
> despite the extremly misleading name cache=none is _NOT_ an alternative,
> unless we make it open the image using O_DIRECT|O_SYNC.

Good point about the misleading name, and good point about O_DIRECT
being insufficient too.

For a safe emulation default with reasonable performance, I wonder if
it would work to emulate drive cache _off_ at the beginning, but with
the capability for the guest to enable it?  The theory is that old
guests don't know about drive caches and will leave it off and be safe
(getting O_DSYNC or O_DIRECT|O_DSYNC)[*], and newer guests will turn it on
if they also implement barriers (getting nothing or O_DIRECT, and
fdatasync when they issue barriers).  Do you think that would work
with typical guests we know about?

[*] - O_DSYNC as opposed to O_SYNC strikes me as important once proper
cache flushes are implemented, as it may behave very similar to real
hardware when doing data overwrites, whereas O_SYNC should seek back
and forth between the data and inode areas for every write, if it's
updating it's nanosecond timestamps correctly.

-- Jamie

Christoph Hellwig Sept. 2, 2009, 3:53 a.m. UTC | #12

On Mon, Aug 31, 2009 at 05:53:23PM -0500, Anthony Liguori wrote:
> I think we should pity our poor users and avoid adding yet another 
> obscure option that is likely to be misunderstood.
> 
> Can someone do some benchmarking with cache=writeback and fdatasync 
> first and quantify what the real performance impact is?

Some preliminary numbers because they are very interesting.  Note that
his is on a raid controller, not cheap ide disks.  To make up for that
I used an image file on ext3, which due to it's horrible fsync
performance should be kind of a worst case.  All these patches are
with Linux 2.6.31-rc8 + my various barrier fixes on guest and host,
using ext3 with barrier=1 on both.

A kernel defconfig compile takes between 9m40s and 9m42s with
data=writeback and barrieres disabled, and with fdatasync barriers
enabled it actually is minimally faster, between 9m38s and 9m39s
(given that I've only done three runs each this might fall under
the boundary for measurement tolerances).

For comparism the raw block device nodes with cache=none (just one run)
is 9m36.759s, which is not far apart.  A completely native run is
7m39.326, btw - and I fear much of the slowdown in KVM isn't I/O
related.

Anthony Liguori Sept. 2, 2009, 1:13 p.m. UTC | #13

Christoph Hellwig wrote:
> On Mon, Aug 31, 2009 at 05:53:23PM -0500, Anthony Liguori wrote:
>   
>> I think we should pity our poor users and avoid adding yet another 
>> obscure option that is likely to be misunderstood.
>>
>> Can someone do some benchmarking with cache=writeback and fdatasync 
>> first and quantify what the real performance impact is?
>>     
>
> Some preliminary numbers because they are very interesting.  Note that
> his is on a raid controller, not cheap ide disks.  To make up for that
> I used an image file on ext3, which due to it's horrible fsync
> performance should be kind of a worst case.  All these patches are
> with Linux 2.6.31-rc8 + my various barrier fixes on guest and host,
> using ext3 with barrier=1 on both.
>   

Does barrier=0 make a performance difference?  IOW, would the typical 
default ext3 deployment show worse behavior?

> A kernel defconfig compile takes between 9m40s and 9m42s with
> data=writeback and barrieres disabled, and with fdatasync barriers
> enabled it actually is minimally faster,

If fdatasync different than fsync on ext3?  Does it result in a full 
metadata commit?

If we think these numbers make sense, then I'd vote for enabling 
fdatasync in master and we'll see if there are any corner cases.

>  between 9m38s and 9m39s
> (given that I've only done three runs each this might fall under
> the boundary for measurement tolerances).
>
> For comparism the raw block device nodes with cache=none (just one run)
> is 9m36.759s, which is not far apart.  A completely native run is
> 7m39.326, btw - and I fear much of the slowdown in KVM isn't I/O
> related.
>   

If you're on pre-NHM or BCN then the slowdown from shadow paging would 
be expected.

Regards,

Anthony Liguori

Christoph Hellwig Sept. 2, 2009, 2:14 p.m. UTC | #14

On Wed, Sep 02, 2009 at 08:13:52AM -0500, Anthony Liguori wrote:
> Does barrier=0 make a performance difference?  IOW, would the typical 
> default ext3 deployment show worse behavior?

I'll give it a spin.

> >A kernel defconfig compile takes between 9m40s and 9m42s with
> >data=writeback and barrieres disabled, and with fdatasync barriers
> >enabled it actually is minimally faster,
> 
> If fdatasync different than fsync on ext3?  Does it result in a full 
> metadata commit?

ext3 honors the fdatasync flag and only does it's horrible job if the
metadata we care about is dirty, that is only if some non-timestamp
data is dirty.  Which means for a non-sparse image file it does
well, while for a sparse image file needing allocations it will
cause trouble.  And now that you mention it I've only tested the
non-sparse case which now is default for the management tools at
least on Fedora.

Christoph Hellwig Sept. 2, 2009, 7:49 p.m. UTC | #15

On Wed, Sep 02, 2009 at 08:13:52AM -0500, Anthony Liguori wrote:
> >performance should be kind of a worst case.  All these patches are
> >with Linux 2.6.31-rc8 + my various barrier fixes on guest and host,
> >using ext3 with barrier=1 on both.
> >  
> 
> Does barrier=0 make a performance difference?  IOW, would the typical 
> default ext3 deployment show worse behavior?

Note for this tyical ext3 deployment the barrier patches are kinda
useless, because we still don't have any data integrity guarantees at
all.  Anyway, here are the numbers with barrier=0 on host and guest:

data=writeback, no write cache advertised:

	9m37.890s, 9m38.303s, 9m38.423s, 9m38.861s, 9m39.599s

data=writeback, write cache advertized (and backed by fdatasync):

	9m39.649s, 9m39.772s, 9m40.149s, 9m41.737s, 9m41.996s

[1/4] block: add enable_write_cache flag

Commit Message

Comments

Patch