Message ID | 20090831201651.GA4874@lst.de |
---|---|
State | Superseded |
Headers | show |
Christoph Hellwig wrote: > We would have to claim one for data=writeback to be safe, but for I will > follow Avi's opinion that it is a useless mode and should be our > dedicated unsafe mode. If anyone disagrees please start the flame > thrower now and I will change it. Otherwise a documentation patch will > follow to explicitly document data=writeback as unsafe. *Opening flame thrower!* Unsafe, useless? It's the most useful mode when you're starting and stopping VMs regularly, or if you can't use O_DIRECT. It's safe if fdatasync is called - in other words not advertising a write cache is silly. I haven't measured but I'd expect it to be much faster than O_SYNC on some host hardware, for the same reason that barriers + volatile write cache are much faster on some host hardware than disabling the write cache. Right now, on a Linux host O_SYNC is unsafe with hardware that has a volatile write cache. That might not be changed, but if it is than performance with cache=writethrough will plummet (due to issuing a CACHE FLUSH to the hardware after every write), while performance with cache=writeback will be reasonable. If an unsafe mode is desired (I think it is, for those throwaway testing VMs, or during OS installs), I suggest adding cache=volatile: cache=none O_DIRECT, fdatasync, advertise volatile write cache cache=writethrough O_SYNC, do not advertise cache=writeback fdatasync, advertise volatile write cache cache=volatile nothing (perhaps fdatasync on QEMU blockdev close) When using guests OSes which issue CACHE FLUSH commands (that's a guest config issue), why would you ever use cache=writethrough? cache=writeback should be faster and equally safe - provided you do actually advertise the write cache! So please do! Not doing so is silly. -- Jamie
On Mon, Aug 31, 2009 at 11:09:50PM +0100, Jamie Lokier wrote: > Right now, on a Linux host O_SYNC is unsafe with hardware that has a > volatile write cache. That might not be changed, but if it is than > performance with cache=writethrough will plummet (due to issuing a > CACHE FLUSH to the hardware after every write), while performance with > cache=writeback will be reasonable. Currenly all modes are more or less unsafe with volatile write caches at least when using ext3 or raw block device accesses. XFS is safe two thirds due to doing the right thing and one third due to sheer luck. > If an unsafe mode is desired (I think it is, for those throwaway > testing VMs, or during OS installs), I suggest adding cache=volatile: > > cache=none > O_DIRECT, fdatasync, advertise volatile write cache > > cache=writethrough > O_SYNC, do not advertise > > cache=writeback > fdatasync, advertise volatile write cache > > cache=volatile > nothing (perhaps fdatasync on QEMU blockdev close) Fine withe me, let the flame war begin :) > When using guests OSes which issue CACHE FLUSH commands (that's a > guest config issue), why would you ever use cache=writethrough? > cache=writeback should be faster and equally safe - provided you do > actually advertise the write cache! And provided the guest OS actually issues cache flushes when it should, something that at least Linux historicaly utterly failed at, and some other operating systems haven't even tried. cache=writethrough is the equivalent of turning off the volatile write cache of the real disk. It might be slower (which isn't even always the case for real disks), but it is much safer. E.g. if you want to move your old SCO Unix box into a VM it's the only safe option.
Christoph Hellwig wrote: > On Mon, Aug 31, 2009 at 11:09:50PM +0100, Jamie Lokier wrote: > > Right now, on a Linux host O_SYNC is unsafe with hardware that has a > > volatile write cache. That might not be changed, but if it is than > > performance with cache=writethrough will plummet (due to issuing a > > CACHE FLUSH to the hardware after every write), while performance with > > cache=writeback will be reasonable. > > Currenly all modes are more or less unsafe with volatile write caches > at least when using ext3 or raw block device accesses. XFS is safe > two thirds due to doing the right thing and one third due to sheer > luck. Right, but now you've made it worse. By not calling fdatasync at all, you've reduced the integrity. Previously it would reach the drive's cache, and take whatever (short) time it took to reach the platter. Now you're leaving data in the host cache which can stay for much longer, and is vulnerable to host kernel crashes. Oh, and QEMU could call whatever "hdparm -F" does when using raw block devices ;-) > > nothing (perhaps fdatasync on QEMU blockdev close) > > Fine withe me, let the flame war begin :) Well I'd like to start by pointing out your patch introduces a regression in the combination cache=writeback with emulated SCSI, because it effectively removes the fdatasync calls in that case :-) So please amend the patch before it gets applied lest such silliness be propagated. Thanks :-) I've actually been using cache=writeback with emulated IDE on deployed server VMs, assuming that worked with KVM. It's been an eye opener to find that it was broken all along because the driver failed set the "has write cache" bit. Thank you for the detective work. It goes to show no matter how hard we try, data integrity is a slippery thing where getting it wrong does not show up under normal circumstances, only during catastrophic system failures. Ironically, with emulated SCSI, I used cache=writethrough, thinking guests would not issue CACHE FLUSH commands over SCSI because historically performance has been reached by having overlapping writes instead. > > When using guests OSes which issue CACHE FLUSH commands (that's a > > guest config issue), why would you ever use cache=writethrough? > > cache=writeback should be faster and equally safe - provided you do > > actually advertise the write cache! > > And provided the guest OS actually issues cache flushes when it should, > something that at least Linux historicaly utterly failed at, and some > other operating systems haven't even tried. For the hosts, yes - fsync/fdatasync/O_SYNC/O_DIRECT all utterly fail. (Afaict, Windows hosts do it in some combinations). But I'd like to think we're about to fix Linux hosts soon, thanks to your good work on that elsewhere. For guests, Linux has been good at issuing the necessary flushes for ordinary journalling, (in ext3's case, provided barrier=1 is given in mount options), which is quite important. It failed with fsync, which is also important to applications, but filesystem integrity is the most important thing and it's been good at that for many years. > cache=writethrough is the equivalent of turning off the volatile > write cache of the real disk. It might be slower (which isn't even > always the case for real disks), but it is much safer. When O_SYNC is made to flush hardware cache on Linux hosts, it will be excruciatingly slow: it'll have to seek twice for every write. Once for the data, once for the inode update. That's another reason O_DSYNC is important. > E.g. if you want to move your old SCO Unix box into a VM it's the > only safe option. I agree, and for that reason, cache=writethrough or cache=none are the only reasonable defaults. By the way, all this has led me to another idea... We may find that O_SYNC is slower than batching several writes followed by one fdatasync whose completion allows several writes to report as completed, when emulating SCSI or virtio-blk (anything which allows overlapping write commands from the guest). -- Jamie
Christoph Hellwig wrote: >> If an unsafe mode is desired (I think it is, for those throwaway >> testing VMs, or during OS installs), I suggest adding cache=volatile: >> >> cache=none >> O_DIRECT, fdatasync, advertise volatile write cache >> >> cache=writethrough >> O_SYNC, do not advertise >> >> cache=writeback >> fdatasync, advertise volatile write cache >> >> cache=volatile >> nothing (perhaps fdatasync on QEMU blockdev close) >> > > Fine withe me, let the flame war begin :) > I think we should pity our poor users and avoid adding yet another obscure option that is likely to be misunderstood. Can someone do some benchmarking with cache=writeback and fdatasync first and quantify what the real performance impact is? I think the two reasonable options are 1) make cache=writeback safe, avoid a massive perf decrease in the process 2) keep cache=writeback as a no-guarantees option. Regards, Anthony Liguori
Anthony Liguori wrote: > Christoph Hellwig wrote: > >>If an unsafe mode is desired (I think it is, for those throwaway > >>testing VMs, or during OS installs), I suggest adding cache=volatile: > >> > >> cache=none > >> O_DIRECT, fdatasync, advertise volatile write cache > >> > >> cache=writethrough > >> O_SYNC, do not advertise > >> > >> cache=writeback > >> fdatasync, advertise volatile write cache > >> > >> cache=volatile > >> nothing (perhaps fdatasync on QEMU blockdev close) > >> > > > >Fine withe me, let the flame war begin :) > > > > I think we should pity our poor users and avoid adding yet another > obscure option that is likely to be misunderstood. > > Can someone do some benchmarking with cache=writeback and fdatasync > first and quantify what the real performance impact is? > > I think the two reasonable options are 1) make cache=writeback safe, > avoid a massive perf decrease in the process 2) keep cache=writeback as > a no-guarantees option. Right now, cache=writeback does set the bit for SCSI emulation, which makes it safe for guests which understand that. Removing that is a regression in safety, not merely a lack of change. -- Jamie
On Mon, Aug 31, 2009 at 05:53:23PM -0500, Anthony Liguori wrote: > I think we should pity our poor users and avoid adding yet another > obscure option that is likely to be misunderstood. > > Can someone do some benchmarking with cache=writeback and fdatasync > first and quantify what the real performance impact is? I can run benchmarks. Any workloads that you are particularly looking for?
Anthony Liguori wrote: > Can someone do some benchmarking with cache=writeback and fdatasync > first and quantify what the real performance impact is? Unfortunately we can't yet quantify the impact on the hardware I care about (ordinary consumer PCs with non-NCQ SATA disks), because Linux hosts don't *yet* implement O_SYNC or fdatasync properly. I would expect the performance difference to be much more significant after those are implemented on the host. -- Jamie
On Mon, Aug 31, 2009 at 11:46:45PM +0100, Jamie Lokier wrote: > > On Mon, Aug 31, 2009 at 11:09:50PM +0100, Jamie Lokier wrote: > > > Right now, on a Linux host O_SYNC is unsafe with hardware that has a > > > volatile write cache. That might not be changed, but if it is than > > > performance with cache=writethrough will plummet (due to issuing a > > > CACHE FLUSH to the hardware after every write), while performance with > > > cache=writeback will be reasonable. > > > > Currenly all modes are more or less unsafe with volatile write caches > > at least when using ext3 or raw block device accesses. XFS is safe > > two thirds due to doing the right thing and one third due to sheer > > luck. > > Right, but now you've made it worse. By not calling fdatasync at all, > you've reduced the integrity. Previously it would reach the drive's > cache, and take whatever (short) time it took to reach the platter. > Now you're leaving data in the host cache which can stay for much > longer, and is vulnerable to host kernel crashes. Your last comment is for data=writeback, which in Avi's proposal that I implemented would indeed lost any guarantees and be for all pratical matters unsafe. It's not true for any of the other options. > Oh, and QEMU could call whatever "hdparm -F" does when using raw block > devices ;-) Actually for ide/scsi implementing cache control is on my todo list. Not sure about virtio yet. > Well I'd like to start by pointing out your patch introduces a > regression in the combination cache=writeback with emulated SCSI, > because it effectively removes the fdatasync calls in that case :-) Yes, you already pointed this out above. > It goes to show no matter how hard we try, data integrity is a > slippery thing where getting it wrong does not show up under normal > circumstances, only during catastrophic system failures. Honestly, it should not. Digging through all this was a bit of work, but I was extremly how carelessly most people that touched it before were. It's not rocket sciense and can be tested quite easily using various tools - qemu beeing the easiest nowdays but scsi_debug or an instrumented iscsi target would do the same thing. > It failed with fsync, which > is also important to applications, but filesystem integrity is the > most important thing and it's been good at that for many years. Users might disagree with that. With my user hat on I couldn't care less on what state the internal metadata is as long as I get back at my data which the OS has guaranteed me to reach the disk after a successfull fsync/fdatasync/O_SYNC write. > > E.g. if you want to move your old SCO Unix box into a VM it's the > > only safe option. > > I agree, and for that reason, cache=writethrough or cache=none are the > only reasonable defaults. despite the extremly misleading name cache=none is _NOT_ an alternative, unless we make it open the image using O_DIRECT|O_SYNC.
On Mon, Aug 31, 2009 at 11:59:25PM +0100, Jamie Lokier wrote: > Anthony Liguori wrote: > > Can someone do some benchmarking with cache=writeback and fdatasync > > first and quantify what the real performance impact is? > > Unfortunately we can't yet quantify the impact on the hardware I care > about (ordinary consumer PCs with non-NCQ SATA disks), because Linux > hosts don't *yet* implement O_SYNC or fdatasync properly. They do if you use XFS.
On Tue, Sep 01, 2009 at 01:06:46AM +0200, Christoph Hellwig wrote: > On Mon, Aug 31, 2009 at 11:59:25PM +0100, Jamie Lokier wrote: > > Anthony Liguori wrote: > > > Can someone do some benchmarking with cache=writeback and fdatasync > > > first and quantify what the real performance impact is? > > > > Unfortunately we can't yet quantify the impact on the hardware I care > > about (ordinary consumer PCs with non-NCQ SATA disks), because Linux > > hosts don't *yet* implement O_SYNC or fdatasync properly. > > They do if you use XFS. And data-writeback + fdatasync also when using reiserfs. For ext3/4 you need the patches I sent out today, for O_SYNC on everythig but XFS also Jan Kara's patch series.
Christoph Hellwig wrote: > > Oh, and QEMU could call whatever "hdparm -F" does when using raw block > > devices ;-) > > Actually for ide/scsi implementing cache control is on my todo list. > Not sure about virtio yet. I think hdparm -f -F does for some block devices what fdatasync ideally does for files. What I was getting at was until we have perfect fdatasync on block devices for Linux, QEMU could use the blockdev ioctls to accomplish the same thing on older kernels. > > It goes to show no matter how hard we try, data integrity is a > > slippery thing where getting it wrong does not show up under normal > > circumstances, only during catastrophic system failures. > > Honestly, it should not. Digging through all this was a bit of work, > but I was extremly how carelessly most people that touched it before > were. It's not rocket sciense and can be tested quite easily using > various tools - qemu beeing the easiest nowdays but scsi_debug or > an instrumented iscsi target would do the same thing. Oh I agree - we have increasingly good debugging tools. What's missing is a dirty script^H^H^H^H^H^H a good validation test which stresses the various combinations of ways to sync data on block devices and various filesystems, and various types of emulated hardware with/without caches enabled, and various mount options, and checks the I/O does what is desired in every case. > > It failed with fsync, which is also important to applications, but > > filesystem integrity is the most important thing and it's been > > good at that for many years. > > Users might disagree with that. With my user hat on I couldn't care > less on what state the internal metadata is as long as I get back at > my data which the OS has guaranteed me to reach the disk after a > successfull fsync/fdatasync/O_SYNC write. I guess it depends what you're doing. I've observed more instances of filesystem corruption due to lack of barriers, resulting in an inability to find files, than I've ever noticed missing data inside files, but then I hardly ever keep large amounts of data in databases. And I get so much mail I wouldn't notice if a few got lost ;-) > > > E.g. if you want to move your old SCO Unix box into a VM it's the > > > only safe option. > > > > I agree, and for that reason, cache=writethrough or cache=none are the > > only reasonable defaults. > > despite the extremly misleading name cache=none is _NOT_ an alternative, > unless we make it open the image using O_DIRECT|O_SYNC. Good point about the misleading name, and good point about O_DIRECT being insufficient too. For a safe emulation default with reasonable performance, I wonder if it would work to emulate drive cache _off_ at the beginning, but with the capability for the guest to enable it? The theory is that old guests don't know about drive caches and will leave it off and be safe (getting O_DSYNC or O_DIRECT|O_DSYNC)[*], and newer guests will turn it on if they also implement barriers (getting nothing or O_DIRECT, and fdatasync when they issue barriers). Do you think that would work with typical guests we know about? [*] - O_DSYNC as opposed to O_SYNC strikes me as important once proper cache flushes are implemented, as it may behave very similar to real hardware when doing data overwrites, whereas O_SYNC should seek back and forth between the data and inode areas for every write, if it's updating it's nanosecond timestamps correctly. -- Jamie
On Mon, Aug 31, 2009 at 05:53:23PM -0500, Anthony Liguori wrote: > I think we should pity our poor users and avoid adding yet another > obscure option that is likely to be misunderstood. > > Can someone do some benchmarking with cache=writeback and fdatasync > first and quantify what the real performance impact is? Some preliminary numbers because they are very interesting. Note that his is on a raid controller, not cheap ide disks. To make up for that I used an image file on ext3, which due to it's horrible fsync performance should be kind of a worst case. All these patches are with Linux 2.6.31-rc8 + my various barrier fixes on guest and host, using ext3 with barrier=1 on both. A kernel defconfig compile takes between 9m40s and 9m42s with data=writeback and barrieres disabled, and with fdatasync barriers enabled it actually is minimally faster, between 9m38s and 9m39s (given that I've only done three runs each this might fall under the boundary for measurement tolerances). For comparism the raw block device nodes with cache=none (just one run) is 9m36.759s, which is not far apart. A completely native run is 7m39.326, btw - and I fear much of the slowdown in KVM isn't I/O related.
Christoph Hellwig wrote: > On Mon, Aug 31, 2009 at 05:53:23PM -0500, Anthony Liguori wrote: > >> I think we should pity our poor users and avoid adding yet another >> obscure option that is likely to be misunderstood. >> >> Can someone do some benchmarking with cache=writeback and fdatasync >> first and quantify what the real performance impact is? >> > > Some preliminary numbers because they are very interesting. Note that > his is on a raid controller, not cheap ide disks. To make up for that > I used an image file on ext3, which due to it's horrible fsync > performance should be kind of a worst case. All these patches are > with Linux 2.6.31-rc8 + my various barrier fixes on guest and host, > using ext3 with barrier=1 on both. > Does barrier=0 make a performance difference? IOW, would the typical default ext3 deployment show worse behavior? > A kernel defconfig compile takes between 9m40s and 9m42s with > data=writeback and barrieres disabled, and with fdatasync barriers > enabled it actually is minimally faster, If fdatasync different than fsync on ext3? Does it result in a full metadata commit? If we think these numbers make sense, then I'd vote for enabling fdatasync in master and we'll see if there are any corner cases. > between 9m38s and 9m39s > (given that I've only done three runs each this might fall under > the boundary for measurement tolerances). > > For comparism the raw block device nodes with cache=none (just one run) > is 9m36.759s, which is not far apart. A completely native run is > 7m39.326, btw - and I fear much of the slowdown in KVM isn't I/O > related. > If you're on pre-NHM or BCN then the slowdown from shadow paging would be expected. Regards, Anthony Liguori
On Wed, Sep 02, 2009 at 08:13:52AM -0500, Anthony Liguori wrote: > Does barrier=0 make a performance difference? IOW, would the typical > default ext3 deployment show worse behavior? I'll give it a spin. > >A kernel defconfig compile takes between 9m40s and 9m42s with > >data=writeback and barrieres disabled, and with fdatasync barriers > >enabled it actually is minimally faster, > > If fdatasync different than fsync on ext3? Does it result in a full > metadata commit? ext3 honors the fdatasync flag and only does it's horrible job if the metadata we care about is dirty, that is only if some non-timestamp data is dirty. Which means for a non-sparse image file it does well, while for a sparse image file needing allocations it will cause trouble. And now that you mention it I've only tested the non-sparse case which now is default for the management tools at least on Fedora.
On Wed, Sep 02, 2009 at 08:13:52AM -0500, Anthony Liguori wrote: > >performance should be kind of a worst case. All these patches are > >with Linux 2.6.31-rc8 + my various barrier fixes on guest and host, > >using ext3 with barrier=1 on both. > > > > Does barrier=0 make a performance difference? IOW, would the typical > default ext3 deployment show worse behavior? Note for this tyical ext3 deployment the barrier patches are kinda useless, because we still don't have any data integrity guarantees at all. Anyway, here are the numbers with barrier=0 on host and guest: data=writeback, no write cache advertised: 9m37.890s, 9m38.303s, 9m38.423s, 9m38.861s, 9m39.599s data=writeback, write cache advertized (and backed by fdatasync): 9m39.649s, 9m39.772s, 9m40.149s, 9m41.737s, 9m41.996s
Index: qemu-kvm/hw/scsi-disk.c =================================================================== --- qemu-kvm.orig/hw/scsi-disk.c +++ qemu-kvm/hw/scsi-disk.c @@ -710,7 +710,9 @@ static int32_t scsi_send_command(SCSIDev memset(p,0,20); p[0] = 8; p[1] = 0x12; - p[2] = 4; /* WCE */ + if (bdrv_enable_write_cache(s->bdrv)) { + p[2] = 4; /* WCE */ + } p += 20; } if ((page == 0x3f || page == 0x2a) Index: qemu-kvm/block.c =================================================================== --- qemu-kvm.orig/block.c +++ qemu-kvm/block.c @@ -408,6 +408,16 @@ int bdrv_open2(BlockDriverState *bs, con } bs->drv = drv; bs->opaque = qemu_mallocz(drv->instance_size); + + /* + * Yes, BDRV_O_NOCACHE aka O_DIRECT means we have to present a + * write cache to the guest. We do need the fdatasync to flush + * out transactions for block allocations, and we maybe have a + * volatile write cache in our backing device to deal with. + */ + if (flags & BDRV_O_NOCACHE) + bs->enable_write_cache = 1; + /* Note: for compatibility, we open disk image files as RDWR, and RDONLY as fallback */ if (!(flags & BDRV_O_FILE)) @@ -918,6 +928,11 @@ int bdrv_is_sg(BlockDriverState *bs) return bs->sg; } +int bdrv_enable_write_cache(BlockDriverState *bs) +{ + return bs->enable_write_cache; +} + /* XXX: no longer used */ void bdrv_set_change_cb(BlockDriverState *bs, void (*change_cb)(void *opaque), void *opaque) Index: qemu-kvm/block_int.h =================================================================== --- qemu-kvm.orig/block_int.h +++ qemu-kvm/block_int.h @@ -152,6 +152,9 @@ struct BlockDriverState { /* the memory alignment required for the buffers handled by this driver */ int buffer_alignment; + /* do we need to tell the quest if we have a volatile write cache? */ + int enable_write_cache; + /* NOTE: the following infos are only hints for real hardware drivers. They are not used by the block driver */ int cyls, heads, secs, translation; Index: qemu-kvm/block.h =================================================================== --- qemu-kvm.orig/block.h +++ qemu-kvm/block.h @@ -120,6 +120,7 @@ int bdrv_get_translation_hint(BlockDrive int bdrv_is_removable(BlockDriverState *bs); int bdrv_is_read_only(BlockDriverState *bs); int bdrv_is_sg(BlockDriverState *bs); +int bdrv_enable_write_cache(BlockDriverState *bs); int bdrv_is_inserted(BlockDriverState *bs); int bdrv_media_changed(BlockDriverState *bs); int bdrv_is_locked(BlockDriverState *bs); Index: qemu-kvm/hw/ide/core.c =================================================================== --- qemu-kvm.orig/hw/ide/core.c +++ qemu-kvm/hw/ide/core.c @@ -148,8 +148,11 @@ static void ide_identify(IDEState *s) put_le16(p + 83, (1 << 14) | (1 << 13) | (1 <<12) | (1 << 10)); /* 14=set to 1, 1=SMART self test, 0=SMART error logging */ put_le16(p + 84, (1 << 14) | 0); - /* 14 = NOP supported, 0=SMART feature set enabled */ - put_le16(p + 85, (1 << 14) | 1); + /* 14 = NOP supported, 5=WCACHE enabled, 0=SMART feature set enabled */ + if (bdrv_enable_write_cache(s->bs)) + put_le16(p + 85, (1 << 14) | (1 << 5) | 1); + else + put_le16(p + 85, (1 << 14) | 1); /* 13=flush_cache_ext,12=flush_cache,10=lba48 */ put_le16(p + 86, (1 << 14) | (1 << 13) | (1 <<12) | (1 << 10)); /* 14=set to 1, 1=smart self test, 0=smart error logging */
Add a enable_write_cache flag in the block driver state, and use it to decide if we claim to have a volatile write cache that needs controlled flushing from the guest. Currently we only claim to have it when cache=none is specified. While that might seem wrong it actually is the case as we still have outstanding block allocations and host drive caches to flush. We do not need to claim a write cache when we use cache=writethrough because O_SYNC writes are guaranteed to have data on stable storage. We would have to claim one for data=writeback to be safe, but for I will follow Avi's opinion that it is a useless mode and should be our dedicated unsafe mode. If anyone disagrees please start the flame thrower now and I will change it. Otherwise a documentation patch will follow to explicitly document data=writeback as unsafe. Both scsi-disk and ide now use the new flage, changing from their defaults of always off (ide) or always on (scsi-disk). Signed-off-by: Christoph Hellwig <hch@lst.de>