Message ID | 1273528310-7051-3-git-send-email-agraf@suse.de |
---|---|
State | New |
Headers | show |
Am 10.05.2010 23:51, schrieb Alexander Graf: > Usually the guest can tell the host to flush data to disk. In some cases we > don't want to flush though, but try to keep everything in cache. > > So let's add a new parameter to -drive that allows us to set the flushing > behavior to "on" or "off", defaulting to enabling the guest to flush. > > Signed-off-by: Alexander Graf <agraf@suse.de> What about another cache=... value instead of adding more options? I'm quite sure you'll only ever need this with writeback caching. So we could have cache=none|writethrough|writeback|wb-noflush or something like that. > --- > block/raw-posix.c | 13 +++++++++++++ This is obviously wrong. If you want to introduce new behaviour to the block layer, you must do it consistently and not just for one block driver. So these changes should be made to the generic functions in block.c instead. Kevin
On Tue, May 11, 2010 at 10:36:24AM +0200, Kevin Wolf wrote: > Am 10.05.2010 23:51, schrieb Alexander Graf: > > Usually the guest can tell the host to flush data to disk. In some cases we > > don't want to flush though, but try to keep everything in cache. > > > > So let's add a new parameter to -drive that allows us to set the flushing > > behavior to "on" or "off", defaulting to enabling the guest to flush. > > > > Signed-off-by: Alexander Graf <agraf@suse.de> > > What about another cache=... value instead of adding more options? I'm > quite sure you'll only ever need this with writeback caching. So we > could have cache=none|writethrough|writeback|wb-noflush or something > like that. The cache option really isn't too useful. There's a matrix of 3x2 possible I/O modes for the posix backend, and right now we only expose two of them. I think we really should expose all of them, split into two options: caching | normal | O_DIRECT | --------+--------------------+--------------------+ none | - | - | integrity O_DSYNC | cache=writeback | - | fsync | cache=writethrough | cache=none | --------+--------------------+--------------------+ While we have to keep the cache= option for legacy purposes we're much better to have two new options for the I/O mode and integrity to express everything we need. Especially given that I'm looking into allowing the guest to tweak the volatile write cache setting of the emulated disk, which will allow us to turn on/off the O_SYNC on the flight. In the current scheme that would introduce a O_DSYNC|O_DIRECT mode that's not currently accesible through the command line/monitor. Markus was looking into separating the block device state in host/ guest portions lately, and splitting cache= would help with this. Driver cache enabled or not (fsync means enabled, O_DSYNC disabled, and none probably disabled given that we don't care), while O_DIRECT vs normal cached I/O is host state.
> > What about another cache=... value instead of adding more options? I'm > > quite sure you'll only ever need this with writeback caching. So we > > could have cache=none|writethrough|writeback|wb-noflush or something > > like that. I agree. > The cache option really isn't too useful. There's a matrix of 3x2 > possible I/O modes for the posix backend, and right now we only expose > two of them. I think we really should expose all of them, split into > two options: > > caching > > | normal | O_DIRECT | > > --------+--------------------+--------------------+ > none | - | - | > integrity O_DSYNC | cache=writeback | - | > fsync | cache=writethrough | cache=none | > --------+--------------------+--------------------+ I think this table is misleading, and from a user perspective there are only 4 interesting combinations: cache=none: No host caching. Reads and writes both go directly to underlying storage. Useful to avoid double-caching. cache=writethrough Reads are cached. Writes go directly to underlying storage. Useful for broken guests that aren't aware of drive caches. cache=writeback Reads and writes are cached. Guest flushes are honoured. Note that this may include a guest visible knob that allows the write cache to be disabled. If the guest requests the cache be disabled then this should act the same as cache=writethrough. Closest to how real hardware works. cache=always (or a more scary name like cache=lie to defend against idiots) Reads and writes are cached. Guest flushes are ignored. Useful for dumb guests in non-critical environments. IMO the other boxes in your table (disable cache but require guest flushes) don't make any sense. Paul
On 05/11/2010 07:15 AM, Paul Brook wrote: >>> What about another cache=... value instead of adding more options? I'm >>> quite sure you'll only ever need this with writeback caching. So we >>> could have cache=none|writethrough|writeback|wb-noflush or something >>> like that. >>> > I think this table is misleading, and from a user perspective there are only 4 > interesting combinations: > I agree that splitting cache would be unfortunate for users. > cache=none: > No host caching. Reads and writes both go directly to underlying storage. > Useful to avoid double-caching. > > cache=writethrough > Reads are cached. Writes go directly to underlying storage. Useful for > broken guests that aren't aware of drive caches. > > cache=writeback > Reads and writes are cached. Guest flushes are honoured. Note that this may > include a guest visible knob that allows the write cache to be disabled. If > the guest requests the cache be disabled then this should act the same as > cache=writethrough. Closest to how real hardware works. > > cache=always (or a more scary name like cache=lie to defend against idiots) > Reads and writes are cached. Guest flushes are ignored. Useful for dumb > guests in non-critical environments. > I really don't believe that we should support a cache=lie. There are many other obtain the same results. For instance, mount your guest filesystem with barrier=0. Introducing a mode where data integrity isn't preserved means that it's possible for someone to accidentally use it or use it when they really shouldn't. People already struggle with what caching mode they should use and this would just make the problem worse. Regards, Anthony Liguori > IMO the other boxes in your table (disable cache but require guest flushes) > don't make any sense. > > Paul > >
> > cache=always (or a more scary name like cache=lie to defend against > > idiots) > > > > Reads and writes are cached. Guest flushes are ignored. Useful for > > dumb guests in non-critical environments. > > I really don't believe that we should support a cache=lie. There are > many other obtain the same results. For instance, mount your guest > filesystem with barrier=0. Ideally yes. However in practice I suspect this is still a useful option. Is it even possible to disable barriers in all cases (e.g. NTFS under windows)? In a production environment it's probably not so useful - you're generally dealing with long lived, custom configured guests. In a development environment the rules can be a bit different. For example if you're testing an OS installer then you really don't want to be passing magic mount options. If the host machine dies then you don't care about the state of the guest because you're going to start from scratch anyway. Paul
On 05/11/2010 08:12 AM, Paul Brook wrote: >>> cache=always (or a more scary name like cache=lie to defend against >>> idiots) >>> >>> Reads and writes are cached. Guest flushes are ignored. Useful for >>> dumb guests in non-critical environments. >>> >> I really don't believe that we should support a cache=lie. There are >> many other obtain the same results. For instance, mount your guest >> filesystem with barrier=0. >> > Ideally yes. However in practice I suspect this is still a useful option. Is > it even possible to disable barriers in all cases (e.g. NTFS under windows)? > > In a production environment it's probably not so useful - you're generally > dealing with long lived, custom configured guests. > > In a development environment the rules can be a bit different. For example if > you're testing an OS installer then you really don't want to be passing magic > mount options. If the host machine dies then you don't care about the state of > the guest because you're going to start from scratch anyway. > Then create a mount point on your host and mount the host file system under that mount with barrier=0. The problem with options added for developers is that those options are very often accidentally used for production. Regards, Anthony Liguori > Paul > >
> > In a development environment the rules can be a bit different. For > > example if you're testing an OS installer then you really don't want to > > be passing magic mount options. If the host machine dies then you don't > > care about the state of the guest because you're going to start from > > scratch anyway. > > Then create a mount point on your host and mount the host file system > under that mount with barrier=0. > > The problem with options added for developers is that those options are > very often accidentally used for production. I disagree. We should not be removing or rejecting features just because they allow you to shoot yourself in the foot. We probably shouldn't be enabling them by default, but that's a whole different question. Mounting a filesystem with barrier=0 is not a good answer because it's a global setting. While my qemu VM may be disposable, it's unlikely that the same is true of the rest of my machine. By your argument linux shouldn't be allowing me to do that in the first place because a dumb sysadmin could use that option on the filesystem containing the mail store. In fact with the average user that's *likely* to happen because they'll only have a single partition on their machine. Paul N.B. in this context "developers" refers to users who happen to be using qemu as part of a toolset for development of $whatever, not development of qemu itself. c.f. "production" where qemu is part of the IT infrastructure used to directly provide an external service.
Paul Brook wrote: >>> What about another cache=... value instead of adding more options? I'm >>> quite sure you'll only ever need this with writeback caching. So we >>> could have cache=none|writethrough|writeback|wb-noflush or something >>> like that. >>> > > I agree. > > >> The cache option really isn't too useful. There's a matrix of 3x2 >> possible I/O modes for the posix backend, and right now we only expose >> two of them. I think we really should expose all of them, split into >> two options: >> >> caching >> >> | normal | O_DIRECT | >> >> --------+--------------------+--------------------+ >> none | - | - | >> integrity O_DSYNC | cache=writeback | - | >> fsync | cache=writethrough | cache=none | >> --------+--------------------+--------------------+ >> > > I think this table is misleading, and from a user perspective there are only 4 > interesting combinations: > > cache=none: > No host caching. Reads and writes both go directly to underlying storage. > Useful to avoid double-caching. > > cache=writethrough > Reads are cached. Writes go directly to underlying storage. Useful for > broken guests that aren't aware of drive caches. > > cache=writeback > Reads and writes are cached. Guest flushes are honoured. Note that this may > include a guest visible knob that allows the write cache to be disabled. If > the guest requests the cache be disabled then this should act the same as > cache=writethrough. Closest to how real hardware works. > > cache=always (or a more scary name like cache=lie to defend against idiots) > Reads and writes are cached. Guest flushes are ignored. Useful for dumb > guests in non-critical environments. > How about cache=insecure? But I agree. Having that as a cache= option seems to be the easiest way to expose it to a user. Alex
On 05/11/2010 08:50 AM, Paul Brook wrote: >>> In a development environment the rules can be a bit different. For >>> example if you're testing an OS installer then you really don't want to >>> be passing magic mount options. If the host machine dies then you don't >>> care about the state of the guest because you're going to start from >>> scratch anyway. >>> >> Then create a mount point on your host and mount the host file system >> under that mount with barrier=0. >> >> The problem with options added for developers is that those options are >> very often accidentally used for production. >> > I disagree. We should not be removing or rejecting features just because they > allow you to shoot yourself in the foot. We probably shouldn't be enabling > them by default, but that's a whole different question. > I disagree and think the mentality severely hurts usability. QEMU's role should be to enable features, not to simplify every obscure feature. In general, if someone wants to accomplish something, we should try to provide a mechanism to accomplish it. cache=none|writeback|writethrough is an example of this. No one other than QEMU can control how we open a file descriptor so we need to provide a knob for it. But if the goal is to make sure that fsync's don't result in data actually being on disk, there are many other ways to accomplish this. First, for the vast majority of users, this is already the case because ext3 defaults to disabling barriers. Alex is running into this issue only because SuSE enables barriers by default for ext3 and fsync()'s are horribly slow on ext3. The fact that this is measurable is purely a statement of ext3 suckage and a conscious decision by the SuSE team to live with that suckage. It shouldn't be nearly as bad on ext4. But even in this context, it's not a huge burden to disable barriers within the guest or to create a dedicated filesystem mount for your images on the host that disabled barriers on the mount. On the other hand, the cost of adding another caching option is pretty significant. Very few users really understand what caching options they should use under what circumstance. Adding another option will just make that worse. The fact that this new option can result in data corruption that won't occur under normal circumstances is troubling. > Mounting a filesystem with barrier=0 is not a good answer because it's a > global setting. While my qemu VM may be disposable, it's unlikely that the > same is true of the rest of my machine. You can have multiple mounts. In fact, you can just make a loopback mount within your existing file system which means that you don't even have to muck with your disk. If your VM is disposable, then shouldn't you be using -snapshot? > By your argument linux shouldn't be > allowing me to do that in the first place because a dumb sysadmin could use > that option on the filesystem containing the mail store. In fact with the > average user that's *likely* to happen because they'll only have a single > partition on their machine. > We aren't preventing sophisticated users from doing sophisticated things. But why should we simplify something that most people don't need and if they accidentally use it, bad things will happen? Regards, Anthony Liguori
> > I disagree. We should not be removing or rejecting features just because > > they allow you to shoot yourself in the foot. We probably shouldn't be > > enabling them by default, but that's a whole different question. > > I disagree and think the mentality severely hurts usability. QEMU's > role should be to enable features, not to simplify every obscure > feature. In general, if someone wants to accomplish something, we > should try to provide a mechanism to accomplish it. > cache=none|writeback|writethrough is an example of this. No one other > than QEMU can control how we open a file descriptor so we need to > provide a knob for it. Doesn't the same argument apply to the existing cache=writethrough? i.e. if you want to avoid data loss you should make sure your guest issues flushes properly, and it's not something qemu should be trying to hack round be adding an implicit flushe after every write. Paul
Anthony Liguori wrote: > On 05/11/2010 08:12 AM, Paul Brook wrote: > >>>cache=always (or a more scary name like cache=lie to defend against > >>>idiots) > >>> > >>> Reads and writes are cached. Guest flushes are ignored. Useful for > >>> dumb guests in non-critical environments. > >>> > >>I really don't believe that we should support a cache=lie. There are > >>many other obtain the same results. For instance, mount your guest > >>filesystem with barrier=0. > >> > >Ideally yes. However in practice I suspect this is still a useful option. > >Is > >it even possible to disable barriers in all cases (e.g. NTFS under > >windows)? > > > >In a production environment it's probably not so useful - you're generally > >dealing with long lived, custom configured guests. > > > >In a development environment the rules can be a bit different. For example > >if > >you're testing an OS installer then you really don't want to be passing > >magic > >mount options. If the host machine dies then you don't care about the > >state of > >the guest because you're going to start from scratch anyway. > > > > Then create a mount point on your host and mount the host file system > under that mount with barrier=0. Two reasons that advice doesn't work: 1. It doesn't work in many environments. You can't mount a filesystem with barrier=0 in one place and barrier=1 on a different point, and there's ofen only one host partition. 2. barrier=0 does _not_ provide the cache=off behaviour. It only disables barriers; it does not prevent writing to the disk hardware. If you are doing a transient OS install, ideally you want an amount equal to your free RAM not written to disk until the end. barrier=0 does not achieve that. > The problem with options added for developers is that those options are > very often accidentally used for production. We already have risky cache= options. Also, do we call fdatasync (with barrier) on _every_ write for guests which disable the emulated disk cache? -- Jamie
On 05/11/2010 10:53 AM, Paul Brook wrote: >>> I disagree. We should not be removing or rejecting features just because >>> they allow you to shoot yourself in the foot. We probably shouldn't be >>> enabling them by default, but that's a whole different question. >>> >> I disagree and think the mentality severely hurts usability. QEMU's >> role should be to enable features, not to simplify every obscure >> feature. In general, if someone wants to accomplish something, we >> should try to provide a mechanism to accomplish it. >> cache=none|writeback|writethrough is an example of this. No one other >> than QEMU can control how we open a file descriptor so we need to >> provide a knob for it. >> > Doesn't the same argument apply to the existing cache=writethrough? > i.e. if you want to avoid data loss you should make sure your guest issues > flushes properly, and it's not something qemu should be trying to hack round > be adding an implicit flushe after every write. > cache is the host page cache acting as an extended disk cache. In writethrough mode, the behavior is identical to writethrough on a normal disk cache in that all operations are completed only when sent down to the next storage layer. In writeback mode, you rely on the integrity of the host page cache to prevent data loss. Data loss is different than data corruption though. In neither mode will a correct application experience data corruption because if a guest issues a flush, we respect those requests. However, the new proposed mode would introduce the possibility of data corruption because it becomes possible for writes to be written to disk outside the order requested by the guest. In the event of power loss, the result is corruption. Regards, Anthony Liguori > Paul >
On 05/11/2010 11:32 AM, Jamie Lokier wrote: > Anthony Liguori wrote: > >> On 05/11/2010 08:12 AM, Paul Brook wrote: >> >>>>> cache=always (or a more scary name like cache=lie to defend against >>>>> idiots) >>>>> >>>>> Reads and writes are cached. Guest flushes are ignored. Useful for >>>>> dumb guests in non-critical environments. >>>>> >>>>> >>>> I really don't believe that we should support a cache=lie. There are >>>> many other obtain the same results. For instance, mount your guest >>>> filesystem with barrier=0. >>>> >>>> >>> Ideally yes. However in practice I suspect this is still a useful option. >>> Is >>> it even possible to disable barriers in all cases (e.g. NTFS under >>> windows)? >>> >>> In a production environment it's probably not so useful - you're generally >>> dealing with long lived, custom configured guests. >>> >>> In a development environment the rules can be a bit different. For example >>> if >>> you're testing an OS installer then you really don't want to be passing >>> magic >>> mount options. If the host machine dies then you don't care about the >>> state of >>> the guest because you're going to start from scratch anyway. >>> >>> >> Then create a mount point on your host and mount the host file system >> under that mount with barrier=0. >> > Two reasons that advice doesn't work: > > 1. It doesn't work in many environments. You can't mount a filesystem > with barrier=0 in one place and barrier=1 on a different point, and > there's ofen only one host partition. > qemu-img create -f raw foo.img 10G mkfs.ext3 foo.img mount -oloop,rw,barrier=1 -t ext3 foo.img mnt Works perfectly fine. > 2. barrier=0 does _not_ provide the cache=off behaviour. It only > disables barriers; it does not prevent writing to the disk hardware. > The proposal has nothing to do with cache=off. >> The problem with options added for developers is that those options are >> very often accidentally used for production. >> > We already have risky cache= options. Also, do we call fdatasync > (with barrier) on _every_ write for guests which disable the > emulated disk cache? > None of our cache= options should result in data corruption on power loss. If they do, it's a bug. Regards, Anthony Liguori > -- Jamie >
Anthony Liguori wrote: > qemu-img create -f raw foo.img 10G > mkfs.ext3 foo.img > mount -oloop,rw,barrier=1 -t ext3 foo.img mnt > > Works perfectly fine. Hmm, interesting. Didn't know loop propagated barriers. So you're suggesting to use qemu with a loop device, and ext2 (bit faster than ext3) and barrier=0 (well, that's implied if you use ext2), and a raw image file on the ext2/3 filesystem, to provide the effect of flush=off, becuase the loop device caches block writes on the host, except for explicit barrier requests from the fs, which are turned off? That wasn't obvious the first time :-) Does the loop device cache fs writes instead of propagating them immediately to the underlying fs? I guess it probably does. Does the loop device allow the backing file to grow sparsely, to get behavious like qcow2? That's ugly but it might just work. > >2. barrier=0 does _not_ provide the cache=off behaviour. It only > >disables barriers; it does not prevent writing to the disk hardware. > > The proposal has nothing to do with cache=off. Sorry, I meant flush=off (the proposal). Mounting the host filesystem (i.e. not using a loop device anywhere) with barrier=0 doesn't have even close to the same effect. > >>The problem with options added for developers is that those options are > >>very often accidentally used for production. > >> > >We already have risky cache= options. Also, do we call fdatasync > >(with barrier) on _every_ write for guests which disable the > >emulated disk cache? > > None of our cache= options should result in data corruption on power > loss. If they do, it's a bug. (I might have the details below a bit off.) If cache=none uses O_DIRECT without calling fdatasync for guest barriers, then it will get data corruption on power loss. If cache=none does call fdatasync for guest barriers, then it might still get corruption on power loss; I am not sure if recent Linux host behaviour of O_DIRECT+fdatasync (with no buffered writes to commit) issues the necessary barriers. I am quite sure that older kernels did not. cache=writethrough will get data corruption on power loss with older Linux host kernels. O_DSYNC did not issue barriers. I'm not sure if the behaviour of O_DSYNC that was recently changed is now issuing barriers after every write. Provided all the cache= options call fdatasync/fsync when the guest issues a cache flush, and call fdatasync/fsync following _every_ write when the guest has disabled the emulated write cache, that should be as good as Qemu can reasonably do. It's up to the host from there. -- Jamie
Paul Brook wrote: > cache=none: > No host caching. Reads and writes both go directly to underlying storage. > Useful to avoid double-caching. > > cache=writethrough > Reads are cached. Writes go directly to underlying storage. Useful for > broken guests that aren't aware of drive caches. These are misleading descriptions - because cache=none does not push writes down to powerfail-safe storage, while cache=writethrough might. > cache=always (or a more scary name like cache=lie to defend against idiots) > Reads and writes are cached. Guest flushes are ignored. Useful for dumb > guests in non-critical environments. cache=unsafe would tell it like it is. Even non-idiots could be excused for getting the wrong impression from cache=always. -- Jamie
On 05/11/2010 11:36 AM, Kevin Wolf wrote: > Am 10.05.2010 23:51, schrieb Alexander Graf: > >> Usually the guest can tell the host to flush data to disk. In some cases we >> don't want to flush though, but try to keep everything in cache. >> >> So let's add a new parameter to -drive that allows us to set the flushing >> behavior to "on" or "off", defaulting to enabling the guest to flush. >> >> Signed-off-by: Alexander Graf<agraf@suse.de> >> > What about another cache=... value instead of adding more options? I'm > quite sure you'll only ever need this with writeback caching. So we > could have cache=none|writethrough|writeback|wb-noflush or something > like that. > I implemented this once and promptly forgot about it, as cache=volatile.
On 05/11/2010 06:40 PM, Anthony Liguori wrote: > > But if the goal is to make sure that fsync's don't result in data > actually being on disk, there are many other ways to accomplish this. > First, for the vast majority of users, this is already the case > because ext3 defaults to disabling barriers. Alex is running into > this issue only because SuSE enables barriers by default for ext3 and > fsync()'s are horribly slow on ext3. The fact that this is measurable > is purely a statement of ext3 suckage and a conscious decision by the > SuSE team to live with that suckage. It shouldn't be nearly as bad on > ext4. > There is a huge difference between disabling barriers and cache=volatile. With barrier=0, data is still forced out of host pagecache and into disk cache; it's simply not forced from disk cache to the platter. But since the disk cache is very limited, you'll still be running at disk speed. With cache=volatile and enough memory you'll never wait for the disk to write data. > > On the other hand, the cost of adding another caching option is pretty > significant. Very few users really understand what caching options > they should use under what circumstance. Adding another option will > just make that worse. The fact that this new option can result in > data corruption that won't occur under normal circumstances is troubling. I don't think it's a real problem with proper naming, but we can always hide the option behind a ./configure --enable-developer-options. > >> Mounting a filesystem with barrier=0 is not a good answer because it's a >> global setting. While my qemu VM may be disposable, it's unlikely >> that the >> same is true of the rest of my machine. > > You can have multiple mounts. In fact, you can just make a loopback > mount within your existing file system which means that you don't even > have to muck with your disk. > > If your VM is disposable, then shouldn't you be using -snapshot? For my use case (autotest) the VM is not disposable (it's reused between tests) but I don't care about the data in case of a host crash. > > >> By your argument linux shouldn't be >> allowing me to do that in the first place because a dumb sysadmin >> could use >> that option on the filesystem containing the mail store. In fact with >> the >> average user that's *likely* to happen because they'll only have a >> single >> partition on their machine. > > We aren't preventing sophisticated users from doing sophisticated > things. But why should we simplify something that most people don't > need and if they accidentally use it, bad things will happen? We don't have a real alternative.
> Paul Brook wrote: > > cache=none: > > No host caching. Reads and writes both go directly to underlying > > storage. > > > > Useful to avoid double-caching. > > > > cache=writethrough > > > > Reads are cached. Writes go directly to underlying storage. Useful for > > broken guests that aren't aware of drive caches. > > These are misleading descriptions - because cache=none does not push > writes down to powerfail-safe storage, while cache=writethrough might. If so, then this is a serious bug. Paul
> > Paul Brook wrote: > > > cache=none: > > > No host caching. Reads and writes both go directly to underlying > > > storage. > > > > > > Useful to avoid double-caching. > > > > > > cache=writethrough > > > > > > Reads are cached. Writes go directly to underlying storage. Useful > > > for > > > > > > broken guests that aren't aware of drive caches. > > > > These are misleading descriptions - because cache=none does not push > > writes down to powerfail-safe storage, while cache=writethrough might. > > If so, then this is a serious bug. .. though it may be a kernel bug rather that a qemu bug, depending on the exact details. Either way, I consider any mode that inhibits host filesystem write cache but not volatile drive cache to be pretty worthless. Either we guaranteed data integrity on completion or we don't. Paul
> On 05/11/2010 10:53 AM, Paul Brook wrote: > >>> I disagree. We should not be removing or rejecting features just > >>> because they allow you to shoot yourself in the foot. We probably > >>> shouldn't be enabling them by default, but that's a whole different > >>> question. > >> > >> I disagree and think the mentality severely hurts usability. QEMU's > >> role should be to enable features, not to simplify every obscure > >> feature. In general, if someone wants to accomplish something, we > >> should try to provide a mechanism to accomplish it. > >> cache=none|writeback|writethrough is an example of this. No one other > >> than QEMU can control how we open a file descriptor so we need to > >> provide a knob for it. > > > > Doesn't the same argument apply to the existing cache=writethrough? > > i.e. if you want to avoid data loss you should make sure your guest > > issues flushes properly, and it's not something qemu should be trying to > > hack round be adding an implicit flushe after every write. > > cache is the host page cache acting as an extended disk cache. In > writethrough mode, the behavior is identical to writethrough on a normal > disk cache in that all operations are completed only when sent down to > the next storage layer. IMO this is a bug. Making host pagecache writethrough but still having a volatile writeback disk cache seems like a complete waste of time. I can see the advantage of disabling host pagecache (avoid double caching in host RAM), but having different levels of cache be writethrough/writeback seems extremely suspect. It's also occurred to me that you're also basing your arguments on the assumption that host pagecache is volatile. On a machine with a good UPS this is not true. In the even of external power failure the UPS will flush the host page cache and cleanly shut the machine down. As with battery-backed RAID cards, it's entirely reasonable to consider the cache to be non-volatile storage and ignore the flush requests. If you don't trust your host OS in this situation then you're into a whole different level of pain, and raises obvious questions about the firmware running on your storage subsystem. Paul
Paul Brook wrote: > > > Paul Brook wrote: > > > > cache=none: > > > > No host caching. Reads and writes both go directly to underlying > > > > storage. > > > > > > > > Useful to avoid double-caching. > > > > > > > > cache=writethrough > > > > > > > > Reads are cached. Writes go directly to underlying storage. Useful > > > > for > > > > > > > > broken guests that aren't aware of drive caches. > > > > > > These are misleading descriptions - because cache=none does not push > > > writes down to powerfail-safe storage, while cache=writethrough might. > > > > If so, then this is a serious bug. > > .. though it may be a kernel bug rather that a qemu bug, depending on the > exact details. It's not a kernel bug. cache=none uses O_DIRECT, and O_DIRECT must not force writes to powerfail-safe storage. If it did, it would be unusably slow for applications using O_DIRECT as a performance enhancer / memory saver. They can call fsync/fdatasync when they need to for integrity. (There might be kernel bugs in the latter department.) > Either way, I consider any mode that inhibits host filesystem write > cache but not volatile drive cache to be pretty worthless. On the contrary, it greatly reduces host memory consumption so that guest data isn't cached twice (it's already cached in the guest), and it may improve performance by relaxing the POSIX write-serialisation constraint (not sure if Linux cares; Solaris does). > Either we guaranteed data integrity on completion or we don't. The problem with the description of cache=none is it uses O_DIRECT, which does always not push writes to powerfail-safe storage,. O_DIRECT is effectively a hint. It requests less caching in kernel memory, may reduce memory usage and copying, may invoke direct DMA. O_DIRECT does not tell the disk hardware to commit to powerfail-safe storage. I.e. it doesn't issue barriers or disable disk write caching. (However, depending on a host setup, it might have that effect if disk write cache is disabled by the admin). Also, it doesn't even always write to disk: It falls back to buffered in some circumstances, even on filesystems which support it - see recent patches for btrfs which use buffered I/O for O_DIRECT for some parts of some files. (Many non-Linux OSes fall back to buffered when any other process holds a non-O_DIRECT file descriptor, or when requests don't meet some criteria). The POSIX thing to use for cache=none would be O_DSYNC|O_RSYNC, and that should work on some hosts, but Linux doesn't implement real O_RSYNC. A combination which ought to work is O_DSYNC|O_DIRECT. O_DIRECT is the performance hint; O_DSYNC provides the commit request. Christoph Hellwig has mentioned that combination elsewhere on this thread. It makes sense to me for cache=none. O_DIRECT by itself is a useful performance & memory hint, so there does need to be some option which maps onto O_DIRECT alone. But it shouldn't be documented as stronger than cache=writethrough, because it isn't. -- Jamie
Kevin Wolf wrote: > Am 10.05.2010 23:51, schrieb Alexander Graf: > >> Usually the guest can tell the host to flush data to disk. In some cases we >> don't want to flush though, but try to keep everything in cache. >> >> So let's add a new parameter to -drive that allows us to set the flushing >> behavior to "on" or "off", defaulting to enabling the guest to flush. >> >> Signed-off-by: Alexander Graf <agraf@suse.de> >> > > What about another cache=... value instead of adding more options? I'm > quite sure you'll only ever need this with writeback caching. So we > could have cache=none|writethrough|writeback|wb-noflush or something > like that. > > Yes, cache=volatile seems reasonable. Or cache=unsafe. >> --- >> block/raw-posix.c | 13 +++++++++++++ >> > > This is obviously wrong. If you want to introduce new behaviour to the > block layer, you must do it consistently and not just for one block > driver. So these changes should be made to the generic functions in > block.c instead. > How so? The callback functions are called using bdrv->drv->xxx. If I modify that pointer, I end up affecting all other virtual disks as well. Alex
Am 12.05.2010 17:05, schrieb Alexander Graf: > Kevin Wolf wrote: >> Am 10.05.2010 23:51, schrieb Alexander Graf: >> >>> Usually the guest can tell the host to flush data to disk. In some cases we >>> don't want to flush though, but try to keep everything in cache. >>> >>> So let's add a new parameter to -drive that allows us to set the flushing >>> behavior to "on" or "off", defaulting to enabling the guest to flush. >>> >>> Signed-off-by: Alexander Graf <agraf@suse.de> >>> >> >> What about another cache=... value instead of adding more options? I'm >> quite sure you'll only ever need this with writeback caching. So we >> could have cache=none|writethrough|writeback|wb-noflush or something >> like that. >> >> > > Yes, cache=volatile seems reasonable. Or cache=unsafe. > >>> --- >>> block/raw-posix.c | 13 +++++++++++++ >>> >> >> This is obviously wrong. If you want to introduce new behaviour to the >> block layer, you must do it consistently and not just for one block >> driver. So these changes should be made to the generic functions in >> block.c instead. >> > > How so? The callback functions are called using bdrv->drv->xxx. If I > modify that pointer, I end up affecting all other virtual disks as well. By doing the check in bdrv_flush/bdrv_aio_flush before this function pointer is even called. Kevin
Am 12.05.2010 um 17:36 schrieb Kevin Wolf <kwolf@redhat.com>: > Am 12.05.2010 17:05, schrieb Alexander Graf: >> Kevin Wolf wrote: >>> Am 10.05.2010 23:51, schrieb Alexander Graf: >>> >>>> Usually the guest can tell the host to flush data to disk. In >>>> some cases we >>>> don't want to flush though, but try to keep everything in cache. >>>> >>>> So let's add a new parameter to -drive that allows us to set the >>>> flushing >>>> behavior to "on" or "off", defaulting to enabling the guest to >>>> flush. >>>> >>>> Signed-off-by: Alexander Graf <agraf@suse.de> >>>> >>> >>> What about another cache=... value instead of adding more options? >>> I'm >>> quite sure you'll only ever need this with writeback caching. So we >>> could have cache=none|writethrough|writeback|wb-noflush or something >>> like that. >>> >>> >> >> Yes, cache=volatile seems reasonable. Or cache=unsafe. >> >>>> --- >>>> block/raw-posix.c | 13 +++++++++++++ >>>> >>> >>> This is obviously wrong. If you want to introduce new behaviour to >>> the >>> block layer, you must do it consistently and not just for one block >>> driver. So these changes should be made to the generic functions in >>> block.c instead. >>> >> >> How so? The callback functions are called using bdrv->drv->xxx. If I >> modify that pointer, I end up affecting all other virtual disks as >> well. > > By doing the check in bdrv_flush/bdrv_aio_flush before this function > pointer is even called. Sorry yeah, realized that 2 minutes after writing the mail myself :). I put together an updated patch, but need to test it properly first. Alex >
Christoph Hellwig <hch@lst.de> writes: [...] > Markus was looking into separating the block device state in host/ > guest portions lately, and splitting cache= would help with this. > Driver cache enabled or not (fsync means enabled, O_DSYNC disabled, > and none probably disabled given that we don't care), while O_DIRECT > vs normal cached I/O is host state. Let me provide a bit more context. -drive is a happy mix of guest and host parameters. Some parameters are clearly host, some clearly guest, and some are... complicated. For instance, I'd argue that "readonly" applies to both host and guest. I'm working on turning all guest parameters proper qdev properties, and collect the host parameters in a new -blockdev. -drive stays for compatibility and convenience (I'd like to deprecate if=none, though). If you change -drive, please keep the coming separation in mind, and avoid parameters that mix up guest and host.
On Tue, May 11, 2010 at 11:11:12PM +0100, Paul Brook wrote: > .. though it may be a kernel bug rather that a qemu bug, depending on the > exact details. Either way, I consider any mode that inhibits host filesystem > write cache but not volatile drive cache to be pretty worthless. Either we > guaranteed data integrity on completion or we don't. O_DIRECT just means bypassing the pagecache, it does not mean flushing the disk cache on every access, which for certain workloads can be very painful. It also doesn't require a synchronous writeout of metadata required to reach the data, e.g. in case when we have to allocate blocks for a sparse image file. To get the behaviour you want you need O_DIRECT|O_DSYNC, which is something that is not exposed by qemu's current cache= suboption. If we want to implement this properly we need to split the cache option, as I already mentioned. This would also have benefits in other areas, but again refer to my previous mail for that.
On Fri, May 14, 2010 at 11:16:47AM +0200, Markus Armbruster wrote: > If you change -drive, please keep the coming separation in mind, and > avoid parameters that mix up guest and host. This new option seems to be exactly that, unfortunately..
On 17.05.2010, at 14:41, Christoph Hellwig wrote: > On Fri, May 14, 2010 at 11:16:47AM +0200, Markus Armbruster wrote: >> If you change -drive, please keep the coming separation in mind, and >> avoid parameters that mix up guest and host. > > This new option seems to be exactly that, unfortunately.. > How so? It only affects the host. The guest never gets to see that flushing is disabled. Alex
diff --git a/block/raw-posix.c b/block/raw-posix.c index 7541ed2..2510b1b 100644 --- a/block/raw-posix.c +++ b/block/raw-posix.c @@ -106,6 +106,7 @@ typedef struct BDRVRawState { int fd; int type; int open_flags; + int bdrv_flags; #if defined(__linux__) /* linux floppy specific */ int64_t fd_open_time; @@ -133,6 +134,7 @@ static int raw_open_common(BlockDriverState *bs, const char *filename, BDRVRawState *s = bs->opaque; int fd, ret; + s->bdrv_flags = bdrv_flags; s->open_flags = open_flags | O_BINARY; s->open_flags &= ~O_ACCMODE; if (bdrv_flags & BDRV_O_RDWR) { @@ -555,6 +557,11 @@ static BlockDriverAIOCB *raw_aio_flush(BlockDriverState *bs, if (fd_open(bs) < 0) return NULL; + /* Don't flush? */ + if (s->bdrv_flags & BDRV_O_NOFLUSH) { + return bdrv_aio_noop_em(bs, cb, opaque); + } + return paio_submit(bs, s->fd, 0, NULL, 0, cb, opaque, QEMU_AIO_FLUSH); } @@ -726,6 +733,12 @@ static int raw_create(const char *filename, QEMUOptionParameter *options) static void raw_flush(BlockDriverState *bs) { BDRVRawState *s = bs->opaque; + + /* No flush means no flush */ + if (s->bdrv_flags & BDRV_O_NOFLUSH) { + return; + } + qemu_fdatasync(s->fd); } diff --git a/qemu-config.c b/qemu-config.c index d500885..c358add 100644 --- a/qemu-config.c +++ b/qemu-config.c @@ -79,6 +79,9 @@ QemuOptsList qemu_drive_opts = { },{ .name = "readonly", .type = QEMU_OPT_BOOL, + },{ + .name = "flush", + .type = QEMU_OPT_BOOL, }, { /* end if list */ } }, diff --git a/qemu-options.hx b/qemu-options.hx index 12f6b51..69ae8de 100644 --- a/qemu-options.hx +++ b/qemu-options.hx @@ -120,6 +120,7 @@ DEF("drive", HAS_ARG, QEMU_OPTION_drive, " [,cyls=c,heads=h,secs=s[,trans=t]][,snapshot=on|off]\n" " [,cache=writethrough|writeback|none][,format=f][,serial=s]\n" " [,addr=A][,id=name][,aio=threads|native][,readonly=on|off]\n" + " [,flush=on|off]\n" " use 'file' as a drive image\n", QEMU_ARCH_ALL) STEXI @item -drive @var{option}[,@var{option}[,@var{option}[,...]]] @@ -151,6 +152,8 @@ These options have the same definition as they have in @option{-hdachs}. @var{cache} is "none", "writeback", or "writethrough" and controls how the host cache is used to access block data. @item aio=@var{aio} @var{aio} is "threads", or "native" and selects between pthread based disk I/O and native Linux AIO. +@item flush=@var{flush} +@var{flush} is "on" (default), or "off" and select whether the guest can trigger a host flush @item format=@var{format} Specify which disk @var{format} will be used rather than detecting the format. Can be used to specifiy format=raw to avoid interpreting diff --git a/vl.c b/vl.c index 85bcc84..a7ca2c3 100644 --- a/vl.c +++ b/vl.c @@ -787,6 +787,7 @@ DriveInfo *drive_init(QemuOpts *opts, void *opaque, int max_devs; int index; int ro = 0; + int flush = 1; int bdrv_flags = 0; int on_read_error, on_write_error; const char *devaddr; @@ -819,6 +820,7 @@ DriveInfo *drive_init(QemuOpts *opts, void *opaque, snapshot = qemu_opt_get_bool(opts, "snapshot", 0); ro = qemu_opt_get_bool(opts, "readonly", 0); + flush = qemu_opt_get_bool(opts, "flush", 1); file = qemu_opt_get(opts, "file"); serial = qemu_opt_get(opts, "serial"); @@ -1118,6 +1120,7 @@ DriveInfo *drive_init(QemuOpts *opts, void *opaque, } bdrv_flags |= ro ? 0 : BDRV_O_RDWR; + bdrv_flags |= flush ? 0 : BDRV_O_NOFLUSH; if (bdrv_open(dinfo->bdrv, file, bdrv_flags, drv) < 0) { fprintf(stderr, "qemu: could not open disk image %s: %s\n",
Usually the guest can tell the host to flush data to disk. In some cases we don't want to flush though, but try to keep everything in cache. So let's add a new parameter to -drive that allows us to set the flushing behavior to "on" or "off", defaulting to enabling the guest to flush. Signed-off-by: Alexander Graf <agraf@suse.de> --- block/raw-posix.c | 13 +++++++++++++ qemu-config.c | 3 +++ qemu-options.hx | 3 +++ vl.c | 3 +++ 4 files changed, 22 insertions(+), 0 deletions(-)