Message ID | 20130621143347.GF10730@thunk.org |
---|---|
State | New, archived |
Headers | show |
hi, On Fri, Jun 21, 2013, at 10:33, Theodore Ts'o wrote: > Based on how the implementation is currently implemented, any modified > blocks belonging to the inode will be staged out to disk --- Although > with out an explicit CACHE FLUSH command, which is ***extremely*** > expensive. Okay -- so any modified blocks, not just unallocated ones, therefore fallocate() doesn't affect us here.... Good. So why are we seeing the problem happen so often? Do you really think this is related to a bug that was introduced in the block layer in 3.0 and that once that bug is fixed replace-by-rename without fsync() will become "relatively" safe again? > Why are you using fallocate, by the way? For small files, fallocate > is largely pointless. All of the modern file systems which use > delayed allocation can do the right thing without fallocate(2). It > won't hurt, but it won't help, either. g_file_set_contents() is a very general purpose API used by dconf but also many other things. It is being used to write all kinds of files, large and small. I understand how delayed allocation on ext4 is essentially giving me the same thing automatically for small files that manage to be written out before the kernel decides to do the allocation but doing this explicitly will mean that I'm always giving the kernel the information it needs, up front, to avoid fragmentation to the greatest extent possible. I see it as "won't hurt and may help" and therefore I do it. I'm happy to remove it on your (justified) advice, but keep in mind that people are using this API for larger files as well... > The POSIX API is pretty clear: if you care about data being on disk, > you have to use fsync(). Well, in fairness, it's not even clear on this point. POSIX doesn't really talk about any sort of guarantees across system crashes at all... and I can easily imagine that fsync() still doesn't get me what I want in some really bizarre cases (like an ecryptfs over NFS from a virtual server using an lvm setup running inside of kvm on a machine with hard dives that have buggy firmware). I guess I'm trying to solve for the case of "normal ext4 on a normal partition on real metal with properly working hardware". Subject to those constraints, I'm happy to call fsync(). > As file system designers, we're generally rather hesitant to make > guarantees beyond that, since most users are hypersensitive about > performance, and every single time we make additional guarantees > beyond what is specified by POSIX, it further constrains us, and it > may be that one of these guarantees is one you don't care about, but > will impact your performance. Just as the cost of some guarantee you > *do* care about may impact the performance of some other application > which happens to be renaming files but who doesn't necessarily care > about making sure things are forced to disk atomically in that > particular instance. That's fair... and I do realise the pain in the ass that it is if I call fsync() on a filesystem that has an ordered metadata guarantee. I'm asking you to immediately write out my metadata changes that came after other metadata, so you have to write all of it out first. This is part of why I'd rather avoid the fsync entirely... aside: what's your opinion on fdatasync()? Seems like it wouldn't be good enough for my usecase because I'm changing the size of the file.... another aside: why do you make global guarantees about metadata changes being well-ordered? It seems quite likely that what's going on on one part of the disk by one user is totally unrelated to what's going on on another other part by a different user... ((and I do appreciate the irony that I am committing by complaining about "other guarantees that I don't care about")). > There are all sorts of rather tricky impliciations with this. For > example, consider what happens if some disk editor does this with a > small text file. OK, fine. Future reads of this text file will get > the new contents, but if the system crashes, when the file read, they > will get the old value. Now suppose other files are created based on > that text file. For example, suppose the text file is a C source > file, and the compiler writes out an object file based on the source > file --- and then the system crashes. What guarantees do we have to > give about the state of the object file after the crash? What if the > object file contains the compiled version of the "new" source file, > but that source file hsa reverted to its original value. Can you > imagine how badly make would get confused with such a thing? Ya... I can see this. I don't think it's important for normal users, but this is an argument that goes to the heart of "what is a normal user" and is honestly not a useful discussion to have here... I guess in fact this answers my previous question about "why do you care about metadata changes being well ordered?" The answer is "make". In any case, I don't expect that you'd change your existing guarantees about the filesystem. I'm suggesting that using this new 'replace file with contents' API, however, would indicate that I am only interested in this one thing happening, and I don't care how it relates to anything else. If we wanted a way to express that one file's contents should only be replaced after another file's contents (which might be useful, but doesn't concern me) then the API could be made more advanced... > Beyond the semantic difficulties of such an interface, while I can > technically think of ways that this might be workable for small files, > the problem with the file system API is that it's highly generalized, > and while you might promise than you'd only use it for files less than > 64k, say, inevitably someone would try to use the exact same interface > with a multi-megabyte file. And then complain when it didn't work, > didn't fulfill the guarantees, or OOM-killed their process, or trashed > their performance of their entire system.... I know essentially nothing about the block layer or filesystems, but I don't understand why it shouldn't work for files larger than 64k. I would expect this to be reasonably implementable for files up to a non-trivial fraction of the available memory in the system (say 20%). The user has successfully allocated a buffer of this size already, after all... I would certainly not expect anything in the range of "multi-megabyte" (or even up to 10s or 100s of megabytes) to be a problem at all on a system with 4 to 8GB of RAM. If memory pressure really became a big problem you would have two easy outs before reaching for the OOM stick: force the cached data out to disk in real time, or return ENOMEM from this new API (instructing userspace to go about more traditional means of getting their data on disk). There are a few different threads in this discussion and I think we've gotten away from the original point of my email (which wasn't addressed in your most recent reply): I think you need to update the ext4 documentation to more clearly state that if you care about your data, you really must call fsync(). Thanks -- To unsubscribe from this list: send the line "unsubscribe linux-ext4" in the body of a message to majordomo@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
So I've been taking a closer look at the the rename code, and there's something I can do which will improve the chances of avoiding data loss on a crash after an application tries to replace file contents via: 1) write foo.new 2) <omit fsync of foo.new> 3) rename foo.new to foo Those are the kernel patches that I cc'ed you on. The reason why it's still not a guarantee is because we are not doing a file integrity writeback; this is not as important for small files, but if foo.new is several megabytes, not all of the data blocks will be flushed out before the rename, and this will kill performance, and in somoe cases it might not be necessary. Still, for small files ("most config files are smaller than 100k"), this should serve you just fine. Of course, it's not going to be in currently deployed kernels, so I don't know how much these proposed patches will help you,. I'm doing mainly because it helps protects users against (in my mind) unwise application programmers, and it doesn't cost us any extra performance from what we are currently doing, so why not improve things a little? If you want better guarantees than that, this is the best you can do: 1) write foo.new using file descriptor fd 2) sync_file_range(fd, 0, 0, SYNC_FILE_RANGE_WRITE); 3) rename foo.new to foo This will work on today's kernels, and it should be safe to do for all file systems. What sync_file_range() will do is to force out all of the data blocks (and if foo.new is a some gaurgantuan DVD ISO image, you may stall for seconds or minutes while the data blocks are written back). It does not force out any of the metadata blocks, or issue a CACHE_FLUSH command. But that's OK, because after the rename() operation, at the next journal commit the metdata blocks will be flushed out, and the journal commit will issue a CACHE FLUSH command to the disk. So this is just as safe as using fsync(), and it will be more performant. However, sync_file_range(2) is a Linux-specific system call, so if you care about portability to other operating systems, you'll have to use fsync() instead on legacy Unix systems. :-) On Fri, Jun 21, 2013 at 11:24:45AM -0400, Ryan Lortie wrote: > > So why are we seeing the problem happen so often? Do you really think > this is related to a bug that was introduced in the block layer in 3.0 > and that once that bug is fixed replace-by-rename without fsync() will > become "relatively" safe again? So there are a couple of explanations I can think of. As I said, at least one of the test programs was not actually doing a rename() operation to overwrite an existing file. So in that case, seeing a zero-length file after a crash really isn't unexpected. The btrfs wiki also makes it clear that if you aren't doing a rename which deletes an old file, there are no guarantees. For situations where the application really was doing rename() to overwrite an existing file, we were indeed initiating a non-data-integrity writeback after the rename. So most of the time, users should have been OK. Now, if the application is still issuing lots and lots of updates, say multiple times a second while a window is being moved around, or even once for every single window resize/move, it could just have been a case of bad luck. Another example of bad luck might be the case of Tux Racer writing its high score file, and then shutting down its Open GL context, which promptly caused the Nvidia driver to crash the entire system. In that case, the two events would be highly correlated together, so the chances that the user would get screwed would thus be much, much higher. Yet another possible cause is crappy flash devices. Not all flash devices are forcing out their internal flash translation layer (FTL) metadata to stable store on a CACHE FLUSH command --- precisely because if they did, it would trash their performance, and getting good scores on AnandTech rankings might be more important to them than the safety of their user's data. As a result, even for an application which is properly calling fsync(), you could see data loss or even file system corruption after a power failure. I recently helped out an embedded systems engineer who was trying to use ext4 in an appliance, and he was complaining that with this Intel SSD things worked fine, but with his Brand X SSD (name ommited to protect the guilty) he was seeing file system corruption after a power plug pull test. I had to tell them that there was nothing I could do. If the storage device isn't flushing everything to stable store upon receipt of a CACHE FLUSH command, that would be like a file system which didn't properly implement fsync(). If the application doesn't call fsync(), then it's on the application. But if the application calls fsync(), and data loss still occurs, then it's on either the file system or the storage device. Similarly, if the file system doesn't send a CACHE FLUSH command, it's on the file system (or the administrator, if he or she uses the nobarrier mount option, which disables the CACHE FLUSH command). But if the file system does send the CACHE FLUSH command, and the device isn't guaranteeing that all data sent to the storage device can be read after a power pull, then it's on the storage device, and it's the storage device which is buggy. > g_file_set_contents() is a very general purpose API used by dconf but > also many other things. It is being used to write all kinds of files, > large and small. I understand how delayed allocation on ext4 is > essentially giving me the same thing automatically for small files that > manage to be written out before the kernel decides to do the allocation > but doing this explicitly will mean that I'm always giving the kernel > the information it needs, up front, to avoid fragmentation to the > greatest extent possible. I see it as "won't hurt and may help" and > therefore I do it. So that would be the problem if we defined some new interface which implemented a replace data contents functionality. Inevitably it would get used by some crazy application which tried to write a multi-gigabyte file.... If I were to define such a new syscall, what I'd probably do is export it a set_contents() type interface. So you would *not* use read or write, but you would send down a the new contents in a single data buffer, and if it is too big (where too big is completely at the discretion of the kernel) you would get back an error, and it would be up to the application to fall back to the traditional methods of "write to foo.new, rename foo.new to foo" scheme. I don't know if I could get the rest of the file system developers to agree to such an interface, but if we were to do such a thing, that's the proposal I would make. On the other hand, you keep telling me that dconf() is only intended to be used for small config files, and applications should only be calling it but rarely. In that case, does it really matter if g_file_set_contents() takes a tenth of a second or so? I can see needing to optimize things if g_file_set_contents() is getting called several times a second as the window is getting moved or resized, but I thought we've agreed that's an abusive use of the interface, and hence not one we should be trying to spend huge amounts of programming effort trying to optimize. > > The POSIX API is pretty clear: if you care about data being on disk, > > you have to use fsync(). > > Well, in fairness, it's not even clear on this point. POSIX doesn't > really talk about any sort of guarantees across system crashes at all... Actually, POSIX does have clear words about this: "If _POSIX_SYNCHRONIZED_IO is defined, the fsync() function shall force all currently queued I/O operations associated with the file indicated by file descriptor fildes to the synchronized I/O completion state. All I/O operations shall be completed as defined for synchronized I/O file integrity completion" And yes, Linux is intended to be implemented with the _POSIX_SYNCHRONIZED_IO defined. There are operating systems where this might not be true, though. For example, Mac OS X. Fun Trivia fact; on Mac OS, fsync doesn't result in a CACHE FLUSH command, so the contents of your data writes are not guaranteed after a power failure. If you really want fsync(2) to perform as specified by POSIX, you must use fcntl and F_FULLFSYNC on Mac OS. The reason? Surprise, surprise --- performance. And probably because there are too many applications which were calling fsync() several times a second during a window resize on the display thread, or some such, and Steve Jobs wanted things to be silky smooth. :-) Another fun fact: firefox used to call fsync() on its UI thread. Guess who got blamed for the resulting UI Jank and got all of the hate mail from the users? Hint: not the Firefox developers.... and this sort of thing probably explains Mac OS's design decision vis-a-vis fsync(). There are a lot of application programmers out there, and they outnumber the file system developers --- and not all of them are competent. > and I can easily imagine that fsync() still doesn't get me what I want > in some really bizarre cases (like an ecryptfs over NFS from a virtual > server using an lvm setup running inside of kvm on a machine with hard > dives that have buggy firmware). Yes, there are file systems such as NFS which are not POSIX compliant. And yes, you can always have buggy hardware, such as the crap SSD's that are out there. But there's nothing anyone can do if the hardware is crap. (There's a reason why I tend to stick to Intel SSD's on my personal laptops. More recently I've experimented with using a Samsung SSD on one of my machines, but in general, I don't use SSD's from random manufacturers precisely because I don't trust them.....) > > This is part of why I'd rather avoid the fsync entirely... Well, for your use case, sync_file_range() should actually be sufficient, and it's what I would recommend instead of fsync(), at least for Linux which has this syscall. > aside: what's your opinion on fdatasync()? Seems like it wouldn't be > good enough for my usecase because I'm changing the size of the file.... fdatasync() is basically sync_file_range() plus a CACHE FLUSH command. Like sync_file_range, it doesn't sync the metadata (and by the way, this includes things like indirect blocks for ext2/3 or extent tree blocks for ext4). However, for the "write foo.new ; rename foo.new to foo" use case, either sync_file_range() or fdatasync() is fine, since at the point where the rename() is committed via the file system transaction, all of the metadata will be forced out to disk, and there will also be a CACHE FLUSH sent to the device. So if the desired guarantee is "file foo contains either the old or new data", fsync(), fdatasync() or sync_file_range() will all do, but the sync_file_range() will be the best from a performance POV, since it eliminates a duplicate and expensive CACHE FLUSH command from being sent to the disk. > another aside: why do you make global guarantees about metadata changes > being well-ordered? It seems quite likely that what's going on on one > part of the disk by one user is totally unrelated to what's going on on > another other part by a different user... ((and I do appreciate the > irony that I am committing by complaining about "other guarantees that I > don't care about")). There are two answers. One is that very often there are dependencies between files --- and at the file system level, we don't necessarily know what these dependencies are (for example, between a .y and .c file, and a .c and a .o files with respect to make). There may be (many) files for which it doesn't matter, but how do you tell the file system which file dependencies matters and which doesn't? There's another reason why, which is that trying to do this would be horrifically complicated and/or much, much slower. The problem is entangled updates. If you modify an inode, you have to write back the entire inode table block. There may be other inodes that are also in the procss of being modified. Similarly, you might have a rename operation, an unlink operation, and a file write operation that all result in changes to a block allocation bitmap. If you want to keep these operations separate, then you need to have a very complicated transaction machinery, with intent logs, and rollback logs, and all of the rest of the massive complexity that comes with a full-fledged transactional database. There have been attempts to use a general database engine to implement a generic file system, but they have generally been a performance disaster. One such example happened in the early 90's, when Oracle tried push this concept in order to sell more OracleDB licenses, but that went over like a lead balloon, and not just because of the pricing issue. - Ted -- To unsubscribe from this list: send the line "unsubscribe linux-ext4" in the body of a message to majordomo@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
On Fri, Jun 21, 2013 at 04:35:47PM -0400, Theodore Ts'o wrote: > So I've been taking a closer look at the the rename code, and there's > something I can do which will improve the chances of avoiding data > loss on a crash after an application tries to replace file contents > via: > > 1) write foo.new > 2) <omit fsync of foo.new> > 3) rename foo.new to foo > > Those are the kernel patches that I cc'ed you on. > > The reason why it's still not a guarantee is because we are not doing > a file integrity writeback; this is not as important for small files, > but if foo.new is several megabytes, not all of the data blocks will > be flushed out before the rename, and this will kill performance, and > in somoe cases it might not be necessary. > > Still, for small files ("most config files are smaller than 100k"), > this should serve you just fine. Of course, it's not going to be in > currently deployed kernels, so I don't know how much these proposed > patches will help you,. I'm doing mainly because it helps protects > users against (in my mind) unwise application programmers, and it > doesn't cost us any extra performance from what we are currently > doing, so why not improve things a little? > > > If you want better guarantees than that, this is the best you can do: > > 1) write foo.new using file descriptor fd > 2) sync_file_range(fd, 0, 0, SYNC_FILE_RANGE_WRITE); > 3) rename foo.new to foo > > This will work on today's kernels, and it should be safe to do for all > file systems. No, it's not. SYNC_FILE_RANGE_WRITE does not wait for IO completion, and not all filesystems sychronise journal flushes with data write IO completion. Indeed, we have a current "data corruption after power failure" problem found on Ceph storage clusters using XFS for the OSD storage that is specifically triggered by the use of SYNC_FILE_RANGE_WRITE rather than using fsync() to get data to disk. http://oss.sgi.com/pipermail/xfs/2013-June/027390.html The question was raised as to whether sync_file_range() was safe on ext4 was asked and my response was: http://oss.sgi.com/pipermail/xfs/2013-June/027452.html "> Is sync_file_range(2) similarly problematic with ext4? In data=writeback mode, most definitely. For data=ordered, I have no idea - the writeack paths in ext4 are ... convoluted, and I hurt my brain every time I look at them. I wouldn't be surprised if there are problems, but they'll be different problems because ext4 doesn't do speculative prealloc..." ..... > > aside: what's your opinion on fdatasync()? Seems like it wouldn't be > > good enough for my usecase because I'm changing the size of the file.... > > fdatasync() is basically sync_file_range() plus a CACHE FLUSH command. > Like sync_file_range, it doesn't sync the metadata (and by the way, > this includes things like indirect blocks for ext2/3 or extent tree > blocks for ext4). If fdatasync() on ext4 doesn't sync metadata blocks required to access the data that was just written by the fdatasync() call, then it is broken. fdatasync() is supposed to guarantee all the data in the file and all the metadata *needed to access that data* is on stable storage by the time the fdatasync() completes. i.e. fdatasync() might just be a data write and cache flush, but in the case where allocation, file size changes, etc have occurred, it is effectively the equivalent of a full fsync(). So, fdatasync() will do what you want, but the performance overhead will be no different to fsync() in the rename case because all the metadata pointing to the tmp file needs to comitted as well... ---- But, let me make a very important point here. Nobody should be trying to optimise a general purpose application for a specific filesystem's data integrity behaviour. fsync() and fdatasync() are the gold standards as it is consistently implemented across all Linux filesystems. The reason I say this is that we've been down this road before and we shoul dhave learnt better from it. Ted, you should recognise this because you were front and centre in the fallout of it: http://tytso.livejournal.com/61989.html ".... Application writers had gotten lazy, because ext3 by default has a commit interval of 5 seconds, and and uses a journalling mode called data=ordered. What does this mean? .... ... Since ext3 became the dominant filesystem for Linux, application writers and users have started depending on this, and so they become shocked and angry when their system locks up and they lose data - even though POSIX never really made any such guarantee. ..." This discussion of "how can we abuse ext4 data=ordered sematics to avoid using fsync()" is heading right going down this path again. It is starting from "fsync on ext4 is too slow", and solutions are being proposed that assume that either everyone is use ext4 (patently untrue) and that all filesystems behave like ext4 (also patently untrue). To all the application developers reading this: just use fsync()/fdatasync() for operations that require data integrity. Your responisbility is to your users: using methods that don't guarantee data integrity and therefore will result in data loss is indicating you don't place any value on your user's data what-so-ever. There is no place for being fancy when it comes to data integrity - it needs to be reliable and rock-solid. If that means your application is slow, then you need to explain why it is slow to your users and how they can change a knob to make it fast by trading off data integrity. The user can make the choice at that point, and they have no grounds to complain if they lose data at that point because they made a conscious choice to configure their system that way. IOWs, the choice of whether data can be lost on a crash is one that only the user can make. As such, applications need be safe-by-default when it comes to data integrity. Cheers, Dave.
On Sat, Jun 22, 2013 at 01:29:44PM +1000, Dave Chinner wrote: > > This will work on today's kernels, and it should be safe to do for all > > file systems. > > No, it's not. SYNC_FILE_RANGE_WRITE does not wait for IO completion, > and not all filesystems sychronise journal flushes with data write > IO completion. Sorry, what I should have said is: sync_file_range(fd, 0, 0, SYNC_FILE_RANGE_WRITE | SYNC_FILE_RANGE_WAIT_AFTER); This *does* wait for I/O completion; the result is the equivalent filemap_fdatawrite() followed by filemap_fdatawait(). > Indeed, we have a current "data corruption after power failure" > problem found on Ceph storage clusters using XFS for the OSD storage > that is specifically triggered by the use of SYNC_FILE_RANGE_WRITE > rather than using fsync() to get data to disk. > > http://oss.sgi.com/pipermail/xfs/2013-June/027390.html This woudn't be a problem in the sequence I suggested. 1) write foo.new 2) sync_file_range(...) 3) rename foo.new to foo If the system crashes right after foo.new, yes, there's no guarantee since the metadata blocks are written. (Although if XFS is exposing stale data as a result of sync_file_range, that's arguably a security bug, since sync_file_range doesn't require root to execute, and it _has_ been part of the Linux interface for a long time.) Unlike in the Ceph case, in this sequence as are calling sync_file_range on the new file (foo.new). And if we haven't done the rename yet, we don't care about the contents of foo.new. (Although if the file system is causing stale data to be revealed, I'd claim that's a fs bug.) If the rename has completed, the metadata blocks will definitely be journalled and committed for both ext3 and ext4. So for ext3 and ext4, this sequence will guarantee that the file named "foo" will have either the old data or the new data --- and this is true for either data=ordered, or data=writeback. You're the expert for xfs, but I didn't think this sequence was depending on anything file system specific, since filemap_fdatawrite() and filemap_fdatawait() are part of the core Linux FS/MM layer. > But, let me make a very important point here. Nobody should be > trying to optimise a general purpose application for a specific > filesystem's data integrity behaviour. fsync() and fdatasync() are > the gold standards as it is consistently implemented across all > Linux filesystems. From a philosophical point of view, I agree with you. As I wrote in my earlier messages, assuming the applications aren't abusively calling g_file_set_contents() several times per second, I don't understand why Ryan is trying so hard to optimize it. The fact that he's trying to optimize it at least to me seems to indicate a simple admission that there *are* broken applications out there, some of which may be calling it with high frequency, perhaps out of the UI thread. And having general applications or generic desktop libraries trying to depend on specific implementation details of file systems is really ugly. So it's not something I'm all that excited about. However, regardless of whether wish it or not, abusive applications written by incompetent application authors *will* exist, and whether we like it or not, desktop library authors are going to try to coddle said abusive applications by do these filesystem implemenatation dependent optimization. GNOME is *already* detecting btrfs and has made optimization decisions based on it in its libraries. Trying to prevent this is (in my opinion) the equivalent of claiming that the command of King Canute could hold back the tide. Given that, my thinking was to try to suggest the least harmful way of doing so, and so my eye fell on sync_file_range(2) as a generic system call that is not file system specific, with some relatively well defined semantics. And given that we don't care about foo.new until after the rename operation has completed, it seemed to me that this should be safe for more than just ext3 and ext4 (where I am quite sure it is safe). But if XFS is doing something sufficiently clever that sync_file_range() isn't going to do the right thing, and if we presume that abusive applications will always exist, then maybe it's time to consider implementing a new system call which has very well defined semantics, for those applications that insist on updating a file hundreds of times an hour, demand good performance, and aren't picky about consistency semantics, so long that some version of the file contents is available after a crash. This system call would of course have to be optional, and so for file systems that don't support it, applications will have to fall back more traditional approachses, whether that involves fsync() or perhaps sync_file_range(), if that can be made safe and generic for all/most file systems. Personally, I think application programmers *shouldn't* need such a facility, if their applications are competently designed and implemented. But unfortunately, they outnumber us file system developers, and apparently many of them seem to want to do things their way, whether we like it or not. Regards, - Ted -- To unsubscribe from this list: send the line "unsubscribe linux-ext4" in the body of a message to majordomo@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
> From a philosophical point of view, I agree with you. As I wrote in > my earlier messages, assuming the applications aren't abusively > calling g_file_set_contents() several times per second, I don't > understand why Ryan is trying so hard to optimize it. The fact that > he's trying to optimize it at least to me seems to indicate a simple > admission that there *are* broken applications out there, some of > which may be calling it with high frequency, perhaps out of the UI > thread. Well, one application calling fsync is almost nothing to care about. On the other hand tens and hundreds of apps doing fsync's is a disaster. > And having general applications or generic desktop libraries trying to > depend on specific implementation details of file systems is really > ugly. So it's not something I'm all that excited about. Me too, but people have to do that because fs api is too generic and at the same time one has to account fs specifics in order to make their app take most advantage or at least to avoid inefficiencies. For example I have an app that constantly does appending writes to about 15 files and I must ensure that no more than 5 seconds will be lost in an event of system crash or power loss. How would you do that in generic way? Generic and portable way to do it is to start 15 threads and call fsyncs on those fds at the same time. That works fine with JFS since it doesn't do flushes and it works fine with ext4 because all those fsync's are likely to complete within single transaction. However that doesn't scale well and it forces app to do bursts. Scalable, but still bursty solution could be io_submit, but afaik no fs currently supports async fsync. What if you want to distribute the load? Single dedicated thread calling fsync's works fine with JFS, but sucks with ext4. Ok, there is a sync_file_range, let's try it out. Luckily I have control over commit=N option to underlying ext4 fs which I leave at default 5s. Otherwise I would like to have an ioctl to ext4 to force commit (I'm not sure if fsync on a single fd will commit currently running transaction). Sync thread calls sync_file_range evenly over 5s interval, ext4 does commits every 5s. Nice! But it doesn't work with JFS. Therefore I have two implementations for different file systems. > Personally, I think application programmers *shouldn't* need such a > facility, if their applications are competently designed and > implemented. But unfortunately, they outnumber us file system > developers, and apparently many of them seem to want to do things > their way, whether we like it or not. I would argue : ) fsync is not the one to rule them all. It's semantics is clear: write all those bytes NOW. The fact fsync can be used as a barrier doesn't mean it's the best way to do it. There are quite few cases where write-right-now semantics is absolutely required. More often apps just want atomic file updates and sort of writeback control which is available only as system-wide knob. As for atomic updates, I'm thinking of something like io_exec() or io_submit_atomic() or whatever name is best for it. Probably it shouldn't be tied to kaio. This syscall would accept an array of iocb's and guarantee atomicity of the update. This shouldn't be a big deal for ext4 to support it because it already supports data journalling, which is however only block/page-wise atomic. Such a syscall wouldn't be undervalued if majority of file systems support it. Regards, Andrey. -- To unsubscribe from this list: send the line "unsubscribe linux-ext4" in the body of a message to majordomo@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
On Sat, Jun 22, 2013 at 01:40:26PM +0000, Sidorov, Andrei wrote: > Me too, but people have to do that because fs api is too generic and > at the same time one has to account fs specifics in order to make > their app take most advantage or at least to avoid inefficiencies. > For example I have an app that constantly does appending writes to > about 15 files and I must ensure that no more than 5 seconds will be > lost in an event of system crash or power loss. Stop right there. If you are doing lots of appending writes, and you dont want to lose no more than 5 seconds, why do you have 15 files? Why send your appending writes into a single file and then later on, disaggregate out the logs when you need to read them? With 15 files, no matter what you are doing, you will be forcing the heads to seek all over the place, and since the 4k blocks won't be full when you sync them, you're wasting write cycles and disk bandwidth. See what I mean? There's a reason why I said the fault was incompetent application design, not just incomplement implementation. - Ted -- To unsubscribe from this list: send the line "unsubscribe linux-ext4" in the body of a message to majordomo@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
> Stop right there. If you are doing lots of appending writes, and you > dont want to lose no more than 5 seconds, why do you have 15 files? > Why send your appending writes into a single file and then later on, > disaggregate out the logs when you need to read them? There are 15 files because those are different entities. There might be less of them, some of them can be removed. There is no reason to place those files close to each other since those are just metadata related to mpeg stream and in fact it is best to put stream and metadata close to each other. If I wanted to localize those small files, I would just put them on a different partition. Anyway, if I have to implement a file system inside a file, why do I need top layer file system? : ) > With 15 files, > no matter what you are doing, you will be forcing the heads to seek > all over the place, and since the 4k blocks won't be full when you > sync them, you're wasting write cycles and disk bandwidth. Yes, seeks are unavoidable, but that's something I have to account in worst case scenario anyway. Huh, yes. Another advantage of ext4 is that I can control writeback on a specific file range. I sync only complete pages and ext4 is happy to commit whatever was synced. Generic solution would be to cache pages in app and submit only complete ones. Disadvantage of generic solution is that data cached by app is invisible to others. Regards, Andrey. -- To unsubscribe from this list: send the line "unsubscribe linux-ext4" in the body of a message to majordomo@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
On Sat, Jun 22, 2013 at 12:47:18AM -0400, Theodore Ts'o wrote: > On Sat, Jun 22, 2013 at 01:29:44PM +1000, Dave Chinner wrote: > > > This will work on today's kernels, and it should be safe to do for all > > > file systems. > > > > No, it's not. SYNC_FILE_RANGE_WRITE does not wait for IO completion, > > and not all filesystems sychronise journal flushes with data write > > IO completion. > > Sorry, what I should have said is: > > sync_file_range(fd, 0, 0, SYNC_FILE_RANGE_WRITE | SYNC_FILE_RANGE_WAIT_AFTER); > > This *does* wait for I/O completion; the result is the equivalent > filemap_fdatawrite() followed by filemap_fdatawait(). Right, and now it is effectively identical to fsync() in terms of speed. And when performance is the reason for avoiding fsync().... > > Indeed, we have a current "data corruption after power failure" > > problem found on Ceph storage clusters using XFS for the OSD storage > > that is specifically triggered by the use of SYNC_FILE_RANGE_WRITE > > rather than using fsync() to get data to disk. > > > > http://oss.sgi.com/pipermail/xfs/2013-June/027390.html > > This woudn't be a problem in the sequence I suggested. > > 1) write foo.new > 2) sync_file_range(...) > 3) rename foo.new to foo You miss the point - it's not a data integrity operation, and to present it as a way of obtaining data integrity across *all filesystems* because it would work on ex4 is extremely dangerous. You're relying on side effects of a specific filesystem implementation to give you the desired result. > If the system crashes right after foo.new, yes, there's no > guarantee since the metadata blocks are written. Sure, but focussing on ext4 behaviour misses the larger issue - sync_file_range() has no well defined integrity rules and hence exposes applications to unexpected behaviours on different filesystems. > (Although if XFS > is exposing stale data as a result of sync_file_range, that's > arguably a security bug, since sync_file_range doesn't require > root to execute, and it _has_ been part of the Linux interface for > a long time.) Yes, it is. But doesn't this point out one of the problems with moving from a well known interface to something that almost nobody uses and (obviously) has never tested in a data integrity test bench previously? it's taken how many years for this problem to be exposed? We test fsync crash integrity all the time, but do we test sync_file_range()? No, we don't, because it doesn't have any data integrity guarantees that we can validate. So while the interface may have been around for years, the first serious storage product I've heard of that has tried to use it to optimise away fsync() calls to use it has found that: a) it exposes strange behaviours on different filesystems; b) doesn't provide data integrity guarantees worth a damn; and c) isn't any faster than using fsync/fdatasync() in the first place. And so has reverted back to using fsync().... > So for ext3 and ext4, this sequence will guarantee that the file Exactly my point. You're designing for ext3/4 data=ordered behaviour and so it's not a replacement for fsync/fdatasync... > > But, let me make a very important point here. Nobody should be > > trying to optimise a general purpose application for a specific > > filesystem's data integrity behaviour. fsync() and fdatasync() > > are the gold standards as it is consistently implemented across > > all Linux filesystems. > > From a philosophical point of view, I agree with you. As I wrote > in my earlier messages, assuming the applications aren't abusively > calling g_file_set_contents() several times per second, I don't > understand why Ryan is trying so hard to optimize it. The fact > that he's trying to optimize it at least to me seems to indicate a > simple admission that there *are* broken applications out there, > some of which may be calling it with high frequency, perhaps out > of the UI thread. That's not a problem that should be solved by trading off data integrity and losing user's configurations. Have a backbone, Ted, and tell people they are doing something stupid when they are doing something stupid. > And having general applications or generic desktop libraries > trying to depend on specific implementation details of file > systems is really ugly. So it's not something I'm all that > excited about. Not excited? That's an understatement - it's the sort of thing that gives me nightmares. We've *been here before*. We've been telling application developer's to use fsync/fdatasync() since the O_PONIES saga so we don't end up back in that world of hurt. Yet here you are advocating that application developers go back down the rat hole of relying on implicit side effects of the design of a specific filesystem configuration to provide them with data integrity guarantees.... > But if XFS is doing something sufficiently clever that > sync_file_range() isn't going to do the right thing, and if we > presume that abusive applications will always exist, then maybe > it's time to consider implementing a new system call which has > very well defined semantics, That's fine - go and define the requirements for an rename_datasync() syscall, write a bunch of xfstests to ensure the API is usable and provides the required behaviour. Then do a slow generic implementation that you wire all the filesystems up to so it's available to everyone (i.e. no fs probing needed). Applications can start using it immediately, knowing it will work on every filesystem. i.e: generic_rename_datasync() { vfs_fsync(src); ->rename() } At that point you can implement a fast, optimal implementation in ext3/4 that nobody outside ext3/4 needs to care about how it works. The BTRFS guys can optimise it appropriately in their own time, as can us XFS people. But the most important point is that we start with on implementation that *works everywhere*... > Personally, I think application programmers *shouldn't* need such > a facility, if their applications are competently designed and > implemented. But unfortunately, they outnumber us file system > developers, and apparently many of them seem to want to do things > their way, whether we like it or not. And you're too afraid to call out someone when they are doing something stupid. If you don't call them out, they'll keep doing stupid things and the problems will never get solved. Cheers, Dave.
diff --git a/fs/block_dev.c b/fs/block_dev.c index 2091db8..85f5c85 100644 --- a/fs/block_dev.c +++ b/fs/block_dev.c @@ -58,17 +58,24 @@ static void bdev_inode_switch_bdi(struct inode *inode, struct backing_dev_info *dst) { struct backing_dev_info *old = inode->i_data.backing_dev_info; + bool wakeup_bdi = false; if (unlikely(dst == old)) /* deadlock avoidance */ return; bdi_lock_two(&old->wb, &dst->wb); spin_lock(&inode->i_lock); inode->i_data.backing_dev_info = dst; - if (inode->i_state & I_DIRTY) + if (inode->i_state & I_DIRTY) { + if (bdi_cap_writeback_dirty(dst) && !wb_has_dirty_io(&dst->wb)) + wakeup_bdi = true; list_move(&inode->i_wb_list, &dst->wb.b_dirty); + } spin_unlock(&inode->i_lock); spin_unlock(&old->wb.list_lock); spin_unlock(&dst->wb.list_lock); + + if (wakeup_bdi) + bdi_wakeup_thread_delayed(dst); } /* Kill _all_ buffers and pagecache , dirty or not.. */