Patchwork ext4 file replace guarantees

login
register
mail settings
Submitter Theodore Ts'o
Date June 21, 2013, 2:33 p.m.
Message ID <20130621143347.GF10730@thunk.org>
Download mbox | patch
Permalink /patch/253230/
State New
Headers show

Comments

Theodore Ts'o - June 21, 2013, 2:33 p.m.
On Fri, Jun 21, 2013 at 09:51:47AM -0400, Ryan Lortie wrote:
> 
> > #2) If an existing file is removed via a rename, if there are any
> >     delayed allocation blocks in the new file, they will be flushed
> >     out.
> 
> Can you describe to me exactly what this means in terms of the syscall
> API?  Do I void these "delayed allocation blocks" guarantees by using
> fallocate() before-hand?

Based on how the implementation is currently implemented, any modified
blocks belonging to the inode will be staged out to disk --- Although
with out an explicit CACHE FLUSH command, which is ***extremely***
expensive.

Why are you using fallocate, by the way?  For small files, fallocate
is largely pointless.  All of the modern file systems which use
delayed allocation can do the right thing without fallocate(2).  It
won't hurt, but it won't help, either.

Fallocate is primarily useful for very large files, where the file
system has to start writing the data out to disk due to memory
pressure before the entire file has been written to the page cache.
And it sounds like multi-megabyte dconf files are a bit out of your
design scope.  :-)

> I have a legitimate desire to optimise for my second concern without
> sacrificing my first concern and I don't think that it's unreasonable
> for me to ask where the line is.  I don't really accept this "surf the
> boundaries" business about fuzzy APIs that "should do the right thing
> usually, but no guarantees if you get too close to the line".  I just
> want to know what I have to do in order to be safe.

The POSIX API is pretty clear: if you care about data being on disk,
you have to use fsync().

As file system designers, we're generally rather hesitant to make
guarantees beyond that, since most users are hypersensitive about
performance, and every single time we make additional guarantees
beyond what is specified by POSIX, it further constrains us, and it
may be that one of these guarantees is one you don't care about, but
will impact your performance.  Just as the cost of some guarantee you
*do* care about may impact the performance of some other application
which happens to be renaming files but who doesn't necessarily care
about making sure things are forced to disk atomically in that
particular instance.

> But it brings me to another interesting point: I don't want to sync the
> filesystem.  I really don't care when the data makes it to disk.  I
> don't want to waste battery and SSD life.  But I am forced to.  I only
> want one thing, which is what I said before: I want to know that either
> the old data will be there, or the new data.  The fact that I have to
> force the hard drive to wake up in order to get this is pretty
> unreasonable.
> 
> Why can't we get what we really need?
> 
> Honestly, what would be absolutely awesome is g_file_set_contents() in
> the kernel, with the "old or new" guarantee.  I have all of the data in
> a single buffer and I know the size of it in advance.  I just want you
> to get it onto the disk, at some point that is convenient to you, in
> such a way that I don't end up destroying the old data while still not
> having the new.

There are all sorts of rather tricky impliciations with this.  For
example, consider what happens if some disk editor does this with a
small text file.  OK, fine.  Future reads of this text file will get
the new contents, but if the system crashes, when the file read, they
will get the old value.  Now suppose other files are created based on
that text file.  For example, suppose the text file is a C source
file, and the compiler writes out an object file based on the source
file --- and then the system crashes.  What guarantees do we have to
give about the state of the object file after the crash?  What if the
object file contains the compiled version of the "new" source file,
but that source file hsa reverted to its original value.  Can you
imagine how badly make would get confused with such a thing?

Beyond the semantic difficulties of such an interface, while I can
technically think of ways that this might be workable for small files,
the problem with the file system API is that it's highly generalized,
and while you might promise than you'd only use it for files less than
64k, say, inevitably someone would try to use the exact same interface
with a multi-megabyte file.  And then complain when it didn't work,
didn't fulfill the guarantees, or OOM-killed their process, or trashed
their performance of their entire system....

By the way, if the old and new contents of the data you are replacing
is less than 4k, just open the file using O_DIRECT, and do a 4k
replacement write.  Keep the unused portion of the file padded out
with zeros, so you don't have to worry about the inode's i_size
update, and just do a 4k write.  *Boom*.  This is the fast, guaranteed
to work on all file systems, and will result in the least amount of
SSD wear.


By the way, I think I know why people have been seeing more instnaces
of data loss after a power failure.  There's been a bug that's been
introduced since the 3.0 kernel which disabled the writeback timer
under some circumstances.  This is the timer which forces inodes (for
all file systems; this was generic writeback bug) to be written back
after /sys/proc/vm/dirty_writeback_centisecs.  As a result, for files
that were written that did _not_ meet magic overwrite-via-rename
conditions and did not use fsync, they were more succeptible to data
loss after a crash.  The patch to fix this was discovered a three
weeks ago, and is attached below.

						- Ted

From: Jan Kara <jack@suse.cz>
To: Jens Axboe <axboe@kernel.dk>
Cc: Wu Fengguang <fengguang.wu@intel.com>,
	linux-fsdevel@vger.kernel.org,
	Bert De Jonghe <Bert.DeJonghe@amplidata.com>,
	Jan Kara <"jack@suse.cz>, stable"@vger.kernel.org#>
Subject: [PATCH] writeback: Fix periodic writeback after fs mount
Date: Thu, 30 May 2013 10:44:19 +0200
Message-Id: <1369903459-10295-1-git-send-email-jack@suse.cz>
X-Mailer: git-send-email 1.8.1.4

Code in blkdev.c moves a device inode to default_backing_dev_info when
the last reference to the device is put and moves the device inode back
to its bdi when the first reference is acquired. This includes moving to
wb.b_dirty list if the device inode is dirty. The code however doesn't
setup timer to wake corresponding flusher thread and while wb.b_dirty
list is non-empty __mark_inode_dirty() will not set it up either. Thus
periodic writeback is effectively disabled until a sync(2) call which can
lead to unexpected data loss in case of crash or power failure.

Fix the problem by setting up a timer for periodic writeback in case we
add the first dirty inode to wb.b_dirty list in bdev_inode_switch_bdi().

Reported-by: Bert De Jonghe <Bert.DeJonghe@amplidata.com>
CC: stable@vger.kernel.org # >= 3.0
Signed-off-by: Jan Kara <jack@suse.cz>
---
 fs/block_dev.c | 9 ++++++++-
 1 file changed, 8 insertions(+), 1 deletion(-)
Ryan Lortie - June 21, 2013, 3:24 p.m.
hi,

On Fri, Jun 21, 2013, at 10:33, Theodore Ts'o wrote:
> Based on how the implementation is currently implemented, any modified
> blocks belonging to the inode will be staged out to disk --- Although
> with out an explicit CACHE FLUSH command, which is ***extremely***
> expensive.

Okay -- so any modified blocks, not just unallocated ones, therefore
fallocate() doesn't affect us here.... Good.

So why are we seeing the problem happen so often?  Do you really think
this is related to a bug that was introduced in the block layer in 3.0
and that once that bug is fixed replace-by-rename without fsync() will
become "relatively" safe again?

> Why are you using fallocate, by the way?  For small files, fallocate
> is largely pointless.  All of the modern file systems which use
> delayed allocation can do the right thing without fallocate(2).  It
> won't hurt, but it won't help, either.

g_file_set_contents() is a very general purpose API used by dconf but
also many other things.  It is being used to write all kinds of files,
large and small.  I understand how delayed allocation on ext4 is
essentially giving me the same thing automatically for small files that
manage to be written out before the kernel decides to do the allocation
but doing this explicitly will mean that I'm always giving the kernel
the information it needs, up front, to avoid fragmentation to the
greatest extent possible.  I see it as "won't hurt and may help" and
therefore I do it.

I'm happy to remove it on your (justified) advice, but keep in mind that
people are using this API for larger files as well...

> The POSIX API is pretty clear: if you care about data being on disk,
> you have to use fsync().

Well, in fairness, it's not even clear on this point.  POSIX doesn't
really talk about any sort of guarantees across system crashes at all...
and I can easily imagine that fsync() still doesn't get me what I want
in some really bizarre cases (like an ecryptfs over NFS from a virtual
server using an lvm setup running inside of kvm on a machine with hard
dives that have buggy firmware).

I guess I'm trying to solve for the case of "normal ext4 on a normal
partition on real metal with properly working hardware".  Subject to
those constraints, I'm happy to call fsync().

> As file system designers, we're generally rather hesitant to make
> guarantees beyond that, since most users are hypersensitive about
> performance, and every single time we make additional guarantees
> beyond what is specified by POSIX, it further constrains us, and it
> may be that one of these guarantees is one you don't care about, but
> will impact your performance.  Just as the cost of some guarantee you
> *do* care about may impact the performance of some other application
> which happens to be renaming files but who doesn't necessarily care
> about making sure things are forced to disk atomically in that
> particular instance.

That's fair... and I do realise the pain in the ass that it is if I call
fsync() on a filesystem that has an ordered metadata guarantee.  I'm
asking you to immediately write out my metadata changes that came after
other metadata, so you have to write all of it out first.

This is part of why I'd rather avoid the fsync entirely...

aside: what's your opinion on fdatasync()?  Seems like it wouldn't be
good enough for my usecase because I'm changing the size of the file....

another aside: why do you make global guarantees about metadata changes
being well-ordered?  It seems quite likely that what's going on on one
part of the disk by one user is totally unrelated to what's going on on
another other part by a different user... ((and I do appreciate the
irony that I am committing by complaining about "other guarantees that I
don't care about")).

> There are all sorts of rather tricky impliciations with this.  For
> example, consider what happens if some disk editor does this with a
> small text file.  OK, fine.  Future reads of this text file will get
> the new contents, but if the system crashes, when the file read, they
> will get the old value.  Now suppose other files are created based on
> that text file.  For example, suppose the text file is a C source
> file, and the compiler writes out an object file based on the source
> file --- and then the system crashes.  What guarantees do we have to
> give about the state of the object file after the crash?  What if the
> object file contains the compiled version of the "new" source file,
> but that source file hsa reverted to its original value.  Can you
> imagine how badly make would get confused with such a thing?

Ya... I can see this.  I don't think it's important for normal users,
but this is an argument that goes to the heart of "what is a normal
user" and is honestly not a useful discussion to have here...

I guess in fact this answers my previous question about "why do you care
about metadata changes being well ordered?"  The answer is "make".

In any case, I don't expect that you'd change your existing guarantees
about the filesystem.  I'm suggesting that using this new 'replace file
with contents' API, however, would indicate that I am only interested in
this one thing happening, and I don't care how it relates to anything
else.

If we wanted a way to express that one file's contents should only be
replaced after another file's contents (which might be useful, but
doesn't concern me) then the API could be made more advanced...

> Beyond the semantic difficulties of such an interface, while I can
> technically think of ways that this might be workable for small files,
> the problem with the file system API is that it's highly generalized,
> and while you might promise than you'd only use it for files less than
> 64k, say, inevitably someone would try to use the exact same interface
> with a multi-megabyte file.  And then complain when it didn't work,
> didn't fulfill the guarantees, or OOM-killed their process, or trashed
> their performance of their entire system....

I know essentially nothing about the block layer or filesystems, but I
don't understand why it shouldn't work for files larger than 64k.  I
would expect this to be reasonably implementable for files up to a
non-trivial fraction of the available memory in the system (say 20%). 
The user has successfully allocated a buffer of this size already, after
all...  I would certainly not expect anything in the range of
"multi-megabyte" (or even up to 10s or 100s of megabytes) to be a
problem at all on a system with 4 to 8GB of RAM.

If memory pressure really became a big problem you would have two easy
outs before reaching for the OOM stick: force the cached data out to
disk in real time, or return ENOMEM from this new API (instructing
userspace to go about more traditional means of getting their data on
disk).


There are a few different threads in this discussion and I think we've
gotten away from the original point of my email (which wasn't addressed
in your most recent reply): I think you need to update the ext4
documentation to more clearly state that if you care about your data,
you really must call fsync().

Thanks
--
To unsubscribe from this list: send the line "unsubscribe linux-ext4" in
the body of a message to majordomo@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Theodore Ts'o - June 21, 2013, 8:35 p.m.
So I've been taking a closer look at the the rename code, and there's
something I can do which will improve the chances of avoiding data
loss on a crash after an application tries to replace file contents
via:

1)  write foo.new
2)  <omit fsync of foo.new>
3)  rename foo.new to foo

Those are the kernel patches that I cc'ed you on.

The reason why it's still not a guarantee is because we are not doing
a file integrity writeback; this is not as important for small files,
but if foo.new is several megabytes, not all of the data blocks will
be flushed out before the rename, and this will kill performance, and
in somoe cases it might not be necessary.

Still, for small files ("most config files are smaller than 100k"),
this should serve you just fine.  Of course, it's not going to be in
currently deployed kernels, so I don't know how much these proposed
patches will help you,.  I'm doing mainly because it helps protects
users against (in my mind) unwise application programmers, and it
doesn't cost us any extra performance from what we are currently
doing, so why not improve things a little?


If you want better guarantees than that, this is the best you can do:

1)  write foo.new using file descriptor fd
2)  sync_file_range(fd, 0, 0, SYNC_FILE_RANGE_WRITE);
3)  rename foo.new to foo

This will work on today's kernels, and it should be safe to do for all
file systems.  What sync_file_range() will do is to force out all of
the data blocks (and if foo.new is a some gaurgantuan DVD ISO image,
you may stall for seconds or minutes while the data blocks are written
back).  It does not force out any of the metadata blocks, or issue a
CACHE_FLUSH command.  But that's OK, because after the rename()
operation, at the next journal commit the metdata blocks will be
flushed out, and the journal commit will issue a CACHE FLUSH command
to the disk.

So this is just as safe as using fsync(), and it will be more
performant.  However, sync_file_range(2) is a Linux-specific system
call, so if you care about portability to other operating systems,
you'll have to use fsync() instead on legacy Unix systems.  :-)


On Fri, Jun 21, 2013 at 11:24:45AM -0400, Ryan Lortie wrote:
> 
> So why are we seeing the problem happen so often?  Do you really think
> this is related to a bug that was introduced in the block layer in 3.0
> and that once that bug is fixed replace-by-rename without fsync() will
> become "relatively" safe again?

So there are a couple of explanations I can think of.  As I said, at
least one of the test programs was not actually doing a rename()
operation to overwrite an existing file.  So in that case, seeing a
zero-length file after a crash really isn't unexpected.  The btrfs
wiki also makes it clear that if you aren't doing a rename which
deletes an old file, there are no guarantees.

For situations where the application really was doing rename() to
overwrite an existing file, we were indeed initiating a
non-data-integrity writeback after the rename.  So most of the time,
users should have been OK.  Now, if the application is still issuing
lots and lots of updates, say multiple times a second while a window
is being moved around, or even once for every single window
resize/move, it could just have been a case of bad luck.

Another example of bad luck might be the case of Tux Racer writing its
high score file, and then shutting down its Open GL context, which
promptly caused the Nvidia driver to crash the entire system.  In that
case, the two events would be highly correlated together, so the
chances that the user would get screwed would thus be much, much
higher.

Yet another possible cause is crappy flash devices.  Not all flash
devices are forcing out their internal flash translation layer (FTL)
metadata to stable store on a CACHE FLUSH command --- precisely
because if they did, it would trash their performance, and getting
good scores on AnandTech rankings might be more important to them than
the safety of their user's data.

As a result, even for an application which is properly calling
fsync(), you could see data loss or even file system corruption after
a power failure.  I recently helped out an embedded systems engineer
who was trying to use ext4 in an appliance, and he was complaining
that with this Intel SSD things worked fine, but with his Brand X SSD
(name ommited to protect the guilty) he was seeing file system
corruption after a power plug pull test.  I had to tell them that
there was nothing I could do.  If the storage device isn't flushing
everything to stable store upon receipt of a CACHE FLUSH command, that
would be like a file system which didn't properly implement fsync().
If the application doesn't call fsync(), then it's on the application.
But if the application calls fsync(), and data loss still occurs, then
it's on either the file system or the storage device.

Similarly, if the file system doesn't send a CACHE FLUSH command, it's
on the file system (or the administrator, if he or she uses the
nobarrier mount option, which disables the CACHE FLUSH command).  But
if the file system does send the CACHE FLUSH command, and the device
isn't guaranteeing that all data sent to the storage device can be
read after a power pull, then it's on the storage device, and it's the
storage device which is buggy.

> g_file_set_contents() is a very general purpose API used by dconf but
> also many other things.  It is being used to write all kinds of files,
> large and small.  I understand how delayed allocation on ext4 is
> essentially giving me the same thing automatically for small files that
> manage to be written out before the kernel decides to do the allocation
> but doing this explicitly will mean that I'm always giving the kernel
> the information it needs, up front, to avoid fragmentation to the
> greatest extent possible.  I see it as "won't hurt and may help" and
> therefore I do it.

So that would be the problem if we defined some new interface which
implemented a replace data contents functionality.  Inevitably it
would get used by some crazy application which tried to write a
multi-gigabyte file....

If I were to define such a new syscall, what I'd probably do is export
it a set_contents() type interface.  So you would *not* use read or
write, but you would send down a the new contents in a single data
buffer, and if it is too big (where too big is completely at the
discretion of the kernel) you would get back an error, and it would be
up to the application to fall back to the traditional methods of
"write to foo.new, rename foo.new to foo" scheme.

I don't know if I could get the rest of the file system developers to
agree to such an interface, but if we were to do such a thing, that's
the proposal I would make.  On the other hand, you keep telling me
that dconf() is only intended to be used for small config files, and
applications should only be calling it but rarely.

In that case, does it really matter if g_file_set_contents() takes a
tenth of a second or so?  I can see needing to optimize things if
g_file_set_contents() is getting called several times a second as the
window is getting moved or resized, but I thought we've agreed that's
an abusive use of the interface, and hence not one we should be trying
to spend huge amounts of programming effort trying to optimize.

> > The POSIX API is pretty clear: if you care about data being on disk,
> > you have to use fsync().
> 
> Well, in fairness, it's not even clear on this point.  POSIX doesn't
> really talk about any sort of guarantees across system crashes at all...

Actually, POSIX does have clear words about this:

    "If _POSIX_SYNCHRONIZED_IO is defined, the fsync() function shall
    force all currently queued I/O operations associated with the file
    indicated by file descriptor fildes to the synchronized I/O
    completion state. All I/O operations shall be completed as defined
    for synchronized I/O file integrity completion"

And yes, Linux is intended to be implemented with the
_POSIX_SYNCHRONIZED_IO defined.  There are operating systems where
this might not be true, though.  For example, Mac OS X.

Fun Trivia fact; on Mac OS, fsync doesn't result in a CACHE FLUSH
command, so the contents of your data writes are not guaranteed after
a power failure.  If you really want fsync(2) to perform as specified
by POSIX, you must use fcntl and F_FULLFSYNC on Mac OS.  The reason?
Surprise, surprise --- performance.  And probably because there are
too many applications which were calling fsync() several times a
second during a window resize on the display thread, or some such, and
Steve Jobs wanted things to be silky smooth.  :-)

Another fun fact: firefox used to call fsync() on its UI thread.
Guess who got blamed for the resulting UI Jank and got all of the hate
mail from the users?  Hint: not the Firefox developers.... and this
sort of thing probably explains Mac OS's design decision vis-a-vis
fsync().  There are a lot of application programmers out there, and
they outnumber the file system developers --- and not all of them are
competent.

> and I can easily imagine that fsync() still doesn't get me what I want
> in some really bizarre cases (like an ecryptfs over NFS from a virtual
> server using an lvm setup running inside of kvm on a machine with hard
> dives that have buggy firmware).

Yes, there are file systems such as NFS which are not POSIX compliant.
And yes, you can always have buggy hardware, such as the crap SSD's
that are out there.  But there's nothing anyone can do if the hardware
is crap.  

(There's a reason why I tend to stick to Intel SSD's on my personal
laptops.  More recently I've experimented with using a Samsung SSD on
one of my machines, but in general, I don't use SSD's from random
manufacturers precisely because I don't trust them.....)

> 
> This is part of why I'd rather avoid the fsync entirely...

Well, for your use case, sync_file_range() should actually be
sufficient, and it's what I would recommend instead of fsync(), at
least for Linux which has this syscall.

> aside: what's your opinion on fdatasync()?  Seems like it wouldn't be
> good enough for my usecase because I'm changing the size of the file....

fdatasync() is basically sync_file_range() plus a CACHE FLUSH command.
Like sync_file_range, it doesn't sync the metadata (and by the way,
this includes things like indirect blocks for ext2/3 or extent tree
blocks for ext4).

However, for the "write foo.new ; rename foo.new to foo" use case,
either sync_file_range() or fdatasync() is fine, since at the point
where the rename() is committed via the file system transaction, all
of the metadata will be forced out to disk, and there will also be a
CACHE FLUSH sent to the device.  So if the desired guarantee is "file
foo contains either the old or new data", fsync(), fdatasync() or
sync_file_range() will all do, but the sync_file_range() will be the
best from a performance POV, since it eliminates a duplicate and
expensive CACHE FLUSH command from being sent to the disk.

> another aside: why do you make global guarantees about metadata changes
> being well-ordered?  It seems quite likely that what's going on on one
> part of the disk by one user is totally unrelated to what's going on on
> another other part by a different user... ((and I do appreciate the
> irony that I am committing by complaining about "other guarantees that I
> don't care about")).

There are two answers.  One is that very often there are dependencies
between files --- and at the file system level, we don't necessarily
know what these dependencies are (for example, between a .y and .c
file, and a .c and a .o files with respect to make).  There may be
(many) files for which it doesn't matter, but how do you tell the file
system which file dependencies matters and which doesn't?

There's another reason why, which is that trying to do this would be
horrifically complicated and/or much, much slower.  The problem is
entangled updates.  If you modify an inode, you have to write back the
entire inode table block.  There may be other inodes that are also in
the procss of being modified.  Similarly, you might have a rename
operation, an unlink operation, and a file write operation that all
result in changes to a block allocation bitmap.  If you want to keep
these operations separate, then you need to have a very complicated
transaction machinery, with intent logs, and rollback logs, and all of
the rest of the massive complexity that comes with a full-fledged
transactional database.  There have been attempts to use a general
database engine to implement a generic file system, but they have
generally been a performance disaster.  One such example happened in
the early 90's, when Oracle tried push this concept in order to sell
more OracleDB licenses, but that went over like a lead balloon, and
not just because of the pricing issue.

						- Ted
--
To unsubscribe from this list: send the line "unsubscribe linux-ext4" in
the body of a message to majordomo@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Dave Chinner - June 22, 2013, 3:29 a.m.
On Fri, Jun 21, 2013 at 04:35:47PM -0400, Theodore Ts'o wrote:
> So I've been taking a closer look at the the rename code, and there's
> something I can do which will improve the chances of avoiding data
> loss on a crash after an application tries to replace file contents
> via:
> 
> 1)  write foo.new
> 2)  <omit fsync of foo.new>
> 3)  rename foo.new to foo
> 
> Those are the kernel patches that I cc'ed you on.
> 
> The reason why it's still not a guarantee is because we are not doing
> a file integrity writeback; this is not as important for small files,
> but if foo.new is several megabytes, not all of the data blocks will
> be flushed out before the rename, and this will kill performance, and
> in somoe cases it might not be necessary.
> 
> Still, for small files ("most config files are smaller than 100k"),
> this should serve you just fine.  Of course, it's not going to be in
> currently deployed kernels, so I don't know how much these proposed
> patches will help you,.  I'm doing mainly because it helps protects
> users against (in my mind) unwise application programmers, and it
> doesn't cost us any extra performance from what we are currently
> doing, so why not improve things a little?
> 
> 
> If you want better guarantees than that, this is the best you can do:
> 
> 1)  write foo.new using file descriptor fd
> 2)  sync_file_range(fd, 0, 0, SYNC_FILE_RANGE_WRITE);
> 3)  rename foo.new to foo
> 
> This will work on today's kernels, and it should be safe to do for all
> file systems.

No, it's not. SYNC_FILE_RANGE_WRITE does not wait for IO completion,
and not all filesystems sychronise journal flushes with data write
IO completion.

Indeed, we have a current "data corruption after power failure"
problem found on Ceph storage clusters using XFS for the OSD storage
that is specifically triggered by the use of SYNC_FILE_RANGE_WRITE
rather than using fsync() to get data to disk.

http://oss.sgi.com/pipermail/xfs/2013-June/027390.html

The question was raised as to whether sync_file_range() was safe on
ext4 was asked and my response was:

http://oss.sgi.com/pipermail/xfs/2013-June/027452.html

"> Is sync_file_range(2) similarly problematic with ext4?

In data=writeback mode, most definitely. For data=ordered, I have no
idea - the writeack paths in ext4 are ... convoluted, and I hurt my
brain every time I look at them. I wouldn't be surprised if there
are problems, but they'll be different problems because ext4 doesn't
do speculative prealloc..."

.....

> > aside: what's your opinion on fdatasync()?  Seems like it wouldn't be
> > good enough for my usecase because I'm changing the size of the file....
> 
> fdatasync() is basically sync_file_range() plus a CACHE FLUSH command.
> Like sync_file_range, it doesn't sync the metadata (and by the way,
> this includes things like indirect blocks for ext2/3 or extent tree
> blocks for ext4).

If fdatasync() on ext4 doesn't sync metadata blocks required to
access the data that was just written by the fdatasync() call, then
it is broken.

fdatasync() is supposed to guarantee all the data in the file and
all the metadata *needed to access that data* is on stable storage
by the time the fdatasync() completes. i.e. fdatasync() might just
be a data write and cache flush, but in the case where allocation,
file size changes, etc have occurred, it is effectively the
equivalent of a full fsync().

So, fdatasync() will do what you want, but the performance overhead
will be no different to fsync() in the rename case because all the
metadata pointing to the tmp file needs to comitted as well...

----

But, let me make a very important point here. Nobody should be
trying to optimise a general purpose application for a specific
filesystem's data integrity behaviour. fsync() and fdatasync() are
the gold standards as it is consistently implemented across all
Linux filesystems.

The reason I say this is that we've been down this road before and
we shoul dhave learnt better from it. Ted, you should recognise this
because you were front and centre in the fallout of it:

http://tytso.livejournal.com/61989.html

".... Application writers had gotten lazy, because ext3 by default
has a commit interval of 5 seconds, and and uses a journalling mode
called data=ordered. What does this mean? ....

... Since ext3 became the dominant filesystem for Linux, application
writers and users have started depending on this, and so they become
shocked and angry when their system locks up and they lose data -
even though POSIX never really made any such guarantee. ..."

This discussion of "how can we abuse ext4 data=ordered sematics to
avoid using fsync()" is heading right going down this path again.
It is starting from "fsync on ext4 is too slow", and solutions are
being proposed that assume that either everyone is use ext4
(patently untrue) and that all filesystems behave like ext4 (also
patently untrue).

To all the application developers reading this: just use
fsync()/fdatasync() for operations that require data integrity. Your
responisbility is to your users: using methods that don't guarantee
data integrity and therefore will result in data loss is indicating
you don't place any value on your user's data what-so-ever. There is
no place for being fancy when it comes to data integrity - it needs
to be reliable and rock-solid.

If that means your application is slow, then you need to explain why
it is slow to your users and how they can change a knob to make it
fast by trading off data integrity. The user can make the choice at
that point, and they have no grounds to complain if they lose data
at that point because they made a conscious choice to configure
their system that way.

IOWs, the choice of whether data can be lost on a crash is one that
only the user can make. As such, applications need be
safe-by-default when it comes to data integrity.

Cheers,

Dave.
Theodore Ts'o - June 22, 2013, 4:47 a.m.
On Sat, Jun 22, 2013 at 01:29:44PM +1000, Dave Chinner wrote:
> > This will work on today's kernels, and it should be safe to do for all
> > file systems.
> 
> No, it's not. SYNC_FILE_RANGE_WRITE does not wait for IO completion,
> and not all filesystems sychronise journal flushes with data write
> IO completion.

Sorry, what I should have said is:

sync_file_range(fd, 0, 0, SYNC_FILE_RANGE_WRITE | SYNC_FILE_RANGE_WAIT_AFTER);

This *does* wait for I/O completion; the result is the equivalent
filemap_fdatawrite() followed by filemap_fdatawait().

> Indeed, we have a current "data corruption after power failure"
> problem found on Ceph storage clusters using XFS for the OSD storage
> that is specifically triggered by the use of SYNC_FILE_RANGE_WRITE
> rather than using fsync() to get data to disk.
> 
> http://oss.sgi.com/pipermail/xfs/2013-June/027390.html

This woudn't be a problem in the sequence I suggested.

1)  write foo.new
2)  sync_file_range(...)
3)  rename foo.new to foo

If the system crashes right after foo.new, yes, there's no guarantee
since the metadata blocks are written.  (Although if XFS is exposing
stale data as a result of sync_file_range, that's arguably a security
bug, since sync_file_range doesn't require root to execute, and it
_has_ been part of the Linux interface for a long time.)

Unlike in the Ceph case, in this sequence as are calling
sync_file_range on the new file (foo.new).  And if we haven't done the
rename yet, we don't care about the contents of foo.new.  (Although if
the file system is causing stale data to be revealed, I'd claim that's
a fs bug.)  If the rename has completed, the metadata blocks will
definitely be journalled and committed for both ext3 and ext4.

So for ext3 and ext4, this sequence will guarantee that the file named
"foo" will have either the old data or the new data --- and this is
true for either data=ordered, or data=writeback.  You're the expert
for xfs, but I didn't think this sequence was depending on anything
file system specific, since filemap_fdatawrite() and
filemap_fdatawait() are part of the core Linux FS/MM layer.

> But, let me make a very important point here. Nobody should be
> trying to optimise a general purpose application for a specific
> filesystem's data integrity behaviour. fsync() and fdatasync() are
> the gold standards as it is consistently implemented across all
> Linux filesystems.

From a philosophical point of view, I agree with you.  As I wrote in
my earlier messages, assuming the applications aren't abusively
calling g_file_set_contents() several times per second, I don't
understand why Ryan is trying so hard to optimize it.  The fact that
he's trying to optimize it at least to me seems to indicate a simple
admission that there *are* broken applications out there, some of
which may be calling it with high frequency, perhaps out of the UI
thread.  

And having general applications or generic desktop libraries trying to
depend on specific implementation details of file systems is really
ugly.  So it's not something I'm all that excited about.

However, regardless of whether wish it or not, abusive applications
written by incompetent application authors *will* exist, and whether
we like it or not, desktop library authors are going to try to coddle
said abusive applications by do these filesystem implemenatation
dependent optimization.  GNOME is *already* detecting btrfs and has
made optimization decisions based on it in its libraries.  Trying to
prevent this is (in my opinion) the equivalent of claiming that the
command of King Canute could hold back the tide.

Given that, my thinking was to try to suggest the least harmful way of
doing so, and so my eye fell on sync_file_range(2) as a generic system
call that is not file system specific, with some relatively well
defined semantics.  And given that we don't care about foo.new until
after the rename operation has completed, it seemed to me that this
should be safe for more than just ext3 and ext4 (where I am quite sure
it is safe).

But if XFS is doing something sufficiently clever that
sync_file_range() isn't going to do the right thing, and if we presume
that abusive applications will always exist, then maybe it's time to
consider implementing a new system call which has very well defined
semantics, for those applications that insist on updating a file
hundreds of times an hour, demand good performance, and aren't picky
about consistency semantics, so long that some version of the file
contents is available after a crash.

This system call would of course have to be optional, and so for file
systems that don't support it, applications will have to fall back
more traditional approachses, whether that involves fsync() or perhaps
sync_file_range(), if that can be made safe and generic for all/most
file systems.

Personally, I think application programmers *shouldn't* need such a
facility, if their applications are competently designed and
implemented.  But unfortunately, they outnumber us file system
developers, and apparently many of them seem to want to do things
their way, whether we like it or not.

Regards,

						- Ted
--
To unsubscribe from this list: send the line "unsubscribe linux-ext4" in
the body of a message to majordomo@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Sidorov, Andrei - June 22, 2013, 1:40 p.m.
> From a philosophical point of view, I agree with you.  As I wrote in
> my earlier messages, assuming the applications aren't abusively
> calling g_file_set_contents() several times per second, I don't
> understand why Ryan is trying so hard to optimize it.  The fact that
> he's trying to optimize it at least to me seems to indicate a simple
> admission that there *are* broken applications out there, some of
> which may be calling it with high frequency, perhaps out of the UI
> thread.

Well, one application calling fsync is almost nothing to care about.
On the other hand tens and hundreds of apps doing fsync's is a disaster.

> And having general applications or generic desktop libraries trying to
> depend on specific implementation details of file systems is really
> ugly.  So it's not something I'm all that excited about.

Me too, but people have to do that because fs api is too generic and at the
same time one has to account fs specifics in order to make their app take most
advantage or at least to avoid inefficiencies.
For example I have an app that constantly does appending writes to about 15
files and I must ensure that no more than 5 seconds will be lost in an event
of system crash or power loss. How would you do that in generic way?
Generic and portable way to do it is to start 15 threads and call fsyncs on
those fds at the same time. That works fine with JFS since it doesn't do
flushes and it works fine with ext4 because all those fsync's are likely to
complete within single transaction.
However that doesn't scale well and it forces app to do bursts. Scalable, but
still bursty solution could be io_submit, but afaik no fs currently supports
async fsync.
What if you want to distribute the load? Single dedicated thread calling
fsync's works fine with JFS, but sucks with ext4. Ok, there is a
sync_file_range, let's try it out. Luckily I have control over commit=N option
to underlying ext4 fs which I leave at default 5s. Otherwise I would like to
have an ioctl to ext4 to force commit (I'm not sure if fsync on a single fd
will commit currently running transaction). Sync thread calls sync_file_range
evenly over 5s interval, ext4 does commits every 5s. Nice! But it doesn't work
with JFS. Therefore I have two implementations for different file systems.

> Personally, I think application programmers *shouldn't* need such a
> facility, if their applications are competently designed and
> implemented.  But unfortunately, they outnumber us file system
> developers, and apparently many of them seem to want to do things
> their way, whether we like it or not.

I would argue : )
fsync is not the one to rule them all. It's semantics is clear: write all
those bytes NOW.
The fact fsync can be used as a barrier doesn't mean it's the best way to do
it. There are quite few cases where write-right-now semantics is
absolutely required. More often apps just want atomic file updates and
sort of writeback control which is available only as system-wide knob.

As for atomic updates, I'm thinking of something like io_exec() or
io_submit_atomic() or whatever name is best for it. Probably it shouldn't be
tied to kaio.
This syscall would accept an array of iocb's and guarantee atomicity of the
update. This shouldn't be a big deal for ext4 to support it because it already
supports data journalling, which is however only block/page-wise atomic.
Such a syscall wouldn't be undervalued if majority of file systems support it.

Regards,
Andrey.

--
To unsubscribe from this list: send the line "unsubscribe linux-ext4" in
the body of a message to majordomo@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Theodore Ts'o - June 22, 2013, 2:06 p.m.
On Sat, Jun 22, 2013 at 01:40:26PM +0000, Sidorov, Andrei wrote:
> Me too, but people have to do that because fs api is too generic and
> at the same time one has to account fs specifics in order to make
> their app take most advantage or at least to avoid inefficiencies.
> For example I have an app that constantly does appending writes to
> about 15 files and I must ensure that no more than 5 seconds will be
> lost in an event of system crash or power loss.

Stop right there.  If you are doing lots of appending writes, and you
dont want to lose no more than 5 seconds, why do you have 15 files?
Why send your appending writes into a single file and then later on,
disaggregate out the logs when you need to read them?  With 15 files,
no matter what you are doing, you will be forcing the heads to seek
all over the place, and since the 4k blocks won't be full when you
sync them, you're wasting write cycles and disk bandwidth.

See what I mean?  There's a reason why I said the fault was
incompetent application design, not just incomplement implementation.

	    			    	 	      - Ted
--
To unsubscribe from this list: send the line "unsubscribe linux-ext4" in
the body of a message to majordomo@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Sidorov, Andrei - June 22, 2013, 2:41 p.m.
> Stop right there.  If you are doing lots of appending writes, and you
> dont want to lose no more than 5 seconds, why do you have 15 files?
> Why send your appending writes into a single file and then later on,
> disaggregate out the logs when you need to read them?

There are 15 files because those are different entities. There might be less of
them, some of them can be removed. There is no reason to place those files
close to each other since those are just metadata related to mpeg stream
and in fact it is best to put stream and metadata close to each other.
If I wanted to localize those small files, I would just put them on a different
partition.

Anyway, if I have to implement a file system inside a file, why do I need
top layer file system? : )

> With 15 files,
> no matter what you are doing, you will be forcing the heads to seek
> all over the place, and since the 4k blocks won't be full when you
> sync them, you're wasting write cycles and disk bandwidth.

Yes, seeks are unavoidable, but that's something I have to account in worst
case scenario anyway.
Huh, yes. Another advantage of ext4 is that I can control writeback on a
specific file range. I sync only complete pages and ext4 is happy to commit
whatever was synced. Generic solution would be to cache pages in app and
submit only complete ones. Disadvantage of generic solution is that data
cached by app is invisible to others.

Regards,
Andrey.
--
To unsubscribe from this list: send the line "unsubscribe linux-ext4" in
the body of a message to majordomo@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Dave Chinner - June 23, 2013, 1:58 a.m.
On Sat, Jun 22, 2013 at 12:47:18AM -0400, Theodore Ts'o wrote:
> On Sat, Jun 22, 2013 at 01:29:44PM +1000, Dave Chinner wrote:
> > > This will work on today's kernels, and it should be safe to do for all
> > > file systems.
> > 
> > No, it's not. SYNC_FILE_RANGE_WRITE does not wait for IO completion,
> > and not all filesystems sychronise journal flushes with data write
> > IO completion.
> 
> Sorry, what I should have said is:
> 
> sync_file_range(fd, 0, 0, SYNC_FILE_RANGE_WRITE | SYNC_FILE_RANGE_WAIT_AFTER);
> 
> This *does* wait for I/O completion; the result is the equivalent
> filemap_fdatawrite() followed by filemap_fdatawait().

Right, and now it is effectively identical to fsync() in terms of
speed. And when performance is the reason for avoiding fsync()....

> > Indeed, we have a current "data corruption after power failure"
> > problem found on Ceph storage clusters using XFS for the OSD storage
> > that is specifically triggered by the use of SYNC_FILE_RANGE_WRITE
> > rather than using fsync() to get data to disk.
> > 
> > http://oss.sgi.com/pipermail/xfs/2013-June/027390.html
> 
> This woudn't be a problem in the sequence I suggested.
> 
> 1)  write foo.new
> 2)  sync_file_range(...)
> 3)  rename foo.new to foo

You miss the point - it's not a data integrity operation, and to
present it as a way of obtaining data integrity across *all
filesystems* because it would work on ex4 is extremely dangerous.
You're relying on side effects of a specific filesystem implementation
to give you the desired result.

> If the system crashes right after foo.new, yes, there's no
> guarantee since the metadata blocks are written.

Sure, but focussing on ext4 behaviour misses the larger issue -
sync_file_range() has no well defined integrity rules and hence
exposes applications to unexpected behaviours on different
filesystems.

> (Although if XFS
> is exposing stale data as a result of sync_file_range, that's
> arguably a security bug, since sync_file_range doesn't require
> root to execute, and it _has_ been part of the Linux interface for
> a long time.)

Yes, it is.

But doesn't this point out one of the problems with
moving from a well known interface to something that almost nobody
uses and (obviously) has never tested in a data integrity test bench
previously? it's taken how many years for this problem to be
exposed? We test fsync crash integrity all the time, but do we test
sync_file_range()? No, we don't, because it doesn't have any data
integrity guarantees that we can validate.

So while the interface may have been around for years, the first
serious storage product I've heard of that has tried to use it to
optimise away fsync() calls to use it has found that:

	a) it exposes strange behaviours on different filesystems;
	b) doesn't provide data integrity guarantees worth a damn;
	and
	c) isn't any faster than using fsync/fdatasync() in the
	   first place.

And so has reverted back to using fsync()....

> So for ext3 and ext4, this sequence will guarantee that the file

Exactly my point. You're designing for ext3/4 data=ordered behaviour
and so it's not a replacement for fsync/fdatasync...

> > But, let me make a very important point here. Nobody should be
> > trying to optimise a general purpose application for a specific
> > filesystem's data integrity behaviour. fsync() and fdatasync()
> > are the gold standards as it is consistently implemented across
> > all Linux filesystems.
> 
> From a philosophical point of view, I agree with you.  As I wrote
> in my earlier messages, assuming the applications aren't abusively
> calling g_file_set_contents() several times per second, I don't
> understand why Ryan is trying so hard to optimize it.  The fact
> that he's trying to optimize it at least to me seems to indicate a
> simple admission that there *are* broken applications out there,
> some of which may be calling it with high frequency, perhaps out
> of the UI thread.  

That's not a problem that should be solved by trading off data
integrity and losing user's configurations. Have a backbone, Ted,
and tell people they are doing something stupid when they are doing
something stupid.

> And having general applications or generic desktop libraries
> trying to depend on specific implementation details of file
> systems is really ugly.  So it's not something I'm all that
> excited about.

Not excited? That's an understatement - it's the sort of thing that
gives me nightmares.

We've *been here before*. We've been telling application developer's
to use fsync/fdatasync() since the O_PONIES saga so we don't end up
back in that world of hurt. Yet here you are advocating that
application developers go back down the rat hole of relying on
implicit side effects of the design of a specific filesystem
configuration to provide them with data integrity guarantees....

> But if XFS is doing something sufficiently clever that
> sync_file_range() isn't going to do the right thing, and if we
> presume that abusive applications will always exist, then maybe
> it's time to consider implementing a new system call which has
> very well defined semantics,

That's fine - go and define the requirements for an
rename_datasync() syscall, write a bunch of xfstests to ensure the
API is usable and provides the required behaviour. Then do a slow
generic implementation that you wire all the filesystems up to so
it's available to everyone (i.e.  no fs probing needed).
Applications can start using it immediately, knowing it will work on
every filesystem. i.e:

generic_rename_datasync()
{
	vfs_fsync(src);
	->rename()
}

At that point you can implement a fast, optimal implementation in
ext3/4 that nobody outside ext3/4 needs to care about how it works.
The BTRFS guys can optimise it appropriately in their own time, as
can us XFS people. But the most important point is that we start
with on implementation that *works everywhere*...

> Personally, I think application programmers *shouldn't* need such
> a facility, if their applications are competently designed and
> implemented.  But unfortunately, they outnumber us file system
> developers, and apparently many of them seem to want to do things
> their way, whether we like it or not.

And you're too afraid to call out someone when they are doing something
stupid. If you don't call them out, they'll keep doing stupid things
and the problems will never get solved.

Cheers,

Dave.

Patch

diff --git a/fs/block_dev.c b/fs/block_dev.c
index 2091db8..85f5c85 100644
--- a/fs/block_dev.c
+++ b/fs/block_dev.c
@@ -58,17 +58,24 @@  static void bdev_inode_switch_bdi(struct inode *inode,
 			struct backing_dev_info *dst)
 {
 	struct backing_dev_info *old = inode->i_data.backing_dev_info;
+	bool wakeup_bdi = false;
 
 	if (unlikely(dst == old))		/* deadlock avoidance */
 		return;
 	bdi_lock_two(&old->wb, &dst->wb);
 	spin_lock(&inode->i_lock);
 	inode->i_data.backing_dev_info = dst;
-	if (inode->i_state & I_DIRTY)
+	if (inode->i_state & I_DIRTY) {
+		if (bdi_cap_writeback_dirty(dst) && !wb_has_dirty_io(&dst->wb))
+			wakeup_bdi = true;
 		list_move(&inode->i_wb_list, &dst->wb.b_dirty);
+	}
 	spin_unlock(&inode->i_lock);
 	spin_unlock(&old->wb.list_lock);
 	spin_unlock(&dst->wb.list_lock);
+
+	if (wakeup_bdi)
+		bdi_wakeup_thread_delayed(dst);
 }
 
 /* Kill _all_ buffers and pagecache , dirty or not.. */