Patchwork [RFC] vfs: Call filesystem callback when backing device caches should be flushed

login
register
mail settings
Submitter Jan Kara
Date Jan. 20, 2009, 4:05 p.m.
Message ID <20090120160527.GA17067@duck.suse.cz>
Download mbox | patch
Permalink /patch/19517/
State New
Headers show

Comments

Jan Kara - Jan. 20, 2009, 4:05 p.m.
Hi,

  we noted in our testing that ext2 (and it seems some other filesystems as
well) don't flush disk's write caches on cases like fsync() or changing
DIRSYNC directory. This is my attempt to solve the problem in a generic way
by calling a filesystem callback from VFS at appropriate place as Andrew
suggested. For ext2 what I did is enough (it just then fills in
block_flush_device() as .flush_device callback) and I think it could be
fine for other filesystems as well.
  There is one remaining issue though: Should we call .flush_device in
generic_sync_sb_inodes()? Generally places like __fsync_super() or
do_sync() seem more appropriate (although in do_sync() we'd have to
do one more traversal of superblocks) to me. But maybe question which
should be answered first is: Is it correct how we implement __fsync_super()
or do_sync()? E.g. in do_sync() we do:

        sync_inodes(0);         /* All mappings, inodes and their blockdevs */
        DQUOT_SYNC(NULL);
        sync_supers();          /* Write the superblocks */
        sync_filesystems(0);    /* Start syncing the filesystems */
        sync_filesystems(wait); /* Waitingly sync the filesystems */
        sync_inodes(wait);      /* Mappings, inodes and blockdevs, again.  */

  But sync_inodes(0) results in writeback done with WB_SYNC_NONE which does
not have to flush all the dirty data. So until last sync_inodes(wait) we
cannot be sure that all the dirty data has been submitted to disk. But such
writes could in theory dirty the superblock or similar structures again. So
shouldn't we rather do:
	...
        sync_inodes(wait);      /* Mappings, inodes and blockdevs, again.  */
        sync_filesystems(wait); /* Waitingly sync the filesystems */


  And one more question: Should we also call .flush_device after fsync?
Filesystems can already issue the flush in their .fsync callbacks so it's
not necessary. The question is what's easier to use...
  Thanks for comments.

									Honza
Joel Becker - Jan. 20, 2009, 11:16 p.m.
On Tue, Jan 20, 2009 at 05:05:27PM +0100, Jan Kara wrote:
>   we noted in our testing that ext2 (and it seems some other filesystems as
> well) don't flush disk's write caches on cases like fsync() or changing
> DIRSYNC directory. This is my attempt to solve the problem in a generic way
> by calling a filesystem callback from VFS at appropriate place as Andrew
> suggested. For ext2 what I did is enough (it just then fills in
> block_flush_device() as .flush_device callback) and I think it could be
> fine for other filesystems as well.

	The only question I have is why this would be optional.  It
would seem that this would be the preferred default behavior for all
block filesystems.  We have the backing_dev_info and a way to override
the default if a filesystem needs something special.

Joel
Jamie Lokier - Jan. 21, 2009, 12:16 a.m.
Joel Becker wrote:
> On Tue, Jan 20, 2009 at 05:05:27PM +0100, Jan Kara wrote:
> >   we noted in our testing that ext2 (and it seems some other filesystems as
> > well) don't flush disk's write caches on cases like fsync() or changing
> > DIRSYNC directory. This is my attempt to solve the problem in a generic way
> > by calling a filesystem callback from VFS at appropriate place as Andrew
> > suggested. For ext2 what I did is enough (it just then fills in
> > block_flush_device() as .flush_device callback) and I think it could be
> > fine for other filesystems as well.
> 
> 	The only question I have is why this would be optional.  It
> would seem that this would be the preferred default behavior for all
> block filesystems.  We have the backing_dev_info and a way to override
> the default if a filesystem needs something special.

I agree, it should be done by default.  Not only that, if you have
several concurrent fsync() calls (could be unrelated but on the same
disk), it could perhaps delay slightly and coalesce the flushes for
better throughput.

What about O_SYNC writes though?  A device flush after each one would
be expensive, but that's what equivalence to fsync() implies is
needed.

O_DIRECT writes shouldn't do block_flush_device(), but an app may
still need a way to commit data for integrity.  So fsync() or
fdatasync() called after a series of O_DIRECT writes should call
block_flush_device() _even_ though there's no page-cache dirty data to
commit, and even if there's no inode change to commit.

Since you want to avoid issuing two device flushes in a row (they're
not free), and a journalling fs may issue one separately, as Joel says
a filesystem could override this.

But I suspect it would be better to keep the generic call to
block_flush_device() from fsync(), and at the block layer discard
duplicate flushes that have no writes in between.

-- Jamie
--
To unsubscribe from this list: send the line "unsubscribe linux-ext4" in
the body of a message to majordomo@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Jan Kara - Jan. 21, 2009, 12:55 p.m.
On Tue 20-01-09 15:16:48, Joel Becker wrote:
> On Tue, Jan 20, 2009 at 05:05:27PM +0100, Jan Kara wrote:
> >   we noted in our testing that ext2 (and it seems some other filesystems as
> > well) don't flush disk's write caches on cases like fsync() or changing
> > DIRSYNC directory. This is my attempt to solve the problem in a generic way
> > by calling a filesystem callback from VFS at appropriate place as Andrew
> > suggested. For ext2 what I did is enough (it just then fills in
> > block_flush_device() as .flush_device callback) and I think it could be
> > fine for other filesystems as well.
> 
> 	The only question I have is why this would be optional.  It
> would seem that this would be the preferred default behavior for all
> block filesystems.  We have the backing_dev_info and a way to override
> the default if a filesystem needs something special.
  The reason why I've decided for NOP to be the default is that filesystems
doing proper journalling with barriers should not need this (as the barrier
in the transaction commit already does the job for them).  So these would
have to override the operation to NOP which seems a bit silly. Also virtual
filesystems without backing device would have to override this to NOP.
  Finally, I prefer maintainers of the filesystems themselves to decide
whether their filesystem needs flushing and thus knowingly impose this
performance penalty on them...

								Honza
Jan Kara - Jan. 21, 2009, 3:05 p.m.
On Wed 21-01-09 00:16:30, Jamie Lokier wrote:
> Joel Becker wrote:
> > On Tue, Jan 20, 2009 at 05:05:27PM +0100, Jan Kara wrote:
> > >   we noted in our testing that ext2 (and it seems some other filesystems as
> > > well) don't flush disk's write caches on cases like fsync() or changing
> > > DIRSYNC directory. This is my attempt to solve the problem in a generic way
> > > by calling a filesystem callback from VFS at appropriate place as Andrew
> > > suggested. For ext2 what I did is enough (it just then fills in
> > > block_flush_device() as .flush_device callback) and I think it could be
> > > fine for other filesystems as well.
> > 
> > 	The only question I have is why this would be optional.  It
> > would seem that this would be the preferred default behavior for all
> > block filesystems.  We have the backing_dev_info and a way to override
> > the default if a filesystem needs something special.
> 
> I agree, it should be done by default.  Not only that, if you have
> several concurrent fsync() calls (could be unrelated but on the same
> disk), it could perhaps delay slightly and coalesce the flushes for
> better throughput.
  Well, that would be nice but you cannot return from fsync() until you've
done the flush. So you have to be careful not to wait for too long. JBD
actually plays these tricks with sync transaction batching and it's not
trivial to get this right. So I'd rather avoid it.

> What about O_SYNC writes though?  A device flush after each one would
> be expensive, but that's what equivalence to fsync() implies is
> needed.
  Yes.

> O_DIRECT writes shouldn't do block_flush_device(), but an app may
> still need a way to commit data for integrity.  So fsync() or
> fdatasync() called after a series of O_DIRECT writes should call
> block_flush_device() _even_ though there's no page-cache dirty data to
> commit, and even if there's no inode change to commit.
  Hmm, this is an interesting point. You're right that we currently miss
the flushes and we probably need some dirty inode flag like needs_flush or
so.

> Since you want to avoid issuing two device flushes in a row (they're
> not free), and a journalling fs may issue one separately, as Joel says
> a filesystem could override this.
  Yes, journalling filesystems usually take care themselves.

> But I suspect it would be better to keep the generic call to
> block_flush_device() from fsync(), and at the block layer discard
> duplicate flushes that have no writes in between.
  Hmm, probably this won't be too hard to implement. OTOH it won't catch
those cases where some other process manages to squeeze in some writes
between the two flushes. So I'm not sure if we really want to design things
this way unless really necessary.

								Honza
Jamie Lokier - Jan. 21, 2009, 9:41 p.m.
Jan Kara wrote:
>   Well, that would be nice but you cannot return from fsync() until you've
> done the flush. So you have to be careful not to wait for too long. JBD
> actually plays these tricks with sync transaction batching and it's not
> trivial to get this right. So I'd rather avoid it.

Didn't extN for some N do/did something similar?

> > What about O_SYNC writes though?  A device flush after each one would
> > be expensive, but that's what equivalence to fsync() implies is
> > needed.
>   Yes.
> 
> > O_DIRECT writes shouldn't do block_flush_device(), but an app may
> > still need a way to commit data for integrity.  So fsync() or
> > fdatasync() called after a series of O_DIRECT writes should call
> > block_flush_device() _even_ though there's no page-cache dirty data to
> > commit, and even if there's no inode change to commit.
>   Hmm, this is an interesting point. You're right that we currently miss
> the flushes and we probably need some dirty inode flag like needs_flush or
> so.

Proposal (both together):

  1. per-device-queue flag needs_flush.

     Set on write queued, clear on flush queued.  When clear, flushes
     are discarded instead of being queued.  Waiting on the discarded
     flush waits instead for the last flush which was queued, if it's
     still in flight.  So the queue will also track that last flush.

  2. per-inode flag needs_flush.

     Set on write queued from this file (writeback), cleared on flush
     sent from this file (i.e. the thing fsync/fdatasync/O_SYNC should
     be calling).  As above, flushes aren't sent from this file when
     this flag is clear, and waiting on a discarded flush waits
     instead on the last flush sent for this file, if it's still in
     flight.  So the file will track that last flush command in
     addition to needs_flush.

Implement both.  The first doee right thing optimising away
unnecessary journal/tree-log barriers.  The second further optimises
individual files.

You *could* have a needs_flush bit per page, to tune it further, in
the same way that fsync_range() and O_DIRECT invalidations etc. are
getting better at working with ranges, but that may be pointless
overengineering (I've no idea).

> > Since you want to avoid issuing two device flushes in a row (they're
> > not free), and a journalling fs may issue one separately, as Joel says
> > a filesystem could override this.
>   Yes, journalling filesystems usually take care themselves.
> 
> > But I suspect it would be better to keep the generic call to
> > block_flush_device() from fsync(), and at the block layer discard
> > duplicate flushes that have no writes in between.

>   Hmm, probably this won't be too hard to implement. OTOH it won't catch
> those cases where some other process manages to squeeze in some writes
> between the two flushes. So I'm not sure if we really want to design things
> this way unless really necessary.

Let me put it this way.  ext3 is a journalling fs, and it does _not_
provide integrity with fsync() or fdatasync() in all cases, even with
barriers and data=ordered turned on.

We should have something which provides flushes generically, with the
possibility for the fs to override it with a smarter method when it
knows better.

-- Jamie
--
To unsubscribe from this list: send the line "unsubscribe linux-ext4" in
the body of a message to majordomo@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Jamie Lokier - Jan. 21, 2009, 9:47 p.m.
Jan Kara wrote:
> On Tue 20-01-09 15:16:48, Joel Becker wrote:
> > On Tue, Jan 20, 2009 at 05:05:27PM +0100, Jan Kara wrote:
> > >   we noted in our testing that ext2 (and it seems some other filesystems as
> > > well) don't flush disk's write caches on cases like fsync() or changing
> > > DIRSYNC directory. This is my attempt to solve the problem in a generic way
> > > by calling a filesystem callback from VFS at appropriate place as Andrew
> > > suggested. For ext2 what I did is enough (it just then fills in
> > > block_flush_device() as .flush_device callback) and I think it could be
> > > fine for other filesystems as well.
> > 
> > 	The only question I have is why this would be optional.  It
> > would seem that this would be the preferred default behavior for all
> > block filesystems.  We have the backing_dev_info and a way to override
> > the default if a filesystem needs something special.
>
>   The reason why I've decided for NOP to be the default is that
> filesystems doing proper journalling with barriers should not need
> this (as the barrier in the transaction commit already does the job
> for them).

No, that doesn't work.

fsync() doesn't always cause a transaction.  If there's no inode
change, there may not be a transaction.  Writing does not always dirty
mtime, if it's within mtime granularity.

For efficient fdatasync() you _never_ want a transaction if possible,
because it forces the disk head to seek between alternating regions of
the disk, two seeks per fsync().

So you can't rely on journalling transactions to flush.

>   Finally, I prefer maintainers of the filesystems themselves to decide
> whether their filesystem needs flushing and thus knowingly impose this
> performance penalty on them...

I say it should flush be default unless a filesystem hooks an
alternative strategy.  Certainly, it's silly to have the same code
duplicated in nearly every filesystem

-- Jamie
--
To unsubscribe from this list: send the line "unsubscribe linux-ext4" in
the body of a message to majordomo@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Jamie Lokier - Jan. 21, 2009, 9:50 p.m.
Jan Kara wrote:
>   Finally, I prefer maintainers of the filesystems themselves to decide
> whether their filesystem needs flushing and thus knowingly impose this
> performance penalty on them...

It could of course be a generic mount option which works with all
filesystems.

Then *administrators* can decide whether their data needs flushing,
which is where the decision should lie...

I'd vote for enabling it by default, but it's ok if it isn't.  Apps
where users expect integrity can warn if they see the option is disabled.

-- Jamie
--
To unsubscribe from this list: send the line "unsubscribe linux-ext4" in
the body of a message to majordomo@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Joel Becker - Jan. 21, 2009, 10:03 p.m.
On Wed, Jan 21, 2009 at 01:55:37PM +0100, Jan Kara wrote:
> On Tue 20-01-09 15:16:48, Joel Becker wrote:
> > On Tue, Jan 20, 2009 at 05:05:27PM +0100, Jan Kara wrote:
> > >   we noted in our testing that ext2 (and it seems some other filesystems as
> > > well) don't flush disk's write caches on cases like fsync() or changing
> > > DIRSYNC directory. This is my attempt to solve the problem in a generic way
> > > by calling a filesystem callback from VFS at appropriate place as Andrew
> > > suggested. For ext2 what I did is enough (it just then fills in
> > > block_flush_device() as .flush_device callback) and I think it could be
> > > fine for other filesystems as well.
> > 
> > 	The only question I have is why this would be optional.  It
> > would seem that this would be the preferred default behavior for all
> > block filesystems.  We have the backing_dev_info and a way to override
> > the default if a filesystem needs something special.
>   The reason why I've decided for NOP to be the default is that filesystems
> doing proper journalling with barriers should not need this (as the barrier
> in the transaction commit already does the job for them).  So these would
> have to override the operation to NOP which seems a bit silly. Also virtual
> filesystems without backing device would have to override this to NOP.

	You make a fair point about journaling filesystems - except, of
course, that they don't really use barriers; mount defaults or
device-mapper often preclude them.  So people with 'incorrect' barrier
configurations get no fsync() safety.
	Regarding "filesystems without a backing device", that's why I
said "we have backing_dev_info".  We can tell what the backing device
is; we should be able to determine that no flush is needed without
modifying those filesystems.

>   Finally, I prefer maintainers of the filesystems themselves to decide
> whether their filesystem needs flushing and thus knowingly impose this
> performance penalty on them...

	I understand what you're thinking here, but that way defaults to
an unsafe fsync().  Thus you're causing broken behavior in the hopes
that maintainers pay enough attention to fix the behavior.

Joel
Jamie Lokier - Jan. 21, 2009, 10:35 p.m.
Joel Becker wrote:
> 	You make a fair point about journaling filesystems - except, of
> course, that they don't really use barriers; mount defaults or
> device-mapper often preclude them.  So people with 'incorrect' barrier
> configurations get no fsync() safety.

I think maybe it's fair enough that if barrier=no fsync() safety
doesn't use barriers either.  Barriers mean it's safe on power loss -
on most disks and some RAID controllers.  No barriers is still useful
- it's maybe safe on system crash but not power loss, with some
performance gained.  So it's fair that it can be an admin decision.

Maybe a separate generic mount option for fsync safety would be good
though.  Interestingly, Windows is documented as letting the
application choose (limited by the constraints of the hardware), and
so is MacOSX.  That makes sense too.

> 	Regarding "filesystems without a backing device", that's why I
> said "we have backing_dev_info".  We can tell what the backing device
> is; we should be able to determine that no flush is needed without
> modifying those filesystems.
> 
> >   Finally, I prefer maintainers of the filesystems themselves to decide
> > whether their filesystem needs flushing and thus knowingly impose this
> > performance penalty on them...
> 
> 	I understand what you're thinking here, but that way defaults to
> an unsafe fsync().  Thus you're causing broken behavior in the hopes
> that maintainers pay enough attention to fix the behavior.

In this area, because the symptom of broken behaviour rarely shows up,
and when it does you don't know this is the culprit, it won't get
fixed passively.  As Nick says, we've had other fsync() bugs for ages
too, and it's hard to test if it's really correct, yet it's quite
important.

-- Jamie
--
To unsubscribe from this list: send the line "unsubscribe linux-ext4" in
the body of a message to majordomo@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Dave Chinner - Jan. 21, 2009, 11:25 p.m.
On Wed, Jan 21, 2009 at 09:47:48PM +0000, Jamie Lokier wrote:
> Jan Kara wrote:
> > On Tue 20-01-09 15:16:48, Joel Becker wrote:
> > > On Tue, Jan 20, 2009 at 05:05:27PM +0100, Jan Kara wrote:
> > > >   we noted in our testing that ext2 (and it seems some other filesystems as
> > > > well) don't flush disk's write caches on cases like fsync() or changing
> > > > DIRSYNC directory. This is my attempt to solve the problem in a generic way
> > > > by calling a filesystem callback from VFS at appropriate place as Andrew
> > > > suggested. For ext2 what I did is enough (it just then fills in
> > > > block_flush_device() as .flush_device callback) and I think it could be
> > > > fine for other filesystems as well.
> > > 
> > > 	The only question I have is why this would be optional.  It
> > > would seem that this would be the preferred default behavior for all
> > > block filesystems.  We have the backing_dev_info and a way to override
> > > the default if a filesystem needs something special.
> >
> >   The reason why I've decided for NOP to be the default is that
> > filesystems doing proper journalling with barriers should not need
> > this (as the barrier in the transaction commit already does the job
> > for them).
> 
> No, that doesn't work.
> 
> fsync() doesn't always cause a transaction.  If there's no inode
> change, there may not be a transaction.  Writing does not always dirty
> mtime, if it's within mtime granularity.

If the inode is dirty and fsync does nothing, then that filesystem
is *broken*. If writing to the inode doesn't dirty it, then the
filesystem is broken. Fix the broken filesystem.

> For efficient fdatasync() you _never_ want a transaction if possible,
> because it forces the disk head to seek between alternating regions of
> the disk, two seeks per fsync().

If there is dirty metadata that is need to be logged or flushed,
then fdatasync() needs to do something. If it doesn't do it
correctly, then that *filesystem is broken*. Fix the broken
filesystem.

> So you can't rely on journalling transactions to flush.

The VFS doesn't even know about transactions....

> >   Finally, I prefer maintainers of the filesystems themselves to
> >   decide whether their filesystem needs flushing and thus
> >   knowingly impose this performance penalty on them...
> 
> I say it should flush be default unless a filesystem hooks an
> alternative strategy.  Certainly, it's silly to have the same code
> duplicated in nearly every filesystem

So write a *generic helper* for those filesystems that do the same
thing and hook it to their ->fsync method. Don't hard code it in the
VFS so other filesystem dev's have to come along afterwards and turn
it off.

Cheers,

Dave.
Jamie Lokier - Jan. 21, 2009, 11:55 p.m.
Dave Chinner wrote:
> If the inode is dirty and fsync does nothing, then that filesystem
> is *broken*. If writing to the inode doesn't dirty it, then the
> filesystem is broken. Fix the broken filesystem.

*Wrong*  Very, very wrong.

You do not write totally unchanged inode bytes just for the sake of
causing a NOP transaction to make the disk write the fsync as a
side-effect of a broken paradigm.  That's _three_ pointless I/Os (one
redundant barrier and two writes), and probably 50x slowdown in write
performance due to seeking.  Now who's filesystem is broken?

> > For efficient fdatasync() you _never_ want a transaction if possible,
> > because it forces the disk head to seek between alternating regions of
> > the disk, two seeks per fsync().
> 
> If there is dirty metadata that is need to be logged or flushed,
> then fdatasync() needs to do something. If it doesn't do it
> correctly, then that *filesystem is broken*. Fix the broken
> filesystem.

A series of a writes over existing data and fdatasync() should *never*
write to the transaction log, unless you mounted something like ext3
data=journal, which isn't usual.

There is no dirty metadata to write.  It is data only.  fdatasync()
*means* "do NOT write metadata that is not needed for data retrieval",
that's it's whole point.  A filesystem which keeps seeking to its
inode area _and_ its journal area _and_ the data area on every
fdatasync() is a poor design indeed.

> > So you can't rely on journalling transactions to flush.
> 
> The VFS doesn't even know about transactions....

Whoever brought them up said they can be relied on to flush writes
during fsync/fdatasync.  Just saying they can't, is all...

> > >   Finally, I prefer maintainers of the filesystems themselves to
> > >   decide whether their filesystem needs flushing and thus
> > >   knowingly impose this performance penalty on them...
> > 
> > I say it should flush be default unless a filesystem hooks an
> > alternative strategy.  Certainly, it's silly to have the same code
> > duplicated in nearly every filesystem
> 
> So write a *generic helper* for those filesystems that do the same
> thing and hook it to their ->fsync method. Don't hard code it in the
> VFS so other filesystem dev's have to come along afterwards and turn
> it off.

Are there any at the moment which would turn it off?
If so that's a fine idea.

-- Jamie
--
To unsubscribe from this list: send the line "unsubscribe linux-ext4" in
the body of a message to majordomo@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Dave Chinner - Jan. 22, 2009, 1:21 a.m.
On Wed, Jan 21, 2009 at 11:55:31PM +0000, Jamie Lokier wrote:
> Dave Chinner wrote:
> > If the inode is dirty and fsync does nothing, then that filesystem
> > is *broken*. If writing to the inode doesn't dirty it, then the
> > filesystem is broken. Fix the broken filesystem.
> 
> *Wrong*  Very, very wrong.
> 
> You do not write totally unchanged inode bytes just for the sake of
> causing a NOP transaction to make the disk write the fsync as a
> side-effect of a broken paradigm.

Right, by definition, fsync shouldn't write unchanged inodes.

But I fail to see how that is even relevant to the above comment
I made about *dirty or modified inodes*.

> > > For efficient fdatasync() you _never_ want a transaction if possible,
> > > because it forces the disk head to seek between alternating regions of
> > > the disk, two seeks per fsync().
> > 
> > If there is dirty metadata that is need to be logged or flushed,
> > then fdatasync() needs to do something. If it doesn't do it
> > correctly, then that *filesystem is broken*. Fix the broken
> > filesystem.
> 
> A series of a writes over existing data and fdatasync() should *never*
> write to the transaction log, unless you mounted something like ext3
> data=journal, which isn't usual.

Yes, but that's a specific case, not the general case you first
raised. In this specific case, the filesystem can issue a device
flush instead of a transaction. However, only the filesystem knows
that this is the correct thing to do and so that is why the VFS
should not be implementing device flushes.

Remember - transaction != device flush - they are separate
operations and only on some filesystems does a transaction
imply a barrier/device flush.

> > > >   decide whether their filesystem needs flushing and thus
> > > >   knowingly impose this performance penalty on them...
> > > 
> > > I say it should flush be default unless a filesystem hooks an
> > > alternative strategy.  Certainly, it's silly to have the same code
> > > duplicated in nearly every filesystem
> > 
> > So write a *generic helper* for those filesystems that do the same
> > thing and hook it to their ->fsync method. Don't hard code it in the
> > VFS so other filesystem dev's have to come along afterwards and turn
> > it off.
> 
> Are there any at the moment which would turn it off?

XFS, for one. Probably btrfs, ext3 and ext4 would also need to turn
it off. Any other filesystem that supports barriers properly would
have to turn it off, too. However, I don't claim to have sufficient
expertise about those filesystems (except for XFS) to say for
certain what process is most optimal for sync or fsync for them.
Similarly, the VFS shouldn't be deciding that either...

Cheers,

Dave.
Jamie Lokier - Jan. 22, 2009, 3:03 a.m.
Dave Chinner wrote:
> On Wed, Jan 21, 2009 at 11:55:31PM +0000, Jamie Lokier wrote:
> > Dave Chinner wrote:
> > > If the inode is dirty and fsync does nothing, then that filesystem
> > > is *broken*. If writing to the inode doesn't dirty it, then the
> > > filesystem is broken. Fix the broken filesystem.
> > 
> > *Wrong*  Very, very wrong.
> > 
> > You do not write totally unchanged inode bytes just for the sake of
> > causing a NOP transaction to make the disk write the fsync as a
> > side-effect of a broken paradigm.
> 
> Right, by definition, fsync shouldn't write unchanged inodes.
> 
> But I fail to see how that is even relevant to the above comment
> I made about *dirty or modified inodes*.

The "above comment" is in reference to when data has been written with
write(), to be fsync'd, but the inode itself is not modified.  It is
proper behaviour not to write the inode to disk in that case.

> > > > For efficient fdatasync() you _never_ want a transaction if possible,
> > > > because it forces the disk head to seek between alternating regions of
> > > > the disk, two seeks per fsync().
> > > 
> > > If there is dirty metadata that is need to be logged or flushed,
> > > then fdatasync() needs to do something. If it doesn't do it
> > > correctly, then that *filesystem is broken*. Fix the broken
> > > filesystem.
> > 
> > A series of a writes over existing data and fdatasync() should *never*
> > write to the transaction log, unless you mounted something like ext3
> > data=journal, which isn't usual.
> 
> Yes, but that's a specific case, not the general case you first
> raised. In this specific case, the filesystem can issue a device
> flush instead of a transaction.

Erm... You said "the filesystem is broken because it doesn't flush
dirty metadata when you do fdatasync".  That's just wrong; it's not
_supposed_ to flush dirty metadata.

fdatasync() is the case it's most worth optimising, by the way.

> Remember - transaction != device flush - they are separate
> operations and only on some filesystems does a transaction
> imply a barrier/device flush.

Right, you're saying what I've just been saying.  Are we arguing with
each other or someone else? :-)

> However, only the filesystem knows that this is the correct thing to
> do and so that is why the VFS should not be implementing device
> flushes.

True.  But it's never _wrong_ to issue a device flush, following
completion of the filesystem sync method, only suboptimal sometimes.
It does no harm except to performance.

It may be worth erring on the side of caution and always doing so.
It's difficult to test this code really gives data integrity, yet its
important, and erring on the side of caution might have negligable
performance effect.

If we start distinguishing I/O flushes from I/O barriers, by the way,
an ordinary transaction isn't enough, because a transaction barrier
(using a pure I/O barrier) is insufficient for flushing.

> > > > >   decide whether their filesystem needs flushing and thus
> > > > >   knowingly impose this performance penalty on them...
> > > > 
> > > > I say it should flush be default unless a filesystem hooks an
> > > > alternative strategy.  Certainly, it's silly to have the same code
> > > > duplicated in nearly every filesystem
> > > 
> > > So write a *generic helper* for those filesystems that do the same
> > > thing and hook it to their ->fsync method. Don't hard code it in the
> > > VFS so other filesystem dev's have to come along afterwards and turn
> > > it off.
> > 
> > Are there any at the moment which would turn it off?
> 
> XFS, for one. Probably btrfs, ext3 and ext4 would also need to turn
> it off. Any other filesystem that supports barriers properly would
> have to turn it off, too. However, I don't claim to have sufficient
> expertise about those filesystems (except for XFS) to say for
> certain what process is most optimal for sync or fsync for them.
> Similarly, the VFS shouldn't be deciding that either...

I see your reasoning.

Filesystems which have a sync method, can call an appropriate block
flush helper.  Right now ext3 is broken in this department.

Older filesystems which don't have their own sync method will need the
block flush helper always, after data is generically flushed.

-- Jamie
--
To unsubscribe from this list: send the line "unsubscribe linux-ext4" in
the body of a message to majordomo@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html

Patch

diff --git a/fs/buffer.c b/fs/buffer.c
index b6e8b86..1be876c 100644
--- a/fs/buffer.c
+++ b/fs/buffer.c
@@ -165,6 +165,17 @@  void end_buffer_write_sync(struct buffer_head *bh, int uptodate)
 	put_bh(bh);
 }
 
+/* Issue flush of write caches on the block device */
+int block_flush_device(struct super_block *sb)
+{
+	int ret = blkdev_issue_flush(sb->s_bdev, NULL);
+
+	if (ret == -EOPNOTSUPP)
+		return 0;
+	return ret;
+}
+EXPORT_SYMBOL(block_flush_device);
+
 /*
  * Write out and wait upon all the dirty data associated with a block
  * device via its mapping.  Does not take the superblock lock.
diff --git a/fs/fs-writeback.c b/fs/fs-writeback.c
index e5eaa62..fcfacb2 100644
--- a/fs/fs-writeback.c
+++ b/fs/fs-writeback.c
@@ -412,6 +412,16 @@  __writeback_single_inode(struct inode *inode, struct writeback_control *wbc)
 }
 
 /*
+ * Make filesystem flush backing device of the filesystem
+ */
+static int flush_backing_device(struct super_block *sb)
+{
+	if (sb->s_op->flush_device)
+		return sb->s_op->flush_device(sb);
+	return 0;
+}
+
+/*
  * Write out a superblock's list of dirty inodes.  A wait will be performed
  * upon no inodes, all inodes or the final one, depending upon sync_mode.
  *
@@ -557,6 +567,7 @@  void generic_sync_sb_inodes(struct super_block *sb,
 		}
 		spin_unlock(&inode_lock);
 		iput(old_inode);
+		flush_backing_device(sb);
 	} else
 		spin_unlock(&inode_lock);
 
@@ -752,6 +763,8 @@  int sync_inode(struct inode *inode, struct writeback_control *wbc)
 	spin_lock(&inode_lock);
 	ret = __writeback_single_inode(inode, wbc);
 	spin_unlock(&inode_lock);
+	if (!ret && wbc->sync_mode == WB_SYNC_ALL)
+		ret = flush_backing_device(inode->i_sb);
 	return ret;
 }
 EXPORT_SYMBOL(sync_inode);
@@ -806,6 +819,9 @@  int generic_osync_inode(struct inode *inode, struct address_space *mapping, int
 	else
 		inode_sync_wait(inode);
 
+	if (!err)
+		err = flush_backing_device(inode->i_sb);
+
 	return err;
 }
 EXPORT_SYMBOL(generic_osync_inode);
diff --git a/include/linux/buffer_head.h b/include/linux/buffer_head.h
index bd7ac79..c154cfd 100644
--- a/include/linux/buffer_head.h
+++ b/include/linux/buffer_head.h
@@ -238,6 +238,7 @@  int nobh_write_end(struct file *, struct address_space *,
 int nobh_truncate_page(struct address_space *, loff_t, get_block_t *);
 int nobh_writepage(struct page *page, get_block_t *get_block,
                         struct writeback_control *wbc);
+int block_flush_device(struct super_block *sb);
 
 void buffer_init(void);
 
diff --git a/include/linux/fs.h b/include/linux/fs.h
index 6022f44..c37f9f0 100644
--- a/include/linux/fs.h
+++ b/include/linux/fs.h
@@ -1390,6 +1390,7 @@  struct super_operations {
 	int (*remount_fs) (struct super_block *, int *, char *);
 	void (*clear_inode) (struct inode *);
 	void (*umount_begin) (struct super_block *);
+	int (*flush_device) (struct super_block *);
 
 	int (*show_options)(struct seq_file *, struct vfsmount *);
 	int (*show_stats)(struct seq_file *, struct vfsmount *);