Patchwork [v3,3/5] mm: Notify filesystems when it's time to apply a deferred cmtime update

login
register
mail settings
Submitter Andrew Lutomirski
Date Aug. 16, 2013, 11:22 p.m.
Message ID <ec267e95fd21891986373c7af1c72b4c8b507332.1376679411.git.luto@amacapital.net>
Download mbox | patch
Permalink /patch/267934/
State Superseded
Headers show

Comments

Andrew Lutomirski - Aug. 16, 2013, 11:22 p.m.
Filesystems that defer cmtime updates should update cmtime when any
of these events happen after a write via a mapping:

 - The mapping is written back to disk.  This happens from all kinds
   of places, all of which eventually call ->writepages.

 - munmap is called or the mapping is removed when the process exits

 - msync(MS_ASYNC) is called.  Linux currently does nothing for
   msync(MS_ASYNC), but POSIX says that cmtime should be updated some
   time between an mmaped write and the subsequent msync call.
   MS_SYNC calls ->writepages, but MS_ASYNC needs special handling.

Filesystmes that defer cmtime updates should flush them on munmap or
exit.  Finding out that this happened through vm_ops is messy, so
add a new address space op for this.

It's not strictly necessary to call ->flush_cmtime after ->writepages,
but it simplifies the fs code.  As an optional optimization,
filesystems can call mapping_test_clear_cmtime themselves in
->writepages (as long as they're careful to scan all the pages first
-- the cmtime bit may not be set when ->writepages is entered).

This patch does not implement the MS_ASYNC case; that's in the next
patch.

Signed-off-by: Andy Lutomirski <luto@amacapital.net>
---
 include/linux/fs.h        |  9 +++++++++
 include/linux/writeback.h |  1 +
 mm/mmap.c                 |  9 ++++++++-
 mm/page-writeback.c       | 26 ++++++++++++++++++++++++++
 4 files changed, 44 insertions(+), 1 deletion(-)
Dave Chinner - Aug. 20, 2013, 2:36 a.m.
On Fri, Aug 16, 2013 at 04:22:10PM -0700, Andy Lutomirski wrote:
> Filesystems that defer cmtime updates should update cmtime when any
> of these events happen after a write via a mapping:
> 
>  - The mapping is written back to disk.  This happens from all kinds
>    of places, all of which eventually call ->writepages.
> 
>  - munmap is called or the mapping is removed when the process exits
> 
>  - msync(MS_ASYNC) is called.  Linux currently does nothing for
>    msync(MS_ASYNC), but POSIX says that cmtime should be updated some
>    time between an mmaped write and the subsequent msync call.
>    MS_SYNC calls ->writepages, but MS_ASYNC needs special handling.
> 
> Filesystmes that defer cmtime updates should flush them on munmap or
> exit.  Finding out that this happened through vm_ops is messy, so
> add a new address space op for this.
> 
> It's not strictly necessary to call ->flush_cmtime after ->writepages,
> but it simplifies the fs code.  As an optional optimization,
> filesystems can call mapping_test_clear_cmtime themselves in
> ->writepages (as long as they're careful to scan all the pages first
> -- the cmtime bit may not be set when ->writepages is entered).

.flush_cmtime is effectively a duplicate method.  We already have
.update_time to notify filesystems that they need to update the
timestamp in the inode transactionally.

Indeed:

> +	/*
> +	 * Userspace expects certain system calls to update cmtime if
> +	 * a file has been recently written using a shared vma.  In
> +	 * cases where cmtime may need to be updated but writepages is
> +	 * not called, this is called instead.  (Implementations
> +	 * should call mapping_test_clear_cmtime.)
> +	 */
> +	void (*flush_cmtime)(struct address_space *);

You say it can be implemented in the ->writepage(s) method, and all
filesystems provide ->writepage(s) in some form. Therefore I would
have thought it be best to simply require filesystems to check that
mapping flag during those methods and update the inode directly when
that is set?

Indeed, the way you've set up the infrastructure, we'll have to
rewrite the cmtime update code to enable writepages to update this
within some other transaction. Perhaps you should just implement it
that way first?

> --- a/mm/page-writeback.c
> +++ b/mm/page-writeback.c
> @@ -1928,6 +1928,18 @@ int do_writepages(struct address_space *mapping, struct writeback_control *wbc)
>  		ret = mapping->a_ops->writepages(mapping, wbc);
>  	else
>  		ret = generic_writepages(mapping, wbc);
> +
> +	/*
> +	 * ->writepages will call clear_page_dirty_for_io, which may, in turn,
> +	 * mark the mapping for deferred cmtime update.  As an optimization,
> +	 * a filesystem can flush the update at the end of ->writepages
> +	 * (possibly avoiding a journal transaction, for example), but,
> +	 * for simplicity, let filesystems skip that part and just implement
> +	 * ->flush_cmtime.
> +	 */
> +	if (mapping->a_ops->flush_cmtime)
> +		mapping->a_ops->flush_cmtime(mapping);

And that's where you cannot call sb_pagefault_start/end....

Cheers,

Dave.
Andrew Lutomirski - Aug. 20, 2013, 3:28 a.m.
On Mon, Aug 19, 2013 at 7:36 PM, Dave Chinner <david@fromorbit.com> wrote:
> On Fri, Aug 16, 2013 at 04:22:10PM -0700, Andy Lutomirski wrote:
>> Filesystems that defer cmtime updates should update cmtime when any
>> of these events happen after a write via a mapping:
>>
>>  - The mapping is written back to disk.  This happens from all kinds
>>    of places, all of which eventually call ->writepages.
>>
>>  - munmap is called or the mapping is removed when the process exits
>>
>>  - msync(MS_ASYNC) is called.  Linux currently does nothing for
>>    msync(MS_ASYNC), but POSIX says that cmtime should be updated some
>>    time between an mmaped write and the subsequent msync call.
>>    MS_SYNC calls ->writepages, but MS_ASYNC needs special handling.
>>
>> Filesystmes that defer cmtime updates should flush them on munmap or
>> exit.  Finding out that this happened through vm_ops is messy, so
>> add a new address space op for this.
>>
>> It's not strictly necessary to call ->flush_cmtime after ->writepages,
>> but it simplifies the fs code.  As an optional optimization,
>> filesystems can call mapping_test_clear_cmtime themselves in
>> ->writepages (as long as they're careful to scan all the pages first
>> -- the cmtime bit may not be set when ->writepages is entered).
>
> .flush_cmtime is effectively a duplicate method.  We already have
> .update_time to notify filesystems that they need to update the
> timestamp in the inode transactionally.

.update_time is used for the atime update as well, and it relies on
the core code to update the in-memory timestamp first.  I used that
approach in v2, but it was (correctly, I think) pointed out that this
was a layering violation and that core code shouldn't be mucking with
the timestamps directly during writeback.

There was a recent effort to move most of the file_update_calls from
the core into .page_mkwrite, and I don't think anyone wants to undo
that.

>
> Indeed:
>
>> +     /*
>> +      * Userspace expects certain system calls to update cmtime if
>> +      * a file has been recently written using a shared vma.  In
>> +      * cases where cmtime may need to be updated but writepages is
>> +      * not called, this is called instead.  (Implementations
>> +      * should call mapping_test_clear_cmtime.)
>> +      */
>> +     void (*flush_cmtime)(struct address_space *);
>
> You say it can be implemented in the ->writepage(s) method, and all
> filesystems provide ->writepage(s) in some form. Therefore I would
> have thought it be best to simply require filesystems to check that
> mapping flag during those methods and update the inode directly when
> that is set?

The problem with only doing it in ->writepages is that calling
writepages from munmap and exit would probably hurt performance for no
particular gain.  So I need some kind of callback to say "update the
time, but don't write data."  The AS_CMTIME bit will still be set when
the ptes are removed.

I could require ->writepages *and* ->flush_cmtime to handle the time
update, but that would complicate non-transactional filesystems.
Those filesystems should just flush cmtime at the end of writepages.

>
> Indeed, the way you've set up the infrastructure, we'll have to
> rewrite the cmtime update code to enable writepages to update this
> within some other transaction. Perhaps you should just implement it
> that way first?

This is already possible although not IMO necessary for correctness.
All that ext4 would need to do is to add something like:

if (mapping_test_clear_cmtime(mapping)) {
  update times within current transaction
}

somewhere inside the transaction in writepages.  There would probably
be room for some kind of generic helper to do everything in
inode_update_time_writable except for the actual mark_inode_dirty
part, but this still seems nasty from a locking perspective, and I'd
rather leave that optimization to an ext4 developer who wants to do
it.

I could simplify this a bit by moving the mapping_test_clear_cmtime
part from .flush_cmtime to its callers.

--Andy
--
To unsubscribe from this list: send the line "unsubscribe linux-ext4" in
the body of a message to majordomo@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Dave Chinner - Aug. 20, 2013, 4:08 a.m.
On Mon, Aug 19, 2013 at 08:28:20PM -0700, Andy Lutomirski wrote:
> On Mon, Aug 19, 2013 at 7:36 PM, Dave Chinner <david@fromorbit.com> wrote:
> > On Fri, Aug 16, 2013 at 04:22:10PM -0700, Andy Lutomirski wrote:
> >> Filesystems that defer cmtime updates should update cmtime when any
> >> of these events happen after a write via a mapping:
> >>
> >>  - The mapping is written back to disk.  This happens from all kinds
> >>    of places, all of which eventually call ->writepages.
> >>
> >>  - munmap is called or the mapping is removed when the process exits
> >>
> >>  - msync(MS_ASYNC) is called.  Linux currently does nothing for
> >>    msync(MS_ASYNC), but POSIX says that cmtime should be updated some
> >>    time between an mmaped write and the subsequent msync call.
> >>    MS_SYNC calls ->writepages, but MS_ASYNC needs special handling.
> >>
> >> Filesystmes that defer cmtime updates should flush them on munmap or
> >> exit.  Finding out that this happened through vm_ops is messy, so
> >> add a new address space op for this.
> >>
> >> It's not strictly necessary to call ->flush_cmtime after ->writepages,
> >> but it simplifies the fs code.  As an optional optimization,
> >> filesystems can call mapping_test_clear_cmtime themselves in
> >> ->writepages (as long as they're careful to scan all the pages first
> >> -- the cmtime bit may not be set when ->writepages is entered).
> >
> > .flush_cmtime is effectively a duplicate method.  We already have
> > .update_time to notify filesystems that they need to update the
> > timestamp in the inode transactionally.
> 
> .update_time is used for the atime update as well, and it relies on
> the core code to update the in-memory timestamp first.

No, it doesn't. The caller of .update_time provides the timestamp to
that gets written into the relevant fields of the inode.

i.e. this code:

	now = current_fs_time(inode->i_sb);
	if (inode->i_op->update_time)
		return inode->i_op->update_time(inode, &now, S_CTIME|S_MTIME);

will update *only* the ctime and mtime of the inode to have a value
of @now. That's precisely what you want to do, yes?

Indeed, your .flush_cmtime function effectively does:

	flags = prepare_cmtime(inode, &now)
		{ *now = current_fs_time()
		  flags = S_CTIME|S_MTIME;
		}
	update_time(inode, now, flags);
		{
			inode->i_op->update_time(inode, now, flags)
			or
			mark_inode_dirty_sync()
		}

IOWs, you're adding a filesystem specific method to update a
timestamp that wraps around a generic method for updating a
timestamp. i.e.  .flush_cmtime is not necessary - you could just
call inode_update_time_writable() from generic_writepages() and it
would simply just work for all filesystems....

> I used that
> approach in v2, but it was (correctly, I think) pointed out that this
> was a layering violation and that core code shouldn't be mucking with
> the timestamps directly during writeback.

If calling a generic method to update the timestamp is a layering
violation, then why is replacing that with a filesystem specific
method of updating a timestamp not a layering violation?

> > Indeed:
> >
> >> +     /*
> >> +      * Userspace expects certain system calls to update cmtime if
> >> +      * a file has been recently written using a shared vma.  In
> >> +      * cases where cmtime may need to be updated but writepages is
> >> +      * not called, this is called instead.  (Implementations
> >> +      * should call mapping_test_clear_cmtime.)
> >> +      */
> >> +     void (*flush_cmtime)(struct address_space *);
> >
> > You say it can be implemented in the ->writepage(s) method, and all
> > filesystems provide ->writepage(s) in some form. Therefore I would
> > have thought it be best to simply require filesystems to check that
> > mapping flag during those methods and update the inode directly when
> > that is set?
> 
> The problem with only doing it in ->writepages is that calling
> writepages from munmap and exit would probably hurt performance for no
> particular gain.  So I need some kind of callback to say "update the
> time, but don't write data."  The AS_CMTIME bit will still be set when
> the ptes are removed.

What's the point of updating the timestamp at unmap when it's going
to be changed again when the writeback occurs?

> I could require ->writepages *and* ->flush_cmtime to handle the time
> update, but that would complicate non-transactional filesystems.
> Those filesystems should just flush cmtime at the end of writepages.

do_writepages() is the wrong place to do such updates - we can get
writeback directly through .writepage, so the time updates need to
be in .writepage. That first .writepage call will clear the bit on
the mapping, so it's only done on the first call to .writepage on
the given mapping.

And some filesystems provide a .writepages method that doesn't call
.writepage, so those filesystems will need to do the timestamp
update in those methods.

> > Indeed, the way you've set up the infrastructure, we'll have to
> > rewrite the cmtime update code to enable writepages to update this
> > within some other transaction. Perhaps you should just implement it
> > that way first?
> 
> This is already possible although not IMO necessary for correctness.
> All that ext4 would need to do is to add something like:
> 
> if (mapping_test_clear_cmtime(mapping)) {
>   update times within current transaction
> }

Where does it get the timestamps from? i.e. the update could call
prepare_update_cmtime() to do this, yes? And so having a wrapper
that does

	if (mapping_test_clear_cmtime(mapping))
		return prepare_update_cmtime(inode);
	return 0;

would work just fine for them, yes?

> somewhere inside the transaction in writepages.  There would probably
> be room for some kind of generic helper to do everything in
> inode_update_time_writable except for the actual mark_inode_dirty
> part, but this still seems nasty from a locking perspective, and I'd
> rather leave that optimization to an ext4 developer who wants to do
> it.

filesystems that implement .update_time don't need to mark the
struct inode dirty on timestamp updates. Similarly, filesystems that
use a writepages transaction to do the update don't either....

Cheers,

Dave.
Andrew Lutomirski - Aug. 20, 2013, 4:14 a.m.
On Mon, Aug 19, 2013 at 9:08 PM, Dave Chinner <david@fromorbit.com> wrote:
> On Mon, Aug 19, 2013 at 08:28:20PM -0700, Andy Lutomirski wrote:
>> On Mon, Aug 19, 2013 at 7:36 PM, Dave Chinner <david@fromorbit.com> wrote:
>> > On Fri, Aug 16, 2013 at 04:22:10PM -0700, Andy Lutomirski wrote:
>> >> Filesystems that defer cmtime updates should update cmtime when any
>> >> of these events happen after a write via a mapping:
>> >>
>> >>  - The mapping is written back to disk.  This happens from all kinds
>> >>    of places, all of which eventually call ->writepages.
>> >>
>> >>  - munmap is called or the mapping is removed when the process exits
>> >>
>> >>  - msync(MS_ASYNC) is called.  Linux currently does nothing for
>> >>    msync(MS_ASYNC), but POSIX says that cmtime should be updated some
>> >>    time between an mmaped write and the subsequent msync call.
>> >>    MS_SYNC calls ->writepages, but MS_ASYNC needs special handling.
>> >>
>> >> Filesystmes that defer cmtime updates should flush them on munmap or
>> >> exit.  Finding out that this happened through vm_ops is messy, so
>> >> add a new address space op for this.
>> >>
>> >> It's not strictly necessary to call ->flush_cmtime after ->writepages,
>> >> but it simplifies the fs code.  As an optional optimization,
>> >> filesystems can call mapping_test_clear_cmtime themselves in
>> >> ->writepages (as long as they're careful to scan all the pages first
>> >> -- the cmtime bit may not be set when ->writepages is entered).
>> >
>> > .flush_cmtime is effectively a duplicate method.  We already have
>> > .update_time to notify filesystems that they need to update the
>> > timestamp in the inode transactionally.
>>
>> .update_time is used for the atime update as well, and it relies on
>> the core code to update the in-memory timestamp first.
>
> No, it doesn't. The caller of .update_time provides the timestamp to
> that gets written into the relevant fields of the inode.
>
> i.e. this code:
>
>         now = current_fs_time(inode->i_sb);
>         if (inode->i_op->update_time)
>                 return inode->i_op->update_time(inode, &now, S_CTIME|S_MTIME);
>
> will update *only* the ctime and mtime of the inode to have a value
> of @now. That's precisely what you want to do, yes?
>
> Indeed, your .flush_cmtime function effectively does:
>
>         flags = prepare_cmtime(inode, &now)
>                 { *now = current_fs_time()
>                   flags = S_CTIME|S_MTIME;
>                 }
>         update_time(inode, now, flags);
>                 {
>                         inode->i_op->update_time(inode, now, flags)
>                         or
>                         mark_inode_dirty_sync()
>                 }
>
> IOWs, you're adding a filesystem specific method to update a
> timestamp that wraps around a generic method for updating a
> timestamp. i.e.  .flush_cmtime is not necessary - you could just
> call inode_update_time_writable() from generic_writepages() and it
> would simply just work for all filesystems....

OK, I'll try that out.

>
>> I used that
>> approach in v2, but it was (correctly, I think) pointed out that this
>> was a layering violation and that core code shouldn't be mucking with
>> the timestamps directly during writeback.
>
> If calling a generic method to update the timestamp is a layering
> violation, then why is replacing that with a filesystem specific
> method of updating a timestamp not a layering violation?
>
>> > Indeed:
>> >
>> >> +     /*
>> >> +      * Userspace expects certain system calls to update cmtime if
>> >> +      * a file has been recently written using a shared vma.  In
>> >> +      * cases where cmtime may need to be updated but writepages is
>> >> +      * not called, this is called instead.  (Implementations
>> >> +      * should call mapping_test_clear_cmtime.)
>> >> +      */
>> >> +     void (*flush_cmtime)(struct address_space *);
>> >
>> > You say it can be implemented in the ->writepage(s) method, and all
>> > filesystems provide ->writepage(s) in some form. Therefore I would
>> > have thought it be best to simply require filesystems to check that
>> > mapping flag during those methods and update the inode directly when
>> > that is set?
>>
>> The problem with only doing it in ->writepages is that calling
>> writepages from munmap and exit would probably hurt performance for no
>> particular gain.  So I need some kind of callback to say "update the
>> time, but don't write data."  The AS_CMTIME bit will still be set when
>> the ptes are removed.
>
> What's the point of updating the timestamp at unmap when it's going
> to be changed again when the writeback occurs?

To avoid breaking make and similar tools.  Suppose that an output file
already exists but is stale.  Some program gets called to update it,
and it opens it for write, mmaps it, updates it, calls munmap, and
exits.  Its parent will expect the timestamp to have been updated, but
writeback may not have happened.

>
>> I could require ->writepages *and* ->flush_cmtime to handle the time
>> update, but that would complicate non-transactional filesystems.
>> Those filesystems should just flush cmtime at the end of writepages.
>
> do_writepages() is the wrong place to do such updates - we can get
> writeback directly through .writepage, so the time updates need to
> be in .writepage. That first .writepage call will clear the bit on
> the mapping, so it's only done on the first call to .writepage on
> the given mapping.

Last time I checked, all the paths that actually needed the timestamp
update went through .writepages.  I'll double-check.

>
> And some filesystems provide a .writepages method that doesn't call
> .writepage, so those filesystems will need to do the timestamp
> update in those methods.
>
>> > Indeed, the way you've set up the infrastructure, we'll have to
>> > rewrite the cmtime update code to enable writepages to update this
>> > within some other transaction. Perhaps you should just implement it
>> > that way first?
>>
>> This is already possible although not IMO necessary for correctness.
>> All that ext4 would need to do is to add something like:
>>
>> if (mapping_test_clear_cmtime(mapping)) {
>>   update times within current transaction
>> }
>
> Where does it get the timestamps from? i.e. the update could call
> prepare_update_cmtime() to do this, yes? And so having a wrapper
> that does
>
>         if (mapping_test_clear_cmtime(mapping))
>                 return prepare_update_cmtime(inode);
>         return 0;
>
> would work just fine for them, yes?
>
>> somewhere inside the transaction in writepages.  There would probably
>> be room for some kind of generic helper to do everything in
>> inode_update_time_writable except for the actual mark_inode_dirty
>> part, but this still seems nasty from a locking perspective, and I'd
>> rather leave that optimization to an ext4 developer who wants to do
>> it.
>
> filesystems that implement .update_time don't need to mark the
> struct inode dirty on timestamp updates. Similarly, filesystems that
> use a writepages transaction to do the update don't either....

Fair enough.  I'll spin a v4 and see how it looks.

--Andy
--
To unsubscribe from this list: send the line "unsubscribe linux-ext4" in
the body of a message to majordomo@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Jan Kara - Aug. 20, 2013, 4 p.m.
On Mon 19-08-13 21:14:44, Andy Lutomirski wrote:
> >> I could require ->writepages *and* ->flush_cmtime to handle the time
> >> update, but that would complicate non-transactional filesystems.
> >> Those filesystems should just flush cmtime at the end of writepages.
> >
> > do_writepages() is the wrong place to do such updates - we can get
> > writeback directly through .writepage, so the time updates need to
> > be in .writepage. That first .writepage call will clear the bit on
> > the mapping, so it's only done on the first call to .writepage on
> > the given mapping.
> 
> Last time I checked, all the paths that actually needed the timestamp
> update went through .writepages.  I'll double-check.
  kswapd can call just .writepage to do the writeout so timestamp update
should be handled there as well. Otherwise all pages in a mapping can be
cleaned without timestamp being updated.

Which btw made me realize that even your scheme doesn't completely make
sure timestamp is updated after mmap write - if you have pages 0 and 1, you
write to both of them - CMTIME flag gets set. Then fsync_range(fd, 0, 4096)
is called. We write the page 0, writeprotect it, update timestamps. But
page 1 is still writeable so writes to it won't set CMTIME flag, neither
update the timestamp... Not that I think this can be reasonably solved but
it is a food for thought.

								Honza
Andrew Lutomirski - Aug. 20, 2013, 4:42 p.m.
On Tue, Aug 20, 2013 at 9:00 AM, Jan Kara <jack@suse.cz> wrote:
> On Mon 19-08-13 21:14:44, Andy Lutomirski wrote:
>> >> I could require ->writepages *and* ->flush_cmtime to handle the time
>> >> update, but that would complicate non-transactional filesystems.
>> >> Those filesystems should just flush cmtime at the end of writepages.
>> >
>> > do_writepages() is the wrong place to do such updates - we can get
>> > writeback directly through .writepage, so the time updates need to
>> > be in .writepage. That first .writepage call will clear the bit on
>> > the mapping, so it's only done on the first call to .writepage on
>> > the given mapping.
>>
>> Last time I checked, all the paths that actually needed the timestamp
>> update went through .writepages.  I'll double-check.
>   kswapd can call just .writepage to do the writeout so timestamp update
> should be handled there as well. Otherwise all pages in a mapping can be
> cleaned without timestamp being updated.

OK, I'll fix that.

>
> Which btw made me realize that even your scheme doesn't completely make
> sure timestamp is updated after mmap write - if you have pages 0 and 1, you
> write to both of them - CMTIME flag gets set. Then fsync_range(fd, 0, 4096)
> is called. We write the page 0, writeprotect it, update timestamps. But
> page 1 is still writeable so writes to it won't set CMTIME flag, neither
> update the timestamp... Not that I think this can be reasonably solved but
> it is a food for thought.

This should already work.  AS_CMTIME is set when the pte goes from
dirty to clean, not when the pte goes from wp to writable.  So
whenever clear_page_dirty_for_io is called on page 1, AS_CMTIME will
be set and a subsequent writepages call will update the timestamp.

--Andy
--
To unsubscribe from this list: send the line "unsubscribe linux-ext4" in
the body of a message to majordomo@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Andrew Lutomirski - Aug. 20, 2013, 7:27 p.m.
On Tue, Aug 20, 2013 at 9:42 AM, Andy Lutomirski <luto@amacapital.net> wrote:
> On Tue, Aug 20, 2013 at 9:00 AM, Jan Kara <jack@suse.cz> wrote:
>> On Mon 19-08-13 21:14:44, Andy Lutomirski wrote:
>>> >> I could require ->writepages *and* ->flush_cmtime to handle the time
>>> >> update, but that would complicate non-transactional filesystems.
>>> >> Those filesystems should just flush cmtime at the end of writepages.
>>> >
>>> > do_writepages() is the wrong place to do such updates - we can get
>>> > writeback directly through .writepage, so the time updates need to
>>> > be in .writepage. That first .writepage call will clear the bit on
>>> > the mapping, so it's only done on the first call to .writepage on
>>> > the given mapping.
>>>
>>> Last time I checked, all the paths that actually needed the timestamp
>>> update went through .writepages.  I'll double-check.
>>   kswapd can call just .writepage to do the writeout so timestamp update
>> should be handled there as well. Otherwise all pages in a mapping can be
>> cleaned without timestamp being updated.
>
> OK, I'll fix that.

This is a bit ugly.  mpage_writepages and generic_writepages both call
->writepage.  If writepage starts checking cmtime, then writepages
will do multiple timestamp updates on filesystems that use this code.

I can modify vmscan and migrate to flush cmtime -- they seem to be the
only callers of ->writepage that aren't themselves called from
->writepages.

--Andy
--
To unsubscribe from this list: send the line "unsubscribe linux-ext4" in
the body of a message to majordomo@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Dave Chinner - Aug. 20, 2013, 9:48 p.m.
On Tue, Aug 20, 2013 at 09:42:34AM -0700, Andy Lutomirski wrote:
> On Tue, Aug 20, 2013 at 9:00 AM, Jan Kara <jack@suse.cz> wrote:
> > On Mon 19-08-13 21:14:44, Andy Lutomirski wrote:
> >> >> I could require ->writepages *and* ->flush_cmtime to handle the time
> >> >> update, but that would complicate non-transactional filesystems.
> >> >> Those filesystems should just flush cmtime at the end of writepages.
> >> >
> >> > do_writepages() is the wrong place to do such updates - we can get
> >> > writeback directly through .writepage, so the time updates need to
> >> > be in .writepage. That first .writepage call will clear the bit on
> >> > the mapping, so it's only done on the first call to .writepage on
> >> > the given mapping.
> >>
> >> Last time I checked, all the paths that actually needed the timestamp
> >> update went through .writepages.  I'll double-check.
> >   kswapd can call just .writepage to do the writeout so timestamp update
> > should be handled there as well. Otherwise all pages in a mapping can be
> > cleaned without timestamp being updated.
> 
> OK, I'll fix that.
> 
> >
> > Which btw made me realize that even your scheme doesn't completely make
> > sure timestamp is updated after mmap write - if you have pages 0 and 1, you
> > write to both of them - CMTIME flag gets set. Then fsync_range(fd, 0, 4096)
> > is called. We write the page 0, writeprotect it, update timestamps. But
> > page 1 is still writeable so writes to it won't set CMTIME flag, neither
> > update the timestamp... Not that I think this can be reasonably solved but
> > it is a food for thought.
> 
> This should already work.  AS_CMTIME is set when the pte goes from
> dirty to clean, not when the pte goes from wp to writable.  So
> whenever clear_page_dirty_for_io is called on page 1, AS_CMTIME will
> be set and a subsequent writepages call will update the timestamp.

Oh, I missed that - I thought you were setting AS_CMTIME during
.page_mkwrite.

Setting it in clear_page_dirty_for_io() is too late for filesystems
to include it in their existing transactions during .writepage, (at
least for XFs and ext4) because they do their delayed allocation
transactions before changing page state....

Cheers,

Dave.
Andrew Lutomirski - Aug. 20, 2013, 9:54 p.m.
On Tue, Aug 20, 2013 at 2:48 PM, Dave Chinner <david@fromorbit.com> wrote:
> On Tue, Aug 20, 2013 at 09:42:34AM -0700, Andy Lutomirski wrote:
>> On Tue, Aug 20, 2013 at 9:00 AM, Jan Kara <jack@suse.cz> wrote:
>> > On Mon 19-08-13 21:14:44, Andy Lutomirski wrote:
>> >> >> I could require ->writepages *and* ->flush_cmtime to handle the time
>> >> >> update, but that would complicate non-transactional filesystems.
>> >> >> Those filesystems should just flush cmtime at the end of writepages.
>> >> >
>> >> > do_writepages() is the wrong place to do such updates - we can get
>> >> > writeback directly through .writepage, so the time updates need to
>> >> > be in .writepage. That first .writepage call will clear the bit on
>> >> > the mapping, so it's only done on the first call to .writepage on
>> >> > the given mapping.
>> >>
>> >> Last time I checked, all the paths that actually needed the timestamp
>> >> update went through .writepages.  I'll double-check.
>> >   kswapd can call just .writepage to do the writeout so timestamp update
>> > should be handled there as well. Otherwise all pages in a mapping can be
>> > cleaned without timestamp being updated.
>>
>> OK, I'll fix that.
>>
>> >
>> > Which btw made me realize that even your scheme doesn't completely make
>> > sure timestamp is updated after mmap write - if you have pages 0 and 1, you
>> > write to both of them - CMTIME flag gets set. Then fsync_range(fd, 0, 4096)
>> > is called. We write the page 0, writeprotect it, update timestamps. But
>> > page 1 is still writeable so writes to it won't set CMTIME flag, neither
>> > update the timestamp... Not that I think this can be reasonably solved but
>> > it is a food for thought.
>>
>> This should already work.  AS_CMTIME is set when the pte goes from
>> dirty to clean, not when the pte goes from wp to writable.  So
>> whenever clear_page_dirty_for_io is called on page 1, AS_CMTIME will
>> be set and a subsequent writepages call will update the timestamp.
>
> Oh, I missed that - I thought you were setting AS_CMTIME during
> .page_mkwrite.
>
> Setting it in clear_page_dirty_for_io() is too late for filesystems
> to include it in their existing transactions during .writepage, (at
> least for XFs and ext4) because they do their delayed allocation
> transactions before changing page state....

Couldn't it go between mpage_map_and_submit_extent and
ext4_journal_stop in ext4_writepages?

>
> Cheers,
>
> Dave.
> --
> Dave Chinner
> david@fromorbit.com
Dave Chinner - Aug. 20, 2013, 10:43 p.m.
On Tue, Aug 20, 2013 at 02:54:01PM -0700, Andy Lutomirski wrote:
> On Tue, Aug 20, 2013 at 2:48 PM, Dave Chinner <david@fromorbit.com> wrote:
> > On Tue, Aug 20, 2013 at 09:42:34AM -0700, Andy Lutomirski wrote:
> >> On Tue, Aug 20, 2013 at 9:00 AM, Jan Kara <jack@suse.cz> wrote:
> >> > On Mon 19-08-13 21:14:44, Andy Lutomirski wrote:
> >> >> >> I could require ->writepages *and* ->flush_cmtime to handle the time
> >> >> >> update, but that would complicate non-transactional filesystems.
> >> >> >> Those filesystems should just flush cmtime at the end of writepages.
> >> >> >
> >> >> > do_writepages() is the wrong place to do such updates - we can get
> >> >> > writeback directly through .writepage, so the time updates need to
> >> >> > be in .writepage. That first .writepage call will clear the bit on
> >> >> > the mapping, so it's only done on the first call to .writepage on
> >> >> > the given mapping.
> >> >>
> >> >> Last time I checked, all the paths that actually needed the timestamp
> >> >> update went through .writepages.  I'll double-check.
> >> >   kswapd can call just .writepage to do the writeout so timestamp update
> >> > should be handled there as well. Otherwise all pages in a mapping can be
> >> > cleaned without timestamp being updated.
> >>
> >> OK, I'll fix that.
> >>
> >> >
> >> > Which btw made me realize that even your scheme doesn't completely make
> >> > sure timestamp is updated after mmap write - if you have pages 0 and 1, you
> >> > write to both of them - CMTIME flag gets set. Then fsync_range(fd, 0, 4096)
> >> > is called. We write the page 0, writeprotect it, update timestamps. But
> >> > page 1 is still writeable so writes to it won't set CMTIME flag, neither
> >> > update the timestamp... Not that I think this can be reasonably solved but
> >> > it is a food for thought.
> >>
> >> This should already work.  AS_CMTIME is set when the pte goes from
> >> dirty to clean, not when the pte goes from wp to writable.  So
> >> whenever clear_page_dirty_for_io is called on page 1, AS_CMTIME will
> >> be set and a subsequent writepages call will update the timestamp.
> >
> > Oh, I missed that - I thought you were setting AS_CMTIME during
> > .page_mkwrite.
> >
> > Setting it in clear_page_dirty_for_io() is too late for filesystems
> > to include it in their existing transactions during .writepage, (at
> > least for XFs and ext4) because they do their delayed allocation
> > transactions before changing page state....
> 
> Couldn't it go between mpage_map_and_submit_extent and
> ext4_journal_stop in ext4_writepages?

Maybe - I'm not an ext4 expert - but even if you can make it work
for ext4 in some way, that doesn't mean it is possible for any other
filesystem to use the same method. You're adding code to generic,
non-filesystem specific code paths and so the solutions need to be
generic rather not tied to how a specific filesystem is implemented.

Cheers,

Dave.
Andrew Lutomirski - Aug. 21, 2013, 12:47 a.m.
On Tue, Aug 20, 2013 at 3:43 PM, Dave Chinner <david@fromorbit.com> wrote:
> On Tue, Aug 20, 2013 at 02:54:01PM -0700, Andy Lutomirski wrote:
>> On Tue, Aug 20, 2013 at 2:48 PM, Dave Chinner <david@fromorbit.com> wrote:
>> > On Tue, Aug 20, 2013 at 09:42:34AM -0700, Andy Lutomirski wrote:
>> >> On Tue, Aug 20, 2013 at 9:00 AM, Jan Kara <jack@suse.cz> wrote:
>> >> > On Mon 19-08-13 21:14:44, Andy Lutomirski wrote:
>> >> >> >> I could require ->writepages *and* ->flush_cmtime to handle the time
>> >> >> >> update, but that would complicate non-transactional filesystems.
>> >> >> >> Those filesystems should just flush cmtime at the end of writepages.
>> >> >> >
>> >> >> > do_writepages() is the wrong place to do such updates - we can get
>> >> >> > writeback directly through .writepage, so the time updates need to
>> >> >> > be in .writepage. That first .writepage call will clear the bit on
>> >> >> > the mapping, so it's only done on the first call to .writepage on
>> >> >> > the given mapping.
>> >> >>
>> >> >> Last time I checked, all the paths that actually needed the timestamp
>> >> >> update went through .writepages.  I'll double-check.
>> >> >   kswapd can call just .writepage to do the writeout so timestamp update
>> >> > should be handled there as well. Otherwise all pages in a mapping can be
>> >> > cleaned without timestamp being updated.
>> >>
>> >> OK, I'll fix that.
>> >>
>> >> >
>> >> > Which btw made me realize that even your scheme doesn't completely make
>> >> > sure timestamp is updated after mmap write - if you have pages 0 and 1, you
>> >> > write to both of them - CMTIME flag gets set. Then fsync_range(fd, 0, 4096)
>> >> > is called. We write the page 0, writeprotect it, update timestamps. But
>> >> > page 1 is still writeable so writes to it won't set CMTIME flag, neither
>> >> > update the timestamp... Not that I think this can be reasonably solved but
>> >> > it is a food for thought.
>> >>
>> >> This should already work.  AS_CMTIME is set when the pte goes from
>> >> dirty to clean, not when the pte goes from wp to writable.  So
>> >> whenever clear_page_dirty_for_io is called on page 1, AS_CMTIME will
>> >> be set and a subsequent writepages call will update the timestamp.
>> >
>> > Oh, I missed that - I thought you were setting AS_CMTIME during
>> > .page_mkwrite.
>> >
>> > Setting it in clear_page_dirty_for_io() is too late for filesystems
>> > to include it in their existing transactions during .writepage, (at
>> > least for XFs and ext4) because they do their delayed allocation
>> > transactions before changing page state....
>>
>> Couldn't it go between mpage_map_and_submit_extent and
>> ext4_journal_stop in ext4_writepages?
>
> Maybe - I'm not an ext4 expert - but even if you can make it work
> for ext4 in some way, that doesn't mean it is possible for any other
> filesystem to use the same method. You're adding code to generic,
> non-filesystem specific code paths and so the solutions need to be
> generic rather not tied to how a specific filesystem is implemented.
>

I don't see the problem for xfs or btrfs either.

xfs uses generic_writepages, which already does the right thing.  (xfs
with my updated patches passes my tests.)  xfs_vm_writepage calls
xfs_start_page_writeback(..., 1, ...), so clear_page_dirty_for_io is
called.  At that point (I presume), it would still be possible to add
metadata to a transaction (assuming there's a transaction open -- I
don't have a clue here).  Even if this is too late, xfs_vm_writepage
could call page_mkwrite to for AS_CMTIME to be set if needed.
page_mkwrite will be fast if the page isn't mmapped.  What am I
missing?

btrfs seems to do much the same thing.

--Andy


> Cheers,
>
> Dave.
> --
> Dave Chinner
> david@fromorbit.com
Dave Chinner - Aug. 21, 2013, 1:33 a.m.
On Tue, Aug 20, 2013 at 05:47:10PM -0700, Andy Lutomirski wrote:
> On Tue, Aug 20, 2013 at 3:43 PM, Dave Chinner <david@fromorbit.com> wrote:
> > On Tue, Aug 20, 2013 at 02:54:01PM -0700, Andy Lutomirski wrote:
> >> On Tue, Aug 20, 2013 at 2:48 PM, Dave Chinner <david@fromorbit.com> wrote:
> >> > On Tue, Aug 20, 2013 at 09:42:34AM -0700, Andy Lutomirski wrote:
> >> >> On Tue, Aug 20, 2013 at 9:00 AM, Jan Kara <jack@suse.cz> wrote:
> >> >> > On Mon 19-08-13 21:14:44, Andy Lutomirski wrote:
> >> >> >> >> I could require ->writepages *and* ->flush_cmtime to handle the time
> >> >> >> >> update, but that would complicate non-transactional filesystems.
> >> >> >> >> Those filesystems should just flush cmtime at the end of writepages.
> >> >> >> >
> >> >> >> > do_writepages() is the wrong place to do such updates - we can get
> >> >> >> > writeback directly through .writepage, so the time updates need to
> >> >> >> > be in .writepage. That first .writepage call will clear the bit on
> >> >> >> > the mapping, so it's only done on the first call to .writepage on
> >> >> >> > the given mapping.
> >> >> >>
> >> >> >> Last time I checked, all the paths that actually needed the timestamp
> >> >> >> update went through .writepages.  I'll double-check.
> >> >> >   kswapd can call just .writepage to do the writeout so timestamp update
> >> >> > should be handled there as well. Otherwise all pages in a mapping can be
> >> >> > cleaned without timestamp being updated.
> >> >>
> >> >> OK, I'll fix that.
> >> >>
> >> >> >
> >> >> > Which btw made me realize that even your scheme doesn't completely make
> >> >> > sure timestamp is updated after mmap write - if you have pages 0 and 1, you
> >> >> > write to both of them - CMTIME flag gets set. Then fsync_range(fd, 0, 4096)
> >> >> > is called. We write the page 0, writeprotect it, update timestamps. But
> >> >> > page 1 is still writeable so writes to it won't set CMTIME flag, neither
> >> >> > update the timestamp... Not that I think this can be reasonably solved but
> >> >> > it is a food for thought.
> >> >>
> >> >> This should already work.  AS_CMTIME is set when the pte goes from
> >> >> dirty to clean, not when the pte goes from wp to writable.  So
> >> >> whenever clear_page_dirty_for_io is called on page 1, AS_CMTIME will
> >> >> be set and a subsequent writepages call will update the timestamp.
> >> >
> >> > Oh, I missed that - I thought you were setting AS_CMTIME during
> >> > .page_mkwrite.
> >> >
> >> > Setting it in clear_page_dirty_for_io() is too late for filesystems
> >> > to include it in their existing transactions during .writepage, (at
> >> > least for XFs and ext4) because they do their delayed allocation
> >> > transactions before changing page state....
> >>
> >> Couldn't it go between mpage_map_and_submit_extent and
> >> ext4_journal_stop in ext4_writepages?
> >
> > Maybe - I'm not an ext4 expert - but even if you can make it work
> > for ext4 in some way, that doesn't mean it is possible for any other
> > filesystem to use the same method. You're adding code to generic,
> > non-filesystem specific code paths and so the solutions need to be
> > generic rather not tied to how a specific filesystem is implemented.
> >
> 
> I don't see the problem for xfs or btrfs either.
> 
> xfs uses generic_writepages, which already does the right thing.  (xfs
> with my updated patches passes my tests.)  xfs_vm_writepage calls
> xfs_start_page_writeback(..., 1, ...), so clear_page_dirty_for_io is
> called.  At that point (I presume), it would still be possible to add
> metadata to a transaction (assuming there's a transaction open -- I
> don't have a clue here). 

That's my point - there isn't a transaction in XFS at this point,
and so if we gather that flag from clear_page_dirty_for_io() we'd
need a second transaction and therefore the optimisation you want
filesystems to use to mitigate the additional overhead can't be done
for all commonly used filesystems.

> Even if this is too late, xfs_vm_writepage
> could call page_mkwrite to for AS_CMTIME to be set if needed.
> page_mkwrite will be fast if the page isn't mmapped.  What am I
> missing?

That it leads to different behaviour for different filesystems.

i.e. page_mkwrite on page A sets the flag. writeback on a range that
doesn't include page A occurs, sees the flag, clears it after
updating the timestamp. Some time later writeback on page A occurs,
no timestamp update occurs.

The behaviour needs to be consistent across filesystems.

Cheers,

Dave.

Patch

diff --git a/include/linux/fs.h b/include/linux/fs.h
index 86cf0a4..f224155 100644
--- a/include/linux/fs.h
+++ b/include/linux/fs.h
@@ -350,6 +350,15 @@  struct address_space_operations {
 	/* Write back some dirty pages from this mapping. */
 	int (*writepages)(struct address_space *, struct writeback_control *);
 
+	/*
+	 * Userspace expects certain system calls to update cmtime if
+	 * a file has been recently written using a shared vma.  In
+	 * cases where cmtime may need to be updated but writepages is
+	 * not called, this is called instead.  (Implementations
+	 * should call mapping_test_clear_cmtime.)
+	 */
+	void (*flush_cmtime)(struct address_space *);
+
 	/* Set a page dirty.  Return true if this dirtied it */
 	int (*set_page_dirty)(struct page *page);
 
diff --git a/include/linux/writeback.h b/include/linux/writeback.h
index 4e198ca..f6e8261 100644
--- a/include/linux/writeback.h
+++ b/include/linux/writeback.h
@@ -174,6 +174,7 @@  typedef int (*writepage_t)(struct page *page, struct writeback_control *wbc,
 
 int generic_writepages(struct address_space *mapping,
 		       struct writeback_control *wbc);
+void generic_flush_cmtime(struct address_space *mapping);
 void tag_pages_for_writeback(struct address_space *mapping,
 			     pgoff_t start, pgoff_t end);
 int write_cache_pages(struct address_space *mapping,
diff --git a/mm/mmap.c b/mm/mmap.c
index 1edbaa3..7ed7700 100644
--- a/mm/mmap.c
+++ b/mm/mmap.c
@@ -1,3 +1,4 @@ 
+
 /*
  * mm/mmap.c
  *
@@ -249,8 +250,14 @@  static struct vm_area_struct *remove_vma(struct vm_area_struct *vma)
 	might_sleep();
 	if (vma->vm_ops && vma->vm_ops->close)
 		vma->vm_ops->close(vma);
-	if (vma->vm_file)
+	if (vma->vm_file) {
+		if ((vma->vm_flags & VM_SHARED) && vma->vm_file->f_mapping) {
+			struct address_space *mapping = vma->vm_file->f_mapping;
+			if (mapping->a_ops && mapping->a_ops->flush_cmtime)
+				mapping->a_ops->flush_cmtime(mapping);
+		}
 		fput(vma->vm_file);
+	}
 	mpol_put(vma_policy(vma));
 	kmem_cache_free(vm_area_cachep, vma);
 	return next;
diff --git a/mm/page-writeback.c b/mm/page-writeback.c
index 3f0c895..9ab8c9e 100644
--- a/mm/page-writeback.c
+++ b/mm/page-writeback.c
@@ -1928,6 +1928,18 @@  int do_writepages(struct address_space *mapping, struct writeback_control *wbc)
 		ret = mapping->a_ops->writepages(mapping, wbc);
 	else
 		ret = generic_writepages(mapping, wbc);
+
+	/*
+	 * ->writepages will call clear_page_dirty_for_io, which may, in turn,
+	 * mark the mapping for deferred cmtime update.  As an optimization,
+	 * a filesystem can flush the update at the end of ->writepages
+	 * (possibly avoiding a journal transaction, for example), but,
+	 * for simplicity, let filesystems skip that part and just implement
+	 * ->flush_cmtime.
+	 */
+	if (mapping->a_ops->flush_cmtime)
+		mapping->a_ops->flush_cmtime(mapping);
+
 	return ret;
 }
 
@@ -1970,6 +1982,20 @@  int write_one_page(struct page *page, int wait)
 }
 EXPORT_SYMBOL(write_one_page);
 
+/**
+ * generic_flush_cmtime - perform a deferred cmtime update if needed
+ * @mapping: address space structure
+ *
+ * This is a library function, which implements the flush_cmtime()
+ * address_space_operation.
+ */
+void generic_flush_cmtime(struct address_space *mapping)
+{
+	if (mapping_test_clear_cmtime(mapping))
+		inode_update_time_writable(mapping->host);
+}
+EXPORT_SYMBOL(generic_flush_cmtime);
+
 /*
  * For address_spaces which do not use buffers nor write back.
  */