diff mbox series

ext4: Fix mmap write protection for data=journal mode

Message ID 20201027132751.29858-1-jack@suse.cz
State Awaiting Upstream
Headers show
Series ext4: Fix mmap write protection for data=journal mode | expand

Commit Message

Jan Kara Oct. 27, 2020, 1:27 p.m. UTC
Commit afb585a97f81 "ext4: data=journal: write-protect pages on
j_submit_inode_data_buffers()") added calls ext4_jbd2_inode_add_write()
to track inode ranges whose mappings need to get write-protected during
transaction commits. However the added calls use wrong start of a range
(0 instead of page offset) and so write protection is not necessarily
effective. Use correct range start to fix the problem.

Fixes: afb585a97f81 ("ext4: data=journal: write-protect pages on j_submit_inode_data_buffers()")
Signed-off-by: Jan Kara <jack@suse.cz>
---
 fs/ext4/inode.c | 5 +++--
 1 file changed, 3 insertions(+), 2 deletions(-)

Mauricio, I think this could be the reason for occasional test failures you
were still seeing. Can you try whether this patch fixes those for you? Thanks!

Comments

Mauricio Faria de Oliveira Oct. 27, 2020, 4:10 p.m. UTC | #1
Hey Jan,

On Tue, Oct 27, 2020 at 10:27 AM Jan Kara <jack@suse.cz> wrote:
>
> Commit afb585a97f81 "ext4: data=journal: write-protect pages on
> j_submit_inode_data_buffers()") added calls ext4_jbd2_inode_add_write()
> to track inode ranges whose mappings need to get write-protected during
> transaction commits. However the added calls use wrong start of a range
> (0 instead of page offset) and so write protection is not necessarily
> effective. Use correct range start to fix the problem.
>
> Fixes: afb585a97f81 ("ext4: data=journal: write-protect pages on j_submit_inode_data_buffers()")
> Signed-off-by: Jan Kara <jack@suse.cz>
> ---
>  fs/ext4/inode.c | 5 +++--
>  1 file changed, 3 insertions(+), 2 deletions(-)
>
> Mauricio, I think this could be the reason for occasional test failures you
> were still seeing. Can you try whether this patch fixes those for you? Thanks!
>

Thanks! Nice catch. Sure, I'll give it a try and follow up.

Just as FYI, I've been working on debugging and instrumentation to
identify the corner cases
where the page has not been write protected; but got a slower pace due
to other work items.

cheers,
Mauricio

> diff --git a/fs/ext4/inode.c b/fs/ext4/inode.c
> index 03c2253005f0..f4a599c6dcde 100644
> --- a/fs/ext4/inode.c
> +++ b/fs/ext4/inode.c
> @@ -1918,7 +1918,7 @@ static int __ext4_journalled_writepage(struct page *page,
>         }
>         if (ret == 0)
>                 ret = err;
> -       err = ext4_jbd2_inode_add_write(handle, inode, 0, len);
> +       err = ext4_jbd2_inode_add_write(handle, inode, page_offset(page), len);
>         if (ret == 0)
>                 ret = err;
>         EXT4_I(inode)->i_datasync_tid = handle->h_transaction->t_tid;
> @@ -6157,7 +6157,8 @@ vm_fault_t ext4_page_mkwrite(struct vm_fault *vmf)
>                         if (ext4_walk_page_buffers(handle, page_buffers(page),
>                                         0, len, NULL, write_end_fn))
>                                 goto out_error;
> -                       if (ext4_jbd2_inode_add_write(handle, inode, 0, len))
> +                       if (ext4_jbd2_inode_add_write(handle, inode,
> +                                                     page_offset(page), len))
>                                 goto out_error;
>                         ext4_set_inode_state(inode, EXT4_STATE_JDATA);
>                 } else {
> --
> 2.16.4
>
Theodore Ts'o Oct. 28, 2020, 3:56 a.m. UTC | #2
On Tue, Oct 27, 2020 at 02:27:51PM +0100, Jan Kara wrote:
> Commit afb585a97f81 "ext4: data=journal: write-protect pages on
> j_submit_inode_data_buffers()") added calls ext4_jbd2_inode_add_write()
> to track inode ranges whose mappings need to get write-protected during
> transaction commits. However the added calls use wrong start of a range
> (0 instead of page offset) and so write protection is not necessarily
> effective. Use correct range start to fix the problem.
> 
> Fixes: afb585a97f81 ("ext4: data=journal: write-protect pages on j_submit_inode_data_buffers()")
> Signed-off-by: Jan Kara <jack@suse.cz>

Thanks, applied.

					- Ted
Mauricio Faria de Oliveira Nov. 20, 2020, 3:27 p.m. UTC | #3
Hey Jan,

On Tue, Oct 27, 2020 at 1:10 PM Mauricio Faria de Oliveira
<mfo@canonical.com> wrote:
> On Tue, Oct 27, 2020 at 10:27 AM Jan Kara <jack@suse.cz> wrote:
> >
> > Commit afb585a97f81 "ext4: data=journal: write-protect pages on
> > j_submit_inode_data_buffers()") added calls ext4_jbd2_inode_add_write()
> > to track inode ranges whose mappings need to get write-protected during
> > transaction commits. However the added calls use wrong start of a range
> > (0 instead of page offset) and so write protection is not necessarily
> > effective. Use correct range start to fix the problem.
> >
> > Fixes: afb585a97f81 ("ext4: data=journal: write-protect pages on j_submit_inode_data_buffers()")
> > Signed-off-by: Jan Kara <jack@suse.cz>
> > ---
> >  fs/ext4/inode.c | 5 +++--
> >  1 file changed, 3 insertions(+), 2 deletions(-)
> >
> > Mauricio, I think this could be the reason for occasional test failures you
> > were still seeing. Can you try whether this patch fixes those for you? Thanks!
> >
>
> Thanks! Nice catch. Sure, I'll give it a try and follow up.

TL;DR:

1) Thanks! The patch fixed almost 100% of the checksum failures.
2) I can send a debug patch to verify buffer checksums before write-out.
3) The remaining, rare checksum failures seem to be due to
    a race between commit/write-protect and page writeback
    related to PageChecked(), clearing pagecache dirty tag used to
write-protect.
4) Test results statistics confirm that the occurrence of checksum
failures is really low.

...

Sorry for the delay in following up on this.

Thanks for the patch! The results with v5.10-rc4 are almost 100%:

There are now _very rare_ occasions of journal checksum failures detected; with
_zero_ recovery failures in stress-ng/crash/reboot/mount in 1187 loops
overnight.
(Previously I'd get 3-5 out of 10.)

I plan to send the debug patch used to verify the buffer checksum in the tag
before write-out (catches the checksum failures that fail recovery in advance),
if you think it might be useful. I thought of it under CONFIG_JBD2_DEBUG.

...

The remaining checksum changes due to write-protect failures _seem_ to be
a race between our write-protect with write_cache_pages() and writeback/sync.
But I'm not exactly sure as it's been hard to find the timing/steps
for both threads.
The idea is,

Our write_cache_pages() during commit /
ext4_journalled_submit_inode_data_buffers()
depends on PAGECACHE_TAG_DIRTY being set, for pagevec_lookup_range_tag()
to put such pages in the array to be write-protected with
clear_page_dirty_for_io().

With a debug patch to dump_stack() callers of
__test_set_page_writeback(), that can
xas_clear_mark() it, _while_ the page is still going in our call to
write_cache_pages(),
we see this: wb_workfn() -> ext4_writepage() -> ext4_ext4_bio_write_page(),

i.e., _not_ going through ext4_journalled_writepage(), which
knows/waits on journal.
The other leg of ext4_writepage()  _doesn't_, and thus can clear the
tag and prevent
write-protect while the journal commit / our write_cache_pages() is running.

Since the switch to either leg is PageChecked(), I guess this might be due to
clearing it right away in ext4_journalled_writepage(), before write-protection.

If that is the issue, perhaps that should be done in our writepage
callback or finish callback,
but I haven't thought much about it, until confirming if that's
actually the root cause.
(but it seems there's a problem if that bit is set again while we
process and clear it.)

...

And just details in the testing methodology, for documentation purposes.

The reproducer for the checksum errors is 'stress-ng --mmap $((4 *
$(nproc))) --mmap-file' on data=journal.
There's a test/wait/crash/reboot script that runs it and crashes in
$RANDOM <= 120 seconds.

This used to _often_ hit a journal recovery failure in mount on boot,
that prevents mount/boot.
Now that the checksum errors occur much less often, there's been
_zero_ recovery failures.

With the debug patch to check/report for checksum changes, the kernel
logs now show
such occurrences along with the (re)boot markers; e.g.,

[    0.000000] Linux version 5.10.0-rc4.rechecksum ...
[   80.534910] TESTCASE: checksum changed (tag3: 0x2d57bbdf, calc: 0xe0b4ddcb)
[    0.000000] Linux version 5.10.0-rc4.rechecksum ...

So it's relatively easy to convert those into numbers/statistics with
shell glue.

There's a total of 1187 reboots (i.e., stress-ng runs); 1131 had no
checksum errors; only 56 had.

The average number of reboots/runs between checksum errors being
detected is 20 (min 1, max 140, mean 16)
The average number of checksum errors per reboot/run _that has errors_
is 1 (95%), and max is 2 (5%).

So this confirms it's really rare now, to the point that it's even
harder to happen
in the moment that crash happened, and thus jbd2 recovery can finish correctly:

[    1.492521] EXT4-fs (vda1): INFO: recovery required on readonly filesystem
[    1.493906] EXT4-fs (vda1): write access will be enabled during recovery
[    1.826665] EXT4-fs (vda1): orphan cleanup on readonly fs
[    1.832006] EXT4-fs (vda1): 16 orphan inodes deleted
[    1.833050] EXT4-fs (vda1): recovery complete
[    1.837041] EXT4-fs (vda1): mounted filesystem with journalled data
mode. Opts: (null)

cheers,




--
Mauricio Faria de Oliveira
Theodore Ts'o Nov. 20, 2020, 6:10 p.m. UTC | #4
Mauricio,

Thanks for your work in finding the corner cases for data=journal.  If
you don't mind me asking, however --- what are the use cases for you
where data=journal is the preferred mode for ext4?  There are a lot of
advanced features for data=journal mode which don't work.  This
includes things like dioread_nolock (now the default), delayed
allocation, and other optimizations.

It used to be that data=journal pretty nicely fell out of how ext4
worked in "normally".  These days, data=journal is becoming more and
more an exception case that requires special handling.  And to be
honest, every so often there has been discussion about whether the
maintenance overhead of data=journal has been worth keeping it.  So
far, we just don't bother making data=journal work with things like
delayed allocation, and one of ther reasons why we've kept it around
is because it's a unique mode that none of the Linux file systems
have.

It would be useful, though, to understand what are the use cases where
you (or your customers) are finding data=journal useful, so we can
better optimize for their use case.  And if there are enough people
who care about it --- and clearly, you've invested so much effort that
you definitely fall into that category :-) --- then maybe there's a
business case for investing more into data=journal and trying to make
it something which is easier to maintain and can work with things like
delayed allocation.

Thanks,

					- Ted
Jan Kara Nov. 23, 2020, 9:34 a.m. UTC | #5
Hi Mauricio!

On Fri 20-11-20 12:27:56, Mauricio Faria de Oliveira wrote:
> On Tue, Oct 27, 2020 at 1:10 PM Mauricio Faria de Oliveira
> <mfo@canonical.com> wrote:
> > On Tue, Oct 27, 2020 at 10:27 AM Jan Kara <jack@suse.cz> wrote:
> > >
> > > Commit afb585a97f81 "ext4: data=journal: write-protect pages on
> > > j_submit_inode_data_buffers()") added calls ext4_jbd2_inode_add_write()
> > > to track inode ranges whose mappings need to get write-protected during
> > > transaction commits. However the added calls use wrong start of a range
> > > (0 instead of page offset) and so write protection is not necessarily
> > > effective. Use correct range start to fix the problem.
> > >
> > > Fixes: afb585a97f81 ("ext4: data=journal: write-protect pages on j_submit_inode_data_buffers()")
> > > Signed-off-by: Jan Kara <jack@suse.cz>
> > > ---
> > >  fs/ext4/inode.c | 5 +++--
> > >  1 file changed, 3 insertions(+), 2 deletions(-)
> > >
> > > Mauricio, I think this could be the reason for occasional test failures you
> > > were still seeing. Can you try whether this patch fixes those for you? Thanks!
> > >
> >
> > Thanks! Nice catch. Sure, I'll give it a try and follow up.
> 
> TL;DR:
> 
> 1) Thanks! The patch fixed almost 100% of the checksum failures.
> 2) I can send a debug patch to verify buffer checksums before write-out.
> 3) The remaining, rare checksum failures seem to be due to
>     a race between commit/write-protect and page writeback
>     related to PageChecked(), clearing pagecache dirty tag used to
> write-protect.
> 4) Test results statistics confirm that the occurrence of checksum
> failures is really low.

Glad to hear that!

> Thanks for the patch! The results with v5.10-rc4 are almost 100%:
> 
> There are now _very rare_ occasions of journal checksum failures detected; with
> _zero_ recovery failures in stress-ng/crash/reboot/mount in 1187 loops
> overnight.
> (Previously I'd get 3-5 out of 10.)
> 
> I plan to send the debug patch used to verify the buffer checksum in the tag
> before write-out (catches the checksum failures that fail recovery in advance),
> if you think it might be useful. I thought of it under CONFIG_JBD2_DEBUG.
> 
> ...
> 
> The remaining checksum changes due to write-protect failures _seem_ to be
> a race between our write-protect with write_cache_pages() and writeback/sync.
> But I'm not exactly sure as it's been hard to find the timing/steps
> for both threads.
> The idea is,
> 
> Our write_cache_pages() during commit /
> ext4_journalled_submit_inode_data_buffers()
> depends on PAGECACHE_TAG_DIRTY being set, for pagevec_lookup_range_tag()
> to put such pages in the array to be write-protected with
> clear_page_dirty_for_io().
> 
> With a debug patch to dump_stack() callers of
> __test_set_page_writeback(), that can
> xas_clear_mark() it, _while_ the page is still going in our call to
> write_cache_pages(),
> we see this: wb_workfn() -> ext4_writepage() -> ext4_ext4_bio_write_page(),

I guess there was something between wb_workfn() and ext4_writepage(),
wasn't there? There should be write_cache_pages()...

> i.e., _not_ going through ext4_journalled_writepage(), which
> knows/waits on journal.
> The other leg of ext4_writepage()  _doesn't_, and thus can clear the
> tag and prevent
> write-protect while the journal commit / our write_cache_pages() is running.

So I don't think this is quite it. If there are two writebacks racing,
either of them will writeprotect the page which is enough for commit to be
safe (as the mapped write will then trigger ext4_page_mkwrite() which will
wait for the end of running commit in jbd2_journal_get_write_access()). Or
am I missing something? But there must be still some race as you can still
see occasional checksum changes... So I must be missing something.

> Since the switch to either leg is PageChecked(), I guess this might be
> due to clearing it right away in ext4_journalled_writepage(), before
> write-protection.

The write-protection happens in clear_page_dirty_for_io() call so that's
before our ->writepage() callback is even called.

								Honza
Mauricio Faria de Oliveira Nov. 23, 2020, 11:52 p.m. UTC | #6
Hey Ted,

On Fri, Nov 20, 2020 at 3:10 PM Theodore Y. Ts'o <tytso@mit.edu> wrote:
>
> Mauricio,
>
> Thanks for your work in finding the corner cases for data=journal.  If

Thank you and Jan for your suggestions and guidance to fix them.

> you don't mind me asking, however --- what are the use cases for you
> where data=journal is the preferred mode for ext4?  There are a lot of
> advanced features for data=journal mode which don't work.  This
> includes things like dioread_nolock (now the default), delayed
> allocation, and other optimizations.
>
> It used to be that data=journal pretty nicely fell out of how ext4
> worked in "normally".  These days, data=journal is becoming more and
> more an exception case that requires special handling.  And to be
> honest, every so often there has been discussion about whether the
> maintenance overhead of data=journal has been worth keeping it.  So
> far, we just don't bother making data=journal work with things like
> delayed allocation, and one of ther reasons why we've kept it around
> is because it's a unique mode that none of the Linux file systems
> have.
>
> It would be useful, though, to understand what are the use cases where
> you (or your customers) are finding data=journal useful, so we can
> better optimize for their use case.  And if there are enough people
> who care about it --- and clearly, you've invested so much effort that
> you definitely fall into that category :-) --- then maybe there's a
> business case for investing more into data=journal and trying to make
> it something which is easier to maintain and can work with things like
> delayed allocation.
>

The main reason a customer used data=journal in some cases
is it provided them more consistency w/ data on power loss.

Specifically, it prevented some consistency check errors on
applications in which it's unfortunately not always possible
to request/guarantee that sync() is called (not ext4's fault.)

We've previously exposed the disadvantages with data=journal,
including the discussion about its future/maintenance upstream
and associated risks with less exposure/testing (eg, this bug! :)
in order to assist them with decision-making and maybe switching.

It seems that data=ordered with nodelalloc and a smaller commit=
interval could help them, but that's going on experiment stages.

So, it seems this use case is more about data integrity than features,
and data=journal does seem to help with that despite the applications.

Glad that ext4 still provides such mode; but it's understandable that
maintenance and features and special handling leads to the discussion
about no longer keeping it.

Thanks for reaching out to understand the use cases from users,
and being willing to consider that into development/maintenance.

There's also a few bits about it from Andreas in the v1 series [1].

Hope this helps!
Mauricio

[1] https://lore.kernel.org/linux-ext4/C9FEDED5-CDEE-449F-AE11-64BB56A42277@dilger.ca/

> Thanks,
>
>                                         - Ted
Mauricio Faria de Oliveira Nov. 24, 2020, 12:03 a.m. UTC | #7
Hey Jan,

On Mon, Nov 23, 2020 at 6:34 AM Jan Kara <jack@suse.cz> wrote:
> > ...
> >
> > The remaining checksum changes due to write-protect failures _seem_ to be
> > a race between our write-protect with write_cache_pages() and writeback/sync.
> > But I'm not exactly sure as it's been hard to find the timing/steps
> > for both threads.
> > The idea is,
> >
> > Our write_cache_pages() during commit /
> > ext4_journalled_submit_inode_data_buffers()
> > depends on PAGECACHE_TAG_DIRTY being set, for pagevec_lookup_range_tag()
> > to put such pages in the array to be write-protected with
> > clear_page_dirty_for_io().
> >
> > With a debug patch to dump_stack() callers of
> > __test_set_page_writeback(), that can
> > xas_clear_mark() it, _while_ the page is still going in our call to
> > write_cache_pages(),
> > we see this: wb_workfn() -> ext4_writepage() -> ext4_ext4_bio_write_page(),
>
> I guess there was something between wb_workfn() and ext4_writepage(),
> wasn't there? There should be write_cache_pages()...
>

Yes, you're right, I oversimplified the function path between the first two. :)
(I'll use '...' in the future to make that clearer.)

> > i.e., _not_ going through ext4_journalled_writepage(), which
> > knows/waits on journal.
> > The other leg of ext4_writepage()  _doesn't_, and thus can clear the
> > tag and prevent
> > write-protect while the journal commit / our write_cache_pages() is running.
>
> So I don't think this is quite it. If there are two writebacks racing,
> either of them will writeprotect the page which is enough for commit to be
> safe (as the mapped write will then trigger ext4_page_mkwrite() which will
> wait for the end of running commit in jbd2_journal_get_write_access()). Or
> am I missing something? But there must be still some race as you can still
> see occasional checksum changes... So I must be missing something.
>

Actually you're right, I missed that -- either writeback will
writeprotect the page.
There's still something else going on, but it's me not you who missed
something. :)
Thanks for pointing that out.

> > Since the switch to either leg is PageChecked(), I guess this might be
> > due to clearing it right away in ext4_journalled_writepage(), before
> > write-protection.
>
> The write-protection happens in clear_page_dirty_for_io() call so that's
> before our ->writepage() callback is even called.
>

Right, I meant write-protection by the other/racing writeback thread.
But please nevermind, as this doesn't seem to be the corner case anymore.

Thanks!

>                                                                 Honza
> --
> Jan Kara <jack@suse.com>
> SUSE Labs, CR
diff mbox series

Patch

diff --git a/fs/ext4/inode.c b/fs/ext4/inode.c
index 03c2253005f0..f4a599c6dcde 100644
--- a/fs/ext4/inode.c
+++ b/fs/ext4/inode.c
@@ -1918,7 +1918,7 @@  static int __ext4_journalled_writepage(struct page *page,
 	}
 	if (ret == 0)
 		ret = err;
-	err = ext4_jbd2_inode_add_write(handle, inode, 0, len);
+	err = ext4_jbd2_inode_add_write(handle, inode, page_offset(page), len);
 	if (ret == 0)
 		ret = err;
 	EXT4_I(inode)->i_datasync_tid = handle->h_transaction->t_tid;
@@ -6157,7 +6157,8 @@  vm_fault_t ext4_page_mkwrite(struct vm_fault *vmf)
 			if (ext4_walk_page_buffers(handle, page_buffers(page),
 					0, len, NULL, write_end_fn))
 				goto out_error;
-			if (ext4_jbd2_inode_add_write(handle, inode, 0, len))
+			if (ext4_jbd2_inode_add_write(handle, inode,
+						      page_offset(page), len))
 				goto out_error;
 			ext4_set_inode_state(inode, EXT4_STATE_JDATA);
 		} else {