diff mbox

ext4: fix data integrity sync in ordered mode

Message ID 004201cf645b$430004f0$c9000ed0$@samsung.com
State Superseded, archived
Headers show

Commit Message

Namjae Jeon April 30, 2014, 10:02 a.m. UTC
When we perform a data integrity sync we tag all the dirty pages with
PAGECACHE_TAG_TOWRITE at start of ext4_da_writepages.
Later we check for this tag in write_cache_pages_da and creates a
struct mpage_da_data containing contiguously indexed pages tagged with this
tag and sync these pages with a call to mpage_da_map_and_submit.
This process is done in while loop until all the PAGECACHE_TAG_TOWRITE pages
are synced. We also do journal start and stop in each iteration.
journal_stop could initiate journal commit which would call ext4_writepage
which in turn will call ext4_bio_write_page even for delayed OR unwritten
buffers. When ext4_bio_write_page is called for such buffers, even though it
does not sync them but it clears the PAGECACHE_TAG_TOWRITE of the corresponding
page and hence these pages are also not synced by the currently running data
integrity sync. We will end up with dirty pages although sync is completed.

This could cause a potential data loss when the sync call is followed by a
truncate_pagecache call, which is exactly the case in collapse_range.
(It will cause generic/127 failure in xfstests)

Cc: stable@vger.kernel.org
Cc: Jan kara <jack@suse.de>
Signed-off-by: Namjae Jeon <namjae.jeon@samsung.com>
Signed-off-by: Ashish Sangwan <a.sangwan@samsung.com>
---
 fs/ext4/inode.c | 11 +++++++++--
 1 file changed, 9 insertions(+), 2 deletions(-)

Comments

Jan Kara April 30, 2014, 4:01 p.m. UTC | #1
Hello,

On Wed 30-04-14 19:02:14, Namjae Jeon wrote:
> When we perform a data integrity sync we tag all the dirty pages with
> PAGECACHE_TAG_TOWRITE at start of ext4_da_writepages.
> Later we check for this tag in write_cache_pages_da and creates a
> struct mpage_da_data containing contiguously indexed pages tagged with this
> tag and sync these pages with a call to mpage_da_map_and_submit.
> This process is done in while loop until all the PAGECACHE_TAG_TOWRITE pages
> are synced. We also do journal start and stop in each iteration.
> journal_stop could initiate journal commit which would call ext4_writepage
> which in turn will call ext4_bio_write_page even for delayed OR unwritten
> buffers. When ext4_bio_write_page is called for such buffers, even though it
> does not sync them but it clears the PAGECACHE_TAG_TOWRITE of the corresponding
> page and hence these pages are also not synced by the currently running data
> integrity sync. We will end up with dirty pages although sync is completed.
> 
> This could cause a potential data loss when the sync call is followed by a
> truncate_pagecache call, which is exactly the case in collapse_range.
> (It will cause generic/127 failure in xfstests)
  This is well spotted. Thanks for finding this bug. See my comment below
regarding the fix.

> Cc: stable@vger.kernel.org
> Cc: Jan kara <jack@suse.de>
> Signed-off-by: Namjae Jeon <namjae.jeon@samsung.com>
> Signed-off-by: Ashish Sangwan <a.sangwan@samsung.com>
> ---
>  fs/ext4/inode.c | 11 +++++++++--
>  1 file changed, 9 insertions(+), 2 deletions(-)
> 
> diff --git a/fs/ext4/inode.c b/fs/ext4/inode.c
> index b1dc334..bd85712 100644
> --- a/fs/ext4/inode.c
> +++ b/fs/ext4/inode.c
> @@ -1865,12 +1865,19 @@ static int ext4_writepage(struct page *page,
>  	if (ext4_walk_page_buffers(NULL, page_bufs, 0, len, NULL,
>  				   ext4_bh_delay_or_unwritten)) {
>  		redirty_page_for_writepage(wbc, page);
> -		if (current->flags & PF_MEMALLOC) {
> +		if ((current->flags & PF_MEMALLOC) || 
> +		     radix_tree_tag_get(&page->mapping->page_tree,
> +					page->index, PAGECACHE_TAG_TOWRITE)) {
  I don't think your fix is correct. journal_submit_inode_data_buffers()
uses WB_SYNC_ALL mode to write the pages and thus all the pages you'll see
in ext4_writepage() are going to have TOWRITE tag set. And even if that
wasn't the case you'll have problems when blocksize < pagesize. Because in
data=ordered mode we want to writeout allocated (mapped) blocks in the page
to avoid exposure of uninitialized data after a crash (e.g. in case we have
allocated some blocks in the current transaction but not yet finished
writing them out and there are other blocks underlying the page which
aren't allocated yet). Fixing this isn't easy I'm afraid.

What we could do is to create a variant of set_page_writeback() which
doesn't clear TOWRITE tag and use that in ext4_bio_write_page() if we are
writing out just some buffers in a page and leaving other dirty buffers
behind. It would have a down side that we would be leaving TOWRITE tagged
pages behind in case when we actually don't race with other writeback but
I don't see that causing any real problems.

								Honza
diff mbox

Patch

diff --git a/fs/ext4/inode.c b/fs/ext4/inode.c
index b1dc334..bd85712 100644
--- a/fs/ext4/inode.c
+++ b/fs/ext4/inode.c
@@ -1865,12 +1865,19 @@  static int ext4_writepage(struct page *page,
 	if (ext4_walk_page_buffers(NULL, page_bufs, 0, len, NULL,
 				   ext4_bh_delay_or_unwritten)) {
 		redirty_page_for_writepage(wbc, page);
-		if (current->flags & PF_MEMALLOC) {
+		if ((current->flags & PF_MEMALLOC) || 
+		     radix_tree_tag_get(&page->mapping->page_tree,
+					page->index, PAGECACHE_TAG_TOWRITE)) {
 			/*
 			 * For memory cleaning there's no point in writing only
 			 * some buffers. So just bail out. Warn if we came here
 			 * from direct reclaim.
-			 */
+			 * We should also bail out when a journal commit happen
+			 * during an integrity sync operation because calling
+			 * ext4_bio_write_page in this case will clear 
+			 * PAGECACHE_TAG_TOWRITE and we could end up with 
+			 * dirty pages even after completion of a sync call.
+			 */ 
 			WARN_ON_ONCE((current->flags & (PF_MEMALLOC|PF_KSWAPD))
 							== PF_MEMALLOC);
 			unlock_page(page);