[C,1/1] mm/page-writeback.c: fix range_cyclic writeback vs writepages deadlock
diff mbox series

Message ID 20190415163210.17463-2-mfo@canonical.com
State New
Headers show
Series
  • Fix write()/fsync() deadlock in write_cache_pages()
Related show

Commit Message

Mauricio Faria de Oliveira April 15, 2019, 4:32 p.m. UTC
From: Dave Chinner <dchinner@redhat.com>

BugLink: https://bugs.launchpad.net/bugs/1824827

We've recently seen a workload on XFS filesystems with a repeatable
deadlock between background writeback and a multi-process application
doing concurrent writes and fsyncs to a small range of a file.

range_cyclic
writeback		Process 1		Process 2

xfs_vm_writepages
  write_cache_pages
    writeback_index = 2
    cycled = 0
    ....
    find page 2 dirty
    lock Page 2
    ->writepage
      page 2 writeback
      page 2 clean
      page 2 added to bio
    no more pages
			write()
			locks page 1
			dirties page 1
			locks page 2
			dirties page 1
			fsync()
			....
			xfs_vm_writepages
			write_cache_pages
			  start index 0
			  find page 1 towrite
			  lock Page 1
			  ->writepage
			    page 1 writeback
			    page 1 clean
			    page 1 added to bio
			  find page 2 towrite
			  lock Page 2
			  page 2 is writeback
			  <blocks>
						write()
						locks page 1
						dirties page 1
						fsync()
						....
						xfs_vm_writepages
						write_cache_pages
						  start index 0

    !done && !cycled
      sets index to 0, restarts lookup
    find page 1 dirty
						  find page 1 towrite
						  lock Page 1
						  page 1 is writeback
						  <blocks>

    lock Page 1
    <blocks>

DEADLOCK because:

	- process 1 needs page 2 writeback to complete to make
	  enough progress to issue IO pending for page 1
	- writeback needs page 1 writeback to complete so process 2
	  can progress and unlock the page it is blocked on, then it
	  can issue the IO pending for page 2
	- process 2 can't make progress until process 1 issues IO
	  for page 1

The underlying cause of the problem here is that range_cyclic writeback is
processing pages in descending index order as we hold higher index pages
in a structure controlled from above write_cache_pages().  The
write_cache_pages() caller needs to be able to submit these pages for IO
before write_cache_pages restarts writeback at mapping index 0 to avoid
wcp inverting the page lock/writeback wait order.

generic_writepages() is not susceptible to this bug as it has no private
context held across write_cache_pages() - filesystems using this
infrastructure always submit pages in ->writepage immediately and so there
is no problem with range_cyclic going back to mapping index 0.

However:
	mpage_writepages() has a private bio context,
	exofs_writepages() has page_collect
	fuse_writepages() has fuse_fill_wb_data
	nfs_writepages() has nfs_pageio_descriptor
	xfs_vm_writepages() has xfs_writepage_ctx

All of these ->writepages implementations can hold pages under writeback
in their private structures until write_cache_pages() returns, and hence
they are all susceptible to this deadlock.

Also worth noting is that ext4 has it's own bastardised version of
write_cache_pages() and so it /may/ have an equivalent deadlock.  I looked
at the code long enough to understand that it has a similar retry loop for
range_cyclic writeback reaching the end of the file and then promptly ran
away before my eyes bled too much.  I'll leave it for the ext4 developers
to determine if their code is actually has this deadlock and how to fix it
if it has.

There's a few ways I can see avoid this deadlock.  There's probably more,
but these are the first I've though of:

1. get rid of range_cyclic altogether

2. range_cyclic always stops at EOF, and we start again from
writeback index 0 on the next call into write_cache_pages()

2a. wcp also returns EAGAIN to ->writepages implementations to
indicate range cyclic has hit EOF. writepages implementations can
then flush the current context and call wpc again to continue. i.e.
lift the retry into the ->writepages implementation

3. range_cyclic uses trylock_page() rather than lock_page(), and it
skips pages it can't lock without blocking. It will already do this
for pages under writeback, so this seems like a no-brainer

3a. all non-WB_SYNC_ALL writeback uses trylock_page() to avoid
blocking as per pages under writeback.

I don't think #1 is an option - range_cyclic prevents frequently
dirtied lower file offset from starving background writeback of
rarely touched higher file offsets.

#2 is simple, and I don't think it will have any impact on
performance as going back to the start of the file implies an
immediate seek. We'll have exactly the same number of seeks if we
switch writeback to another inode, and then come back to this one
later and restart from index 0.

#2a is pretty much "status quo without the deadlock". Moving the
retry loop up into the wcp caller means we can issue IO on the
pending pages before calling wcp again, and so avoid locking or
waiting on pages in the wrong order. I'm not convinced we need to do
this given that we get the same thing from #2 on the next writeback
call from the writeback infrastructure.

#3 is really just a band-aid - it doesn't fix the access/wait
inversion problem, just prevents it from becoming a deadlock
situation. I'd prefer we fix the inversion, not sweep it under the
carpet like this.

#3a is really an optimisation that just so happens to include the
band-aid fix of #3.

So it seems that the simplest way to fix this issue is to implement
solution #2

Link: http://lkml.kernel.org/r/20181005054526.21507-1-david@fromorbit.com
Signed-off-by: Dave Chinner <dchinner@redhat.com>
Reviewed-by: Jan Kara <jack@suse.de>
Cc: Nicholas Piggin <npiggin@gmail.com>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
(cherry picked from commit 64081362e8ff4587b4554087f3cfc73d3e0a4cd7)
Signed-off-by: Mauricio Faria de Oliveira <mfo@canonical.com>
---
 mm/page-writeback.c | 33 +++++++++++++++------------------
 1 file changed, 15 insertions(+), 18 deletions(-)

Comments

Stefan Bader April 16, 2019, 8:34 a.m. UTC | #1
On 15.04.19 18:32, Mauricio Faria de Oliveira wrote:
> From: Dave Chinner <dchinner@redhat.com>
> 
> BugLink: https://bugs.launchpad.net/bugs/1824827
> 
> We've recently seen a workload on XFS filesystems with a repeatable
> deadlock between background writeback and a multi-process application
> doing concurrent writes and fsyncs to a small range of a file.
> 
> range_cyclic
> writeback		Process 1		Process 2
> 
> xfs_vm_writepages
>   write_cache_pages
>     writeback_index = 2
>     cycled = 0
>     ....
>     find page 2 dirty
>     lock Page 2
>     ->writepage
>       page 2 writeback
>       page 2 clean
>       page 2 added to bio
>     no more pages
> 			write()
> 			locks page 1
> 			dirties page 1
> 			locks page 2
> 			dirties page 1
> 			fsync()
> 			....
> 			xfs_vm_writepages
> 			write_cache_pages
> 			  start index 0
> 			  find page 1 towrite
> 			  lock Page 1
> 			  ->writepage
> 			    page 1 writeback
> 			    page 1 clean
> 			    page 1 added to bio
> 			  find page 2 towrite
> 			  lock Page 2
> 			  page 2 is writeback
> 			  <blocks>
> 						write()
> 						locks page 1
> 						dirties page 1
> 						fsync()
> 						....
> 						xfs_vm_writepages
> 						write_cache_pages
> 						  start index 0
> 
>     !done && !cycled
>       sets index to 0, restarts lookup
>     find page 1 dirty
> 						  find page 1 towrite
> 						  lock Page 1
> 						  page 1 is writeback
> 						  <blocks>
> 
>     lock Page 1
>     <blocks>
> 
> DEADLOCK because:
> 
> 	- process 1 needs page 2 writeback to complete to make
> 	  enough progress to issue IO pending for page 1
> 	- writeback needs page 1 writeback to complete so process 2
> 	  can progress and unlock the page it is blocked on, then it
> 	  can issue the IO pending for page 2
> 	- process 2 can't make progress until process 1 issues IO
> 	  for page 1
> 
> The underlying cause of the problem here is that range_cyclic writeback is
> processing pages in descending index order as we hold higher index pages
> in a structure controlled from above write_cache_pages().  The
> write_cache_pages() caller needs to be able to submit these pages for IO
> before write_cache_pages restarts writeback at mapping index 0 to avoid
> wcp inverting the page lock/writeback wait order.
> 
> generic_writepages() is not susceptible to this bug as it has no private
> context held across write_cache_pages() - filesystems using this
> infrastructure always submit pages in ->writepage immediately and so there
> is no problem with range_cyclic going back to mapping index 0.
> 
> However:
> 	mpage_writepages() has a private bio context,
> 	exofs_writepages() has page_collect
> 	fuse_writepages() has fuse_fill_wb_data
> 	nfs_writepages() has nfs_pageio_descriptor
> 	xfs_vm_writepages() has xfs_writepage_ctx
> 
> All of these ->writepages implementations can hold pages under writeback
> in their private structures until write_cache_pages() returns, and hence
> they are all susceptible to this deadlock.
> 
> Also worth noting is that ext4 has it's own bastardised version of
> write_cache_pages() and so it /may/ have an equivalent deadlock.  I looked
> at the code long enough to understand that it has a similar retry loop for
> range_cyclic writeback reaching the end of the file and then promptly ran
> away before my eyes bled too much.  I'll leave it for the ext4 developers
> to determine if their code is actually has this deadlock and how to fix it
> if it has.
> 
> There's a few ways I can see avoid this deadlock.  There's probably more,
> but these are the first I've though of:
> 
> 1. get rid of range_cyclic altogether
> 
> 2. range_cyclic always stops at EOF, and we start again from
> writeback index 0 on the next call into write_cache_pages()
> 
> 2a. wcp also returns EAGAIN to ->writepages implementations to
> indicate range cyclic has hit EOF. writepages implementations can
> then flush the current context and call wpc again to continue. i.e.
> lift the retry into the ->writepages implementation
> 
> 3. range_cyclic uses trylock_page() rather than lock_page(), and it
> skips pages it can't lock without blocking. It will already do this
> for pages under writeback, so this seems like a no-brainer
> 
> 3a. all non-WB_SYNC_ALL writeback uses trylock_page() to avoid
> blocking as per pages under writeback.
> 
> I don't think #1 is an option - range_cyclic prevents frequently
> dirtied lower file offset from starving background writeback of
> rarely touched higher file offsets.
> 
> #2 is simple, and I don't think it will have any impact on
> performance as going back to the start of the file implies an
> immediate seek. We'll have exactly the same number of seeks if we
> switch writeback to another inode, and then come back to this one
> later and restart from index 0.
> 
> #2a is pretty much "status quo without the deadlock". Moving the
> retry loop up into the wcp caller means we can issue IO on the
> pending pages before calling wcp again, and so avoid locking or
> waiting on pages in the wrong order. I'm not convinced we need to do
> this given that we get the same thing from #2 on the next writeback
> call from the writeback infrastructure.
> 
> #3 is really just a band-aid - it doesn't fix the access/wait
> inversion problem, just prevents it from becoming a deadlock
> situation. I'd prefer we fix the inversion, not sweep it under the
> carpet like this.
> 
> #3a is really an optimisation that just so happens to include the
> band-aid fix of #3.
> 
> So it seems that the simplest way to fix this issue is to implement
> solution #2
> 
> Link: http://lkml.kernel.org/r/20181005054526.21507-1-david@fromorbit.com
> Signed-off-by: Dave Chinner <dchinner@redhat.com>
> Reviewed-by: Jan Kara <jack@suse.de>
> Cc: Nicholas Piggin <npiggin@gmail.com>
> Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
> (cherry picked from commit 64081362e8ff4587b4554087f3cfc73d3e0a4cd7)
> Signed-off-by: Mauricio Faria de Oliveira <mfo@canonical.com>
Acked-by: Stefan Bader <stefan.bader@canonical.com>
> ---

Good testing coverage and specific testcase for the issue.

>  mm/page-writeback.c | 33 +++++++++++++++------------------
>  1 file changed, 15 insertions(+), 18 deletions(-)
> 
> diff --git a/mm/page-writeback.c b/mm/page-writeback.c
> index 337c6afb3345..c10e70ed3515 100644
> --- a/mm/page-writeback.c
> +++ b/mm/page-writeback.c
> @@ -2150,6 +2150,13 @@ EXPORT_SYMBOL(tag_pages_for_writeback);
>   * not miss some pages (e.g., because some other process has cleared TOWRITE
>   * tag we set). The rule we follow is that TOWRITE tag can be cleared only
>   * by the process clearing the DIRTY tag (and submitting the page for IO).
> + *
> + * To avoid deadlocks between range_cyclic writeback and callers that hold
> + * pages in PageWriteback to aggregate IO until write_cache_pages() returns,
> + * we do not loop back to the start of the file. Doing so causes a page
> + * lock/page writeback access order inversion - we should only ever lock
> + * multiple pages in ascending page->index order, and looping back to the start
> + * of the file violates that rule and causes deadlocks.
>   */
>  int write_cache_pages(struct address_space *mapping,
>  		      struct writeback_control *wbc, writepage_t writepage,
> @@ -2163,7 +2170,6 @@ int write_cache_pages(struct address_space *mapping,
>  	pgoff_t index;
>  	pgoff_t end;		/* Inclusive */
>  	pgoff_t done_index;
> -	int cycled;
>  	int range_whole = 0;
>  	int tag;
>  
> @@ -2171,23 +2177,17 @@ int write_cache_pages(struct address_space *mapping,
>  	if (wbc->range_cyclic) {
>  		writeback_index = mapping->writeback_index; /* prev offset */
>  		index = writeback_index;
> -		if (index == 0)
> -			cycled = 1;
> -		else
> -			cycled = 0;
>  		end = -1;
>  	} else {
>  		index = wbc->range_start >> PAGE_SHIFT;
>  		end = wbc->range_end >> PAGE_SHIFT;
>  		if (wbc->range_start == 0 && wbc->range_end == LLONG_MAX)
>  			range_whole = 1;
> -		cycled = 1; /* ignore range_cyclic tests */
>  	}
>  	if (wbc->sync_mode == WB_SYNC_ALL || wbc->tagged_writepages)
>  		tag = PAGECACHE_TAG_TOWRITE;
>  	else
>  		tag = PAGECACHE_TAG_DIRTY;
> -retry:
>  	if (wbc->sync_mode == WB_SYNC_ALL || wbc->tagged_writepages)
>  		tag_pages_for_writeback(mapping, index, end);
>  	done_index = index;
> @@ -2273,17 +2273,14 @@ int write_cache_pages(struct address_space *mapping,
>  		pagevec_release(&pvec);
>  		cond_resched();
>  	}
> -	if (!cycled && !done) {
> -		/*
> -		 * range_cyclic:
> -		 * We hit the last page and there is more work to be done: wrap
> -		 * back to the start of the file
> -		 */
> -		cycled = 1;
> -		index = 0;
> -		end = writeback_index - 1;
> -		goto retry;
> -	}
> +
> +	/*
> +	 * If we hit the last page and there is more work to be done: wrap
> +	 * back the index back to the start of the file for the next
> +	 * time we are called.
> +	 */
> +	if (wbc->range_cyclic && !done)
> +		done_index = 0;
>  	if (wbc->range_cyclic || (range_whole && wbc->nr_to_write > 0))
>  		mapping->writeback_index = done_index;
>  
>
Kleber Souza April 17, 2019, 2:12 p.m. UTC | #2
On 4/15/19 6:32 PM, Mauricio Faria de Oliveira wrote:
> From: Dave Chinner <dchinner@redhat.com>
>
> BugLink: https://bugs.launchpad.net/bugs/1824827
>
> We've recently seen a workload on XFS filesystems with a repeatable
> deadlock between background writeback and a multi-process application
> doing concurrent writes and fsyncs to a small range of a file.
>
> range_cyclic
> writeback		Process 1		Process 2
>
> xfs_vm_writepages
>   write_cache_pages
>     writeback_index = 2
>     cycled = 0
>     ....
>     find page 2 dirty
>     lock Page 2
>     ->writepage
>       page 2 writeback
>       page 2 clean
>       page 2 added to bio
>     no more pages
> 			write()
> 			locks page 1
> 			dirties page 1
> 			locks page 2
> 			dirties page 1
> 			fsync()
> 			....
> 			xfs_vm_writepages
> 			write_cache_pages
> 			  start index 0
> 			  find page 1 towrite
> 			  lock Page 1
> 			  ->writepage
> 			    page 1 writeback
> 			    page 1 clean
> 			    page 1 added to bio
> 			  find page 2 towrite
> 			  lock Page 2
> 			  page 2 is writeback
> 			  <blocks>
> 						write()
> 						locks page 1
> 						dirties page 1
> 						fsync()
> 						....
> 						xfs_vm_writepages
> 						write_cache_pages
> 						  start index 0
>
>     !done && !cycled
>       sets index to 0, restarts lookup
>     find page 1 dirty
> 						  find page 1 towrite
> 						  lock Page 1
> 						  page 1 is writeback
> 						  <blocks>
>
>     lock Page 1
>     <blocks>
>
> DEADLOCK because:
>
> 	- process 1 needs page 2 writeback to complete to make
> 	  enough progress to issue IO pending for page 1
> 	- writeback needs page 1 writeback to complete so process 2
> 	  can progress and unlock the page it is blocked on, then it
> 	  can issue the IO pending for page 2
> 	- process 2 can't make progress until process 1 issues IO
> 	  for page 1
>
> The underlying cause of the problem here is that range_cyclic writeback is
> processing pages in descending index order as we hold higher index pages
> in a structure controlled from above write_cache_pages().  The
> write_cache_pages() caller needs to be able to submit these pages for IO
> before write_cache_pages restarts writeback at mapping index 0 to avoid
> wcp inverting the page lock/writeback wait order.
>
> generic_writepages() is not susceptible to this bug as it has no private
> context held across write_cache_pages() - filesystems using this
> infrastructure always submit pages in ->writepage immediately and so there
> is no problem with range_cyclic going back to mapping index 0.
>
> However:
> 	mpage_writepages() has a private bio context,
> 	exofs_writepages() has page_collect
> 	fuse_writepages() has fuse_fill_wb_data
> 	nfs_writepages() has nfs_pageio_descriptor
> 	xfs_vm_writepages() has xfs_writepage_ctx
>
> All of these ->writepages implementations can hold pages under writeback
> in their private structures until write_cache_pages() returns, and hence
> they are all susceptible to this deadlock.
>
> Also worth noting is that ext4 has it's own bastardised version of
> write_cache_pages() and so it /may/ have an equivalent deadlock.  I looked
> at the code long enough to understand that it has a similar retry loop for
> range_cyclic writeback reaching the end of the file and then promptly ran
> away before my eyes bled too much.  I'll leave it for the ext4 developers
> to determine if their code is actually has this deadlock and how to fix it
> if it has.
>
> There's a few ways I can see avoid this deadlock.  There's probably more,
> but these are the first I've though of:
>
> 1. get rid of range_cyclic altogether
>
> 2. range_cyclic always stops at EOF, and we start again from
> writeback index 0 on the next call into write_cache_pages()
>
> 2a. wcp also returns EAGAIN to ->writepages implementations to
> indicate range cyclic has hit EOF. writepages implementations can
> then flush the current context and call wpc again to continue. i.e.
> lift the retry into the ->writepages implementation
>
> 3. range_cyclic uses trylock_page() rather than lock_page(), and it
> skips pages it can't lock without blocking. It will already do this
> for pages under writeback, so this seems like a no-brainer
>
> 3a. all non-WB_SYNC_ALL writeback uses trylock_page() to avoid
> blocking as per pages under writeback.
>
> I don't think #1 is an option - range_cyclic prevents frequently
> dirtied lower file offset from starving background writeback of
> rarely touched higher file offsets.
>
> #2 is simple, and I don't think it will have any impact on
> performance as going back to the start of the file implies an
> immediate seek. We'll have exactly the same number of seeks if we
> switch writeback to another inode, and then come back to this one
> later and restart from index 0.
>
> #2a is pretty much "status quo without the deadlock". Moving the
> retry loop up into the wcp caller means we can issue IO on the
> pending pages before calling wcp again, and so avoid locking or
> waiting on pages in the wrong order. I'm not convinced we need to do
> this given that we get the same thing from #2 on the next writeback
> call from the writeback infrastructure.
>
> #3 is really just a band-aid - it doesn't fix the access/wait
> inversion problem, just prevents it from becoming a deadlock
> situation. I'd prefer we fix the inversion, not sweep it under the
> carpet like this.
>
> #3a is really an optimisation that just so happens to include the
> band-aid fix of #3.
>
> So it seems that the simplest way to fix this issue is to implement
> solution #2
>
> Link: http://lkml.kernel.org/r/20181005054526.21507-1-david@fromorbit.com
> Signed-off-by: Dave Chinner <dchinner@redhat.com>
> Reviewed-by: Jan Kara <jack@suse.de>
> Cc: Nicholas Piggin <npiggin@gmail.com>
> Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
> (cherry picked from commit 64081362e8ff4587b4554087f3cfc73d3e0a4cd7)
> Signed-off-by: Mauricio Faria de Oliveira <mfo@canonical.com>

Clean cherry-pick, the fix has been stress-tested and has a way to reproduce.

Acked-by: Kleber Sacilotto de Souza <kleber.souza@canonical.com>

> ---
>  mm/page-writeback.c | 33 +++++++++++++++------------------
>  1 file changed, 15 insertions(+), 18 deletions(-)
>
> diff --git a/mm/page-writeback.c b/mm/page-writeback.c
> index 337c6afb3345..c10e70ed3515 100644
> --- a/mm/page-writeback.c
> +++ b/mm/page-writeback.c
> @@ -2150,6 +2150,13 @@ EXPORT_SYMBOL(tag_pages_for_writeback);
>   * not miss some pages (e.g., because some other process has cleared TOWRITE
>   * tag we set). The rule we follow is that TOWRITE tag can be cleared only
>   * by the process clearing the DIRTY tag (and submitting the page for IO).
> + *
> + * To avoid deadlocks between range_cyclic writeback and callers that hold
> + * pages in PageWriteback to aggregate IO until write_cache_pages() returns,
> + * we do not loop back to the start of the file. Doing so causes a page
> + * lock/page writeback access order inversion - we should only ever lock
> + * multiple pages in ascending page->index order, and looping back to the start
> + * of the file violates that rule and causes deadlocks.
>   */
>  int write_cache_pages(struct address_space *mapping,
>  		      struct writeback_control *wbc, writepage_t writepage,
> @@ -2163,7 +2170,6 @@ int write_cache_pages(struct address_space *mapping,
>  	pgoff_t index;
>  	pgoff_t end;		/* Inclusive */
>  	pgoff_t done_index;
> -	int cycled;
>  	int range_whole = 0;
>  	int tag;
>  
> @@ -2171,23 +2177,17 @@ int write_cache_pages(struct address_space *mapping,
>  	if (wbc->range_cyclic) {
>  		writeback_index = mapping->writeback_index; /* prev offset */
>  		index = writeback_index;
> -		if (index == 0)
> -			cycled = 1;
> -		else
> -			cycled = 0;
>  		end = -1;
>  	} else {
>  		index = wbc->range_start >> PAGE_SHIFT;
>  		end = wbc->range_end >> PAGE_SHIFT;
>  		if (wbc->range_start == 0 && wbc->range_end == LLONG_MAX)
>  			range_whole = 1;
> -		cycled = 1; /* ignore range_cyclic tests */
>  	}
>  	if (wbc->sync_mode == WB_SYNC_ALL || wbc->tagged_writepages)
>  		tag = PAGECACHE_TAG_TOWRITE;
>  	else
>  		tag = PAGECACHE_TAG_DIRTY;
> -retry:
>  	if (wbc->sync_mode == WB_SYNC_ALL || wbc->tagged_writepages)
>  		tag_pages_for_writeback(mapping, index, end);
>  	done_index = index;
> @@ -2273,17 +2273,14 @@ int write_cache_pages(struct address_space *mapping,
>  		pagevec_release(&pvec);
>  		cond_resched();
>  	}
> -	if (!cycled && !done) {
> -		/*
> -		 * range_cyclic:
> -		 * We hit the last page and there is more work to be done: wrap
> -		 * back to the start of the file
> -		 */
> -		cycled = 1;
> -		index = 0;
> -		end = writeback_index - 1;
> -		goto retry;
> -	}
> +
> +	/*
> +	 * If we hit the last page and there is more work to be done: wrap
> +	 * back the index back to the start of the file for the next
> +	 * time we are called.
> +	 */
> +	if (wbc->range_cyclic && !done)
> +		done_index = 0;
>  	if (wbc->range_cyclic || (range_whole && wbc->nr_to_write > 0))
>  		mapping->writeback_index = done_index;
>

Patch
diff mbox series

diff --git a/mm/page-writeback.c b/mm/page-writeback.c
index 337c6afb3345..c10e70ed3515 100644
--- a/mm/page-writeback.c
+++ b/mm/page-writeback.c
@@ -2150,6 +2150,13 @@  EXPORT_SYMBOL(tag_pages_for_writeback);
  * not miss some pages (e.g., because some other process has cleared TOWRITE
  * tag we set). The rule we follow is that TOWRITE tag can be cleared only
  * by the process clearing the DIRTY tag (and submitting the page for IO).
+ *
+ * To avoid deadlocks between range_cyclic writeback and callers that hold
+ * pages in PageWriteback to aggregate IO until write_cache_pages() returns,
+ * we do not loop back to the start of the file. Doing so causes a page
+ * lock/page writeback access order inversion - we should only ever lock
+ * multiple pages in ascending page->index order, and looping back to the start
+ * of the file violates that rule and causes deadlocks.
  */
 int write_cache_pages(struct address_space *mapping,
 		      struct writeback_control *wbc, writepage_t writepage,
@@ -2163,7 +2170,6 @@  int write_cache_pages(struct address_space *mapping,
 	pgoff_t index;
 	pgoff_t end;		/* Inclusive */
 	pgoff_t done_index;
-	int cycled;
 	int range_whole = 0;
 	int tag;
 
@@ -2171,23 +2177,17 @@  int write_cache_pages(struct address_space *mapping,
 	if (wbc->range_cyclic) {
 		writeback_index = mapping->writeback_index; /* prev offset */
 		index = writeback_index;
-		if (index == 0)
-			cycled = 1;
-		else
-			cycled = 0;
 		end = -1;
 	} else {
 		index = wbc->range_start >> PAGE_SHIFT;
 		end = wbc->range_end >> PAGE_SHIFT;
 		if (wbc->range_start == 0 && wbc->range_end == LLONG_MAX)
 			range_whole = 1;
-		cycled = 1; /* ignore range_cyclic tests */
 	}
 	if (wbc->sync_mode == WB_SYNC_ALL || wbc->tagged_writepages)
 		tag = PAGECACHE_TAG_TOWRITE;
 	else
 		tag = PAGECACHE_TAG_DIRTY;
-retry:
 	if (wbc->sync_mode == WB_SYNC_ALL || wbc->tagged_writepages)
 		tag_pages_for_writeback(mapping, index, end);
 	done_index = index;
@@ -2273,17 +2273,14 @@  int write_cache_pages(struct address_space *mapping,
 		pagevec_release(&pvec);
 		cond_resched();
 	}
-	if (!cycled && !done) {
-		/*
-		 * range_cyclic:
-		 * We hit the last page and there is more work to be done: wrap
-		 * back to the start of the file
-		 */
-		cycled = 1;
-		index = 0;
-		end = writeback_index - 1;
-		goto retry;
-	}
+
+	/*
+	 * If we hit the last page and there is more work to be done: wrap
+	 * back the index back to the start of the file for the next
+	 * time we are called.
+	 */
+	if (wbc->range_cyclic && !done)
+		done_index = 0;
 	if (wbc->range_cyclic || (range_whole && wbc->nr_to_write > 0))
 		mapping->writeback_index = done_index;