Message ID | 20190415163210.17463-2-mfo@canonical.com |
---|---|
State | New |
Headers | show |
Series | Fix write()/fsync() deadlock in write_cache_pages() | expand |
On 15.04.19 18:32, Mauricio Faria de Oliveira wrote: > From: Dave Chinner <dchinner@redhat.com> > > BugLink: https://bugs.launchpad.net/bugs/1824827 > > We've recently seen a workload on XFS filesystems with a repeatable > deadlock between background writeback and a multi-process application > doing concurrent writes and fsyncs to a small range of a file. > > range_cyclic > writeback Process 1 Process 2 > > xfs_vm_writepages > write_cache_pages > writeback_index = 2 > cycled = 0 > .... > find page 2 dirty > lock Page 2 > ->writepage > page 2 writeback > page 2 clean > page 2 added to bio > no more pages > write() > locks page 1 > dirties page 1 > locks page 2 > dirties page 1 > fsync() > .... > xfs_vm_writepages > write_cache_pages > start index 0 > find page 1 towrite > lock Page 1 > ->writepage > page 1 writeback > page 1 clean > page 1 added to bio > find page 2 towrite > lock Page 2 > page 2 is writeback > <blocks> > write() > locks page 1 > dirties page 1 > fsync() > .... > xfs_vm_writepages > write_cache_pages > start index 0 > > !done && !cycled > sets index to 0, restarts lookup > find page 1 dirty > find page 1 towrite > lock Page 1 > page 1 is writeback > <blocks> > > lock Page 1 > <blocks> > > DEADLOCK because: > > - process 1 needs page 2 writeback to complete to make > enough progress to issue IO pending for page 1 > - writeback needs page 1 writeback to complete so process 2 > can progress and unlock the page it is blocked on, then it > can issue the IO pending for page 2 > - process 2 can't make progress until process 1 issues IO > for page 1 > > The underlying cause of the problem here is that range_cyclic writeback is > processing pages in descending index order as we hold higher index pages > in a structure controlled from above write_cache_pages(). The > write_cache_pages() caller needs to be able to submit these pages for IO > before write_cache_pages restarts writeback at mapping index 0 to avoid > wcp inverting the page lock/writeback wait order. > > generic_writepages() is not susceptible to this bug as it has no private > context held across write_cache_pages() - filesystems using this > infrastructure always submit pages in ->writepage immediately and so there > is no problem with range_cyclic going back to mapping index 0. > > However: > mpage_writepages() has a private bio context, > exofs_writepages() has page_collect > fuse_writepages() has fuse_fill_wb_data > nfs_writepages() has nfs_pageio_descriptor > xfs_vm_writepages() has xfs_writepage_ctx > > All of these ->writepages implementations can hold pages under writeback > in their private structures until write_cache_pages() returns, and hence > they are all susceptible to this deadlock. > > Also worth noting is that ext4 has it's own bastardised version of > write_cache_pages() and so it /may/ have an equivalent deadlock. I looked > at the code long enough to understand that it has a similar retry loop for > range_cyclic writeback reaching the end of the file and then promptly ran > away before my eyes bled too much. I'll leave it for the ext4 developers > to determine if their code is actually has this deadlock and how to fix it > if it has. > > There's a few ways I can see avoid this deadlock. There's probably more, > but these are the first I've though of: > > 1. get rid of range_cyclic altogether > > 2. range_cyclic always stops at EOF, and we start again from > writeback index 0 on the next call into write_cache_pages() > > 2a. wcp also returns EAGAIN to ->writepages implementations to > indicate range cyclic has hit EOF. writepages implementations can > then flush the current context and call wpc again to continue. i.e. > lift the retry into the ->writepages implementation > > 3. range_cyclic uses trylock_page() rather than lock_page(), and it > skips pages it can't lock without blocking. It will already do this > for pages under writeback, so this seems like a no-brainer > > 3a. all non-WB_SYNC_ALL writeback uses trylock_page() to avoid > blocking as per pages under writeback. > > I don't think #1 is an option - range_cyclic prevents frequently > dirtied lower file offset from starving background writeback of > rarely touched higher file offsets. > > #2 is simple, and I don't think it will have any impact on > performance as going back to the start of the file implies an > immediate seek. We'll have exactly the same number of seeks if we > switch writeback to another inode, and then come back to this one > later and restart from index 0. > > #2a is pretty much "status quo without the deadlock". Moving the > retry loop up into the wcp caller means we can issue IO on the > pending pages before calling wcp again, and so avoid locking or > waiting on pages in the wrong order. I'm not convinced we need to do > this given that we get the same thing from #2 on the next writeback > call from the writeback infrastructure. > > #3 is really just a band-aid - it doesn't fix the access/wait > inversion problem, just prevents it from becoming a deadlock > situation. I'd prefer we fix the inversion, not sweep it under the > carpet like this. > > #3a is really an optimisation that just so happens to include the > band-aid fix of #3. > > So it seems that the simplest way to fix this issue is to implement > solution #2 > > Link: http://lkml.kernel.org/r/20181005054526.21507-1-david@fromorbit.com > Signed-off-by: Dave Chinner <dchinner@redhat.com> > Reviewed-by: Jan Kara <jack@suse.de> > Cc: Nicholas Piggin <npiggin@gmail.com> > Signed-off-by: Andrew Morton <akpm@linux-foundation.org> > Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org> > (cherry picked from commit 64081362e8ff4587b4554087f3cfc73d3e0a4cd7) > Signed-off-by: Mauricio Faria de Oliveira <mfo@canonical.com> Acked-by: Stefan Bader <stefan.bader@canonical.com> > --- Good testing coverage and specific testcase for the issue. > mm/page-writeback.c | 33 +++++++++++++++------------------ > 1 file changed, 15 insertions(+), 18 deletions(-) > > diff --git a/mm/page-writeback.c b/mm/page-writeback.c > index 337c6afb3345..c10e70ed3515 100644 > --- a/mm/page-writeback.c > +++ b/mm/page-writeback.c > @@ -2150,6 +2150,13 @@ EXPORT_SYMBOL(tag_pages_for_writeback); > * not miss some pages (e.g., because some other process has cleared TOWRITE > * tag we set). The rule we follow is that TOWRITE tag can be cleared only > * by the process clearing the DIRTY tag (and submitting the page for IO). > + * > + * To avoid deadlocks between range_cyclic writeback and callers that hold > + * pages in PageWriteback to aggregate IO until write_cache_pages() returns, > + * we do not loop back to the start of the file. Doing so causes a page > + * lock/page writeback access order inversion - we should only ever lock > + * multiple pages in ascending page->index order, and looping back to the start > + * of the file violates that rule and causes deadlocks. > */ > int write_cache_pages(struct address_space *mapping, > struct writeback_control *wbc, writepage_t writepage, > @@ -2163,7 +2170,6 @@ int write_cache_pages(struct address_space *mapping, > pgoff_t index; > pgoff_t end; /* Inclusive */ > pgoff_t done_index; > - int cycled; > int range_whole = 0; > int tag; > > @@ -2171,23 +2177,17 @@ int write_cache_pages(struct address_space *mapping, > if (wbc->range_cyclic) { > writeback_index = mapping->writeback_index; /* prev offset */ > index = writeback_index; > - if (index == 0) > - cycled = 1; > - else > - cycled = 0; > end = -1; > } else { > index = wbc->range_start >> PAGE_SHIFT; > end = wbc->range_end >> PAGE_SHIFT; > if (wbc->range_start == 0 && wbc->range_end == LLONG_MAX) > range_whole = 1; > - cycled = 1; /* ignore range_cyclic tests */ > } > if (wbc->sync_mode == WB_SYNC_ALL || wbc->tagged_writepages) > tag = PAGECACHE_TAG_TOWRITE; > else > tag = PAGECACHE_TAG_DIRTY; > -retry: > if (wbc->sync_mode == WB_SYNC_ALL || wbc->tagged_writepages) > tag_pages_for_writeback(mapping, index, end); > done_index = index; > @@ -2273,17 +2273,14 @@ int write_cache_pages(struct address_space *mapping, > pagevec_release(&pvec); > cond_resched(); > } > - if (!cycled && !done) { > - /* > - * range_cyclic: > - * We hit the last page and there is more work to be done: wrap > - * back to the start of the file > - */ > - cycled = 1; > - index = 0; > - end = writeback_index - 1; > - goto retry; > - } > + > + /* > + * If we hit the last page and there is more work to be done: wrap > + * back the index back to the start of the file for the next > + * time we are called. > + */ > + if (wbc->range_cyclic && !done) > + done_index = 0; > if (wbc->range_cyclic || (range_whole && wbc->nr_to_write > 0)) > mapping->writeback_index = done_index; > >
On 4/15/19 6:32 PM, Mauricio Faria de Oliveira wrote: > From: Dave Chinner <dchinner@redhat.com> > > BugLink: https://bugs.launchpad.net/bugs/1824827 > > We've recently seen a workload on XFS filesystems with a repeatable > deadlock between background writeback and a multi-process application > doing concurrent writes and fsyncs to a small range of a file. > > range_cyclic > writeback Process 1 Process 2 > > xfs_vm_writepages > write_cache_pages > writeback_index = 2 > cycled = 0 > .... > find page 2 dirty > lock Page 2 > ->writepage > page 2 writeback > page 2 clean > page 2 added to bio > no more pages > write() > locks page 1 > dirties page 1 > locks page 2 > dirties page 1 > fsync() > .... > xfs_vm_writepages > write_cache_pages > start index 0 > find page 1 towrite > lock Page 1 > ->writepage > page 1 writeback > page 1 clean > page 1 added to bio > find page 2 towrite > lock Page 2 > page 2 is writeback > <blocks> > write() > locks page 1 > dirties page 1 > fsync() > .... > xfs_vm_writepages > write_cache_pages > start index 0 > > !done && !cycled > sets index to 0, restarts lookup > find page 1 dirty > find page 1 towrite > lock Page 1 > page 1 is writeback > <blocks> > > lock Page 1 > <blocks> > > DEADLOCK because: > > - process 1 needs page 2 writeback to complete to make > enough progress to issue IO pending for page 1 > - writeback needs page 1 writeback to complete so process 2 > can progress and unlock the page it is blocked on, then it > can issue the IO pending for page 2 > - process 2 can't make progress until process 1 issues IO > for page 1 > > The underlying cause of the problem here is that range_cyclic writeback is > processing pages in descending index order as we hold higher index pages > in a structure controlled from above write_cache_pages(). The > write_cache_pages() caller needs to be able to submit these pages for IO > before write_cache_pages restarts writeback at mapping index 0 to avoid > wcp inverting the page lock/writeback wait order. > > generic_writepages() is not susceptible to this bug as it has no private > context held across write_cache_pages() - filesystems using this > infrastructure always submit pages in ->writepage immediately and so there > is no problem with range_cyclic going back to mapping index 0. > > However: > mpage_writepages() has a private bio context, > exofs_writepages() has page_collect > fuse_writepages() has fuse_fill_wb_data > nfs_writepages() has nfs_pageio_descriptor > xfs_vm_writepages() has xfs_writepage_ctx > > All of these ->writepages implementations can hold pages under writeback > in their private structures until write_cache_pages() returns, and hence > they are all susceptible to this deadlock. > > Also worth noting is that ext4 has it's own bastardised version of > write_cache_pages() and so it /may/ have an equivalent deadlock. I looked > at the code long enough to understand that it has a similar retry loop for > range_cyclic writeback reaching the end of the file and then promptly ran > away before my eyes bled too much. I'll leave it for the ext4 developers > to determine if their code is actually has this deadlock and how to fix it > if it has. > > There's a few ways I can see avoid this deadlock. There's probably more, > but these are the first I've though of: > > 1. get rid of range_cyclic altogether > > 2. range_cyclic always stops at EOF, and we start again from > writeback index 0 on the next call into write_cache_pages() > > 2a. wcp also returns EAGAIN to ->writepages implementations to > indicate range cyclic has hit EOF. writepages implementations can > then flush the current context and call wpc again to continue. i.e. > lift the retry into the ->writepages implementation > > 3. range_cyclic uses trylock_page() rather than lock_page(), and it > skips pages it can't lock without blocking. It will already do this > for pages under writeback, so this seems like a no-brainer > > 3a. all non-WB_SYNC_ALL writeback uses trylock_page() to avoid > blocking as per pages under writeback. > > I don't think #1 is an option - range_cyclic prevents frequently > dirtied lower file offset from starving background writeback of > rarely touched higher file offsets. > > #2 is simple, and I don't think it will have any impact on > performance as going back to the start of the file implies an > immediate seek. We'll have exactly the same number of seeks if we > switch writeback to another inode, and then come back to this one > later and restart from index 0. > > #2a is pretty much "status quo without the deadlock". Moving the > retry loop up into the wcp caller means we can issue IO on the > pending pages before calling wcp again, and so avoid locking or > waiting on pages in the wrong order. I'm not convinced we need to do > this given that we get the same thing from #2 on the next writeback > call from the writeback infrastructure. > > #3 is really just a band-aid - it doesn't fix the access/wait > inversion problem, just prevents it from becoming a deadlock > situation. I'd prefer we fix the inversion, not sweep it under the > carpet like this. > > #3a is really an optimisation that just so happens to include the > band-aid fix of #3. > > So it seems that the simplest way to fix this issue is to implement > solution #2 > > Link: http://lkml.kernel.org/r/20181005054526.21507-1-david@fromorbit.com > Signed-off-by: Dave Chinner <dchinner@redhat.com> > Reviewed-by: Jan Kara <jack@suse.de> > Cc: Nicholas Piggin <npiggin@gmail.com> > Signed-off-by: Andrew Morton <akpm@linux-foundation.org> > Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org> > (cherry picked from commit 64081362e8ff4587b4554087f3cfc73d3e0a4cd7) > Signed-off-by: Mauricio Faria de Oliveira <mfo@canonical.com> Clean cherry-pick, the fix has been stress-tested and has a way to reproduce. Acked-by: Kleber Sacilotto de Souza <kleber.souza@canonical.com> > --- > mm/page-writeback.c | 33 +++++++++++++++------------------ > 1 file changed, 15 insertions(+), 18 deletions(-) > > diff --git a/mm/page-writeback.c b/mm/page-writeback.c > index 337c6afb3345..c10e70ed3515 100644 > --- a/mm/page-writeback.c > +++ b/mm/page-writeback.c > @@ -2150,6 +2150,13 @@ EXPORT_SYMBOL(tag_pages_for_writeback); > * not miss some pages (e.g., because some other process has cleared TOWRITE > * tag we set). The rule we follow is that TOWRITE tag can be cleared only > * by the process clearing the DIRTY tag (and submitting the page for IO). > + * > + * To avoid deadlocks between range_cyclic writeback and callers that hold > + * pages in PageWriteback to aggregate IO until write_cache_pages() returns, > + * we do not loop back to the start of the file. Doing so causes a page > + * lock/page writeback access order inversion - we should only ever lock > + * multiple pages in ascending page->index order, and looping back to the start > + * of the file violates that rule and causes deadlocks. > */ > int write_cache_pages(struct address_space *mapping, > struct writeback_control *wbc, writepage_t writepage, > @@ -2163,7 +2170,6 @@ int write_cache_pages(struct address_space *mapping, > pgoff_t index; > pgoff_t end; /* Inclusive */ > pgoff_t done_index; > - int cycled; > int range_whole = 0; > int tag; > > @@ -2171,23 +2177,17 @@ int write_cache_pages(struct address_space *mapping, > if (wbc->range_cyclic) { > writeback_index = mapping->writeback_index; /* prev offset */ > index = writeback_index; > - if (index == 0) > - cycled = 1; > - else > - cycled = 0; > end = -1; > } else { > index = wbc->range_start >> PAGE_SHIFT; > end = wbc->range_end >> PAGE_SHIFT; > if (wbc->range_start == 0 && wbc->range_end == LLONG_MAX) > range_whole = 1; > - cycled = 1; /* ignore range_cyclic tests */ > } > if (wbc->sync_mode == WB_SYNC_ALL || wbc->tagged_writepages) > tag = PAGECACHE_TAG_TOWRITE; > else > tag = PAGECACHE_TAG_DIRTY; > -retry: > if (wbc->sync_mode == WB_SYNC_ALL || wbc->tagged_writepages) > tag_pages_for_writeback(mapping, index, end); > done_index = index; > @@ -2273,17 +2273,14 @@ int write_cache_pages(struct address_space *mapping, > pagevec_release(&pvec); > cond_resched(); > } > - if (!cycled && !done) { > - /* > - * range_cyclic: > - * We hit the last page and there is more work to be done: wrap > - * back to the start of the file > - */ > - cycled = 1; > - index = 0; > - end = writeback_index - 1; > - goto retry; > - } > + > + /* > + * If we hit the last page and there is more work to be done: wrap > + * back the index back to the start of the file for the next > + * time we are called. > + */ > + if (wbc->range_cyclic && !done) > + done_index = 0; > if (wbc->range_cyclic || (range_whole && wbc->nr_to_write > 0)) > mapping->writeback_index = done_index; >
diff --git a/mm/page-writeback.c b/mm/page-writeback.c index 337c6afb3345..c10e70ed3515 100644 --- a/mm/page-writeback.c +++ b/mm/page-writeback.c @@ -2150,6 +2150,13 @@ EXPORT_SYMBOL(tag_pages_for_writeback); * not miss some pages (e.g., because some other process has cleared TOWRITE * tag we set). The rule we follow is that TOWRITE tag can be cleared only * by the process clearing the DIRTY tag (and submitting the page for IO). + * + * To avoid deadlocks between range_cyclic writeback and callers that hold + * pages in PageWriteback to aggregate IO until write_cache_pages() returns, + * we do not loop back to the start of the file. Doing so causes a page + * lock/page writeback access order inversion - we should only ever lock + * multiple pages in ascending page->index order, and looping back to the start + * of the file violates that rule and causes deadlocks. */ int write_cache_pages(struct address_space *mapping, struct writeback_control *wbc, writepage_t writepage, @@ -2163,7 +2170,6 @@ int write_cache_pages(struct address_space *mapping, pgoff_t index; pgoff_t end; /* Inclusive */ pgoff_t done_index; - int cycled; int range_whole = 0; int tag; @@ -2171,23 +2177,17 @@ int write_cache_pages(struct address_space *mapping, if (wbc->range_cyclic) { writeback_index = mapping->writeback_index; /* prev offset */ index = writeback_index; - if (index == 0) - cycled = 1; - else - cycled = 0; end = -1; } else { index = wbc->range_start >> PAGE_SHIFT; end = wbc->range_end >> PAGE_SHIFT; if (wbc->range_start == 0 && wbc->range_end == LLONG_MAX) range_whole = 1; - cycled = 1; /* ignore range_cyclic tests */ } if (wbc->sync_mode == WB_SYNC_ALL || wbc->tagged_writepages) tag = PAGECACHE_TAG_TOWRITE; else tag = PAGECACHE_TAG_DIRTY; -retry: if (wbc->sync_mode == WB_SYNC_ALL || wbc->tagged_writepages) tag_pages_for_writeback(mapping, index, end); done_index = index; @@ -2273,17 +2273,14 @@ int write_cache_pages(struct address_space *mapping, pagevec_release(&pvec); cond_resched(); } - if (!cycled && !done) { - /* - * range_cyclic: - * We hit the last page and there is more work to be done: wrap - * back to the start of the file - */ - cycled = 1; - index = 0; - end = writeback_index - 1; - goto retry; - } + + /* + * If we hit the last page and there is more work to be done: wrap + * back the index back to the start of the file for the next + * time we are called. + */ + if (wbc->range_cyclic && !done) + done_index = 0; if (wbc->range_cyclic || (range_whole && wbc->nr_to_write > 0)) mapping->writeback_index = done_index;