Patchwork fix for slow unmount on ext4 - bug 543617 on lp

login
register
mail settings
Submitter Stefan Bader
Date June 2, 2010, 1:09 p.m.
Message ID <1275484147-26044-1-git-send-email-stefan.bader@canonical.com>
Download mbox | patch
Permalink /patch/54369/
State Rejected
Delegated to: Stefan Bader
Headers show

Comments

Stefan Bader - June 2, 2010, 1:09 p.m.
This is a followup on a previous post which basically consists of the
revert of 5e1941884c700b7b97bcad52c2d8212ac56a7ebc
 "UBUNTU: SAUCE: sync before umount to reduce time taken by ext4 umount"
and then applying a patch from bugzilla
 "writeback: fix __sync_filesystem(sb, 0) on umount"

However the second patch never went upstream, while the following two
(backported to 2.6.32.y) made it. Sadly, when asking Jens Axboe about
forwarding those to stable, I got the following response:

Jens Axboe wrote:
> > I would send the backports to stable but wanted to check with you
> > beforehand. Would you be ok with a stable submission of these two?

> I would have said yes a few days ago, but Christoph claims that the
> patches make xfs test hang. Unfortunately I'm a bit between jobs and
> don't have proper testing equipment right now, so I had no other option
> than revert exactly those two commits...

Now the question is whether we want to go ahead (because we need to
do the revert as that causes other problems) with the upstream version
that might cause xfs problems (assuming xfs is not our main fs) or
take the patch from bugzilla which might have the same problem or others
yet unknown.

-Stefan

----

From: Dmitry Monakhov <dmonakhov@openvz.org>

BugLink: http://launchpad.net/bugs/543617

__sync_filesystem(sb, 0) is no longer works on umount because sb can
not be pined, Because s_mount sem is downed for write and s_root is NULL.
In fact umount is a special case similar to WB_SYNC_ALL, where
sb is pinned already.

BADCOMMIT: 03ba3782e8dcc5b0e1efe440d33084f066e38cae

Signed-off-by: Dmitry Monakhov <dmonakhov@openvz.org>
(cherry-picked from https://bugzilla.kernel.org/attachment.cgi?id=26224)

Signed-off-by: Surbhi Palande <surbhi.palande@canonical.com>
---
 fs/fs-writeback.c |    5 +++--
 1 files changed, 3 insertions(+), 2 deletions(-)
Stefan Bader - June 7, 2010, 12:10 p.m.
Ok, so upstream seems to have officially reverted the changes for now. Given
that the changes in that area seem to have unexpected side effects and we do not
really know the other patch would not do the same. When reverting the
work-around we got huge sync times. IIRC the side effects of that patch were not
as bad timewise (2 seconds?). So maybe for now, we stay with what we got and
wait for the proper upstream solution?

Stefan

On 06/02/2010 03:09 PM, Stefan Bader wrote:
> This is a followup on a previous post which basically consists of the
> revert of 5e1941884c700b7b97bcad52c2d8212ac56a7ebc
>  "UBUNTU: SAUCE: sync before umount to reduce time taken by ext4 umount"
> and then applying a patch from bugzilla
>  "writeback: fix __sync_filesystem(sb, 0) on umount"
> 
> However the second patch never went upstream, while the following two
> (backported to 2.6.32.y) made it. Sadly, when asking Jens Axboe about
> forwarding those to stable, I got the following response:
> 
> Jens Axboe wrote:
>>> I would send the backports to stable but wanted to check with you
>>> beforehand. Would you be ok with a stable submission of these two?
> 
>> I would have said yes a few days ago, but Christoph claims that the
>> patches make xfs test hang. Unfortunately I'm a bit between jobs and
>> don't have proper testing equipment right now, so I had no other option
>> than revert exactly those two commits...
> 
> Now the question is whether we want to go ahead (because we need to
> do the revert as that causes other problems) with the upstream version
> that might cause xfs problems (assuming xfs is not our main fs) or
> take the patch from bugzilla which might have the same problem or others
> yet unknown.
> 
> -Stefan
> 
> ----
> 
> From: Dmitry Monakhov <dmonakhov@openvz.org>
> 
> BugLink: http://launchpad.net/bugs/543617
> 
> __sync_filesystem(sb, 0) is no longer works on umount because sb can
> not be pined, Because s_mount sem is downed for write and s_root is NULL.
> In fact umount is a special case similar to WB_SYNC_ALL, where
> sb is pinned already.
> 
> BADCOMMIT: 03ba3782e8dcc5b0e1efe440d33084f066e38cae
> 
> Signed-off-by: Dmitry Monakhov <dmonakhov@openvz.org>
> (cherry-picked from https://bugzilla.kernel.org/attachment.cgi?id=26224)
> 
> Signed-off-by: Surbhi Palande <surbhi.palande@canonical.com>
> ---
>  fs/fs-writeback.c |    5 +++--
>  1 files changed, 3 insertions(+), 2 deletions(-)
> 
> diff --git a/fs/fs-writeback.c b/fs/fs-writeback.c
> index 4102f20..937a8ae 100644
> --- a/fs/fs-writeback.c
> +++ b/fs/fs-writeback.c
> @@ -583,9 +583,9 @@ static int pin_sb_for_writeback(struct writeback_control *wbc,
>  		unpin_sb_for_writeback(psb);
>  
>  	/*
> -	 * Caller must already hold the ref for this
> +	 * Caller must already hold the ref for integrity sync, and on umount
>  	 */
> -	if (wbc->sync_mode == WB_SYNC_ALL) {
> +	if (wbc->sync_mode == WB_SYNC_ALL || !sb->s_root) {
>  		WARN_ON(!rwsem_is_locked(&sb->s_umount));
>  		return 0;
>  	}
> @@ -597,6 +597,7 @@ static int pin_sb_for_writeback(struct writeback_control *wbc,
>  			spin_unlock(&sb_lock);
>  			goto pinned;
>  		}
> +		WARN_ON(1);
>  		/*
>  		 * umounted, drop rwsem again and fall through to failure
>  		 */
> 
>
Kees Cook - June 9, 2010, 7:06 p.m.
Hi Stefan,

On Mon, Jun 07, 2010 at 02:10:43PM +0200, Stefan Bader wrote:
> Ok, so upstream seems to have officially reverted the changes for now. Given
> that the changes in that area seem to have unexpected side effects and we do not
> really know the other patch would not do the same. When reverting the
> work-around we got huge sync times. IIRC the side effects of that patch were not
> as bad timewise (2 seconds?). So maybe for now, we stay with what we got and
> wait for the proper upstream solution?

Upstream gave up on fixing the problem?  Argh.

Huge sync times in Lucid?  When what was changed?

I think we need to revert 5e1941884c700b7b97bcad52c2d8212ac56a7ebc even
if there isn't a good fix for the umount slow-down since it solves the
slow umount only under certain conditions, and makes other more common
umount scenarios slower.

-Kees
Stefan Bader - June 10, 2010, 6:57 a.m.
On 06/09/2010 09:06 PM, Kees Cook wrote:
> Hi Stefan,
> 
> On Mon, Jun 07, 2010 at 02:10:43PM +0200, Stefan Bader wrote:
>> Ok, so upstream seems to have officially reverted the changes for now. Given
>> that the changes in that area seem to have unexpected side effects and we do not
>> really know the other patch would not do the same. When reverting the
>> work-around we got huge sync times. IIRC the side effects of that patch were not
>> as bad timewise (2 seconds?). So maybe for now, we stay with what we got and
>> wait for the proper upstream solution?
> 
> Upstream gave up on fixing the problem?  Argh.
> 

No not given up but currently needing to come up with a better solution.

> Huge sync times in Lucid?  When what was changed?
> 

Err, wasn't that the reason for the patch below in the first place? Maybe my
description was not completely accurate. It was very long time to unmount. I
might be wrong but in my memory that was as bad as 30m in bad cases.
For which the reason is that sync is done syncronously.

> I think we need to revert 5e1941884c700b7b97bcad52c2d8212ac56a7ebc even
> if there isn't a good fix for the umount slow-down since it solves the
> slow umount only under certain conditions, and makes other more common
> umount scenarios slower.
> 

The question would be how much slower. If it is possible to live with the
current status a little longer, then I'd rather wait for the correct upstream
fix, than to revert this patch to see one regression fixed by another one coming
back.

Stefan

> -Kees
>
Kees Cook - June 10, 2010, 7:13 p.m.
On Thu, Jun 10, 2010 at 08:57:55AM +0200, Stefan Bader wrote:
> Err, wasn't that the reason for the patch below in the first place? Maybe my
> description was not completely accurate. It was very long time to unmount. I
> might be wrong but in my memory that was as bad as 30m in bad cases.
> For which the reason is that sync is done syncronously.

Without the forced-sync-on-umount work-around patch:
- umount of a single filesystem with barrier support on an idle system can
  take upwards of 15 minutes.
- umount of a single filesystem with barrier support on a busy system can
  take upwards of 4 hours.
- umount of everything else is fast and normal (less than a hundredth of
  a second to umount a bind mount), regardless of system IO load.

With patch:
- umount of a single filesystem with barrier support on an idle system
  will take as long as a normal sync and umount (i.e. fast and normal).
- umount of everything else on an idle system takes as long as a sync
  and umount (raises trivial bind umounts to at least 1 second).
- umount of everything else on a busy system takes as long as a sync
  of the entire system and umount of the filesystem (which can sometimes
  take upwards of 4 hours due to the original problem of dirty umounting
  of filesystems with barrier support).

The addition of the sync workaround means that busy systems will have
umount blocked for as long as _all_ sync activity takes on the given
system.  I'll leave it up to you, but for me in most cases, the patch
made my system significantly worse than without it.

-Kees
Daniel J Blueman - June 12, 2010, 2:18 p.m.
On Wed, Jun 9, 2010 at 8:06 PM, Kees Cook <kees.cook@canonical.com> wrote:
> Hi Stefan,
>
> On Mon, Jun 07, 2010 at 02:10:43PM +0200, Stefan Bader wrote:
>> Ok, so upstream seems to have officially reverted the changes for now. Given
>> that the changes in that area seem to have unexpected side effects and we do not
>> really know the other patch would not do the same. When reverting the
>> work-around we got huge sync times. IIRC the side effects of that patch were not
>> as bad timewise (2 seconds?). So maybe for now, we stay with what we got and
>> wait for the proper upstream solution?
>
> Upstream gave up on fixing the problem?  Argh.
>
> Huge sync times in Lucid?  When what was changed?
>
> I think we need to revert 5e1941884c700b7b97bcad52c2d8212ac56a7ebc even
> if there isn't a good fix for the umount slow-down since it solves the
> slow umount only under certain conditions, and makes other more common
> umount scenarios slower.

Dave Chinner just committed a ratified fix [1] for this upstream,
which may be a good cherry-pick.

Dan

--- [1]

commit d87815cb2090e07b0b0b2d73dc9740706e92c80c
Author: Dave Chinner <dchinner@redhat.com>
Date:   Wed Jun 9 10:37:20 2010 +1000

    writeback: limit write_cache_pages integrity scanning to current EOF

    sync can currently take a really long time if a concurrent writer is
    extending a file. The problem is that the dirty pages on the address
    space grow in the same direction as write_cache_pages scans, so if
    the writer keeps ahead of writeback, the writeback will not
    terminate until the writer stops adding dirty pages.

    For a data integrity sync, we only need to write the pages dirty at
    the time we start the writeback, so we can stop scanning once we get
    to the page that was at the end of the file at the time the scan
    started.

    This will prevent operations like copying a large file preventing
    sync from completing as it will not write back pages that were
    dirtied after the sync was started. This does not impact the
    existing integrity guarantees, as any dirty page (old or new)
    within the EOF range at the start of the scan will still be
    captured.

    This patch will not prevent sync from blocking on large writes into
    holes. That requires more complex intervention while this patch only
    addresses the common append-case of this sync holdoff.

    Signed-off-by: Dave Chinner <dchinner@redhat.com>
    Reviewed-by: Christoph Hellwig <hch@lst.de>
    Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
Stefan Bader - June 14, 2010, 7:58 a.m.
On 06/12/2010 04:18 PM, Daniel J Blueman wrote:
> On Wed, Jun 9, 2010 at 8:06 PM, Kees Cook <kees.cook@canonical.com> wrote:
>> Hi Stefan,
>>
>> On Mon, Jun 07, 2010 at 02:10:43PM +0200, Stefan Bader wrote:
>>> Ok, so upstream seems to have officially reverted the changes for now. Given
>>> that the changes in that area seem to have unexpected side effects and we do not
>>> really know the other patch would not do the same. When reverting the
>>> work-around we got huge sync times. IIRC the side effects of that patch were not
>>> as bad timewise (2 seconds?). So maybe for now, we stay with what we got and
>>> wait for the proper upstream solution?
>>
>> Upstream gave up on fixing the problem?  Argh.
>>
>> Huge sync times in Lucid?  When what was changed?
>>
>> I think we need to revert 5e1941884c700b7b97bcad52c2d8212ac56a7ebc even
>> if there isn't a good fix for the umount slow-down since it solves the
>> slow umount only under certain conditions, and makes other more common
>> umount scenarios slower.
> 
> Dave Chinner just committed a ratified fix [1] for this upstream,
> which may be a good cherry-pick.
> 
> Dan

Hi Daniel,

this looks like an additional corner case from the description (I have not yet
had time to look at the actual patch). The case we are facing is that sync on
umount used to get converted into an asynchronous sync, followed by a
synchronous one. This changed in a way that the sync on umount is done only
synchronously which waits on every single page.
But the patch you pointed to definitely is something to keep in mind as well.

Cheers,
Stefan
> 
> --- [1]
> 
> commit d87815cb2090e07b0b0b2d73dc9740706e92c80c
> Author: Dave Chinner <dchinner@redhat.com>
> Date:   Wed Jun 9 10:37:20 2010 +1000
> 
>     writeback: limit write_cache_pages integrity scanning to current EOF
> 
>     sync can currently take a really long time if a concurrent writer is
>     extending a file. The problem is that the dirty pages on the address
>     space grow in the same direction as write_cache_pages scans, so if
>     the writer keeps ahead of writeback, the writeback will not
>     terminate until the writer stops adding dirty pages.
> 
>     For a data integrity sync, we only need to write the pages dirty at
>     the time we start the writeback, so we can stop scanning once we get
>     to the page that was at the end of the file at the time the scan
>     started.
> 
>     This will prevent operations like copying a large file preventing
>     sync from completing as it will not write back pages that were
>     dirtied after the sync was started. This does not impact the
>     existing integrity guarantees, as any dirty page (old or new)
>     within the EOF range at the start of the scan will still be
>     captured.
> 
>     This patch will not prevent sync from blocking on large writes into
>     holes. That requires more complex intervention while this patch only
>     addresses the common append-case of this sync holdoff.
> 
>     Signed-off-by: Dave Chinner <dchinner@redhat.com>
>     Reviewed-by: Christoph Hellwig <hch@lst.de>
>     Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
Stefan Bader - Aug. 3, 2010, 4:49 p.m.
The patches discussed here were causing regresssions and have been reverted
upstream. By now there is a new set (which is even bigger). I am waiting for
some feedback on a backport I did for 2.6.32.y from upstream.

Patch

diff --git a/fs/fs-writeback.c b/fs/fs-writeback.c
index 4102f20..937a8ae 100644
--- a/fs/fs-writeback.c
+++ b/fs/fs-writeback.c
@@ -583,9 +583,9 @@  static int pin_sb_for_writeback(struct writeback_control *wbc,
 		unpin_sb_for_writeback(psb);
 
 	/*
-	 * Caller must already hold the ref for this
+	 * Caller must already hold the ref for integrity sync, and on umount
 	 */
-	if (wbc->sync_mode == WB_SYNC_ALL) {
+	if (wbc->sync_mode == WB_SYNC_ALL || !sb->s_root) {
 		WARN_ON(!rwsem_is_locked(&sb->s_umount));
 		return 0;
 	}
@@ -597,6 +597,7 @@  static int pin_sb_for_writeback(struct writeback_control *wbc,
 			spin_unlock(&sb_lock);
 			goto pinned;
 		}
+		WARN_ON(1);
 		/*
 		 * umounted, drop rwsem again and fall through to failure
 		 */