diff mbox

[v3,4/6] ext4: change lru to round-robin in extent status tree shrinker

Message ID 20140903033738.GB2504@thunk.org
State New, archived
Headers show

Commit Message

Theodore Ts'o Sept. 3, 2014, 3:37 a.m. UTC
On Wed, Aug 27, 2014 at 05:01:21PM +0200, Jan Kara wrote:
> On Thu 07-08-14 11:35:51, Zheng Liu wrote:
>   This comment is not directly related to this patch but looking into the
> code made me think about it. It seems ugly to call __es_shrink() from
> internals of ext4_es_insert_extent(). Also thinking about locking
> implications makes me shudder a bit and finally this may make the pressure
> on the extent cache artificially bigger because MM subsystem is not aware
> of the shrinking you do here. I would prefer to leave shrinking on
> the slab subsystem itself.

If we fail, the allocation we only try to free at most one extent, so
I don't think it's going to make the slab system that confused; it's
the equivalent of freeing an entry and then using allocating it again.

> Now GFP_ATOMIC allocation we use for extent cache makes it hard for the
> slab subsystem and actually we could fairly easily use GFP_NOFS. We can just
> allocate the structure before grabbing i_es_lock with GFP_NOFS allocation and
> in case we don't need the structure, we can just free it again. It may
> introduce some overhead from unnecessary alloc/free but things get simpler
> that way (no need for that locked_ei argument for __es_shrink(), no need
> for internal calls to __es_shrink() from within the filesystem).

The tricky bit is that even __es_remove_extent() can require a memory
allocation, and in the worst case, it's possible that
ext4_es_insert_extent() can require *two* allocations.  For example,
if you start with a single large extent, and then need to insert a
subregion with a different set of flags into the already existing
extent, thus resulting in three extents where you started with one.

And in some cases, no allocation is required at all....

One thing that can help is that so long as we haven't done something
critical, such as erase a delalloc region, we always release the write
lock and retry the allocation with GFP_NOFS, and the try the operation
again.

So we may need to think a bit about what's the best way to improve
this, although it is separate topic from making the shrinker be less
heavyweight.

>   Nothing seems to prevent reclaim from freeing the inode after we drop
> s_es_lock. So we could use freed memory. I don't think we want to pin the
> inode here by grabbing a refcount since we don't want to deal with iput()
> in the shrinker (that could mean having to delete the inode from shrinker
> context). But what we could do it to grab ei->i_es_lock before dropping
> s_es_lock. Since ext4_es_remove_extent() called from ext4_clear_inode()
> always grabs i_es_lock, we are protected from inode being freed while we
> hold that lock. But please add comments about this both to the
> __es_shrink() and ext4_es_remove_extent().

Something like this should work, yes?

						- Ted

--
To unsubscribe from this list: send the line "unsubscribe linux-ext4" in
the body of a message to majordomo@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html

Comments

Jan Kara Sept. 3, 2014, 3:31 p.m. UTC | #1
On Tue 02-09-14 23:37:38, Ted Tso wrote:
> On Wed, Aug 27, 2014 at 05:01:21PM +0200, Jan Kara wrote:
> > On Thu 07-08-14 11:35:51, Zheng Liu wrote:
> >   This comment is not directly related to this patch but looking into the
> > code made me think about it. It seems ugly to call __es_shrink() from
> > internals of ext4_es_insert_extent(). Also thinking about locking
> > implications makes me shudder a bit and finally this may make the pressure
> > on the extent cache artificially bigger because MM subsystem is not aware
> > of the shrinking you do here. I would prefer to leave shrinking on
> > the slab subsystem itself.
> 
> If we fail, the allocation we only try to free at most one extent, so
> I don't think it's going to make the slab system that confused; it's
> the equivalent of freeing an entry and then using allocating it again.
> 
> > Now GFP_ATOMIC allocation we use for extent cache makes it hard for the
> > slab subsystem and actually we could fairly easily use GFP_NOFS. We can just
> > allocate the structure before grabbing i_es_lock with GFP_NOFS allocation and
> > in case we don't need the structure, we can just free it again. It may
> > introduce some overhead from unnecessary alloc/free but things get simpler
> > that way (no need for that locked_ei argument for __es_shrink(), no need
> > for internal calls to __es_shrink() from within the filesystem).
> 
> The tricky bit is that even __es_remove_extent() can require a memory
> allocation, and in the worst case, it's possible that
> ext4_es_insert_extent() can require *two* allocations.  For example,
> if you start with a single large extent, and then need to insert a
> subregion with a different set of flags into the already existing
> extent, thus resulting in three extents where you started with one.
  Right, I didn't realize that.

> And in some cases, no allocation is required at all....
> 
> One thing that can help is that so long as we haven't done something
> critical, such as erase a delalloc region, we always release the write
> lock and retry the allocation with GFP_NOFS, and the try the operation
> again.
  Yeah, maybe we could use mempools for this. It should make the code less
clumsy.

> So we may need to think a bit about what's the best way to improve
> this, although it is separate topic from making the shrinker be less
> heavyweight.
  Agreed, it's a separate topic.

> >   Nothing seems to prevent reclaim from freeing the inode after we drop
> > s_es_lock. So we could use freed memory. I don't think we want to pin the
> > inode here by grabbing a refcount since we don't want to deal with iput()
> > in the shrinker (that could mean having to delete the inode from shrinker
> > context). But what we could do it to grab ei->i_es_lock before dropping
> > s_es_lock. Since ext4_es_remove_extent() called from ext4_clear_inode()
> > always grabs i_es_lock, we are protected from inode being freed while we
> > hold that lock. But please add comments about this both to the
> > __es_shrink() and ext4_es_remove_extent().
> 
> Something like this should work, yes?
  Yes, this should work. I would just add a comment to
ext4_es_remove_extent() about the fact that ext4_clear_inode() requires
grabbing i_es_lock so that we don't do some clever optimization in future
and break these lifetime rules...

Also one question:

> -		if (ei == locked_ei || !write_trylock(&ei->i_es_lock)) {
> -			nr_skipped++;
> -			spin_lock(&sbi->s_es_lock);
>  			__ext4_es_list_add(sbi, ei);
> +			if (spin_is_contended(&sbi->s_es_lock)) {
> +				spin_unlock(&sbi->s_es_lock);
> +				spin_lock(&sbi->s_es_lock);
> +			}
  Why not cond_resched_lock(&sbi->s_es_lock)?


								Honza
Theodore Ts'o Sept. 3, 2014, 8 p.m. UTC | #2
On Wed, Sep 03, 2014 at 05:31:22PM +0200, Jan Kara wrote:
> Also one question:
> 
> > -		if (ei == locked_ei || !write_trylock(&ei->i_es_lock)) {
> > -			nr_skipped++;
> > -			spin_lock(&sbi->s_es_lock);
> >  			__ext4_es_list_add(sbi, ei);
> > +			if (spin_is_contended(&sbi->s_es_lock)) {
> > +				spin_unlock(&sbi->s_es_lock);
> > +				spin_lock(&sbi->s_es_lock);
> > +			}
>   Why not cond_resched_lock(&sbi->s_es_lock)?

I didn't think we were allowed to reschedule or sleep while in
shrinker context?

					- Ted
--
To unsubscribe from this list: send the line "unsubscribe linux-ext4" in
the body of a message to majordomo@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Jan Kara Sept. 3, 2014, 10:14 p.m. UTC | #3
On Wed 03-09-14 16:00:39, Ted Tso wrote:
> On Wed, Sep 03, 2014 at 05:31:22PM +0200, Jan Kara wrote:
> > Also one question:
> > 
> > > -		if (ei == locked_ei || !write_trylock(&ei->i_es_lock)) {
> > > -			nr_skipped++;
> > > -			spin_lock(&sbi->s_es_lock);
> > >  			__ext4_es_list_add(sbi, ei);
> > > +			if (spin_is_contended(&sbi->s_es_lock)) {
> > > +				spin_unlock(&sbi->s_es_lock);
> > > +				spin_lock(&sbi->s_es_lock);
> > > +			}
> >   Why not cond_resched_lock(&sbi->s_es_lock)?
> 
> I didn't think we were allowed to reschedule or sleep while in
> shrinker context?
  I believe we are allowed to sleep in the shrinker if appropriate gfp
flags are set (__GFP_WAIT) and we enter extent cache shrinker only if
__GFP_FS is set which guarantees __GFP_WAIT.

								Honza
Theodore Ts'o Sept. 3, 2014, 10:38 p.m. UTC | #4
On Thu, Sep 04, 2014 at 12:14:02AM +0200, Jan Kara wrote:
> > I didn't think we were allowed to reschedule or sleep while in
> > shrinker context?
>   I believe we are allowed to sleep in the shrinker if appropriate gfp
> flags are set (__GFP_WAIT) and we enter extent cache shrinker only if
> __GFP_FS is set which guarantees __GFP_WAIT.

I must be missing something.  How is this guaranteed?

I can see how we can determine what gfp_mask was used in the
allocation which triggered the shrinker callback, by checking
shrink->gfp_mask, but I don't see anything that guarantees that the
extent cache shrinker is only entered if __GFP_FS is set.

I guess we could add something like:

	if ((shrink->gfp_mask & __GFP_WAIT) == 0)
		return 0;

to the beginning of ext4_es_scan(), but we're not doing that at the
moment.

       	     	      	      	      	 - Ted
--
To unsubscribe from this list: send the line "unsubscribe linux-ext4" in
the body of a message to majordomo@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Jan Kara Sept. 4, 2014, 7:15 a.m. UTC | #5
On Wed 03-09-14 18:38:05, Ted Tso wrote:
> On Thu, Sep 04, 2014 at 12:14:02AM +0200, Jan Kara wrote:
> > > I didn't think we were allowed to reschedule or sleep while in
> > > shrinker context?
> >   I believe we are allowed to sleep in the shrinker if appropriate gfp
> > flags are set (__GFP_WAIT) and we enter extent cache shrinker only if
> > __GFP_FS is set which guarantees __GFP_WAIT.
> 
> I must be missing something.  How is this guaranteed?
> 
> I can see how we can determine what gfp_mask was used in the
> allocation which triggered the shrinker callback, by checking
> shrink->gfp_mask, but I don't see anything that guarantees that the
> extent cache shrinker is only entered if __GFP_FS is set.
> 
> I guess we could add something like:
> 
> 	if ((shrink->gfp_mask & __GFP_WAIT) == 0)
> 		return 0;
> 
> to the beginning of ext4_es_scan(), but we're not doing that at the
> moment.
  Ah, sorry. I was mistaken and thought we do check for __GFP_FS in
ext4_es_scan() but we don't and we don't need to. But thinking about it
again - if we're going to always scan at most nr_to_scan cache entries,
there's probably no need to reduce s_es_lock latency by playing with
spinlock_contended(), right?

								Honza
Dave Chinner Sept. 4, 2014, 7:50 a.m. UTC | #6
On Wed, Sep 03, 2014 at 04:00:39PM -0400, Theodore Ts'o wrote:
> On Wed, Sep 03, 2014 at 05:31:22PM +0200, Jan Kara wrote:
> > Also one question:
> > 
> > > -		if (ei == locked_ei || !write_trylock(&ei->i_es_lock)) {
> > > -			nr_skipped++;
> > > -			spin_lock(&sbi->s_es_lock);
> > >  			__ext4_es_list_add(sbi, ei);
> > > +			if (spin_is_contended(&sbi->s_es_lock)) {
> > > +				spin_unlock(&sbi->s_es_lock);
> > > +				spin_lock(&sbi->s_es_lock);
> > > +			}
> >   Why not cond_resched_lock(&sbi->s_es_lock)?
> 
> I didn't think we were allowed to reschedule or sleep while in
> shrinker context?

You are allowed to block shrinkers if there is no possibility of
deadlock. i.e. that's what the __GFP_FS flag check in filesystem
shrinkers is for - so that they only run in GFP_KERNEL context
and not GFP_NOFS/GFP_NOIO/GFP_ATOMIC context where blocking reclaim
can cause deadlocks...

Cheers,

Dave.
Theodore Ts'o Sept. 4, 2014, 3:44 p.m. UTC | #7
On Thu, Sep 04, 2014 at 09:15:53AM +0200, Jan Kara wrote:
>   Ah, sorry. I was mistaken and thought we do check for __GFP_FS in
> ext4_es_scan() but we don't and we don't need to. But thinking about it
> again - if we're going to always scan at most nr_to_scan cache entries,
> there's probably no need to reduce s_es_lock latency by playing with
> spinlock_contended(), right?

I'm more generally worried contention on s_es_lock, since it's a file
system-wide spinlock that is grabbed whenever we need to add or remove
an inode from the es_list.  So if someone were to try to run AIM7
benchmark on a large core count machine on an ext4 file system mounted
on a ramdisk, this lock would likely show up.

Now, this might not be a realistic scenario, but it's a common way to
test for fs scalability without having a super-expensive RAID array,
so it's quite common if you look at FAST papers over the last couple
of years, for example..

So my thinking was that if we do run into contention, the shrinker
thread should always yield, since if it gets slowed down slightly,
there's no harm done.  Hmmm.... OTOH, the extra cache line bounce
could potentially be worse, so maybe it would be better to let the
shrinker thread do its thing and then get out of there.

					- Ted
--
To unsubscribe from this list: send the line "unsubscribe linux-ext4" in
the body of a message to majordomo@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Jan Kara Sept. 8, 2014, 3:47 p.m. UTC | #8
On Thu 04-09-14 11:44:59, Ted Tso wrote:
> On Thu, Sep 04, 2014 at 09:15:53AM +0200, Jan Kara wrote:
> >   Ah, sorry. I was mistaken and thought we do check for __GFP_FS in
> > ext4_es_scan() but we don't and we don't need to. But thinking about it
> > again - if we're going to always scan at most nr_to_scan cache entries,
> > there's probably no need to reduce s_es_lock latency by playing with
> > spinlock_contended(), right?
> 
> I'm more generally worried contention on s_es_lock, since it's a file
> system-wide spinlock that is grabbed whenever we need to add or remove
> an inode from the es_list.  So if someone were to try to run AIM7
> benchmark on a large core count machine on an ext4 file system mounted
> on a ramdisk, this lock would likely show up.
> 
> Now, this might not be a realistic scenario, but it's a common way to
> test for fs scalability without having a super-expensive RAID array,
> so it's quite common if you look at FAST papers over the last couple
> of years, for example..
> 
> So my thinking was that if we do run into contention, the shrinker
> thread should always yield, since if it gets slowed down slightly,
> there's no harm done.  Hmmm.... OTOH, the extra cache line bounce
> could potentially be worse, so maybe it would be better to let the
> shrinker thread do its thing and then get out of there.
  Yeah. I think cache bouncing limits scalability in a similar way spinlock
itself does so there's no big win in shortening the lock hold times. If
someone is concerned about scalability of our extent cache LRU, we could
use some more fancy LRU implementation like the one implemented in
mm/list_lru.c and used for other fs objects. But I would see that as a
separate step and only once someone can show a benefit...

								Honza
diff mbox

Patch

diff --git a/fs/ext4/extents_status.c b/fs/ext4/extents_status.c
index 25da1bf..4768f7f 100644
--- a/fs/ext4/extents_status.c
+++ b/fs/ext4/extents_status.c
@@ -981,32 +981,27 @@  retry:
 
 		list_del_init(&ei->i_es_list);
 		sbi->s_es_nr_inode--;
-		spin_unlock(&sbi->s_es_lock);
+		if (ei->i_es_shk_nr == 0)
+			continue;
 
 		/*
 		 * Normally we try hard to avoid shrinking precached inodes,
 		 * but we will as a last resort.
 		 */
-		if (!retried && ext4_test_inode_state(&ei->vfs_inode,
-						EXT4_STATE_EXT_PRECACHED)) {
+		if ((!retried && ext4_test_inode_state(&ei->vfs_inode,
+				       EXT4_STATE_EXT_PRECACHED)) ||
+		    ei == locked_ei ||
+		    !write_trylock(&ei->i_es_lock)) {
 			nr_skipped++;
-			spin_lock(&sbi->s_es_lock);
-			__ext4_es_list_add(sbi, ei);
-			continue;
-		}
-
-		if (ei->i_es_shk_nr == 0) {
-			spin_lock(&sbi->s_es_lock);
-			continue;
-		}
-
-		if (ei == locked_ei || !write_trylock(&ei->i_es_lock)) {
-			nr_skipped++;
-			spin_lock(&sbi->s_es_lock);
 			__ext4_es_list_add(sbi, ei);
+			if (spin_is_contended(&sbi->s_es_lock)) {
+				spin_unlock(&sbi->s_es_lock);
+				spin_lock(&sbi->s_es_lock);
+			}
 			continue;
 		}
-
+		/* we only release s_es_lock once we have i_es_lock */
+		spin_unlock(&sbi->s_es_lock);
 		shrunk = __es_try_to_reclaim_extents(ei, nr_to_scan);
 		write_unlock(&ei->i_es_lock);