Patchwork fs: allow for fs-specific objects to be pruned as part of pruning inodes

login
register
mail settings
Submitter Theodore Ts'o
Date Jan. 23, 2013, 6:06 a.m.
Message ID <1358921168-30921-1-git-send-email-tytso@mit.edu>
Download mbox | patch
Permalink /patch/214798/
State Rejected
Headers show

Comments

Theodore Ts'o - Jan. 23, 2013, 6:06 a.m.
The VFS's prune_super() function allows for the file system to prune
file-system specific objects.  Ext4 would like to use this to prune
parts of the inode's extent cache.  The object lifetime rules used by
ext4 is somewhat different from the those of the dentry and inode in
the VFS.  Ext4's extent cache objects can be pruned without removing
the inode; however if an inode is pruned, all of the extent cache
objects associated with the inode are immediately removed.

To accomodate this rule, we measure the number of fs-specific objects
before the dentry and inodes are pruned, and then measure them again
afterwards.  If the number of fs-specific objects have decreased, we
credit that decrease as part of the shrink operation, so that we do
not end up removing too many fs-specific objects.

In the case where fs-specific objects are not removed when inodes are
removed, this will not change the behavior of prune_super() in any
appreciable way.  (Currently the only other user of this facility is
XFS, and this change should not affect XFS's usage of this facility
for this reason.)

Signed-off-by: "Theodore Ts'o" <tytso@mit.edu>
Cc: linux-fsdevel@vger.kernel.org
---
 fs/super.c | 22 +++++++++++++++++++---
 1 file changed, 19 insertions(+), 3 deletions(-)
Jan Kara - Jan. 23, 2013, 10:52 a.m.
On Wed 23-01-13 01:06:08, Ted Tso wrote:
> The VFS's prune_super() function allows for the file system to prune
> file-system specific objects.  Ext4 would like to use this to prune
> parts of the inode's extent cache.  The object lifetime rules used by
> ext4 is somewhat different from the those of the dentry and inode in
> the VFS.  Ext4's extent cache objects can be pruned without removing
> the inode; however if an inode is pruned, all of the extent cache
> objects associated with the inode are immediately removed.
> 
> To accomodate this rule, we measure the number of fs-specific objects
> before the dentry and inodes are pruned, and then measure them again
> afterwards.  If the number of fs-specific objects have decreased, we
> credit that decrease as part of the shrink operation, so that we do
> not end up removing too many fs-specific objects.
> 
> In the case where fs-specific objects are not removed when inodes are
> removed, this will not change the behavior of prune_super() in any
> appreciable way.  (Currently the only other user of this facility is
> XFS, and this change should not affect XFS's usage of this facility
> for this reason.)
  The patch looks good to me. So you can add:
Reviewed-by: Jan Kara <jack@suse.cz>

I've also added Dave to CC.

								Honza

> 
> Signed-off-by: "Theodore Ts'o" <tytso@mit.edu>
> Cc: linux-fsdevel@vger.kernel.org
> ---
>  fs/super.c | 22 +++++++++++++++++++---
>  1 file changed, 19 insertions(+), 3 deletions(-)
> 
> diff --git a/fs/super.c b/fs/super.c
> index 12f1237..fb57bd2 100644
> --- a/fs/super.c
> +++ b/fs/super.c
> @@ -80,6 +80,7 @@ static int prune_super(struct shrinker *shrink, struct shrink_control *sc)
>  	if (sc->nr_to_scan) {
>  		int	dentries;
>  		int	inodes;
> +		int	fs_to_scan = 0;
>  
>  		/* proportion the scan between the caches */
>  		dentries = (sc->nr_to_scan * sb->s_nr_dentry_unused) /
> @@ -87,7 +88,7 @@ static int prune_super(struct shrinker *shrink, struct shrink_control *sc)
>  		inodes = (sc->nr_to_scan * sb->s_nr_inodes_unused) /
>  							total_objects;
>  		if (fs_objects)
> -			fs_objects = (sc->nr_to_scan * fs_objects) /
> +			fs_to_scan = (sc->nr_to_scan * fs_objects) /
>  							total_objects;
>  		/*
>  		 * prune the dcache first as the icache is pinned by it, then
> @@ -96,8 +97,23 @@ static int prune_super(struct shrinker *shrink, struct shrink_control *sc)
>  		prune_dcache_sb(sb, dentries);
>  		prune_icache_sb(sb, inodes);
>  
> -		if (fs_objects && sb->s_op->free_cached_objects) {
> -			sb->s_op->free_cached_objects(sb, fs_objects);
> +		/*
> +		 * If as a result of pruning the icache, we released some
> +		 * of the fs_objects, give credit to the fact and
> +		 * reduce the number of fs objects that we should try
> +		 * to release.
> +		 */
> +		if (fs_to_scan) {
> +			int fs_objects_now = sb->s_op->nr_cached_objects(sb);
> +
> +			if (fs_objects_now < fs_objects)
> +				fs_to_scan -= fs_objects - fs_objects_now;
> +			if (fs_to_scan < 0)
> +				fs_to_scan = 0;
> +		}
> +
> +		if (fs_to_scan && sb->s_op->free_cached_objects) {
> +			sb->s_op->free_cached_objects(sb, fs_to_scan);
>  			fs_objects = sb->s_op->nr_cached_objects(sb);
>  		}
>  		total_objects = sb->s_nr_dentry_unused +
> -- 
> 1.7.12.rc0.22.gcdd159b
> 
> --
> To unsubscribe from this list: send the line "unsubscribe linux-fsdevel" in
> the body of a message to majordomo@vger.kernel.org
> More majordomo info at  http://vger.kernel.org/majordomo-info.html
Carlos Maiolino - Jan. 23, 2013, 1:06 p.m.
Looks good to me:

Reviewed-by: Carlos Maiolino <cmaiolino@redhat.com>

On Wed, Jan 23, 2013 at 01:06:08AM -0500, Theodore Ts'o wrote:
> The VFS's prune_super() function allows for the file system to prune
> file-system specific objects.  Ext4 would like to use this to prune
> parts of the inode's extent cache.  The object lifetime rules used by
> ext4 is somewhat different from the those of the dentry and inode in
> the VFS.  Ext4's extent cache objects can be pruned without removing
> the inode; however if an inode is pruned, all of the extent cache
> objects associated with the inode are immediately removed.
> 
> To accomodate this rule, we measure the number of fs-specific objects
> before the dentry and inodes are pruned, and then measure them again
> afterwards.  If the number of fs-specific objects have decreased, we
> credit that decrease as part of the shrink operation, so that we do
> not end up removing too many fs-specific objects.
> 
> In the case where fs-specific objects are not removed when inodes are
> removed, this will not change the behavior of prune_super() in any
> appreciable way.  (Currently the only other user of this facility is
> XFS, and this change should not affect XFS's usage of this facility
> for this reason.)
> 
> Signed-off-by: "Theodore Ts'o" <tytso@mit.edu>
> Cc: linux-fsdevel@vger.kernel.org
> ---
>  fs/super.c | 22 +++++++++++++++++++---
>  1 file changed, 19 insertions(+), 3 deletions(-)
> 
> diff --git a/fs/super.c b/fs/super.c
> index 12f1237..fb57bd2 100644
> --- a/fs/super.c
> +++ b/fs/super.c
> @@ -80,6 +80,7 @@ static int prune_super(struct shrinker *shrink, struct shrink_control *sc)
>  	if (sc->nr_to_scan) {
>  		int	dentries;
>  		int	inodes;
> +		int	fs_to_scan = 0;
>  
>  		/* proportion the scan between the caches */
>  		dentries = (sc->nr_to_scan * sb->s_nr_dentry_unused) /
> @@ -87,7 +88,7 @@ static int prune_super(struct shrinker *shrink, struct shrink_control *sc)
>  		inodes = (sc->nr_to_scan * sb->s_nr_inodes_unused) /
>  							total_objects;
>  		if (fs_objects)
> -			fs_objects = (sc->nr_to_scan * fs_objects) /
> +			fs_to_scan = (sc->nr_to_scan * fs_objects) /
>  							total_objects;
>  		/*
>  		 * prune the dcache first as the icache is pinned by it, then
> @@ -96,8 +97,23 @@ static int prune_super(struct shrinker *shrink, struct shrink_control *sc)
>  		prune_dcache_sb(sb, dentries);
>  		prune_icache_sb(sb, inodes);
>  
> -		if (fs_objects && sb->s_op->free_cached_objects) {
> -			sb->s_op->free_cached_objects(sb, fs_objects);
> +		/*
> +		 * If as a result of pruning the icache, we released some
> +		 * of the fs_objects, give credit to the fact and
> +		 * reduce the number of fs objects that we should try
> +		 * to release.
> +		 */
> +		if (fs_to_scan) {
> +			int fs_objects_now = sb->s_op->nr_cached_objects(sb);
> +
> +			if (fs_objects_now < fs_objects)
> +				fs_to_scan -= fs_objects - fs_objects_now;
> +			if (fs_to_scan < 0)
> +				fs_to_scan = 0;
> +		}
> +
> +		if (fs_to_scan && sb->s_op->free_cached_objects) {
> +			sb->s_op->free_cached_objects(sb, fs_to_scan);
>  			fs_objects = sb->s_op->nr_cached_objects(sb);
>  		}
>  		total_objects = sb->s_nr_dentry_unused +
> -- 
> 1.7.12.rc0.22.gcdd159b
> 
> --
> To unsubscribe from this list: send the line "unsubscribe linux-fsdevel" in
> the body of a message to majordomo@vger.kernel.org
> More majordomo info at  http://vger.kernel.org/majordomo-info.html
Dave Chinner - Jan. 23, 2013, 1:32 p.m.
On Wed, Jan 23, 2013 at 01:06:08AM -0500, Theodore Ts'o wrote:
> The VFS's prune_super() function allows for the file system to prune
> file-system specific objects.  Ext4 would like to use this to prune
> parts of the inode's extent cache.  The object lifetime rules used by
> ext4 is somewhat different from the those of the dentry and inode in
> the VFS.  Ext4's extent cache objects can be pruned without removing
> the inode; however if an inode is pruned, all of the extent cache
> objects associated with the inode are immediately removed.
> 
> To accomodate this rule, we measure the number of fs-specific objects
> before the dentry and inodes are pruned, and then measure them again
> afterwards.  If the number of fs-specific objects have decreased, we
> credit that decrease as part of the shrink operation, so that we do
> not end up removing too many fs-specific objects.

Doesn't work. Shrinkers run concurrently on the same context, so a
shrinker can be running on multiple CPUs and hence "interfere" with
each other. i.e. A shrinker call on CPU 2 could see a reduction in
a cache as a result of the same shrinker running on CPU 1 in the
same context, and that would mean the shrinker on CPU 2 doesn't do
the work it was asked (and needs) to do to reclaim memory.

It also seems somewhat incompatible with the proposed memcg/NUMA
aware shrinker infrastructure, where the shrinker has much more
fine-grained context for operation than the current infrastructure.
This seems to assume that there is global context relationship
between inode cache and the fs specific cache.

Also, the superblock shrinker is designed around a direct 1:1:1
dependency relationship between the superblock dentry, inode and "fs
cache" objects. i.e. dentry pins inode pins fs cache object.  It is
designed to keep a direct balance of the three caches by ensuring
they get scanned in amounts directly proportional to the relative
differences in their object counts.  That can't be done with
separate shrinkers, hence the use of the superblock shrinker to
define the dependent relationship between the caches.

i.e. the relationship between the VFS inode and the XFS inode is
that they are the same object, but the "XFS inode" lives for some
time after the VFS is done with it. IOWs, the VFS inode reclaim does
not free any memory, so the reclaim work needs to be transferred
directly to the "fscache shrinker" to free the inode objects.

e.g.  if we have a batch of 100 objects to scan and al the caches
are of equal sizes, 33 dentries are scanned, 33 VFS indoes are
scanned, and 33 XFS inodes are scanned, The result is that for every
100 objects scanned, we'll free 33 dentries and 33 inodes. And if
the caches are out of balance, the biasing of the reclaim towards
different caches will pull them back into even proportions of object
counts.  i.e. the proportioning is very carefully balanced around
maintaining the fixed relationship between the different types
objects...

In your proposed use case, the ext4 extent cache size has no direct
relationship to the size of the VFS inode cache - the can both
change size independently and not impact the balance of the system
as long as the hot objects are kept in their respective caches when
under memory pressure.

When the cache size proportion varies with workload, you want
separate shrinkers for the caches so that the memory pressure and
number of active objects the workload generates determines the cache
size. That's exactly what I'd say is necessary for an extent cache -
it will balloon out massively larger than the inode cache when you
have fragmented files, but if you have well formed files, it will
stay relatively small. However, the number of inodes doesn't change,
and hence what we have here is the optimal cache size proportions
change with workload...

i.e. the superblock fscache shrinker callout is the wrong thing to
use here asit doesn't model the relationship between objects at all
well. A separate shrinker instance for the extent cache is a much
better match....

> In the case where fs-specific objects are not removed when inodes are
> removed, this will not change the behavior of prune_super() in any
> appreciable way.  (Currently the only other user of this facility is
> XFS, and this change should not affect XFS's usage of this facility
> for this reason.)

It can change behaviour is surprisingly subtle, nasty ways.

The XFS superblock shrinker is what provides that memory reclaim
rate throttling for XFS, similar to the way the page cache throttles
writes to the rate at which dirty data can be written to disk. In
effect, it throttles the rate of cache allocation to the rate at
which we clean and free inodes and hence maintains a good system
balance.

The shrinker is also designed to prevent overloading and contention
in the case of concurent execution on large node count machines. It
prevents different CPUs/nodes executing reclaim on the same caches
and hence contending on locks. This also further throttles the rate
of reclaim by blocking shrinkers until they can do the work they
were asked to do by reclaim. IOWs, there are several layers of
throttling in the XFS shrinker, and the system balance is dependent
on the XFS inode shrinker being called appropriately.

With these two cache counters, XFS is guaranteed to see different
inode counts as shrinker reclaim always happens concurrently, not to
mention there is also background inode reclaim also running
concurrently. The result of decreasing counts (as will happen
frequently) is that the cwthis shrinker-based reclaim will not be
run, and hence inode reclaim will not get throttled to the rate at
which we can clean inodes or relcaim them without contention.

The result is that the system becomes unstable and unpredictable
under memory pressure, especially under workloads that dirty a lot
of metadata. Think "random OOM conditions" and "my system stopped
responding for minutes" type of issues....

Like I said, surprisingly subtle and nasty....

Cheers,

Dave.
Theodore Ts'o - Jan. 23, 2013, 4:34 p.m.
On Thu, Jan 24, 2013 at 12:32:31AM +1100, Dave Chinner wrote:
> Doesn't work. Shrinkers run concurrently on the same context, so a
> shrinker can be running on multiple CPUs and hence "interfere" with
> each other. i.e. A shrinker call on CPU 2 could see a reduction in
> a cache as a result of the same shrinker running on CPU 1 in the
> same context, and that would mean the shrinker on CPU 2 doesn't do
> the work it was asked (and needs) to do to reclaim memory.

Hmm, I had assumed that a fs would only have a single prune_super()
running at a time.  So you're telling me that was a bad assumption....

> It also seems somewhat incompatible with the proposed memcg/NUMA
> aware shrinker infrastructure, where the shrinker has much more
> fine-grained context for operation than the current infrastructure.
> This seems to assume that there is global context relationship
> between inode cache and the fs specific cache.

Can you point me at a mail archive with this proposed memcg-aware
shrinker?  I was noticing that that at time moment we're not doing any
shrinking at all on a per-memcg basis, and was reflecting on what a
mess that could cause....  I agree that's a problem that needs fixing,
although it seems fundamentally, hard, especially given that we
currently account for memcg memory usage on a per-page basis, and a
single object owned by a different memcg could prevent a page which
was originally allocated (and hence charged) to the first memcg....

> In your proposed use case, the ext4 extent cache size has no direct
> relationship to the size of the VFS inode cache - the can both
> change size independently and not impact the balance of the system
> as long as the hot objects are kept in their respective caches when
> under memory pressure.
> 
> i.e. the superblock fscache shrinker callout is the wrong thing to
> use here asit doesn't model the relationship between objects at all
> well. A separate shrinker instance for the extent cache is a much
> better match....

Yeah, that was Zheng's original implementation.  My concern was that
could cause the extent cache to get charged twice.  It would get hit
one time when we shrank the number of inodes, since the extent cache
currently does not have a lifetime independent of inodes (rather they
are linked to the inode via a tree structure), and then if we had a
separate extent cache shrinker, they would get reduced a second time.

The reason why we need the second shrinker, of course, is because of
the issue you raised; we could have some files which are heavily
fragmented, and hence would have many more extent cache objects, and
so we can't just rely on shrinking the inode cache to keep the growth
of the extent caches in check in a high memory pressure situation.

Hmm....  this is going to require more thought.  Do you have any
sugestions about what might be a better strategy?

Thanks,

						- Ted
--
To unsubscribe from this list: send the line "unsubscribe linux-ext4" in
the body of a message to majordomo@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Dave Chinner - Jan. 23, 2013, 11:35 p.m.
On Wed, Jan 23, 2013 at 11:34:21AM -0500, Theodore Ts'o wrote:
> On Thu, Jan 24, 2013 at 12:32:31AM +1100, Dave Chinner wrote:
> > Doesn't work. Shrinkers run concurrently on the same context, so a
> > shrinker can be running on multiple CPUs and hence "interfere" with
> > each other. i.e. A shrinker call on CPU 2 could see a reduction in
> > a cache as a result of the same shrinker running on CPU 1 in the
> > same context, and that would mean the shrinker on CPU 2 doesn't do
> > the work it was asked (and needs) to do to reclaim memory.
> 
> Hmm, I had assumed that a fs would only have a single prune_super()
> running at a time.  So you're telling me that was a bad assumption....

Yes.

> > It also seems somewhat incompatible with the proposed memcg/NUMA
> > aware shrinker infrastructure, where the shrinker has much more
> > fine-grained context for operation than the current infrastructure.
> > This seems to assume that there is global context relationship
> > between inode cache and the fs specific cache.
> 
> Can you point me at a mail archive with this proposed memcg-aware
> shrinker?  I was noticing that that at time moment we're not doing any
> shrinking at all on a per-memcg basis, and was reflecting on what a
> mess that could cause....  I agree that's a problem that needs fixing,
> although it seems fundamentally, hard, especially given that we
> currently account for memcg memory usage on a per-page basis, and a
> single object owned by a different memcg could prevent a page which
> was originally allocated (and hence charged) to the first memcg....

http://oss.sgi.com/archives/xfs/2012-11/msg00643.html

The posting is for numa aware LRUs and shrinkers, and the discussion
follows on how to build memcg awareness on top of that generic
LRU/shrinker infrastructure

> > In your proposed use case, the ext4 extent cache size has no direct
> > relationship to the size of the VFS inode cache - the can both
> > change size independently and not impact the balance of the system
> > as long as the hot objects are kept in their respective caches when
> > under memory pressure.
> > 
> > i.e. the superblock fscache shrinker callout is the wrong thing to
> > use here asit doesn't model the relationship between objects at all
> > well. A separate shrinker instance for the extent cache is a much
> > better match....
> 
> Yeah, that was Zheng's original implementation.  My concern was that
> could cause the extent cache to get charged twice.  It would get hit
> one time when we shrank the number of inodes, since the extent cache
> currently does not have a lifetime independent of inodes (rather they
> are linked to the inode via a tree structure), and then if we had a
> separate extent cache shrinker, they would get reduced a second time.

The decision of how much to shrink a cache is made at the time the
shrinker is invoked, not for each call to the shrinker function. The
number to scan from each cache is based on a fixed value, and hence
all caches are put under the same pressure. The amount of objects to
scan is therefore dependent on the relative difference in the number
of objects in each cache. Hence if we remove objects from cache B
while scanning cache A, the shrinker for cache B will see less
objects in the cache and apply less pressure (i.e. scan less).

However, what you have to consider is that the micro-level
behaviour of a single shrinker call is not important. Shrinkers
often run at thousands of scan cycles per second, and so it's the
macro-level behaviour that results from the interactions of multiple
shrinkers that determines the system balance under memory pressure.
Design and tune for macro-level behaviour, not what seems right for
a single shrinker scan call...

> The reason why we need the second shrinker, of course, is because of
> the issue you raised; we could have some files which are heavily
> fragmented, and hence would have many more extent cache objects, and
> so we can't just rely on shrinking the inode cache to keep the growth
> of the extent caches in check in a high memory pressure situation.
> 
> Hmm....  this is going to require more thought.  Do you have any
> sugestions about what might be a better strategy?

In general, the shrinker mechanism balances separate caches pretty
well, so I'd just use a standard shrinker first. Observe the
behaviour under different workloads to see if the standard cache
balancing causes problems. If you see obvious high level imbalances
or performance problems then you need to start considering "special"
solutions.

The coarse knob the shrinkers have to affect this balance is the
"seeks" parameter of the shrinker. That tells the shrinker the
relative cost of replacing the object in the cache, and so has a
high level bias on the pressure the infrastructure places on the
cache. What you need to decide is whether the cost of replacing
objects is more or less expensive than the cost of replaing an inode
in cache, and bias from there.

The filesystem caches also have another "big hammer" knob in the
form of the /proc/sys/vm/vfs_cache_pressure sysctl. This makes the
caches look larger or smaller w.r.t. the page cache and hence biases
reclaim towards or away from the VFS caches. YOu can use this method
in individual shrinkers to cause the shrinker infrastructure to have
different reclaim characterisitics. Hence if you don't want to
reclaim from a cache, then just tell the shrinker it's size is
zero. (FWIW, the changed API in the above patch set makes this
biasing technique much easier and more reliable.)

I guess what I'm trying to say is just use a standard, stand-alone
shrinker and see how it behaves under real world conditions before
trying anything fancy. Often they "just work". :)

Cheers,

Dave.
Andrew Morton - Jan. 24, 2013, 8:58 a.m.
On Thu, 24 Jan 2013 00:32:31 +1100 Dave Chinner <david@fromorbit.com> wrote:

> Also, the superblock shrinker is designed around a direct 1:1:1
> dependency relationship between the superblock dentry, inode and "fs
> cache" objects. i.e. dentry pins inode pins fs cache object.  It is
> designed to keep a direct balance of the three caches by ensuring
> they get scanned in amounts directly proportional to the relative
> differences in their object counts.  That can't be done with
> separate shrinkers, hence the use of the superblock shrinker to
> define the dependent relationship between the caches.

I was staring at the code and at the 0e1fdafd9 changelog trying to work
out why prune_super() does its weird shrinker-in-a-shrinker thing.  And
failing.

IOW it needs a code comment, please.  Ideally one which explains *why*
"It is designed to keep a direct balance of the three caches...".  What
would go wrong if the fs were to just register its own shrinker in the
expected manner?

--
To unsubscribe from this list: send the line "unsubscribe linux-ext4" in
the body of a message to majordomo@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Dave Chinner - Jan. 24, 2013, 8:03 p.m.
On Thu, Jan 24, 2013 at 12:58:16AM -0800, Andrew Morton wrote:
> On Thu, 24 Jan 2013 00:32:31 +1100 Dave Chinner <david@fromorbit.com> wrote:
> 
> > Also, the superblock shrinker is designed around a direct 1:1:1
> > dependency relationship between the superblock dentry, inode and "fs
> > cache" objects. i.e. dentry pins inode pins fs cache object.  It is
> > designed to keep a direct balance of the three caches by ensuring
> > they get scanned in amounts directly proportional to the relative
> > differences in their object counts.  That can't be done with
> > separate shrinkers, hence the use of the superblock shrinker to
> > define the dependent relationship between the caches.
> 
> I was staring at the code and at the 0e1fdafd9 changelog trying to work
> out why prune_super() does its weird shrinker-in-a-shrinker thing.  And
> failing.

commit 8daaa83145ef1f0a146680618328dbbd0fa76939
Author: Dave Chinner <dchinner@redhat.com>
Date:   Fri Jul 8 14:14:46 2011 +1000

    xfs: make use of new shrinker callout for the inode cache
    
    Convert the inode reclaim shrinker to use the new per-sb shrinker
    operations. This allows much bigger reclaim batches to be used, and
    allows the XFS inode cache to be shrunk in proportion with the VFS
    dentry and inode caches. This avoids the problem of the VFS caches
    being shrunk significantly before the XFS inode cache is shrunk
    resulting in imbalances in the caches during reclaim.
    
    Signed-off-by: Dave Chinner <dchinner@redhat.com>
    Signed-off-by: Al Viro <viro@zeniv.linux.org.uk>

Basically, it's to allow filesystems with their own inode cache
management to shrink the cache in direct proportion to the VFS inode
cache so that we don't get situations where the VFS caches get
shrunk to nothing while the filesystem inode cache stays large (and
hence no memory is reclaimed) and vice-versa. The reclaim imbalance
problem ended up so bad I could OOM machines simply by running a
metadata intensive workload....

> IOW it needs a code comment, please.  Ideally one which explains *why*
> "It is designed to keep a direct balance of the three caches...".  What
> would go wrong if the fs were to just register its own shrinker in the
> expected manner?

Nothing, unless there is a direct reclaim relationship between the
filesystem cache and the VFS caches and then independent shrinkers
cannot maintain the balanced relationship between the dependent
caches.  XFS uses normal shrinkers for it's other caches that don't
have a direct 1:1 relationship to the VFS inode and dentry caches,
and that works just fine.

As it is, if you just want to keep random caches at the same object
count as the inodes and dentries, then the fs callout will work just
fine. But if you want something that does not have an equivalent
object count relationship, then a separate shrinker is what is
needed.

Cheers,

Dave.

Patch

diff --git a/fs/super.c b/fs/super.c
index 12f1237..fb57bd2 100644
--- a/fs/super.c
+++ b/fs/super.c
@@ -80,6 +80,7 @@  static int prune_super(struct shrinker *shrink, struct shrink_control *sc)
 	if (sc->nr_to_scan) {
 		int	dentries;
 		int	inodes;
+		int	fs_to_scan = 0;
 
 		/* proportion the scan between the caches */
 		dentries = (sc->nr_to_scan * sb->s_nr_dentry_unused) /
@@ -87,7 +88,7 @@  static int prune_super(struct shrinker *shrink, struct shrink_control *sc)
 		inodes = (sc->nr_to_scan * sb->s_nr_inodes_unused) /
 							total_objects;
 		if (fs_objects)
-			fs_objects = (sc->nr_to_scan * fs_objects) /
+			fs_to_scan = (sc->nr_to_scan * fs_objects) /
 							total_objects;
 		/*
 		 * prune the dcache first as the icache is pinned by it, then
@@ -96,8 +97,23 @@  static int prune_super(struct shrinker *shrink, struct shrink_control *sc)
 		prune_dcache_sb(sb, dentries);
 		prune_icache_sb(sb, inodes);
 
-		if (fs_objects && sb->s_op->free_cached_objects) {
-			sb->s_op->free_cached_objects(sb, fs_objects);
+		/*
+		 * If as a result of pruning the icache, we released some
+		 * of the fs_objects, give credit to the fact and
+		 * reduce the number of fs objects that we should try
+		 * to release.
+		 */
+		if (fs_to_scan) {
+			int fs_objects_now = sb->s_op->nr_cached_objects(sb);
+
+			if (fs_objects_now < fs_objects)
+				fs_to_scan -= fs_objects - fs_objects_now;
+			if (fs_to_scan < 0)
+				fs_to_scan = 0;
+		}
+
+		if (fs_to_scan && sb->s_op->free_cached_objects) {
+			sb->s_op->free_cached_objects(sb, fs_to_scan);
 			fs_objects = sb->s_op->nr_cached_objects(sb);
 		}
 		total_objects = sb->s_nr_dentry_unused +