diff mbox

[RFC] JBD: release checkpoint journal heads through try_to_release_page when the memory is exhausted

Message ID 20081023174101.85b59177.toshi.okajima@jp.fujitsu.com
State Superseded, archived
Headers show

Commit Message

Toshiyuki Okajima Oct. 23, 2008, 8:41 a.m. UTC
Hi Andrew.

> > rather costly.  An alternative might be to implement a shrinker
> > callback function for the journal_head slab cache.  Did you consider
> > this?
> Yes.
> But the unused-list and counters are required by managing the shrink targets("journal head") 
> if we implement a shrinker. 
> I thought that comparatively big code changes were necessary for jbd to accomplish it. 

> However I will try it. 

I managed to build a shrinker callback function for the journal_head slab cache.
This code size is less than before but the logic of it seems to be more complex
 than before.
However, I haven't got any troubles while I am testing some easy load operations
on the fixed kernel.
But I think a system may hang up if concurrently several journal_head shrinker 
are executed.
So, I will retry to build more appropriate fix.

Please give me comments if you have a nicer idea.

------------------------------------------------------------------------------
The direct data blocks can be released by the member function, releasepage()
of their mapping.
(They have the mapping of their filesystem i-node.)

On the other hand, the indirect data blocks (ext3) are attempted to be released 
by try_to_free_buffers().
Because its mapping is a block device, and a block device doesn't have 
own a member function to release a page. 

But try_to_free_buffers() is a generic function which releases a buffer_head, 
and no buffer_head can be released if a buffer_head has private data 
(like journal_head) because the buffer_head reference counter is bigger than 0.
Therefore, a buffer_head cannot be released by try_to_free_buffers() even if 
its private data can be released.
As a result, oom-killer may happen when a system memory is exhausted even if 
a lot of private data can be released.

To solve this situation, a shrinker of journal_heads is required.
A shrinker was made by referring to logics such as shrink_icache_memory. 
In order to shrink journal_heads, it is necessary to manage a list of 
journal_heads which are required to be checkpointed all over the filesystems
 with jbd. 

Timing from which the newly additional list is operated:
- when a journal_head is registered into a checkpoint list. It is also 
 registered into an overall checkpoint list (newly additional list).
- when a journal_head is removed from a checkpoint list. It is also removed 
 from an overall checkpoint list (newly additional list).
- while a shrinker is working.

A shrinker scans only a necessary number of journal_heads which are connected
 from a new list, and releases ones if possible.

Therefore it becomes difficult for oom-killer to happen than before.

Signed-off-by: Toshiyuki Okajima <toshi.okajima@jp.fujitsu.com>
---
 fs/jbd/checkpoint.c          |   77 +++++++++++++++++++++++++++++++++++++++++++
 fs/jbd/journal.c             |    2 +
 include/linux/journal-head.h |    7 +++
 3 files changed, 86 insertions(+)

--
To unsubscribe from this list: send the line "unsubscribe linux-ext4" in
the body of a message to majordomo@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html

Comments

Andrew Morton Oct. 27, 2008, 9:26 p.m. UTC | #1
(added linux-fsdevel)

On Thu, 23 Oct 2008 17:41:01 +0900
Toshiyuki Okajima <toshi.okajima@jp.fujitsu.com> wrote:

> Hi Andrew.
> 
> > > rather costly.  An alternative might be to implement a shrinker
> > > callback function for the journal_head slab cache.  Did you consider
> > > this?
> > Yes.
> > But the unused-list and counters are required by managing the shrink targets("journal head") 
> > if we implement a shrinker. 
> > I thought that comparatively big code changes were necessary for jbd to accomplish it. 
> 
> > However I will try it. 
> 
> I managed to build a shrinker callback function for the journal_head slab cache.
> This code size is less than before but the logic of it seems to be more complex
>  than before.
> However, I haven't got any troubles while I am testing some easy load operations
> on the fixed kernel.
> But I think a system may hang up if concurrently several journal_head shrinker 
> are executed.
> So, I will retry to build more appropriate fix.

yeah, that's not very pretty either, is it?

> Please give me comments if you have a nicer idea.

Stepping back a bit...

The basic problem is, I believe, that some client of the blockdev
(ext3) is adding metadata to the blockdev's data structures
(buffer_heads) but we have no means by which the blockdev code can call
back into that client requesting that the metadata be released, yes?

We can fix the problem which you've identified by adding a means for
the blockdev code (def_blk_aops.releasepage()) to call back into ext3,
yes?

If so, how do we do that?

I seem to recall that there's code somewhere in the tree which does
things like taking a copy of bdev->address_space_operations and
reinstalling that, and overwriting selected fields, and then arranging
somehow for the old value to be reinstalled when the client releases
the blockdev.  That's plain nasty.

Perhaps what we could do is to add a new

	blkdev_register_releasepage(struct block-device *,
					int (*)(struct page *, gfp_t)

function and call that from within ext3 initialisation.  (This could be
a block_device_operations entry, but is there any point in doing that?)


Within blkdev_register_releasepage(), record the address of that
function in the `struct block_device' (with what locking??) and then
implement def_blk_aops.releasepage(), which calls
bdev->registered_releasepage().  Set def_blk_aops.releaspage() to point
at try_to_free_buffers() to provide the default behaviour.

Then we'd need a blkdev_unregister_releasepage() which restores the old
value.  Or, better, make blkdev_register_releasepage()
return the old value and require that clients of the blockdev (ie:
ext3) restore the old value prior to releasing the blockdev.



Or something along these lines, anyway..

--
To unsubscribe from this list: send the line "unsubscribe linux-ext4" in
the body of a message to majordomo@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Toshiyuki Okajima Oct. 28, 2008, 2:46 a.m. UTC | #2
Hi Andrew.
Thank you for your useful comment.

Andrew Morton wrote:
 > (added linux-fsdevel)
 >
 > On Thu, 23 Oct 2008 17:41:01 +0900
 > Toshiyuki Okajima <toshi.okajima@jp.fujitsu.com> wrote:
<SNIP>
 > > I managed to build a shrinker callback function for the journal_head slab cache.
 > > > This code size is less than before but the logic of it seems to be more complex
 > > >  than before.
 > > > However, I haven't got any troubles while I am testing some easy load operations
 > > > on the fixed kernel.
 > > > But I think a system may hang up if concurrently several journal_head shrinker
 > > > are executed.
 > > > So, I will retry to build more appropriate fix.
 >
 > yeah, that's not very pretty either, is it?
Yes...
I realized fixing only within buffer_head or jbd needs a more complex code.

 > > > Please give me comments if you have a nicer idea.
 > Stepping back a bit...
 >
 > The basic problem is, I believe, that some client of the blockdev
 > (ext3) is adding metadata to the blockdev's data structures
 > (buffer_heads) but we have no means by which the blockdev code can call
 > back into that client requesting that the metadata be released, yes?
Yes.

 > We can fix the problem which you've identified by adding a means for
 > the blockdev code (def_blk_aops.releasepage()) to call back into ext3,
 > yes?
Yes.

At first, I tried to fix by using only filesystem approach.
  - ver.1: fixing in buffer_head
  - ver.2: adding shrinker of journal_head (for releasing buffer_head)
But these approaches become complex code.
So, we should fix the essence of the problem.

 > If so, how do we do that?
 >
 > I seem to recall that there's code somewhere in the tree which does
 > things like taking a copy of bdev->address_space_operations and
 > reinstalling that, and overwriting selected fields, and then arranging
 > somehow for the old value to be reinstalled when the client releases
 > the blockdev.  That's plain nasty.
uh-huh.

I try to fix this problem again by using your approach,
blkdev_register_releasepage/blkdev_unregister_releasepage().

Best Regards,
Toshiyuki Okajima

--
To unsubscribe from this list: send the line "unsubscribe linux-ext4" in
the body of a message to majordomo@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
diff mbox

Patch

diff -Nurp linux-2.6.27.1.org/fs/jbd/checkpoint.c linux-2.6.27.1/fs/jbd/checkpoint.c
--- linux-2.6.27.1.org/fs/jbd/checkpoint.c	2008-10-16 08:02:53.000000000 +0900
+++ linux-2.6.27.1/fs/jbd/checkpoint.c	2008-10-23 15:07:14.000000000 +0900
@@ -24,6 +24,14 @@ 
 #include <linux/slab.h>
 
 /*
+ * Used for shrinking journal_heads whose I/O are completed
+ */
+static DEFINE_SPINLOCK(jbd_global_lock);
+static LIST_HEAD(jbd_checkpoint_list);
+static int jbd_jh_cache_pressure = 10;
+static int jbd_nr_checkpoint_jh = 0;
+
+/*
  * Unlink a buffer from a transaction checkpoint list.
  *
  * Called with j_list_lock held.
@@ -595,6 +603,10 @@  int __journal_remove_checkpoint(struct j
 
 	__buffer_unlink(jh);
 	jh->b_cp_transaction = NULL;
+	spin_lock(&jbd_global_lock);
+	list_del_init(&jh->b_checkpoint_list);
+	jbd_nr_checkpoint_jh--;
+	spin_unlock(&jbd_global_lock);
 
 	if (transaction->t_checkpoint_list != NULL ||
 	    transaction->t_checkpoint_io_list != NULL)
@@ -655,8 +667,73 @@  void __journal_insert_checkpoint(struct 
 		jh->b_cpnext->b_cpprev = jh;
 	}
 	transaction->t_checkpoint_list = jh;
+	spin_lock(&jbd_global_lock);
+	list_add(&jh->b_checkpoint_list, &jbd_checkpoint_list);
+	jbd_nr_checkpoint_jh++;
+	spin_unlock(&jbd_global_lock);
+}
+
+static void try_to_free_cp_buf(journal_t *journal, transaction_t *transaction, struct journal_head *jh)
+{
+	transaction_t *transaction2;
+
+	spin_lock(&journal->j_list_lock);
+	if (!list_empty(&jh->b_checkpoint_list)) {
+		transaction2 = jh->b_cp_transaction;
+		BUG_ON(transaction2 == NULL);
+		if (transaction == transaction2) {
+			jbd_lock_bh_state(jh2bh(jh));
+			__try_to_free_cp_buf(jh);
+		}
+	}
+	spin_unlock(&journal->j_list_lock);
 }
 
+static void prune_jbd_jhcache(int nr)
+{
+	struct journal_head *jh;
+	struct list_head *tmp;
+	journal_t *journal;
+	transaction_t *transaction;
+
+	BUG_ON(nr < 0);
+	for (; nr; nr--) {
+		spin_lock(&jbd_global_lock);
+		if ((tmp = jbd_checkpoint_list.prev) == &jbd_checkpoint_list) {
+			spin_unlock(&jbd_global_lock);
+			break;
+		}
+		list_move(tmp, &jbd_checkpoint_list);
+		jh = list_entry(tmp, struct journal_head, b_checkpoint_list);
+		/* Protect a jh from being removed while operating */
+		journal_grab_journal_head(jh2bh(jh));
+		transaction = jh->b_cp_transaction;
+		BUG_ON(transaction == NULL);
+		journal = transaction->t_journal;
+		spin_unlock(&jbd_global_lock);
+		/* Releasing a jh from checkpoint list if possible */
+		try_to_free_cp_buf(journal, transaction, jh);
+		/* For previous count up (actually releasing a jh here) */
+		journal_put_journal_head(jh);
+		cond_resched();
+	}
+}
+
+static int shrink_jbd_jhcache_memory(int nr, gfp_t gfp_mask)
+{
+	if (nr) {
+		if (!(gfp_mask & __GFP_FS))
+			return -1;
+                prune_jbd_jhcache(nr);
+	}
+        return (jbd_nr_checkpoint_jh*100)/jbd_jh_cache_pressure;
+}
+
+struct shrinker jbd_jh_shrinker = {
+        .shrink = shrink_jbd_jhcache_memory,
+        .seeks = DEFAULT_SEEKS,
+};
+
 /*
  * We've finished with this transaction structure: adios...
  *
diff -Nurp linux-2.6.27.1.org/fs/jbd/journal.c linux-2.6.27.1/fs/jbd/journal.c
--- linux-2.6.27.1.org/fs/jbd/journal.c	2008-10-16 08:02:53.000000000 +0900
+++ linux-2.6.27.1/fs/jbd/journal.c	2008-10-23 15:00:44.000000000 +0900
@@ -1890,6 +1890,7 @@  static inline void jbd_remove_debugfs_en
 
 #endif
 
+extern struct shrinker jbd_jh_shrinker;
 struct kmem_cache *jbd_handle_cache;
 
 static int __init journal_init_handle_cache(void)
@@ -1903,6 +1904,7 @@  static int __init journal_init_handle_ca
 		printk(KERN_EMERG "JBD: failed to create handle cache\n");
 		return -ENOMEM;
 	}
+	register_shrinker(&jbd_jh_shrinker);
 	return 0;
 }
 
diff -Nurp linux-2.6.27.1.org/include/linux/journal-head.h linux-2.6.27.1/include/linux/journal-head.h
--- linux-2.6.27.1.org/include/linux/journal-head.h	2008-10-16 08:02:53.000000000 +0900
+++ linux-2.6.27.1/include/linux/journal-head.h	2008-10-23 15:00:44.000000000 +0900
@@ -87,6 +87,13 @@  struct journal_head {
 	 * [j_list_lock]
 	 */
 	struct journal_head *b_cpnext, *b_cpprev;
+
+	/*
+	 * Checkpoint journal head list
+	 * all over filesystems with jbd in order to shrink.
+	 * [jbd_global_lock]
+	 */
+	struct list_head b_checkpoint_list;
 };
 
 #endif		/* JOURNAL_HEAD_H_INCLUDED */