diff mbox

[RFC] JBD: release checkpoint journal heads through try_to_release_page when the memory is exhausted

Message ID 20081017.223716.147444348.00960188@stratos.soft.fujitsu.com
State Superseded, archived
Headers show

Commit Message

Toshiyuki Okajima Oct. 17, 2008, 1:37 p.m. UTC
Hi.

I found the situation where OOM-Killer happens easily.
I will inform you of it. 
I tried to fix this problem to make OOM-Killer not happen easily as much as 
possible.
As a result, I made a reference patch to fix it. 

Any comments are welcome.
(The comments for making much simpler or epoch-making approach are 
very welcome.)

------------------------------------------------------------------------------

If the following is satisfied, OOM-Killer happens easily.
(1) A quarter of a summation of each total log size of all filesystems which 
   use jbd exceeds the memory size of Normal Zone.
(2) We commit a huge number of data which include many metadata to each 
   filesystem and then we stop committing data to them. 
    For example, a process creates many files whose size are huge and 
   which have a huge number of indirect blocks. Then all processes stop I/O 
   to all filesystems which use jbd.
(3) After (2), we request to get a big size memory.
(NOTE: A oom-killer can happen easily on a system whose architecture is x86. 
Because a x86 system can have only a small Normal Zone of less than 1GB.)

The reason is that jbd does not positively release journal heads(jh-s)
  even if there are many jh-s which can be released.

Releasing jh-s is only executed at the following timing:
- if free log space becomes a quarter of the total log size 
  (log_do_checkpoint())
- if a transaction begins to commit (journal_cleanup_checkpoint_list() 
 which is called by journal_commit_transaction())
(NOTE: A jh-s which corresponds to buffer heads (bh-s) which is a direct block 
      can be released at journal_try_to_free_buffers() which is called 
      by try_to_release_page())   

Therefore,  if we let filesystems do above (2), jh-s remains because 
new transaction isn't generated. 
However, when the system memory is exhausted, try_to_release_page() can be 
called, but it cannot release bh-s which are metadata (indirect blocks 
and so on).  
Because the mapping to the page is owned by a block device not a filesystem 
(ext3).

If the mapping is owned by a block device, try_to_release_page() calls 
try_to_free_buffers(). It can release generic bh, but cannot release the bh 
which is referring by the jh. Because the reference counter of the bh is 
larger than 0.
Therefore it is necessary to release the jh before the bh is released.

To achieve it, I added a new member function into buffer head structure.
The function releases the bh which correspond to a page whose mapping
is block device. And the release target of the bh has private data 
(journal head).
The function resembles journal_try_to_free_buffers().
Then I changed try_to_release_page(), which calls try_to_free_buffers()
after the new function.

As a result, I think it becomes difficult for oom-killer to happen 
than before because try_to_free_buffers() via try_to_release_page() 
which is called when the system memory is exhausted can release bh-s. 

Signed-off-by: Toshiyuki Okajima <toshi.okajima@jp.fujitsu.com>
---
 fs/buffer.c                 |   23 ++++++++++++++++++++++-
 fs/jbd/journal.c            |    7 +++++++
 fs/jbd/transaction.c        |   39 +++++++++++++++++++++++++++++++++++++++
 include/linux/buffer_head.h |    7 +++++++
 include/linux/jbd.h         |    1 +
 5 files changed, 76 insertions(+), 1 deletion(-) 

--
To unsubscribe from this list: send the line "unsubscribe linux-ext4" in
the body of a message to majordomo@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html

Comments

Andrew Morton Oct. 20, 2008, 11:02 p.m. UTC | #1
On Fri, 17 Oct 2008 22:37:16 +0900 (JST)
Toshiyuki Okajima <toshi.okajima@jp.fujitsu.com> wrote:

> Hi.
> 
> I found the situation where OOM-Killer happens easily.
> I will inform you of it. 
> I tried to fix this problem to make OOM-Killer not happen easily as much as 
> possible.
> As a result, I made a reference patch to fix it. 
> 
> Any comments are welcome.
> (The comments for making much simpler or epoch-making approach are 
> very welcome.)
> 
> ------------------------------------------------------------------------------
> 
> If the following is satisfied, OOM-Killer happens easily.
> (1) A quarter of a summation of each total log size of all filesystems which 
>    use jbd exceeds the memory size of Normal Zone.
> (2) We commit a huge number of data which include many metadata to each 
>    filesystem and then we stop committing data to them. 
>     For example, a process creates many files whose size are huge and 
>    which have a huge number of indirect blocks. Then all processes stop I/O 
>    to all filesystems which use jbd.
> (3) After (2), we request to get a big size memory.
> (NOTE: A oom-killer can happen easily on a system whose architecture is x86. 
> Because a x86 system can have only a small Normal Zone of less than 1GB.)
> 
> The reason is that jbd does not positively release journal heads(jh-s)
>   even if there are many jh-s which can be released.
> 
> Releasing jh-s is only executed at the following timing:
> - if free log space becomes a quarter of the total log size 
>   (log_do_checkpoint())
> - if a transaction begins to commit (journal_cleanup_checkpoint_list() 
>  which is called by journal_commit_transaction())
> (NOTE: A jh-s which corresponds to buffer heads (bh-s) which is a direct block 
>       can be released at journal_try_to_free_buffers() which is called 
>       by try_to_release_page())   
> 
> Therefore,  if we let filesystems do above (2), jh-s remains because 
> new transaction isn't generated. 
> However, when the system memory is exhausted, try_to_release_page() can be 
> called, but it cannot release bh-s which are metadata (indirect blocks 
> and so on).  
> Because the mapping to the page is owned by a block device not a filesystem 
> (ext3).
> 
> If the mapping is owned by a block device, try_to_release_page() calls 
> try_to_free_buffers(). It can release generic bh, but cannot release the bh 
> which is referring by the jh. Because the reference counter of the bh is 
> larger than 0.
> Therefore it is necessary to release the jh before the bh is released.
> 
> To achieve it, I added a new member function into buffer head structure.
> The function releases the bh which correspond to a page whose mapping
> is block device. And the release target of the bh has private data 
> (journal head).
> The function resembles journal_try_to_free_buffers().
> Then I changed try_to_release_page(), which calls try_to_free_buffers()
> after the new function.
> 
> As a result, I think it becomes difficult for oom-killer to happen 
> than before because try_to_free_buffers() via try_to_release_page() 
> which is called when the system memory is exhausted can release bh-s. 
> 

OK.

> ---
>  fs/buffer.c                 |   23 ++++++++++++++++++++++-
>  fs/jbd/journal.c            |    7 +++++++
>  fs/jbd/transaction.c        |   39 +++++++++++++++++++++++++++++++++++++++
>  include/linux/buffer_head.h |    7 +++++++
>  include/linux/jbd.h         |    1 +
>  5 files changed, 76 insertions(+), 1 deletion(-) 

The patch is fairly complex, and increasing the buffer_head size can be
rather costly.  An alternative might be to implement a shrinker
callback function for the journal_head slab cache.  Did you consider
this?

--
To unsubscribe from this list: send the line "unsubscribe linux-ext4" in
the body of a message to majordomo@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Toshiyuki Okajima Oct. 21, 2008, 2:49 a.m. UTC | #2
Hi Andrew.
Thank you for your comment.

Andrew Morton wrote:
> On Fri, 17 Oct 2008 22:37:16 +0900 (JST)
> Toshiyuki Okajima <toshi.okajima@jp.fujitsu.com> wrote:
> 
>> Hi.
>>
>> I found the situation where OOM-Killer happens easily.
>> I will inform you of it. 
>> I tried to fix this problem to make OOM-Killer not happen easily as much as 
>> possible.
<SNIP>
> OK.
> 
>> ---
>>  fs/buffer.c                 |   23 ++++++++++++++++++++++-
>>  fs/jbd/journal.c            |    7 +++++++
>>  fs/jbd/transaction.c        |   39 +++++++++++++++++++++++++++++++++++++++
>>  include/linux/buffer_head.h |    7 +++++++
>>  include/linux/jbd.h         |    1 +
>>  5 files changed, 76 insertions(+), 1 deletion(-) 
> 
> The patch is fairly complex, and increasing the buffer_head size can be
Yes.
Applying this fix causes the buffer_head size to increase.
The increase of the buffer_head size changes into 60 bytes from 56 bytes
on x86 system.
As a result, the maximum number of buffer heads of one slab changes
into 63 from 64.
(The increase of the size is less than 2%.)
Therefore I think this change influences system performance hardly.
And I rather want to add a new member because I think it is useful for
not only this fix but also the future.

> rather costly.  An alternative might be to implement a shrinker
> callback function for the journal_head slab cache.  Did you consider
> this?
Yes.
But the unused-list and counters are required by managing the shrink targets
("journal head") if we implement a shrinker.
I thought that comparatively big code changes were necessary for jbd
to accomplish it.

However I will try it.

Best Regards,
Toshiyuki Okajima

--
To unsubscribe from this list: send the line "unsubscribe linux-ext4" in
the body of a message to majordomo@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
diff mbox

Patch

diff -Nurp linux-2.6.27-rc8.org/fs/buffer.c linux-2.6.27-rc8/fs/buffer.c
--- linux-2.6.27-rc8.org/fs/buffer.c	2008-09-30 07:24:02.000000000 +0900
+++ linux-2.6.27-rc8/fs/buffer.c	2008-10-17 08:59:37.000000000 +0900
@@ -3111,7 +3111,7 @@  failed:
 	return 0;
 }
 
-int try_to_free_buffers(struct page *page)
+static int __try_to_free_buffers(struct page *page)
 {
 	struct address_space * const mapping = page->mapping;
 	struct buffer_head *buffers_to_free = NULL;
@@ -3158,6 +3158,26 @@  out:
 	}
 	return ret;
 }
+
+int try_to_free_buffers(struct page *page)
+{
+	struct address_space * const mapping = page->mapping;
+	struct buffer_head *head, *bh;
+	const struct buffer_head_operations *bops;
+		
+	if (mapping == NULL) 
+		return __try_to_free_buffers(page);
+	bh = head = page_buffers(page);
+	do {
+		if ((bops = bh->b_ops) != NULL
+			&& !(bh->b_state & ((1 << BH_Dirty) | (1 << BH_Lock)))
+			&& atomic_read(&bh->b_count) > 0)
+			if (bops->release_buffer != NULL)
+				bops->release_buffer(bh);
+		bh = bh->b_this_page;
+	} while (bh != head);
+	return __try_to_free_buffers(page);
+}
 EXPORT_SYMBOL(try_to_free_buffers);
 
 void block_sync_page(struct page *page)
@@ -3234,6 +3254,7 @@  struct buffer_head *alloc_buffer_head(gf
 {
 	struct buffer_head *ret = kmem_cache_alloc(bh_cachep, gfp_flags);
 	if (ret) {
+		ret->b_ops = NULL;
 		INIT_LIST_HEAD(&ret->b_assoc_buffers);
 		get_cpu_var(bh_accounting).nr++;
 		recalc_bh_state();
diff -Nurp linux-2.6.27-rc8.org/fs/jbd/journal.c linux-2.6.27-rc8/fs/jbd/journal.c
--- linux-2.6.27-rc8.org/fs/jbd/journal.c	2008-09-30 07:24:02.000000000 +0900
+++ linux-2.6.27-rc8/fs/jbd/journal.c	2008-10-17 08:59:37.000000000 +0900
@@ -79,11 +79,16 @@  EXPORT_SYMBOL(journal_wipe);
 EXPORT_SYMBOL(journal_blocks_per_page);
 EXPORT_SYMBOL(journal_invalidatepage);
 EXPORT_SYMBOL(journal_try_to_free_buffers);
+EXPORT_SYMBOL(journal_try_to_free_one_buffer);
 EXPORT_SYMBOL(journal_force_commit);
 
 static int journal_convert_superblock_v1(journal_t *, journal_superblock_t *);
 static void __journal_abort_soft (journal_t *journal, int errno);
 
+static const struct buffer_head_operations jbd_bh_ops = {
+	.release_buffer = journal_try_to_free_one_buffer
+};
+
 /*
  * Helper function used to manage commit timeouts
  */
@@ -1749,6 +1754,7 @@  repeat:
 		set_buffer_jbd(bh);
 		bh->b_private = jh;
 		jh->b_bh = bh;
+		bh->b_ops = &jbd_bh_ops;
 		get_bh(bh);
 		BUFFER_TRACE(bh, "added journal_head");
 	}
@@ -1803,6 +1809,7 @@  static void __journal_remove_journal_hea
 						__func__);
 				jbd_free(jh->b_committed_data, bh->b_size);
 			}
+			bh->b_ops = NULL;
 			bh->b_private = NULL;
 			jh->b_bh = NULL;	/* debug, really */
 			clear_buffer_jbd(bh);
diff -Nurp linux-2.6.27-rc8.org/fs/jbd/transaction.c linux-2.6.27-rc8/fs/jbd/transaction.c
--- linux-2.6.27-rc8.org/fs/jbd/transaction.c	2008-09-30 07:24:02.000000000 +0900
+++ linux-2.6.27-rc8/fs/jbd/transaction.c	2008-10-17 09:11:49.000000000 +0900
@@ -1648,6 +1648,45 @@  out:
 	return;
 }
 
+void journal_try_to_free_one_buffer(struct buffer_head *bh)
+{
+	journal_t *journal;
+	struct journal_head *jh;
+	const struct buffer_head_operations *bops;
+
+	jh = journal_grab_journal_head(bh);
+	if (!jh)
+		return;
+	if (!jbd_trylock_bh_state(bh)) {
+		journal_put_journal_head(jh);
+		return;	
+	}
+	if ((bops = bh->b_ops) == NULL || bops->release_buffer == NULL)
+		goto skip;
+	if (buffer_locked(bh) || buffer_dirty(bh))
+		goto skip;
+	if (jh->b_next_transaction != NULL)
+		goto skip;
+	journal = (jh->b_cp_transaction != NULL ? jh->b_cp_transaction->t_journal: NULL);
+	if (journal != NULL) {
+		if (spin_trylock(&journal->j_list_lock)) {
+			if (jh->b_cp_transaction != NULL && jh->b_transaction == NULL) {
+				/* written-back checkpointed metadata buffer */
+				if (jh->b_jlist == BJ_None) {
+					JBUFFER_TRACE(jh, "remove from checkpoint list");
+					__journal_remove_checkpoint(jh);
+					journal_remove_journal_head(bh);
+					__brelse(bh);
+				}
+			}
+			spin_unlock(&journal->j_list_lock);
+		}
+	}
+skip:
+	journal_put_journal_head(jh);
+	jbd_unlock_bh_state(bh);
+}
+
 /*
  * journal_try_to_free_buffers() could race with journal_commit_transaction()
  * The latter might still hold the a count on buffers when inspecting
diff -Nurp linux-2.6.27-rc8.org/include/linux/buffer_head.h linux-2.6.27-rc8/include/linux/buffer_head.h
--- linux-2.6.27-rc8.org/include/linux/buffer_head.h	2008-09-30 07:24:02.000000000 +0900
+++ linux-2.6.27-rc8/include/linux/buffer_head.h	2008-10-17 08:59:37.000000000 +0900
@@ -43,6 +43,10 @@  enum bh_state_bits {
 
 #define MAX_BUF_PER_PAGE (PAGE_CACHE_SIZE / 512)
 
+struct buffer_head_operations {
+	void (*release_buffer)(struct buffer_head*);
+};
+
 struct page;
 struct buffer_head;
 struct address_space;
@@ -73,6 +77,9 @@  struct buffer_head {
 	struct address_space *b_assoc_map;	/* mapping this buffer is
 						   associated with */
 	atomic_t b_count;		/* users using this buffer_head */
+	const struct buffer_head_operations *b_ops; 
+					/* for a private buffer head
+					   to do special work */
 };
 
 /*
diff -Nurp linux-2.6.27-rc8.org/include/linux/jbd.h linux-2.6.27-rc8/include/linux/jbd.h
--- linux-2.6.27-rc8.org/include/linux/jbd.h	2008-09-30 07:24:02.000000000 +0900
+++ linux-2.6.27-rc8/include/linux/jbd.h	2008-10-17 08:59:37.000000000 +0900
@@ -890,6 +890,7 @@  extern void	 journal_sync_buffer (struct
 extern void	 journal_invalidatepage(journal_t *,
 				struct page *, unsigned long);
 extern int	 journal_try_to_free_buffers(journal_t *, struct page *, gfp_t);
+extern void	 journal_try_to_free_one_buffer(struct buffer_head *);
 extern int	 journal_stop(handle_t *);
 extern int	 journal_flush (journal_t *);
 extern void	 journal_lock_updates (journal_t *);