diff mbox

[1/3] add releasepage hooks to block devices which can be used by file systems

Message ID 1230995358-24013-2-git-send-email-tytso@mit.edu
State Accepted, archived
Headers show

Commit Message

Theodore Ts'o Jan. 3, 2009, 3:09 p.m. UTC
Implement blkdev_releasepage() to release the buffer_heads and pages
after we release private data belonging to a mounted filesystem.

Cc: Toshiyuki Okajima <toshi.okajima@jp.fujitsu.com>
Cc: linux-fsdevel@vger.kernel.org
Signed-off-by: "Theodore Ts'o" <tytso@mit.edu>
---
 fs/block_dev.c     |   15 +++++++++++++++
 fs/super.c         |    2 ++
 include/linux/fs.h |    2 ++
 3 files changed, 19 insertions(+), 0 deletions(-)

Comments

Toshiyuki Okajima Jan. 5, 2009, 8:16 a.m. UTC | #1
Hi Ted-san,

Theodore Ts'o wrote:
> Implement blkdev_releasepage() to release the buffer_heads and pages
> after we release private data belonging to a mounted filesystem.
>
> Cc: Toshiyuki Okajima <toshi.okajima@jp.fujitsu.com>
> Cc: linux-fsdevel@vger.kernel.org
> Signed-off-by: "Theodore Ts'o" <tytso@mit.edu>
> ---
>  fs/block_dev.c     |   15 +++++++++++++++
>  fs/super.c         |    2 ++
>  include/linux/fs.h |    2 ++
>  3 files changed, 19 insertions(+), 0 deletions(-)

I was confirming whether the kernel to which your new patch is applied can run
without trouble. But unfortunately, I got a hangup problem. Now I am
investigating the root cause.
After I investigated it for a little time, I think calling log_wait_commit() from
journal_try_to_free_buffers() can cause it.
I examine it a little more in detail.

Additional Info(Crash dump):
Backtrace:
crash> bt
PID: 260    TASK: f71076d0  CPU: 1   COMMAND: "kswapd0"
 #0 [f707dcbc] schedule at c06346a3
 #1 [f707dd34] log_wait_commit at f80904c1
 #2 [f707dd70] journal_try_to_free_buffers at f808c81f
 #3 [f707dd94] blkdev_releasepage at c04916cc
 #4 [f707dda4] try_to_release_page at c04526b1
 #5 [f707ddb0] shrink_page_list at c045b3d1
 #6 [f707de50] shrink_list at c045b72e
 #7 [f707def0] shrink_zone at c045bbc6
 #8 [f707df40] kswapd at c045c12c
 #9 [f707dfd8] kthread at c043612c
#10 [f707dfe4] kernel_thread_helper at c04045e1

Sleep time:
crash> ps -l | head -n 1
[5360808577593]  PID: 7995   TASK: c98b76d0  CPU: 1   COMMAND: "crtfile"
crash> ps -l 260
[3727586943566]  PID: 260    TASK: f71076d0  CPU: 1   COMMAND: "kswapd0"
crash> p (5360808577593 - 3727586943566)/1000000000
$4 = 1633
 ======> 1633 seconds

Best Regards,
Toshiyuki Okajima

--
To unsubscribe from this list: send the line "unsubscribe linux-ext4" in
the body of a message to majordomo@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Theodore Ts'o Jan. 5, 2009, 4:05 p.m. UTC | #2
On Mon, Jan 05, 2009 at 05:16:08PM +0900, Toshiyuki Okajima wrote:
> 
> I was confirming whether the kernel to which your new patch is
> applied can run without trouble. But unfortunately, I got a hangup
> problem. Now I am investigating the root cause.  After I
> investigated it for a little time, I think calling log_wait_commit()
> from journal_try_to_free_buffers() can cause it.  

Sounds like a deadlock caused by the fact that we're no longer masking
__GFP_WAIT, probably on journal->j_wait_done_commit.  Presumably the
system came under pressure during a commit operation, which makes
sense, and so we ended up with a deadlock between kjournald and
kswapd.  The fix is pretty simple; we just need to mask out the
__GFP_WAIT in the filesystem-specific callback, since this is a
restriction imposed by the filesystem's use of the jbd/jbd2 layer.

	    	       	   		       - Ted
--
To unsubscribe from this list: send the line "unsubscribe linux-ext4" in
the body of a message to majordomo@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Toshiyuki Okajima Jan. 6, 2009, 4:07 a.m. UTC | #3
Ted-san,

Theodore Tso wrote:
 > On Mon, Jan 05, 2009 at 05:16:08PM +0900, Toshiyuki Okajima wrote:
 > > >
 > > > I was confirming whether the kernel to which your new patch is
 > > > applied can run without trouble. But unfortunately, I got a hangup
 > > > problem. Now I am investigating the root cause.  After I
 > > > investigated it for a little time, I think calling log_wait_commit()
 > > > from journal_try_to_free_buffers() can cause it.
 >
 > Sounds like a deadlock caused by the fact that we're no longer masking
 > __GFP_WAIT, probably on journal->j_wait_done_commit.  Presumably the
 > system came under pressure during a commit operation, which makes
 > sense, and so we ended up with a deadlock between kjournald and
 > kswapd.  The fix is pretty simple; we just need to mask out the
 > __GFP_WAIT in the filesystem-specific callback, since this is a
 > restriction imposed by the filesystem's use of the jbd/jbd2 layer.

Your opinion is correct.
A detailed investigation is done, and the root cause has been understood.

The deadlock was caused by the following two processes:

(1) A certain process
Memory collecting process which is started by a memory allocator calls
journal_try_to_free_buffers(). And then it calls log_wait_commit() to get more
memory and waits for the finish of one committing transaction.

(2) kjournald process
kjournald process starts by Process (1) calling log_wait_commit().
And then it calls journal_commit_transaction to write all data buffers
into the filesystem and write all metadata buffers into the journal storage.
Writing metadata buffer is journal_write_metadata_buffer(). This function also needs
new buffer_head (more memory) in order to copy a buffer_head.

Detailed Information:
Process (1):
crash> bt 260
PID: 260    TASK: f71076d0  CPU: 1   COMMAND: "kswapd0"
  #0 [f707dcbc] schedule at c06346a3
  #1 [f707dd34] log_wait_commit at f80904c1
    ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^ -> It lets kjournald start
                                                and waits for the commit.
  #2 [f707dd70] journal_try_to_free_buffers at f808c81f
  #3 [f707dd94] blkdev_releasepage at c04916cc
  #4 [f707dda4] try_to_release_page at c04526b1
  #5 [f707ddb0] shrink_page_list at c045b3d1
  #6 [f707de50] shrink_list at c045b72e
  #7 [f707def0] shrink_zone at c045bbc6
  #8 [f707df40] kswapd at c045c12c
  #9 [f707dfd8] kthread at c043612c
#10 [f707dfe4] kernel_thread_helper at c04045e1

journal structure: 0xccab1e00

Process (2) [kjournald]:
PID: 3170   TASK: f717b240  CPU: 1   COMMAND: "kjournald"
  #0 [c42b4cf4] schedule at c06346a3
  #1 [c42b4d6c] schedule_timeout at c06349ef
  #2 [c42b4d90] io_schedule_timeout at c0633e0f
  #3 [c42b4da0] congestion_wait at c045d7ee
  #4 [c42b4dc8] try_to_free_pages at c045c82a
  #5 [c42b4e2c] __alloc_pages_internal at c04579fc
  #6 [c42b4e70] cache_alloc_refill at c0471235
  #7 [c42b4ec0] kmem_cache_alloc at c0470fa8
  #8 [c42b4ed4] alloc_buffer_head at c048c06b
                ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^-> It tries to get a buffer but
                                               cannot get one. Because memory
                                               collectors (include: process (1))
                                               cannot go farther.
  #9 [c42b4edc] journal_write_metadata_buffer at f8090eb6
#10 [c42b4f10] journal_commit_transaction at f808df80
#11 [c42b4f98] kjournald at f809089d
#12 [c42b4fd8] kthread at c043612c
#13 [c42b4fe4] kernel_thread_helper at c04045e1

journal structure: 0xccab1e00

Additional Information:
The process by which the trigger of a deadlock is pulled is not only kswapd.
[1]
PID: 1800   TASK: f7379b60  CPU: 1   COMMAND: "rsyslogd"
  #0 [f61c3bfc] schedule at c06346a3
  #1 [f61c3c74] log_wait_commit at f80904c1
  #2 [f61c3cb0] journal_try_to_free_buffers at f808c81f
  #3 [f61c3cd4] blkdev_releasepage at c04916cc
  #4 [f61c3ce4] try_to_release_page at c04526b1
  #5 [f61c3cf0] shrink_page_list at c045b3d1
  #6 [f61c3d90] shrink_list at c045b72e
  #7 [f61c3e30] shrink_zone at c045bbc6
  #8 [f61c3e80] try_to_free_pages at c045c787
  #9 [f61c3ee4] __alloc_pages_internal at c04579fc
#10 [f61c3f28] __get_free_pages at c0457bac
#11 [f61c3f30] copy_process at c0425823
#12 [f61c3f68] do_fork at c042674b
#13 [f61c3fa4] sys_clone at c0402399
#14 [f61c3fb4] system_call at c0403893
     EAX: ffffffda  EBX: 003d0f00  ECX: b7fcd4b4  EDX: b7fcdbd8
     DS:  007b      ESI: b6fcb16c  ES:  007b      EDI: b7fcdbd8
     SS:  007b      ESP: b6fcb100  EBP: b6fcb198
     CS:  0073      EIP: 00d271f8  ERR: 00000078  EFLAGS: 00000296

[2]
PID: 1990   TASK: f70c6000  CPU: 0   COMMAND: "pcscd"
  #0 [f6078be0] schedule at c06346a3
  #1 [f6078c58] log_wait_commit at f80904c1
  #2 [f6078c94] journal_try_to_free_buffers at f808c81f
  #3 [f6078cb8] blkdev_releasepage at c04916cc
  #4 [f6078cc8] try_to_release_page at c04526b1
  #5 [f6078cd4] shrink_page_list at c045b3d1
  #6 [f6078d74] shrink_list at c045b72e
  #7 [f6078e14] shrink_zone at c045bbc6
  #8 [f6078e64] try_to_free_pages at c045c787
  #9 [f6078ec8] __alloc_pages_internal at c04579fc
#10 [f6078f0c] cache_alloc_refill at c0471235
#11 [f6078f5c] kmem_cache_alloc at c0470fa8
#12 [f6078f70] getname at c047b71c
#13 [f6078f88] do_sys_open at c04729d2
#14 [f6078fa0] sys_open at c0472ab6
#15 [f6078fb4] ia32_sysenter_target at c04037da
     EAX: 00000005  EBX: 006a2700  ECX: 00098800  EDX: 00000000
     DS:  007b      ESI: 006a2700  ES:  007b      EDI: 00000000
     SS:  007b      ESP: b801d0f8  EBP: b801d188
     CS:  0073      EIP: b803f424  ERR: 00000005  EFLAGS: 00000202
...

Regards,
Toshiyuki Okajima

--
To unsubscribe from this list: send the line "unsubscribe linux-ext4" in
the body of a message to majordomo@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
diff mbox

Patch

diff --git a/fs/block_dev.c b/fs/block_dev.c
index 99e0ae1..ef7d795 100644
--- a/fs/block_dev.c
+++ b/fs/block_dev.c
@@ -1219,6 +1219,20 @@  static long block_ioctl(struct file *file, unsigned cmd, unsigned long arg)
 	return blkdev_ioctl(bdev, mode, cmd, arg);
 }
 
+/*
+ * Try to release a page associated with block device when the system
+ * is under memory pressure.
+ */
+static int blkdev_releasepage(struct page *page, gfp_t wait)
+{
+	struct super_block *super = BDEV_I(page->mapping->host)->bdev.bd_super;
+
+	if (super && super->s_op->bdev_try_to_free_page)
+		return super->s_op->bdev_try_to_free_page(super, page, wait);
+
+	return try_to_free_buffers(page);
+}
+
 static const struct address_space_operations def_blk_aops = {
 	.readpage	= blkdev_readpage,
 	.writepage	= blkdev_writepage,
@@ -1226,6 +1240,7 @@  static const struct address_space_operations def_blk_aops = {
 	.write_begin	= blkdev_write_begin,
 	.write_end	= blkdev_write_end,
 	.writepages	= generic_writepages,
+	.releasepage	= blkdev_releasepage,
 	.direct_IO	= blkdev_direct_IO,
 };
 
diff --git a/fs/super.c b/fs/super.c
index 400a760..d7e200d 100644
--- a/fs/super.c
+++ b/fs/super.c
@@ -800,6 +800,7 @@  int get_sb_bdev(struct file_system_type *fs_type,
 		}
 
 		s->s_flags |= MS_ACTIVE;
+		bdev->bd_super = s;
 	}
 
 	return simple_set_mnt(mnt, s);
@@ -819,6 +820,7 @@  void kill_block_super(struct super_block *sb)
 	struct block_device *bdev = sb->s_bdev;
 	fmode_t mode = sb->s_mode;
 
+	bdev->bd_super = 0;
 	generic_shutdown_super(sb);
 	sync_blockdev(bdev);
 	close_bdev_exclusive(bdev, mode);
diff --git a/include/linux/fs.h b/include/linux/fs.h
index 4a853ef..911f812 100644
--- a/include/linux/fs.h
+++ b/include/linux/fs.h
@@ -553,6 +553,7 @@  struct address_space {
 struct block_device {
 	dev_t			bd_dev;  /* not a kdev_t - it's a search key */
 	struct inode *		bd_inode;	/* will die */
+	struct super_block *	bd_super;
 	int			bd_openers;
 	struct mutex		bd_mutex;	/* open/close mutex */
 	struct semaphore	bd_mount_sem;
@@ -1375,6 +1376,7 @@  struct super_operations {
 	ssize_t (*quota_read)(struct super_block *, int, char *, size_t, loff_t);
 	ssize_t (*quota_write)(struct super_block *, int, const char *, size_t, loff_t);
 #endif
+	int (*bdev_try_to_free_page)(struct super_block*, struct page*, gfp_t);
 };
 
 /*