diff mbox

[v2,2/3] jbd2: Add extra parameter in start_this_handle() to control allocation flags.

Message ID 1306563657-4334-1-git-send-email-mkatiyar@gmail.com
State New, archived
Headers show

Commit Message

Manish Katiyar May 28, 2011, 6:20 a.m. UTC
changes from v1 -> v2 :
*) Update start_this_handle to take extra parameter to specify whether
to retry the allocation or not.
*) Added jbd allocation flags for callers to control the transaction allocation
behavior. Callers can pass JBD2_TOPLEVEL if allocation needs to be done using GFP_KERNEL.

Pass extra flags in journal routines to specify if its ok to
fail in the journal transaction allocation. Passing JBD2_FAIL_OK means caller is
ok with journal start failures and can handle ENOMEM.

Update ocfs2 and ext4 routines to pass JBD2_NO_FAIL for the updated journal
interface by default, to retain the existing behavior.

Signed-off-by: Manish Katiyar <mkatiyar@gmail.com>
---
 fs/ext4/ext4_jbd2.h   |    2 +-
 fs/ext4/super.c       |    2 +-
 fs/jbd2/transaction.c |   44 ++++++++++++++++----------------------------
 fs/ocfs2/journal.c    |    8 ++++----
 include/linux/jbd2.h  |   13 +++++++++----
 5 files changed, 31 insertions(+), 38 deletions(-)

Comments

Jan Kara May 31, 2011, 11:22 a.m. UTC | #1
On Fri 27-05-11 23:20:57, Manish Katiyar wrote:
> changes from v1 -> v2 :
> *) Update start_this_handle to take extra parameter to specify whether
> to retry the allocation or not.
> *) Added jbd allocation flags for callers to control the transaction allocation
> behavior. Callers can pass JBD2_TOPLEVEL if allocation needs to be done using GFP_KERNEL.
  The above changelog should be below where (*) is.

Also - this is mainly for Ted: I've looked at where we JBD2_TOPLEVEL could
actually be enabled and the results are: Pretty much nowhere.

The problem is that with ext4, we need i_mutex in io completion path to
end page writeback. So we cannot do GFP_KERNEL allocation whenever we hold
i_mutex because mm might wait in direct reclaim for IO to complete and that
cannot happen until we release i_mutex. And pretty much every write path in
ext4 holds i_mutex.

So JBD2_TOPLEVEL looks like a useless excercise to me and I'd just don't do
it.

> Pass extra flags in journal routines to specify if its ok to
> fail in the journal transaction allocation. Passing JBD2_FAIL_OK means caller is
> ok with journal start failures and can handle ENOMEM.
> 
> Update ocfs2 and ext4 routines to pass JBD2_NO_FAIL for the updated journal
> interface by default, to retain the existing behavior.
> 
> Signed-off-by: Manish Katiyar <mkatiyar@gmail.com>
> ---
>  fs/ext4/ext4_jbd2.h   |    2 +-
>  fs/ext4/super.c       |    2 +-
>  fs/jbd2/transaction.c |   44 ++++++++++++++++----------------------------
>  fs/ocfs2/journal.c    |    8 ++++----
>  include/linux/jbd2.h  |   13 +++++++++----
>  5 files changed, 31 insertions(+), 38 deletions(-)
  (*) HERE


> +/* JBD2 transaction allocation flags */
> +#define JBD2_NO_FAIL	0x00000001
> +#define JBD2_FAIL_OK	0x00000002
> +#define JBD2_TOPLEVEL	0x00000004
> +
  I guess there's no need for JBD2_FAIL_OK - if NOFAIL is not set, we can
fail. Otherwise the patch looks OK.

								Honza
Theodore Ts'o May 31, 2011, 10:27 p.m. UTC | #2
On Tue, May 31, 2011 at 01:22:53PM +0200, Jan Kara wrote:
> 
> The problem is that with ext4, we need i_mutex in io completion path to
> end page writeback. So we cannot do GFP_KERNEL allocation whenever we hold
> i_mutex because mm might wait in direct reclaim for IO to complete and that
> cannot happen until we release i_mutex. 

OK, maybe I'm being dense, but I'm not seeing it.  I see where we need
i_mutex on the ext4_da_writepages() codepath, but that's never used
for direct reclaim.  Direct reclaim only calls ext4_writepage(), and
that doesn't seem to try to grab i_mutex as near as I can tell.  Am I
missing something?

						- Ted
--
To unsubscribe from this list: send the line "unsubscribe linux-ext4" in
the body of a message to majordomo@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Jan Kara June 2, 2011, 9:54 a.m. UTC | #3
On Tue 31-05-11 18:27:20, Ted Tso wrote:
> On Tue, May 31, 2011 at 01:22:53PM +0200, Jan Kara wrote:
> > 
> > The problem is that with ext4, we need i_mutex in io completion path to
> > end page writeback. So we cannot do GFP_KERNEL allocation whenever we hold
> > i_mutex because mm might wait in direct reclaim for IO to complete and that
> > cannot happen until we release i_mutex. 
> 
> OK, maybe I'm being dense, but I'm not seeing it.  I see where we need
> i_mutex on the ext4_da_writepages() codepath, but that's never used
> for direct reclaim.  Direct reclaim only calls ext4_writepage(), and
> that doesn't seem to try to grab i_mutex as near as I can tell.  Am I
> missing something?
  What happens is that direct reclaim sometimes does
wait_on_page_writeback() (e.g. shrink_page_list()) or it explicitely waits
for NR_WRITEBACK statistics to go below some threshold
(throttle_vm_writeout()). And that is deadlockable if we hold i_mutex while
doing this because we may need i_mutex to actually move the page from
PageWriteback state...

As I'm saying this, I've realized ext4 has this problem also with
stable-pages patches because there we can wait for PageWriteback in
grab_cache_page_write_begin() when we also hold i_mutex. So I think we'll
have to come up with a way to convert unwritten extents without having to
hold i_mutex. That's going to be interesting.

								Honza
Manish Katiyar June 6, 2011, 12:12 a.m. UTC | #4
On Thu, Jun 2, 2011 at 2:54 AM, Jan Kara <jack@suse.cz> wrote:
> On Tue 31-05-11 18:27:20, Ted Tso wrote:
>> On Tue, May 31, 2011 at 01:22:53PM +0200, Jan Kara wrote:
>> >
>> > The problem is that with ext4, we need i_mutex in io completion path to
>> > end page writeback. So we cannot do GFP_KERNEL allocation whenever we hold
>> > i_mutex because mm might wait in direct reclaim for IO to complete and that
>> > cannot happen until we release i_mutex.
>>
>> OK, maybe I'm being dense, but I'm not seeing it.  I see where we need
>> i_mutex on the ext4_da_writepages() codepath, but that's never used
>> for direct reclaim.  Direct reclaim only calls ext4_writepage(), and
>> that doesn't seem to try to grab i_mutex as near as I can tell.  Am I
>> missing something?
>  What happens is that direct reclaim sometimes does
> wait_on_page_writeback() (e.g. shrink_page_list()) or it explicitely waits
> for NR_WRITEBACK statistics to go below some threshold
> (throttle_vm_writeout()). And that is deadlockable if we hold i_mutex while
> doing this because we may need i_mutex to actually move the page from
> PageWriteback state...
>
> As I'm saying this, I've realized ext4 has this problem also with
> stable-pages patches because there we can wait for PageWriteback in
> grab_cache_page_write_begin() when we also hold i_mutex. So I think we'll
> have to come up with a way to convert unwritten extents without having to
> hold i_mutex. That's going to be interesting.

Hi Jan/Ted,

Does that mean I should remove the whole JBD2_TOPLEVEL thing from my
revised patch ? Or should I fix it as per your feedback in
the other patch ?
Theodore Ts'o June 6, 2011, 3:21 a.m. UTC | #5
On Thu, Jun 02, 2011 at 11:54:24AM +0200, Jan Kara wrote:
>   What happens is that direct reclaim sometimes does
> wait_on_page_writeback() (e.g. shrink_page_list()) or it explicitely waits
> for NR_WRITEBACK statistics to go below some threshold
> (throttle_vm_writeout()). And that is deadlockable if we hold i_mutex while
> doing this because we may need i_mutex to actually move the page from
> PageWriteback state...

We don't actully call set_page_writeback() until right before we
submit the page for writeback.  And we convert the unwritten extents
in a workqueue, which gets submitted after we call
end_page_writeback().  So I'm still not seeing a problem; sorry if I'm
being dense!

			    	      	  - Ted

> As I'm saying this, I've realized ext4 has this problem also with
> stable-pages patches because there we can wait for PageWriteback in
> grab_cache_page_write_begin() when we also hold i_mutex. So I think we'll
> have to come up with a way to convert unwritten extents without having to
> hold i_mutex. That's going to be interesting.
> 
> 								Honza
> -- 
> Jan Kara <jack@suse.cz>
> SUSE Labs, CR
> --
> To unsubscribe from this list: send the line "unsubscribe linux-ext4" in
> the body of a message to majordomo@vger.kernel.org
> More majordomo info at  http://vger.kernel.org/majordomo-info.html
--
To unsubscribe from this list: send the line "unsubscribe linux-ext4" in
the body of a message to majordomo@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Theodore Ts'o June 8, 2011, 2:10 p.m. UTC | #6
Ping?

				- Ted

On Sun, Jun 05, 2011 at 11:21:13PM -0400, Ted Ts'o wrote:
> On Thu, Jun 02, 2011 at 11:54:24AM +0200, Jan Kara wrote:
> >   What happens is that direct reclaim sometimes does
> > wait_on_page_writeback() (e.g. shrink_page_list()) or it explicitely waits
> > for NR_WRITEBACK statistics to go below some threshold
> > (throttle_vm_writeout()). And that is deadlockable if we hold i_mutex while
> > doing this because we may need i_mutex to actually move the page from
> > PageWriteback state...
> 
> We don't actully call set_page_writeback() until right before we
> submit the page for writeback.  And we convert the unwritten extents
> in a workqueue, which gets submitted after we call
> end_page_writeback().  So I'm still not seeing a problem; sorry if I'm
> being dense!
> 
> 			    	      	  - Ted
> 
> > As I'm saying this, I've realized ext4 has this problem also with
> > stable-pages patches because there we can wait for PageWriteback in
> > grab_cache_page_write_begin() when we also hold i_mutex. So I think we'll
> > have to come up with a way to convert unwritten extents without having to
> > hold i_mutex. That's going to be interesting.
> > 
> > 								Honza
> > -- 
> > Jan Kara <jack@suse.cz>
> > SUSE Labs, CR
> > --
> > To unsubscribe from this list: send the line "unsubscribe linux-ext4" in
> > the body of a message to majordomo@vger.kernel.org
> > More majordomo info at  http://vger.kernel.org/majordomo-info.html
> --
> To unsubscribe from this list: send the line "unsubscribe linux-ext4" in
> the body of a message to majordomo@vger.kernel.org
> More majordomo info at  http://vger.kernel.org/majordomo-info.html
--
To unsubscribe from this list: send the line "unsubscribe linux-ext4" in
the body of a message to majordomo@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Manish Katiyar June 17, 2011, 6:32 a.m. UTC | #7
On Wed, Jun 8, 2011 at 7:10 AM, Ted Ts'o <tytso@mit.edu> wrote:
> Ping?

Hi Jan,

Is there anything required from my side on this patch ? From your
other post it seems that holding i_mutex isn't a problem. Please
advise.
Jan Kara June 20, 2011, 2:32 p.m. UTC | #8
Hi,

On Thu 16-06-11 23:32:12, Manish Katiyar wrote:
> On Wed, Jun 8, 2011 at 7:10 AM, Ted Ts'o <tytso@mit.edu> wrote:
> > Ping?
> 
> Is there anything required from my side on this patch ? From your
> other post it seems that holding i_mutex isn't a problem. Please
> advise.
  OK, after my discussion with Ted today, i_mutex really shouldn't be an
issue so your patch should be fine. I just suggest you resend it to Ted to
remind yourself :).

								Honza
Manish Katiyar June 20, 2011, 2:40 p.m. UTC | #9
On Mon, Jun 20, 2011 at 7:32 AM, Jan Kara <jack@suse.cz> wrote:
>  Hi,
>
> On Thu 16-06-11 23:32:12, Manish Katiyar wrote:
>> On Wed, Jun 8, 2011 at 7:10 AM, Ted Ts'o <tytso@mit.edu> wrote:
>> > Ping?
>>
>> Is there anything required from my side on this patch ? From your
>> other post it seems that holding i_mutex isn't a problem. Please
>> advise.
>  OK, after my discussion with Ted today, i_mutex really shouldn't be an
> issue so your patch should be fine. I just suggest you resend it to Ted to
> remind yourself :).

Thanks a lot Jan, Will resend the series to Ted.
Theodore Ts'o June 20, 2011, 5:57 p.m. UTC | #10
On Mon, Jun 20, 2011 at 07:40:53AM -0700, Manish Katiyar wrote:
> >  OK, after my discussion with Ted today, i_mutex really shouldn't be an
> > issue so your patch should be fine. I just suggest you resend it to Ted to
> > remind yourself :).
> 
> Thanks a lot Jan, Will resend the series to Ted.

Thanks, but don't bother resending it unless you have any changes to
update.

I do keep track of patches using patchwork[1], so I haven't forgotten.  :-)

[1] http://patchwork.ozlabs.org/project/linux-ext4/list/

I'll be starting to go through patches for the next merge window this
week.

							- Ted


--
To unsubscribe from this list: send the line "unsubscribe linux-ext4" in
the body of a message to majordomo@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Manish Katiyar June 20, 2011, 6:08 p.m. UTC | #11
On Mon, Jun 20, 2011 at 10:57 AM, Ted Ts'o <tytso@mit.edu> wrote:
> On Mon, Jun 20, 2011 at 07:40:53AM -0700, Manish Katiyar wrote:
>> >  OK, after my discussion with Ted today, i_mutex really shouldn't be an
>> > issue so your patch should be fine. I just suggest you resend it to Ted to
>> > remind yourself :).
>>
>> Thanks a lot Jan, Will resend the series to Ted.
>
> Thanks, but don't bother resending it unless you have any changes to
> update.

Thanks Ted, ... I don't have any other changes.
diff mbox

Patch

diff --git a/fs/ext4/ext4_jbd2.h b/fs/ext4/ext4_jbd2.h
index bb85757..14e6599 100644
--- a/fs/ext4/ext4_jbd2.h
+++ b/fs/ext4/ext4_jbd2.h
@@ -220,7 +220,7 @@  static inline int ext4_journal_extend(handle_t *handle, int nblocks)
 static inline int ext4_journal_restart(handle_t *handle, int nblocks)
 {
 	if (ext4_handle_valid(handle))
-		return jbd2_journal_restart(handle, nblocks);
+		return jbd2_journal_restart(handle, nblocks, JBD2_NO_FAIL);
 	return 0;
 }
 
diff --git a/fs/ext4/super.c b/fs/ext4/super.c
index d9937df..aa842f3 100644
--- a/fs/ext4/super.c
+++ b/fs/ext4/super.c
@@ -295,7 +295,7 @@  handle_t *ext4_journal_start_sb(struct super_block *sb, int nblocks)
 		ext4_abort(sb, "Detected aborted journal");
 		return ERR_PTR(-EROFS);
 	}
-	return jbd2_journal_start(journal, nblocks);
+	return jbd2_journal_start(journal, nblocks, JBD2_NO_FAIL);
 }
 
 /*
diff --git a/fs/jbd2/transaction.c b/fs/jbd2/transaction.c
index 3eec82d..7f53589 100644
--- a/fs/jbd2/transaction.c
+++ b/fs/jbd2/transaction.c
@@ -106,6 +106,11 @@  static inline void update_t_max_wait(transaction_t *transaction,
 #endif
 }
 
+static inline int __jbd_flags_to_mask(int jbd_alloc_flags)
+{
+	return (jbd_alloc_flags & JBD2_TOPLEVEL) ? GFP_KERNEL : GFP_NOFS;
+}
+
 /*
  * start_this_handle: Given a handle, deal with any locking or stalling
  * needed to make sure that there is enough journal space for the handle
@@ -114,13 +119,14 @@  static inline void update_t_max_wait(transaction_t *transaction,
  */
 
 static int start_this_handle(journal_t *journal, handle_t *handle,
-			     int gfp_mask)
+			     int jbd_alloc_flags)
 {
 	transaction_t	*transaction, *new_transaction = NULL;
 	tid_t		tid;
 	int		needed, need_to_start;
 	int		nblocks = handle->h_buffer_credits;
 	unsigned long ts = jiffies;
+	int gfp_mask = __jbd_flags_to_mask(jbd_alloc_flags);
 
 	if (nblocks > journal->j_max_transaction_buffers) {
 		printk(KERN_ERR "JBD: %s wants too many credits (%d > %d)\n",
@@ -133,14 +139,7 @@  alloc_transaction:
 	if (!journal->j_running_transaction) {
 		new_transaction = kzalloc(sizeof(*new_transaction), gfp_mask);
 		if (!new_transaction) {
-			/*
-			 * If __GFP_FS is not present, then we may be
-			 * being called from inside the fs writeback
-			 * layer, so we MUST NOT fail.  Since
-			 * __GFP_NOFAIL is going away, we will arrange
-			 * to retry the allocation ourselves.
-			 */
-			if ((gfp_mask & __GFP_FS) == 0) {
+			if (jbd_alloc_flags & JBD2_NO_FAIL) {
 				congestion_wait(BLK_RW_ASYNC, HZ/50);
 				goto alloc_transaction;
 			}
@@ -308,6 +307,7 @@  static handle_t *new_handle(int nblocks)
  * handle_t *jbd2_journal_start() - Obtain a new handle.
  * @journal: Journal to start transaction on.
  * @nblocks: number of block buffer we might modify
+ * @jbd_alloc_flags:  jbd allocation flags for transaction.
  *
  * We make sure that the transaction can guarantee at least nblocks of
  * modified buffers in the log.  We block until the log can guarantee
@@ -319,7 +319,8 @@  static handle_t *new_handle(int nblocks)
  * Return a pointer to a newly allocated handle, or an ERR_PTR() value
  * on failure.
  */
-handle_t *jbd2__journal_start(journal_t *journal, int nblocks, int gfp_mask)
+handle_t *jbd2_journal_start(journal_t *journal,
+			     int nblocks, int jbd_alloc_flags)
 {
 	handle_t *handle = journal_current_handle();
 	int err;
@@ -339,7 +340,7 @@  handle_t *jbd2__journal_start(journal_t *journal, int nblocks, int gfp_mask)
 
 	current->journal_info = handle;
 
-	err = start_this_handle(journal, handle, gfp_mask);
+	err = start_this_handle(journal, handle, jbd_alloc_flags);
 	if (err < 0) {
 		jbd2_free_handle(handle);
 		current->journal_info = NULL;
@@ -347,13 +348,6 @@  handle_t *jbd2__journal_start(journal_t *journal, int nblocks, int gfp_mask)
 	}
 	return handle;
 }
-EXPORT_SYMBOL(jbd2__journal_start);
-
-
-handle_t *jbd2_journal_start(journal_t *journal, int nblocks)
-{
-	return jbd2__journal_start(journal, nblocks, GFP_NOFS);
-}
 EXPORT_SYMBOL(jbd2_journal_start);
 
 
@@ -442,7 +436,8 @@  out:
  * transaction capabable of guaranteeing the requested number of
  * credits.
  */
-int jbd2__journal_restart(handle_t *handle, int nblocks, int gfp_mask)
+int jbd2_journal_restart(handle_t *handle,
+			 int nblocks, int jbd_alloc_flags)
 {
 	transaction_t *transaction = handle->h_transaction;
 	journal_t *journal = transaction->t_journal;
@@ -478,16 +473,9 @@  int jbd2__journal_restart(handle_t *handle, int nblocks, int gfp_mask)
 
 	lock_map_release(&handle->h_lockdep_map);
 	handle->h_buffer_credits = nblocks;
-	ret = start_this_handle(journal, handle, gfp_mask);
+	ret = start_this_handle(journal, handle, jbd_alloc_flags);
 	return ret;
 }
-EXPORT_SYMBOL(jbd2__journal_restart);
-
-
-int jbd2_journal_restart(handle_t *handle, int nblocks)
-{
-	return jbd2__journal_restart(handle, nblocks, GFP_NOFS);
-}
 EXPORT_SYMBOL(jbd2_journal_restart);
 
 /**
@@ -1437,7 +1425,7 @@  int jbd2_journal_force_commit(journal_t *journal)
 	handle_t *handle;
 	int ret;
 
-	handle = jbd2_journal_start(journal, 1);
+	handle = jbd2_journal_start(journal, 1, JBD2_NO_FAIL);
 	if (IS_ERR(handle)) {
 		ret = PTR_ERR(handle);
 	} else {
diff --git a/fs/ocfs2/journal.c b/fs/ocfs2/journal.c
index b141a44..a784a64 100644
--- a/fs/ocfs2/journal.c
+++ b/fs/ocfs2/journal.c
@@ -353,11 +353,11 @@  handle_t *ocfs2_start_trans(struct ocfs2_super *osb, int max_buffs)
 
 	/* Nested transaction? Just return the handle... */
 	if (journal_current_handle())
-		return jbd2_journal_start(journal, max_buffs);
+		return jbd2_journal_start(journal, max_buffs, JBD2_NO_FAIL);
 
 	down_read(&osb->journal->j_trans_barrier);
 
-	handle = jbd2_journal_start(journal, max_buffs);
+	handle = jbd2_journal_start(journal, max_buffs, JBD2_NO_FAIL);
 	if (IS_ERR(handle)) {
 		up_read(&osb->journal->j_trans_barrier);
 
@@ -437,8 +437,8 @@  int ocfs2_extend_trans(handle_t *handle, int nblocks)
 
 	if (status > 0) {
 		trace_ocfs2_extend_trans_restart(old_nblocks + nblocks);
-		status = jbd2_journal_restart(handle,
-					      old_nblocks + nblocks);
+		status = jbd2_journal_restart(handle, old_nblocks + nblocks,
+					      JBD2_NO_FAIL);
 		if (status < 0) {
 			mlog_errno(status);
 			goto bail;
diff --git a/include/linux/jbd2.h b/include/linux/jbd2.h
index 4ecb7b1..c11181b 100644
--- a/include/linux/jbd2.h
+++ b/include/linux/jbd2.h
@@ -1106,10 +1106,10 @@  static inline handle_t *journal_current_handle(void)
  * Register buffer modifications against the current transaction.
  */
 
-extern handle_t *jbd2_journal_start(journal_t *, int nblocks);
-extern handle_t *jbd2__journal_start(journal_t *, int nblocks, int gfp_mask);
-extern int	 jbd2_journal_restart(handle_t *, int nblocks);
-extern int	 jbd2__journal_restart(handle_t *, int nblocks, int gfp_mask);
+extern handle_t *jbd2_journal_start(journal_t *,
+				    int nblocks, int jbd_alloc_flags);
+extern int	 jbd2_journal_restart(handle_t *,
+				      int nblocks, int jbd_alloc_flags);
 extern int	 jbd2_journal_extend (handle_t *, int nblocks);
 extern int	 jbd2_journal_get_write_access(handle_t *, struct buffer_head *);
 extern int	 jbd2_journal_get_create_access (handle_t *, struct buffer_head *);
@@ -1322,6 +1322,11 @@  static inline int jbd_space_needed(journal_t *journal)
 
 extern int jbd_blocks_per_page(struct inode *inode);
 
+/* JBD2 transaction allocation flags */
+#define JBD2_NO_FAIL	0x00000001
+#define JBD2_FAIL_OK	0x00000002
+#define JBD2_TOPLEVEL	0x00000004
+
 #ifdef __KERNEL__
 
 #define buffer_trace_init(bh)	do {} while (0)