Patchwork [2/3] jbd2 : Fix journal start by passing a parameter to specify if the caller can deal with ENOMEM

login
register
mail settings
Submitter Manish Katiyar
Date May 25, 2011, 7:26 a.m.
Message ID <4DDCAF18.8030809@gmail.com>
Download mbox | patch
Permalink /patch/97306/
State Changes Requested
Headers show

Comments

Manish Katiyar - May 25, 2011, 7:26 a.m.
Pass extra bool parameter in journal routines to specify if its ok to
fail in the journal start. Passing true means caller is
ok with journal start failures and can handle ENOMEM. Update ocfs2 and ext4
routines to pass false for the updated journal interface by default to
retain the existing behavior.

Signed-off-by: Manish Katiyar <mkatiyar@gmail.com>
Acked-by: Jan Kara <jack@suse.cz>
---
 fs/ext4/ext4_jbd2.h   |    2 +-
 fs/ext4/super.c       |    2 +-
 fs/jbd2/transaction.c |   26 +++++++-------------------
 fs/ocfs2/journal.c    |    6 +++---
 include/linux/jbd2.h  |    6 ++----
 5 files changed, 14 insertions(+), 28 deletions(-)
Jan Kara - May 25, 2011, 7:44 a.m.
On Wed 25-05-11 00:26:16, Manish Katiyar wrote:
> Pass extra bool parameter in journal routines to specify if its ok to
> fail in the journal start. Passing true means caller is
> ok with journal start failures and can handle ENOMEM. Update ocfs2 and ext4
> routines to pass false for the updated journal interface by default to
> retain the existing behavior.
> 
> Signed-off-by: Manish Katiyar <mkatiyar@gmail.com>
> Acked-by: Jan Kara <jack@suse.cz>
...
> @@ -318,7 +320,7 @@ static handle_t *new_handle(int nblocks)
>   *
>   * Return a pointer to a newly allocated handle, or NULL on failure
>   */
> -handle_t *jbd2__journal_start(journal_t *journal, int nblocks, int gfp_mask)
> +handle_t *jbd2_journal_start(journal_t *journal, int nblocks, bool errok)
>  {
>  	handle_t *handle = journal_current_handle();
>  	int err;
> @@ -338,7 +340,7 @@ handle_t *jbd2__journal_start(journal_t *journal, int nblocks, int gfp_mask)
> 
>  	current->journal_info = handle;
> 
> -	err = start_this_handle(journal, handle, gfp_mask);
> +	err = start_this_handle(journal, handle, GFP_NOFS);
>  	if (err < 0) {
>  		jbd2_free_handle(handle);
>  		current->journal_info = NULL;
  This patch is OK as such but as I wrote in a reply to a later patch in
the series, you also need to pass 'errok' to start_this_handle() instead of
gfp_mask as well. Then in start_this_handle() you can use 'errok' to decide
whether we should retry the allocation or not instead of gfp_mask. And just
call kzalloc in start_this_handle() with GFP_NOFS... Thanks.

								Honza
Manish Katiyar - May 25, 2011, 7:47 a.m.
On 05/25/2011 12:44 AM, Jan Kara wrote:
> This patch is OK as such but as I wrote in a reply to a later patch in
> the series, you also need to pass 'errok' to start_this_handle() instead of
> gfp_mask as well. Then in start_this_handle() you can use 'errok' to decide
> whether we should retry the allocation or not instead of gfp_mask. And just
> call kzalloc in start_this_handle() with GFP_NOFS... Thanks.

Hi Jan,

   ok.. I will do it as a separate patch.
Jan Kara - May 25, 2011, 8:13 a.m.
On Wed 25-05-11 00:47:06, Manish Katiyar wrote:
> On 05/25/2011 12:44 AM, Jan Kara wrote:
> > This patch is OK as such but as I wrote in a reply to a later patch in
> > the series, you also need to pass 'errok' to start_this_handle() instead of
> > gfp_mask as well. Then in start_this_handle() you can use 'errok' to decide
> > whether we should retry the allocation or not instead of gfp_mask. And just
> > call kzalloc in start_this_handle() with GFP_NOFS... Thanks.
> 
> Hi Jan,
> 
>    ok.. I will do it as a separate patch.
  Well, patch 2/3 does not really make too much sense without it (errok
parameter isn't used) so there's no reason to do it as a separate patch.
Just add it to this patch please.

Also Ted asked you to send patches in one mail thread - that is the second
and further patches are a reply to the first patch. Git makes this easy to
do via 'git send-email' but it's not so easy to use for development when
you need to change patches in the middle of the series. But TopGit
(http://repo.or.cz/w/topgit.git?a=blob;f=README) can make this task easier
for you (or I personaly just use two shell scripts on top of git itself).

Finally, the patch header like:
Hi Jan,

Posting this as a new mail for Ted's convenience. I have incorporated your
feedback from the last patch
(http://comments.gmane.org/gmane.comp.file-systems.ext4/24547) and updated
the patch.



a) Add a new API ext4_journal_start_failok for callers who are
  ok to handle ENOMEM cases.
b) Update ext4 routines with the new API which can fail.


Signed-off-by: Manish Katiyar <mkatiyar@gmail.com>
---
Isn't really useful. You don't really want the greeing and the first
paragraph to end up in git history forever, do you? And Ted doesn't want to
edit emails by hand to remove unnecessary stuff either. So it should rather
have looked like:
a) Add a new API ext4_journal_start_failok for callers who are
  ok to handle ENOMEM cases.
b) Update ext4 routines with the new API which can fail.
---
Posting this as a new mail for Ted's convenience. I have incorporated Jan's
feedback from the last patch
(http://comments.gmane.org/gmane.comp.file-systems.ext4/24547) and updated
the patch.
<diffstat here>

Then, one can just run 'git am' on the mailbox and it will automatically
remove stuff below --- and include the patch into git. Much faster when you
have to handle tons of patches...

								Honza
Theodore Ts'o - May 26, 2011, 2:22 a.m.
On Wed, May 25, 2011 at 10:13:33AM +0200, Jan Kara wrote:
> > 
> >    ok.. I will do it as a separate patch.
>   Well, patch 2/3 does not really make too much sense without it (errok
> parameter isn't used) so there's no reason to do it as a separate patch.
> Just add it to this patch please.

Agreed; right now this whole patch series is a no-op, since errok
isn't getting used for anything.  So fixing errok so it's passed to
start_this_handle() seems to be more in the category of "fix the
patch" more than anything else.

One more thing; perhaps we should be passing in a integer so we can
pass in a flag word.  That way you don't need to have a fail_ok
variant.  It's a lot more obvious if you have a call:

	  handle = ext4_journal_start(inode, 1, JBD2_FAIL_OK);

What we can also do is this:

	  handle = ext4_journal_start(inode, 1, JBD2_FAIL_OK | JBD2_TOPLEVEL);

What JBD2_TOPLEVEL means is that caller is from a top-level file
system function, such as ext4_symlink() or ext4_chmod(), such that
start_this_handle() can use GFP_KERNEL instead of GFP_NOFS.  GFP_NOFS
is needed for any function that might get called by the direct reclaim
path (i.e., the writepage() function).  But for the top-level
symlink() or chmod() function, it's actually OK to allocate memory
using GFP_KERNEL, since there's no potential recursion problem.

      		  		   	     	       - Ted
--
To unsubscribe from this list: send the line "unsubscribe linux-ext4" in
the body of a message to majordomo@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Andreas Dilger - May 26, 2011, 4:07 a.m.
On May 25, 2011, at 20:22, Ted Ts'o wrote:
> On Wed, May 25, 2011 at 10:13:33AM +0200, Jan Kara wrote:
>>> 
>>>  ok.. I will do it as a separate patch.
>> Well, patch 2/3 does not really make too much sense without it (errok
>> parameter isn't used) so there's no reason to do it as a separate patch.
>> Just add it to this patch please.
> 
> Agreed; right now this whole patch series is a no-op, since errok
> isn't getting used for anything.  So fixing errok so it's passed to
> start_this_handle() seems to be more in the category of "fix the
> patch" more than anything else.
> 
> One more thing; perhaps we should be passing in a integer so we can
> pass in a flag word.  That way you don't need to have a fail_ok
> variant.  It's a lot more obvious if you have a call:
> 
> 	  handle = ext4_journal_start(inode, 1, JBD2_FAIL_OK);
> 
> What we can also do is this:
> 
> 	  handle = ext4_journal_start(inode, 1, JBD2_FAIL_OK | JBD2_TOPLEVEL);
> 
> What JBD2_TOPLEVEL means is that caller is from a top-level file
> system function, such as ext4_symlink() or ext4_chmod(), such that
> start_this_handle() can use GFP_KERNEL instead of GFP_NOFS.  GFP_NOFS
> is needed for any function that might get called by the direct reclaim
> path (i.e., the writepage() function).  But for the top-level
> symlink() or chmod() function, it's actually OK to allocate memory
> using GFP_KERNEL, since there's no potential recursion problem.

At this point, why not just pass GFP_KERNEL or GFP_NOFS directly, optionally with __GFP_NOFAIL?


Cheers, Andreas





--
To unsubscribe from this list: send the line "unsubscribe linux-ext4" in
the body of a message to majordomo@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Manish Katiyar - May 26, 2011, 5:29 a.m.
On Wed, May 25, 2011 at 7:22 PM, Ted Ts'o <tytso@mit.edu> wrote:
> On Wed, May 25, 2011 at 10:13:33AM +0200, Jan Kara wrote:
>> >
>> >    ok.. I will do it as a separate patch.
>>   Well, patch 2/3 does not really make too much sense without it (errok
>> parameter isn't used) so there's no reason to do it as a separate patch.
>> Just add it to this patch please.
>
> Agreed; right now this whole patch series is a no-op, since errok
> isn't getting used for anything.  So fixing errok so it's passed to
> start_this_handle() seems to be more in the category of "fix the
> patch" more than anything else.
>
> One more thing; perhaps we should be passing in a integer so we can
> pass in a flag word.  That way you don't need to have a fail_ok
> variant.  It's a lot more obvious if you have a call:
>
>          handle = ext4_journal_start(inode, 1, JBD2_FAIL_OK);
>
> What we can also do is this:
>
>          handle = ext4_journal_start(inode, 1, JBD2_FAIL_OK | JBD2_TOPLEVEL);
>
> What JBD2_TOPLEVEL means is that caller is from a top-level file
> system function, such as ext4_symlink() or ext4_chmod(), such that
> start_this_handle() can use GFP_KERNEL instead of GFP_NOFS.  GFP_NOFS
> is needed for any function that might get called by the direct reclaim
> path (i.e., the writepage() function).  But for the top-level
> symlink() or chmod() function, it's actually OK to allocate memory
> using GFP_KERNEL, since there's no potential recursion problem.

Hi Ted/Jan,

Will it be more desirable to have the patch fixed as JBD2_FAIL_OK as
you suggested above ? I have already
fixed the patch to pass errok to start_this_handle() and then retry or
not based on that, and was planning to send
it. But if this needs to be fixed this way, then I will rework it then send.

Thanks -
Manish

>
>                                                       - Ted
>
Theodore Ts'o - May 26, 2011, 2:05 p.m.
On Wed, May 25, 2011 at 10:07:20PM -0600, Andreas Dilger wrote:
> > What JBD2_TOPLEVEL means is that caller is from a top-level file
> > system function, such as ext4_symlink() or ext4_chmod(), such that
> > start_this_handle() can use GFP_KERNEL instead of GFP_NOFS.  GFP_NOFS
> > is needed for any function that might get called by the direct reclaim
> > path (i.e., the writepage() function).  But for the top-level
> > symlink() or chmod() function, it's actually OK to allocate memory
> > using GFP_KERNEL, since there's no potential recursion problem.
> 
> At this point, why not just pass GFP_KERNEL or GFP_NOFS directly,
> optionally with __GFP_NOFAIL?

Well, __GFP_NOFAIL is going away.  (At least there are a number of mm
hackers, including akpm, who really want it to go away).

We could key off of GFP_NOFS, but GFP_NOFS doesn't mean loop and make
sure that we don't fail.  The two concepts are in fact orthogonal;
it's just at the moment that there are most places which are called in
the fs writeback path which also can't fail.  

But just to give one example, ext4_bio_write_page() is an example of a
function that allocates memory GFP_NOFS, but can fail with ENOMEM,
because its caller, mpage_da_submit_io() in fs/ext4/inode.c is
designed to cope with failure in a way that doesn't cause data loss.
(We leave the page dirty, unlock it and back out of the writeback
code, and later the bdi flusher threads will retry the writepages
request.)

    	      	      	       	    	    	  - Ted

P.S.  That means that there are calls to ext4_journal_start() in the
ext4_da_writepages() code path that probably could be converted to
ext4_journal_start_failok() --- or to ext4_journal_start(inode,
nblocks, JBD2_FAIL_OK) per my suggestion --- once we have fully
audited the error return paths.

P.P.S.  Something that might be worth doing is having a sysfs tunable
that causes ext4_journal_start() to return an ENOMEM failure on a per
file system basis with a probability specified by the sysfs tunable.
This would allow us to actually _test_ the error handling to make sure
sane things happen....  

What I'd probably do is define a new flag, JBD2_WRITEBACK_TEST, and
use it to make the ext4_journal_start() functions that are allowed to
probabilistically fail (since the retry should be happening at a
higher level), and then run a stress test with the syfs tunable
enabled.  Since the flag would only cause ext4_journal_start()
functions that should have automatic fallbacks, the stress test would
be able to run to completion even though 10% of the
ext4_journal_start(... JBD2_WRITEBACK_TEST) calls were failing.
That's another example of why using a flag bitfield instead of a bool
is much more powerful.

--
To unsubscribe from this list: send the line "unsubscribe linux-ext4" in
the body of a message to majordomo@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Jan Kara - May 26, 2011, 2:49 p.m.
On Thu 26-05-11 10:05:58, Ted Tso wrote:
> On Wed, May 25, 2011 at 10:07:20PM -0600, Andreas Dilger wrote:
> > > What JBD2_TOPLEVEL means is that caller is from a top-level file
> > > system function, such as ext4_symlink() or ext4_chmod(), such that
> > > start_this_handle() can use GFP_KERNEL instead of GFP_NOFS.  GFP_NOFS
> > > is needed for any function that might get called by the direct reclaim
> > > path (i.e., the writepage() function).  But for the top-level
> > > symlink() or chmod() function, it's actually OK to allocate memory
> > > using GFP_KERNEL, since there's no potential recursion problem.
> > 
> > At this point, why not just pass GFP_KERNEL or GFP_NOFS directly,
> > optionally with __GFP_NOFAIL?
> 
> Well, __GFP_NOFAIL is going away.  (At least there are a number of mm
> hackers, including akpm, who really want it to go away).
> 
> We could key off of GFP_NOFS, but GFP_NOFS doesn't mean loop and make
> sure that we don't fail.  The two concepts are in fact orthogonal;
> it's just at the moment that there are most places which are called in
> the fs writeback path which also can't fail.  
  Exactly.

> But just to give one example, ext4_bio_write_page() is an example of a
> function that allocates memory GFP_NOFS, but can fail with ENOMEM,
> because its caller, mpage_da_submit_io() in fs/ext4/inode.c is
> designed to cope with failure in a way that doesn't cause data loss.
> (We leave the page dirty, unlock it and back out of the writeback
> code, and later the bdi flusher threads will retry the writepages
> request.)
> 
>     	      	      	       	    	    	  - Ted
> 
> P.S.  That means that there are calls to ext4_journal_start() in the
> ext4_da_writepages() code path that probably could be converted to
> ext4_journal_start_failok() --- or to ext4_journal_start(inode,
> nblocks, JBD2_FAIL_OK) per my suggestion --- once we have fully
> audited the error return paths.
> 
> P.P.S.  Something that might be worth doing is having a sysfs tunable
> that causes ext4_journal_start() to return an ENOMEM failure on a per
> file system basis with a probability specified by the sysfs tunable.
> This would allow us to actually _test_ the error handling to make sure
> sane things happen....  
  No need to do this. If you make JBD2 use a separate slab for transaction
structures (trivial and makes some sense anyway), you can use
fault-injection framework to do exactly what you describe above (see
Documentation/fault-injection/fault-injection.txt and look for failslab).

> What I'd probably do is define a new flag, JBD2_WRITEBACK_TEST, and
> use it to make the ext4_journal_start() functions that are allowed to
> probabilistically fail (since the retry should be happening at a
> higher level), and then run a stress test with the syfs tunable
> enabled.  Since the flag would only cause ext4_journal_start()
> functions that should have automatic fallbacks, the stress test would
> be able to run to completion even though 10% of the
> ext4_journal_start(... JBD2_WRITEBACK_TEST) calls were failing.
> That's another example of why using a flag bitfield instead of a bool
> is much more powerful.
  But if we just fail all transaction allocations with say 10% probability,
it should work as well, shouldn't it? We'd just retry those allocations
whose failure we cannot handle and eventually succeed. Or do I miss
something?
								Honza
Jan Kara - May 26, 2011, 2:51 p.m.
Hi Manish,

On Wed 25-05-11 22:29:59, Manish Katiyar wrote:
> On Wed, May 25, 2011 at 7:22 PM, Ted Ts'o <tytso@mit.edu> wrote:
> > On Wed, May 25, 2011 at 10:13:33AM +0200, Jan Kara wrote:
> >> >
> >> >    ok.. I will do it as a separate patch.
> >>   Well, patch 2/3 does not really make too much sense without it (errok
> >> parameter isn't used) so there's no reason to do it as a separate patch.
> >> Just add it to this patch please.
> >
> > Agreed; right now this whole patch series is a no-op, since errok
> > isn't getting used for anything.  So fixing errok so it's passed to
> > start_this_handle() seems to be more in the category of "fix the
> > patch" more than anything else.
> >
> > One more thing; perhaps we should be passing in a integer so we can
> > pass in a flag word.  That way you don't need to have a fail_ok
> > variant.  It's a lot more obvious if you have a call:
> >
> >          handle = ext4_journal_start(inode, 1, JBD2_FAIL_OK);
> >
> > What we can also do is this:
> >
> >          handle = ext4_journal_start(inode, 1, JBD2_FAIL_OK | JBD2_TOPLEVEL);
> >
> > What JBD2_TOPLEVEL means is that caller is from a top-level file
> > system function, such as ext4_symlink() or ext4_chmod(), such that
> > start_this_handle() can use GFP_KERNEL instead of GFP_NOFS.  GFP_NOFS
> > is needed for any function that might get called by the direct reclaim
> > path (i.e., the writepage() function).  But for the top-level
> > symlink() or chmod() function, it's actually OK to allocate memory
> > using GFP_KERNEL, since there's no potential recursion problem.
> 
> Will it be more desirable to have the patch fixed as JBD2_FAIL_OK as
> you suggested above ? I have already
> fixed the patch to pass errok to start_this_handle() and then retry or
> not based on that, and was planning to send
> it. But if this needs to be fixed this way, then I will rework it then send.
  Well, I don't care much and Ted seems to like flags more. As he is the
maintainer, I guess just follow his advice ;).

								Honza
Theodore Ts'o - May 26, 2011, 3:08 p.m.
On Thu, May 26, 2011 at 04:49:56PM +0200, Jan Kara wrote:
>   No need to do this. If you make JBD2 use a separate slab for transaction
> structures (trivial and makes some sense anyway), you can use
> fault-injection framework to do exactly what you describe above (see
> Documentation/fault-injection/fault-injection.txt and look for failslab).

Thanks for pointing me at the fault-injection framework; it's not
something I've used before.  I'll have to take a look at it.

>   But if we just fail all transaction allocations with say 10% probability,
> it should work as well, shouldn't it? We'd just retry those allocations
> whose failure we cannot handle and eventually succeed. Or do I miss
> something?

The reason why I only wanted to fail the transactions relating to the
writeback path is because other failures will get reflected back to
userspace, and would thus change the behavior of the stress test.  (If
we used fsstress, it would cause fsstress to immediately stop and
fail, for example.)

That is the one thing that worries me a little about this patch series
in general.  If we suddenly start failing open() or rename() or
chmod() syscalls with ENOMEM in low memory situations, what of
programs that aren't doing adequate error checking?  Sure, other file
systems will do this, but the bulk of the users use ext3/ext4, and
remember how much kvetching and complaining when xfs was the first
file system to require user space applications to actually use fsync()
if they wanted their files to be safe after a power failure.

I worry that there are a lot of incompetently written editors out
there that aren't doing error checking, or worse yet, package managers
or other security-critical programs that aren't doing error checking,
and which won't notice when an syscall fails in a low-memory
situation, leading to either (a) user data loss (which the application
programers will lay at the feet of the file system developers, don't
doubt it), or (b) security holes.

I'm not sure there's a way to address this concern, and I'm going not
NACK'ing this patch series on that basis --- but I do worry that it
might not improve the situation by a whole lot, and may in fact cause
some problems, at the end of the day.

	      	    	      	   	 - Ted
--
To unsubscribe from this list: send the line "unsubscribe linux-ext4" in
the body of a message to majordomo@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Jan Kara - May 26, 2011, 3:37 p.m.
On Thu 26-05-11 11:08:46, Ted Tso wrote:
> On Thu, May 26, 2011 at 04:49:56PM +0200, Jan Kara wrote:
> >   But if we just fail all transaction allocations with say 10% probability,
> > it should work as well, shouldn't it? We'd just retry those allocations
> > whose failure we cannot handle and eventually succeed. Or do I miss
> > something?
> 
> The reason why I only wanted to fail the transactions relating to the
> writeback path is because other failures will get reflected back to
> userspace, and would thus change the behavior of the stress test.  (If
> we used fsstress, it would cause fsstress to immediately stop and
> fail, for example.)
  Ah, I see. OK.

> That is the one thing that worries me a little about this patch series
> in general.  If we suddenly start failing open() or rename() or
> chmod() syscalls with ENOMEM in low memory situations, what of
> programs that aren't doing adequate error checking?  Sure, other file
> systems will do this, but the bulk of the users use ext3/ext4, and
> remember how much kvetching and complaining when xfs was the first
> file system to require user space applications to actually use fsync()
> if they wanted their files to be safe after a power failure.
  Yeah, I know and it's painful to fight these fights. But ultimately, I
belive, it results in better / faster code so that's a good thing.

> I worry that there are a lot of incompetently written editors out
> there that aren't doing error checking, or worse yet, package managers
> or other security-critical programs that aren't doing error checking,
> and which won't notice when an syscall fails in a low-memory
> situation, leading to either (a) user data loss (which the application
> programers will lay at the feet of the file system developers, don't
> doubt it), or (b) security holes.
  But OTOH it allows good applications to do something (if nothing at least
displaying a dialog with error) instead of just being stuck in the kernel
which is a good thing IMHO. So I believe it is a move in the right
direction (although I agree there will be probably people bitching about it
;).

								Honza
Manish Katiyar - May 27, 2011, 4:11 a.m.
On Thu, May 26, 2011 at 8:08 AM, Ted Ts'o <tytso@mit.edu> wrote:
> On Thu, May 26, 2011 at 04:49:56PM +0200, Jan Kara wrote:
>>   No need to do this. If you make JBD2 use a separate slab for transaction
>> structures (trivial and makes some sense anyway), you can use
>> fault-injection framework to do exactly what you describe above (see
>> Documentation/fault-injection/fault-injection.txt and look for failslab).
>
> Thanks for pointing me at the fault-injection framework; it's not
> something I've used before.  I'll have to take a look at it.
>
>>   But if we just fail all transaction allocations with say 10% probability,
>> it should work as well, shouldn't it? We'd just retry those allocations
>> whose failure we cannot handle and eventually succeed. Or do I miss
>> something?
>
> The reason why I only wanted to fail the transactions relating to the
> writeback path is because other failures will get reflected back to
> userspace, and would thus change the behavior of the stress test.  (If
> we used fsstress, it would cause fsstress to immediately stop and
> fail, for example.)
>
> That is the one thing that worries me a little about this patch series
> in general.  If we suddenly start failing open() or rename() or
> chmod() syscalls with ENOMEM in low memory situations, what of
> programs that aren't doing adequate error checking?  Sure, other file
> systems will do this, but the bulk of the users use ext3/ext4, and
> remember how much kvetching and complaining when xfs was the first
> file system to require user space applications to actually use fsync()
> if they wanted their files to be safe after a power failure.
>
> I worry that there are a lot of incompetently written editors out
> there that aren't doing error checking, or worse yet, package managers
> or other security-critical programs that aren't doing error checking,
> and which won't notice when an syscall fails in a low-memory
> situation, leading to either (a) user data loss (which the application
> programers will lay at the feet of the file system developers, don't
> doubt it), or (b) security holes.
>
> I'm not sure there's a way to address this concern, and I'm going not
> NACK'ing this patch series on that basis --- but I do worry that it
> might not improve the situation by a whole lot, and may in fact cause
> some problems, at the end of the day.

If there are applications which don't expect this behaviour or can't handle such
cases, we can always add a sysfs flag to turn of this behavior completely and
always retry like old behavior. This may be ugly, but would help till
ppl start realising
this behavior and writing applications appropriately.
Manish Katiyar - May 28, 2011, 6:16 a.m.
On Wed, May 25, 2011 at 7:22 PM, Ted Ts'o <tytso@mit.edu> wrote:
> On Wed, May 25, 2011 at 10:13:33AM +0200, Jan Kara wrote:
>> >
>> >    ok.. I will do it as a separate patch.
>>   Well, patch 2/3 does not really make too much sense without it (errok
>> parameter isn't used) so there's no reason to do it as a separate patch.
>> Just add it to this patch please.
>
> Agreed; right now this whole patch series is a no-op, since errok
> isn't getting used for anything.  So fixing errok so it's passed to
> start_this_handle() seems to be more in the category of "fix the
> patch" more than anything else.
>
> One more thing; perhaps we should be passing in a integer so we can
> pass in a flag word.  That way you don't need to have a fail_ok
> variant.  It's a lot more obvious if you have a call:
>
>          handle = ext4_journal_start(inode, 1, JBD2_FAIL_OK);
>
> What we can also do is this:
>
>          handle = ext4_journal_start(inode, 1, JBD2_FAIL_OK | JBD2_TOPLEVEL);
>
> What JBD2_TOPLEVEL means is that caller is from a top-level file
> system function, such as ext4_symlink() or ext4_chmod(), such that
> start_this_handle() can use GFP_KERNEL instead of GFP_NOFS.  GFP_NOFS
> is needed for any function that might get called by the direct reclaim
> path (i.e., the writepage() function).  But for the top-level
> symlink() or chmod() function, it's actually OK to allocate memory
> using GFP_KERNEL, since there's no potential recursion problem.

Hi Ted,

Resending version 2 of the three patch series after updating them as
you suggested above.

Patch

diff --git a/fs/ext4/ext4_jbd2.h b/fs/ext4/ext4_jbd2.h
index d0f5353..0abee1b 100644
--- a/fs/ext4/ext4_jbd2.h
+++ b/fs/ext4/ext4_jbd2.h
@@ -225,7 +225,7 @@  static inline int ext4_journal_extend(handle_t *handle, int nblocks)
 static inline int ext4_journal_restart(handle_t *handle, int nblocks)
 {
 	if (ext4_handle_valid(handle))
-		return jbd2_journal_restart(handle, nblocks);
+		return jbd2_journal_restart(handle, nblocks, false);
 	return 0;
 }

diff --git a/fs/ext4/super.c b/fs/ext4/super.c
index 8553dfb..4e4c17f 100644
--- a/fs/ext4/super.c
+++ b/fs/ext4/super.c
@@ -279,7 +279,7 @@  handle_t *ext4_journal_start_sb(struct super_block *sb, int nblocks)
 		ext4_abort(sb, "Detected aborted journal");
 		return ERR_PTR(-EROFS);
 	}
-	return jbd2_journal_start(journal, nblocks);
+	return jbd2_journal_start(journal, nblocks, false);
 }

 /*
diff --git a/fs/jbd2/transaction.c b/fs/jbd2/transaction.c
index 05fa77a..248a7b6 100644
--- a/fs/jbd2/transaction.c
+++ b/fs/jbd2/transaction.c
@@ -308,6 +308,8 @@  static handle_t *new_handle(int nblocks)
  * handle_t *jbd2_journal_start() - Obtain a new handle.
  * @journal: Journal to start transaction on.
  * @nblocks: number of block buffer we might modify
+ * @errok:  True if the callers can handle ENOMEM failures from
+ *	    this routine, false otherwise
  *
  * We make sure that the transaction can guarantee at least nblocks of
  * modified buffers in the log.  We block until the log can guarantee
@@ -318,7 +320,7 @@  static handle_t *new_handle(int nblocks)
  *
  * Return a pointer to a newly allocated handle, or NULL on failure
  */
-handle_t *jbd2__journal_start(journal_t *journal, int nblocks, int gfp_mask)
+handle_t *jbd2_journal_start(journal_t *journal, int nblocks, bool errok)
 {
 	handle_t *handle = journal_current_handle();
 	int err;
@@ -338,7 +340,7 @@  handle_t *jbd2__journal_start(journal_t *journal, int nblocks, int gfp_mask)

 	current->journal_info = handle;

-	err = start_this_handle(journal, handle, gfp_mask);
+	err = start_this_handle(journal, handle, GFP_NOFS);
 	if (err < 0) {
 		jbd2_free_handle(handle);
 		current->journal_info = NULL;
@@ -346,13 +348,6 @@  handle_t *jbd2__journal_start(journal_t *journal, int nblocks, int gfp_mask)
 	}
 	return handle;
 }
-EXPORT_SYMBOL(jbd2__journal_start);
-
-
-handle_t *jbd2_journal_start(journal_t *journal, int nblocks)
-{
-	return jbd2__journal_start(journal, nblocks, GFP_NOFS);
-}
 EXPORT_SYMBOL(jbd2_journal_start);


@@ -441,7 +436,7 @@  out:
  * transaction capabable of guaranteeing the requested number of
  * credits.
  */
-int jbd2__journal_restart(handle_t *handle, int nblocks, int gfp_mask)
+int jbd2_journal_restart(handle_t *handle, int nblocks, bool errok)
 {
 	transaction_t *transaction = handle->h_transaction;
 	journal_t *journal = transaction->t_journal;
@@ -477,16 +472,9 @@  int jbd2__journal_restart(handle_t *handle, int nblocks, int gfp_mask)

 	lock_map_release(&handle->h_lockdep_map);
 	handle->h_buffer_credits = nblocks;
-	ret = start_this_handle(journal, handle, gfp_mask);
+	ret = start_this_handle(journal, handle, GFP_NOFS);
 	return ret;
 }
-EXPORT_SYMBOL(jbd2__journal_restart);
-
-
-int jbd2_journal_restart(handle_t *handle, int nblocks)
-{
-	return jbd2__journal_restart(handle, nblocks, GFP_NOFS);
-}
 EXPORT_SYMBOL(jbd2_journal_restart);

 /**
@@ -1436,7 +1424,7 @@  int jbd2_journal_force_commit(journal_t *journal)
 	handle_t *handle;
 	int ret;

-	handle = jbd2_journal_start(journal, 1);
+	handle = jbd2_journal_start(journal, 1, false);
 	if (IS_ERR(handle)) {
 		ret = PTR_ERR(handle);
 	} else {
diff --git a/fs/ocfs2/journal.c b/fs/ocfs2/journal.c
index 295d564..bff51c0 100644
--- a/fs/ocfs2/journal.c
+++ b/fs/ocfs2/journal.c
@@ -353,11 +353,11 @@  handle_t *ocfs2_start_trans(struct ocfs2_super *osb, int max_buffs)

 	/* Nested transaction? Just return the handle... */
 	if (journal_current_handle())
-		return jbd2_journal_start(journal, max_buffs);
+		return jbd2_journal_start(journal, max_buffs, false);

 	down_read(&osb->journal->j_trans_barrier);

-	handle = jbd2_journal_start(journal, max_buffs);
+	handle = jbd2_journal_start(journal, max_buffs, false);
 	if (IS_ERR(handle)) {
 		up_read(&osb->journal->j_trans_barrier);

@@ -438,7 +438,7 @@  int ocfs2_extend_trans(handle_t *handle, int nblocks)
 	if (status > 0) {
 		trace_ocfs2_extend_trans_restart(old_nblocks + nblocks);
 		status = jbd2_journal_restart(handle,
-					      old_nblocks + nblocks);
+					      old_nblocks + nblocks, false);
 		if (status < 0) {
 			mlog_errno(status);
 			goto bail;
diff --git a/include/linux/jbd2.h b/include/linux/jbd2.h
index a32dcae..67f0f4f 100644
--- a/include/linux/jbd2.h
+++ b/include/linux/jbd2.h
@@ -1103,10 +1103,8 @@  static inline handle_t *journal_current_handle(void)
  * Register buffer modifications against the current transaction.
  */

-extern handle_t *jbd2_journal_start(journal_t *, int nblocks);
-extern handle_t *jbd2__journal_start(journal_t *, int nblocks, int gfp_mask);
-extern int	 jbd2_journal_restart(handle_t *, int nblocks);
-extern int	 jbd2__journal_restart(handle_t *, int nblocks, int gfp_mask);
+extern handle_t *jbd2_journal_start(journal_t *, int nblocks, bool errok);
+extern int	 jbd2_journal_restart(handle_t *, int nblocks, bool errok);
 extern int	 jbd2_journal_extend (handle_t *, int nblocks);
 extern int	 jbd2_journal_get_write_access(handle_t *, struct buffer_head *);
 extern int	 jbd2_journal_get_create_access (handle_t *, struct buffer_head *);