diff mbox

[1/2] ext4/jbd2: fix io-barrier logic in case of external journal

Message ID 1268414810-17289-1-git-send-email-dmonakhov@openvz.org
State New, archived
Headers show

Commit Message

Dmitry Monakhov March 12, 2010, 5:26 p.m. UTC
Currently journalled mode is still broken if fs use external
journal. This is because all data handled as metadata.
We must to issue io-barrier to fs_device even if transaction
has only metadata blocks.

Also add missed return value check.

Signed-off-by: Dmitry Monakhov <dmonakhov@openvz.org>
---
 fs/ext4/fsync.c  |    2 +-
 fs/jbd2/commit.c |    3 +++
 2 files changed, 4 insertions(+), 1 deletions(-)

Comments

Theodore Ts'o March 22, 2010, 1:20 a.m. UTC | #1
On Fri, Mar 12, 2010 at 08:26:49PM +0300, Dmitry Monakhov wrote:
>  start_journal_io:
> +			if (bufs)
> +				commit_transaction->t_flushed_data_blocks = 1;
> +

I'm not convinced this is right.

From your test case, the problem isn't because we have journaled
metadata blocks (which is what bufs) counts, but because fsync()
depends on data blocks also getting flushed out to disks.

However, if we aren't closing the transaction because of fsync(), I
don't think we need to do a barrier in the case of an external
journal.  So instead of effectively unconditionally setting
t_flushed_data_blocks (since bufs is nearly always going to be
non-zero), I think the better fix is to test to see if the journal
device != to the fs data device in fsync(), and if so, start the
barrier operation there.

Do you agree?

Best regards,

						- Ted
--
To unsubscribe from this list: send the line "unsubscribe linux-ext4" in
the body of a message to majordomo@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Dmitry Monakhov March 22, 2010, 2:04 p.m. UTC | #2
tytso@mit.edu writes:

> On Fri, Mar 12, 2010 at 08:26:49PM +0300, Dmitry Monakhov wrote:
>>  start_journal_io:
>> +			if (bufs)
>> +				commit_transaction->t_flushed_data_blocks = 1;
>> +
>
> I'm not convinced this is right.
>
> From your test case, the problem isn't because we have journaled
> metadata blocks (which is what bufs) counts, but because fsync()
> depends on data blocks also getting flushed out to disks.
>
> However, if we aren't closing the transaction because of fsync(), I
> don't think we need to do a barrier in the case of an external
> journal.  So instead of effectively unconditionally setting
> t_flushed_data_blocks (since bufs is nearly always going to be
> non-zero), I think the better fix is to test to see if the journal
> device != to the fs data device in fsync(), and if so, start the
> barrier operation there.
>
> Do you agree?
Yes.
BTW Would it be correct to update j_tail in
jbd2_journal_commit_transaction() to something more recent
if we have issued an io-barrier to j_fs_dev?
This will helps to reduce journal_recovery time which may be really
painful in some slow devices.

I've take a look at async commit logic: fs/jbd2/commit.c
void jbd2_journal_commit_transaction(journal_t *journal)
{
725:    /* Done it all: now write the commit record asynchronously. */
        if (JBD2_HAS_INCOMPAT_FEATURE(journal,
                                      JBD2_FEATURE_INCOMPAT_ASYNC_COMMIT))
                                      {
                err = journal_submit_commit_record(journal,
                commit_transaction,
                                                 &cbh, crc32_sum);
                if (err)
                        __jbd2_journal_abort_hard(journal);
                if (journal->j_flags & JBD2_BARRIER)
                        blkdev_issue_flush(journal->j_dev, NULL);
<<< blkdev_issue_flush is wait for barrier to complete by default, but
<<< in fact we don't have to wait for barrier here. I've prepared a
<<< patch wich add flags to control blkdev_issue_flush() wait
<<< behavior, and this is the place for no-wait variant.
...
855:    if (!err && !is_journal_aborted(journal))
                err = journal_wait_on_commit_record(journal, cbh);
}
--
To unsubscribe from this list: send the line "unsubscribe linux-ext4" in
the body of a message to majordomo@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Theodore Ts'o March 22, 2010, 3:03 p.m. UTC | #3
On Mon, Mar 22, 2010 at 05:04:19PM +0300, Dmitry Monakhov wrote:
> tytso@mit.edu writes:
> 
> > On Fri, Mar 12, 2010 at 08:26:49PM +0300, Dmitry Monakhov wrote:
> >>  start_journal_io:
> >> +			if (bufs)
> >> +				commit_transaction->t_flushed_data_blocks = 1;
> >> +
> >
> > I'm not convinced this is right.
> >
> > From your test case, the problem isn't because we have journaled
> > metadata blocks (which is what bufs) counts, but because fsync()
> > depends on data blocks also getting flushed out to disks.
> >
> > However, if we aren't closing the transaction because of fsync(), I
> > don't think we need to do a barrier in the case of an external
> > journal.  So instead of effectively unconditionally setting
> > t_flushed_data_blocks (since bufs is nearly always going to be
> > non-zero), I think the better fix is to test to see if the journal
> > device != to the fs data device in fsync(), and if so, start the
> > barrier operation there.
> >
> > Do you agree?
> Yes.

Just to be clear, since I realized I wrote fsync() when I should have
written fs/ext4/fsync.c, my proposal was to put this check in
ext4_sync_file().

> BTW Would it be correct to update j_tail in
> jbd2_journal_commit_transaction() to something more recent
> if we have issued an io-barrier to j_fs_dev?
> This will helps to reduce journal_recovery time which may be really
> painful in some slow devices.

Um, maybe.  We are already calling __jbd2_journal_clean_checkpoint_list(),
and the barrier operation *is* expensive.

On the other hand, updating the journal superblock on every sync is
another seek that would have to made before the barrier operation, and
I'm a bit concerned that this seek would be noticeable.  If it is
noticeable, is it worth it to optimize for the uncommon case (a power
failure requiring a journal replay) when it might cost us something,
however, small, on every single journal update?

Do we really think the journal replay time is really something that is
a major pain point.  I can think of optimizations that involve
skipping writes that will get updated later in future transactions,
but it means complicating the replay code, which has been stable for a
long time, and it's not clear to me that the costs are worth the
benefits.

> I've take a look at async commit logic: fs/jbd2/commit.c
> void jbd2_journal_commit_transaction(journal_t *journal)
> {
> 725:    /* Done it all: now write the commit record asynchronously. */
>         if (JBD2_HAS_INCOMPAT_FEATURE(journal,
>                                       JBD2_FEATURE_INCOMPAT_ASYNC_COMMIT))
>                                       {
>                 err = journal_submit_commit_record(journal,
>                 commit_transaction,
>                                                  &cbh, crc32_sum);
>                 if (err)
>                         __jbd2_journal_abort_hard(journal);
>                 if (journal->j_flags & JBD2_BARRIER)
>                         blkdev_issue_flush(journal->j_dev, NULL);
> <<< blkdev_issue_flush is wait for barrier to complete by default, but
> <<< in fact we don't have to wait for barrier here. I've prepared a
> <<< patch wich add flags to control blkdev_issue_flush() wait
> <<< behavior, and this is the place for no-wait variant.

I think that's right, as long as we're confident that subsequent
writes won't get scheduled before the no-wait barrier.  If it did, it
would be a bug in the block I/O layer, so it should be OK.

      	       	      	    	- Ted
 
--
To unsubscribe from this list: send the line "unsubscribe linux-ext4" in
the body of a message to majordomo@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Dmitry Monakhov March 22, 2010, 4:14 p.m. UTC | #4
tytso@mit.edu writes:

> On Mon, Mar 22, 2010 at 05:04:19PM +0300, Dmitry Monakhov wrote:
>> tytso@mit.edu writes:
>> 
>> > On Fri, Mar 12, 2010 at 08:26:49PM +0300, Dmitry Monakhov wrote:
>> >>  start_journal_io:
>> >> +			if (bufs)
>> >> +				commit_transaction->t_flushed_data_blocks = 1;
>> >> +
>> >
>> > I'm not convinced this is right.
>> >
>> > From your test case, the problem isn't because we have journaled
>> > metadata blocks (which is what bufs) counts, but because fsync()
>> > depends on data blocks also getting flushed out to disks.
>> >
>> > However, if we aren't closing the transaction because of fsync(), I
>> > don't think we need to do a barrier in the case of an external
>> > journal.  So instead of effectively unconditionally setting
>> > t_flushed_data_blocks (since bufs is nearly always going to be
>> > non-zero), I think the better fix is to test to see if the journal
>> > device != to the fs data device in fsync(), and if so, start the
>> > barrier operation there.
>> >
>> > Do you agree?
>> Yes.
>
> Just to be clear, since I realized I wrote fsync() when I should have
> written fs/ext4/fsync.c, my proposal was to put this check in
> ext4_sync_file().
This means that we will end up with two io-barriers in a row
for external journal case:
ext4_sync_file()
  ->jbd2_log_start_commit()
     ->blkdev_issue_flush(j_fs_dev)
  /* seems following flush is mandatory to guarantee the metadata
   * consistency */
  ->blkdev_issue_flush(j_fs_dev)

may be it will be better to pass some sort of barrier flag to
jbd2_log_start_commit() this function mark journal->j_commit_request
with ->t_flushed_data_blocks = 1, so io-carrier will be issued
even if transaction has only metadata
Advantages:
1) approach will works for all data modes
2) only one io-barrier is needed to guarantee the data and
   metadata consiscency.
3) changes not so complicated.

I've already started to work with the patch but was surprised with
commit logic. 
->__jbd2_log_start_commit(target)
   journal->j_commit_request = target
   ->wake_up(&journal->j_wait_commit)
->kjournald2
   transaction = journal->j_running_transaction
   ->jbd2_journal_commit_transaction(journal);
     commit_transaction = journal->j_running_transaction
     ...
     /* So j_commit_request is used only for comparison. But actually
        committing journal->j_running_transaction->t_tid transaction
        instead of j_commit_request. Why?
      */
>
>> BTW Would it be correct to update j_tail in
>> jbd2_journal_commit_transaction() to something more recent
>> if we have issued an io-barrier to j_fs_dev?
>> This will helps to reduce journal_recovery time which may be really
>> painful in some slow devices.
>
> Um, maybe.  We are already calling __jbd2_journal_clean_checkpoint_list(),
> and the barrier operation *is* expensive.
>
> On the other hand, updating the journal superblock on every sync is
> another seek that would have to made before the barrier operation, and
> I'm a bit concerned that this seek would be noticeable.  If it is
> noticeable, is it worth it to optimize for the uncommon case (a power
> failure requiring a journal replay) when it might cost us something,
> however, small, on every single journal update?
>
> Do we really think the journal replay time is really something that is
> a major pain point.  I can think of optimizations that involve
> skipping writes that will get updated later in future transactions,
> but it means complicating the replay code, which has been stable for a
> long time, and it's not clear to me that the costs are worth the
> benefits.
Never mind. It was just my thoughts. The price of broken recovery stage
is to expansive to fix such rare cases.
>
>> I've take a look at async commit logic: fs/jbd2/commit.c
>> void jbd2_journal_commit_transaction(journal_t *journal)
>> {
>> 725:    /* Done it all: now write the commit record asynchronously. */
>>         if (JBD2_HAS_INCOMPAT_FEATURE(journal,
>>                                       JBD2_FEATURE_INCOMPAT_ASYNC_COMMIT))
>>                                       {
>>                 err = journal_submit_commit_record(journal,
>>                 commit_transaction,
>>                                                  &cbh, crc32_sum);
>>                 if (err)
>>                         __jbd2_journal_abort_hard(journal);
>>                 if (journal->j_flags & JBD2_BARRIER)
>>                         blkdev_issue_flush(journal->j_dev, NULL);
>> <<< blkdev_issue_flush is wait for barrier to complete by default, but
>> <<< in fact we don't have to wait for barrier here. I've prepared a
>> <<< patch wich add flags to control blkdev_issue_flush() wait
>> <<< behavior, and this is the place for no-wait variant.
>
> I think that's right, as long as we're confident that subsequent
> writes won't get scheduled before the no-wait barrier.  If it did, it
> would be a bug in the block I/O layer, so it should be OK.
>
>       	       	      	    	- Ted
>  
--
To unsubscribe from this list: send the line "unsubscribe linux-ext4" in
the body of a message to majordomo@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Theodore Ts'o March 22, 2010, 8:22 p.m. UTC | #5
On Mon, Mar 22, 2010 at 07:14:13PM +0300, Dmitry Monakhov wrote:
> >
> > Just to be clear, since I realized I wrote fsync() when I should have
> > written fs/ext4/fsync.c, my proposal was to put this check in
> > ext4_sync_file().
> This means that we will end up with two io-barriers in a row
> for external journal case:
> ext4_sync_file()
>   ->jbd2_log_start_commit()
>      ->blkdev_issue_flush(j_fs_dev)
>   /* seems following flush is mandatory to guarantee the metadata
>    * consistency */
>   ->blkdev_issue_flush(j_fs_dev)

Well, not if we only issue a barrier in the external barrier case....
but I agree the logic may start getting ugly.

> I've already started to work with the patch but was surprised with
> commit logic. 
> ->__jbd2_log_start_commit(target)
>    journal->j_commit_request = target
>    ->wake_up(&journal->j_wait_commit)
> ->kjournald2
>    transaction = journal->j_running_transaction
>    ->jbd2_journal_commit_transaction(journal);
>      commit_transaction = journal->j_running_transaction
>      ...
>      /* So j_commit_request is used only for comparison. But actually
>         committing journal->j_running_transaction->t_tid transaction
>         instead of j_commit_request. Why?
>       */

Yeah, I don't think jbd2_log_start_commit() has the semantics that are
currently documented.  We will return 0 if we wake up the commit
thread, yes --- but since we only check j_commit_request, and it's a
geq test, it might very well be the case that the transaction was
committed long ago, and so we end up kicking off a transaction commit
when one is not needed.  Or, it may be that a transaction is just
already being committed (and in fact is just about to be completed),
and so the wake_up() call is a no-op, and so we don't get a barrier
when one is needed (and expected) by ext4_sync_file().

We need to look at this very closely, but I think
jbd2_log_start_commit() needs to check to see whether there is a
current committing transaction, and whether it is the one which is
that has been requested as a target.  If so, and a barrier is
requested, I think we need to have jbd2_log_start_commit() do the
barrier.  There is a risk of having a double barrier in that case, but
modifying the flag on the currently committing transaction has the
danger of being lost by kjournald --- or I guess alternatively we
could have kjournald check that flag while holding j_state_lock, which
might be better.    

If the currently running transaction is the requested target, then
that's easy; we can set the flag, and then wake up the j_wait_commit
wait queue.  

In any case, log_start_commit() doesn't currently distinguish between
whether the targetted commit is the running or the committing
transaction when it returns 0, and I think that's a problem.

	    	    	       	     - Ted
--
To unsubscribe from this list: send the line "unsubscribe linux-ext4" in
the body of a message to majordomo@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Jan Kara March 30, 2010, 1:47 a.m. UTC | #6
> tytso@mit.edu writes:
> 
> > On Mon, Mar 22, 2010 at 05:04:19PM +0300, Dmitry Monakhov wrote:
> >> tytso@mit.edu writes:
> >> 
> >> > On Fri, Mar 12, 2010 at 08:26:49PM +0300, Dmitry Monakhov wrote:
> >> >>  start_journal_io:
> >> >> +			if (bufs)
> >> >> +				commit_transaction->t_flushed_data_blocks = 1;
> >> >> +
> >> >
> >> > I'm not convinced this is right.
> >> >
> >> > From your test case, the problem isn't because we have journaled
> >> > metadata blocks (which is what bufs) counts, but because fsync()
> >> > depends on data blocks also getting flushed out to disks.
> >> >
> >> > However, if we aren't closing the transaction because of fsync(), I
> >> > don't think we need to do a barrier in the case of an external
> >> > journal.  So instead of effectively unconditionally setting
> >> > t_flushed_data_blocks (since bufs is nearly always going to be
> >> > non-zero), I think the better fix is to test to see if the journal
> >> > device != to the fs data device in fsync(), and if so, start the
> >> > barrier operation there.
> >> >
> >> > Do you agree?
> >> Yes.
> >
> > Just to be clear, since I realized I wrote fsync() when I should have
> > written fs/ext4/fsync.c, my proposal was to put this check in
> > ext4_sync_file().
> This means that we will end up with two io-barriers in a row
> for external journal case:
> ext4_sync_file()
>   ->jbd2_log_start_commit()
>      ->blkdev_issue_flush(j_fs_dev)
>   /* seems following flush is mandatory to guarantee the metadata
>    * consistency */
>   ->blkdev_issue_flush(j_fs_dev)
> 
> may be it will be better to pass some sort of barrier flag to
> jbd2_log_start_commit() this function mark journal->j_commit_request
> with ->t_flushed_data_blocks = 1, so io-carrier will be issued
> even if transaction has only metadata
> Advantages:
> 1) approach will works for all data modes
> 2) only one io-barrier is needed to guarantee the data and
>    metadata consiscency.
> 3) changes not so complicated.
> 
> I've already started to work with the patch but was surprised with
> commit logic. 
> ->__jbd2_log_start_commit(target)
>    journal->j_commit_request = target
>    ->wake_up(&journal->j_wait_commit)
> ->kjournald2
>    transaction = journal->j_running_transaction
>    ->jbd2_journal_commit_transaction(journal);
>      commit_transaction = journal->j_running_transaction
>      ...
>      /* So j_commit_request is used only for comparison. But actually
>         committing journal->j_running_transaction->t_tid transaction
>         instead of j_commit_request. Why?
>       */
  Remember: There is always at most one running transaction and at most one
transaction in the committing state - all the JBD2 and ext4 code heavily relies
on this. So in this case, because kjournald isn't committing any transaction,
we know that there is no committing transaction and at most one running
transaction whose tid = journal->j_commit_sequence + 1. So actually the only
trasaction we can commit is the running one and it does not even make sence to
try committing another one (the check in log_start_commit actually makes sure
that j_commit_request cannot be set to a lower value than it is and the code in
kjournald2 makes sure that j_commit_request >= j_commit_sequence.  So I think
the code works right (although I agree it's hard to read - it's an old legacy
;).

									Honza
Jan Kara March 30, 2010, 2:14 a.m. UTC | #7
> On Mon, Mar 22, 2010 at 07:14:13PM +0300, Dmitry Monakhov wrote:
> > >
> > > Just to be clear, since I realized I wrote fsync() when I should have
> > > written fs/ext4/fsync.c, my proposal was to put this check in
> > > ext4_sync_file().
> > This means that we will end up with two io-barriers in a row
> > for external journal case:
> > ext4_sync_file()
> >   ->jbd2_log_start_commit()
> >      ->blkdev_issue_flush(j_fs_dev)
> >   /* seems following flush is mandatory to guarantee the metadata
> >    * consistency */
> >   ->blkdev_issue_flush(j_fs_dev)
> 
> Well, not if we only issue a barrier in the external barrier case....
> but I agree the logic may start getting ugly.
> 
> > I've already started to work with the patch but was surprised with
> > commit logic. 
> > ->__jbd2_log_start_commit(target)
> >    journal->j_commit_request = target
> >    ->wake_up(&journal->j_wait_commit)
> > ->kjournald2
> >    transaction = journal->j_running_transaction
> >    ->jbd2_journal_commit_transaction(journal);
> >      commit_transaction = journal->j_running_transaction
> >      ...
> >      /* So j_commit_request is used only for comparison. But actually
> >         committing journal->j_running_transaction->t_tid transaction
> >         instead of j_commit_request. Why?
> >       */
> 
> Yeah, I don't think jbd2_log_start_commit() has the semantics that are
> currently documented.  We will return 0 if we wake up the commit
> thread, yes --- but since we only check j_commit_request, and it's a
> geq test, it might very well be the case that the transaction was
> committed long ago, and so we end up kicking off a transaction commit
> when one is not needed.  Or, it may be that a transaction is just
> already being committed (and in fact is just about to be completed),
> and so the wake_up() call is a no-op, and so we don't get a barrier
> when one is needed (and expected) by ext4_sync_file().
> 
> We need to look at this very closely, but I think
> jbd2_log_start_commit() needs to check to see whether there is a
> current committing transaction, and whether it is the one which is
> that has been requested as a target.  If so, and a barrier is
> requested, I think we need to have jbd2_log_start_commit() do the
> barrier.  There is a risk of having a double barrier in that case, but
> modifying the flag on the currently committing transaction has the
> danger of being lost by kjournald --- or I guess alternatively we
> could have kjournald check that flag while holding j_state_lock, which
> might be better.
  But still if you happen to set the flag after commit code has checked
it, you have lost the race. Given the bug you describe below, I think
we should provide a new call from JBD2 that will do all the necessary
magic for given TID - something like:
	data_barrier = 0;
	if (journal->j_commit_sequence > tid)
		data_barrier = 1;
	else if (journal->j_committing_transaction &&
		 journal->j_committing_transaction->t_tid == tid) {
		/* Here we could be more clever and set the barrier
		 * flag of the transaction if according to its state
		 * it has not yet issued data barrier */
		data_barrier = 1;
	} else {
		journal->j_running_transaction->has_data = 1;
		jbd2_log_start_commit(journal, tid);
	}
	jbd2_log_wait_commit(journal, tid);
	if (data_barrier)
		blk_dev_issue_flush(journal->j_fs_dev);

  What do you think?

> If the currently running transaction is the requested target, then
> that's easy; we can set the flag, and then wake up the j_wait_commit
> wait queue.  
> 
> In any case, log_start_commit() doesn't currently distinguish between
> whether the targetted commit is the running or the committing
> transaction when it returns 0, and I think that's a problem.
  Yeah, I agree this is a problem and it could cause ext4_sync_file not
properly wait. The simplest fix is to call jbd2_log_wait_commit
unconditionally. But I'd maybe go with the above approach.

								Honza
diff mbox

Patch

diff --git a/fs/ext4/fsync.c b/fs/ext4/fsync.c
index 98bd140..eb52837 100644
--- a/fs/ext4/fsync.c
+++ b/fs/ext4/fsync.c
@@ -101,7 +101,7 @@  int ext4_sync_file(struct file *file, struct dentry *dentry, int datasync)
 		    (journal->j_fs_dev != journal->j_dev) &&
 		    (journal->j_flags & JBD2_BARRIER))
 			blkdev_issue_flush(inode->i_sb->s_bdev, NULL);
-		jbd2_log_wait_commit(journal, commit_tid);
+		ret = jbd2_log_wait_commit(journal, commit_tid);
 	} else if (journal->j_flags & JBD2_BARRIER)
 		blkdev_issue_flush(inode->i_sb->s_bdev, NULL);
 	return ret;
diff --git a/fs/jbd2/commit.c b/fs/jbd2/commit.c
index 1bc74b6..2f62f1b 100644
--- a/fs/jbd2/commit.c
+++ b/fs/jbd2/commit.c
@@ -682,6 +682,9 @@  void jbd2_journal_commit_transaction(journal_t *journal)
 			tag->t_flags |= cpu_to_be32(JBD2_FLAG_LAST_TAG);
 
 start_journal_io:
+			if (bufs)
+				commit_transaction->t_flushed_data_blocks = 1;
+
 			for (i = 0; i < bufs; i++) {
 				struct buffer_head *bh = wbuf[i];
 				/*