Message ID | 1268414810-17289-1-git-send-email-dmonakhov@openvz.org |
---|---|
State | New, archived |
Headers | show |
On Fri, Mar 12, 2010 at 08:26:49PM +0300, Dmitry Monakhov wrote: > start_journal_io: > + if (bufs) > + commit_transaction->t_flushed_data_blocks = 1; > + I'm not convinced this is right. From your test case, the problem isn't because we have journaled metadata blocks (which is what bufs) counts, but because fsync() depends on data blocks also getting flushed out to disks. However, if we aren't closing the transaction because of fsync(), I don't think we need to do a barrier in the case of an external journal. So instead of effectively unconditionally setting t_flushed_data_blocks (since bufs is nearly always going to be non-zero), I think the better fix is to test to see if the journal device != to the fs data device in fsync(), and if so, start the barrier operation there. Do you agree? Best regards, - Ted -- To unsubscribe from this list: send the line "unsubscribe linux-ext4" in the body of a message to majordomo@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
tytso@mit.edu writes: > On Fri, Mar 12, 2010 at 08:26:49PM +0300, Dmitry Monakhov wrote: >> start_journal_io: >> + if (bufs) >> + commit_transaction->t_flushed_data_blocks = 1; >> + > > I'm not convinced this is right. > > From your test case, the problem isn't because we have journaled > metadata blocks (which is what bufs) counts, but because fsync() > depends on data blocks also getting flushed out to disks. > > However, if we aren't closing the transaction because of fsync(), I > don't think we need to do a barrier in the case of an external > journal. So instead of effectively unconditionally setting > t_flushed_data_blocks (since bufs is nearly always going to be > non-zero), I think the better fix is to test to see if the journal > device != to the fs data device in fsync(), and if so, start the > barrier operation there. > > Do you agree? Yes. BTW Would it be correct to update j_tail in jbd2_journal_commit_transaction() to something more recent if we have issued an io-barrier to j_fs_dev? This will helps to reduce journal_recovery time which may be really painful in some slow devices. I've take a look at async commit logic: fs/jbd2/commit.c void jbd2_journal_commit_transaction(journal_t *journal) { 725: /* Done it all: now write the commit record asynchronously. */ if (JBD2_HAS_INCOMPAT_FEATURE(journal, JBD2_FEATURE_INCOMPAT_ASYNC_COMMIT)) { err = journal_submit_commit_record(journal, commit_transaction, &cbh, crc32_sum); if (err) __jbd2_journal_abort_hard(journal); if (journal->j_flags & JBD2_BARRIER) blkdev_issue_flush(journal->j_dev, NULL); <<< blkdev_issue_flush is wait for barrier to complete by default, but <<< in fact we don't have to wait for barrier here. I've prepared a <<< patch wich add flags to control blkdev_issue_flush() wait <<< behavior, and this is the place for no-wait variant. ... 855: if (!err && !is_journal_aborted(journal)) err = journal_wait_on_commit_record(journal, cbh); } -- To unsubscribe from this list: send the line "unsubscribe linux-ext4" in the body of a message to majordomo@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
On Mon, Mar 22, 2010 at 05:04:19PM +0300, Dmitry Monakhov wrote: > tytso@mit.edu writes: > > > On Fri, Mar 12, 2010 at 08:26:49PM +0300, Dmitry Monakhov wrote: > >> start_journal_io: > >> + if (bufs) > >> + commit_transaction->t_flushed_data_blocks = 1; > >> + > > > > I'm not convinced this is right. > > > > From your test case, the problem isn't because we have journaled > > metadata blocks (which is what bufs) counts, but because fsync() > > depends on data blocks also getting flushed out to disks. > > > > However, if we aren't closing the transaction because of fsync(), I > > don't think we need to do a barrier in the case of an external > > journal. So instead of effectively unconditionally setting > > t_flushed_data_blocks (since bufs is nearly always going to be > > non-zero), I think the better fix is to test to see if the journal > > device != to the fs data device in fsync(), and if so, start the > > barrier operation there. > > > > Do you agree? > Yes. Just to be clear, since I realized I wrote fsync() when I should have written fs/ext4/fsync.c, my proposal was to put this check in ext4_sync_file(). > BTW Would it be correct to update j_tail in > jbd2_journal_commit_transaction() to something more recent > if we have issued an io-barrier to j_fs_dev? > This will helps to reduce journal_recovery time which may be really > painful in some slow devices. Um, maybe. We are already calling __jbd2_journal_clean_checkpoint_list(), and the barrier operation *is* expensive. On the other hand, updating the journal superblock on every sync is another seek that would have to made before the barrier operation, and I'm a bit concerned that this seek would be noticeable. If it is noticeable, is it worth it to optimize for the uncommon case (a power failure requiring a journal replay) when it might cost us something, however, small, on every single journal update? Do we really think the journal replay time is really something that is a major pain point. I can think of optimizations that involve skipping writes that will get updated later in future transactions, but it means complicating the replay code, which has been stable for a long time, and it's not clear to me that the costs are worth the benefits. > I've take a look at async commit logic: fs/jbd2/commit.c > void jbd2_journal_commit_transaction(journal_t *journal) > { > 725: /* Done it all: now write the commit record asynchronously. */ > if (JBD2_HAS_INCOMPAT_FEATURE(journal, > JBD2_FEATURE_INCOMPAT_ASYNC_COMMIT)) > { > err = journal_submit_commit_record(journal, > commit_transaction, > &cbh, crc32_sum); > if (err) > __jbd2_journal_abort_hard(journal); > if (journal->j_flags & JBD2_BARRIER) > blkdev_issue_flush(journal->j_dev, NULL); > <<< blkdev_issue_flush is wait for barrier to complete by default, but > <<< in fact we don't have to wait for barrier here. I've prepared a > <<< patch wich add flags to control blkdev_issue_flush() wait > <<< behavior, and this is the place for no-wait variant. I think that's right, as long as we're confident that subsequent writes won't get scheduled before the no-wait barrier. If it did, it would be a bug in the block I/O layer, so it should be OK. - Ted -- To unsubscribe from this list: send the line "unsubscribe linux-ext4" in the body of a message to majordomo@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
tytso@mit.edu writes: > On Mon, Mar 22, 2010 at 05:04:19PM +0300, Dmitry Monakhov wrote: >> tytso@mit.edu writes: >> >> > On Fri, Mar 12, 2010 at 08:26:49PM +0300, Dmitry Monakhov wrote: >> >> start_journal_io: >> >> + if (bufs) >> >> + commit_transaction->t_flushed_data_blocks = 1; >> >> + >> > >> > I'm not convinced this is right. >> > >> > From your test case, the problem isn't because we have journaled >> > metadata blocks (which is what bufs) counts, but because fsync() >> > depends on data blocks also getting flushed out to disks. >> > >> > However, if we aren't closing the transaction because of fsync(), I >> > don't think we need to do a barrier in the case of an external >> > journal. So instead of effectively unconditionally setting >> > t_flushed_data_blocks (since bufs is nearly always going to be >> > non-zero), I think the better fix is to test to see if the journal >> > device != to the fs data device in fsync(), and if so, start the >> > barrier operation there. >> > >> > Do you agree? >> Yes. > > Just to be clear, since I realized I wrote fsync() when I should have > written fs/ext4/fsync.c, my proposal was to put this check in > ext4_sync_file(). This means that we will end up with two io-barriers in a row for external journal case: ext4_sync_file() ->jbd2_log_start_commit() ->blkdev_issue_flush(j_fs_dev) /* seems following flush is mandatory to guarantee the metadata * consistency */ ->blkdev_issue_flush(j_fs_dev) may be it will be better to pass some sort of barrier flag to jbd2_log_start_commit() this function mark journal->j_commit_request with ->t_flushed_data_blocks = 1, so io-carrier will be issued even if transaction has only metadata Advantages: 1) approach will works for all data modes 2) only one io-barrier is needed to guarantee the data and metadata consiscency. 3) changes not so complicated. I've already started to work with the patch but was surprised with commit logic. ->__jbd2_log_start_commit(target) journal->j_commit_request = target ->wake_up(&journal->j_wait_commit) ->kjournald2 transaction = journal->j_running_transaction ->jbd2_journal_commit_transaction(journal); commit_transaction = journal->j_running_transaction ... /* So j_commit_request is used only for comparison. But actually committing journal->j_running_transaction->t_tid transaction instead of j_commit_request. Why? */ > >> BTW Would it be correct to update j_tail in >> jbd2_journal_commit_transaction() to something more recent >> if we have issued an io-barrier to j_fs_dev? >> This will helps to reduce journal_recovery time which may be really >> painful in some slow devices. > > Um, maybe. We are already calling __jbd2_journal_clean_checkpoint_list(), > and the barrier operation *is* expensive. > > On the other hand, updating the journal superblock on every sync is > another seek that would have to made before the barrier operation, and > I'm a bit concerned that this seek would be noticeable. If it is > noticeable, is it worth it to optimize for the uncommon case (a power > failure requiring a journal replay) when it might cost us something, > however, small, on every single journal update? > > Do we really think the journal replay time is really something that is > a major pain point. I can think of optimizations that involve > skipping writes that will get updated later in future transactions, > but it means complicating the replay code, which has been stable for a > long time, and it's not clear to me that the costs are worth the > benefits. Never mind. It was just my thoughts. The price of broken recovery stage is to expansive to fix such rare cases. > >> I've take a look at async commit logic: fs/jbd2/commit.c >> void jbd2_journal_commit_transaction(journal_t *journal) >> { >> 725: /* Done it all: now write the commit record asynchronously. */ >> if (JBD2_HAS_INCOMPAT_FEATURE(journal, >> JBD2_FEATURE_INCOMPAT_ASYNC_COMMIT)) >> { >> err = journal_submit_commit_record(journal, >> commit_transaction, >> &cbh, crc32_sum); >> if (err) >> __jbd2_journal_abort_hard(journal); >> if (journal->j_flags & JBD2_BARRIER) >> blkdev_issue_flush(journal->j_dev, NULL); >> <<< blkdev_issue_flush is wait for barrier to complete by default, but >> <<< in fact we don't have to wait for barrier here. I've prepared a >> <<< patch wich add flags to control blkdev_issue_flush() wait >> <<< behavior, and this is the place for no-wait variant. > > I think that's right, as long as we're confident that subsequent > writes won't get scheduled before the no-wait barrier. If it did, it > would be a bug in the block I/O layer, so it should be OK. > > - Ted > -- To unsubscribe from this list: send the line "unsubscribe linux-ext4" in the body of a message to majordomo@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
On Mon, Mar 22, 2010 at 07:14:13PM +0300, Dmitry Monakhov wrote: > > > > Just to be clear, since I realized I wrote fsync() when I should have > > written fs/ext4/fsync.c, my proposal was to put this check in > > ext4_sync_file(). > This means that we will end up with two io-barriers in a row > for external journal case: > ext4_sync_file() > ->jbd2_log_start_commit() > ->blkdev_issue_flush(j_fs_dev) > /* seems following flush is mandatory to guarantee the metadata > * consistency */ > ->blkdev_issue_flush(j_fs_dev) Well, not if we only issue a barrier in the external barrier case.... but I agree the logic may start getting ugly. > I've already started to work with the patch but was surprised with > commit logic. > ->__jbd2_log_start_commit(target) > journal->j_commit_request = target > ->wake_up(&journal->j_wait_commit) > ->kjournald2 > transaction = journal->j_running_transaction > ->jbd2_journal_commit_transaction(journal); > commit_transaction = journal->j_running_transaction > ... > /* So j_commit_request is used only for comparison. But actually > committing journal->j_running_transaction->t_tid transaction > instead of j_commit_request. Why? > */ Yeah, I don't think jbd2_log_start_commit() has the semantics that are currently documented. We will return 0 if we wake up the commit thread, yes --- but since we only check j_commit_request, and it's a geq test, it might very well be the case that the transaction was committed long ago, and so we end up kicking off a transaction commit when one is not needed. Or, it may be that a transaction is just already being committed (and in fact is just about to be completed), and so the wake_up() call is a no-op, and so we don't get a barrier when one is needed (and expected) by ext4_sync_file(). We need to look at this very closely, but I think jbd2_log_start_commit() needs to check to see whether there is a current committing transaction, and whether it is the one which is that has been requested as a target. If so, and a barrier is requested, I think we need to have jbd2_log_start_commit() do the barrier. There is a risk of having a double barrier in that case, but modifying the flag on the currently committing transaction has the danger of being lost by kjournald --- or I guess alternatively we could have kjournald check that flag while holding j_state_lock, which might be better. If the currently running transaction is the requested target, then that's easy; we can set the flag, and then wake up the j_wait_commit wait queue. In any case, log_start_commit() doesn't currently distinguish between whether the targetted commit is the running or the committing transaction when it returns 0, and I think that's a problem. - Ted -- To unsubscribe from this list: send the line "unsubscribe linux-ext4" in the body of a message to majordomo@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
> tytso@mit.edu writes: > > > On Mon, Mar 22, 2010 at 05:04:19PM +0300, Dmitry Monakhov wrote: > >> tytso@mit.edu writes: > >> > >> > On Fri, Mar 12, 2010 at 08:26:49PM +0300, Dmitry Monakhov wrote: > >> >> start_journal_io: > >> >> + if (bufs) > >> >> + commit_transaction->t_flushed_data_blocks = 1; > >> >> + > >> > > >> > I'm not convinced this is right. > >> > > >> > From your test case, the problem isn't because we have journaled > >> > metadata blocks (which is what bufs) counts, but because fsync() > >> > depends on data blocks also getting flushed out to disks. > >> > > >> > However, if we aren't closing the transaction because of fsync(), I > >> > don't think we need to do a barrier in the case of an external > >> > journal. So instead of effectively unconditionally setting > >> > t_flushed_data_blocks (since bufs is nearly always going to be > >> > non-zero), I think the better fix is to test to see if the journal > >> > device != to the fs data device in fsync(), and if so, start the > >> > barrier operation there. > >> > > >> > Do you agree? > >> Yes. > > > > Just to be clear, since I realized I wrote fsync() when I should have > > written fs/ext4/fsync.c, my proposal was to put this check in > > ext4_sync_file(). > This means that we will end up with two io-barriers in a row > for external journal case: > ext4_sync_file() > ->jbd2_log_start_commit() > ->blkdev_issue_flush(j_fs_dev) > /* seems following flush is mandatory to guarantee the metadata > * consistency */ > ->blkdev_issue_flush(j_fs_dev) > > may be it will be better to pass some sort of barrier flag to > jbd2_log_start_commit() this function mark journal->j_commit_request > with ->t_flushed_data_blocks = 1, so io-carrier will be issued > even if transaction has only metadata > Advantages: > 1) approach will works for all data modes > 2) only one io-barrier is needed to guarantee the data and > metadata consiscency. > 3) changes not so complicated. > > I've already started to work with the patch but was surprised with > commit logic. > ->__jbd2_log_start_commit(target) > journal->j_commit_request = target > ->wake_up(&journal->j_wait_commit) > ->kjournald2 > transaction = journal->j_running_transaction > ->jbd2_journal_commit_transaction(journal); > commit_transaction = journal->j_running_transaction > ... > /* So j_commit_request is used only for comparison. But actually > committing journal->j_running_transaction->t_tid transaction > instead of j_commit_request. Why? > */ Remember: There is always at most one running transaction and at most one transaction in the committing state - all the JBD2 and ext4 code heavily relies on this. So in this case, because kjournald isn't committing any transaction, we know that there is no committing transaction and at most one running transaction whose tid = journal->j_commit_sequence + 1. So actually the only trasaction we can commit is the running one and it does not even make sence to try committing another one (the check in log_start_commit actually makes sure that j_commit_request cannot be set to a lower value than it is and the code in kjournald2 makes sure that j_commit_request >= j_commit_sequence. So I think the code works right (although I agree it's hard to read - it's an old legacy ;). Honza
> On Mon, Mar 22, 2010 at 07:14:13PM +0300, Dmitry Monakhov wrote: > > > > > > Just to be clear, since I realized I wrote fsync() when I should have > > > written fs/ext4/fsync.c, my proposal was to put this check in > > > ext4_sync_file(). > > This means that we will end up with two io-barriers in a row > > for external journal case: > > ext4_sync_file() > > ->jbd2_log_start_commit() > > ->blkdev_issue_flush(j_fs_dev) > > /* seems following flush is mandatory to guarantee the metadata > > * consistency */ > > ->blkdev_issue_flush(j_fs_dev) > > Well, not if we only issue a barrier in the external barrier case.... > but I agree the logic may start getting ugly. > > > I've already started to work with the patch but was surprised with > > commit logic. > > ->__jbd2_log_start_commit(target) > > journal->j_commit_request = target > > ->wake_up(&journal->j_wait_commit) > > ->kjournald2 > > transaction = journal->j_running_transaction > > ->jbd2_journal_commit_transaction(journal); > > commit_transaction = journal->j_running_transaction > > ... > > /* So j_commit_request is used only for comparison. But actually > > committing journal->j_running_transaction->t_tid transaction > > instead of j_commit_request. Why? > > */ > > Yeah, I don't think jbd2_log_start_commit() has the semantics that are > currently documented. We will return 0 if we wake up the commit > thread, yes --- but since we only check j_commit_request, and it's a > geq test, it might very well be the case that the transaction was > committed long ago, and so we end up kicking off a transaction commit > when one is not needed. Or, it may be that a transaction is just > already being committed (and in fact is just about to be completed), > and so the wake_up() call is a no-op, and so we don't get a barrier > when one is needed (and expected) by ext4_sync_file(). > > We need to look at this very closely, but I think > jbd2_log_start_commit() needs to check to see whether there is a > current committing transaction, and whether it is the one which is > that has been requested as a target. If so, and a barrier is > requested, I think we need to have jbd2_log_start_commit() do the > barrier. There is a risk of having a double barrier in that case, but > modifying the flag on the currently committing transaction has the > danger of being lost by kjournald --- or I guess alternatively we > could have kjournald check that flag while holding j_state_lock, which > might be better. But still if you happen to set the flag after commit code has checked it, you have lost the race. Given the bug you describe below, I think we should provide a new call from JBD2 that will do all the necessary magic for given TID - something like: data_barrier = 0; if (journal->j_commit_sequence > tid) data_barrier = 1; else if (journal->j_committing_transaction && journal->j_committing_transaction->t_tid == tid) { /* Here we could be more clever and set the barrier * flag of the transaction if according to its state * it has not yet issued data barrier */ data_barrier = 1; } else { journal->j_running_transaction->has_data = 1; jbd2_log_start_commit(journal, tid); } jbd2_log_wait_commit(journal, tid); if (data_barrier) blk_dev_issue_flush(journal->j_fs_dev); What do you think? > If the currently running transaction is the requested target, then > that's easy; we can set the flag, and then wake up the j_wait_commit > wait queue. > > In any case, log_start_commit() doesn't currently distinguish between > whether the targetted commit is the running or the committing > transaction when it returns 0, and I think that's a problem. Yeah, I agree this is a problem and it could cause ext4_sync_file not properly wait. The simplest fix is to call jbd2_log_wait_commit unconditionally. But I'd maybe go with the above approach. Honza
diff --git a/fs/ext4/fsync.c b/fs/ext4/fsync.c index 98bd140..eb52837 100644 --- a/fs/ext4/fsync.c +++ b/fs/ext4/fsync.c @@ -101,7 +101,7 @@ int ext4_sync_file(struct file *file, struct dentry *dentry, int datasync) (journal->j_fs_dev != journal->j_dev) && (journal->j_flags & JBD2_BARRIER)) blkdev_issue_flush(inode->i_sb->s_bdev, NULL); - jbd2_log_wait_commit(journal, commit_tid); + ret = jbd2_log_wait_commit(journal, commit_tid); } else if (journal->j_flags & JBD2_BARRIER) blkdev_issue_flush(inode->i_sb->s_bdev, NULL); return ret; diff --git a/fs/jbd2/commit.c b/fs/jbd2/commit.c index 1bc74b6..2f62f1b 100644 --- a/fs/jbd2/commit.c +++ b/fs/jbd2/commit.c @@ -682,6 +682,9 @@ void jbd2_journal_commit_transaction(journal_t *journal) tag->t_flags |= cpu_to_be32(JBD2_FLAG_LAST_TAG); start_journal_io: + if (bufs) + commit_transaction->t_flushed_data_blocks = 1; + for (i = 0; i < bufs; i++) { struct buffer_head *bh = wbuf[i]; /*
Currently journalled mode is still broken if fs use external journal. This is because all data handled as metadata. We must to issue io-barrier to fs_device even if transaction has only metadata blocks. Also add missed return value check. Signed-off-by: Dmitry Monakhov <dmonakhov@openvz.org> --- fs/ext4/fsync.c | 2 +- fs/jbd2/commit.c | 3 +++ 2 files changed, 4 insertions(+), 1 deletions(-)