diff mbox series

[RFC] Fix jbd2 to stop waking up sleeping disks on sync

Message ID 20240227212546.110340-1-phill@thesusis.net
State Rejected
Headers show
Series [RFC] Fix jbd2 to stop waking up sleeping disks on sync | expand

Commit Message

Phillip Susi Feb. 27, 2024, 9:25 p.m. UTC
I noticed that every time I sync ( which happens automatically when
you suspend to ram ), ext4 issues a flush to the block device, even
though there have been no writes to flush.  This appears to be because
jbd2_trans_will_send_data_barrier() returns a 0 when no transaction
has been started.  The intent appears to be that a transaction that
has completed should return 0, and that when there is NO transaction,
it should return a 1, but the tests were in the wrong order, leading
to the 0 to be returned before checking for the absense of a
transaction at all.  Reversing the order allows my disk to remain in
runtime_pm when syncing.

I *think* this is correct, but I'm not very familliar with jbd2, so it
may have unintended consequences.  What do you think?
---
 fs/jbd2/journal.c | 8 +++++---
 1 file changed, 5 insertions(+), 3 deletions(-)

Comments

Theodore Ts'o Feb. 29, 2024, 8:07 a.m. UTC | #1
On Tue, Feb 27, 2024 at 04:25:46PM -0500, Phillip Susi wrote:
> I noticed that every time I sync ( which happens automatically when
> you suspend to ram ), ext4 issues a flush to the block device, even
> though there have been no writes to flush.  This appears to be because
> jbd2_trans_will_send_data_barrier() returns a 0 when no transaction
> has been started.  The intent appears to be that a transaction that
> has completed should return 0, and that when there is NO transaction,
> it should return a 1, but the tests were in the wrong order, leading
> to the 0 to be returned before checking for the absense of a
> transaction at all.  Reversing the order allows my disk to remain in
> runtime_pm when syncing.
> 
> I *think* this is correct, but I'm not very familliar with jbd2, so it
> may have unintended consequences.  What do you think?

Yeah, this change is going to problems.  The basic idea here is if
when we request that a transaction to commit, will it issue a a
commit?  If so, then fsync(2) doesn't need to issue a barrier (i.e., a
cache flush command).

So for example, if a database does an overwriting write to a file
block which is already allocated, and then follows it up with a
fdatasync(2), there won't be any need to make any metadata changes as
part of writing out the changed block.  Hence, we won't need to start
a new jbd2 transaction, and in that case, current transaction has
already commited, so the jbd2 layer won't need to do anything, and so
it won't have issued a commit.  In that case,
jbd2_trans_will_send_data_barrier() needs to return false, so that
fdatasync(2) will actually issue a cache flush command.

The patch you've proposed will cause fdatasync(2) to not issue a
barrier, which could lead to the write to the database file getting
lost after a power fail event, which would make the database
adminisrtator very sad.

Cheers,

					- Ted
Phillip Susi Feb. 29, 2024, 2:23 p.m. UTC | #2
"Theodore Tso" <tytso@mit.edu> writes:

> Yeah, this change is going to problems.  The basic idea here is if
> when we request that a transaction to commit, will it issue a a
> commit?  If so, then fsync(2) doesn't need to issue a barrier (i.e., a
> cache flush command).
>
> So for example, if a database does an overwriting write to a file
> block which is already allocated, and then follows it up with a
> fdatasync(2), there won't be any need to make any metadata changes as
> part of writing out the changed block.  Hence, we won't need to start
> a new jbd2 transaction, and in that case, current transaction has
> already commited, so the jbd2 layer won't need to do anything, and so
> it won't have issued a commit.  In that case,
> jbd2_trans_will_send_data_barrier() needs to return false, so that
> fdatasync(2) will actually issue a cache flush command.
>
> The patch you've proposed will cause fdatasync(2) to not issue a
> barrier, which could lead to the write to the database file getting
> lost after a power fail event, which would make the database
> adminisrtator very sad.

So because no metadata changed, jbd2 will not issue a barrier to end the
transaction?  How can we fix this then?  Is there some way that jbd2 can
know whether file data has been written, and thus, issue the barrier to
close the transaction?
Theodore Ts'o Feb. 29, 2024, 2:58 p.m. UTC | #3
On Thu, Feb 29, 2024 at 09:23:35AM -0500, Phillip Susi wrote:
> 
> So because no metadata changed, jbd2 will not issue a barrier to end the
> transaction?  How can we fix this then?  Is there some way that jbd2 can
> know whether file data has been written, and thus, issue the barrier to
> close the transaction?

Because no metadata changed, jbd2 will not even *start* a (jbd2)
transaction as a result of that write (overwrite) to an already
allocated data block..  Since it didn't start a jbd2 transaction for
this file system operation, there's no reason to force a jbd2
transaction to close.

(Note: this could because there *is* no currently existing open
transaction, or there might be a currently open transaction, but it's
not relevent to the activity associated with the file descriptor being
fsync'ed.)

This is a critical performance optimization, because for many database
workloads, which are overwriting existing blocks and using
fdatasync(2), there is no reason to force a jbd2 transaction commit
for every single fdatasync(2) issued by the database.  However, we
still need to send a cache flush operation so that the data block is
safely persistend onto stable storage.

Cheers,

						- Ted
Phillip Susi Feb. 29, 2024, 8:19 p.m. UTC | #4
"Theodore Ts'o" <tytso@mit.edu> writes:

> Because no metadata changed, jbd2 will not even *start* a (jbd2)
> transaction as a result of that write (overwrite) to an already
> allocated data block..  Since it didn't start a jbd2 transaction for
> this file system operation, there's no reason to force a jbd2
> transaction to close.
>
> (Note: this could because there *is* no currently existing open
> transaction, or there might be a currently open transaction, but it's
> not relevent to the activity associated with the file descriptor being
> fsync'ed.)
>
> This is a critical performance optimization, because for many database
> workloads, which are overwriting existing blocks and using
> fdatasync(2), there is no reason to force a jbd2 transaction commit
> for every single fdatasync(2) issued by the database.  However, we
> still need to send a cache flush operation so that the data block is
> safely persistend onto stable storage.

So maybe what ext4's sync_fs needs to know is whether ANY writes have
been done since the last transaction committed?  Is there a way to know
that?  As long as NOTHING has been written since the last commit, then
there is no need to issue a flush.
diff mbox series

Patch

diff --git a/fs/jbd2/journal.c b/fs/jbd2/journal.c
index b6c114c11b97..be13dae767be 100644
--- a/fs/jbd2/journal.c
+++ b/fs/jbd2/journal.c
@@ -632,14 +632,16 @@  int jbd2_trans_will_send_data_barrier(journal_t *journal, tid_t tid)
 	if (!(journal->j_flags & JBD2_BARRIER))
 		return 0;
 	read_lock(&journal->j_state_lock);
-	/* Transaction already committed? */
-	if (tid_geq(journal->j_commit_sequence, tid))
-		goto out;
 	commit_trans = journal->j_committing_transaction;
 	if (!commit_trans || commit_trans->t_tid != tid) {
 		ret = 1;
 		goto out;
 	}
+	/* Transaction already committed? */
+	if (tid_geq(journal->j_commit_sequence, tid))
+	{
+		goto out;
+	}
 	/*
 	 * Transaction is being committed and we already proceeded to
 	 * submitting a flush to fs partition?