diff mbox

imporve jbd2 fsync batching

Message ID 20081104161024.GC28058@unused.rdu.redhat.com
State Accepted, archived
Headers show

Commit Message

Josef Bacik Nov. 4, 2008, 4:10 p.m. UTC
Hello,

This is the jbd2 version of the jbd fsync batching patch.

This patch removes the static sleep time in favor of a more self optimizing
approach where we measure the average amount of time it takes to commit a
transaction to disk and the ammount of time a transaction has been running.  If
somebody does a sync write or an fsync() traditionally we would sleep for 1
jiffies, which depending on the value of HZ could be a significant amount of
time compared to how long it takes to commit a transaction to the underlying
storage.  With this patch instead of sleeping for a jiffie, we check to see if
the amount of time this transaction has been running is less than the average
commit time, and if it is we sleep for the delta using schedule_hrtimeout to
give us a higher precision sleep time.  This greatly benefits high end storage
where you could end up sleeping for longer than it takes to commit the
transaction and therefore sitting idle instead of allowing the transaction to be
committed by keeping the sleep time to a minimum so you are sure to always be
doing something.  Thank you,

Signed-off-by: Josef Bacik <jbacik@redhat.com>

--
To unsubscribe from this list: send the line "unsubscribe linux-ext4" in
the body of a message to majordomo@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html

Comments

Theodore Ts'o Nov. 4, 2008, 8:52 p.m. UTC | #1
Thanks!!

Once tiny spelling nit, which I'll fix up before I put it in the ext4
patch queue.

On Tue, Nov 04, 2008 at 11:10:25AM -0500, Josef Bacik wrote:
> @@ -1222,6 +1224,17 @@ int jbd2_journal_stop(handle_t *handle)
>  	 * on IO anyway.  Speeds up many-threaded, many-dir operations
>  	 * by 30x or more...
>  	 *
> +	 * We try and optimize the sleep time against what the underlying disk
> +	 * can do, instead of having a static sleep time.  This is usefull for

s/usefull/useful/

							- Ted
--
To unsubscribe from this list: send the line "unsubscribe linux-ext4" in
the body of a message to majordomo@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Leroy van Logchem Nov. 4, 2008, 10:15 p.m. UTC | #2
> > +	 * can do, instead of having a static sleep time.  This is usefull for
> 
> s/usefull/useful/

and:

+	 * join the transaction.  We acheive this by measuring how long it takes

s/acheive/achieve/
Andreas Dilger Nov. 5, 2008, 11:10 p.m. UTC | #3
On Nov 04, 2008  11:10 -0500, Josef Bacik wrote:
> somebody does a sync write or an fsync() traditionally we would sleep for 1
> jiffies, which depending on the value of HZ could be a significant amount of
> time compared to how long it takes to commit a transaction to the underlying
> storage.  With this patch instead of sleeping for a jiffie, we check to see if
> the amount of time this transaction has been running is less than the average
> commit time, and if it is we sleep for the delta using schedule_hrtimeout to
> give us a higher precision sleep time.  This greatly benefits high end storage
> where you could end up sleeping for longer than it takes to commit the
> transaction and therefore sitting idle instead of allowing the transaction to
> be committed by keeping the sleep time to a minimum so you are sure to always
> be doing something.

There was no reply to my previous comments about making the maximum sleep
time be a fixed value (e.g. 15ms) instead of having it arbitrarily based
on the jiffies value, which may change between 1ms and 10ms.

I don't object to this being included in ext4, but I suspect it could do
even better, or at least be more consistent than depending on the HZ value.

Cheers, Andreas
--
Andreas Dilger
Sr. Staff Engineer, Lustre Group
Sun Microsystems of Canada, Inc.

--
To unsubscribe from this list: send the line "unsubscribe linux-ext4" in
the body of a message to majordomo@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Theodore Ts'o Nov. 6, 2008, 12:27 a.m. UTC | #4
On Wed, Nov 05, 2008 at 04:10:12PM -0700, Andreas Dilger wrote:
> There was no reply to my previous comments about making the maximum sleep
> time be a fixed value (e.g. 15ms) instead of having it arbitrarily based
> on the jiffies value, which may change between 1ms and 10ms.

I know the whole point is to make it be self-tuning, but I wonder if
we should have some knobs to control the min and max sleep times.

   	          	       	    	     	     - Ted
--
To unsubscribe from this list: send the line "unsubscribe linux-ext4" in
the body of a message to majordomo@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Ric Wheeler Nov. 6, 2008, 12:45 p.m. UTC | #5
Theodore Tso wrote:
> On Wed, Nov 05, 2008 at 04:10:12PM -0700, Andreas Dilger wrote:
>   
>> There was no reply to my previous comments about making the maximum sleep
>> time be a fixed value (e.g. 15ms) instead of having it arbitrarily based
>> on the jiffies value, which may change between 1ms and 10ms.
>>     
>
> I know the whole point is to make it be self-tuning, but I wonder if
> we should have some knobs to control the min and max sleep times.
>
>    	          	       	    	     	     - Ted
>   

I think that it could be useful to have a tunable knob for the max sleep 
time. It would certainly be capped at a much lower level for a RAM disk 
than it would be for a very large RAID volume...

Ric

--
To unsubscribe from this list: send the line "unsubscribe linux-ext4" in
the body of a message to majordomo@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Josef Bacik Nov. 6, 2008, 2:35 p.m. UTC | #6
On Wed, Nov 05, 2008 at 04:10:12PM -0700, Andreas Dilger wrote:
> On Nov 04, 2008  11:10 -0500, Josef Bacik wrote:
> > somebody does a sync write or an fsync() traditionally we would sleep for 1
> > jiffies, which depending on the value of HZ could be a significant amount of
> > time compared to how long it takes to commit a transaction to the underlying
> > storage.  With this patch instead of sleeping for a jiffie, we check to see if
> > the amount of time this transaction has been running is less than the average
> > commit time, and if it is we sleep for the delta using schedule_hrtimeout to
> > give us a higher precision sleep time.  This greatly benefits high end storage
> > where you could end up sleeping for longer than it takes to commit the
> > transaction and therefore sitting idle instead of allowing the transaction to
> > be committed by keeping the sleep time to a minimum so you are sure to always
> > be doing something.
> 
> There was no reply to my previous comments about making the maximum sleep
> time be a fixed value (e.g. 15ms) instead of having it arbitrarily based
> on the jiffies value, which may change between 1ms and 10ms.
> 
> I don't object to this being included in ext4, but I suspect it could do
> even better, or at least be more consistent than depending on the HZ value.
>

Sorry I missed that comment.  The reason I set the top to be 1 jiffies was
because thats what it used to be, so in an effort to try and keep regressions
from happening I just left it as the max.  I'm fine with setting the max to some
other more constant value.  Thanks,

Josef 
--
To unsubscribe from this list: send the line "unsubscribe linux-ext4" in
the body of a message to majordomo@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Theodore Ts'o Nov. 25, 2008, 10:23 a.m. UTC | #7
On Tue, Nov 04, 2008 at 11:10:25AM -0500, Josef Bacik wrote:
> +	commit_time = ktime_to_ns(ktime_sub(ktime_get(), start_time));
> +
> +	/*
> +	 * weight the commit time higher than the average time so we don't
> +	 * react too strongly to vast changes in the commit time
> +	 */
> +	if (likely(journal->j_average_commit_time))
> +		journal->j_average_commit_time = (commit_time*3 +
> +				journal->j_average_commit_time) / 4;

Stupid question... if the goal is to not have the average commit time
not react too strongly to sudden and vast changes to the commit time,
would it be better to do this instead:

> +		journal->j_average_commit_time = (commit_time +
> +				journal->j_average_commit_time*3) / 4;


						    - Ted
--
To unsubscribe from this list: send the line "unsubscribe linux-ext4" in
the body of a message to majordomo@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Andreas Dilger Nov. 25, 2008, 10:41 p.m. UTC | #8
On Nov 25, 2008  05:23 -0500, Theodore Ts'o wrote:
> On Tue, Nov 04, 2008 at 11:10:25AM -0500, Josef Bacik wrote:
> > +	commit_time = ktime_to_ns(ktime_sub(ktime_get(), start_time));
> > +
> > +	/*
> > +	 * weight the commit time higher than the average time so we don't
> > +	 * react too strongly to vast changes in the commit time
> > +	 */
> > +	if (likely(journal->j_average_commit_time))
> > +		journal->j_average_commit_time = (commit_time*3 +
> > +				journal->j_average_commit_time) / 4;
> 
> Stupid question... if the goal is to not have the average commit time
> not react too strongly to sudden and vast changes to the commit time,
> would it be better to do this instead:
> 
> > +		journal->j_average_commit_time = (commit_time +
> > +				journal->j_average_commit_time*3) / 4;

Actually, yes.  That is my fault, I'd suggested the original change to
Josef.

Cheers, Andreas
--
Andreas Dilger
Sr. Staff Engineer, Lustre Group
Sun Microsystems of Canada, Inc.

--
To unsubscribe from this list: send the line "unsubscribe linux-ext4" in
the body of a message to majordomo@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Theodore Ts'o Nov. 26, 2008, 5:10 a.m. UTC | #9
On Tue, Nov 25, 2008 at 03:41:15PM -0700, Andreas Dilger wrote:
> > Stupid question... if the goal is to not have the average commit time
> > not react too strongly to sudden and vast changes to the commit time,
> > would it be better to do this instead:
> > 
> > > +		journal->j_average_commit_time = (commit_time +
> > > +				journal->j_average_commit_time*3) / 4;
> 
> Actually, yes.  That is my fault, I'd suggested the original change to
> Josef.

BTW, I've added a patch to display the average commit time it does
vary wildly, especially on a laptop hard drive.  While the system is
idle, and the occasional commits need to wait for the hard drive to
wake up, leads to a average commit time of around 80-140ms.  If the
disk is just getting lightly tickled, such that it never has a chance
to go to sleep, the average commit time can drop down to around
20-25ms.  If the hard drive is really busy, then the average commit
time can go up to 40-50ms.

Increasing the weight as described below will slow down the move to
these long-term averages, but at least for laptop or Green Star drives
with power savings enabled, the average commit time does seem to vary
in some non-inintuitive ways.  Of course, if we are capping the max
transaction time at 15ms, most of this will never be visible, but it
would probably be interesting to test out this patch on a fast SSD or
an expensive enterprise array to see whether are similar surprising
variations in the average commit time.

						- Ted
--
To unsubscribe from this list: send the line "unsubscribe linux-ext4" in
the body of a message to majordomo@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Josef Bacik Nov. 26, 2008, 1:18 p.m. UTC | #10
On Wed, Nov 26, 2008 at 12:10:57AM -0500, Theodore Tso wrote:
> On Tue, Nov 25, 2008 at 03:41:15PM -0700, Andreas Dilger wrote:
> > > Stupid question... if the goal is to not have the average commit time
> > > not react too strongly to sudden and vast changes to the commit time,
> > > would it be better to do this instead:
> > > 
> > > > +		journal->j_average_commit_time = (commit_time +
> > > > +				journal->j_average_commit_time*3) / 4;
> > 
> > Actually, yes.  That is my fault, I'd suggested the original change to
> > Josef.
> 
> BTW, I've added a patch to display the average commit time it does
> vary wildly, especially on a laptop hard drive.  While the system is
> idle, and the occasional commits need to wait for the hard drive to
> wake up, leads to a average commit time of around 80-140ms.  If the
> disk is just getting lightly tickled, such that it never has a chance
> to go to sleep, the average commit time can drop down to around
> 20-25ms.  If the hard drive is really busy, then the average commit
> time can go up to 40-50ms.
> 
> Increasing the weight as described below will slow down the move to
> these long-term averages, but at least for laptop or Green Star drives
> with power savings enabled, the average commit time does seem to vary
> in some non-inintuitive ways.  Of course, if we are capping the max
> transaction time at 15ms, most of this will never be visible, but it
> would probably be interesting to test out this patch on a fast SSD or
> an expensive enterprise array to see whether are similar surprising
> variations in the average commit time.
>

I was printing out the commit time when I was testing this and with high end
storage the commit time didn't vary all that much.  With my SATA drive on a
normal running box it didn't vary all that much either, it would drop down to
like 1ms because of syslog, but other than that it would ramp up when there was
load and it would slowly come back down as the load went away.  Also do you
want a new patch with that fix, or will you just fix it up?  Thanks,

Josef 
--
To unsubscribe from this list: send the line "unsubscribe linux-ext4" in
the body of a message to majordomo@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
diff mbox

Patch

diff --git a/fs/jbd2/commit.c b/fs/jbd2/commit.c
index 8b119e1..3169db9 100644
--- a/fs/jbd2/commit.c
+++ b/fs/jbd2/commit.c
@@ -332,6 +332,8 @@  void jbd2_journal_commit_transaction(journal_t *journal)
 	int flags;
 	int err;
 	unsigned long long blocknr;
+	ktime_t start_time;
+	u64 commit_time;
 	char *tagp = NULL;
 	journal_header_t *header;
 	journal_block_tag_t *tag = NULL;
@@ -458,6 +460,7 @@  void jbd2_journal_commit_transaction(journal_t *journal)
 	commit_transaction->t_state = T_FLUSH;
 	journal->j_committing_transaction = commit_transaction;
 	journal->j_running_transaction = NULL;
+	start_time = ktime_get();
 	commit_transaction->t_log_start = journal->j_head;
 	wake_up(&journal->j_wait_transaction_locked);
 	spin_unlock(&journal->j_state_lock);
@@ -972,6 +975,17 @@  restart_loop:
 	J_ASSERT(commit_transaction == journal->j_committing_transaction);
 	journal->j_commit_sequence = commit_transaction->t_tid;
 	journal->j_committing_transaction = NULL;
+	commit_time = ktime_to_ns(ktime_sub(ktime_get(), start_time));
+
+	/*
+	 * weight the commit time higher than the average time so we don't
+	 * react too strongly to vast changes in the commit time
+	 */
+	if (likely(journal->j_average_commit_time))
+		journal->j_average_commit_time = (commit_time*3 +
+				journal->j_average_commit_time) / 4;
+	else
+		journal->j_average_commit_time = commit_time;
 	spin_unlock(&journal->j_state_lock);
 
 	if (commit_transaction->t_checkpoint_list == NULL &&
diff --git a/fs/jbd2/transaction.c b/fs/jbd2/transaction.c
index 39b7805..ef705f9 100644
--- a/fs/jbd2/transaction.c
+++ b/fs/jbd2/transaction.c
@@ -25,6 +25,7 @@ 
 #include <linux/timer.h>
 #include <linux/mm.h>
 #include <linux/highmem.h>
+#include <linux/hrtimer.h>
 
 static void __jbd2_journal_temp_unlink_buffer(struct journal_head *jh);
 
@@ -48,6 +49,7 @@  jbd2_get_transaction(journal_t *journal, transaction_t *transaction)
 {
 	transaction->t_journal = journal;
 	transaction->t_state = T_RUNNING;
+	transaction->t_start_time = ktime_get();
 	transaction->t_tid = journal->j_transaction_sequence++;
 	transaction->t_expires = jiffies + journal->j_commit_interval;
 	spin_lock_init(&transaction->t_handle_lock);
@@ -1193,7 +1195,7 @@  int jbd2_journal_stop(handle_t *handle)
 {
 	transaction_t *transaction = handle->h_transaction;
 	journal_t *journal = transaction->t_journal;
-	int old_handle_count, err;
+	int err;
 	pid_t pid;
 
 	J_ASSERT(journal_current_handle() == handle);
@@ -1222,6 +1224,17 @@  int jbd2_journal_stop(handle_t *handle)
 	 * on IO anyway.  Speeds up many-threaded, many-dir operations
 	 * by 30x or more...
 	 *
+	 * We try and optimize the sleep time against what the underlying disk
+	 * can do, instead of having a static sleep time.  This is usefull for
+	 * the case where our storage is so fast that it is more optimal to go
+	 * ahead and force a flush and wait for the transaction to be committed
+	 * than it is to wait for an arbitrary amount of time for new writers to
+	 * join the transaction.  We acheive this by measuring how long it takes
+	 * to commit a transaction, and compare it with how long this
+	 * transaction has been running, and if run time < commit time then we
+	 * sleep for the delta and commit.  This greatly helps super fast disks
+	 * that would see slowdowns as more threads started doing fsyncs.
+	 *
 	 * But don't do this if this process was the most recent one to
 	 * perform a synchronous write.  We do this to detect the case where a
 	 * single process is doing a stream of sync writes.  No point in waiting
@@ -1229,11 +1242,26 @@  int jbd2_journal_stop(handle_t *handle)
 	 */
 	pid = current->pid;
 	if (handle->h_sync && journal->j_last_sync_writer != pid) {
+		u64 commit_time, trans_time;
+
 		journal->j_last_sync_writer = pid;
-		do {
-			old_handle_count = transaction->t_handle_count;
-			schedule_timeout_uninterruptible(1);
-		} while (old_handle_count != transaction->t_handle_count);
+
+		spin_lock(&journal->j_state_lock);
+		commit_time = journal->j_average_commit_time;
+		spin_unlock(&journal->j_state_lock);
+
+		trans_time = ktime_to_ns(ktime_sub(ktime_get(),
+						   transaction->t_start_time));
+
+		commit_time = min_t(u64, commit_time,
+				    1000*jiffies_to_usecs(1));
+
+		if (trans_time < commit_time) {
+			ktime_t expires = ktime_add_ns(ktime_get(),
+						       commit_time);
+			set_current_state(TASK_UNINTERRUPTIBLE);
+			schedule_hrtimeout(&expires, HRTIMER_MODE_ABS);
+		}
 	}
 
 	current->journal_info = NULL;
diff --git a/include/linux/jbd2.h b/include/linux/jbd2.h
index c7d106e..b8b8744 100644
--- a/include/linux/jbd2.h
+++ b/include/linux/jbd2.h
@@ -637,6 +637,11 @@  struct transaction_s
 	unsigned long		t_expires;
 
 	/*
+	 * When this transaction started, in nanoseconds [no locking]
+	 */
+	ktime_t			t_start_time;
+
+	/*
 	 * How many handles used this transaction? [t_handle_lock]
 	 */
 	int t_handle_count;
@@ -938,8 +943,18 @@  struct journal_s
 	struct buffer_head	**j_wbuf;
 	int			j_wbufsize;
 
+	/*
+	 * this is the pid of hte last person to run a synchronous operation
+	 * through the journal
+	 */
 	pid_t			j_last_sync_writer;
 
+	/*
+	 * the average amount of time in nanoseconds it takes to commit a
+	 * transaction to disk. [j_state_lock]
+	 */
+	u64			j_average_commit_time;
+
 	/* This function is called when a transaction is closed */
 	void			(*j_commit_callback)(journal_t *,
 						     transaction_t *);