Message ID | 20081104161024.GC28058@unused.rdu.redhat.com |
---|---|
State | Accepted, archived |
Headers | show |
Thanks!! Once tiny spelling nit, which I'll fix up before I put it in the ext4 patch queue. On Tue, Nov 04, 2008 at 11:10:25AM -0500, Josef Bacik wrote: > @@ -1222,6 +1224,17 @@ int jbd2_journal_stop(handle_t *handle) > * on IO anyway. Speeds up many-threaded, many-dir operations > * by 30x or more... > * > + * We try and optimize the sleep time against what the underlying disk > + * can do, instead of having a static sleep time. This is usefull for s/usefull/useful/ - Ted -- To unsubscribe from this list: send the line "unsubscribe linux-ext4" in the body of a message to majordomo@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
> > + * can do, instead of having a static sleep time. This is usefull for > > s/usefull/useful/ and: + * join the transaction. We acheive this by measuring how long it takes s/acheive/achieve/
On Nov 04, 2008 11:10 -0500, Josef Bacik wrote: > somebody does a sync write or an fsync() traditionally we would sleep for 1 > jiffies, which depending on the value of HZ could be a significant amount of > time compared to how long it takes to commit a transaction to the underlying > storage. With this patch instead of sleeping for a jiffie, we check to see if > the amount of time this transaction has been running is less than the average > commit time, and if it is we sleep for the delta using schedule_hrtimeout to > give us a higher precision sleep time. This greatly benefits high end storage > where you could end up sleeping for longer than it takes to commit the > transaction and therefore sitting idle instead of allowing the transaction to > be committed by keeping the sleep time to a minimum so you are sure to always > be doing something. There was no reply to my previous comments about making the maximum sleep time be a fixed value (e.g. 15ms) instead of having it arbitrarily based on the jiffies value, which may change between 1ms and 10ms. I don't object to this being included in ext4, but I suspect it could do even better, or at least be more consistent than depending on the HZ value. Cheers, Andreas -- Andreas Dilger Sr. Staff Engineer, Lustre Group Sun Microsystems of Canada, Inc. -- To unsubscribe from this list: send the line "unsubscribe linux-ext4" in the body of a message to majordomo@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
On Wed, Nov 05, 2008 at 04:10:12PM -0700, Andreas Dilger wrote: > There was no reply to my previous comments about making the maximum sleep > time be a fixed value (e.g. 15ms) instead of having it arbitrarily based > on the jiffies value, which may change between 1ms and 10ms. I know the whole point is to make it be self-tuning, but I wonder if we should have some knobs to control the min and max sleep times. - Ted -- To unsubscribe from this list: send the line "unsubscribe linux-ext4" in the body of a message to majordomo@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
Theodore Tso wrote: > On Wed, Nov 05, 2008 at 04:10:12PM -0700, Andreas Dilger wrote: > >> There was no reply to my previous comments about making the maximum sleep >> time be a fixed value (e.g. 15ms) instead of having it arbitrarily based >> on the jiffies value, which may change between 1ms and 10ms. >> > > I know the whole point is to make it be self-tuning, but I wonder if > we should have some knobs to control the min and max sleep times. > > - Ted > I think that it could be useful to have a tunable knob for the max sleep time. It would certainly be capped at a much lower level for a RAM disk than it would be for a very large RAID volume... Ric -- To unsubscribe from this list: send the line "unsubscribe linux-ext4" in the body of a message to majordomo@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
On Wed, Nov 05, 2008 at 04:10:12PM -0700, Andreas Dilger wrote: > On Nov 04, 2008 11:10 -0500, Josef Bacik wrote: > > somebody does a sync write or an fsync() traditionally we would sleep for 1 > > jiffies, which depending on the value of HZ could be a significant amount of > > time compared to how long it takes to commit a transaction to the underlying > > storage. With this patch instead of sleeping for a jiffie, we check to see if > > the amount of time this transaction has been running is less than the average > > commit time, and if it is we sleep for the delta using schedule_hrtimeout to > > give us a higher precision sleep time. This greatly benefits high end storage > > where you could end up sleeping for longer than it takes to commit the > > transaction and therefore sitting idle instead of allowing the transaction to > > be committed by keeping the sleep time to a minimum so you are sure to always > > be doing something. > > There was no reply to my previous comments about making the maximum sleep > time be a fixed value (e.g. 15ms) instead of having it arbitrarily based > on the jiffies value, which may change between 1ms and 10ms. > > I don't object to this being included in ext4, but I suspect it could do > even better, or at least be more consistent than depending on the HZ value. > Sorry I missed that comment. The reason I set the top to be 1 jiffies was because thats what it used to be, so in an effort to try and keep regressions from happening I just left it as the max. I'm fine with setting the max to some other more constant value. Thanks, Josef -- To unsubscribe from this list: send the line "unsubscribe linux-ext4" in the body of a message to majordomo@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
On Tue, Nov 04, 2008 at 11:10:25AM -0500, Josef Bacik wrote: > + commit_time = ktime_to_ns(ktime_sub(ktime_get(), start_time)); > + > + /* > + * weight the commit time higher than the average time so we don't > + * react too strongly to vast changes in the commit time > + */ > + if (likely(journal->j_average_commit_time)) > + journal->j_average_commit_time = (commit_time*3 + > + journal->j_average_commit_time) / 4; Stupid question... if the goal is to not have the average commit time not react too strongly to sudden and vast changes to the commit time, would it be better to do this instead: > + journal->j_average_commit_time = (commit_time + > + journal->j_average_commit_time*3) / 4; - Ted -- To unsubscribe from this list: send the line "unsubscribe linux-ext4" in the body of a message to majordomo@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
On Nov 25, 2008 05:23 -0500, Theodore Ts'o wrote: > On Tue, Nov 04, 2008 at 11:10:25AM -0500, Josef Bacik wrote: > > + commit_time = ktime_to_ns(ktime_sub(ktime_get(), start_time)); > > + > > + /* > > + * weight the commit time higher than the average time so we don't > > + * react too strongly to vast changes in the commit time > > + */ > > + if (likely(journal->j_average_commit_time)) > > + journal->j_average_commit_time = (commit_time*3 + > > + journal->j_average_commit_time) / 4; > > Stupid question... if the goal is to not have the average commit time > not react too strongly to sudden and vast changes to the commit time, > would it be better to do this instead: > > > + journal->j_average_commit_time = (commit_time + > > + journal->j_average_commit_time*3) / 4; Actually, yes. That is my fault, I'd suggested the original change to Josef. Cheers, Andreas -- Andreas Dilger Sr. Staff Engineer, Lustre Group Sun Microsystems of Canada, Inc. -- To unsubscribe from this list: send the line "unsubscribe linux-ext4" in the body of a message to majordomo@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
On Tue, Nov 25, 2008 at 03:41:15PM -0700, Andreas Dilger wrote: > > Stupid question... if the goal is to not have the average commit time > > not react too strongly to sudden and vast changes to the commit time, > > would it be better to do this instead: > > > > > + journal->j_average_commit_time = (commit_time + > > > + journal->j_average_commit_time*3) / 4; > > Actually, yes. That is my fault, I'd suggested the original change to > Josef. BTW, I've added a patch to display the average commit time it does vary wildly, especially on a laptop hard drive. While the system is idle, and the occasional commits need to wait for the hard drive to wake up, leads to a average commit time of around 80-140ms. If the disk is just getting lightly tickled, such that it never has a chance to go to sleep, the average commit time can drop down to around 20-25ms. If the hard drive is really busy, then the average commit time can go up to 40-50ms. Increasing the weight as described below will slow down the move to these long-term averages, but at least for laptop or Green Star drives with power savings enabled, the average commit time does seem to vary in some non-inintuitive ways. Of course, if we are capping the max transaction time at 15ms, most of this will never be visible, but it would probably be interesting to test out this patch on a fast SSD or an expensive enterprise array to see whether are similar surprising variations in the average commit time. - Ted -- To unsubscribe from this list: send the line "unsubscribe linux-ext4" in the body of a message to majordomo@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
On Wed, Nov 26, 2008 at 12:10:57AM -0500, Theodore Tso wrote: > On Tue, Nov 25, 2008 at 03:41:15PM -0700, Andreas Dilger wrote: > > > Stupid question... if the goal is to not have the average commit time > > > not react too strongly to sudden and vast changes to the commit time, > > > would it be better to do this instead: > > > > > > > + journal->j_average_commit_time = (commit_time + > > > > + journal->j_average_commit_time*3) / 4; > > > > Actually, yes. That is my fault, I'd suggested the original change to > > Josef. > > BTW, I've added a patch to display the average commit time it does > vary wildly, especially on a laptop hard drive. While the system is > idle, and the occasional commits need to wait for the hard drive to > wake up, leads to a average commit time of around 80-140ms. If the > disk is just getting lightly tickled, such that it never has a chance > to go to sleep, the average commit time can drop down to around > 20-25ms. If the hard drive is really busy, then the average commit > time can go up to 40-50ms. > > Increasing the weight as described below will slow down the move to > these long-term averages, but at least for laptop or Green Star drives > with power savings enabled, the average commit time does seem to vary > in some non-inintuitive ways. Of course, if we are capping the max > transaction time at 15ms, most of this will never be visible, but it > would probably be interesting to test out this patch on a fast SSD or > an expensive enterprise array to see whether are similar surprising > variations in the average commit time. > I was printing out the commit time when I was testing this and with high end storage the commit time didn't vary all that much. With my SATA drive on a normal running box it didn't vary all that much either, it would drop down to like 1ms because of syslog, but other than that it would ramp up when there was load and it would slowly come back down as the load went away. Also do you want a new patch with that fix, or will you just fix it up? Thanks, Josef -- To unsubscribe from this list: send the line "unsubscribe linux-ext4" in the body of a message to majordomo@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
diff --git a/fs/jbd2/commit.c b/fs/jbd2/commit.c index 8b119e1..3169db9 100644 --- a/fs/jbd2/commit.c +++ b/fs/jbd2/commit.c @@ -332,6 +332,8 @@ void jbd2_journal_commit_transaction(journal_t *journal) int flags; int err; unsigned long long blocknr; + ktime_t start_time; + u64 commit_time; char *tagp = NULL; journal_header_t *header; journal_block_tag_t *tag = NULL; @@ -458,6 +460,7 @@ void jbd2_journal_commit_transaction(journal_t *journal) commit_transaction->t_state = T_FLUSH; journal->j_committing_transaction = commit_transaction; journal->j_running_transaction = NULL; + start_time = ktime_get(); commit_transaction->t_log_start = journal->j_head; wake_up(&journal->j_wait_transaction_locked); spin_unlock(&journal->j_state_lock); @@ -972,6 +975,17 @@ restart_loop: J_ASSERT(commit_transaction == journal->j_committing_transaction); journal->j_commit_sequence = commit_transaction->t_tid; journal->j_committing_transaction = NULL; + commit_time = ktime_to_ns(ktime_sub(ktime_get(), start_time)); + + /* + * weight the commit time higher than the average time so we don't + * react too strongly to vast changes in the commit time + */ + if (likely(journal->j_average_commit_time)) + journal->j_average_commit_time = (commit_time*3 + + journal->j_average_commit_time) / 4; + else + journal->j_average_commit_time = commit_time; spin_unlock(&journal->j_state_lock); if (commit_transaction->t_checkpoint_list == NULL && diff --git a/fs/jbd2/transaction.c b/fs/jbd2/transaction.c index 39b7805..ef705f9 100644 --- a/fs/jbd2/transaction.c +++ b/fs/jbd2/transaction.c @@ -25,6 +25,7 @@ #include <linux/timer.h> #include <linux/mm.h> #include <linux/highmem.h> +#include <linux/hrtimer.h> static void __jbd2_journal_temp_unlink_buffer(struct journal_head *jh); @@ -48,6 +49,7 @@ jbd2_get_transaction(journal_t *journal, transaction_t *transaction) { transaction->t_journal = journal; transaction->t_state = T_RUNNING; + transaction->t_start_time = ktime_get(); transaction->t_tid = journal->j_transaction_sequence++; transaction->t_expires = jiffies + journal->j_commit_interval; spin_lock_init(&transaction->t_handle_lock); @@ -1193,7 +1195,7 @@ int jbd2_journal_stop(handle_t *handle) { transaction_t *transaction = handle->h_transaction; journal_t *journal = transaction->t_journal; - int old_handle_count, err; + int err; pid_t pid; J_ASSERT(journal_current_handle() == handle); @@ -1222,6 +1224,17 @@ int jbd2_journal_stop(handle_t *handle) * on IO anyway. Speeds up many-threaded, many-dir operations * by 30x or more... * + * We try and optimize the sleep time against what the underlying disk + * can do, instead of having a static sleep time. This is usefull for + * the case where our storage is so fast that it is more optimal to go + * ahead and force a flush and wait for the transaction to be committed + * than it is to wait for an arbitrary amount of time for new writers to + * join the transaction. We acheive this by measuring how long it takes + * to commit a transaction, and compare it with how long this + * transaction has been running, and if run time < commit time then we + * sleep for the delta and commit. This greatly helps super fast disks + * that would see slowdowns as more threads started doing fsyncs. + * * But don't do this if this process was the most recent one to * perform a synchronous write. We do this to detect the case where a * single process is doing a stream of sync writes. No point in waiting @@ -1229,11 +1242,26 @@ int jbd2_journal_stop(handle_t *handle) */ pid = current->pid; if (handle->h_sync && journal->j_last_sync_writer != pid) { + u64 commit_time, trans_time; + journal->j_last_sync_writer = pid; - do { - old_handle_count = transaction->t_handle_count; - schedule_timeout_uninterruptible(1); - } while (old_handle_count != transaction->t_handle_count); + + spin_lock(&journal->j_state_lock); + commit_time = journal->j_average_commit_time; + spin_unlock(&journal->j_state_lock); + + trans_time = ktime_to_ns(ktime_sub(ktime_get(), + transaction->t_start_time)); + + commit_time = min_t(u64, commit_time, + 1000*jiffies_to_usecs(1)); + + if (trans_time < commit_time) { + ktime_t expires = ktime_add_ns(ktime_get(), + commit_time); + set_current_state(TASK_UNINTERRUPTIBLE); + schedule_hrtimeout(&expires, HRTIMER_MODE_ABS); + } } current->journal_info = NULL; diff --git a/include/linux/jbd2.h b/include/linux/jbd2.h index c7d106e..b8b8744 100644 --- a/include/linux/jbd2.h +++ b/include/linux/jbd2.h @@ -637,6 +637,11 @@ struct transaction_s unsigned long t_expires; /* + * When this transaction started, in nanoseconds [no locking] + */ + ktime_t t_start_time; + + /* * How many handles used this transaction? [t_handle_lock] */ int t_handle_count; @@ -938,8 +943,18 @@ struct journal_s struct buffer_head **j_wbuf; int j_wbufsize; + /* + * this is the pid of hte last person to run a synchronous operation + * through the journal + */ pid_t j_last_sync_writer; + /* + * the average amount of time in nanoseconds it takes to commit a + * transaction to disk. [j_state_lock] + */ + u64 j_average_commit_time; + /* This function is called when a transaction is closed */ void (*j_commit_callback)(journal_t *, transaction_t *);
Hello, This is the jbd2 version of the jbd fsync batching patch. This patch removes the static sleep time in favor of a more self optimizing approach where we measure the average amount of time it takes to commit a transaction to disk and the ammount of time a transaction has been running. If somebody does a sync write or an fsync() traditionally we would sleep for 1 jiffies, which depending on the value of HZ could be a significant amount of time compared to how long it takes to commit a transaction to the underlying storage. With this patch instead of sleeping for a jiffie, we check to see if the amount of time this transaction has been running is less than the average commit time, and if it is we sleep for the delta using schedule_hrtimeout to give us a higher precision sleep time. This greatly benefits high end storage where you could end up sleeping for longer than it takes to commit the transaction and therefore sitting idle instead of allowing the transaction to be committed by keeping the sleep time to a minimum so you are sure to always be doing something. Thank you, Signed-off-by: Josef Bacik <jbacik@redhat.com> -- To unsubscribe from this list: send the line "unsubscribe linux-ext4" in the body of a message to majordomo@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html