diff mbox

[RFC] jbd: don't wake kjournald unnecessarily

Message ID 50D0A1FD.7040203@redhat.com
State Not Applicable, archived
Headers show

Commit Message

Eric Sandeen Dec. 18, 2012, 5:03 p.m. UTC
Commit d9b0193 jbd: fix fsync() tid wraparound bug
changed the logic for whether __log_start_commit() should wake up
kjournald.

After backporting this to RHEL6, I had a report of a performance regression
on a large benchmark, and it was narrowed down to the change above.

I did a little investigation of jbd behavior while running xfstest
013, which just does a large fsstress run, and found that we were
waking up kjournald more often than before; specifically,
in the case where

	target == j_commit_request == journal->j_running_transaction

It seems to me that the wakeup is not needed if we already have
the right target on the commit request, so I tested with the
additional condition added in the patch below; this brought
performance back up to prior levels.

I also tested it with tid_t defined to a u8, to get frequent wraps.
If I back out the wraparound patch, it will easily provoke
the original ASSERT that prompted the prior commit.  With
the commit in place and the patch below, I survived running
fsstress for 10 hours without problems even with a frequently-wrapping
tid_t.

A couple questions remain:

With a u8 tid_t, the "else" clause from commit d9b0193 fires
frequently; I really think the underlying problem is that tid_geq()
etc does not properly handle wraparounds - if, say, target is 255
and j_commit_request is 0, we don't know if j_commit_request
is 255 tids behind, or 1 tid ahead.  I have to think about that
some more, unless it's obvious to someone else.

FWIW, some people have indeed seen that else clause fire upstream,
both in the case where j_commit_request is > 2^31 and the 
target is 0.

https://bugzilla.kernel.org/show_bug.cgi?id=46031
http://forums.debian.net/viewtopic.php?f=5&t=80741

Anyway, I think this patch helps on the "don't send extra wakeups"
side of things.  Does anyone see a problem with it?

If it looks ok, I'll send the same for jbd2.

Thanks,
-Eric

--
To unsubscribe from this list: send the line "unsubscribe linux-ext4" in
the body of a message to majordomo@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html

Comments

Jan Kara Dec. 19, 2012, 1:27 a.m. UTC | #1
On Tue 18-12-12 11:03:57, Eric Sandeen wrote:
> Commit d9b0193 jbd: fix fsync() tid wraparound bug
> changed the logic for whether __log_start_commit() should wake up
> kjournald.
> 
> After backporting this to RHEL6, I had a report of a performance regression
> on a large benchmark, and it was narrowed down to the change above.
  Strange. I wonder what really happened that those additional wakeups had
influence on performance. They should be pretty cheap.

> I did a little investigation of jbd behavior while running xfstest
> 013, which just does a large fsstress run, and found that we were
> waking up kjournald more often than before; specifically,
> in the case where
> 
> 	target == j_commit_request == journal->j_running_transaction
> 
> It seems to me that the wakeup is not needed if we already have
> the right target on the commit request, so I tested with the
> additional condition added in the patch below; this brought
> performance back up to prior levels.
  Correct.

> I also tested it with tid_t defined to a u8, to get frequent wraps.
> If I back out the wraparound patch, it will easily provoke
> the original ASSERT that prompted the prior commit.  With
> the commit in place and the patch below, I survived running
> fsstress for 10 hours without problems even with a frequently-wrapping
> tid_t.
  Thanks for throughout testing!

> A couple questions remain:
> 
> With a u8 tid_t, the "else" clause from commit d9b0193 fires
> frequently; I really think the underlying problem is that tid_geq()
> etc does not properly handle wraparounds - if, say, target is 255
> and j_commit_request is 0, we don't know if j_commit_request
> is 255 tids behind, or 1 tid ahead.  I have to think about that
> some more, unless it's obvious to someone else.
  Well, there's no way to handle wraps better AFAICT. Tids eventually wrap
and if someone has stored away tid of a transaction he wants committed and
keeps it for a long time before using it, it can end up being anywhere
before / after current j_commit_request. The hope was that it takes long
enough to wrap around 32-bit tids. If this happens often in practice we may
have to switch to 64-bit tids (in memory, on disk 32-bit tids are enough
because of limited journal size).

> FWIW, some people have indeed seen that else clause fire upstream,
> both in the case where j_commit_request is > 2^31 and the 
> target is 0.
> 
> https://bugzilla.kernel.org/show_bug.cgi?id=46031
> http://forums.debian.net/viewtopic.php?f=5&t=80741
  This is actually curious. The fact that i_datasync_tid was 0 means that
either journal was not initialized during ext3_iget() or j_commit_sequence
was 0 during ext3_iget() - note that j_commit_sequence is initialized to
j_transaction_sequence in journal_reset()... Hum, but in a case when
ext3_load_journal() calls journal_wipe() and that finds j_tail != 0, we
call journal_skip_recovery(). That ends up setting j_transaction_sequence
to the last transaction in the log but j_commit_sequence is left at 0.
I see that explains how we could hit the warning. I think we should
initialize j_commit_sequence properly also when skipping recovery and that
will solve the problem.

BTW if we find j_tail == 0 in journal_wipe(), we skip setting
j_transaction_sequence to the last transaction in the journal.  So
j_transaction_sequence ends up being 0 but j_tail_sequence is set by
load_superblock() to sb->s_sequence so there's a mismatch between reality
and what j_tail_sequence claims to be the oldest transaction in the log.
That reminds me of the report where bogus journal replay corrupted a
filesystem...

I'll fix both these issues.

> Anyway, I think this patch helps on the "don't send extra wakeups"
> side of things.  Does anyone see a problem with it?
  The patch is fine. I'll queue it up.

								Honza
> =============
> [PATCH] jbd: don't wake kjournald unnecessarily
> 
> Don't send an extra wakeup to kjournald in the case where we
> already have the proper target in j_commit_request, i.e. that
> commit has already been requested for commit.
> 
> commit d9b0193 "jbd: fix fsync() tid wraparound bug" changed
> the logic leading to a wakeup, but it caused some extra wakeups
> which were found to lead to a measurable performance regression.
> 
> Signed-off-by: Eric Sandeen <sandeen@redhat.com>
> ---
>  
> diff --git a/fs/jbd/journal.c b/fs/jbd/journal.c
> index a286233..81cc7ea 100644
> --- a/fs/jbd/journal.c
> +++ b/fs/jbd/journal.c
> @@ -446,7 +446,8 @@ int __log_start_commit(journal_t *journal, tid_t target)
>  	 * currently running transaction (if it exists).  Otherwise,
>  	 * the target tid must be an old one.
>  	 */
> -	if (journal->j_running_transaction &&
> +	if (journal->j_commit_request != target &&
> +	    journal->j_running_transaction &&
>  	    journal->j_running_transaction->t_tid == target) {
>  		/*
>  		 * We want a new commit: OK, mark the request and wakeup the
Jan Kara Dec. 19, 2012, 2:05 a.m. UTC | #2
On Wed 19-12-12 02:27:10, Jan Kara wrote:
> > With a u8 tid_t, the "else" clause from commit d9b0193 fires
> > frequently; I really think the underlying problem is that tid_geq()
> > etc does not properly handle wraparounds - if, say, target is 255
> > and j_commit_request is 0, we don't know if j_commit_request
> > is 255 tids behind, or 1 tid ahead.  I have to think about that
> > some more, unless it's obvious to someone else.
>   Well, there's no way to handle wraps better AFAICT. Tids eventually wrap
> and if someone has stored away tid of a transaction he wants committed and
> keeps it for a long time before using it, it can end up being anywhere
> before / after current j_commit_request. The hope was that it takes long
> enough to wrap around 32-bit tids. If this happens often in practice we may
> have to switch to 64-bit tids (in memory, on disk 32-bit tids are enough
> because of limited journal size).
> 
> > FWIW, some people have indeed seen that else clause fire upstream,
> > both in the case where j_commit_request is > 2^31 and the 
> > target is 0.
> > 
> > https://bugzilla.kernel.org/show_bug.cgi?id=46031
> > http://forums.debian.net/viewtopic.php?f=5&t=80741
>   This is actually curious. The fact that i_datasync_tid was 0 means that
> either journal was not initialized during ext3_iget() or j_commit_sequence
> was 0 during ext3_iget() - note that j_commit_sequence is initialized to
> j_transaction_sequence in journal_reset()... Hum, but in a case when
> ext3_load_journal() calls journal_wipe() and that finds j_tail != 0, we
> call journal_skip_recovery(). That ends up setting j_transaction_sequence
> to the last transaction in the log but j_commit_sequence is left at 0.
> I see that explains how we could hit the warning. I think we should
> initialize j_commit_sequence properly also when skipping recovery and that
> will solve the problem.
  Bah, I was wrong here. I misread ext3_journal_load(). We call
journal_load() after journal_wipe() and so j_transaction_sequence and
j_commit_sequence() are set properly... But then I don't see how
i_datasync_tid was zero (modulo the very unlikely event we happened to load
the inode just after wrapping tids).

								Honza
Eric Sandeen Dec. 19, 2012, 2:36 a.m. UTC | #3
On 12/18/12 7:27 PM, Jan Kara wrote:
> On Tue 18-12-12 11:03:57, Eric Sandeen wrote:
>> Commit d9b0193 jbd: fix fsync() tid wraparound bug
>> changed the logic for whether __log_start_commit() should wake up
>> kjournald.

...

>> Anyway, I think this patch helps on the "don't send extra wakeups"
>> side of things.  Does anyone see a problem with it?
>   The patch is fine. I'll queue it up.

Thanks for the review!  I'll get jbd2 on the list soon.

-Eric

> 								Honza
>> =============
>> [PATCH] jbd: don't wake kjournald unnecessarily
>>
>> Don't send an extra wakeup to kjournald in the case where we
>> already have the proper target in j_commit_request, i.e. that
>> commit has already been requested for commit.
>>
>> commit d9b0193 "jbd: fix fsync() tid wraparound bug" changed
>> the logic leading to a wakeup, but it caused some extra wakeups
>> which were found to lead to a measurable performance regression.
>>
>> Signed-off-by: Eric Sandeen <sandeen@redhat.com>
>> ---
>>  
>> diff --git a/fs/jbd/journal.c b/fs/jbd/journal.c
>> index a286233..81cc7ea 100644
>> --- a/fs/jbd/journal.c
>> +++ b/fs/jbd/journal.c
>> @@ -446,7 +446,8 @@ int __log_start_commit(journal_t *journal, tid_t target)
>>  	 * currently running transaction (if it exists).  Otherwise,
>>  	 * the target tid must be an old one.
>>  	 */
>> -	if (journal->j_running_transaction &&
>> +	if (journal->j_commit_request != target &&
>> +	    journal->j_running_transaction &&
>>  	    journal->j_running_transaction->t_tid == target) {
>>  		/*
>>  		 * We want a new commit: OK, mark the request and wakeup the

--
To unsubscribe from this list: send the line "unsubscribe linux-ext4" in
the body of a message to majordomo@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Eric Sandeen Dec. 19, 2012, 3:08 a.m. UTC | #4
On 12/18/12 8:05 PM, Jan Kara wrote:
> On Wed 19-12-12 02:27:10, Jan Kara wrote:
>>> With a u8 tid_t, the "else" clause from commit d9b0193 fires
>>> frequently; I really think the underlying problem is that tid_geq()
>>> etc does not properly handle wraparounds - if, say, target is 255
>>> and j_commit_request is 0, we don't know if j_commit_request
>>> is 255 tids behind, or 1 tid ahead.  I have to think about that
>>> some more, unless it's obvious to someone else.
>>   Well, there's no way to handle wraps better AFAICT. Tids eventually wrap
>> and if someone has stored away tid of a transaction he wants committed and
>> keeps it for a long time before using it, it can end up being anywhere
>> before / after current j_commit_request. The hope was that it takes long
>> enough to wrap around 32-bit tids. If this happens often in practice we may
>> have to switch to 64-bit tids (in memory, on disk 32-bit tids are enough
>> because of limited journal size).

I was wondering if, since the tid_g*() functions only work if the
distance is half the unsigned int space, we can force a commit at some
point if j_transaction_sequence has gotten too far ahead?  I'm not sure
where or if that could be done...

-Eric
--
To unsubscribe from this list: send the line "unsubscribe linux-ext4" in
the body of a message to majordomo@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Jan Kara Dec. 19, 2012, 8:13 a.m. UTC | #5
On Tue 18-12-12 21:08:51, Eric Sandeen wrote:
> On 12/18/12 8:05 PM, Jan Kara wrote:
> > On Wed 19-12-12 02:27:10, Jan Kara wrote:
> >>> With a u8 tid_t, the "else" clause from commit d9b0193 fires
> >>> frequently; I really think the underlying problem is that tid_geq()
> >>> etc does not properly handle wraparounds - if, say, target is 255
> >>> and j_commit_request is 0, we don't know if j_commit_request
> >>> is 255 tids behind, or 1 tid ahead.  I have to think about that
> >>> some more, unless it's obvious to someone else.
> >>   Well, there's no way to handle wraps better AFAICT. Tids eventually wrap
> >> and if someone has stored away tid of a transaction he wants committed and
> >> keeps it for a long time before using it, it can end up being anywhere
> >> before / after current j_commit_request. The hope was that it takes long
> >> enough to wrap around 32-bit tids. If this happens often in practice we may
> >> have to switch to 64-bit tids (in memory, on disk 32-bit tids are enough
> >> because of limited journal size).
> 
> I was wondering if, since the tid_g*() functions only work if the
> distance is half the unsigned int space, we can force a commit at some
> point if j_transaction_sequence has gotten too far ahead?  I'm not sure
> where or if that could be done...
  I don't quiete understand. If someone stores tid = transaction->t_tid and
in two weeks calls log_start_commit(tid), I don't see how any forcing of
commits could solve that tid may now look ahead of the log...

								Honza
Theodore Ts'o Dec. 19, 2012, 3:37 p.m. UTC | #6
On Wed, Dec 19, 2012 at 09:13:34AM +0100, Jan Kara wrote:
> > I was wondering if, since the tid_g*() functions only work if the
> > distance is half the unsigned int space, we can force a commit at some
> > point if j_transaction_sequence has gotten too far ahead?  I'm not sure
> > where or if that could be done...
>
>   I don't quiete understand. If someone stores tid = transaction->t_tid and
> in two weeks calls log_start_commit(tid), I don't see how any forcing of
> commits could solve that tid may now look ahead of the log...

Actually, what we can do is the reverse.  Right now when we modify an
inode, we stash the tid to indicate "we need to commit to at least
this tid".  The problem comes if we haven't modified the inode for a
long time, but then later when we issue an fsync for that inode, it's
after a tid wrap, so we trigger the warning.

What we could do is in cases were we aren't touching the inode note if
the tid value is obviously out of date, and set it to some value which
is one less than the current tid.  This avoids the wrap, and in cases
where we are releasing the inode and nothing else is left, it's a safe
thing to do.

The downside is that doing this would incur locking overhead, and it's
not clear it's worth it.  Now that we understand what's going on,
nothing bad is happening when the warning is triggered, so we could
just remove it.

If we use a 64-bit in-memory tid, that would help avoid cases where
the tid has wrapped and we get unlucky and hit the 1 in 2**32 case
where we trigger a commit where one is not needed.  Given that the
cost of an extra commit with probably 1 in 2**32 is pretty low, it's
probably not worth the overhead of using a 64-bit tid, though....

      	  	      	      - Ted
--
To unsubscribe from this list: send the line "unsubscribe linux-ext4" in
the body of a message to majordomo@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Eric Sandeen Dec. 19, 2012, 3:46 p.m. UTC | #7
On 12/19/12 2:13 AM, Jan Kara wrote:
> On Tue 18-12-12 21:08:51, Eric Sandeen wrote:
>> On 12/18/12 8:05 PM, Jan Kara wrote:
>>> On Wed 19-12-12 02:27:10, Jan Kara wrote:
>>>>> With a u8 tid_t, the "else" clause from commit d9b0193 fires
>>>>> frequently; I really think the underlying problem is that tid_geq()
>>>>> etc does not properly handle wraparounds - if, say, target is 255
>>>>> and j_commit_request is 0, we don't know if j_commit_request
>>>>> is 255 tids behind, or 1 tid ahead.  I have to think about that
>>>>> some more, unless it's obvious to someone else.
>>>>   Well, there's no way to handle wraps better AFAICT. Tids eventually wrap
>>>> and if someone has stored away tid of a transaction he wants committed and
>>>> keeps it for a long time before using it, it can end up being anywhere
>>>> before / after current j_commit_request. The hope was that it takes long
>>>> enough to wrap around 32-bit tids. If this happens often in practice we may
>>>> have to switch to 64-bit tids (in memory, on disk 32-bit tids are enough
>>>> because of limited journal size).
>>
>> I was wondering if, since the tid_g*() functions only work if the
>> distance is half the unsigned int space, we can force a commit at some
>> point if j_transaction_sequence has gotten too far ahead?  I'm not sure
>> where or if that could be done...
>   I don't quiete understand. If someone stores tid = transaction->t_tid and
> in two weeks calls log_start_commit(tid), I don't see how any forcing of
> commits could solve that tid may now look ahead of the log...

I'm probably missing something, but I was thinking we could compare
j_commit_sequence to j_transaction_sequence and force a commit up to at least
j_commit_sequence if it's too "stale" - but I'm only handwaving. :)

-Eric

> 								Honza
> 

--
To unsubscribe from this list: send the line "unsubscribe linux-ext4" in
the body of a message to majordomo@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Jan Kara Dec. 19, 2012, 5:11 p.m. UTC | #8
On Wed 19-12-12 09:46:51, Eric Sandeen wrote:
> On 12/19/12 2:13 AM, Jan Kara wrote:
> > On Tue 18-12-12 21:08:51, Eric Sandeen wrote:
> >> On 12/18/12 8:05 PM, Jan Kara wrote:
> >>> On Wed 19-12-12 02:27:10, Jan Kara wrote:
> >>>>> With a u8 tid_t, the "else" clause from commit d9b0193 fires
> >>>>> frequently; I really think the underlying problem is that tid_geq()
> >>>>> etc does not properly handle wraparounds - if, say, target is 255
> >>>>> and j_commit_request is 0, we don't know if j_commit_request
> >>>>> is 255 tids behind, or 1 tid ahead.  I have to think about that
> >>>>> some more, unless it's obvious to someone else.
> >>>>   Well, there's no way to handle wraps better AFAICT. Tids eventually wrap
> >>>> and if someone has stored away tid of a transaction he wants committed and
> >>>> keeps it for a long time before using it, it can end up being anywhere
> >>>> before / after current j_commit_request. The hope was that it takes long
> >>>> enough to wrap around 32-bit tids. If this happens often in practice we may
> >>>> have to switch to 64-bit tids (in memory, on disk 32-bit tids are enough
> >>>> because of limited journal size).
> >>
> >> I was wondering if, since the tid_g*() functions only work if the
> >> distance is half the unsigned int space, we can force a commit at some
> >> point if j_transaction_sequence has gotten too far ahead?  I'm not sure
> >> where or if that could be done...
> >   I don't quiete understand. If someone stores tid = transaction->t_tid and
> > in two weeks calls log_start_commit(tid), I don't see how any forcing of
> > commits could solve that tid may now look ahead of the log...
> 
> I'm probably missing something, but I was thinking we could compare
> j_commit_sequence to j_transaction_sequence and force a commit up to at least
> j_commit_sequence if it's too "stale" - but I'm only handwaving. :)
  You are probably missing the fact that
j_transaction_sequence - 2 <= j_commit_sequence <= j_transaction_sequence.
  I.e., we have always one running transaction and at most one committing
transaction which is the previous one. j_commit_sequence is TID of the
transaction which successfully finished commit.

The warnings we are seeing are caused by TIDs stored in
EXT3_I(inode)->i_datasync_tid (and i_sync_tid). Those can get rather old
before they are used.

								Honza
Jan Kara Dec. 19, 2012, 5:14 p.m. UTC | #9
On Wed 19-12-12 10:37:25, Ted Tso wrote:
> On Wed, Dec 19, 2012 at 09:13:34AM +0100, Jan Kara wrote:
> > > I was wondering if, since the tid_g*() functions only work if the
> > > distance is half the unsigned int space, we can force a commit at some
> > > point if j_transaction_sequence has gotten too far ahead?  I'm not sure
> > > where or if that could be done...
> >
> >   I don't quiete understand. If someone stores tid = transaction->t_tid and
> > in two weeks calls log_start_commit(tid), I don't see how any forcing of
> > commits could solve that tid may now look ahead of the log...
> 
> Actually, what we can do is the reverse.  Right now when we modify an
> inode, we stash the tid to indicate "we need to commit to at least
> this tid".  The problem comes if we haven't modified the inode for a
> long time, but then later when we issue an fsync for that inode, it's
> after a tid wrap, so we trigger the warning.
> 
> What we could do is in cases were we aren't touching the inode note if
> the tid value is obviously out of date, and set it to some value which
> is one less than the current tid.  This avoids the wrap, and in cases
> where we are releasing the inode and nothing else is left, it's a safe
> thing to do.
> 
> The downside is that doing this would incur locking overhead, and it's
> not clear it's worth it.  Now that we understand what's going on,
> nothing bad is happening when the warning is triggered, so we could
> just remove it.
> 
> If we use a 64-bit in-memory tid, that would help avoid cases where
> the tid has wrapped and we get unlucky and hit the 1 in 2**32 case
> where we trigger a commit where one is not needed.  Given that the
> cost of an extra commit with probably 1 in 2**32 is pretty low, it's
> probably not worth the overhead of using a 64-bit tid, though....
  I agree. Just I'm still somewhat puzzled by those two reports pointed to
by Eric. In both cases stored tids were 0 and I cannot see how that happens
(well how it could happen in a reasonably likely way).

								Honza
Theodore Ts'o Dec. 19, 2012, 8:27 p.m. UTC | #10
On Wed, Dec 19, 2012 at 06:14:01PM +0100, Jan Kara wrote:
>   I agree. Just I'm still somewhat puzzled by those two reports pointed to
> by Eric. In both cases stored tids were 0 and I cannot see how that happens
> (well how it could happen in a reasonably likely way).

I'm also really puzzled about how Eric's patch makes a 10% different
on the AIM7 benchmark; as you've pointed out, that will just cause an
extra wakeup of the jbd/jbd2 thread, which should then quickly check
and decide to go back to sleep.

					- Ted
--
To unsubscribe from this list: send the line "unsubscribe linux-ext4" in
the body of a message to majordomo@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Eric Sandeen Dec. 19, 2012, 9:19 p.m. UTC | #11
On Dec 19, 2012, at 2:27 PM, "Theodore Ts'o" <tytso@mit.edu> wrote:

> On Wed, Dec 19, 2012 at 06:14:01PM +0100, Jan Kara wrote:
>>  I agree. Just I'm still somewhat puzzled by those two reports pointed to
>> by Eric. In both cases stored tids were 0 and I cannot see how that happens
>> (well how it could happen in a reasonably likely way).
> 
> I'm also really puzzled about how Eric's patch makes a 10% different
> on the AIM7 benchmark; as you've pointed out, that will just cause an
> extra wakeup of the jbd/jbd2 thread, which should then quickly check
> and decide to go back to sleep.

TBH I'm puzzled too, but that's what the perf guys tell me...

Eric

>                    - Ted
--
To unsubscribe from this list: send the line "unsubscribe linux-ext4" in
the body of a message to majordomo@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Eric Sandeen Dec. 21, 2012, 5:01 p.m. UTC | #12
On 12/19/12 2:27 PM, Theodore Ts'o wrote:
> On Wed, Dec 19, 2012 at 06:14:01PM +0100, Jan Kara wrote:
>>   I agree. Just I'm still somewhat puzzled by those two reports pointed to
>> by Eric. In both cases stored tids were 0 and I cannot see how that happens
>> (well how it could happen in a reasonably likely way).
> 
> I'm also really puzzled about how Eric's patch makes a 10% different
> on the AIM7 benchmark; as you've pointed out, that will just cause an
> extra wakeup of the jbd/jbd2 thread, which should then quickly check
> and decide to go back to sleep.
> 
> 					- Ted
> 

Ted, just to double check - is that some wondering aloud, or a NAK
of the original patch? :)

-Eric
--
To unsubscribe from this list: send the line "unsubscribe linux-ext4" in
the body of a message to majordomo@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Theodore Ts'o Dec. 21, 2012, 5:46 p.m. UTC | #13
On Fri, Dec 21, 2012 at 11:01:58AM -0600, Eric Sandeen wrote:
> > I'm also really puzzled about how Eric's patch makes a 10% different
> > on the AIM7 benchmark; as you've pointed out, that will just cause an
> > extra wakeup of the jbd/jbd2 thread, which should then quickly check
> > and decide to go back to sleep.
>
> Ted, just to double check - is that some wondering aloud, or a NAK
> of the original patch? :)

I'm still thinking....  Things that I don't understand worry me, since
there's a possibility there's more going on than we think.

Did you have a chance to have your perf people enable the the
jbd2_run_stats tracepoint, to see how the stats change with and
without the patch?

It would be interesting to see how the stats change --- in particular,
whether the number of blocks logged per transaction is changing,
and/or the number of blocks per transaction is changing.  It would
also be interesting to insert a tracepoint in kjournald so we can
track the number of times when kjournald is waking, but ends up *not*
triggering a commit due to the commit timeout firing or
j_commit_sequence != j_commit_request.  

I'll probably take the patch on the grounds that it's obvious, but if
you could get your perf folks to run the experiment, I'd really
appreciate it, just so we can understand what might be going on.
Perhaps there's an opportunity for further optimizations, or we'll
find that something unexpected that is evidence of a bug.  (Or maybe
it's just a bug in our understanding, but that's also good to get
fixed.  :-)

						- Ted
--
To unsubscribe from this list: send the line "unsubscribe linux-ext4" in
the body of a message to majordomo@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Eric Sandeen Jan. 8, 2013, 7:19 p.m. UTC | #14
On 12/21/12 11:46 AM, Theodore Ts'o wrote:
> On Fri, Dec 21, 2012 at 11:01:58AM -0600, Eric Sandeen wrote:
>>> I'm also really puzzled about how Eric's patch makes a 10% different
>>> on the AIM7 benchmark; as you've pointed out, that will just cause an
>>> extra wakeup of the jbd/jbd2 thread, which should then quickly check
>>> and decide to go back to sleep.
>>
>> Ted, just to double check - is that some wondering aloud, or a NAK
>> of the original patch? :)
> 
> I'm still thinking....  Things that I don't understand worry me, since
> there's a possibility there's more going on than we think.
> 
> Did you have a chance to have your perf people enable the the
> jbd2_run_stats tracepoint, to see how the stats change with and
> without the patch?

Getting back to this; grabbing this over a whole AIM7 run would 
be huge.  I wonder if a few snapshots will be illustrative.
We can try.

> It would be interesting to see how the stats change --- in particular,
> whether the number of blocks logged per transaction is changing,
> and/or the number of blocks per transaction is changing.  It would
> also be interesting to insert a tracepoint in kjournald so we can
> track the number of times when kjournald is waking, but ends up *not*
> triggering a commit due to the commit timeout firing or
> j_commit_sequence != j_commit_request.  

Ok, I think I can do that w/ systemtap.

> I'll probably take the patch on the grounds that it's obvious, but if
> you could get your perf folks to run the experiment, I'd really
> appreciate it, just so we can understand what might be going on.
> Perhaps there's an opportunity for further optimizations, or we'll
> find that something unexpected that is evidence of a bug.  (Or maybe
> it's just a bug in our understanding, but that's also good to get
> fixed.  :-)

I'll see what I can do.

-Eric


> 						- Ted
> 

--
To unsubscribe from this list: send the line "unsubscribe linux-ext4" in
the body of a message to majordomo@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Eric Sandeen Jan. 11, 2013, 4:42 p.m. UTC | #15
On 12/21/12 11:46 AM, Theodore Ts'o wrote:
> On Fri, Dec 21, 2012 at 11:01:58AM -0600, Eric Sandeen wrote:
>>> I'm also really puzzled about how Eric's patch makes a 10% different
>>> on the AIM7 benchmark; as you've pointed out, that will just cause an
>>> extra wakeup of the jbd/jbd2 thread, which should then quickly check
>>> and decide to go back to sleep.
>>
>> Ted, just to double check - is that some wondering aloud, or a NAK
>> of the original patch? :)
> 
> I'm still thinking....  Things that I don't understand worry me, since
> there's a possibility there's more going on than we think.
> 
> Did you have a chance to have your perf people enable the the
> jbd2_run_stats tracepoint, to see how the stats change with and
> without the patch?

No tracepoint yet, but here's some data from the jbd2 info proc file
for a whole AIM7 run, averaged over all devices.

Prior to d9b0193 jbd: fix fsync() tid wraparound bug went in:

3387.93 transaction, each up to 8192 blocks
average:
102.661ms waiting for transaction
189ms running transaction
65.375ms transaction was being locked
17.8393ms flushing data (in ordered mode)
164.518ms logging transaction
3694.29us average transaction commit time
2090.05 handles per transaction
12.5893 blocks per transaction
13.5893 logged blocks per transaction

with d9b0193 in place, the benchmark was about 10% slower:

2857.96 transaction, each up to 8192 blocks
average:
108.482ms waiting for transaction
266.286ms running transaction
71.625ms transaction was being locked
2.76786ms flushing data (in ordered mode)
252.625ms logging transaction
5932.82us average transaction commit time
2551.21 handles per transaction
43.25 blocks per transaction
44.25 logged blocks per transaction

and with my wake changes:

3775.61 transaction, each up to 8192 blocks
average:
92.9286ms waiting for transaction
173.571ms running transaction
60.3036ms transaction was being locked
16.6964ms flushing data (in ordered mode)
149.464ms logging transaction
3849.07us average transaction commit time
1924.84 handles per transaction
13.3036 blocks per transaction
14.3036 logged blocks per transaction

TBH though, this is somewhat opposite of what I'd expect; I thought more wakes might mean smaller transactions - except the wakes were "pointless" - so I'm not quite sure what's going on yet.  We can certainly see the difference, though, and that my change gets us back to the prior behavior.

-Eric
--
To unsubscribe from this list: send the line "unsubscribe linux-ext4" in
the body of a message to majordomo@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Jan Kara Jan. 11, 2013, 7:03 p.m. UTC | #16
On Fri 11-01-13 10:42:00, Eric Sandeen wrote:
> On 12/21/12 11:46 AM, Theodore Ts'o wrote:
> > On Fri, Dec 21, 2012 at 11:01:58AM -0600, Eric Sandeen wrote:
> >>> I'm also really puzzled about how Eric's patch makes a 10% different
> >>> on the AIM7 benchmark; as you've pointed out, that will just cause an
> >>> extra wakeup of the jbd/jbd2 thread, which should then quickly check
> >>> and decide to go back to sleep.
> >>
> >> Ted, just to double check - is that some wondering aloud, or a NAK
> >> of the original patch? :)
> > 
> > I'm still thinking....  Things that I don't understand worry me, since
> > there's a possibility there's more going on than we think.
> > 
> > Did you have a chance to have your perf people enable the the
> > jbd2_run_stats tracepoint, to see how the stats change with and
> > without the patch?
> 
> No tracepoint yet, but here's some data from the jbd2 info proc file
> for a whole AIM7 run, averaged over all devices.
> 
> Prior to d9b0193 jbd: fix fsync() tid wraparound bug went in:
> 
> 3387.93 transaction, each up to 8192 blocks
> average:
> 102.661ms waiting for transaction
> 189ms running transaction
> 65.375ms transaction was being locked
> 17.8393ms flushing data (in ordered mode)
> 164.518ms logging transaction
> 3694.29us average transaction commit time
> 2090.05 handles per transaction
> 12.5893 blocks per transaction
> 13.5893 logged blocks per transaction
> 
> with d9b0193 in place, the benchmark was about 10% slower:
> 
> 2857.96 transaction, each up to 8192 blocks
> average:
> 108.482ms waiting for transaction
> 266.286ms running transaction
> 71.625ms transaction was being locked
> 2.76786ms flushing data (in ordered mode)
> 252.625ms logging transaction
> 5932.82us average transaction commit time
> 2551.21 handles per transaction
> 43.25 blocks per transaction
> 44.25 logged blocks per transaction
> 
> and with my wake changes:
> 
> 3775.61 transaction, each up to 8192 blocks
> average:
> 92.9286ms waiting for transaction
> 173.571ms running transaction
> 60.3036ms transaction was being locked
> 16.6964ms flushing data (in ordered mode)
> 149.464ms logging transaction
> 3849.07us average transaction commit time
> 1924.84 handles per transaction
> 13.3036 blocks per transaction
> 14.3036 logged blocks per transaction
> 
> TBH though, this is somewhat opposite of what I'd expect; I thought more
> wakes might mean smaller transactions - except the wakes were "pointless"
> - so I'm not quite sure what's going on yet.  We can certainly see the
> difference, though, and that my change gets us back to the prior
> behavior.
  Yes, that's what I'd expect if the difference was really in IO. But
apparently the benchmark is CPU bound on the machine and so the higher
amount of work we do under j_state_lock (wake_up() has some small
cost after all - it disables interrupts and takes q->lock) results in
kjournald taking longer to wake and do its work. It might be interesting to
know about how many useless wakeups are we speaking here?

								Honza
Eric Sandeen Jan. 11, 2013, 7:06 p.m. UTC | #17
On 1/11/13 1:03 PM, Jan Kara wrote:
> On Fri 11-01-13 10:42:00, Eric Sandeen wrote:

...

>> TBH though, this is somewhat opposite of what I'd expect; I thought more
>> wakes might mean smaller transactions - except the wakes were "pointless"
>> - so I'm not quite sure what's going on yet.  We can certainly see the
>> difference, though, and that my change gets us back to the prior
>> behavior.
>   Yes, that's what I'd expect if the difference was really in IO. But
> apparently the benchmark is CPU bound on the machine and so the higher
> amount of work we do under j_state_lock (wake_up() has some small
> cost after all - it disables interrupts and takes q->lock) results in
> kjournald taking longer to wake and do its work. It might be interesting to
> know about how many useless wakeups are we speaking here?

Yeah, I want to get that next.

I think I'll need to instrument the module to get that, the fear is that
systemtap might have too much overhead (but I can try both ways).

Thanks,
-Eric

> 								Honza
> 

--
To unsubscribe from this list: send the line "unsubscribe linux-ext4" in
the body of a message to majordomo@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
diff mbox

Patch

=============
[PATCH] jbd: don't wake kjournald unnecessarily

Don't send an extra wakeup to kjournald in the case where we
already have the proper target in j_commit_request, i.e. that
commit has already been requested for commit.

commit d9b0193 "jbd: fix fsync() tid wraparound bug" changed
the logic leading to a wakeup, but it caused some extra wakeups
which were found to lead to a measurable performance regression.

Signed-off-by: Eric Sandeen <sandeen@redhat.com>
---
 
diff --git a/fs/jbd/journal.c b/fs/jbd/journal.c
index a286233..81cc7ea 100644
--- a/fs/jbd/journal.c
+++ b/fs/jbd/journal.c
@@ -446,7 +446,8 @@  int __log_start_commit(journal_t *journal, tid_t target)
 	 * currently running transaction (if it exists).  Otherwise,
 	 * the target tid must be an old one.
 	 */
-	if (journal->j_running_transaction &&
+	if (journal->j_commit_request != target &&
+	    journal->j_running_transaction &&
 	    journal->j_running_transaction->t_tid == target) {
 		/*
 		 * We want a new commit: OK, mark the request and wakeup the