diff mbox

[RFC,v4] ext4: Coordinate fsync requests

Message ID 20100819021441.GM2109@tux1.beaverton.ibm.com
State Superseded, archived
Headers show

Commit Message

Darrick J. Wong Aug. 19, 2010, 2:14 a.m. UTC
This patch attempts to coordinate barrier requests being sent in by fsync.
Instead of each fsync call initiating its own barrier, there's now a flag to
indicate if (0) no barriers are ongoing, (1) we're delaying a short time to
collect other fsync threads, or (2) we're actually in-progress on a barrier.

So, if someone calls ext4_sync_file and no barriers are in progress, the flag
shifts from 0->1 and the thread delays for a short time to see if there are
any other threads that are close behind in ext4_sync_file.  After that wait,
the state transitions to 2 and the barrier is issued.  Once that's done, the
state goes back to 0 and a completion is signalled.

Those close-behind threads see the flag is already 1, and go to sleep until the
completion is signalled.  Instead of issuing a barrier themselves, they simply
wait for that first thread to do it for them.

In an earlier version of this patch, the "short time" was 500ms, adjustable by
a module parameter.  Now, it's a mount option, and the mount option has three
values: x = 0, which gets you the old behavior; x = -1, which uses the average
commit time for the delay period; and x > 0, which sets the delay time to
min(x, average commit time).

So I gave this patchset some wider testing in my lab, and came up with the
following spreadsheet:
https://spreadsheets.google.com/ccc?key=0AtDnBlSkyNjodDZ4OEdSZC01X2haVi1ncmpGOXpfNkE&hl=en&authkey=CMznqcgH

These results do not reflect Tejun/Christoph's latest barrier/flush semantic
changes because ... I haven't figured out the correct sequence of 2.6.35+,
tejun's patchset, and hch's patchset.  With Tejun's original set of patches,
the performance actually decreased a little bit, so I haven't bothered to
upload those results while I try to put together a new tree.

The comma-separated numbers in the leftmost column indicates which options were
in effect during the run.  From left to right, they are (1) whether or not
directio is enabled; (2) the max_fsync_delay parameter (-1 is automatic
tuning, 0 is the behavior prior to my patch, and 500 is the v3 patch);
(3) whether or not Jan Kara's barrier generation counter is disabled; and (4)
whether or not my old dirty flag patch is disabled.  The old system behaviors
are in the *,0,1,1 rows, and the everything-on behavior should be in the
*,-1,0,0 rows.

From these results, it seems like a reasonable conclusion that enabling Jan
Kara's barrier generation counter patch (*,*,0,*) does increase the tps count.
It also would appear that enabling the barrier coordination patch with the
automatic barrier delay tuning (*,-1,*,*) also seems to be increasing tps while
reducing barrier counts.  It's not so clear that my old dirty flag patch
(*,*,*,0) is doing much anymore.

The *,na,na,na rows reflect what happens if I turn barriers off entirely, so I
can compare the new numbers with the nobarrier behavior.

elm3c44_sas is a 36-spindle SAS storage array (RAID0).
elm3c44_ssd is 4 Intel SSDs tied together in a RAID0 via md.
elm3c65_sata is a single SATA disk.
elm3c71_sas is a 8-spindle SAS storage array (RAID0).
elm3c71_scsi is a 14-spindle SCSI storage array (RAID0).
elm3c75_ide is a single IDE laptop disk.

The storage array have battery-backed WC.

Signed-off-by: Darrick J. Wong <djwong@us.ibm.com>
---

 fs/ext4/ext4.h  |    7 +++++++
 fs/ext4/fsync.c |   55 +++++++++++++++++++++++++++++++++++++++++++++++++++----
 fs/ext4/super.c |   21 ++++++++++++++++++++-
 3 files changed, 78 insertions(+), 5 deletions(-)

--
To unsubscribe from this list: send the line "unsubscribe linux-ext4" in
the body of a message to majordomo@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html

Comments

Darrick J. Wong Aug. 23, 2010, 6:31 p.m. UTC | #1
Hi all,

I retested the ext4 barrier mitigation patchset against a base of 2.6.36-rc1 +
Tejun's flush_fua tree + Christoph's patches to change FS barrier semantics,
and came up with this new spreadsheet:
http://bit.ly/bWpbsT

Here are the previous 2.6.35 results for convenience: http://bit.ly/c22grd

The machine configurations are the same as with the previous (2.6.35)
spreadsheet.  It appears to be the case that Tejun and Christoph's patches to
change barrier use into simpler cache flushes generally improve the speed of
the fsync-happy workload in buffered I/O mode ... if you have a bunch of
spinning disks.  Results for the SSD array (elm3c44) and the single disk
systems (elm3c65/elm3c75) decreased slightly.  For the case where direct I/O
was used, the patchset improved the results in nearly all cases.  The speed
with barriers on is getting closer to the speed with barriers off, thankfully!

Unfortunately, one thing that became /much/ less clear in these new results is
the impact of the other patch sets that we've been working on to make ext4
smarter with regards to barrier/flush use.  In most cases I don't really see
the fsync-delay patch having much effect for directio, and it seems to have
wild effects when buffered mode is used.  Jan Kara's barrier generation patch
still generally helps with directio loads.  I've also concluded that my really
old dirty-flag patchset from ages ago no longer has any effect.

What does everyone else think of these results?

--D
--
To unsubscribe from this list: send the line "unsubscribe linux-ext4" in
the body of a message to majordomo@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Darrick J. Wong Sept. 23, 2010, 11:25 p.m. UTC | #2
Hi all,

I just retested with 2.6.36-rc5 and the same set of patches as before
(flush_fua, fsync_coordination, etc) and have an even larger spreadsheet:
http://bit.ly/ahdhyk

This time, however, I instrumented the kernel to report the amount of time it
takes to complete the flush operation.  The test setups elm3a63, elm3c44_sas,
and elm3c71_sas are all arrays that have battery backed write-back cache; it
should not be a huge shock that the average flush time generally stays under
8ms for these setups.  elm3c65 and elm3c75_ide are single disk SAS and IDE
disks (no write cache), and the other setups all feature md-raids backed by
SCSI disks (also no write cache).  The flush_times tab in the spreadsheet lists
average, max, and min sync times.

Turning to the ffsb scores, I can see some of the same results that I saw while
testing 2.6.36-rc1 a few weeks ago.  Now that I've had the time to look at how
the code works and evaluate a lot more setups, I think I can speculate further
about the cause of the regression that I see with the fsync coordination patch.
Because I'm testing the effects of varying the fsync_delay values, I've bolded
the highest score for each unique (directio, nojan, nodj) configuration, and it
appears that the most winning cases are fsync_delay=0 which corresponds to the
old fsync behavior (every caller issues a flush), and fsync_delay=-1 which
corresponds to a coordination delay equal to the average flush duration.

To try to find an explanation, I started looking for connections between fsync
delay values and average flush times.  I noticed that the setups with low (<
8ms) flush times exhibit better performance when fsync coordination is not
attempted, and the setups with higher flush times exhibit better performance
when fsync coordination happens.  This also is no surprise, as it seems
perfectly reasonable that the more time consuming a flush is, the more desirous
it is to spend a little time coordinating those flushes across CPUs.

I think a reasonable next step would be to alter this patch so that
ext4_sync_file always measures the duration of the flushes that it issues, but
only enable the coordination steps if it detects the flushes taking more than
about 8ms.  One thing I don't know for sure is whether 8ms is a result of 2*HZ
(currently set to 250) or if 8ms is a hardware property.

As for safety testing, I've been running power-fail tests on the single-disk
systems with the same ffsb profile.  So far I've observed a lot of fsck
complaints about orphaned inodes being truncated ("Truncating orphaned inode
1352607 (uid=0, gid=0, mode=0100700, size=4096)") though this happens
regardless of whether I run with this 2.6.36 test kernel of mine or a plain
vanilla 2.6.35 configuration.  I've not seen any serious corruption yet.

So, what do people think of these latest results?

--D

On Mon, Aug 23, 2010 at 11:31:19AM -0700, Darrick J. Wong wrote:
> Hi all,
> 
> I retested the ext4 barrier mitigation patchset against a base of 2.6.36-rc1 +
> Tejun's flush_fua tree + Christoph's patches to change FS barrier semantics,
> and came up with this new spreadsheet:
> http://bit.ly/bWpbsT
> 
> Here are the previous 2.6.35 results for convenience: http://bit.ly/c22grd
> 
> The machine configurations are the same as with the previous (2.6.35)
> spreadsheet.  It appears to be the case that Tejun and Christoph's patches to
> change barrier use into simpler cache flushes generally improve the speed of
> the fsync-happy workload in buffered I/O mode ... if you have a bunch of
> spinning disks.  Results for the SSD array (elm3c44) and the single disk
> systems (elm3c65/elm3c75) decreased slightly.  For the case where direct I/O
> was used, the patchset improved the results in nearly all cases.  The speed
> with barriers on is getting closer to the speed with barriers off, thankfully!
> 
> Unfortunately, one thing that became /much/ less clear in these new results is
> the impact of the other patch sets that we've been working on to make ext4
> smarter with regards to barrier/flush use.  In most cases I don't really see
> the fsync-delay patch having much effect for directio, and it seems to have
> wild effects when buffered mode is used.  Jan Kara's barrier generation patch
> still generally helps with directio loads.  I've also concluded that my really
> old dirty-flag patchset from ages ago no longer has any effect.
> 
> What does everyone else think of these results?
> 
> --D
> --
> To unsubscribe from this list: send the line "unsubscribe linux-ext4" in
> the body of a message to majordomo@vger.kernel.org
> More majordomo info at  http://vger.kernel.org/majordomo-info.html
--
To unsubscribe from this list: send the line "unsubscribe linux-ext4" in
the body of a message to majordomo@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Andreas Dilger Sept. 24, 2010, 6:24 a.m. UTC | #3
On 2010-09-23, at 17:25, Darrick J. Wong wrote:
> To try to find an explanation, I started looking for connections between fsync delay values and average flush times.  I noticed that the setups with low (< 8ms) flush times exhibit better performance when fsync coordination is not attempted, and the setups with higher flush times exhibit better performance when fsync coordination happens.  This also is no surprise, as it seems perfectly reasonable that the more time consuming a flush is, the more desirous it is to spend a little time coordinating those flushes across CPUs.
> 
> I think a reasonable next step would be to alter this patch so that ext4_sync_file always measures the duration of the flushes that it issues, but only enable the coordination steps if it detects the flushes taking more than about 8ms.  One thing I don't know for sure is whether 8ms is a result of 2*HZ (currently set to 250) or if 8ms is a hardware property.

Note that the JBD/JBD2 code will already dynamically adjust the journal flush interval based on the delay seen when writing the journal commit block.  This was done to allow aggregating sync journal operations for slow devices, and allowing fast (no delay) sync on fast devices.  See jbd2_journal_stop() for details.

I think the best approach is to just depend on the journal to do this sync aggregation, if at all possible, otherwise use the same mechanism in ext3/4 for fsync operations that do not involve the journal (e.g. nojournal mode, data sync in writeback mode, etc).

Using any fixed threshold is the wrong approach, IMHO.

Cheers, Andreas





--
To unsubscribe from this list: send the line "unsubscribe linux-ext4" in
the body of a message to majordomo@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Ric Wheeler Sept. 24, 2010, 11:44 a.m. UTC | #4
On 09/24/2010 02:24 AM, Andreas Dilger wrote:
> On 2010-09-23, at 17:25, Darrick J. Wong wrote:
>> To try to find an explanation, I started looking for connections between fsync delay values and average flush times.  I noticed that the setups with low (<  8ms) flush times exhibit better performance when fsync coordination is not attempted, and the setups with higher flush times exhibit better performance when fsync coordination happens.  This also is no surprise, as it seems perfectly reasonable that the more time consuming a flush is, the more desirous it is to spend a little time coordinating those flushes across CPUs.
>>
>> I think a reasonable next step would be to alter this patch so that ext4_sync_file always measures the duration of the flushes that it issues, but only enable the coordination steps if it detects the flushes taking more than about 8ms.  One thing I don't know for sure is whether 8ms is a result of 2*HZ (currently set to 250) or if 8ms is a hardware property.
> Note that the JBD/JBD2 code will already dynamically adjust the journal flush interval based on the delay seen when writing the journal commit block.  This was done to allow aggregating sync journal operations for slow devices, and allowing fast (no delay) sync on fast devices.  See jbd2_journal_stop() for details.
>
> I think the best approach is to just depend on the journal to do this sync aggregation, if at all possible, otherwise use the same mechanism in ext3/4 for fsync operations that do not involve the journal (e.g. nojournal mode, data sync in writeback mode, etc).
>
> Using any fixed threshold is the wrong approach, IMHO.
>
> Cheers, Andreas
>
>
>
>
>

I agree - we started on that dynamic batching when we noticed that single 
threaded writes to an array went at something like 720 files/sec (using fs_mark) 
and 2 threaded writes dropped down to 230 files/sec. That was directly 
attributed to the fixed (1 jiffie) wait we used to do.

Josef Bacik worked on the dynamic batching so we would not wait (sometimes 
much!) to batch other fsync/flushes into a transaction when it was faster just 
to dispatch them.

Related worry I have is that we have other places in the kernel that currently 
wait way too long for our current classes of devices....

Thanks,

Ric

--
To unsubscribe from this list: send the line "unsubscribe linux-ext4" in
the body of a message to majordomo@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Darrick J. Wong Sept. 27, 2010, 11:01 p.m. UTC | #5
On Fri, Sep 24, 2010 at 12:24:04AM -0600, Andreas Dilger wrote:
> On 2010-09-23, at 17:25, Darrick J. Wong wrote:
> > To try to find an explanation, I started looking for connections between
> > fsync delay values and average flush times.  I noticed that the setups with
> > low (< 8ms) flush times exhibit better performance when fsync coordination
> > is not attempted, and the setups with higher flush times exhibit better
> > performance when fsync coordination happens.  This also is no surprise, as
> > it seems perfectly reasonable that the more time consuming a flush is, the
> > more desirous it is to spend a little time coordinating those flushes
> > across CPUs.
> > 
> > I think a reasonable next step would be to alter this patch so that
> > ext4_sync_file always measures the duration of the flushes that it issues,
> > but only enable the coordination steps if it detects the flushes taking
> > more than about 8ms.  One thing I don't know for sure is whether 8ms is a
> > result of 2*HZ (currently set to 250) or if 8ms is a hardware property.
> 
> Note that the JBD/JBD2 code will already dynamically adjust the journal flush
> interval based on the delay seen when writing the journal commit block.  This
> was done to allow aggregating sync journal operations for slow devices, and
> allowing fast (no delay) sync on fast devices.  See jbd2_journal_stop() for
> details.
> 
> I think the best approach is to just depend on the journal to do this sync
> aggregation, if at all possible, otherwise use the same mechanism in ext3/4
> for fsync operations that do not involve the journal (e.g. nojournal mode,
> data sync in writeback mode, etc).

I've been informed that there's confusion about how to interpret this
spreadsheet.  I'll first provide a few clarifications, then discuss Andreas'
suggestion, which I've coded up and given some light testing.

Zeroth, the kernel is 2.6.36-rc5 with a few patchsets applied:
 1. Tejun Heo's conversion of barriers to flush/fua.
 2. Jan Kara's barrier generation patch.
 3. My old patch to record if there's dirty data in the disk cache.
 4. My newer patch to implement fsync coordination in ext4.
 5. My newest patch which implements coordination via jbd2.

Patches 2, 3, 4, and 5 all have debugging toggles so I can quickly run
experiments.

First, the "fsync_delay_us" column records the behavior of my (latest) fsync
coordination patch.  The raw control values might be a bit confusing, so I
elaborated them a little more in the spreadsheet.  The "old fsync behavior"
entries use the current upstream semantics (no coordination, everyone issues
their own flush).  "jbd2 fsync" means coordination of fsyncs through jbd2 as
detailed below.  "use avg sync time" measures the average time it takes to
issue a flush command, and tells the first thread into ext4_sync_pages to wait
that amount of time for other threads to catch up.

Second, the "nojan" column is a control knob I added to Jan Kara's old barrier
generation patch so that I could measure its effects.  0 means always track
barrier generations and don't submit flushes for already-flushed data.  1 means
always issue flushes, regardless of generation counts.

Third, the "nodj" column is a control knob that controls my old
EXT4_STATE_DIRTY_DATA patch.  A zero here means that a flush will only be
triggered if ext4_write_page has written some dirty data and there hasn't been
a flush yet.  1 disables this logic.

Fourth, the bolded cells in the table represent the highest transactions per
second count across all fsync_delay_us values when holding the other four
control variables constant.  For example, let's take a look at
host=elm3a4,directio=0,nojan=0,nodj=0.  There are five fsync_delay_us values
(old, jbd2, avg, 1, 500) and five corresponding results (145.84, 184.06,
181.58, 152.39, 158.19).  184.06 is the highest, hence jbd2 wins and is in bold
face.

Background colors are used to group the rows by fsync_delay_us.

The barriers=0 results are, of course, the transactions per second count when
the fs is mounted with barrier support disabled.  This ought to provide a rough
idea of the upper performance limit of each piece of hardware.

------

As for Andreas' suggestion, he wants ext4 to use jbd2 as coordination point for
all fsync calls.  I could be wrong, but I think that the following snippet
ought to do the trick:

h = ext4_journal_start(journal, 0);
ext4_journal_stop(h);
if (jbd2_journal_start_commit(journal, &target))
	jbd2_log_wait_commit(journal, target);

It looks as though this snippet effectively says "Send an empty transaction.
Then, if there are any live or committing transactions, wait for them to
finish", which sounds like what we want.  I figured this also means that the
nojan/nodj settings would not have any significant effect on the results, which
seems to be true (though nojan/nodj have had little effect under Tejun's
patchset).  So I coded up that patch and gave it a spin on my testing farm.
The results have been added to the 2.6.36-rc5 spreadsheet.  Though I have to
say, this seems like an awful lot of overhead just to issue a flush command.

Given a quick look around the jbd2 code, it seems that going through the journal
ought to have a higher overhead cost, which would negatively impact performance
on hardware that features low flush times, and this seems to be true for
elm3a63, elm3c44_sas, and elm3c71_sas in directio=1 mode, where we see rather
large regressions against fsync_delay=avg_sync_time.  Curiously, I saw a
dramatic increase in speed for the SSDs when directio=1, which probably relates
to the way SSDs perform writes.

Other than those regressions, the jbd2 fsync coordination is about as fast as
sending the flush directly from ext4.  Unfortunately, where there _are_
regressions they seem rather large, which makes this approach (as implemented,
anyway) less attractive.  Perhaps there is a better way to do it?

--D
--
To unsubscribe from this list: send the line "unsubscribe linux-ext4" in
the body of a message to majordomo@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Darrick J. Wong Oct. 8, 2010, 9:26 p.m. UTC | #6
On Mon, Sep 27, 2010 at 04:01:11PM -0700, Darrick J. Wong wrote:
> 
> Other than those regressions, the jbd2 fsync coordination is about as fast as
> sending the flush directly from ext4.  Unfortunately, where there _are_
> regressions they seem rather large, which makes this approach (as implemented,
> anyway) less attractive.  Perhaps there is a better way to do it?

Hmm, not much chatter for two weeks.  Either I've confused everyone with the
humongous spreadsheet, or ... something?

I've performed some more extensive performance and safety testing with the
fsync coordination patch.  The results have been merged into the spreadsheet
that I linked to in the last email, though in general the results have not
really changed much at all.

I see two trends happening here with regards to comparing the use of jbd2 to
coordinate the flushes vs. measuring and coodinating flushes directly in ext4.
The first is that for loads that most benefit from having any kind of fsync
coordination (i.e. storage with slow flushes), the jbd2 approach provides the
same or slightly better performance than the direct approach.  However, for
storage with fast flushes, the jbd2 approach seems to cause major slowdowns
even compared to not changing any code at all.  To me this would suggest that
ext4 needs to coordinate the fsyncs directly, even at a higher code maintenance
cost, because a huge performance regression isn't good.

Other people in my group have been running their own performance comparisons
between no-coordination, jbd2-coordination, and direct-coordination, and what
I'm hearing is tha the direct-coordination mode is slightly faster than jbd2
coordination, though either are better than no coordination at all.  Happily, I
haven't seen an increase in fsck complaints in my poweroff testing.

Given the nearness of the merge window, perhaps we ought to discuss this on
Monday's ext4 call?  In the meantime I'll clean up the fsync coordination patch
so that it doesn't have so many debugging knobs and whistles.

Thanks,

--D
--
To unsubscribe from this list: send the line "unsubscribe linux-ext4" in
the body of a message to majordomo@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Ric Wheeler Oct. 8, 2010, 9:56 p.m. UTC | #7
On 10/08/2010 05:26 PM, Darrick J. Wong wrote:
> On Mon, Sep 27, 2010 at 04:01:11PM -0700, Darrick J. Wong wrote:
>> Other than those regressions, the jbd2 fsync coordination is about as fast as
>> sending the flush directly from ext4.  Unfortunately, where there _are_
>> regressions they seem rather large, which makes this approach (as implemented,
>> anyway) less attractive.  Perhaps there is a better way to do it?
> Hmm, not much chatter for two weeks.  Either I've confused everyone with the
> humongous spreadsheet, or ... something?
>
> I've performed some more extensive performance and safety testing with the
> fsync coordination patch.  The results have been merged into the spreadsheet
> that I linked to in the last email, though in general the results have not
> really changed much at all.
>
> I see two trends happening here with regards to comparing the use of jbd2 to
> coordinate the flushes vs. measuring and coodinating flushes directly in ext4.
> The first is that for loads that most benefit from having any kind of fsync
> coordination (i.e. storage with slow flushes), the jbd2 approach provides the
> same or slightly better performance than the direct approach.  However, for
> storage with fast flushes, the jbd2 approach seems to cause major slowdowns
> even compared to not changing any code at all.  To me this would suggest that
> ext4 needs to coordinate the fsyncs directly, even at a higher code maintenance
> cost, because a huge performance regression isn't good.
>
> Other people in my group have been running their own performance comparisons
> between no-coordination, jbd2-coordination, and direct-coordination, and what
> I'm hearing is tha the direct-coordination mode is slightly faster than jbd2
> coordination, though either are better than no coordination at all.  Happily, I
> haven't seen an increase in fsck complaints in my poweroff testing.
>
> Given the nearness of the merge window, perhaps we ought to discuss this on
> Monday's ext4 call?  In the meantime I'll clean up the fsync coordination patch
> so that it doesn't have so many debugging knobs and whistles.
>
> Thanks,
>
> --D

Hi Darrick,

We have been busily testing various combinations at Red Hat (we being not me 
:)), but here is one test that we used a long time back to validate the batching 
impact.

You need a slow, poky S-ATA drive - the slower it spins, the better.

A single fs_mark run against that drive should drive some modest number of 
files/sec with 1 thread:


[root@tunkums /]# fs_mark -s 20480 -n 500 -L 5 -d /test/foo

On my disk, I see:

      5          500        20480         31.8             6213

Now run with 4 threads to give the code a chance to coalesce.

On my box, I see it jump up:

      5         2000        20480        113.0            25092

And at 8 threads it jumps again:

      5         4000        20480        179.0            49480

This work load is very device specific. On a very low latency device (arrays, 
high performance SSD), the coalescing "wait" time could be slower than just 
dispatching the command. Ext3/4 work done by Josef a few years back was meant to 
use high res timers to dynamically adjust that wait to avoid slowing down.

Have we tested the combined patchset with this?

Thanks!

Ric






--
To unsubscribe from this list: send the line "unsubscribe linux-ext4" in
the body of a message to majordomo@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Theodore Ts'o Oct. 11, 2010, 2:33 p.m. UTC | #8
On Fri, Oct 08, 2010 at 02:26:06PM -0700, Darrick J. Wong wrote:
> Given the nearness of the merge window, perhaps we ought to discuss
> this on Monday's ext4 call?  In the meantime I'll clean up the fsync
> coordination patch so that it doesn't have so many debugging knobs
> and whistles.

Yes, we should definitely talk about this on today's call.

One of the things that concern me is that if I'm reading the
spreadsheet correctly, there are some colums (for example,
elm3c44_sas, comparing I3, which is what will be in 2.6.36 --- jan's
patch, but not yours, and the currently existing behaviour --- with
I10 which is as I understand it, what you are recommending, I see an
FFSB regression from 1,916.54 to 1,690.33).  

That's something like a 15% regression.  I know that for other
hardware, there are improvements, but the fact that at least for some
hardware we're seeing such a big regression makes me worried that
we're missing something --- or maybe it's just me who is missing
something about your spreadsheet?

						- Ted
--
To unsubscribe from this list: send the line "unsubscribe linux-ext4" in
the body of a message to majordomo@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Darrick J. Wong Oct. 11, 2010, 8:20 p.m. UTC | #9
On Fri, Oct 08, 2010 at 05:56:12PM -0400, Ric Wheeler wrote:
>  On 10/08/2010 05:26 PM, Darrick J. Wong wrote:
>> On Mon, Sep 27, 2010 at 04:01:11PM -0700, Darrick J. Wong wrote:
>>> Other than those regressions, the jbd2 fsync coordination is about as fast as
>>> sending the flush directly from ext4.  Unfortunately, where there _are_
>>> regressions they seem rather large, which makes this approach (as implemented,
>>> anyway) less attractive.  Perhaps there is a better way to do it?
>> Hmm, not much chatter for two weeks.  Either I've confused everyone with the
>> humongous spreadsheet, or ... something?
>>
>> I've performed some more extensive performance and safety testing with the
>> fsync coordination patch.  The results have been merged into the spreadsheet
>> that I linked to in the last email, though in general the results have not
>> really changed much at all.
>>
>> I see two trends happening here with regards to comparing the use of jbd2 to
>> coordinate the flushes vs. measuring and coodinating flushes directly in ext4.
>> The first is that for loads that most benefit from having any kind of fsync
>> coordination (i.e. storage with slow flushes), the jbd2 approach provides the
>> same or slightly better performance than the direct approach.  However, for
>> storage with fast flushes, the jbd2 approach seems to cause major slowdowns
>> even compared to not changing any code at all.  To me this would suggest that
>> ext4 needs to coordinate the fsyncs directly, even at a higher code maintenance
>> cost, because a huge performance regression isn't good.
>>
>> Other people in my group have been running their own performance comparisons
>> between no-coordination, jbd2-coordination, and direct-coordination, and what
>> I'm hearing is tha the direct-coordination mode is slightly faster than jbd2
>> coordination, though either are better than no coordination at all.  Happily, I
>> haven't seen an increase in fsck complaints in my poweroff testing.
>>
>> Given the nearness of the merge window, perhaps we ought to discuss this on
>> Monday's ext4 call?  In the meantime I'll clean up the fsync coordination patch
>> so that it doesn't have so many debugging knobs and whistles.
>>
>> Thanks,
>>
>> --D
>
> Hi Darrick,
>
> We have been busily testing various combinations at Red Hat (we being not 
> me :)), but here is one test that we used a long time back to validate 
> the batching impact.
>
> You need a slow, poky S-ATA drive - the slower it spins, the better.
>
> A single fs_mark run against that drive should drive some modest number 
> of files/sec with 1 thread:
>
>
> [root@tunkums /]# fs_mark -s 20480 -n 500 -L 5 -d /test/foo
>
> On my disk, I see:
>
>      5          500        20480         31.8             6213
>
> Now run with 4 threads to give the code a chance to coalesce.
>
> On my box, I see it jump up:
>
>      5         2000        20480        113.0            25092
>
> And at 8 threads it jumps again:
>
>      5         4000        20480        179.0            49480
>
> This work load is very device specific. On a very low latency device 
> (arrays, high performance SSD), the coalescing "wait" time could be 
> slower than just dispatching the command. Ext3/4 work done by Josef a few 
> years back was meant to use high res timers to dynamically adjust that 
> wait to avoid slowing down.

Yeah, elm3c65 and elm3c75 in that spreadsheet are a new pokey SATA disk and a
really old IDE disk, which ought to represent the low end case.  elm3c44-sas is
a midrange storage server... which doesn't like the patch so much.

--D
--
To unsubscribe from this list: send the line "unsubscribe linux-ext4" in
the body of a message to majordomo@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
diff mbox

Patch

diff --git a/fs/ext4/ext4.h b/fs/ext4/ext4.h
index 19a4de5..9e8dcc7 100644
--- a/fs/ext4/ext4.h
+++ b/fs/ext4/ext4.h
@@ -1143,6 +1143,13 @@  struct ext4_sb_info {
 
 	/* workqueue for dio unwritten */
 	struct workqueue_struct *dio_unwritten_wq;
+
+	/* fsync coordination */
+	spinlock_t barrier_flag_lock;
+	int in_barrier;
+	struct completion barrier_completion;
+	u64 avg_barrier_time;
+	int max_fsync_delay;
 };
 
 static inline struct ext4_sb_info *EXT4_SB(struct super_block *sb)
diff --git a/fs/ext4/fsync.c b/fs/ext4/fsync.c
index 592adf2..e164ccd 100644
--- a/fs/ext4/fsync.c
+++ b/fs/ext4/fsync.c
@@ -75,9 +75,12 @@  int ext4_sync_file(struct file *file, int datasync)
 {
 	struct inode *inode = file->f_mapping->host;
 	struct ext4_inode_info *ei = EXT4_I(inode);
-	journal_t *journal = EXT4_SB(inode->i_sb)->s_journal;
+	struct ext4_sb_info *sb = EXT4_SB(inode->i_sb);
+	journal_t *journal = sb->s_journal;
 	int ret;
 	tid_t commit_tid;
+	ktime_t before, expires;
+	u64 duration;
 
 	J_ASSERT(ext4_journal_current_handle() == NULL);
 
@@ -130,8 +133,52 @@  int ext4_sync_file(struct file *file, int datasync)
 			blkdev_issue_flush(inode->i_sb->s_bdev, GFP_KERNEL,
 					NULL, BLKDEV_IFL_WAIT);
 		ret = jbd2_log_wait_commit(journal, commit_tid);
-	} else if (journal->j_flags & JBD2_BARRIER)
-		blkdev_issue_flush(inode->i_sb->s_bdev, GFP_KERNEL, NULL,
-			BLKDEV_IFL_WAIT);
+	} else if (journal->j_flags & JBD2_BARRIER) {
+		if (!sb->max_fsync_delay) {
+			blkdev_issue_flush(inode->i_sb->s_bdev, GFP_KERNEL, NULL,
+				BLKDEV_IFL_WAIT);
+			goto fast_out;
+		}
+again:
+		spin_lock(&sb->barrier_flag_lock);
+		if (sb->in_barrier == 2) {
+			spin_unlock(&sb->barrier_flag_lock);
+			ret = wait_for_completion_interruptible(&sb->barrier_completion);
+			goto again;
+		} else if (sb->in_barrier) {
+			spin_unlock(&sb->barrier_flag_lock);
+			ret = wait_for_completion_interruptible(&sb->barrier_completion);
+		} else {
+			sb->in_barrier = 1;
+			INIT_COMPLETION(sb->barrier_completion);
+			spin_unlock(&sb->barrier_flag_lock);
+
+			if (sb->max_fsync_delay > 0 && (sb->max_fsync_delay * 1000) < sb->avg_barrier_time)
+				expires = ktime_add_ns(ktime_get(), sb->max_fsync_delay * 1000);
+			else
+				expires = ktime_add_ns(ktime_get(), sb->avg_barrier_time);
+			set_current_state(TASK_UNINTERRUPTIBLE);
+			schedule_hrtimeout(&expires, HRTIMER_MODE_ABS);
+
+			spin_lock(&sb->barrier_flag_lock);
+			sb->in_barrier = 2;
+			spin_unlock(&sb->barrier_flag_lock);
+
+			before = ktime_get();
+			blkdev_issue_flush(inode->i_sb->s_bdev, GFP_KERNEL, NULL, BLKDEV_IFL_WAIT);
+			duration = ktime_to_ns(ktime_sub(ktime_get(), before));
+			if (likely(sb->avg_barrier_time))
+				sb->avg_barrier_time = (duration + sb->avg_barrier_time * 15) / 16;
+			else
+				sb->avg_barrier_time = duration;
+
+			complete_all(&sb->barrier_completion);
+
+			spin_lock(&sb->barrier_flag_lock);
+			sb->in_barrier = 0;
+			spin_unlock(&sb->barrier_flag_lock);
+		}
+	}
+fast_out:
 	return ret;
 }
diff --git a/fs/ext4/super.c b/fs/ext4/super.c
index 4e8983a..8d04e43 100644
--- a/fs/ext4/super.c
+++ b/fs/ext4/super.c
@@ -953,6 +953,7 @@  static int ext4_show_options(struct seq_file *seq, struct vfsmount *vfs)
 	if (!test_opt(sb, DELALLOC))
 		seq_puts(seq, ",nodelalloc");
 
+	seq_printf(seq, ",max_fsync_delay=%d", sbi->max_fsync_delay);
 
 	if (sbi->s_stripe)
 		seq_printf(seq, ",stripe=%lu", sbi->s_stripe);
@@ -1160,7 +1161,7 @@  enum {
 	Opt_block_validity, Opt_noblock_validity,
 	Opt_inode_readahead_blks, Opt_journal_ioprio,
 	Opt_dioread_nolock, Opt_dioread_lock,
-	Opt_discard, Opt_nodiscard,
+	Opt_discard, Opt_nodiscard, Opt_max_fsync_delay,
 };
 
 static const match_table_t tokens = {
@@ -1231,6 +1232,7 @@  static const match_table_t tokens = {
 	{Opt_dioread_lock, "dioread_lock"},
 	{Opt_discard, "discard"},
 	{Opt_nodiscard, "nodiscard"},
+	{Opt_max_fsync_delay, "max_fsync_delay=%d"},
 	{Opt_err, NULL},
 };
 
@@ -1699,6 +1701,16 @@  set_qf_format:
 		case Opt_dioread_lock:
 			clear_opt(sbi->s_mount_opt, DIOREAD_NOLOCK);
 			break;
+		case Opt_max_fsync_delay:
+			if (args[0].from) {
+				if (match_int(&args[0], &option))
+					return 0;
+			} else
+				return 0;
+
+			sbi->max_fsync_delay = option;
+			ext4_msg(sb, KERN_INFO, "Maximum fsync delay %d\n", option);
+			break;
 		default:
 			ext4_msg(sb, KERN_ERR,
 			       "Unrecognized mount option \"%s\" "
@@ -2562,6 +2574,8 @@  static int ext4_fill_super(struct super_block *sb, void *data, int silent)
 	if (!IS_EXT3_SB(sb))
 		set_opt(sbi->s_mount_opt, DELALLOC);
 
+	EXT4_SB(sb)->max_fsync_delay = -1;
+
 	if (!parse_options((char *) data, sb, &journal_devnum,
 			   &journal_ioprio, NULL, 0))
 		goto failed_mount;
@@ -3045,6 +3059,11 @@  no_journal:
 	ext4_msg(sb, KERN_INFO, "mounted filesystem with%s. "
 		"Opts: %s", descr, orig_data);
 
+	EXT4_SB(sb)->in_barrier = 0;
+	EXT4_SB(sb)->avg_barrier_time = 0;
+	spin_lock_init(&EXT4_SB(sb)->barrier_flag_lock);
+	init_completion(&EXT4_SB(sb)->barrier_completion);
+
 	lock_kernel();
 	kfree(orig_data);
 	return 0;