Patchwork NULL pointer dereference in ext4_ext_remove_space on 3.5.1

login
register
mail settings
Submitter Theodore Ts'o
Date Aug. 16, 2012, 3:25 p.m.
Message ID <20120816152513.GA31346@thunk.org>
Download mbox | patch
Permalink /patch/178041/
State Superseded
Headers show

Comments

Theodore Ts'o - Aug. 16, 2012, 3:25 p.m.
On Thu, Aug 16, 2012 at 07:10:51PM +0800, Fengguang Wu wrote:
> 
> Here is the dmesg. BTW, it seems 3.5.0 don't have this issue.

Fengguang,

It sounds like you have a (at least fairly) reliable reproduction for
this problem?  Is it something you can share?  It would be good to get
this into our test suites, since it was _not_ something that was
caught by xfstests, apparently.

Can you see if this patch addresses it?  (The first two patch hunks
are the same debugging additions I had posted before.)

It looks like the responsible commit is 968dee7722: "ext4: fix hole
punch failure when depth is greater than 0".  I had thought this patch
was low risk if you weren't using the new punch ioctl, but it turns
out it did make a critical change in the non-punch (i.e., truncate)
code path, which is what the addition of "i = 0;" in the patch below
addresses.

Regards,

    	     	       	     	       	  - Ted

--
To unsubscribe from this list: send the line "unsubscribe linux-ext4" in
the body of a message to majordomo@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Maciej Żenczykowski - Aug. 16, 2012, 8:21 p.m.
This would probably be much more readable code if the 'i=0' init was
before path=kzalloc.

On Thu, Aug 16, 2012 at 8:25 AM, Theodore Ts'o <tytso@mit.edu> wrote:
> On Thu, Aug 16, 2012 at 07:10:51PM +0800, Fengguang Wu wrote:
>>
>> Here is the dmesg. BTW, it seems 3.5.0 don't have this issue.
>
> Fengguang,
>
> It sounds like you have a (at least fairly) reliable reproduction for
> this problem?  Is it something you can share?  It would be good to get
> this into our test suites, since it was _not_ something that was
> caught by xfstests, apparently.
>
> Can you see if this patch addresses it?  (The first two patch hunks
> are the same debugging additions I had posted before.)
>
> It looks like the responsible commit is 968dee7722: "ext4: fix hole
> punch failure when depth is greater than 0".  I had thought this patch
> was low risk if you weren't using the new punch ioctl, but it turns
> out it did make a critical change in the non-punch (i.e., truncate)
> code path, which is what the addition of "i = 0;" in the patch below
> addresses.
>
> Regards,
>
>                                           - Ted
>
> diff --git a/fs/ext4/extents.c b/fs/ext4/extents.c
> index 769151d..fa829dc 100644
> --- a/fs/ext4/extents.c
> +++ b/fs/ext4/extents.c
> @@ -2432,6 +2432,10 @@ ext4_ext_rm_leaf(handle_t *handle, struct inode *inode,
>
>         /* the header must be checked already in ext4_ext_remove_space() */
>         ext_debug("truncate since %u in leaf to %u\n", start, end);
> +       if (!path[depth].p_hdr && !path[depth].p_bh) {
> +               EXT4_ERROR_INODE(inode, "depth %d", depth);
> +               BUG_ON(1);
> +       }
>         if (!path[depth].p_hdr)
>                 path[depth].p_hdr = ext_block_hdr(path[depth].p_bh);
>         eh = path[depth].p_hdr;
> @@ -2730,6 +2734,10 @@ cont:
>                 /* this is index block */
>                 if (!path[i].p_hdr) {
>                         ext_debug("initialize header\n");
> +                       if (!path[i].p_hdr && !path[i].p_bh) {
> +                               EXT4_ERROR_INODE(inode, "i=%d", i);
> +                               BUG_ON(1);
> +                       }
>                         path[i].p_hdr = ext_block_hdr(path[i].p_bh);
>                 }
>
> @@ -2828,6 +2836,7 @@ out:
>         kfree(path);
>         if (err == -EAGAIN) {
>                 path = NULL;
> +               i = 0;
>                 goto again;
>         }
>         ext4_journal_stop(handle);
Theodore Ts'o - Aug. 16, 2012, 9:19 p.m.
On Thu, Aug 16, 2012 at 01:21:12PM -0700, Maciej Żenczykowski wrote:
> This would probably be much more readable code if the 'i=0' init was
> before path=kzalloc.

Good point, I agree.  I'll move the initialization so i gets
initialized in both branches of the if statement.

Maciej, you weren't able to reliably repro the crash were you?  I'm
pretty sure this should fix the crash, but it would be really great to
confirm things.

I suspect creating a file system with a really small journal may make
it easier to reproduce, but I haven't had time to try create a
reliable repro for this bug yet.

Thanks,

						- Ted
--
To unsubscribe from this list: send the line "unsubscribe linux-ext4" in
the body of a message to majordomo@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Maciej Żenczykowski - Aug. 16, 2012, 9:40 p.m.
> Maciej, you weren't able to reliably repro the crash were you?  I'm
> pretty sure this should fix the crash, but it would be really great to
> confirm things.
>
> I suspect creating a file system with a really small journal may make
> it easier to reproduce, but I haven't had time to try create a
> reliable repro for this bug yet.

This happened twice to me while moving data off of a ~1TB ext4 partition.
The data portion was on a stripe raid across 2 ~500GB drives, the
journal was on a relatively large partition (500MB?) on an SSD.
(crypto and lvm were also involved).
I've since emptied the partition and deleted even the raid array.

Both times it happened during rm, first time rm -rf of a directory
tree, second time during rm of a 250GB disk image generated by dd
(from a notebook drive).
Both rm's were manually run by me from a shell command line, and there
was pretty much nothing else happening on the machine at the time.

I'm not aware of there having been anything interesting (like:
holes/punch/sparseness, much r/w activity in the middle of files, etc)
on this filesystem, it was pretty much just a write-once data backup
that I had copied elsewhere and was deleting.  The 250GB disk image
was definitely just a sequentially written disk dump, and I think the
same thing holds true for the contents of the wiped directory tree
(although in many much smaller files).

I know i=1 in both cases (and dissasembly pointed out the location
where the above debug patch is BUGing), but I don't think it's
possible to figure out what inode # it crashed on.

Perhaps just untarring a bunch of kernels onto an empty partition,
filling it up, then deleting those kernels should be sufficient to
repro this (untried).

Perhaps something like:
  create 1TB filesystem
  untar a thousand kernel source trees on to it
  create 20GB files of junk until it is full
  rm -rf /


- Maciej
--
To unsubscribe from this list: send the line "unsubscribe linux-ext4" in
the body of a message to majordomo@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Theodore Ts'o - Aug. 16, 2012, 10:26 p.m.
On Thu, Aug 16, 2012 at 02:40:53PM -0700, Maciej Żenczykowski wrote:
> 
> This happened twice to me while moving data off of a ~1TB ext4 partition.
> The data portion was on a stripe raid across 2 ~500GB drives, the
> journal was on a relatively large partition (500MB?) on an SSD.
> (crypto and lvm were also involved).
> ...
> Perhaps just untarring a bunch of kernels onto an empty partition,
> filling it up, then deleting those kernels should be sufficient to
> repro this (untried).

Thanks, that's really helpful.   I can say that using a 4MB journal and
running fsstress is _not_ enough to trigger the problem.

Looking more closely at what might be needed to trigger the bug, 'i'
gets left uninitialized when err is set to -EAGAIN, and that happens
when ext4_ext_truncate_extend_restart() is unable to extend the
journal transaction.  But that also means we need to be deleting a
sufficiently large enough file that the blocks span multiple block
groups (which is why we need to extend the transaction, so we can
modify more bitmap blocks) at the point when there is no more room in
the journal, so we have to close the current transaction, and then
retry it again with a new journal handle in a new transaction.

So that implies that untaring a bunch of kernels probably won't be
sufficient, since the files will be too small.  What we probably will
need to do is to fill a large file system with lots of large files,
use a small journal, and then try to do an rm -rf.

          	    	     	      	     - Ted
--
To unsubscribe from this list: send the line "unsubscribe linux-ext4" in
the body of a message to majordomo@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Maciej Żenczykowski - Aug. 16, 2012, 10:44 p.m.
> Thanks, that's really helpful.   I can say that using a 4MB journal and
> running fsstress is _not_ enough to trigger the problem.
>
> Looking more closely at what might be needed to trigger the bug, 'i'
> gets left uninitialized when err is set to -EAGAIN, and that happens
> when ext4_ext_truncate_extend_restart() is unable to extend the
> journal transaction.  But that also means we need to be deleting a
> sufficiently large enough file that the blocks span multiple block
> groups (which is why we need to extend the transaction, so we can
> modify more bitmap blocks) at the point when there is no more room in
> the journal, so we have to close the current transaction, and then
> retry it again with a new journal handle in a new transaction.
>
> So that implies that untaring a bunch of kernels probably won't be
> sufficient, since the files will be too small.  What we probably will
> need to do is to fill a large file system with lots of large files,
> use a small journal, and then try to do an rm -rf.
>
>                                              - Ted

My suggestion of untarring kernels was to cause the big multi gigabyte
files created later on to be massively fragmented, and thus have tons
of extents and a relatively deep extent tree.
But maybe that's not needed to trigger this bug, if as you say, it is
caused by the absolute number of disks blocks being freed and not by
the size/depth/complexity of the extent tree.
My knowledge of the internals of ext4 is pretty much non-existent. ;-)
 In this case I'm just an end user.
--
To unsubscribe from this list: send the line "unsubscribe linux-ext4" in
the body of a message to majordomo@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Wu Fengguang - Aug. 17, 2012, 6:01 a.m.
On Thu, Aug 16, 2012 at 11:25:13AM -0400, Theodore Ts'o wrote:
> On Thu, Aug 16, 2012 at 07:10:51PM +0800, Fengguang Wu wrote:
> > 
> > Here is the dmesg. BTW, it seems 3.5.0 don't have this issue.
> 
> Fengguang,
> 
> It sounds like you have a (at least fairly) reliable reproduction for
> this problem?  Is it something you can share?  It would be good to get

Right, it can be easily reproduced here. I'm running these writeback
performance tests:

        https://github.com/fengguang/writeback-tests

Which is basically doing N parallel dd writes to JBOD/RAID arrays on
various filesystems. It seems that the RAID test can reliably trigger
the problem.

> this into our test suites, since it was _not_ something that was
> caught by xfstests, apparently.
> 
> Can you see if this patch addresses it?  (The first two patch hunks
> are the same debugging additions I had posted before.)
> 
> It looks like the responsible commit is 968dee7722: "ext4: fix hole
> punch failure when depth is greater than 0".  I had thought this patch
> was low risk if you weren't using the new punch ioctl, but it turns
> out it did make a critical change in the non-punch (i.e., truncate)
> code path, which is what the addition of "i = 0;" in the patch below
> addresses.

Yes, I'm sure the patch fixed the bug. With the fix, the writeback
tests have run flawlessly for a dozen hours without any problem.

Thanks,
Fengguang
--
To unsubscribe from this list: send the line "unsubscribe linux-ext4" in
the body of a message to majordomo@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Wu Fengguang - Aug. 17, 2012, 6:09 a.m.
Ted,

I find ext4 write performance dropped by 3.3% on average in the
3.6-rc1 merge window. xfs and btrfs are fine.

Two machines are tested. The performance regression happens in the
lkp-nex04 machine, which is equipped with 12 SSD drives. lkp-st02 does
not see regression, which is equipped with HDD drives. I'll continue
to repeat the tests and report variations.

The below 3.6.0-rc1+ kernel is 3.6.0-rc1 plus the NULL deference fix.

wfg@bee /export/writeback% ./compare -g ext4 lkp-nex04/*/*-{3.5.0,3.6.0-rc1+}
                   3.5.0                3.6.0-rc1+
------------------------  ------------------------
                  720.62        -1.5%       710.16  lkp-nex04/JBOD-12HDD-thresh=1000M/ext4-100dd-1-3.5.0
                  706.04        -0.0%       705.86  lkp-nex04/JBOD-12HDD-thresh=1000M/ext4-10dd-1-3.5.0
                  702.86        -0.2%       701.74  lkp-nex04/JBOD-12HDD-thresh=1000M/ext4-1dd-1-3.5.0
                  779.52        +6.5%       830.11  lkp-nex04/JBOD-12HDD-thresh=100M/ext4-100dd-1-3.5.0
                  646.70        +4.9%       678.59  lkp-nex04/JBOD-12HDD-thresh=100M/ext4-10dd-1-3.5.0
                  704.49        +2.6%       723.00  lkp-nex04/JBOD-12HDD-thresh=100M/ext4-1dd-1-3.5.0
                  705.26        -1.2%       696.61  lkp-nex04/JBOD-12HDD-thresh=8G/ext4-100dd-1-3.5.0
                  703.37        +0.1%       703.76  lkp-nex04/JBOD-12HDD-thresh=8G/ext4-10dd-1-3.5.0
                  701.66        -0.1%       700.83  lkp-nex04/JBOD-12HDD-thresh=8G/ext4-1dd-1-3.5.0
                  675.08       -10.5%       604.29  lkp-nex04/RAID0-12HDD-thresh=1000M/ext4-100dd-1-3.5.0
                  676.52        -2.7%       658.38  lkp-nex04/RAID0-12HDD-thresh=1000M/ext4-10dd-1-3.5.0
                  512.70        +4.0%       533.22  lkp-nex04/RAID0-12HDD-thresh=1000M/ext4-1dd-1-3.5.0
                  709.76       -15.7%       598.44  lkp-nex04/RAID0-12HDD-thresh=100M/ext4-100dd-1-3.5.0
                  681.39        -2.1%       667.25  lkp-nex04/RAID0-12HDD-thresh=100M/ext4-10dd-1-3.5.0
                  699.77       -19.2%       565.54  lkp-nex04/RAID0-12HDD-thresh=8G/ext4-100dd-1-3.5.0
                  675.79        -1.9%       663.17  lkp-nex04/RAID0-12HDD-thresh=8G/ext4-10dd-1-3.5.0
                  484.84        -7.4%       448.83  lkp-nex04/RAID0-12HDD-thresh=8G/ext4-1dd-1-3.5.0
                  167.97       -38.7%       103.03  lkp-nex04/RAID5-12HDD-thresh=1000M/ext4-100dd-1-3.5.0
                  243.67        -9.1%       221.41  lkp-nex04/RAID5-12HDD-thresh=1000M/ext4-10dd-1-3.5.0
                  248.98       +12.2%       279.33  lkp-nex04/RAID5-12HDD-thresh=1000M/ext4-1dd-1-3.5.0
                   71.18       -34.2%        46.82  lkp-nex04/RAID5-12HDD-thresh=100M/ext4-100dd-1-3.5.0
                  145.84        -7.3%       135.25  lkp-nex04/RAID5-12HDD-thresh=100M/ext4-10dd-1-3.5.0
                  255.22        +6.7%       272.35  lkp-nex04/RAID5-12HDD-thresh=100M/ext4-1dd-1-3.5.0
                  209.24       -23.6%       159.96  lkp-nex04/RAID5-12HDD-thresh=8G/ext4-100dd-1-3.5.0
                  243.73       -10.9%       217.28  lkp-nex04/RAID5-12HDD-thresh=8G/ext4-10dd-1-3.5.0
                  214.25        +5.6%       226.32  lkp-nex04/RAID5-12HDD-thresh=8G/ext4-1dd-1-3.5.0
                13286.46        -3.3%     12851.55  TOTAL write_bw

wfg@bee /export/writeback% ./compare -g xfs lkp-nex04/*/*-{3.5.0,3.6.0-rc1+}
                   3.5.0                3.6.0-rc1+
------------------------  ------------------------
                  687.76        +2.4%       704.52  lkp-nex04/JBOD-12HDD-thresh=1000M/xfs-100dd-1-3.5.0
                  705.09        +0.0%       705.11  lkp-nex04/JBOD-12HDD-thresh=1000M/xfs-10dd-1-3.5.0
                  702.21        -0.1%       701.72  lkp-nex04/JBOD-12HDD-thresh=1000M/xfs-1dd-1-3.5.0
                  664.86       +21.8%       809.81  lkp-nex04/JBOD-12HDD-thresh=100M/xfs-100dd-1-3.5.0
                  609.97       +13.6%       693.12  lkp-nex04/JBOD-12HDD-thresh=100M/xfs-10dd-1-3.5.0
                  708.30        +0.8%       713.68  lkp-nex04/JBOD-12HDD-thresh=100M/xfs-1dd-1-3.5.0
                  701.19        -0.0%       700.85  lkp-nex04/JBOD-12HDD-thresh=8G/xfs-10dd-1-3.5.0
                  701.69        -0.1%       701.01  lkp-nex04/JBOD-12HDD-thresh=8G/xfs-1dd-1-3.5.0
                  699.98        -0.4%       697.40  lkp-nex04/RAID0-12HDD-thresh=1000M/xfs-10dd-1-3.5.0
                  653.92        +0.3%       656.07  lkp-nex04/RAID0-12HDD-thresh=1000M/xfs-1dd-1-3.5.0
                  650.25        +0.5%       653.32  lkp-nex04/RAID0-12HDD-thresh=100M/xfs-10dd-1-3.5.0
                  612.47        -2.9%       594.93  lkp-nex04/RAID0-12HDD-thresh=100M/xfs-1dd-1-3.5.0
                  694.90        +0.0%       695.19  lkp-nex04/RAID0-12HDD-thresh=8G/xfs-10dd-1-3.5.0
                  607.37       +14.2%       693.36  lkp-nex04/RAID0-12HDD-thresh=8G/xfs-1dd-1-3.5.0
                  273.54       +27.1%       347.67  lkp-nex04/RAID5-12HDD-thresh=1000M/xfs-10dd-1-3.5.0
                  277.00       +30.6%       361.71  lkp-nex04/RAID5-12HDD-thresh=1000M/xfs-1dd-1-3.5.0
                  194.74        +6.6%       207.62  lkp-nex04/RAID5-12HDD-thresh=100M/xfs-10dd-1-3.5.0
                  288.92       +21.2%       350.05  lkp-nex04/RAID5-12HDD-thresh=100M/xfs-1dd-1-3.5.0
                  278.33       +26.4%       351.78  lkp-nex04/RAID5-12HDD-thresh=8G/xfs-10dd-1-3.5.0
                  285.64       +24.2%       354.68  lkp-nex04/RAID5-12HDD-thresh=8G/xfs-1dd-1-3.5.0
                10998.15        +6.3%     11693.60  TOTAL write_bw

wfg@bee /export/writeback% ./compare -g btrfs lkp-nex04/*/*-{3.5.0,3.6.0-rc1+}
                   3.5.0                3.6.0-rc1+
------------------------  ------------------------
                  703.26        -0.1%       702.57  lkp-nex04/JBOD-12HDD-thresh=1000M/btrfs-10dd-1-3.5.0
                  701.88        -0.0%       701.85  lkp-nex04/JBOD-12HDD-thresh=1000M/btrfs-1dd-1-3.5.0
                  697.67        +7.1%       747.07  lkp-nex04/JBOD-12HDD-thresh=100M/btrfs-10dd-1-3.5.0
                  712.91        -0.4%       710.36  lkp-nex04/JBOD-12HDD-thresh=100M/btrfs-1dd-1-3.5.0
                  702.02        -0.1%       701.26  lkp-nex04/JBOD-12HDD-thresh=8G/btrfs-10dd-1-3.5.0
                  702.06        -0.1%       701.66  lkp-nex04/JBOD-12HDD-thresh=8G/btrfs-1dd-1-3.5.0
                  709.01        -0.7%       703.83  lkp-nex04/RAID0-12HDD-thresh=1000M/btrfs-10dd-1-3.5.0
                  696.67        -4.2%       667.22  lkp-nex04/RAID0-12HDD-thresh=1000M/btrfs-1dd-1-3.5.0
                  822.15        +0.1%       823.01  lkp-nex04/RAID0-12HDD-thresh=100M/btrfs-10dd-1-3.5.0
                  685.14        +2.9%       705.35  lkp-nex04/RAID0-12HDD-thresh=100M/btrfs-1dd-1-3.5.0
                  702.55        -0.0%       702.23  lkp-nex04/RAID0-12HDD-thresh=8G/btrfs-10dd-1-3.5.0
                  674.09        -7.1%       626.31  lkp-nex04/RAID0-12HDD-thresh=8G/btrfs-1dd-1-3.5.0
                  270.81       +21.0%       327.76  lkp-nex04/RAID5-12HDD-thresh=1000M/btrfs-10dd-1-3.5.0
                  267.19       +15.8%       309.36  lkp-nex04/RAID5-12HDD-thresh=1000M/btrfs-1dd-1-3.5.0
                  273.89       +25.3%       343.10  lkp-nex04/RAID5-12HDD-thresh=100M/btrfs-10dd-1-3.5.0
                  276.31       +19.7%       330.87  lkp-nex04/RAID5-12HDD-thresh=100M/btrfs-1dd-1-3.5.0
                  251.25       +17.3%       294.80  lkp-nex04/RAID5-12HDD-thresh=8G/btrfs-10dd-1-3.5.0
                  267.48        +7.1%       286.47  lkp-nex04/RAID5-12HDD-thresh=8G/btrfs-1dd-1-3.5.0
                10116.34        +2.7%     10385.07  TOTAL write_bw

wfg@bee /export/writeback% ./compare -g ext4 lkp-st02-x8664/*/*-{3.5.0,3.6.0-rc1+}
                   3.5.0                3.6.0-rc1+
------------------------  ------------------------
                  900.62        +0.1%       901.66  lkp-st02-x8664/JBOD-12HDD-thresh=100M/ext4-1dd-1-3.5.0
                  898.13        +1.4%       910.73  lkp-st02-x8664/JBOD-12HDD-thresh=2G/ext4-1dd-1-3.5.0
                  166.95        +3.8%       173.33  lkp-st02-x8664/RAID5-12HDD-thresh=100M/ext4-1dd-1-3.5.0
                  176.14        +2.8%       181.01  lkp-st02-x8664/RAID5-12HDD-thresh=2G/ext4-1dd-1-3.5.0
                   25.84        +0.3%        25.92  lkp-st02-x8664/jbod_12hdd/ext4-fio_jbod_12hdd_randrw_mmap_0_4k-1-3.5.0
                   92.34        -4.8%        87.88  lkp-st02-x8664/jbod_12hdd/ext4-fio_jbod_12hdd_randrw_mmap_0_64k-1-3.5.0
                   21.20        +2.1%        21.65  lkp-st02-x8664/jbod_12hdd/ext4-fio_jbod_12hdd_randrw_mmap_1_4k-1-3.5.0
                   90.43        +1.6%        91.90  lkp-st02-x8664/jbod_12hdd/ext4-fio_jbod_12hdd_randrw_mmap_1_64k-1-3.5.0
                   28.69        -1.8%        28.18  lkp-st02-x8664/jbod_12hdd/ext4-fio_jbod_12hdd_randrw_sync_0_4k-1-3.5.0
                  201.86        +0.2%       202.17  lkp-st02-x8664/jbod_12hdd/ext4-fio_jbod_12hdd_randrw_sync_0_64k-1-3.5.0
                   28.43        -0.2%        28.37  lkp-st02-x8664/jbod_12hdd/ext4-fio_jbod_12hdd_randwrite_mmap_0_4k-1-3.5.0
                  110.25        -0.1%       110.20  lkp-st02-x8664/jbod_12hdd/ext4-fio_jbod_12hdd_randwrite_mmap_0_64k-1-3.5.0
                   31.20        +0.5%        31.36  lkp-st02-x8664/jbod_12hdd/ext4-fio_jbod_12hdd_randwrite_sync_0_4k-1-3.5.0
                  289.28        +1.0%       292.08  lkp-st02-x8664/jbod_12hdd/ext4-fio_jbod_12hdd_randwrite_sync_0_64k-1-3.5.0
                   20.50        +0.9%        20.67  lkp-st02-x8664/jbod_12hdd/ext4-fio_jbod_12hdd_randwrite_sync_1_4k-1-3.5.0
                  294.64        +0.4%       295.94  lkp-st02-x8664/jbod_12hdd/ext4-fio_jbod_12hdd_randwrite_sync_1_64k-1-3.5.0
                 3376.51        +0.8%      3403.05  TOTAL write_bw

wfg@bee /export/writeback% ./compare -g xfs lkp-st02-x8664/*/*-{3.5.0,3.6.0-rc1+}
                   3.5.0                3.6.0-rc1+
------------------------  ------------------------
                  976.57        -4.8%       929.50  lkp-st02-x8664/JBOD-12HDD-thresh=100M/xfs-1dd-1-3.5.0
                 1003.33        +2.3%      1026.41  lkp-st02-x8664/JBOD-12HDD-thresh=2G/xfs-1dd-1-3.5.0
                  796.67        -2.1%       780.09  lkp-st02-x8664/RAID0-12HDD-thresh=100M/xfs-1dd-1-3.5.0
                  754.89        +0.3%       757.24  lkp-st02-x8664/RAID0-12HDD-thresh=2G/xfs-1dd-1-3.5.0
                  183.18        +7.6%       197.02  lkp-st02-x8664/RAID5-12HDD-thresh=100M/xfs-1dd-1-3.5.0
                  191.62        +9.0%       208.92  lkp-st02-x8664/RAID5-12HDD-thresh=2G/xfs-1dd-1-3.5.0
                   71.83        -1.0%        71.13  lkp-st02-x8664/jbod_12hdd/xfs-fio_jbod_12hdd_randrw_mmap_0_4k-1-3.5.0
                  104.93        -1.3%       103.56  lkp-st02-x8664/jbod_12hdd/xfs-fio_jbod_12hdd_randrw_mmap_0_64k-1-3.5.0
                   25.90        -0.4%        25.79  lkp-st02-x8664/jbod_12hdd/xfs-fio_jbod_12hdd_randrw_mmap_1_4k-1-3.5.0
                   88.13        +1.1%        89.06  lkp-st02-x8664/jbod_12hdd/xfs-fio_jbod_12hdd_randrw_mmap_1_64k-1-3.5.0
                   88.63        +0.2%        88.85  lkp-st02-x8664/jbod_12hdd/xfs-fio_jbod_12hdd_randrw_sync_0_4k-1-3.5.0
                  291.55        +0.1%       291.70  lkp-st02-x8664/jbod_12hdd/xfs-fio_jbod_12hdd_randrw_sync_0_64k-1-3.5.0
                   87.44        -1.5%        86.15  lkp-st02-x8664/jbod_12hdd/xfs-fio_jbod_12hdd_randwrite_mmap_0_4k-1-3.5.0
                  122.64        -1.6%       120.69  lkp-st02-x8664/jbod_12hdd/xfs-fio_jbod_12hdd_randwrite_mmap_0_64k-1-3.5.0
                  507.15        +0.2%       508.12  lkp-st02-x8664/jbod_12hdd/xfs-fio_jbod_12hdd_randwrite_sync_0_64k-1-3.5.0
                   32.09        -0.8%        31.85  lkp-st02-x8664/jbod_12hdd/xfs-fio_jbod_12hdd_randwrite_sync_1_4k-1-3.5.0
                  331.16        +0.2%       331.77  lkp-st02-x8664/jbod_12hdd/xfs-fio_jbod_12hdd_randwrite_sync_1_64k-1-3.5.0
                 5657.70        -0.2%      5647.85  TOTAL write_bw

wfg@bee /export/writeback% ./compare -g btrfs lkp-st02-x8664/*/*-{3.5.0,3.6.0-rc1+}
                   3.5.0                3.6.0-rc1+
------------------------  ------------------------
                  970.57        -2.9%       942.80  lkp-st02-x8664/JBOD-12HDD-thresh=100M/btrfs-1dd-1-3.5.0
                  965.95        -0.1%       964.91  lkp-st02-x8664/JBOD-12HDD-thresh=2G/btrfs-1dd-1-3.5.0
                  813.94        -2.3%       794.99  lkp-st02-x8664/RAID0-12HDD-thresh=100M/btrfs-1dd-1-3.5.0
                  860.05       -11.1%       764.50  lkp-st02-x8664/RAID0-12HDD-thresh=2G/btrfs-1dd-1-3.5.0
                  164.02       +15.3%       189.09  lkp-st02-x8664/RAID5-12HDD-thresh=100M/btrfs-1dd-1-3.5.0
                  163.78       +14.1%       186.94  lkp-st02-x8664/RAID5-12HDD-thresh=2G/btrfs-1dd-1-3.5.0
                 3938.30        -2.4%      3843.24  TOTAL write_bw

Thanks,
Fengguang
--
To unsubscribe from this list: send the line "unsubscribe linux-ext4" in
the body of a message to majordomo@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Theodore Ts'o - Aug. 17, 2012, 1:15 p.m.
Thanks Fengguang:

For the record, I was able to find my own easy repro, last night using
only a 220 meg partition:

# mke2fs -t ext4 -b 1024 -J size=1 /dev/vdc
# mount -t ext2 /dev/vdc /vdc
# mkdir /vdc/a
# cd /vdc/a
# seq 1 210000  | xargs -n 1 fallocate -l 1m
# seq 1 2 210000  | xargs /bin/rm
# mkdir /vdc/b
# cd /vdc/b
# seq 1 103 | xargs -n 1 fallocate -l 1g
# cd /
# umount /vdc
# mount -t ext4 -o commit=10000 /dev/vdc /vdc
# rm -rf /vdc/b

For future reference, there are a couple of things that are of
interest to ext4 developers when trying to create repro's:

1)  The use of mounting with ext2 to speed up the setup.

2)  The first two "seq ... | xargs ..." commands to create a very
fragmented file system.

3) Using a 1k block size file system to stress the extent tree code
and htree directory (since its easier to make larger tree structure).

4)  The use of the mount option commit=10000 to test what happens when
the journal is full (without using a nice, fast device such as RAID array
or without burning write cycles on an expensive flash device.)

						- Ted
--
To unsubscribe from this list: send the line "unsubscribe linux-ext4" in
the body of a message to majordomo@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Wu Fengguang - Aug. 17, 2012, 1:22 p.m.
Hi Ted,

On Fri, Aug 17, 2012 at 09:15:58AM -0400, Theodore Ts'o wrote:
> Thanks Fengguang:
> 
> For the record, I was able to find my own easy repro, last night using
> only a 220 meg partition:
> 
> # mke2fs -t ext4 -b 1024 -J size=1 /dev/vdc
> # mount -t ext2 /dev/vdc /vdc
> # mkdir /vdc/a
> # cd /vdc/a
> # seq 1 210000  | xargs -n 1 fallocate -l 1m
> # seq 1 2 210000  | xargs /bin/rm
> # mkdir /vdc/b
> # cd /vdc/b
> # seq 1 103 | xargs -n 1 fallocate -l 1g
> # cd /
> # umount /vdc
> # mount -t ext4 -o commit=10000 /dev/vdc /vdc
> # rm -rf /vdc/b

It makes a nice and simple test script, I'd very like to add it to my
0day test system :-)

> For future reference, there are a couple of things that are of
> interest to ext4 developers when trying to create repro's:
> 
> 1)  The use of mounting with ext2 to speed up the setup.
> 
> 2)  The first two "seq ... | xargs ..." commands to create a very
> fragmented file system.
> 
> 3) Using a 1k block size file system to stress the extent tree code
> and htree directory (since its easier to make larger tree structure).
> 
> 4)  The use of the mount option commit=10000 to test what happens when
> the journal is full (without using a nice, fast device such as RAID array
> or without burning write cycles on an expensive flash device.)

Thanks for the directions! I'll make that a big comment.

Thanks,
Fengguang
--
To unsubscribe from this list: send the line "unsubscribe linux-ext4" in
the body of a message to majordomo@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Theodore Ts'o - Aug. 17, 2012, 1:40 p.m.
On Fri, Aug 17, 2012 at 02:09:15PM +0800, Fengguang Wu wrote:
> Ted,
> 
> I find ext4 write performance dropped by 3.3% on average in the
> 3.6-rc1 merge window. xfs and btrfs are fine.
> 
> Two machines are tested. The performance regression happens in the
> lkp-nex04 machine, which is equipped with 12 SSD drives. lkp-st02 does
> not see regression, which is equipped with HDD drives. I'll continue
> to repeat the tests and report variations.

Hmm... I've checked out the commits in "git log v3.5..v3.6-rc1 --
fs/ext4 fs/jbd2" and I don't see anything that I would expect would
cause that.  The are the lock elimination changes for Direct I/O
overwrites, but that shouldn't matter for your tests which are
measuring buffered writes, correct?

Is there any chance you could do me a favor and do a git bisect
restricted to commits involving fs/ext4 and fs/jbd2?

Many thanks,

						- Ted
--
To unsubscribe from this list: send the line "unsubscribe linux-ext4" in
the body of a message to majordomo@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Wu Fengguang - Aug. 17, 2012, 2:13 p.m.
On Fri, Aug 17, 2012 at 09:40:39AM -0400, Theodore Ts'o wrote:
> On Fri, Aug 17, 2012 at 02:09:15PM +0800, Fengguang Wu wrote:
> > Ted,
> > 
> > I find ext4 write performance dropped by 3.3% on average in the
> > 3.6-rc1 merge window. xfs and btrfs are fine.
> > 
> > Two machines are tested. The performance regression happens in the
> > lkp-nex04 machine, which is equipped with 12 SSD drives. lkp-st02 does
> > not see regression, which is equipped with HDD drives. I'll continue
> > to repeat the tests and report variations.
> 
> Hmm... I've checked out the commits in "git log v3.5..v3.6-rc1 --
> fs/ext4 fs/jbd2" and I don't see anything that I would expect would
> cause that.  The are the lock elimination changes for Direct I/O
> overwrites, but that shouldn't matter for your tests which are
> measuring buffered writes, correct?
> 
> Is there any chance you could do me a favor and do a git bisect
> restricted to commits involving fs/ext4 and fs/jbd2?

No problem :)

Thanks,
Fengguang
--
To unsubscribe from this list: send the line "unsubscribe linux-ext4" in
the body of a message to majordomo@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Wu Fengguang - Aug. 17, 2012, 2:25 p.m.
[CC md list]

On Fri, Aug 17, 2012 at 09:40:39AM -0400, Theodore Ts'o wrote:
> On Fri, Aug 17, 2012 at 02:09:15PM +0800, Fengguang Wu wrote:
> > Ted,
> > 
> > I find ext4 write performance dropped by 3.3% on average in the
> > 3.6-rc1 merge window. xfs and btrfs are fine.
> > 
> > Two machines are tested. The performance regression happens in the
> > lkp-nex04 machine, which is equipped with 12 SSD drives. lkp-st02 does
> > not see regression, which is equipped with HDD drives. I'll continue
> > to repeat the tests and report variations.
> 
> Hmm... I've checked out the commits in "git log v3.5..v3.6-rc1 --
> fs/ext4 fs/jbd2" and I don't see anything that I would expect would
> cause that.  The are the lock elimination changes for Direct I/O
> overwrites, but that shouldn't matter for your tests which are
> measuring buffered writes, correct?
> 
> Is there any chance you could do me a favor and do a git bisect
> restricted to commits involving fs/ext4 and fs/jbd2?

I noticed that the regressions all happen in the RAID0/RAID5 cases.
So it may be some interactions between the RAID/ext4 code?

I'll try to get some ext2/3 numbers, which should have less changes on the fs side.

wfg@bee /export/writeback% ./compare -g ext4 lkp-nex04/*/*-{3.5.0,3.6.0-rc1+}     
                   3.5.0                3.6.0-rc1+
------------------------  ------------------------
                  720.62        -1.5%       710.16  lkp-nex04/JBOD-12HDD-thresh=1000M/ext4-100dd-1-3.5.0
                  706.04        -0.0%       705.86  lkp-nex04/JBOD-12HDD-thresh=1000M/ext4-10dd-1-3.5.0
                  702.86        -0.2%       701.74  lkp-nex04/JBOD-12HDD-thresh=1000M/ext4-1dd-1-3.5.0
                  702.41        -0.0%       702.06  lkp-nex04/JBOD-12HDD-thresh=1000M/ext4-1dd-2-3.5.0
                  779.52        +6.5%       830.11  lkp-nex04/JBOD-12HDD-thresh=100M/ext4-100dd-1-3.5.0
                  646.70        +4.9%       678.59  lkp-nex04/JBOD-12HDD-thresh=100M/ext4-10dd-1-3.5.0
                  704.49        +2.6%       723.00  lkp-nex04/JBOD-12HDD-thresh=100M/ext4-1dd-1-3.5.0
                  704.21        +1.2%       712.47  lkp-nex04/JBOD-12HDD-thresh=100M/ext4-1dd-2-3.5.0
                  705.26        -1.2%       696.61  lkp-nex04/JBOD-12HDD-thresh=8G/ext4-100dd-1-3.5.0
                  703.37        +0.1%       703.76  lkp-nex04/JBOD-12HDD-thresh=8G/ext4-10dd-1-3.5.0
                  701.66        -0.1%       700.83  lkp-nex04/JBOD-12HDD-thresh=8G/ext4-1dd-1-3.5.0
                  701.17        +0.0%       701.36  lkp-nex04/JBOD-12HDD-thresh=8G/ext4-1dd-2-3.5.0
                  675.08       -10.5%       604.29  lkp-nex04/RAID0-12HDD-thresh=1000M/ext4-100dd-1-3.5.0
                  676.52        -2.7%       658.38  lkp-nex04/RAID0-12HDD-thresh=1000M/ext4-10dd-1-3.5.0
                  512.70        +4.0%       533.22  lkp-nex04/RAID0-12HDD-thresh=1000M/ext4-1dd-1-3.5.0
                  524.61        -0.3%       522.90  lkp-nex04/RAID0-12HDD-thresh=1000M/ext4-1dd-2-3.5.0
                  709.76       -15.7%       598.44  lkp-nex04/RAID0-12HDD-thresh=100M/ext4-100dd-1-3.5.0
                  681.39        -2.1%       667.25  lkp-nex04/RAID0-12HDD-thresh=100M/ext4-10dd-1-3.5.0
                  524.16        +0.8%       528.25  lkp-nex04/RAID0-12HDD-thresh=100M/ext4-1dd-2-3.5.0
                  699.77       -19.2%       565.54  lkp-nex04/RAID0-12HDD-thresh=8G/ext4-100dd-1-3.5.0
                  675.79        -1.9%       663.17  lkp-nex04/RAID0-12HDD-thresh=8G/ext4-10dd-1-3.5.0
                  484.84        -7.4%       448.83  lkp-nex04/RAID0-12HDD-thresh=8G/ext4-1dd-1-3.5.0
                  470.40        -3.2%       455.31  lkp-nex04/RAID0-12HDD-thresh=8G/ext4-1dd-2-3.5.0
                  167.97       -38.7%       103.03  lkp-nex04/RAID5-12HDD-thresh=1000M/ext4-100dd-1-3.5.0
                  243.67        -9.1%       221.41  lkp-nex04/RAID5-12HDD-thresh=1000M/ext4-10dd-1-3.5.0
                  248.98       +12.2%       279.33  lkp-nex04/RAID5-12HDD-thresh=1000M/ext4-1dd-1-3.5.0
                  208.45       +14.1%       237.86  lkp-nex04/RAID5-12HDD-thresh=1000M/ext4-1dd-2-3.5.0
                   71.18       -34.2%        46.82  lkp-nex04/RAID5-12HDD-thresh=100M/ext4-100dd-1-3.5.0
                  145.84        -7.3%       135.25  lkp-nex04/RAID5-12HDD-thresh=100M/ext4-10dd-1-3.5.0
                  255.22        +6.7%       272.35  lkp-nex04/RAID5-12HDD-thresh=100M/ext4-1dd-1-3.5.0
                  243.09       +20.7%       293.30  lkp-nex04/RAID5-12HDD-thresh=100M/ext4-1dd-2-3.5.0
                  209.24       -23.6%       159.96  lkp-nex04/RAID5-12HDD-thresh=8G/ext4-100dd-1-3.5.0
                  243.73       -10.9%       217.28  lkp-nex04/RAID5-12HDD-thresh=8G/ext4-10dd-1-3.5.0
                  214.25        +5.6%       226.32  lkp-nex04/RAID5-12HDD-thresh=8G/ext4-1dd-1-3.5.0
                  207.16       +13.4%       234.98  lkp-nex04/RAID5-12HDD-thresh=8G/ext4-1dd-2-3.5.0
                17572.12        -1.9%     17240.05  TOTAL write_bw

Thanks,
Fengguang
--
To unsubscribe from this list: send the line "unsubscribe linux-ext4" in
the body of a message to majordomo@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Theodore Ts'o - Aug. 17, 2012, 3:37 p.m.
On Fri, Aug 17, 2012 at 11:13:18PM +0800, Fengguang Wu wrote:
> 
> Obviously the major regressions happen to the 100dd over raid cases.
> Some 10dd cases are also impacted.
> 
> The attached graphs show that everything becomes more fluctuated in
> 3.6.0-rc1 for the lkp-nex04/RAID0-12HDD-thresh=8G/ext4-100dd-1 case.

Hmm... I'm not seeing any differences in the block allocation code, or
in ext4's buffered writeback code paths, which would be the most
likely cause of such problems.  Maybe a quick eyeball of the blktrace
to see if we're doing something pathalogically stupid?

You could also try running a filefrag -v on a few of the dd files to
see if there's any significant difference, although as I said, there
doesn't look like there was any significant changes in the block
allocation code between v3.5 and v3.6-rc1 --- although I suppose
changes in timeing could have have caused the block allocation
decisions to be different, so it's worth checking that out.

Thanks, regards,

						- Ted
--
To unsubscribe from this list: send the line "unsubscribe linux-ext4" in
the body of a message to majordomo@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Christoph Hellwig - Aug. 17, 2012, 5:48 p.m.
On Fri, Aug 17, 2012 at 09:15:58AM -0400, Theodore Ts'o wrote:
> Thanks Fengguang:
> 
> For the record, I was able to find my own easy repro, last night using
> only a 220 meg partition:
> 
> # mke2fs -t ext4 -b 1024 -J size=1 /dev/vdc
> # mount -t ext2 /dev/vdc /vdc
> # mkdir /vdc/a
> # cd /vdc/a
> # seq 1 210000  | xargs -n 1 fallocate -l 1m
> # seq 1 2 210000  | xargs /bin/rm
> # mkdir /vdc/b
> # cd /vdc/b
> # seq 1 103 | xargs -n 1 fallocate -l 1g
> # cd /
> # umount /vdc
> # mount -t ext4 -o commit=10000 /dev/vdc /vdc
> # rm -rf /vdc/b

Can you submit this for xfstests?

--
To unsubscribe from this list: send the line "unsubscribe linux-ext4" in
the body of a message to majordomo@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Theodore Ts'o - Aug. 17, 2012, 8:34 p.m.
On Fri, Aug 17, 2012 at 01:48:41PM -0400, Christoph Hellwig wrote:
> 
> Can you submit this for xfstests?
> 

This is actually something I wanted to ask you guys about.  There are
a series of ext4-specific tests that I could potentially add, but I
wasn't sure how welcome they would be in xfstests.  Assuming that
ext4-specific tests would be welcome, is there a number range for
these ext4-specific tests that I should use?

BTW, we have an extension to xfstests that we've been using inside
Google where Google-internal tests have a "g" prefix (i.e., g001,
g002, etc.).  That way we didn't need to worry about conflicts between
newly added upstream xfstests, and ones which were added internally.
Would it make sense to start using some kind of prefix such as "e001"
for ext2/3/4 specific tests?

Regards,

						- Ted
--
To unsubscribe from this list: send the line "unsubscribe linux-ext4" in
the body of a message to majordomo@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Neil Brown - Aug. 17, 2012, 8:44 p.m.
On Fri, 17 Aug 2012 22:25:26 +0800 Fengguang Wu <fengguang.wu@intel.com>
wrote:

> [CC md list]
> 
> On Fri, Aug 17, 2012 at 09:40:39AM -0400, Theodore Ts'o wrote:
> > On Fri, Aug 17, 2012 at 02:09:15PM +0800, Fengguang Wu wrote:
> > > Ted,
> > > 
> > > I find ext4 write performance dropped by 3.3% on average in the
> > > 3.6-rc1 merge window. xfs and btrfs are fine.
> > > 
> > > Two machines are tested. The performance regression happens in the
> > > lkp-nex04 machine, which is equipped with 12 SSD drives. lkp-st02 does
> > > not see regression, which is equipped with HDD drives. I'll continue
> > > to repeat the tests and report variations.
> > 
> > Hmm... I've checked out the commits in "git log v3.5..v3.6-rc1 --
> > fs/ext4 fs/jbd2" and I don't see anything that I would expect would
> > cause that.  The are the lock elimination changes for Direct I/O
> > overwrites, but that shouldn't matter for your tests which are
> > measuring buffered writes, correct?
> > 
> > Is there any chance you could do me a favor and do a git bisect
> > restricted to commits involving fs/ext4 and fs/jbd2?
> 
> I noticed that the regressions all happen in the RAID0/RAID5 cases.
> So it may be some interactions between the RAID/ext4 code?

I'm aware of some performance regression in RAID5 which I will be drilling
down into next week.  Some things are faster, but some are slower :-(

RAID0 should be unchanged though - I don't think I've changed anything there.

Looking at your numbers, JBOD ranges from  +6.5% to -1.5%
                        RAID0 ranges from  +4.0% to -19.2%
                        RAID5 ranges from +20.7% to -39.7%

I'm guessing + is good and - is bad?
The RAID5 numbers don't surprise me.  The RAID0 do.

> 
> I'll try to get some ext2/3 numbers, which should have less changes on the fs side.

Thanks.  That will be useful.

NeilBrown


> 
> wfg@bee /export/writeback% ./compare -g ext4 lkp-nex04/*/*-{3.5.0,3.6.0-rc1+}     
>                    3.5.0                3.6.0-rc1+
> ------------------------  ------------------------
>                   720.62        -1.5%       710.16  lkp-nex04/JBOD-12HDD-thresh=1000M/ext4-100dd-1-3.5.0
>                   706.04        -0.0%       705.86  lkp-nex04/JBOD-12HDD-thresh=1000M/ext4-10dd-1-3.5.0
>                   702.86        -0.2%       701.74  lkp-nex04/JBOD-12HDD-thresh=1000M/ext4-1dd-1-3.5.0
>                   702.41        -0.0%       702.06  lkp-nex04/JBOD-12HDD-thresh=1000M/ext4-1dd-2-3.5.0
>                   779.52        +6.5%       830.11  lkp-nex04/JBOD-12HDD-thresh=100M/ext4-100dd-1-3.5.0
>                   646.70        +4.9%       678.59  lkp-nex04/JBOD-12HDD-thresh=100M/ext4-10dd-1-3.5.0
>                   704.49        +2.6%       723.00  lkp-nex04/JBOD-12HDD-thresh=100M/ext4-1dd-1-3.5.0
>                   704.21        +1.2%       712.47  lkp-nex04/JBOD-12HDD-thresh=100M/ext4-1dd-2-3.5.0
>                   705.26        -1.2%       696.61  lkp-nex04/JBOD-12HDD-thresh=8G/ext4-100dd-1-3.5.0
>                   703.37        +0.1%       703.76  lkp-nex04/JBOD-12HDD-thresh=8G/ext4-10dd-1-3.5.0
>                   701.66        -0.1%       700.83  lkp-nex04/JBOD-12HDD-thresh=8G/ext4-1dd-1-3.5.0
>                   701.17        +0.0%       701.36  lkp-nex04/JBOD-12HDD-thresh=8G/ext4-1dd-2-3.5.0
>                   675.08       -10.5%       604.29  lkp-nex04/RAID0-12HDD-thresh=1000M/ext4-100dd-1-3.5.0
>                   676.52        -2.7%       658.38  lkp-nex04/RAID0-12HDD-thresh=1000M/ext4-10dd-1-3.5.0
>                   512.70        +4.0%       533.22  lkp-nex04/RAID0-12HDD-thresh=1000M/ext4-1dd-1-3.5.0
>                   524.61        -0.3%       522.90  lkp-nex04/RAID0-12HDD-thresh=1000M/ext4-1dd-2-3.5.0
>                   709.76       -15.7%       598.44  lkp-nex04/RAID0-12HDD-thresh=100M/ext4-100dd-1-3.5.0
>                   681.39        -2.1%       667.25  lkp-nex04/RAID0-12HDD-thresh=100M/ext4-10dd-1-3.5.0
>                   524.16        +0.8%       528.25  lkp-nex04/RAID0-12HDD-thresh=100M/ext4-1dd-2-3.5.0
>                   699.77       -19.2%       565.54  lkp-nex04/RAID0-12HDD-thresh=8G/ext4-100dd-1-3.5.0
>                   675.79        -1.9%       663.17  lkp-nex04/RAID0-12HDD-thresh=8G/ext4-10dd-1-3.5.0
>                   484.84        -7.4%       448.83  lkp-nex04/RAID0-12HDD-thresh=8G/ext4-1dd-1-3.5.0
>                   470.40        -3.2%       455.31  lkp-nex04/RAID0-12HDD-thresh=8G/ext4-1dd-2-3.5.0
>                   167.97       -38.7%       103.03  lkp-nex04/RAID5-12HDD-thresh=1000M/ext4-100dd-1-3.5.0
>                   243.67        -9.1%       221.41  lkp-nex04/RAID5-12HDD-thresh=1000M/ext4-10dd-1-3.5.0
>                   248.98       +12.2%       279.33  lkp-nex04/RAID5-12HDD-thresh=1000M/ext4-1dd-1-3.5.0
>                   208.45       +14.1%       237.86  lkp-nex04/RAID5-12HDD-thresh=1000M/ext4-1dd-2-3.5.0
>                    71.18       -34.2%        46.82  lkp-nex04/RAID5-12HDD-thresh=100M/ext4-100dd-1-3.5.0
>                   145.84        -7.3%       135.25  lkp-nex04/RAID5-12HDD-thresh=100M/ext4-10dd-1-3.5.0
>                   255.22        +6.7%       272.35  lkp-nex04/RAID5-12HDD-thresh=100M/ext4-1dd-1-3.5.0
>                   243.09       +20.7%       293.30  lkp-nex04/RAID5-12HDD-thresh=100M/ext4-1dd-2-3.5.0
>                   209.24       -23.6%       159.96  lkp-nex04/RAID5-12HDD-thresh=8G/ext4-100dd-1-3.5.0
>                   243.73       -10.9%       217.28  lkp-nex04/RAID5-12HDD-thresh=8G/ext4-10dd-1-3.5.0
>                   214.25        +5.6%       226.32  lkp-nex04/RAID5-12HDD-thresh=8G/ext4-1dd-1-3.5.0
>                   207.16       +13.4%       234.98  lkp-nex04/RAID5-12HDD-thresh=8G/ext4-1dd-2-3.5.0
>                 17572.12        -1.9%     17240.05  TOTAL write_bw
> 
> Thanks,
> Fengguang
Christoph Hellwig - Aug. 17, 2012, 9:05 p.m.
On Fri, Aug 17, 2012 at 04:34:38PM -0400, Theodore Ts'o wrote:
> On Fri, Aug 17, 2012 at 01:48:41PM -0400, Christoph Hellwig wrote:
> > 
> > Can you submit this for xfstests?
> > 
> 
> This is actually something I wanted to ask you guys about.  There are
> a series of ext4-specific tests that I could potentially add, but I
> wasn't sure how welcome they would be in xfstests.  Assuming that
> ext4-specific tests would be welcome, is there a number range for
> these ext4-specific tests that I should use?

Dave actually has an outstanding series to move tests from the toplevel
directory to directories for categories.  We already have a lot of
btrfs-specific tests that have a separate directory, as well as xfs
specific ones, ext4 would just follow this model.  For this specific
test it actually seems fairly generic except for the commit interval,
so I'd love to run it for all filesystems, just setting the interval for
ext4.

> BTW, we have an extension to xfstests that we've been using inside
> Google where Google-internal tests have a "g" prefix (i.e., g001,
> g002, etc.).  That way we didn't need to worry about conflicts between
> newly added upstream xfstests, and ones which were added internally.
> Would it make sense to start using some kind of prefix such as "e001"
> for ext2/3/4 specific tests?

Can you take a look at Dave's series if that helps you?  I haven't
really reviewed it much myself yet, but I'll try to get to it ASAP.

--
To unsubscribe from this list: send the line "unsubscribe linux-ext4" in
the body of a message to majordomo@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Dave Chinner - Aug. 17, 2012, 10:55 p.m.
On Fri, Aug 17, 2012 at 05:05:27PM -0400, Christoph Hellwig wrote:
> On Fri, Aug 17, 2012 at 04:34:38PM -0400, Theodore Ts'o wrote:
> > On Fri, Aug 17, 2012 at 01:48:41PM -0400, Christoph Hellwig wrote:
> > > 
> > > Can you submit this for xfstests?
> > > 
> > 
> > This is actually something I wanted to ask you guys about.  There are
> > a series of ext4-specific tests that I could potentially add, but I
> > wasn't sure how welcome they would be in xfstests.  Assuming that
> > ext4-specific tests would be welcome, is there a number range for
> > these ext4-specific tests that I should use?
> 
> Dave actually has an outstanding series to move tests from the toplevel
> directory to directories for categories.

And a whole lot more stuff, like a separate results directory, being
able to run just a directory of tests rather than a group (e.g. just
run ext4 specific tests), being able to use names rather than
numbers for tests (not quite there yet), being able to exclude
different tests (e.g. for older distro testing with wont-fix bugs),
etc.

Basically, all those things I talked about at the LSF/MM conference
about making xfstests easier to use, develop and deploy for the wider
filesystem community are started in the patchsets here:

http://oss.sgi.com/archives/xfs/2012-07/msg00361.html
http://oss.sgi.com/archives/xfs/2012-07/msg00373.html

"This moves all the tests into a ./tests subdirectory, and sorts them into
classes of related tests. Those are:

        tests/generic:  valid for all filesystems
        tests/shared:   valid for a limited number of filesystems
        tests/xfs:      xfs specific tests
        tests/btrfs     btrfs specific tests
        tests/ext4      ext4 specific tests
        tests/udf       udf specific tests

Each directory has it's own group file to determine what groups the
tests are associated with. Tests are run in exactly the same was as
before, but when trying to run individual tests you need to specify
the class as well. e.g. the old way:

# ./check 001

The new way:

# ./check generic/001

...."

> We already have a lot of
> btrfs-specific tests that have a separate directory, as well as xfs
> specific ones, ext4 would just follow this model.  For this specific
> test it actually seems fairly generic except for the commit interval,
> so I'd love to run it for all filesystems, just setting the interval for
> ext4.

Yeah, anything that is not deeply fileystem specific should be
written as a generic test so that it can run on all filesystems. If
it's mostly generic, with a small fs specific extension, that
extension is easy to do under a 'if [ $FSTYP = "ext4" ]; then'
branch....

> > BTW, we have an extension to xfstests that we've been using inside
> > Google where Google-internal tests have a "g" prefix (i.e., g001,
> > g002, etc.).  That way we didn't need to worry about conflicts between
> > newly added upstream xfstests, and ones which were added internally.
> > Would it make sense to start using some kind of prefix such as "e001"
> > for ext2/3/4 specific tests?

No.  The whole point of moving to multiple directories is to allow
easy extension for domain specific tests without having to hack up
the check script or play other games with test naming. Duplicate
names in different test subdirectories are most certainly allowed.

> Can you take a look at Dave's series if that helps you?  I haven't
> really reviewed it much myself yet, but I'll try to get to it ASAP.

Well, I'd apprepciate it if somebody looked at it. It's been almost
a month since I posted it and all I've heard is crickets so far...

Cheers,

Dave.
Theodore Ts'o - Aug. 17, 2012, 11:11 p.m.
On Sat, Aug 18, 2012 at 08:55:24AM +1000, Dave Chinner wrote:
> 
> No.  The whole point of moving to multiple directories is to allow
> easy extension for domain specific tests without having to hack up
> the check script or play other games with test naming. Duplicate
> names in different test subdirectories are most certainly allowed.

Oh, I agree, using separate directories is *way* better than the hack
we're using internally.  The main benefit of what we did was the
patches were minimally intrusive....

> > Can you take a look at Dave's series if that helps you?  I haven't
> > really reviewed it much myself yet, but I'll try to get to it ASAP.
> 
> Well, I'd apprepciate it if somebody looked at it. It's been almost
> a month since I posted it and all I've heard is crickets so far...

I definitely want to look at it, but realistically, I probably won't
have time until after San Diego....  I've been crazy busy lately.

Cheers,

						- Ted
--
To unsubscribe from this list: send the line "unsubscribe linux-ext4" in
the body of a message to majordomo@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Wu Fengguang - Aug. 21, 2012, 9:42 a.m.
On Sat, Aug 18, 2012 at 06:44:57AM +1000, NeilBrown wrote:
> On Fri, 17 Aug 2012 22:25:26 +0800 Fengguang Wu <fengguang.wu@intel.com>
> wrote:
> 
> > [CC md list]
> > 
> > On Fri, Aug 17, 2012 at 09:40:39AM -0400, Theodore Ts'o wrote:
> > > On Fri, Aug 17, 2012 at 02:09:15PM +0800, Fengguang Wu wrote:
> > > > Ted,
> > > > 
> > > > I find ext4 write performance dropped by 3.3% on average in the
> > > > 3.6-rc1 merge window. xfs and btrfs are fine.
> > > > 
> > > > Two machines are tested. The performance regression happens in the
> > > > lkp-nex04 machine, which is equipped with 12 SSD drives. lkp-st02 does
> > > > not see regression, which is equipped with HDD drives. I'll continue
> > > > to repeat the tests and report variations.
> > > 
> > > Hmm... I've checked out the commits in "git log v3.5..v3.6-rc1 --
> > > fs/ext4 fs/jbd2" and I don't see anything that I would expect would
> > > cause that.  The are the lock elimination changes for Direct I/O
> > > overwrites, but that shouldn't matter for your tests which are
> > > measuring buffered writes, correct?
> > > 
> > > Is there any chance you could do me a favor and do a git bisect
> > > restricted to commits involving fs/ext4 and fs/jbd2?
> > 
> > I noticed that the regressions all happen in the RAID0/RAID5 cases.
> > So it may be some interactions between the RAID/ext4 code?
> 
> I'm aware of some performance regression in RAID5 which I will be drilling
> down into next week.  Some things are faster, but some are slower :-(
> 
> RAID0 should be unchanged though - I don't think I've changed anything there.
> 
> Looking at your numbers, JBOD ranges from  +6.5% to -1.5%
>                         RAID0 ranges from  +4.0% to -19.2%
>                         RAID5 ranges from +20.7% to -39.7%
> 
> I'm guessing + is good and - is bad?

Yes.

> The RAID5 numbers don't surprise me.  The RAID0 do.

You are right. I did more tests and it's now obvious that RAID0 is
mostly fine. The major regressions are in the RAID5 10/100dd cases.
JBOD is performing better in 3.6.0-rc1 :-)

> > 
> > I'll try to get some ext2/3 numbers, which should have less changes on the fs side.
> 
> Thanks.  That will be useful.

Here are the more complete results.

   RAID5     ext4    100dd    -7.3%
   RAID5     ext4     10dd    -2.2%
   RAID5     ext4      1dd   +12.1%
   RAID5     ext3    100dd    -3.1%
   RAID5     ext3     10dd   -11.5%
   RAID5     ext3      1dd    +8.9%
   RAID5     ext2    100dd   -10.5%
   RAID5     ext2     10dd    -5.2%
   RAID5     ext2      1dd   +10.0%
   RAID0     ext4    100dd    +1.7%
   RAID0     ext4     10dd    -0.9%
   RAID0     ext4      1dd    -1.1%
   RAID0     ext3    100dd    -4.2%
   RAID0     ext3     10dd    -0.2%
   RAID0     ext3      1dd    -1.0%
   RAID0     ext2    100dd   +11.3%
   RAID0     ext2     10dd    +4.7%
   RAID0     ext2      1dd    -1.6%
    JBOD     ext4    100dd    +5.9%
    JBOD     ext4     10dd    +6.0%
    JBOD     ext4      1dd    +0.6%
    JBOD     ext3    100dd    +6.1%
    JBOD     ext3     10dd    +1.9%
    JBOD     ext3      1dd    +1.7%
    JBOD     ext2    100dd    +9.9%
    JBOD     ext2     10dd    +9.4%
    JBOD     ext2      1dd    +0.5%

wfg@bee /export/writeback% ./compare-groups 'RAID5 RAID0 JBOD' 'ext4 ext3 ext2' '100dd 10dd 1dd' lkp-nex04/*/*-{3.5.0,3.6.0-rc1+}          
RAID5 ext4 100dd
                   3.5.0                3.6.0-rc1+
------------------------  ------------------------
                  167.97       -38.7%       103.03  lkp-nex04/RAID5-12HDD-thresh=1000M/ext4-100dd-1-3.5.0
                  130.42       -21.7%       102.06  lkp-nex04/RAID5-12HDD-thresh=1000M/ext4-100dd-2-3.5.0
                   83.45       +10.2%        91.96  lkp-nex04/RAID5-12HDD-thresh=1000M/ext4-100dd-3-3.5.0
                  105.97       +11.5%       118.12  lkp-nex04/RAID5-12HDD-thresh=1000M/ext4-100dd-4-3.5.0
                   71.18       -34.2%        46.82  lkp-nex04/RAID5-12HDD-thresh=100M/ext4-100dd-1-3.5.0
                   52.79        +1.1%        53.36  lkp-nex04/RAID5-12HDD-thresh=100M/ext4-100dd-2-3.5.0
                   40.75        -5.1%        38.69  lkp-nex04/RAID5-12HDD-thresh=100M/ext4-100dd-3-3.5.0
                   42.79       +14.5%        48.99  lkp-nex04/RAID5-12HDD-thresh=100M/ext4-100dd-4-3.5.0
                  209.24       -23.6%       159.96  lkp-nex04/RAID5-12HDD-thresh=8G/ext4-100dd-1-3.5.0
                  176.21       +11.3%       196.16  lkp-nex04/RAID5-12HDD-thresh=8G/ext4-100dd-2-3.5.0
                  158.12        +3.7%       163.99  lkp-nex04/RAID5-12HDD-thresh=8G/ext4-100dd-3-3.5.0
                  180.18        +6.4%       191.74  lkp-nex04/RAID5-12HDD-thresh=8G/ext4-100dd-4-3.5.0
                 1419.08        -7.3%      1314.88  TOTAL write_bw

RAID5 ext4 10dd
                   3.5.0                3.6.0-rc1+
------------------------  ------------------------
                  243.67        -9.1%       221.41  lkp-nex04/RAID5-12HDD-thresh=1000M/ext4-10dd-1-3.5.0
                  212.84       +16.7%       248.39  lkp-nex04/RAID5-12HDD-thresh=1000M/ext4-10dd-2-3.5.0
                  145.84        -7.3%       135.25  lkp-nex04/RAID5-12HDD-thresh=100M/ext4-10dd-1-3.5.0
                  124.61        +3.2%       128.65  lkp-nex04/RAID5-12HDD-thresh=100M/ext4-10dd-2-3.5.0
                  243.73       -10.9%       217.28  lkp-nex04/RAID5-12HDD-thresh=8G/ext4-10dd-1-3.5.0
                  229.35        -2.8%       222.82  lkp-nex04/RAID5-12HDD-thresh=8G/ext4-10dd-2-3.5.0
                 1200.03        -2.2%      1173.81  TOTAL write_bw

RAID5 ext4 1dd
                   3.5.0                3.6.0-rc1+
------------------------  ------------------------
                  248.98       +12.2%       279.33  lkp-nex04/RAID5-12HDD-thresh=1000M/ext4-1dd-1-3.5.0
                  208.45       +14.1%       237.86  lkp-nex04/RAID5-12HDD-thresh=1000M/ext4-1dd-2-3.5.0
                  255.22        +6.7%       272.35  lkp-nex04/RAID5-12HDD-thresh=100M/ext4-1dd-1-3.5.0
                  243.09       +20.7%       293.30  lkp-nex04/RAID5-12HDD-thresh=100M/ext4-1dd-2-3.5.0
                  214.25        +5.6%       226.32  lkp-nex04/RAID5-12HDD-thresh=8G/ext4-1dd-1-3.5.0
                  207.16       +13.4%       234.98  lkp-nex04/RAID5-12HDD-thresh=8G/ext4-1dd-2-3.5.0
                 1377.15       +12.1%      1544.14  TOTAL write_bw

RAID5 ext3 100dd
                   3.5.0                3.6.0-rc1+
------------------------  ------------------------
                   72.75        -5.8%        68.50  lkp-nex04/RAID5-12HDD-thresh=1000M/ext3-100dd-1-3.5.0
                   52.04        +0.8%        52.45  lkp-nex04/RAID5-12HDD-thresh=1000M/ext3-100dd-2-3.5.0
                   48.85       +19.2%        58.21  lkp-nex04/RAID5-12HDD-thresh=1000M/ext3-100dd-3-3.5.0
                   47.04        +9.4%        51.44  lkp-nex04/RAID5-12HDD-thresh=1000M/ext3-100dd-4-3.5.0
                   53.89        -7.4%        49.90  lkp-nex04/RAID5-12HDD-thresh=100M/ext3-100dd-1-3.5.0
                   43.00       -10.7%        38.39  lkp-nex04/RAID5-12HDD-thresh=100M/ext3-100dd-2-3.5.0
                   37.82        +0.8%        38.11  lkp-nex04/RAID5-12HDD-thresh=100M/ext3-100dd-3-3.5.0
                   39.59        -4.0%        38.02  lkp-nex04/RAID5-12HDD-thresh=100M/ext3-100dd-4-3.5.0
                   54.45       -15.0%        46.26  lkp-nex04/RAID5-12HDD-thresh=8G/ext3-100dd-1-3.5.0
                   45.81        -4.5%        43.77  lkp-nex04/RAID5-12HDD-thresh=8G/ext3-100dd-2-3.5.0
                   51.20       -12.6%        44.75  lkp-nex04/RAID5-12HDD-thresh=8G/ext3-100dd-3-3.5.0
                   47.39        -3.9%        45.53  lkp-nex04/RAID5-12HDD-thresh=8G/ext3-100dd-4-3.5.0
                  593.84        -3.1%       575.32  TOTAL write_bw

RAID5 ext3 10dd
                   3.5.0                3.6.0-rc1+
------------------------  ------------------------
                   50.29       -10.2%        45.14  lkp-nex04/RAID5-12HDD-thresh=1000M/ext3-10dd-1-3.5.0
                   48.46        -7.1%        45.04  lkp-nex04/RAID5-12HDD-thresh=1000M/ext3-10dd-2-3.5.0
                   67.11       -17.8%        55.16  lkp-nex04/RAID5-12HDD-thresh=100M/ext3-10dd-1-3.5.0
                   75.45       -28.2%        54.21  lkp-nex04/RAID5-12HDD-thresh=100M/ext3-10dd-2-3.5.0
                   42.08        +6.4%        44.78  lkp-nex04/RAID5-12HDD-thresh=8G/ext3-10dd-1-3.5.0
                   40.48        +4.8%        42.44  lkp-nex04/RAID5-12HDD-thresh=8G/ext3-10dd-2-3.5.0
                  323.87       -11.5%       286.76  TOTAL write_bw

RAID5 ext3 1dd
                   3.5.0                3.6.0-rc1+
------------------------  ------------------------
                  190.20       +14.5%       217.69  lkp-nex04/RAID5-12HDD-thresh=1000M/ext3-1dd-1-3.5.0
                  192.30        +9.4%       210.43  lkp-nex04/RAID5-12HDD-thresh=1000M/ext3-1dd-2-3.5.0
                  193.63       +14.0%       220.64  lkp-nex04/RAID5-12HDD-thresh=100M/ext3-1dd-1-3.5.0
                  224.33        -6.8%       209.07  lkp-nex04/RAID5-12HDD-thresh=100M/ext3-1dd-2-3.5.0
                  188.30       +14.6%       215.83  lkp-nex04/RAID5-12HDD-thresh=8G/ext3-1dd-1-3.5.0
                  179.04       +10.7%       198.17  lkp-nex04/RAID5-12HDD-thresh=8G/ext3-1dd-2-3.5.0
                 1167.79        +8.9%      1271.83  TOTAL write_bw

RAID5 ext2 100dd
                   3.5.0                3.6.0-rc1+
------------------------  ------------------------
                   76.75       -17.9%        62.98  lkp-nex04/RAID5-12HDD-thresh=1000M/ext2-100dd-1-3.5.0
                   72.32        -7.7%        66.73  lkp-nex04/RAID5-12HDD-thresh=1000M/ext2-100dd-2-3.5.0
                   56.48        +2.2%        57.75  lkp-nex04/RAID5-12HDD-thresh=1000M/ext2-100dd-3-3.5.0
                   56.81        -1.8%        55.81  lkp-nex04/RAID5-12HDD-thresh=1000M/ext2-100dd-4-3.5.0
                   58.58        -5.0%        55.67  lkp-nex04/RAID5-12HDD-thresh=100M/ext2-100dd-1-3.5.0
                   60.02        -3.1%        58.15  lkp-nex04/RAID5-12HDD-thresh=100M/ext2-100dd-2-3.5.0
                   54.01        -9.1%        49.12  lkp-nex04/RAID5-12HDD-thresh=100M/ext2-100dd-3-3.5.0
                   61.00       -22.3%        47.38  lkp-nex04/RAID5-12HDD-thresh=100M/ext2-100dd-4-3.5.0
                   50.98       -15.2%        43.22  lkp-nex04/RAID5-12HDD-thresh=8G/ext2-100dd-1-3.5.0
                   49.52       -14.8%        42.18  lkp-nex04/RAID5-12HDD-thresh=8G/ext2-100dd-2-3.5.0
                   48.35       -17.2%        40.04  lkp-nex04/RAID5-12HDD-thresh=8G/ext2-100dd-3-3.5.0
                   49.14       -14.8%        41.88  lkp-nex04/RAID5-12HDD-thresh=8G/ext2-100dd-4-3.5.0
                  693.96       -10.5%       620.90  TOTAL write_bw

RAID5 ext2 10dd
                   3.5.0                3.6.0-rc1+
------------------------  ------------------------
                   46.88        -0.5%        46.67  lkp-nex04/RAID5-12HDD-thresh=1000M/ext2-10dd-1-3.5.0
                   49.98        -8.8%        45.59  lkp-nex04/RAID5-12HDD-thresh=1000M/ext2-10dd-2-3.5.0
                   45.01        +0.7%        45.32  lkp-nex04/RAID5-12HDD-thresh=100M/ext2-10dd-1-3.5.0
                   84.88       -25.5%        63.27  lkp-nex04/RAID5-12HDD-thresh=100M/ext2-10dd-2-3.5.0
                   44.49       +15.9%        51.56  lkp-nex04/RAID5-12HDD-thresh=8G/ext2-10dd-1-3.5.0
                   43.73        +5.6%        46.19  lkp-nex04/RAID5-12HDD-thresh=8G/ext2-10dd-2-3.5.0
                  314.97        -5.2%       298.60  TOTAL write_bw

RAID5 ext2 1dd
                   3.5.0                3.6.0-rc1+
------------------------  ------------------------
                  234.85        +7.6%       252.80  lkp-nex04/RAID5-12HDD-thresh=1000M/ext2-1dd-1-3.5.0
                  250.77       +17.9%       295.65  lkp-nex04/RAID5-12HDD-thresh=1000M/ext2-1dd-2-3.5.0
                  205.84        +4.9%       215.93  lkp-nex04/RAID5-12HDD-thresh=100M/ext2-1dd-1-3.5.0
                  213.89        +7.2%       229.37  lkp-nex04/RAID5-12HDD-thresh=100M/ext2-1dd-2-3.5.0
                  217.70       +13.1%       246.25  lkp-nex04/RAID5-12HDD-thresh=8G/ext2-1dd-1-3.5.0
                  241.22        +8.3%       261.19  lkp-nex04/RAID5-12HDD-thresh=8G/ext2-1dd-2-3.5.0
                 1364.27       +10.0%      1501.19  TOTAL write_bw

RAID0 ext4 100dd
                   3.5.0                3.6.0-rc1+
------------------------  ------------------------
                  675.08       -10.5%       604.29  lkp-nex04/RAID0-12HDD-thresh=1000M/ext4-100dd-1-3.5.0
                  640.40        -0.8%       635.21  lkp-nex04/RAID0-12HDD-thresh=1000M/ext4-100dd-2-3.5.0
                  370.03        +4.9%       388.06  lkp-nex04/RAID0-12HDD-thresh=1000M/ext4-100dd-3-3.5.0
                  376.90        +6.1%       399.96  lkp-nex04/RAID0-12HDD-thresh=1000M/ext4-100dd-4-3.5.0
                  709.76       -15.7%       598.44  lkp-nex04/RAID0-12HDD-thresh=100M/ext4-100dd-1-3.5.0
                  399.91       +52.7%       610.76  lkp-nex04/RAID0-12HDD-thresh=100M/ext4-100dd-2-3.5.0
                  342.58        +6.3%       364.24  lkp-nex04/RAID0-12HDD-thresh=100M/ext4-100dd-3-3.5.0
                  300.55       +24.6%       374.34  lkp-nex04/RAID0-12HDD-thresh=100M/ext4-100dd-4-3.5.0
                  699.77       -19.2%       565.54  lkp-nex04/RAID0-12HDD-thresh=8G/ext4-100dd-1-3.5.0
                  582.28        -1.5%       573.28  lkp-nex04/RAID0-12HDD-thresh=8G/ext4-100dd-2-3.5.0
                  491.00        +8.4%       532.13  lkp-nex04/RAID0-12HDD-thresh=8G/ext4-100dd-3-3.5.0
                  485.84        +9.5%       532.12  lkp-nex04/RAID0-12HDD-thresh=8G/ext4-100dd-4-3.5.0
                 6074.09        +1.7%      6178.38  TOTAL write_bw

RAID0 ext4 10dd
                   3.5.0                3.6.0-rc1+
------------------------  ------------------------
                  676.52        -2.7%       658.38  lkp-nex04/RAID0-12HDD-thresh=1000M/ext4-10dd-1-3.5.0
                  626.18        -3.2%       606.21  lkp-nex04/RAID0-12HDD-thresh=1000M/ext4-10dd-2-3.5.0
                  681.39        -2.1%       667.25  lkp-nex04/RAID0-12HDD-thresh=100M/ext4-10dd-1-3.5.0
                  630.30        +3.4%       651.81  lkp-nex04/RAID0-12HDD-thresh=100M/ext4-10dd-2-3.5.0
                  675.79        -1.9%       663.17  lkp-nex04/RAID0-12HDD-thresh=8G/ext4-10dd-1-3.5.0
                  665.04        +1.3%       673.54  lkp-nex04/RAID0-12HDD-thresh=8G/ext4-10dd-2-3.5.0
                 3955.21        -0.9%      3920.37  TOTAL write_bw

RAID0 ext4 1dd
                   3.5.0                3.6.0-rc1+
------------------------  ------------------------
                  512.70        +4.0%       533.22  lkp-nex04/RAID0-12HDD-thresh=1000M/ext4-1dd-1-3.5.0
                  524.61        -0.3%       522.90  lkp-nex04/RAID0-12HDD-thresh=1000M/ext4-1dd-2-3.5.0
                  524.16        +0.8%       528.25  lkp-nex04/RAID0-12HDD-thresh=100M/ext4-1dd-2-3.5.0
                  484.84        -7.4%       448.83  lkp-nex04/RAID0-12HDD-thresh=8G/ext4-1dd-1-3.5.0
                  470.40        -3.2%       455.31  lkp-nex04/RAID0-12HDD-thresh=8G/ext4-1dd-2-3.5.0
                 2516.71        -1.1%      2488.51  TOTAL write_bw

RAID0 ext3 100dd
                   3.5.0                3.6.0-rc1+
------------------------  ------------------------
                  500.94        -4.0%       481.11  lkp-nex04/RAID0-12HDD-thresh=1000M/ext3-100dd-1-3.5.0
                  494.13        -4.3%       473.13  lkp-nex04/RAID0-12HDD-thresh=1000M/ext3-100dd-2-3.5.0
                  513.57        -9.2%       466.07  lkp-nex04/RAID0-12HDD-thresh=1000M/ext3-100dd-3-3.5.0
                  490.19        -5.9%       461.42  lkp-nex04/RAID0-12HDD-thresh=1000M/ext3-100dd-4-3.5.0
                  511.08        -2.9%       496.04  lkp-nex04/RAID0-12HDD-thresh=100M/ext3-100dd-1-3.5.0
                  520.57        -7.6%       480.95  lkp-nex04/RAID0-12HDD-thresh=100M/ext3-100dd-2-3.5.0
                  523.62        -5.2%       496.52  lkp-nex04/RAID0-12HDD-thresh=100M/ext3-100dd-3-3.5.0
                  497.72        -0.1%       497.16  lkp-nex04/RAID0-12HDD-thresh=100M/ext3-100dd-4-3.5.0
                  470.99        -5.0%       447.64  lkp-nex04/RAID0-12HDD-thresh=8G/ext3-100dd-1-3.5.0
                  444.63        +2.0%       453.54  lkp-nex04/RAID0-12HDD-thresh=8G/ext3-100dd-2-3.5.0
                  448.25        -4.7%       427.18  lkp-nex04/RAID0-12HDD-thresh=8G/ext3-100dd-3-3.5.0
                  475.57        -3.1%       460.84  lkp-nex04/RAID0-12HDD-thresh=8G/ext3-100dd-4-3.5.0
                 5891.26        -4.2%      5641.62  TOTAL write_bw

RAID0 ext3 10dd
                   3.5.0                3.6.0-rc1+
------------------------  ------------------------
                  560.26        +2.8%       576.15  lkp-nex04/RAID0-12HDD-thresh=1000M/ext3-10dd-1-3.5.0
                  583.44        +0.5%       586.08  lkp-nex04/RAID0-12HDD-thresh=1000M/ext3-10dd-2-3.5.0
                  566.37        -3.2%       548.19  lkp-nex04/RAID0-12HDD-thresh=100M/ext3-10dd-1-3.5.0
                  579.37        -2.1%       567.13  lkp-nex04/RAID0-12HDD-thresh=100M/ext3-10dd-2-3.5.0
                  623.24        +0.2%       624.71  lkp-nex04/RAID0-12HDD-thresh=8G/ext3-10dd-1-3.5.0
                  624.26        +0.7%       628.74  lkp-nex04/RAID0-12HDD-thresh=8G/ext3-10dd-2-3.5.0
                 3536.93        -0.2%      3531.00  TOTAL write_bw

RAID0 ext3 1dd
                   3.5.0                3.6.0-rc1+
------------------------  ------------------------
                  351.69        -2.3%       343.73  lkp-nex04/RAID0-12HDD-thresh=1000M/ext3-1dd-1-3.5.0
                  355.50        -4.1%       340.77  lkp-nex04/RAID0-12HDD-thresh=1000M/ext3-1dd-2-3.5.0
                  383.34        +0.9%       386.96  lkp-nex04/RAID0-12HDD-thresh=100M/ext3-1dd-1-3.5.0
                  385.74        +1.3%       390.69  lkp-nex04/RAID0-12HDD-thresh=100M/ext3-1dd-2-3.5.0
                  315.53        -0.2%       314.81  lkp-nex04/RAID0-12HDD-thresh=8G/ext3-1dd-1-3.5.0
                  319.52        -1.9%       313.36  lkp-nex04/RAID0-12HDD-thresh=8G/ext3-1dd-2-3.5.0
                 2111.31        -1.0%      2090.32  TOTAL write_bw

RAID0 ext2 100dd
                   3.5.0                3.6.0-rc1+
------------------------  ------------------------
                  694.06        +0.0%       694.22  lkp-nex04/RAID0-12HDD-thresh=1000M/ext2-100dd-1-3.5.0
                  693.19        -0.1%       692.38  lkp-nex04/RAID0-12HDD-thresh=1000M/ext2-100dd-2-3.5.0
                  686.16        +1.2%       694.35  lkp-nex04/RAID0-12HDD-thresh=1000M/ext2-100dd-3-3.5.0
                  691.17        -0.2%       690.13  lkp-nex04/RAID0-12HDD-thresh=1000M/ext2-100dd-4-3.5.0
                  668.16        +1.3%       677.07  lkp-nex04/RAID0-12HDD-thresh=100M/ext2-100dd-1-3.5.0
                  404.60       +62.9%       658.90  lkp-nex04/RAID0-12HDD-thresh=100M/ext2-100dd-2-3.5.0
                  346.48       +81.1%       627.62  lkp-nex04/RAID0-12HDD-thresh=100M/ext2-100dd-3-3.5.0
                  373.48       +71.3%       639.85  lkp-nex04/RAID0-12HDD-thresh=100M/ext2-100dd-4-3.5.0
                  691.96        +0.2%       693.29  lkp-nex04/RAID0-12HDD-thresh=8G/ext2-100dd-1-3.5.0
                  690.73        +0.5%       694.40  lkp-nex04/RAID0-12HDD-thresh=8G/ext2-100dd-2-3.5.0
                  692.65        +0.1%       693.27  lkp-nex04/RAID0-12HDD-thresh=8G/ext2-100dd-3-3.5.0
                  690.10        +0.2%       691.71  lkp-nex04/RAID0-12HDD-thresh=8G/ext2-100dd-4-3.5.0
                 7322.74       +11.3%      8147.20  TOTAL write_bw

RAID0 ext2 10dd
                   3.5.0                3.6.0-rc1+
------------------------  ------------------------
                  247.29       +23.2%       304.58  lkp-nex04/RAID0-12HDD-thresh=1000M/ext2-10dd-1-3.5.0
                  697.35        +0.4%       700.38  lkp-nex04/RAID0-12HDD-thresh=1000M/ext2-10dd-2-3.5.0
                  662.14        +1.8%       673.83  lkp-nex04/RAID0-12HDD-thresh=100M/ext2-10dd-1-3.5.0
                  613.81       +10.0%       675.44  lkp-nex04/RAID0-12HDD-thresh=100M/ext2-10dd-2-3.5.0
                  337.37        +5.5%       355.95  lkp-nex04/RAID0-12HDD-thresh=8G/ext2-10dd-1-3.5.0
                  682.57        +0.0%       682.90  lkp-nex04/RAID0-12HDD-thresh=8G/ext2-10dd-2-3.5.0
                 3240.53        +4.7%      3393.07  TOTAL write_bw

RAID0 ext2 1dd
                   3.5.0                3.6.0-rc1+
------------------------  ------------------------
                  526.72        -4.1%       505.29  lkp-nex04/RAID0-12HDD-thresh=1000M/ext2-1dd-1-3.5.0
                  516.77        -0.6%       513.81  lkp-nex04/RAID0-12HDD-thresh=1000M/ext2-1dd-2-3.5.0
                  617.83        +0.3%       619.45  lkp-nex04/RAID0-12HDD-thresh=100M/ext2-1dd-1-3.5.0
                  617.49        +0.6%       621.09  lkp-nex04/RAID0-12HDD-thresh=100M/ext2-1dd-2-3.5.0
                  502.60        -1.0%       497.39  lkp-nex04/RAID0-12HDD-thresh=8G/ext2-1dd-1-3.5.0
                  504.82        -5.7%       475.86  lkp-nex04/RAID0-12HDD-thresh=8G/ext2-1dd-2-3.5.0
                 3286.22        -1.6%      3232.89  TOTAL write_bw

JBOD ext4 100dd
                   3.5.0                3.6.0-rc1+
------------------------  ------------------------
                  720.62        -1.5%       710.16  lkp-nex04/JBOD-12HDD-thresh=1000M/ext4-100dd-1-3.5.0
                  469.82       +14.3%       536.78  lkp-nex04/JBOD-12HDD-thresh=1000M/ext4-100dd-2-3.5.0
                  666.90        -2.6%       649.61  lkp-nex04/JBOD-12HDD-thresh=1000M/ext4-100dd-3-3.5.0
                  343.93       +24.1%       426.86  lkp-nex04/JBOD-12HDD-thresh=1000M/ext4-100dd-4-3.5.0
                  779.52        +6.5%       830.11  lkp-nex04/JBOD-12HDD-thresh=100M/ext4-100dd-1-3.5.0
                  457.65        -1.4%       451.18  lkp-nex04/JBOD-12HDD-thresh=100M/ext4-100dd-2-3.5.0
                  739.08        +5.7%       781.16  lkp-nex04/JBOD-12HDD-thresh=100M/ext4-100dd-3-3.5.0
                  332.98        -9.6%       301.13  lkp-nex04/JBOD-12HDD-thresh=100M/ext4-100dd-4-3.5.0
                  705.26        -1.2%       696.61  lkp-nex04/JBOD-12HDD-thresh=8G/ext4-100dd-1-3.5.0
                  565.73       +16.8%       660.76  lkp-nex04/JBOD-12HDD-thresh=8G/ext4-100dd-2-3.5.0
                  647.47        +5.3%       681.63  lkp-nex04/JBOD-12HDD-thresh=8G/ext4-100dd-3-3.5.0
                  416.22       +25.5%       522.50  lkp-nex04/JBOD-12HDD-thresh=8G/ext4-100dd-4-3.5.0
                 6845.19        +5.9%      7248.49  TOTAL write_bw

JBOD ext4 10dd
                   3.5.0                3.6.0-rc1+
------------------------  ------------------------
                  706.04        -0.0%       705.86  lkp-nex04/JBOD-12HDD-thresh=1000M/ext4-10dd-1-3.5.0
                  525.34       +12.1%       589.06  lkp-nex04/JBOD-12HDD-thresh=1000M/ext4-10dd-2-3.5.0
                  646.70        +4.9%       678.59  lkp-nex04/JBOD-12HDD-thresh=100M/ext4-10dd-1-3.5.0
                  335.12       +25.1%       419.10  lkp-nex04/JBOD-12HDD-thresh=100M/ext4-10dd-2-3.5.0
                  703.37        +0.1%       703.76  lkp-nex04/JBOD-12HDD-thresh=8G/ext4-10dd-1-3.5.0
                  665.60        +5.4%       701.85  lkp-nex04/JBOD-12HDD-thresh=8G/ext4-10dd-2-3.5.0
                 3582.17        +6.0%      3798.23  TOTAL write_bw

JBOD ext4 1dd
                   3.5.0                3.6.0-rc1+
------------------------  ------------------------
                  702.86        -0.2%       701.74  lkp-nex04/JBOD-12HDD-thresh=1000M/ext4-1dd-1-3.5.0
                  702.41        -0.0%       702.06  lkp-nex04/JBOD-12HDD-thresh=1000M/ext4-1dd-2-3.5.0
                  704.49        +2.6%       723.00  lkp-nex04/JBOD-12HDD-thresh=100M/ext4-1dd-1-3.5.0
                  704.21        +1.2%       712.47  lkp-nex04/JBOD-12HDD-thresh=100M/ext4-1dd-2-3.5.0
                  701.66        -0.1%       700.83  lkp-nex04/JBOD-12HDD-thresh=8G/ext4-1dd-1-3.5.0
                  701.17        +0.0%       701.36  lkp-nex04/JBOD-12HDD-thresh=8G/ext4-1dd-2-3.5.0
                 4216.81        +0.6%      4241.46  TOTAL write_bw

JBOD ext3 100dd
                   3.5.0                3.6.0-rc1+
------------------------  ------------------------
                  683.15        -3.8%       657.31  lkp-nex04/JBOD-12HDD-thresh=1000M/ext3-100dd-1-3.5.0
                  711.48        -0.1%       710.83  lkp-nex04/JBOD-12HDD-thresh=1000M/ext3-100dd-2-3.5.0
                  677.50        +0.0%       677.62  lkp-nex04/JBOD-12HDD-thresh=1000M/ext3-100dd-3-3.5.0
                  713.31        -0.5%       709.88  lkp-nex04/JBOD-12HDD-thresh=1000M/ext3-100dd-4-3.5.0
                  648.70       +16.1%       753.15  lkp-nex04/JBOD-12HDD-thresh=100M/ext3-100dd-1-3.5.0
                  633.24       +26.5%       801.02  lkp-nex04/JBOD-12HDD-thresh=100M/ext3-100dd-2-3.5.0
                  568.48       +23.8%       703.49  lkp-nex04/JBOD-12HDD-thresh=100M/ext3-100dd-3-3.5.0
                  680.59       +21.6%       827.77  lkp-nex04/JBOD-12HDD-thresh=100M/ext3-100dd-4-3.5.0
                  656.73        -0.9%       651.09  lkp-nex04/JBOD-12HDD-thresh=8G/ext3-100dd-1-3.5.0
                  697.29        -0.3%       695.49  lkp-nex04/JBOD-12HDD-thresh=8G/ext3-100dd-2-3.5.0
                  669.99        -1.9%       657.24  lkp-nex04/JBOD-12HDD-thresh=8G/ext3-100dd-3-3.5.0
                  697.73        -2.1%       683.17  lkp-nex04/JBOD-12HDD-thresh=8G/ext3-100dd-4-3.5.0
                 8038.18        +6.1%      8528.06  TOTAL write_bw

JBOD ext3 10dd
                   3.5.0                3.6.0-rc1+
------------------------  ------------------------
                  669.69        -1.0%       663.26  lkp-nex04/JBOD-12HDD-thresh=1000M/ext3-10dd-1-3.5.0
                  704.60        +0.3%       707.03  lkp-nex04/JBOD-12HDD-thresh=1000M/ext3-10dd-2-3.5.0
                  629.95        +3.0%       648.55  lkp-nex04/JBOD-12HDD-thresh=100M/ext3-10dd-1-3.5.0
                  616.65        +9.6%       676.08  lkp-nex04/JBOD-12HDD-thresh=100M/ext3-10dd-2-3.5.0
                  691.77        +0.6%       695.88  lkp-nex04/JBOD-12HDD-thresh=8G/ext3-10dd-1-3.5.0
                  706.05        -0.2%       704.95  lkp-nex04/JBOD-12HDD-thresh=8G/ext3-10dd-2-3.5.0
                 4018.71        +1.9%      4095.75  TOTAL write_bw

JBOD ext3 1dd
                   3.5.0                3.6.0-rc1+
------------------------  ------------------------
                  700.88        +0.1%       701.30  lkp-nex04/JBOD-12HDD-thresh=1000M/ext3-1dd-1-3.5.0
                  699.78        +0.1%       700.27  lkp-nex04/JBOD-12HDD-thresh=1000M/ext3-1dd-2-3.5.0
                  700.07        -0.0%       699.98  lkp-nex04/JBOD-12HDD-thresh=100M/ext3-1dd-1-3.5.0
                  598.09       +11.7%       668.09  lkp-nex04/JBOD-12HDD-thresh=100M/ext3-1dd-2-3.5.0
                  700.53        -0.0%       700.47  lkp-nex04/JBOD-12HDD-thresh=8G/ext3-1dd-1-3.5.0
                  700.65        +0.1%       701.43  lkp-nex04/JBOD-12HDD-thresh=8G/ext3-1dd-2-3.5.0
                 4100.00        +1.7%      4171.54  TOTAL write_bw

JBOD ext2 100dd
                   3.5.0                3.6.0-rc1+
------------------------  ------------------------
                  644.32        +6.1%       683.56  lkp-nex04/JBOD-12HDD-thresh=1000M/ext2-100dd-1-3.5.0
                  558.44       +16.9%       652.63  lkp-nex04/JBOD-12HDD-thresh=1000M/ext2-100dd-2-3.5.0
                  443.68       +30.1%       577.18  lkp-nex04/JBOD-12HDD-thresh=1000M/ext2-100dd-3-3.5.0
                  449.28       +36.8%       614.49  lkp-nex04/JBOD-12HDD-thresh=1000M/ext2-100dd-4-3.5.0
                  526.02       +10.2%       579.52  lkp-nex04/JBOD-12HDD-thresh=100M/ext2-100dd-1-3.5.0
                  442.03       +10.6%       488.71  lkp-nex04/JBOD-12HDD-thresh=100M/ext2-100dd-2-3.5.0
                  375.04        -5.5%       354.36  lkp-nex04/JBOD-12HDD-thresh=100M/ext2-100dd-3-3.5.0
                  365.83        +3.9%       379.96  lkp-nex04/JBOD-12HDD-thresh=100M/ext2-100dd-4-3.5.0
                  693.56        +0.8%       699.06  lkp-nex04/JBOD-12HDD-thresh=8G/ext2-100dd-1-3.5.0
                  661.00        +3.3%       682.82  lkp-nex04/JBOD-12HDD-thresh=8G/ext2-100dd-2-3.5.0
                  584.28        +9.1%       637.22  lkp-nex04/JBOD-12HDD-thresh=8G/ext2-100dd-3-3.5.0
                  657.01        +4.2%       684.28  lkp-nex04/JBOD-12HDD-thresh=8G/ext2-100dd-4-3.5.0
                 6400.48        +9.9%      7033.79  TOTAL write_bw

JBOD ext2 10dd
                   3.5.0                3.6.0-rc1+
------------------------  ------------------------
                  662.51        +3.4%       685.05  lkp-nex04/JBOD-12HDD-thresh=1000M/ext2-10dd-1-3.5.0
                  665.07        +3.2%       686.05  lkp-nex04/JBOD-12HDD-thresh=1000M/ext2-10dd-2-3.5.0
                  431.26       +26.9%       547.16  lkp-nex04/JBOD-12HDD-thresh=100M/ext2-10dd-1-3.5.0
                  397.42       +39.4%       553.83  lkp-nex04/JBOD-12HDD-thresh=100M/ext2-10dd-2-3.5.0
                  685.99        +0.9%       691.90  lkp-nex04/JBOD-12HDD-thresh=8G/ext2-10dd-1-3.5.0
                  685.68        +1.2%       693.94  lkp-nex04/JBOD-12HDD-thresh=8G/ext2-10dd-2-3.5.0
                 3527.93        +9.4%      3857.93  TOTAL write_bw

JBOD ext2 1dd
                   3.5.0                3.6.0-rc1+
------------------------  ------------------------
                  718.45        -0.1%       717.89  lkp-nex04/JBOD-12HDD-thresh=1000M/ext2-1dd-1-3.5.0
                  717.25        +0.2%       718.89  lkp-nex04/JBOD-12HDD-thresh=1000M/ext2-1dd-2-3.5.0
                  686.97        +1.0%       693.82  lkp-nex04/JBOD-12HDD-thresh=100M/ext2-1dd-1-3.5.0
                  683.79        +1.5%       694.03  lkp-nex04/JBOD-12HDD-thresh=100M/ext2-1dd-2-3.5.0
                  699.79        +0.1%       700.41  lkp-nex04/JBOD-12HDD-thresh=8G/ext2-1dd-1-3.5.0
                  700.22        +0.1%       701.15  lkp-nex04/JBOD-12HDD-thresh=8G/ext2-1dd-2-3.5.0
                 4206.46        +0.5%      4226.19  TOTAL write_bw
--
To unsubscribe from this list: send the line "unsubscribe linux-ext4" in
the body of a message to majordomo@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Wu Fengguang - Aug. 21, 2012, 12:07 p.m.
On Tue, Aug 21, 2012 at 05:42:21PM +0800, Fengguang Wu wrote:
> On Sat, Aug 18, 2012 at 06:44:57AM +1000, NeilBrown wrote:
> > On Fri, 17 Aug 2012 22:25:26 +0800 Fengguang Wu <fengguang.wu@intel.com>
> > wrote:
> > 
> > > [CC md list]
> > > 
> > > On Fri, Aug 17, 2012 at 09:40:39AM -0400, Theodore Ts'o wrote:
> > > > On Fri, Aug 17, 2012 at 02:09:15PM +0800, Fengguang Wu wrote:
> > > > > Ted,
> > > > > 
> > > > > I find ext4 write performance dropped by 3.3% on average in the
> > > > > 3.6-rc1 merge window. xfs and btrfs are fine.
> > > > > 
> > > > > Two machines are tested. The performance regression happens in the
> > > > > lkp-nex04 machine, which is equipped with 12 SSD drives. lkp-st02 does
> > > > > not see regression, which is equipped with HDD drives. I'll continue
> > > > > to repeat the tests and report variations.
> > > > 
> > > > Hmm... I've checked out the commits in "git log v3.5..v3.6-rc1 --
> > > > fs/ext4 fs/jbd2" and I don't see anything that I would expect would
> > > > cause that.  The are the lock elimination changes for Direct I/O
> > > > overwrites, but that shouldn't matter for your tests which are
> > > > measuring buffered writes, correct?
> > > > 
> > > > Is there any chance you could do me a favor and do a git bisect
> > > > restricted to commits involving fs/ext4 and fs/jbd2?
> > > 
> > > I noticed that the regressions all happen in the RAID0/RAID5 cases.
> > > So it may be some interactions between the RAID/ext4 code?
> > 
> > I'm aware of some performance regression in RAID5 which I will be drilling
> > down into next week.  Some things are faster, but some are slower :-(
> > 
> > RAID0 should be unchanged though - I don't think I've changed anything there.
> > 
> > Looking at your numbers, JBOD ranges from  +6.5% to -1.5%
> >                         RAID0 ranges from  +4.0% to -19.2%
> >                         RAID5 ranges from +20.7% to -39.7%
> > 
> > I'm guessing + is good and - is bad?
> 
> Yes.
> 
> > The RAID5 numbers don't surprise me.  The RAID0 do.
> 
> You are right. I did more tests and it's now obvious that RAID0 is
> mostly fine. The major regressions are in the RAID5 10/100dd cases.
> JBOD is performing better in 3.6.0-rc1 :-)
> 
> > > 
> > > I'll try to get some ext2/3 numbers, which should have less changes on the fs side.
> > 
> > Thanks.  That will be useful.
> 
> Here are the more complete results.
> 
>    RAID5     ext4    100dd    -7.3%
>    RAID5     ext4     10dd    -2.2%
>    RAID5     ext4      1dd   +12.1%
>    RAID5     ext3    100dd    -3.1%
>    RAID5     ext3     10dd   -11.5%
>    RAID5     ext3      1dd    +8.9%
>    RAID5     ext2    100dd   -10.5%
>    RAID5     ext2     10dd    -5.2%
>    RAID5     ext2      1dd   +10.0%
>    RAID0     ext4    100dd    +1.7%
>    RAID0     ext4     10dd    -0.9%
>    RAID0     ext4      1dd    -1.1%
>    RAID0     ext3    100dd    -4.2%
>    RAID0     ext3     10dd    -0.2%
>    RAID0     ext3      1dd    -1.0%
>    RAID0     ext2    100dd   +11.3%
>    RAID0     ext2     10dd    +4.7%
>    RAID0     ext2      1dd    -1.6%
>     JBOD     ext4    100dd    +5.9%
>     JBOD     ext4     10dd    +6.0%
>     JBOD     ext4      1dd    +0.6%
>     JBOD     ext3    100dd    +6.1%
>     JBOD     ext3     10dd    +1.9%
>     JBOD     ext3      1dd    +1.7%
>     JBOD     ext2    100dd    +9.9%
>     JBOD     ext2     10dd    +9.4%
>     JBOD     ext2      1dd    +0.5%

And here are the xfs/btrfs results. Very impressive RAID5 improvements!

   RAID5    btrfs    100dd   +25.8%
   RAID5    btrfs     10dd   +21.3%
   RAID5    btrfs      1dd   +14.3%
   RAID5      xfs    100dd   +32.8%
   RAID5      xfs     10dd   +21.5%
   RAID5      xfs      1dd   +25.2%
   RAID0    btrfs    100dd    -7.4%
   RAID0    btrfs     10dd    -0.2%
   RAID0    btrfs      1dd    -2.8%
   RAID0      xfs    100dd   +18.8%
   RAID0      xfs     10dd    +0.0%
   RAID0      xfs      1dd    +3.8%
    JBOD    btrfs    100dd    -0.0%
    JBOD    btrfs     10dd    +2.3%
    JBOD    btrfs      1dd    -0.1%
    JBOD      xfs    100dd    +8.3%
    JBOD      xfs     10dd    +4.1%
    JBOD      xfs      1dd    +0.1%

Thanks,
Fengguang
--
To unsubscribe from this list: send the line "unsubscribe linux-ext4" in
the body of a message to majordomo@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Shaohua Li - Aug. 22, 2012, 4:07 a.m.
On 8/22/12 11:57 AM, Yuanhan Liu wrote:
>  On Fri, Aug 17, 2012 at 10:25:26PM +0800, Fengguang Wu wrote:
> > [CC md list]
> >
> > On Fri, Aug 17, 2012 at 09:40:39AM -0400, Theodore Ts'o wrote:
> >> On Fri, Aug 17, 2012 at 02:09:15PM +0800, Fengguang Wu wrote:
> >>> Ted,
> >>>
> >>> I find ext4 write performance dropped by 3.3% on average in the
> >>> 3.6-rc1 merge window. xfs and btrfs are fine.
> >>>
> >>> Two machines are tested. The performance regression happens in the
> >>> lkp-nex04 machine, which is equipped with 12 SSD drives. lkp-st02 does
> >>> not see regression, which is equipped with HDD drives. I'll continue
> >>> to repeat the tests and report variations.
> >>
> >> Hmm... I've checked out the commits in "git log v3.5..v3.6-rc1 --
> >> fs/ext4 fs/jbd2" and I don't see anything that I would expect would
> >> cause that. The are the lock elimination changes for Direct I/O
> >> overwrites, but that shouldn't matter for your tests which are
> >> measuring buffered writes, correct?
> >>
> >> Is there any chance you could do me a favor and do a git bisect
> >> restricted to commits involving fs/ext4 and fs/jbd2?
> >
> > I noticed that the regressions all happen in the RAID0/RAID5 cases.
> > So it may be some interactions between the RAID/ext4 code?
> >
> > I'll try to get some ext2/3 numbers, which should have less changes 
on the fs side.
> >
> > wfg@bee /export/writeback% ./compare -g ext4 
lkp-nex04/*/*-{3.5.0,3.6.0-rc1+}
> > 3.5.0 3.6.0-rc1+
> > ------------------------ ------------------------
> > 720.62 -1.5% 710.16 lkp-nex04/JBOD-12HDD-thresh=1000M/ext4-100dd-1-3.5.0
> > 706.04 -0.0% 705.86 lkp-nex04/JBOD-12HDD-thresh=1000M/ext4-10dd-1-3.5.0
> > 702.86 -0.2% 701.74 lkp-nex04/JBOD-12HDD-thresh=1000M/ext4-1dd-1-3.5.0
> > 702.41 -0.0% 702.06 lkp-nex04/JBOD-12HDD-thresh=1000M/ext4-1dd-2-3.5.0
> > 779.52 +6.5% 830.11 lkp-nex04/JBOD-12HDD-thresh=100M/ext4-100dd-1-3.5.0
> > 646.70 +4.9% 678.59 lkp-nex04/JBOD-12HDD-thresh=100M/ext4-10dd-1-3.5.0
> > 704.49 +2.6% 723.00 lkp-nex04/JBOD-12HDD-thresh=100M/ext4-1dd-1-3.5.0
> > 704.21 +1.2% 712.47 lkp-nex04/JBOD-12HDD-thresh=100M/ext4-1dd-2-3.5.0
> > 705.26 -1.2% 696.61 lkp-nex04/JBOD-12HDD-thresh=8G/ext4-100dd-1-3.5.0
> > 703.37 +0.1% 703.76 lkp-nex04/JBOD-12HDD-thresh=8G/ext4-10dd-1-3.5.0
> > 701.66 -0.1% 700.83 lkp-nex04/JBOD-12HDD-thresh=8G/ext4-1dd-1-3.5.0
> > 701.17 +0.0% 701.36 lkp-nex04/JBOD-12HDD-thresh=8G/ext4-1dd-2-3.5.0
> > 675.08 -10.5% 604.29 
lkp-nex04/RAID0-12HDD-thresh=1000M/ext4-100dd-1-3.5.0
> > 676.52 -2.7% 658.38 lkp-nex04/RAID0-12HDD-thresh=1000M/ext4-10dd-1-3.5.0
> > 512.70 +4.0% 533.22 lkp-nex04/RAID0-12HDD-thresh=1000M/ext4-1dd-1-3.5.0
> > 524.61 -0.3% 522.90 lkp-nex04/RAID0-12HDD-thresh=1000M/ext4-1dd-2-3.5.0
> > 709.76 -15.7% 598.44 lkp-nex04/RAID0-12HDD-thresh=100M/ext4-100dd-1-3.5.0
> > 681.39 -2.1% 667.25 lkp-nex04/RAID0-12HDD-thresh=100M/ext4-10dd-1-3.5.0
> > 524.16 +0.8% 528.25 lkp-nex04/RAID0-12HDD-thresh=100M/ext4-1dd-2-3.5.0
> > 699.77 -19.2% 565.54 lkp-nex04/RAID0-12HDD-thresh=8G/ext4-100dd-1-3.5.0
> > 675.79 -1.9% 663.17 lkp-nex04/RAID0-12HDD-thresh=8G/ext4-10dd-1-3.5.0
> > 484.84 -7.4% 448.83 lkp-nex04/RAID0-12HDD-thresh=8G/ext4-1dd-1-3.5.0
> > 470.40 -3.2% 455.31 lkp-nex04/RAID0-12HDD-thresh=8G/ext4-1dd-2-3.5.0
> > 167.97 -38.7% 103.03 
lkp-nex04/RAID5-12HDD-thresh=1000M/ext4-100dd-1-3.5.0
> > 243.67 -9.1% 221.41 lkp-nex04/RAID5-12HDD-thresh=1000M/ext4-10dd-1-3.5.0
> > 248.98 +12.2% 279.33 lkp-nex04/RAID5-12HDD-thresh=1000M/ext4-1dd-1-3.5.0
> > 208.45 +14.1% 237.86 lkp-nex04/RAID5-12HDD-thresh=1000M/ext4-1dd-2-3.5.0
> > 71.18 -34.2% 46.82 lkp-nex04/RAID5-12HDD-thresh=100M/ext4-100dd-1-3.5.0
> > 145.84 -7.3% 135.25 lkp-nex04/RAID5-12HDD-thresh=100M/ext4-10dd-1-3.5.0
> > 255.22 +6.7% 272.35 lkp-nex04/RAID5-12HDD-thresh=100M/ext4-1dd-1-3.5.0
> > 243.09 +20.7% 293.30 lkp-nex04/RAID5-12HDD-thresh=100M/ext4-1dd-2-3.5.0
> > 209.24 -23.6% 159.96 lkp-nex04/RAID5-12HDD-thresh=8G/ext4-100dd-1-3.5.0
> > 243.73 -10.9% 217.28 lkp-nex04/RAID5-12HDD-thresh=8G/ext4-10dd-1-3.5.0
>
>  Hi,
>
>  About this issue, I did some investigation. And found we are blocked at
>  get_active_stripes() in most times. It's reasonable, since max_nr_stripes
>  is set to 256 now. It's a kind of small value, thus I tried with
>  different value. Please see the following patch for detailed numbers.
>
>  The test machine is same as above.
>
>  From 85c27fca12b770da5bc8ec9f26a22cb414e84c68 Mon Sep 17 00:00:00 2001
>  From: Yuanhan Liu <yuanhan.liu@linux.intel.com>
>  Date: Wed, 22 Aug 2012 10:51:48 +0800
>  Subject: [RFC PATCH] md/raid5: increase NR_STRIPES to 1024
>
>  Stripe head is a must held resource before doing any IO. And it's
>  limited to 256 by default. With 10dd case, we found that it is
>  blocked at get_active_stripes() in most times(please see the ps
>  output attached).
>
>  Thus I did some tries with different value set to NR_STRIPS, and
>  here are some numbers(EXT4 only) I got with different NR_STRIPS set:
>
>  write bandwidth:
>  ================
>  3.5.0-rc1-256+: (Here 256 means with max strip head set to 256)
>  write bandwidth: 280
>  3.5.0-rc1-1024+:
>  write bandwidth: 421 (+50.4%)
>  3.5.0-rc1-4096+:
>  write bandwidth: 506 (+80.7%)
>  3.5.0-rc1-32768+:
>  write bandwidth: 615 (+119.6%)
>
>  (Here 'sh' means with Shaohua's "multiple threads to handle strips" 
patch [0])
>  3.5.0-rc3-strip-sh+-256:
>  write bandwidth: 465
>
>  3.5.0-rc3-strip-sh+-1024:
>  write bandwidth: 599
>
>  3.5.0-rc3-strip-sh+-32768:
>  write bandwidth: 615
>
>  The kernel maybe a bit older but I found that the data are still kind of
>  valid. Though, I haven't tried Shaohua's latest patch.
>
>  As you can see from those data above: the write bandwidth is increased
>  (a lot) as we increase NR_STRIPES. Thus the bigger NR_STRIPES set, the
>  better write bandwidth we get. But we can't set NR_STRIPES with a too
>  large number, especially by default, or it need lots of memory. Due to
>  the number I got with Shaohua's patch applied, I guess 1024 would be
>  nice value; it's not too big but we gain above 110% performance.
>
>  Comments? BTW, I have a more flexible(more stupid, in the meantime) way:
>  change the max_nr_stripes dynamically based on need?
>
>  Here I also attached more data: the script I used to get those number,
>  ps output, and iostat -kx 3 output.
>
>  The script does it's job in a straight way: start NR dd in background,
>  trace the writeback/global_dirty_state event in background to count the
>  write bandwidth, sample the ps out regularly.
>
>  ---
>  [0]: patch: http://lwn.net/Articles/500200/
>
>  Signed-off-by: Yuanhan Liu <yuanhan.liu@linux.intel.com>
>  ---
>  drivers/md/raid5.c | 2 +-
>  1 files changed, 1 insertions(+), 1 deletions(-)
>
>  diff --git a/drivers/md/raid5.c b/drivers/md/raid5.c
>  index adda94d..82dca53 100644
>  --- a/drivers/md/raid5.c
>  +++ b/drivers/md/raid5.c
>  @@ -62,7 +62,7 @@
>  * Stripe cache
>  */
>
>  -#define NR_STRIPES 256
>  +#define NR_STRIPES 1024
>  #define STRIPE_SIZE PAGE_SIZE
>  #define STRIPE_SHIFT (PAGE_SHIFT - 9)
>  #define STRIPE_SECTORS (STRIPE_SIZE>>9)

does revert commit 8811b5968f6216e fix the problem?

Thanks,
Shaohua

--
To unsubscribe from this list: send the line "unsubscribe linux-ext4" in
the body of a message to majordomo@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Neil Brown - Aug. 22, 2012, 6 a.m.
On Wed, 22 Aug 2012 11:57:02 +0800 Yuanhan Liu <yuanhan.liu@linux.intel.com>
wrote:

>  
> -#define NR_STRIPES		256
> +#define NR_STRIPES		1024

Changing one magic number into another magic number might help your case, but
it not really a general solution.

Possibly making sure that max_nr_stripes is at least some multiple of the
chunk size might make sense, but I wouldn't want to see a very large multiple.

I thing the problems with RAID5 are deeper than that.  Hopefully I'll figure
out exactly what the best fix is soon - I'm trying to look into it.

I don't think the size of the cache is a big part of the solution.  I think
correct scheduling of IO is the real answer.

Thanks,
NeilBrown
Yuanhan Liu - Aug. 22, 2012, 6:31 a.m.
On Wed, Aug 22, 2012 at 04:00:25PM +1000, NeilBrown wrote:
> On Wed, 22 Aug 2012 11:57:02 +0800 Yuanhan Liu <yuanhan.liu@linux.intel.com>
> wrote:
> 
> >  
> > -#define NR_STRIPES		256
> > +#define NR_STRIPES		1024
> 
> Changing one magic number into another magic number might help your case, but
> it not really a general solution.

Agreed.

> 
> Possibly making sure that max_nr_stripes is at least some multiple of the
> chunk size might make sense, but I wouldn't want to see a very large multiple.
> 
> I thing the problems with RAID5 are deeper than that.  Hopefully I'll figure
> out exactly what the best fix is soon - I'm trying to look into it.
> 
> I don't think the size of the cache is a big part of the solution.  I think
> correct scheduling of IO is the real answer.

Yes, it should not be. But with less max_nr_stripes, the chance to get a
full strip write is less, and maybe that's the reason why the chance to
block at get_active_strip() is more; and also, the reading is more.

The perfect case would be there are no reading; setting max_nr_stripes
to 32768(the max we get set now), you will find the reading is quite
less(almost zero, please see the iostat I attached in former email).

Anyway, I do agree this should not be the big part of the solution. If
we can handle those stripes faster, I guess 256 would be enough.

Thanks,
Yuanhan Liu
--
To unsubscribe from this list: send the line "unsubscribe linux-ext4" in
the body of a message to majordomo@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Andreas Dilger - Aug. 22, 2012, 7:14 a.m.
On 2012-08-22, at 12:00 AM, NeilBrown wrote:
> On Wed, 22 Aug 2012 11:57:02 +0800 Yuanhan Liu <yuanhan.liu@linux.intel.com>
> wrote:
>> 
>> -#define NR_STRIPES		256
>> +#define NR_STRIPES		1024
> 
> Changing one magic number into another magic number might help your case, but
> it not really a general solution.

We've actually been carrying a patch for a few years in Lustre to
increase the NR_STRIPES to 2048, and made it a configurable module
parameter.  This made a noticeable improvement to the performance
for fast systems.

> Possibly making sure that max_nr_stripes is at least some multiple of the
> chunk size might make sense, but I wouldn't want to see a very large multiple.
> 
> I thing the problems with RAID5 are deeper than that.  Hopefully I'll figure
> out exactly what the best fix is soon - I'm trying to look into it.

The other MD RAID-5/6 patches that we have change the page submission
order to avoid the need to merge pages in the elevator so much, and a
patch to allow zero-copy IO submission if the caller marks the page for
direct IO (indicating it will not be modified until after IO completes).
This avoids a lot of overhead on fast systems.

This isn't really my area of expertise, but patches against RHEL6
could be seen at http://review.whamcloud.com/1142 if you want to
take a look.  I don't know if that code is at all relevant to what
is in 3.x today.

> I don't think the size of the cache is a big part of the solution.  I think
> correct scheduling of IO is the real answer.

My experience is that on fast systems the IO scheduler just gets in the
way.  Submitting larger contiguous IOs to each disk in the first place
is far better than trying to merge small IOs again at the back end.

Cheers, Andreas
Dan Williams - Aug. 22, 2012, 8:47 p.m.
On Tue, Aug 21, 2012 at 11:00 PM, NeilBrown <neilb@suse.de> wrote:
> On Wed, 22 Aug 2012 11:57:02 +0800 Yuanhan Liu <yuanhan.liu@linux.intel.com>
> wrote:
>
>>
>> -#define NR_STRIPES           256
>> +#define NR_STRIPES           1024
>
> Changing one magic number into another magic number might help your case, but
> it not really a general solution.
>
> Possibly making sure that max_nr_stripes is at least some multiple of the
> chunk size might make sense, but I wouldn't want to see a very large multiple.
>
> I thing the problems with RAID5 are deeper than that.  Hopefully I'll figure
> out exactly what the best fix is soon - I'm trying to look into it.
>
> I don't think the size of the cache is a big part of the solution.  I think
> correct scheduling of IO is the real answer.

Not sure if this is what we are seeing here, but we still have the
unresolved fast parity effect whereby slower parity calculation gives
a larger time to coalesce writes.  I saw this effect when playing with
xor offload.

--
Dan
--
To unsubscribe from this list: send the line "unsubscribe linux-ext4" in
the body of a message to majordomo@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Neil Brown - Aug. 22, 2012, 9:59 p.m.
On Wed, 22 Aug 2012 13:47:07 -0700 Dan Williams <djbw@fb.com> wrote:

> On Tue, Aug 21, 2012 at 11:00 PM, NeilBrown <neilb@suse.de> wrote:
> > On Wed, 22 Aug 2012 11:57:02 +0800 Yuanhan Liu <yuanhan.liu@linux.intel.com>
> > wrote:
> >
> >>
> >> -#define NR_STRIPES           256
> >> +#define NR_STRIPES           1024
> >
> > Changing one magic number into another magic number might help your case, but
> > it not really a general solution.
> >
> > Possibly making sure that max_nr_stripes is at least some multiple of the
> > chunk size might make sense, but I wouldn't want to see a very large multiple.
> >
> > I thing the problems with RAID5 are deeper than that.  Hopefully I'll figure
> > out exactly what the best fix is soon - I'm trying to look into it.
> >
> > I don't think the size of the cache is a big part of the solution.  I think
> > correct scheduling of IO is the real answer.
> 
> Not sure if this is what we are seeing here, but we still have the
> unresolved fast parity effect whereby slower parity calculation gives
> a larger time to coalesce writes.  I saw this effect when playing with
> xor offload.

I did find a case where inserting a printk made it go faster again.
Replacing that with msleep(2) worked as well. :-)

I'm looking for a most robust solution though.
Thanks for the reminder.

NeilBrown

Patch

diff --git a/fs/ext4/extents.c b/fs/ext4/extents.c
index 769151d..fa829dc 100644
--- a/fs/ext4/extents.c
+++ b/fs/ext4/extents.c
@@ -2432,6 +2432,10 @@  ext4_ext_rm_leaf(handle_t *handle, struct inode *inode,
 
 	/* the header must be checked already in ext4_ext_remove_space() */
 	ext_debug("truncate since %u in leaf to %u\n", start, end);
+	if (!path[depth].p_hdr && !path[depth].p_bh) {
+		EXT4_ERROR_INODE(inode, "depth %d", depth);
+		BUG_ON(1);
+	}
 	if (!path[depth].p_hdr)
 		path[depth].p_hdr = ext_block_hdr(path[depth].p_bh);
 	eh = path[depth].p_hdr;
@@ -2730,6 +2734,10 @@  cont:
 		/* this is index block */
 		if (!path[i].p_hdr) {
 			ext_debug("initialize header\n");
+			if (!path[i].p_hdr && !path[i].p_bh) {
+				EXT4_ERROR_INODE(inode, "i=%d", i);
+				BUG_ON(1);
+			}
 			path[i].p_hdr = ext_block_hdr(path[i].p_bh);
 		}
 
@@ -2828,6 +2836,7 @@  out:
 	kfree(path);
 	if (err == -EAGAIN) {
 		path = NULL;
+		i = 0;
 		goto again;
 	}
 	ext4_journal_stop(handle);