Message ID | 4D9B49A6.7000709@redhat.com |
---|---|
State | Accepted, archived |
Headers | show |
On 4/5/11 9:56 AM, Eric Sandeen wrote: > On 4/5/11 9:39 AM, Eric Sandeen wrote: >> On 4/5/11 1:10 AM, Andreas Dilger wrote: >>> On 2011-04-04, at 9:11 AM, Eric Sandeen wrote: >>>> Block devices may set minimum or optimal IO hints equal to >>>> blocksize; in this case there is really nothing for ext4 >>>> to do with this information (i.e. search for a block-aligned >>>> allocation?) so don't set fs geometry with single-block >>>> values. >>>> >>>> Zeev also reported that with a block-sized stripe, the >>>> ext4 allocator spends time spinning in ext4_mb_scan_aligned(), >>>> oddly enough. >>>> >>>> Reported-by: Zeev Tarantov <zeev.tarantov@gmail.com> >>>> Signed-off-by: Eric Sandeen <sandeen@redhat.com> >>>> --- >>>> >>>> diff --git a/misc/mke2fs.c b/misc/mke2fs.c >>>> index 9798b88..74b838c 100644 >>>> --- a/misc/mke2fs.c >>>> +++ b/misc/mke2fs.c >>>> @@ -1135,8 +1135,11 @@ static int get_device_geometry(const char *file, >>>> if ((opt_io == 0) && (psector_size > blocksize)) >>>> opt_io = psector_size; >>>> >>>> - fs_param->s_raid_stride = min_io / blocksize; >>>> - fs_param->s_raid_stripe_width = opt_io / blocksize; >>>> + /* setting stripe/stride to blocksize is pointless */ >>>> + if (min_io > blocksize) >>>> + fs_param->s_raid_stride = min_io / blocksize; >>>> + if (opt_io > blocksize) >>>> + fs_param->s_raid_stripe_width = opt_io / blocksize; >>> >>> I don't think it is harmful to specify an mballoc alignment that is >>> an even multiple of the underlying device IO size (e.g. at least >>> 256kB or 512kB). >>> >>> If the underlying device (e.g. zram) is reporting 16kB or 64kB opt_io >>> size because that is PAGE_SIZE, but blocksize is 4kB, then we will >>> have the same performance problem again.> >>> Cheers, Andreas >> >> I need to look into why ext4_mb_scan_aligned is so inefficient for a block-sized stripe. >> >> In practice I don't think we've seen this problem with stripe size at 4 or 8 or 16 blocks; it may just be less apparent. I think the function steps through by stripe-sized units, and if that is 1 block, it's a lot of stepping. >> >> while (i < EXT4_BLOCKS_PER_GROUP(sb)) { >> ... >> if (!mb_test_bit(i, bitmap)) { > > Offhand I think maybe mb_find_next_zero_bit would be more efficient. > > --- a/fs/ext4/mballoc.c > +++ b/fs/ext4/mballoc.c > @@ -1939,16 +1939,14 @@ void ext4_mb_scan_aligned(struct ext4_allocation_context *ac, > i = (a * sbi->s_stripe) - first_group_block; > > while (i < EXT4_BLOCKS_PER_GROUP(sb)) { > - if (!mb_test_bit(i, bitmap)) { > - max = mb_find_extent(e4b, 0, i, sbi->s_stripe, &ex); > - if (max >= sbi->s_stripe) { > - ac->ac_found++; > - ac->ac_b_ex = ex; > - ext4_mb_use_best_found(ac, e4b); > - break; > - } > + i = mb_find_next_zero_bit(bitmap, EXT4_BLOCKS_PER_GROUP(sb), i); > + max = mb_find_extent(e4b, 0, i, sbi->s_stripe, &ex); > + if (max >= sbi->s_stripe) { > + ac->ac_found++; > + ac->ac_b_ex = ex; > + ext4_mb_use_best_found(ac, e4b); > + break; > } > - i += sbi->s_stripe; > } > } > > totally untested, but I think we have better ways to step through the bitmap. I tested it ;) Seems to work fine, though I probably should see how things actually got allocated. Creating an fs with 1-block stripes & widths as with the original report, and copying a (built) 2.6 linux kernel tree, it took about 7 minutes, and looped in the while() loop above 328215171 times. With the patch above, it took 6m30s, and looped 25055 times. For a filesystem with no stripe/stride set, it took 6m26s. Hm, but subsequent tests w/ the tiny stripe set came around 6m30s as well. So there's no obvious speedup. Still, avoiding all that looping seems beneficial, I can send a patch after I make sure allocation is still happening as expected. Zeev, if you'd like to test that patch above with your profiling, that'd be awesome. Thanks, -Eric -- To unsubscribe from this list: send the line "unsubscribe linux-ext4" in the body of a message to majordomo@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
On Wed, Apr 6, 2011 at 01:21, Eric Sandeen <sandeen@redhat.com> wrote: > Zeev, if you'd like to test that patch above with your profiling, that'd be awesome. My wall-clock timings are strange and I disregard them. On a desktop with X turned off and perf recording system-wide, I trust the profile to show everything. In the no-stride case, I don't think you've made it any worse. In the stripe=stride=1, it went from: [ perf record: Woken up 13 times to write data ] [ perf record: Captured and wrote 3.163 MB perf.data (~138211 samples) ] # Events: 14K cycles # # Overhead Command Shared Object Symbol # ........ .............. ..................... ....................................... # 13.08% gzip gzip [.] zip 12.65% flush-253:0 [kernel.kallsyms] [k] ext4_mb_scan_aligned 12.46% gzip gzip [.] treat_file.part.4.2264 9.34% flush-253:0 [csnappy_compress] [k] snappy_compress_fragment 5.47% md5sum md5sum [.] digest_file.isra.2.2089 4.70% md5sum [csnappy_decompress] [k] snappy_decompress_noheader 4.29% md5sum md5sum [.] 0x2fe7 2.16% gzip libc-2.13.so [.] __memcpy_ssse3 1.79% flush-253:0 [kernel.kallsyms] [k] memcpy 1.52% tar [kernel.kallsyms] [k] copy_user_generic_string 1.08% tar [kernel.kallsyms] [k] _raw_spin_lock 1.00% gzip [kernel.kallsyms] [k] copy_user_generic_string 0.87% flush-253:0 [kernel.kallsyms] [k] __memset 0.58% md5sum md5sum [.] __libc_csu_init 0.54% md5sum [zram] [k] zram_make_request 0.50% md5sum [kernel.kallsyms] [k] copy_user_generic_string 0.45% tar [kernel.kallsyms] [k] ext4_mark_iloc_dirty 0.44% tar [kernel.kallsyms] [k] __memset 0.39% gzip gzip [.] treat_stdin.2262 0.35% gzip gzip [.] treat_file.2267 0.33% swapper [kernel.kallsyms] [k] mwait_idle 0.28% dd [kernel.kallsyms] [k] system_call 0.26% md5sum [kernel.kallsyms] [k] memcpy 0.25% flush-253:0 [zram] [k] zram_make_request 0.25% flush-253:0 [kernel.kallsyms] [k] _raw_spin_lock 0.23% tar [kernel.kallsyms] [k] ext4_mb_scan_aligned 0.22% gzip gzip [.] compress_block.2644.2190 0.20% tar [kernel.kallsyms] [k] __find_get_block to: [ perf record: Woken up 12 times to write data ] [ perf record: Captured and wrote 3.091 MB perf.data (~135042 samples) ] # Events: 13K cycles # # Overhead Command Shared Object Symbol # ........ ........... ..................... ..................................... # 15.10% gzip gzip [.] zip 14.70% gzip gzip [.] treat_file.part.4.2264 11.10% flush-253:0 [csnappy_compress] [k] snappy_compress_fragment 6.12% md5sum md5sum [.] digest_file.isra.2.2089 5.44% md5sum [csnappy_decompress] [k] snappy_decompress_noheader 5.10% md5sum md5sum [.] 0x3262 2.42% gzip libc-2.13.so [.] __memcpy_ssse3 2.03% flush-253:0 [kernel.kallsyms] [k] memcpy 1.52% tar [kernel.kallsyms] [k] copy_user_generic_string 1.13% gzip [kernel.kallsyms] [k] copy_user_generic_string 0.80% flush-253:0 [kernel.kallsyms] [k] __memset 0.62% md5sum md5sum [.] __libc_csu_init 0.60% md5sum [zram] [k] zram_make_request 0.57% gzip gzip [.] treat_stdin.2262 0.56% md5sum [kernel.kallsyms] [k] copy_user_generic_string 0.55% tar [kernel.kallsyms] [k] __memset 0.49% tar [kernel.kallsyms] [k] ext4_mark_iloc_dirty 0.39% md5sum [kernel.kallsyms] [k] memcpy 0.37% gzip gzip [.] treat_file.2267 0.37% swapper [kernel.kallsyms] [k] mwait_idle 0.28% dd [kernel.kallsyms] [k] system_call 0.28% flush-253:0 [kernel.kallsyms] [k] _raw_spin_lock 0.25% gzip gzip [.] compress_block.2644.2190 0.22% flush-253:0 [kernel.kallsyms] [k] ext4_mark_iloc_dirty 0.22% tar [kernel.kallsyms] [k] __ext4_get_inode_loc 0.22% tar [kernel.kallsyms] [k] _raw_spin_lock 0.21% tar [kernel.kallsyms] [k] mark_page_accessed 0.20% tar [kernel.kallsyms] [k] __find_get_block 0.20% tar tar [.] contains_dot_dot So that's great, but I still don't want mke2fs to set a stripe and stride of 1, because that's silly - no code improvement in the allocator is ever going to be able to use that information. > Thanks, > -Eric Thank you! -Z.T. -- To unsubscribe from this list: send the line "unsubscribe linux-ext4" in the body of a message to majordomo@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
On 4/5/11 11:51 PM, Zeev Tarantov wrote: > On Wed, Apr 6, 2011 at 01:21, Eric Sandeen <sandeen@redhat.com> wrote: >> Zeev, if you'd like to test that patch above with your profiling, that'd be awesome. > > My wall-clock timings are strange and I disregard them. > On a desktop with X turned off and perf recording system-wide, I trust > the profile to show everything. Agreed, the wallclock time was just a very coarse sanity check while off at a conference :) > In the no-stride case, I don't think you've made it any worse. it shouldn't even go down this path so I hope not ;) > In the stripe=stride=1, it went from: > > [ perf record: Woken up 13 times to write data ] > [ perf record: Captured and wrote 3.163 MB perf.data (~138211 samples) ] > # Events: 14K cycles > # > # Overhead Command Shared Object > Symbol > # ........ .............. ..................... > ....................................... > # > 13.08% gzip gzip [.] zip > 12.65% flush-253:0 [kernel.kallsyms] [k] ext4_mb_scan_aligned ... > to: > > [ perf record: Woken up 12 times to write data ] > [ perf record: Captured and wrote 3.091 MB perf.data (~135042 samples) ] > # Events: 13K cycles > # > # Overhead Command Shared Object > Symbol > # ........ ........... ..................... > ..................................... > # > 15.10% gzip gzip [.] zip > 14.70% gzip gzip [.] treat_file.part.4.2264 > 11.10% flush-253:0 [csnappy_compress] [k] snappy_compress_fragment > 6.12% md5sum md5sum [.] digest_file.isra.2.2089 ... > 0.23% tar [kernel.kallsyms] [k] ext4_mb_scan_aligned ... > So that's great, but I still don't want mke2fs to set a stripe and > stride of 1, because that's silly - no code improvement in the > allocator is ever going to be able to use that information. I totally agree, and I'm sure the mkfs patch will go in as well. Thanks for the verification of the kernel patch, too. It should help a bit for sane stripe-sizes, too, with less impact as the stripe size gets bigger. Thanks, -Eric >> Thanks, >> -Eric > > Thank you! > -Z.T. -- To unsubscribe from this list: send the line "unsubscribe linux-ext4" in the body of a message to majordomo@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
On 2011-04-05, at 10:56 AM, Eric Sandeen wrote: > On 4/5/11 9:39 AM, Eric Sandeen wrote: >> Andreas Dilger wrote: >>> I don't think it is harmful to specify an mballoc alignment that is >>> an even multiple of the underlying device IO size (e.g. at least >>> 256kB or 512kB). >>> >>> If the underlying device (e.g. zram) is reporting 16kB or 64kB opt_io >>> size because that is PAGE_SIZE, but blocksize is 4kB, then we will >>> have the same performance problem again.> >>> Cheers, Andreas >> >> I need to look into why ext4_mb_scan_aligned is so inefficient for a block-sized stripe. >> >> In practice I don't think we've seen this problem with stripe size at 4 or 8 or 16 blocks; it may just be less apparent. I think the function steps through by stripe-sized units, and if that is 1 block, it's a lot of stepping. >> >> while (i < EXT4_BLOCKS_PER_GROUP(sb)) { >> ... >> if (!mb_test_bit(i, bitmap)) { > > Offhand I think maybe mb_find_next_zero_bit would be more efficient. > > --- a/fs/ext4/mballoc.c > +++ b/fs/ext4/mballoc.c > @@ -1939,16 +1939,14 @@ void ext4_mb_scan_aligned(struct ext4_allocation_context *ac, > i = (a * sbi->s_stripe) - first_group_block; > > while (i < EXT4_BLOCKS_PER_GROUP(sb)) { > - if (!mb_test_bit(i, bitmap)) { > - max = mb_find_extent(e4b, 0, i, sbi->s_stripe, &ex); > - if (max >= sbi->s_stripe) { > - ac->ac_found++; > - ac->ac_b_ex = ex; > - ext4_mb_use_best_found(ac, e4b); > - break; > - } > + i = mb_find_next_zero_bit(bitmap, EXT4_BLOCKS_PER_GROUP(sb), i); > + max = mb_find_extent(e4b, 0, i, sbi->s_stripe, &ex); > + if (max >= sbi->s_stripe) { > + ac->ac_found++; > + ac->ac_b_ex = ex; > + ext4_mb_use_best_found(ac, e4b); > + break; > } > - i += sbi->s_stripe; > } > } > > totally untested, but I think we have better ways to step through the bitmap. This changes the allocation completely, AFAICS. Instead of doing checks for chunks of free space aligned on sbi->s_stripe boundaries, it is instead finding the first free space of size s_stripe regardless of alignment. That is not good for RAID back-ends, and is the primary reason for ext4_mb_scan_aligned() to exist. I think my original assertion holds - that regardless of what the "optimal IO" size reported by the underlying device, doing larger allocations at the mballoc level that are even multiples of this size isn't harmful. That avoids not only the performance impact of 4kB-sized "optimal IO", but also the (lesser) impact of 8kB-64kB "optimal IO" allocations as well. Cheers, Andreas -- To unsubscribe from this list: send the line "unsubscribe linux-ext4" in the body of a message to majordomo@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
On 4/7/11 5:13 PM, Andreas Dilger wrote: > > On 2011-04-05, at 10:56 AM, Eric Sandeen wrote: > >> On 4/5/11 9:39 AM, Eric Sandeen wrote: >>> Andreas Dilger wrote: >>>> I don't think it is harmful to specify an mballoc alignment that is >>>> an even multiple of the underlying device IO size (e.g. at least >>>> 256kB or 512kB). >>>> >>>> If the underlying device (e.g. zram) is reporting 16kB or 64kB opt_io >>>> size because that is PAGE_SIZE, but blocksize is 4kB, then we will >>>> have the same performance problem again.> >>>> Cheers, Andreas >>> >>> I need to look into why ext4_mb_scan_aligned is so inefficient for a block-sized stripe. >>> >>> In practice I don't think we've seen this problem with stripe size at 4 or 8 or 16 blocks; it may just be less apparent. I think the function steps through by stripe-sized units, and if that is 1 block, it's a lot of stepping. >>> >>> while (i < EXT4_BLOCKS_PER_GROUP(sb)) { >>> ... >>> if (!mb_test_bit(i, bitmap)) { >> >> Offhand I think maybe mb_find_next_zero_bit would be more efficient. >> >> --- a/fs/ext4/mballoc.c >> +++ b/fs/ext4/mballoc.c >> @@ -1939,16 +1939,14 @@ void ext4_mb_scan_aligned(struct ext4_allocation_context *ac, >> i = (a * sbi->s_stripe) - first_group_block; >> >> while (i < EXT4_BLOCKS_PER_GROUP(sb)) { >> - if (!mb_test_bit(i, bitmap)) { >> - max = mb_find_extent(e4b, 0, i, sbi->s_stripe, &ex); >> - if (max >= sbi->s_stripe) { >> - ac->ac_found++; >> - ac->ac_b_ex = ex; >> - ext4_mb_use_best_found(ac, e4b); >> - break; >> - } >> + i = mb_find_next_zero_bit(bitmap, EXT4_BLOCKS_PER_GROUP(sb), i); >> + max = mb_find_extent(e4b, 0, i, sbi->s_stripe, &ex); >> + if (max >= sbi->s_stripe) { >> + ac->ac_found++; >> + ac->ac_b_ex = ex; >> + ext4_mb_use_best_found(ac, e4b); >> + break; >> } >> - i += sbi->s_stripe; >> } >> } >> >> totally untested, but I think we have better ways to step through the bitmap. > > This changes the allocation completely, AFAICS. Instead of doing > checks for chunks of free space aligned on sbi->s_stripe boundaries, > it is instead finding the first free space of size s_stripe > regardless of alignment. That is not good for RAID back-ends, and is > the primary reason for ext4_mb_scan_aligned() to exist. Oh, er, right. It's what I get for coding-at-conference, sorry. I do wonder if test-bit/advance/test-bit/advance can be made a bit more efficient with something like find_next_bit. I just did it wrong. :( I'll revisit it when I get back home. > I think my original assertion holds - that regardless of what the > "optimal IO" size reported by the underlying device, doing larger > allocations at the mballoc level that are even multiples of this size > isn't harmful. That avoids not only the performance impact of > 4kB-sized "optimal IO", but also the (lesser) impact of 8kB-64kB > "optimal IO" allocations as well.> > Cheers, Andreas I'll give that some thought; really, the whole align-on-a-stripe mechanism needs work, at least outside of the Lustre workload :) Thanks, -Eric -- To unsubscribe from this list: send the line "unsubscribe linux-ext4" in the body of a message to majordomo@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
--- a/fs/ext4/mballoc.c +++ b/fs/ext4/mballoc.c @@ -1939,16 +1939,14 @@ void ext4_mb_scan_aligned(struct ext4_allocation_context *ac, i = (a * sbi->s_stripe) - first_group_block; while (i < EXT4_BLOCKS_PER_GROUP(sb)) { - if (!mb_test_bit(i, bitmap)) { - max = mb_find_extent(e4b, 0, i, sbi->s_stripe, &ex); - if (max >= sbi->s_stripe) { - ac->ac_found++; - ac->ac_b_ex = ex; - ext4_mb_use_best_found(ac, e4b); - break; - } + i = mb_find_next_zero_bit(bitmap, EXT4_BLOCKS_PER_GROUP(sb), i); + max = mb_find_extent(e4b, 0, i, sbi->s_stripe, &ex); + if (max >= sbi->s_stripe) { + ac->ac_found++; + ac->ac_b_ex = ex; + ext4_mb_use_best_found(ac, e4b); + break; } - i += sbi->s_stripe; } }