Message ID | 1353947981-15219-4-git-send-email-tytso@mit.edu |
---|---|
State | Superseded, archived |
Headers | show |
> This function efficiently counts the number of bits in a block of > memory. Would it be worth the annoying build- and run-time machinery to detect and use the -msse4.2 __builtin_popcount() gcc intrinsic? > +static unsigned int popcount32(unsigned int w) > +{ > + unsigned int res = w - ((w >> 1) & 0x55555555); > + res = (res & 0x33333333) + ((res >> 2) & 0x33333333); > + res = (res + (res >> 4)) & 0x0F0F0F0F; > + res = res + (res >> 8); > + return (res + (res >> 16)) & 0x000000FF; > +} 40042b: 89 c2 mov %eax,%edx 400432: d1 ea shr %edx 400434: 81 e2 55 55 55 55 and $0x55555555,%edx 40043a: 29 d0 sub %edx,%eax 40043c: 89 c2 mov %eax,%edx 40043e: c1 e8 02 shr $0x2,%eax 400441: 81 e2 33 33 33 33 and $0x33333333,%edx 400447: 25 33 33 33 33 and $0x33333333,%eax 40044c: 01 d0 add %edx,%eax 40044e: 89 c2 mov %eax,%edx 400450: c1 ea 04 shr $0x4,%edx 400453: 01 d0 add %edx,%eax 400455: 25 0f 0f 0f 0f and $0xf0f0f0f,%eax 40045a: 89 c2 mov %eax,%edx 40045c: c1 ea 08 shr $0x8,%edx 40045f: 01 d0 add %edx,%eax 400461: 89 c6 mov %eax,%esi 400463: c1 ee 10 shr $0x10,%esi 400468: 0f b6 f0 movzbl %al,%esi __builtin_popcount(): 40047e: f3 0f b8 f0 popcnt %eax,%esi - z -- To unsubscribe from this list: send the line "unsubscribe linux-ext4" in the body of a message to majordomo@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
On Mon, Nov 26, 2012 at 03:17:45PM -0800, Zach Brown wrote: > > This function efficiently counts the number of bits in a block of > > memory. > > Would it be worth the annoying build- and run-time machinery to detect > and use the -msse4.2 __builtin_popcount() gcc intrinsic? I thought about doing it, but I was in a bit of a hurry implementing this patch set, and I wasn't even sure how to correctly implement the build- and run-time machinery (i.e., detecting whether the gcc you're compiling with supports __builtin_popcount, and implementing a run-time fallback is the CPU doesn't support popcount instruction --- which by the way isn't properly part of SSE 4.2; it has its own separate CPUID bit, IIRC). Is there some userspace application licensed under LGPLv2 which does this cleanly from which I could borrow code? I suppose I should first check and see how much difference it makes to with a hard-coded use __builtin_popcnt(). If it makes a sufficiently large improvement, it's probably worth the hair of implementing the fallback machinery. - Ted -- To unsubscribe from this list: send the line "unsubscribe linux-ext4" in the body of a message to majordomo@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
On Mon, Nov 26, 2012 at 08:45:05PM -0500, Theodore Ts'o wrote: > I suppose I should first check and see how much difference it makes to > with a hard-coded use __builtin_popcnt(). If it makes a sufficiently > large improvement, it's probably worth the hair of implementing the > fallback machinery. I did some quick benchmarking, and the difference it makes when checking 4TB's worth of bitmaps is negligble: slow popcount: 0.2623 fast popcount: 0.0700 For a 128TB's worth of bitmaps, the time difference is: slow popcount: 8.0185 fast popcount: 2.2066 I measured running e2fsck on an empty 128TB file system, and that took 202 CPU seconds (assuming all of the fs metadata blocks are in cache), so with this optimization we would save at most 3%. (For comparison, using an unmodified 1.42.6 e2fsck, it burned 392.7 CPU seconds.) My conclusion is that using __builtin_popcnt() is a nice-to-have, and if someone sends me patches I'll probably take them as a optimization, but it's not super high priority for me. - Ted -- To unsubscribe from this list: send the line "unsubscribe linux-ext4" in the body of a message to majordomo@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
On Tue, Nov 27, 2012 at 12:16:17AM -0500, Theodore Ts'o wrote: > I did some quick benchmarking, and the difference it makes when > checking 4TB's worth of bitmaps is negligble: > > slow popcount: 0.2623 > fast popcount: 0.0700 > > For a 128TB's worth of bitmaps, the time difference is: > > slow popcount: 8.0185 > fast popcount: 2.2066 > > I measured running e2fsck on an empty 128TB file system, and that took > 202 CPU seconds (assuming all of the fs metadata blocks are in cache), > so with this optimization we would save at most 3%. (For comparison, > using an unmodified 1.42.6 e2fsck, it burned 392.7 CPU seconds.) Nice, thanks for taking the time to get numbers. > My conclusion is that using __builtin_popcnt() is a nice-to-have, and > if someone sends me patches I'll probably take them as a optimization, > but it's not super high priority for me. Agreed. I'll chuck it at the end of my fun-projects-some-day list as well, but getting it right for all the platforms that e2fsprogs supports.. meh :). - z -- To unsubscribe from this list: send the line "unsubscribe linux-ext4" in the body of a message to majordomo@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
On Mon, Nov 26, 2012 at 08:45:05PM -0500, Theodore Ts'o wrote: > build- and run-time machinery (i.e., detecting whether the gcc you're > compiling with supports __builtin_popcount, and implementing a > run-time fallback is the CPU doesn't support popcount instruction --- > which by the way isn't properly part of SSE 4.2; it has its own > separate CPUID bit, IIRC). *nod* > Is there some userspace application licensed under LGPLv2 which does > this cleanly from which I could borrow code? Not that I know of, off the top of my head. I think I'd first check the usual crypto suspects :). - z -- To unsubscribe from this list: send the line "unsubscribe linux-ext4" in the body of a message to majordomo@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
On Tue, Nov 27, 2012 at 09:50:23AM -0800, Zach Brown wrote: > > Agreed. I'll chuck it at the end of my fun-projects-some-day list as > well, but getting it right for all the platforms that e2fsprogs > supports.. meh :). It's not strictly necessary to get things right for all platforms; we already have some accelerations using asm statements which only work for one platform already --- although it's already the case that I didn't bother to make 64-bit set/clear/test bit optimizations for x86, mainly because I didn't think it was worth it, especially on modern CPU's. (And with the red/black tree backend for bitmaps, the asm bitops are even less important.) - Ted -- To unsubscribe from this list: send the line "unsubscribe linux-ext4" in the body of a message to majordomo@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
diff --git a/lib/ext2fs/bitops.c b/lib/ext2fs/bitops.c index 9322a35..bd9e768 100644 --- a/lib/ext2fs/bitops.c +++ b/lib/ext2fs/bitops.c @@ -116,3 +116,38 @@ int ext2fs_test_bit64(__u64 nr, const void * addr) return (mask & *ADDR); } +static unsigned int popcount8(unsigned int w) +{ + unsigned int res = w - ((w >> 1) & 0x55); + res = (res & 0x33) + ((res >> 2) & 0x33); + return (res + (res >> 4)) & 0x0F; +} + +static unsigned int popcount32(unsigned int w) +{ + unsigned int res = w - ((w >> 1) & 0x55555555); + res = (res & 0x33333333) + ((res >> 2) & 0x33333333); + res = (res + (res >> 4)) & 0x0F0F0F0F; + res = res + (res >> 8); + return (res + (res >> 16)) & 0x000000FF; +} + +unsigned int ext2fs_bitcount(const void *addr, unsigned int nbytes) +{ + const unsigned char *cp = addr; + const __u32 *p = addr; + unsigned int res = 0; + + if ((((unsigned long) addr) & 3) == 0) { + while (nbytes > 4) { + res += popcount32(*p++); + nbytes -= 4; + } + cp = (const unsigned char *) p; + } + while (nbytes > 0) { + res += popcount8(*cp++); + nbytes--; + } + return res; +} diff --git a/lib/ext2fs/bitops.h b/lib/ext2fs/bitops.h index 526870f..406999b 100644 --- a/lib/ext2fs/bitops.h +++ b/lib/ext2fs/bitops.h @@ -686,6 +686,7 @@ extern int ext2fs_test_bit(unsigned int nr, const void * addr); extern int ext2fs_set_bit64(__u64 nr,void * addr); extern int ext2fs_clear_bit64(__u64 nr, void * addr); extern int ext2fs_test_bit64(__u64 nr, const void * addr); +extern unsigned int ext2fs_bitcount(const void *addr, unsigned int nbytes); #ifdef NO_INLINE_FUNCS extern void ext2fs_fast_set_bit(unsigned int nr,void * addr); diff --git a/lib/ext2fs/tst_bitmaps.c b/lib/ext2fs/tst_bitmaps.c index 5da3693..2a76292 100644 --- a/lib/ext2fs/tst_bitmaps.c +++ b/lib/ext2fs/tst_bitmaps.c @@ -270,6 +270,7 @@ void dump_bitmap(ext2fs_generic_bitmap bmap, unsigned int start, unsigned num) for (i=0; i < len; i++) printf("%02x", buf[i]); printf("\n"); + printf("bits set: %u\n", ext2fs_bitcount(buf, len)); free(buf); } diff --git a/lib/ext2fs/tst_bitmaps_cmds b/lib/ext2fs/tst_bitmaps_cmds index 9a4f3d0..31e2a60 100644 --- a/lib/ext2fs/tst_bitmaps_cmds +++ b/lib/ext2fs/tst_bitmaps_cmds @@ -53,5 +53,47 @@ cleari 5 testi 17 testi 6 testi 4 +clearb 7 12 +dump_bb +setb 1 +dump_bb +setb 2 +dump_bb +setb 3 +dump_bb +setb 4 +dump_bb +setb 5 +dump_bb +setb 6 +dump_bb +setb 7 +dump_bb +setb 8 +dump_bb +setb 10 +setb 12 +setb 14 +setb 17 +setb 19 +setb 24 +setb 26 +setb 27 +setb 30 +setb 31 +setb 32 +setb 35 +setb 39 +setb 40 +setb 44 +setb 46 +setb 47 +setb 49 +setb 51 +setb 52 +clearb 2 +clearb 3 +clearb 7 +dump_bb quit diff --git a/lib/ext2fs/tst_bitmaps_exp b/lib/ext2fs/tst_bitmaps_exp index 2d406ce..2d62b66 100644 --- a/lib/ext2fs/tst_bitmaps_exp +++ b/lib/ext2fs/tst_bitmaps_exp @@ -36,6 +36,7 @@ tst_bitmaps: testb 16 Block 16 is set tst_bitmaps: dump_bb block bitmap: 00f80000000000000000000000000000 +bits set: 5 tst_bitmaps: ffzb 11 16 First unmarked block is 11 tst_bitmaps: ffzb 12 16 @@ -64,6 +65,7 @@ tst_bitmaps: setb 12 7 Marking blocks 12 to 18 tst_bitmaps: dump_bb block bitmap: 00f80300000000000000000000000000 +bits set: 7 tst_bitmaps: seti 2 Setting inode 2, was clear before tst_bitmaps: seti 5 @@ -82,6 +84,7 @@ tst_bitmaps: testi 1 Inode 1 is clear tst_bitmaps: dump_ib inode bitmap: 1e000000 +bits set: 4 tst_bitmaps: ffzi 1 6 First unmarked inode is 1 tst_bitmaps: ffzi 2 5 @@ -110,5 +113,99 @@ tst_bitmaps: testi 6 Inode 6 is clear tst_bitmaps: testi 4 Inode 4 is clear +tst_bitmaps: clearb 7 12 +Clearing blocks 7 to 18 +tst_bitmaps: dump_bb +block bitmap: 00000000000000000000000000000000 +bits set: 0 +tst_bitmaps: setb 1 +Setting block 1, was clear before +tst_bitmaps: dump_bb +block bitmap: 01000000000000000000000000000000 +bits set: 1 +tst_bitmaps: setb 2 +Setting block 2, was clear before +tst_bitmaps: dump_bb +block bitmap: 03000000000000000000000000000000 +bits set: 2 +tst_bitmaps: setb 3 +Setting block 3, was clear before +tst_bitmaps: dump_bb +block bitmap: 07000000000000000000000000000000 +bits set: 3 +tst_bitmaps: setb 4 +Setting block 4, was clear before +tst_bitmaps: dump_bb +block bitmap: 0f000000000000000000000000000000 +bits set: 4 +tst_bitmaps: setb 5 +Setting block 5, was clear before +tst_bitmaps: dump_bb +block bitmap: 1f000000000000000000000000000000 +bits set: 5 +tst_bitmaps: setb 6 +Setting block 6, was clear before +tst_bitmaps: dump_bb +block bitmap: 3f000000000000000000000000000000 +bits set: 6 +tst_bitmaps: setb 7 +Setting block 7, was clear before +tst_bitmaps: dump_bb +block bitmap: 7f000000000000000000000000000000 +bits set: 7 +tst_bitmaps: setb 8 +Setting block 8, was clear before +tst_bitmaps: dump_bb +block bitmap: ff000000000000000000000000000000 +bits set: 8 +tst_bitmaps: setb 10 +Setting block 10, was clear before +tst_bitmaps: setb 12 +Setting block 12, was clear before +tst_bitmaps: setb 14 +Setting block 14, was clear before +tst_bitmaps: setb 17 +Setting block 17, was clear before +tst_bitmaps: setb 19 +Setting block 19, was clear before +tst_bitmaps: setb 24 +Setting block 24, was clear before +tst_bitmaps: setb 26 +Setting block 26, was clear before +tst_bitmaps: setb 27 +Setting block 27, was clear before +tst_bitmaps: setb 30 +Setting block 30, was clear before +tst_bitmaps: setb 31 +Setting block 31, was clear before +tst_bitmaps: setb 32 +Setting block 32, was clear before +tst_bitmaps: setb 35 +Setting block 35, was clear before +tst_bitmaps: setb 39 +Setting block 39, was clear before +tst_bitmaps: setb 40 +Setting block 40, was clear before +tst_bitmaps: setb 44 +Setting block 44, was clear before +tst_bitmaps: setb 46 +Setting block 46, was clear before +tst_bitmaps: setb 47 +Setting block 47, was clear before +tst_bitmaps: setb 49 +Setting block 49, was clear before +tst_bitmaps: setb 51 +Setting block 51, was clear before +tst_bitmaps: setb 52 +Setting block 52, was clear before +tst_bitmaps: clearb 2 +Clearing block 2, was set before +tst_bitmaps: clearb 3 +Clearing block 3, was set before +tst_bitmaps: clearb 7 +Clearing block 7, was set before +tst_bitmaps: dump_bb +block bitmap: b92a85e6c4680d000000000000000000 +bits set: 25 tst_bitmaps: quit tst_bitmaps: