Patchwork lib/ext2fs: Use __builtin_popcount when available Signed-off-by: Cristian Rodríguez <crrodriguez@opensuse.org>

login
register
mail settings
Submitter Cristian Rodríguez
Date Jan. 6, 2013, 3:04 p.m.
Message ID <1357484683-3021-1-git-send-email-crrodriguez@opensuse.org>
Download mbox | patch
Permalink /patch/209776/
State Rejected
Headers show

Comments

Cristian Rodríguez - Jan. 6, 2013, 3:04 p.m.
---
 lib/ext2fs/bitops.c | 4 ++++
 1 file changed, 4 insertions(+)
Theodore Ts'o - Jan. 6, 2013, 10:20 p.m.
On Sun, Jan 06, 2013 at 12:04:43PM -0300, Cristian Rodríguez wrote:
> diff --git a/lib/ext2fs/bitops.c b/lib/ext2fs/bitops.c
> index 7c3f215..0668469 100644
> --- a/lib/ext2fs/bitops.c
> +++ b/lib/ext2fs/bitops.c
> @@ -125,11 +125,15 @@ static unsigned int popcount8(unsigned int w)
>  
>  static unsigned int popcount32(unsigned int w)
>  {
> +#if __GNUC__ > 3 || (__GNUC__ == 3 && __GNUC_MINOR__ >= 4)
> +	return __builtin_popcount(w);
> +#else

The problem is that if you compile popcount32 with your patch on an
x86 using gcc 4.7.2 and run it on a Intel core i7 CPU without using
the -march=core-avx-i compiler option, __builtin_popcount() is 18%
***slower*** than the optimized, non-looping code that is currently in
e2fsprogs.  And if you do compile with -march=core-avx-i, then of
course it's faster, but most distributions won't be able to use that
compiler option, since the resulting binary will crash on older intel
CPU's.  So by default, accepting this patch will slow down things, not
speed things up, for the most common case.

I could imagine a patch which explicitly checks the CPU type using the
x86 cpuinfo instructure, and only optionally uses the popcount
instruction if it exists, and uses the optimized code if it doesn't,
but then again, it's not clear how much it's worth it.

The difference it makes when checking 4TB's worth of bitmaps is negligible:

slow popcount: 0.2623
fast popcount: 0.0700

For a 128TB file system worth of bitmaps, the time difference is more
measurable:

slow popcount: 8.0185
fast popcount: 2.2066

... but running e2fsck on an empty 128TB file system uses 202 CPU
seconds (assuming all of the fs metadata bocks are in cache).  So
trying to use the popcount instruction would save at most 3%.  (For
comparison, e2fsck form 1.42.6 takes 392.7 CPU seconds, so we've
already gotten most of the speed improvements already).

Still, 6 CPU seconds is 6 CPU seconds.... but only if it applies for
the vast majority of e2fsprogs users, which means (a) people who are
using distributions which provide compiled binaries, such as SuSE,
OpenSuSE, RHEL, Fedora, Debian, Ubuntu, etc., and (b) people who are
running on an x86 CPU.

So if you'd like to write a patch which explicitly uses x86-specific
__asm__ statements to call cpuinfo and popcount explicitly, I'd be
happy to take such a patch.

Better yet would be if gcc was taught how to recognize the C
statements in popcount32(), since that's one of the standard,
intelligent ways of implementing popcount32 (as opposed to the stupid
way which is apparently used by gcc's runtime library, sigh), and
inserted code which automatically did the cpuinfo check and fallback
to popcount if the CPU supports it.  This would be even better since
then gcc could do a single cpuinfo check, and then automatically use
the faster SSE[234] instructions as necessary, without having to ask
application programmers to put in gcc-specific __builtin_* functions
which have the property of pessimizing the application, if you don't
ask GCC to build a cpu-specific executable which crashes and burns if
run on an older CPU....

Regards,

						- Ted


P.S.  This is what __popcountdi2 looks like from gcc 4.7.2's runtime.
Note the looping construct, which is perhaps one of the more stupider
ways to implement popcount.  Whoever implemented this should get 100
flogs with a wet noodle....

0000000000400520 <__popcountdi2>:
  400520:	 48 8b 35 99 ab 2a 00	mov    0x2aab99(%rip),%rsi
  400527:	 31 c0                	xor    %eax,%eax
  400529:	 31 c9                	xor    %ecx,%ecx
  40052b:	 0f 1f 44 00 00       	nopl   0x0(%rax,%rax,1)
  400530:	 48 89 fa             	mov    %rdi,%rdx
  400533:	 48 d3 ea             	shr    %cl,%rdx
  400536:	 83 c1 08             	add    $0x8,%ecx
  400539:	 81 e2 ff 00 00 00    	and    $0xff,%edx
  40053f:	 0f b6 14 16          	movzbl (%rsi,%rdx,1),%edx
  400543:	 01 d0                	add    %edx,%eax
  400545:	 83 f9 40             	cmp    $0x40,%ecx
  400548:	 75 e6                	jne    400530 <__popcountdi2+0x10>
  40054a:	 f3 c3                	repz retq 
  40054c:	 90                   	nop
  40054d:	 90                   	nop
  40054e:	 90                   	nop
  40054f:	 90                   	nop

Compare and contrast that with what we are using in e2fsprogs:

  	unsigned int res = w - ((w >> 1) & 0x55555555);
  	res = (res & 0x33333333) + ((res >> 2) & 0x33333333);
  	res = (res + (res >> 4)) & 0x0F0F0F0F;
  	res = res + (res >> 8);
  	return (res + (res >> 16)) & 0x000000FF;

--
To unsubscribe from this list: send the line "unsubscribe linux-ext4" in
the body of a message to majordomo@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Cristian Rodríguez - Jan. 7, 2013, 12:53 a.m.
El dom 06 ene 2013 19:20:20 CLST, Theodore Ts'o escribió:


> I could imagine a patch which explicitly checks the CPU type using the
> x86 cpuinfo instructure, and only optionally uses the popcount
> instruction if it exists, and uses the optimized code if it doesn't,
> but then again, it's not clear how much it's worth it.

The first incarnation of this patch did exactly that, too ugly didnt 
not send it. ;)

> Better yet would be if gcc was taught how to recognize the C
> statements in popcount32(), since that's one of the standard,
> intelligent ways of implementing popcount32 (as opposed to the stupid
> way which is apparently used by gcc's runtime library, sigh), and
> inserted code which automatically did the cpuinfo check and fallback
> to popcount if the CPU supports it.  This would be even better since
> then gcc could do a single cpuinfo check, and then automatically use
> the faster SSE[234] instructions as necessary, without having to ask
> application programmers to put in gcc-specific __builtin_* functions

Yeah, I asked GCC developers exactly this, was told to fill a 
enhancement request.

--
To unsubscribe from this list: send the line "unsubscribe linux-ext4" in
the body of a message to majordomo@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Theodore Ts'o - Jan. 7, 2013, 1:31 a.m.
On Sun, Jan 06, 2013 at 09:53:47PM -0300, Cristian Rodríguez wrote:
> 
> Yeah, I asked GCC developers exactly this, was told to fill a
> enhancement request.

If you could also sned them a bug/enhancement request to use a more
optimized version of __popcountdi2, that would be great.  I'm not sure
it helps e2fsprogs much, since it's too hard for us to tell whether we
are using a version of the gcc runtime that has a optimized or
unuptomized version of builtin_popcount().

But since it doesn't make that much difference, my preference is to
just ignore builtin_popcount() for now.  If someone is really using
128TB ext4 file systems, and cares about that extra 6 seconds of CPU,
it's probably going to require the ugly approach of using x86 asm
statements to determine whether or not we're running on a CPU that
supports the popcount instruction or not....

						- Ted

--
To unsubscribe from this list: send the line "unsubscribe linux-ext4" in
the body of a message to majordomo@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Cristian Rodríguez - Jan. 7, 2013, 1:46 a.m.
El dom 06 ene 2013 22:31:56 CLST, Theodore Ts'o escribió:
> On Sun, Jan 06, 2013 at 09:53:47PM -0300, Cristian Rodríguez wrote:
>>
>> Yeah, I asked GCC developers exactly this, was told to fill a
>> enhancement request.
>
> If you could also sned them a bug/enhancement request to use a more
> optimized version of __popcountdi2, that would be great.  I'm not sure
> it helps e2fsprogs much, since it's too hard for us to tell whether we
> are using a version of the gcc runtime that has a optimized or
> unuptomized version of builtin_popcount().
>
> But since it doesn't make that much difference, my preference is to
> just ignore builtin_popcount() for now.  If someone is really using
> 128TB ext4 file systems, and cares about that extra 6 seconds of CPU,
> it's probably going to require the ugly approach of using x86 asm
> statements to determine whether or not we're running on a CPU that
> supports the popcount instruction or not....

with a recent compiler it goes something like this..

unsigned int popcnt(unsigned int w) __attribute__ ((ifunc 
("resolve_popcnt")));

__attribute__ ((__target__ ("popcnt")))
static unsigned int hw_popcnt(unsigned int w)
{
    return __builtin_popcount(w);
}

static unsigned int soft_popcnt(unsigned int w)
{
    return __builtin_popcount(w);
}

static void (*resolve_popcnt (void)) (void)
{
#if (__GNUC__ > 4) || (__GNUC__ == 4 && __GNUC_MINOR__ >= 8)
    __builtin_cpu_init();
    if (__builtin_cpu_supports("popcnt"))
        return hw_popcnt;
#else
    unsigned int eax, ebx, ecx, edx;
    if (__get_cpuid (1, &eax, &ebx, &ecx, &edx))
        if (ecx & bit_POPCNT)
            return hw_popcnt;
#endif
    /* If magic does not work, or running old cpu.. */
        return soft_popcnt;
}

then call "popcnt" function in the code, this flies in x86 && ELF && 
GCC >= 4.6 only though.
The CPU detection code only runs once at load time btw.



--
To unsubscribe from this list: send the line "unsubscribe linux-ext4" in
the body of a message to majordomo@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Theodore Ts'o - Jan. 7, 2013, 1:56 a.m.
On Sun, Jan 06, 2013 at 10:46:34PM -0300, Cristian Rodríguez wrote:
> 
> with a recent compiler it goes something like this..

Actually, that's less ugly than I feared...  if we wanted to pretty it
up a little, we could simply omit the __get_cpuid() bit, since that's
x86 specific, and just use the __builtin_cpu_supports() if it is
available on gcc 4.8 and newer.  If it is not available, we could just
always use the software version of popcnt, since it's really not that
bad.

What about the people who prefer to use clang instad of gcc; does
clang have support for any of this magic?

Cheers,

					- Ted
--
To unsubscribe from this list: send the line "unsubscribe linux-ext4" in
the body of a message to majordomo@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html

Patch

diff --git a/lib/ext2fs/bitops.c b/lib/ext2fs/bitops.c
index 7c3f215..0668469 100644
--- a/lib/ext2fs/bitops.c
+++ b/lib/ext2fs/bitops.c
@@ -125,11 +125,15 @@  static unsigned int popcount8(unsigned int w)
 
 static unsigned int popcount32(unsigned int w)
 {
+#if __GNUC__ > 3 || (__GNUC__ == 3 && __GNUC_MINOR__ >= 4)
+	return __builtin_popcount(w);
+#else
 	unsigned int res = w - ((w >> 1) & 0x55555555);
 	res = (res & 0x33333333) + ((res >> 2) & 0x33333333);
 	res = (res + (res >> 4)) & 0x0F0F0F0F;
 	res = res + (res >> 8);
 	return (res + (res >> 16)) & 0x000000FF;
+#endif
 }
 
 unsigned int ext2fs_bitcount(const void *addr, unsigned int nbytes)