diff mbox

bitops: provide an inline implementation of find_first_bit

Message ID 1387711957-703-1-git-send-email-aurelien@aurel32.net
State New
Headers show

Commit Message

Aurelien Jarno Dec. 22, 2013, 11:32 a.m. UTC
find_first_bit has started to be used heavily in TCG code. The current
implementation based on find_next_bit is not optimal and can't be
optimized be the compiler if the bit array has a fixed size, which is
the case most of the time.

This new implementation does not use find_next_bit and is yet small
enough to be inlined.

Cc: Richard Henderson <rth@twiddle.net>
Cc: Corentin Chary <corentin.chary@gmail.com>

Signed-off-by: Aurelien Jarno <aurelien@aurel32.net>
---
 include/qemu/bitops.h |   12 +++++++++++-
 1 file changed, 11 insertions(+), 1 deletion(-)

Comments

Richard Henderson Dec. 22, 2013, 5:04 p.m. UTC | #1
On 12/22/2013 03:32 AM, Aurelien Jarno wrote:
> find_first_bit has started to be used heavily in TCG code. The current
> implementation based on find_next_bit is not optimal and can't be
> optimized be the compiler if the bit array has a fixed size, which is
> the case most of the time.
> 
> This new implementation does not use find_next_bit and is yet small
> enough to be inlined.
> 
> Cc: Richard Henderson <rth@twiddle.net>
> Cc: Corentin Chary <corentin.chary@gmail.com>
> 
> Signed-off-by: Aurelien Jarno <aurelien@aurel32.net>
> ---
>  include/qemu/bitops.h |   12 +++++++++++-
>  1 file changed, 11 insertions(+), 1 deletion(-)

Reviewed-by: Richard Henderson <rth@twiddle.net>


r~
Paolo Bonzini June 19, 2014, 8:36 a.m. UTC | #2
Il 22/12/2013 12:32, Aurelien Jarno ha scritto:
> find_first_bit has started to be used heavily in TCG code. The current
> implementation based on find_next_bit is not optimal and can't be
> optimized be the compiler if the bit array has a fixed size, which is
> the case most of the time.

If you mean by fully unrolling the loop, that's right.

However...

> -    return find_next_bit(addr, size, 0);
> +    unsigned long result, tmp;
> +
> +    for (result = 0; result < size; result += BITS_PER_LONG) {
> +        tmp = *addr++;
> +        if (tmp) {
> +            result += ctzl(tmp);
> +            return result < size ? result : size;
> +        }
> +    }
> +    /* Not found */
> +    return size;
>  }
>
>  /**
>

... you probably want to limit this to bitmaps that are of constant 
size, and small enough that the compiler will unroll them.

So it probably would be a good idea to add an

     if (!__builtin_constant_p(size) || size > 8 * BITS_PER_LONG)
         return find_next_bit(addr, size, 0);

Not urgent since TCG is the only user of find_first_bit right now, but 
worth considering.

Paolo
Aurelien Jarno June 20, 2014, 8:48 a.m. UTC | #3
On Thu, Jun 19, 2014 at 10:36:18AM +0200, Paolo Bonzini wrote:
> Il 22/12/2013 12:32, Aurelien Jarno ha scritto:
> >find_first_bit has started to be used heavily in TCG code. The current
> >implementation based on find_next_bit is not optimal and can't be
> >optimized be the compiler if the bit array has a fixed size, which is
> >the case most of the time.
> 
> If you mean by fully unrolling the loop, that's right.
> 
> However...
> 
> >-    return find_next_bit(addr, size, 0);
> >+    unsigned long result, tmp;
> >+
> >+    for (result = 0; result < size; result += BITS_PER_LONG) {
> >+        tmp = *addr++;
> >+        if (tmp) {
> >+            result += ctzl(tmp);
> >+            return result < size ? result : size;
> >+        }
> >+    }
> >+    /* Not found */
> >+    return size;
> > }
> >
> > /**
> >
> 
> ... you probably want to limit this to bitmaps that are of constant
> size, and small enough that the compiler will unroll them.
> 
> So it probably would be a good idea to add an
> 
>     if (!__builtin_constant_p(size) || size > 8 * BITS_PER_LONG)
>         return find_next_bit(addr, size, 0);
> 
> Not urgent since TCG is the only user of find_first_bit right now,
> but worth considering.

In practice on x86_64, this function takes 27 instructions in the
general case, and 18 instructions in the fixed case, even for big
sizes. I therefore think that checking if the size is constant is a good
idea, but we should not make any test on the size itself and trust the
compiler to correctly decide if the loop should be unrolled or not.

I will send a patch about that in the next days.
Paolo Bonzini June 20, 2014, 8:58 a.m. UTC | #4
Il 20/06/2014 10:48, Aurelien Jarno ha scritto:
> In practice on x86_64, this function takes 27 instructions in the
> general case, and 18 instructions in the fixed case, even for big
> sizes. I therefore think that checking if the size is constant is a good
> idea, but we should not make any test on the size itself and trust the
> compiler to correctly decide if the loop should be unrolled or not.

But if the size is large enough that the compiler will (likely) not 
unroll the function, then it should pay off to use the more optimized 
code in find_next_bit.

This of course is unless you expect find_first_bit to return a small 
value and not be used in a loop; and dually expect find_next_bit's usage 
to be more like walking sparser bitmaps in a loop.

This actually makes sense, and then there's no need to change anything.

Paolo
Aurelien Jarno June 20, 2014, 9:43 a.m. UTC | #5
On Fri, Jun 20, 2014 at 10:58:31AM +0200, Paolo Bonzini wrote:
> Il 20/06/2014 10:48, Aurelien Jarno ha scritto:
> >In practice on x86_64, this function takes 27 instructions in the
> >general case, and 18 instructions in the fixed case, even for big
> >sizes. I therefore think that checking if the size is constant is a good
> >idea, but we should not make any test on the size itself and trust the
> >compiler to correctly decide if the loop should be unrolled or not.
> 
> But if the size is large enough that the compiler will (likely) not
> unroll the function, then it should pay off to use the more
> optimized code in find_next_bit.

The point there is that given find_next_bit is a generalized version of
find_first_bit, it is actually slower. I originally noticed that by
running profiling tools and noticing this function appeared relatively
high for what it is supposed to do.

> This of course is unless you expect find_first_bit to return a small
> value and not be used in a loop; and dually expect find_next_bit's
> usage to be more like walking sparser bitmaps in a loop.

I think that's the point. In the TCG case, this is used to map the
temp allocation to answer the question "give me a free temp". That said
people might invent new usages.

> This actually makes sense, and then there's no need to change anything.
> 
> Paolo
>
diff mbox

Patch

diff --git a/include/qemu/bitops.h b/include/qemu/bitops.h
index 304c90c..041142a 100644
--- a/include/qemu/bitops.h
+++ b/include/qemu/bitops.h
@@ -157,7 +157,17 @@  unsigned long find_next_zero_bit(const unsigned long *addr,
 static inline unsigned long find_first_bit(const unsigned long *addr,
                                            unsigned long size)
 {
-    return find_next_bit(addr, size, 0);
+    unsigned long result, tmp;
+
+    for (result = 0; result < size; result += BITS_PER_LONG) {
+        tmp = *addr++;
+        if (tmp) {
+            result += ctzl(tmp);
+            return result < size ? result : size;
+        }
+    }
+    /* Not found */
+    return size;
 }
 
 /**