diff mbox

[RFC] find_next_bit optimizations

Message ID 513DDFA3.1020308@dlhnet.de
State New
Headers show

Commit Message

Peter Lieven March 11, 2013, 1:44 p.m. UTC
Hi,

I ever since had a few VMs which are very hard to migrate because of a lot of memory I/O. I found that finding the next dirty bit
seemed to be one of the culprits (apart from removing locking which Paolo is working on).

I have to following proposal which seems to help a lot in my case. Just wanted to have some feedback.
I applied the same unrolling idea like in buffer_is_zero().

Peter

Comments

Peter Maydell March 11, 2013, 2:04 p.m. UTC | #1
On 11 March 2013 13:44, Peter Lieven <pl@dlhnet.de> wrote:

> @@ -24,12 +24,13 @@ unsigned long find_next_bit(const unsigned long *addr,
> unsigned long size,
>      const unsigned long *p = addr + BITOP_WORD(offset);
>      unsigned long result = offset & ~(BITS_PER_LONG-1);
>      unsigned long tmp;
> +    unsigned long d0,d1,d2,d3;
>
>      if (offset >= size) {
>          return size;
>      }
>      size -= result;
> -    offset %= BITS_PER_LONG;
> +    offset &= (BITS_PER_LONG-1);

This change at least is unnecessary -- I just checked, and gcc
is already smart enough to turn the % operation into a logical
and. The generated object files are identical.

-- PMM
Paolo Bonzini March 11, 2013, 2:14 p.m. UTC | #2
Il 11/03/2013 14:44, Peter Lieven ha scritto:
> Hi,
> 
> I ever since had a few VMs which are very hard to migrate because of a
> lot of memory I/O. I found that finding the next dirty bit
> seemed to be one of the culprits (apart from removing locking which
> Paolo is working on).
> 
> I have to following proposal which seems to help a lot in my case. Just
> wanted to have some feedback.
> I applied the same unrolling idea like in buffer_is_zero().
> 
> Peter
> 
> --- a/util/bitops.c
> +++ b/util/bitops.c
> @@ -24,12 +24,13 @@ unsigned long find_next_bit(const unsigned long
> *addr, unsigned long size,
>      const unsigned long *p = addr + BITOP_WORD(offset);
>      unsigned long result = offset & ~(BITS_PER_LONG-1);
>      unsigned long tmp;
> +    unsigned long d0,d1,d2,d3;
> 
>      if (offset >= size) {
>          return size;
>      }
>      size -= result;
> -    offset %= BITS_PER_LONG;
> +    offset &= (BITS_PER_LONG-1);
>      if (offset) {
>          tmp = *(p++);
>          tmp &= (~0UL << offset);
> @@ -43,6 +44,18 @@ unsigned long find_next_bit(const unsigned long
> *addr, unsigned long size,
>          result += BITS_PER_LONG;
>      }
>      while (size & ~(BITS_PER_LONG-1)) {
> +        while (!(size & (4*BITS_PER_LONG-1))) {

This really means

       if (!(size & (4*BITS_PER_LONG-1))) {
           while (1) {
               ...
           }
       }

because the subtraction will not change the result of the "while" loop
condition.

What you want is probably "while (size & ~(4*BITS_PER_LONG-1))", which
in turn means "while (size >= 4*BITS_PER_LONG).

Please change both while loops to use a ">=" condition, it's easier to read.

Paolo

> +            d0 = *p;
> +            d1 = *(p+1);
> +            d2 = *(p+2);
> +            d3 = *(p+3);
> +            if (d0 || d1 || d2 || d3) {
> +                break;
> +            }
> +            p+=4;
> +            result += 4*BITS_PER_LONG;
> +            size -= 4*BITS_PER_LONG;
> +        }
>          if ((tmp = *(p++))) {
>              goto found_middle;
>          }
Peter Lieven March 11, 2013, 2:22 p.m. UTC | #3
Am 11.03.2013 um 15:14 schrieb Paolo Bonzini <pbonzini@redhat.com>:

> Il 11/03/2013 14:44, Peter Lieven ha scritto:
>> Hi,
>> 
>> I ever since had a few VMs which are very hard to migrate because of a
>> lot of memory I/O. I found that finding the next dirty bit
>> seemed to be one of the culprits (apart from removing locking which
>> Paolo is working on).
>> 
>> I have to following proposal which seems to help a lot in my case. Just
>> wanted to have some feedback.
>> I applied the same unrolling idea like in buffer_is_zero().
>> 
>> Peter
>> 
>> --- a/util/bitops.c
>> +++ b/util/bitops.c
>> @@ -24,12 +24,13 @@ unsigned long find_next_bit(const unsigned long
>> *addr, unsigned long size,
>>     const unsigned long *p = addr + BITOP_WORD(offset);
>>     unsigned long result = offset & ~(BITS_PER_LONG-1);
>>     unsigned long tmp;
>> +    unsigned long d0,d1,d2,d3;
>> 
>>     if (offset >= size) {
>>         return size;
>>     }
>>     size -= result;
>> -    offset %= BITS_PER_LONG;
>> +    offset &= (BITS_PER_LONG-1);
>>     if (offset) {
>>         tmp = *(p++);
>>         tmp &= (~0UL << offset);
>> @@ -43,6 +44,18 @@ unsigned long find_next_bit(const unsigned long
>> *addr, unsigned long size,
>>         result += BITS_PER_LONG;
>>     }
>>     while (size & ~(BITS_PER_LONG-1)) {
>> +        while (!(size & (4*BITS_PER_LONG-1))) {
> 
> This really means
> 
>       if (!(size & (4*BITS_PER_LONG-1))) {
>           while (1) {
>               ...
>           }
>       }
> 
> because the subtraction will not change the result of the "while" loop
> condition.

Are you sure? The above is working nicely for me (wondering why ;-))
I think !(size & (4*BITS_PER_LONG-1)) is the same as what you
propose. If size & (4*BITS_PER_LONG-1) is not zero its not dividable
by 4*BITS_PER_LONG. But it see it might be a problem for size == 0.

> 
> What you want is probably "while (size & ~(4*BITS_PER_LONG-1))", which
> in turn means "while (size >= 4*BITS_PER_LONG).
> 
> Please change both while loops to use a ">=" condition, it's easier to read.

Good idea, its easier to understand.

Has anyone evidence if unrolling 4 longs is optimal on today processors?
I just chooses 4 longs because it was the same in buffer_is_zero.

Peter

> 
> Paolo
> 
>> +            d0 = *p;
>> +            d1 = *(p+1);
>> +            d2 = *(p+2);
>> +            d3 = *(p+3);
>> +            if (d0 || d1 || d2 || d3) {
>> +                break;
>> +            }
>> +            p+=4;
>> +            result += 4*BITS_PER_LONG;
>> +            size -= 4*BITS_PER_LONG;
>> +        }
>>         if ((tmp = *(p++))) {
>>             goto found_middle;
>>         }
>
Peter Lieven March 11, 2013, 2:29 p.m. UTC | #4
Am 11.03.2013 um 15:22 schrieb Peter Lieven <pl@dlhnet.de>:

> 
> Am 11.03.2013 um 15:14 schrieb Paolo Bonzini <pbonzini@redhat.com>:
> 
>> Il 11/03/2013 14:44, Peter Lieven ha scritto:
>>> Hi,
>>> 
>>> I ever since had a few VMs which are very hard to migrate because of a
>>> lot of memory I/O. I found that finding the next dirty bit
>>> seemed to be one of the culprits (apart from removing locking which
>>> Paolo is working on).
>>> 
>>> I have to following proposal which seems to help a lot in my case. Just
>>> wanted to have some feedback.
>>> I applied the same unrolling idea like in buffer_is_zero().
>>> 
>>> Peter
>>> 
>>> --- a/util/bitops.c
>>> +++ b/util/bitops.c
>>> @@ -24,12 +24,13 @@ unsigned long find_next_bit(const unsigned long
>>> *addr, unsigned long size,
>>>    const unsigned long *p = addr + BITOP_WORD(offset);
>>>    unsigned long result = offset & ~(BITS_PER_LONG-1);
>>>    unsigned long tmp;
>>> +    unsigned long d0,d1,d2,d3;
>>> 
>>>    if (offset >= size) {
>>>        return size;
>>>    }
>>>    size -= result;
>>> -    offset %= BITS_PER_LONG;
>>> +    offset &= (BITS_PER_LONG-1);
>>>    if (offset) {
>>>        tmp = *(p++);
>>>        tmp &= (~0UL << offset);
>>> @@ -43,6 +44,18 @@ unsigned long find_next_bit(const unsigned long
>>> *addr, unsigned long size,
>>>        result += BITS_PER_LONG;
>>>    }
>>>    while (size & ~(BITS_PER_LONG-1)) {
>>> +        while (!(size & (4*BITS_PER_LONG-1))) {
>> 
>> This really means
>> 
>>      if (!(size & (4*BITS_PER_LONG-1))) {
>>          while (1) {
>>              ...
>>          }
>>      }
>> 
>> because the subtraction will not change the result of the "while" loop
>> condition.
> 
> Are you sure? The above is working nicely for me (wondering why ;-))
> I think !(size & (4*BITS_PER_LONG-1)) is the same as what you
> propose. If size & (4*BITS_PER_LONG-1) is not zero its not dividable
> by 4*BITS_PER_LONG. But it see it might be a problem for size == 0.
> 
>> 
>> What you want is probably "while (size & ~(4*BITS_PER_LONG-1))", which
>> in turn means "while (size >= 4*BITS_PER_LONG).
>> 
>> Please change both while loops to use a ">=" condition, it's easier to read.

Thinking again, in case a bit is found, this might lead to unnecessary iterations
in the while loop if the bit is in d1, d2 or d3.


> 
> Good idea, its easier to understand.
> 
> Has anyone evidence if unrolling 4 longs is optimal on today processors?
> I just chooses 4 longs because it was the same in buffer_is_zero.
> 
> Peter
> 
>> 
>> Paolo
>> 
>>> +            d0 = *p;
>>> +            d1 = *(p+1);
>>> +            d2 = *(p+2);
>>> +            d3 = *(p+3);
>>> +            if (d0 || d1 || d2 || d3) {
>>> +                break;
>>> +            }
>>> +            p+=4;
>>> +            result += 4*BITS_PER_LONG;
>>> +            size -= 4*BITS_PER_LONG;
>>> +        }
>>>        if ((tmp = *(p++))) {
>>>            goto found_middle;
>>>        }
>> 
>
Paolo Bonzini March 11, 2013, 2:35 p.m. UTC | #5
Il 11/03/2013 15:22, Peter Lieven ha scritto:
> 
> Am 11.03.2013 um 15:14 schrieb Paolo Bonzini <pbonzini@redhat.com>:
> 
>> Il 11/03/2013 14:44, Peter Lieven ha scritto:
>>> Hi,
>>>
>>> I ever since had a few VMs which are very hard to migrate because of a
>>> lot of memory I/O. I found that finding the next dirty bit
>>> seemed to be one of the culprits (apart from removing locking which
>>> Paolo is working on).
>>>
>>> I have to following proposal which seems to help a lot in my case. Just
>>> wanted to have some feedback.
>>> I applied the same unrolling idea like in buffer_is_zero().
>>>
>>> Peter
>>>
>>> --- a/util/bitops.c
>>> +++ b/util/bitops.c
>>> @@ -24,12 +24,13 @@ unsigned long find_next_bit(const unsigned long
>>> *addr, unsigned long size,
>>>     const unsigned long *p = addr + BITOP_WORD(offset);
>>>     unsigned long result = offset & ~(BITS_PER_LONG-1);
>>>     unsigned long tmp;
>>> +    unsigned long d0,d1,d2,d3;
>>>
>>>     if (offset >= size) {
>>>         return size;
>>>     }
>>>     size -= result;
>>> -    offset %= BITS_PER_LONG;
>>> +    offset &= (BITS_PER_LONG-1);
>>>     if (offset) {
>>>         tmp = *(p++);
>>>         tmp &= (~0UL << offset);
>>> @@ -43,6 +44,18 @@ unsigned long find_next_bit(const unsigned long
>>> *addr, unsigned long size,
>>>         result += BITS_PER_LONG;
>>>     }
>>>     while (size & ~(BITS_PER_LONG-1)) {
>>> +        while (!(size & (4*BITS_PER_LONG-1))) {
>>
>> This really means
>>
>>       if (!(size & (4*BITS_PER_LONG-1))) {
>>           while (1) {
>>               ...
>>           }
>>       }
>>
>> because the subtraction will not change the result of the "while" loop
>> condition.
> 
> Are you sure? The above is working nicely for me (wondering why ;-))

   while (!(size & (4*BITS_PER_LONG-1)))    =>
   while (!(size % (4*BITS_PER_LONG))       =>
   while ((size % (4*BITS_PER_LONG)) == 0)

Subtracting 4*BITS_PER_LONG doesn't change the modulus.

> I think !(size & (4*BITS_PER_LONG-1)) is the same as what you
> propose. If size & (4*BITS_PER_LONG-1) is not zero its not dividable
> by 4*BITS_PER_LONG. But it see it might be a problem for size == 0.

In fact I'm not really sure why it works for you. :)

>> What you want is probably "while (size & ~(4*BITS_PER_LONG-1))", which
>> in turn means "while (size >= 4*BITS_PER_LONG).
>>
>> Please change both while loops to use a ">=" condition, it's easier to read.
> 
> Good idea, its easier to understand.
> 
>>> Please change both while loops to use a ">=" condition, it's easier to read.
> 
> Thinking again, in case a bit is found, this might lead to unnecessary iterations
> in the while loop if the bit is in d1, d2 or d3.

How would that be different in your patch?  But you can solve it by
making two >= loops, one checking for 4*BITS_PER_LONG and one checking
BITS_PER_LONG.

Paolo

> 
> Peter
> 
>>
>> Paolo
>>
>>> +            d0 = *p;
>>> +            d1 = *(p+1);
>>> +            d2 = *(p+2);
>>> +            d3 = *(p+3);
>>> +            if (d0 || d1 || d2 || d3) {
>>> +                break;
>>> +            }
>>> +            p+=4;
>>> +            result += 4*BITS_PER_LONG;
>>> +            size -= 4*BITS_PER_LONG;
>>> +        }
>>>         if ((tmp = *(p++))) {
>>>             goto found_middle;
>>>         }
>>
>
Stefan Hajnoczi March 12, 2013, 8:35 a.m. UTC | #6
On Mon, Mar 11, 2013 at 02:44:03PM +0100, Peter Lieven wrote:
> I ever since had a few VMs which are very hard to migrate because of a lot of memory I/O. I found that finding the next dirty bit
> seemed to be one of the culprits (apart from removing locking which Paolo is working on).
> 
> I have to following proposal which seems to help a lot in my case. Just wanted to have some feedback.

Hi Peter,
Do you have any performance numbers for this patch?  I'm just curious
how big the win is.

Stefan
Peter Lieven March 12, 2013, 8:41 a.m. UTC | #7
Am 12.03.2013 um 09:35 schrieb Stefan Hajnoczi <stefanha@gmail.com>:

> On Mon, Mar 11, 2013 at 02:44:03PM +0100, Peter Lieven wrote:
>> I ever since had a few VMs which are very hard to migrate because of a lot of memory I/O. I found that finding the next dirty bit
>> seemed to be one of the culprits (apart from removing locking which Paolo is working on).
>> 
>> I have to following proposal which seems to help a lot in my case. Just wanted to have some feedback.
> 
> Hi Peter,
> Do you have any performance numbers for this patch?  I'm just curious
> how big the win is.

Hi Stefan,

please see my recent email to the list with the final patch.
The win is up to 100%. Worst case execution time (whole
array is zero) is halved on x86_64.

Peter

> 
> Stefan
Stefan Hajnoczi March 12, 2013, 3:12 p.m. UTC | #8
On Tue, Mar 12, 2013 at 09:41:04AM +0100, Peter Lieven wrote:
> 
> Am 12.03.2013 um 09:35 schrieb Stefan Hajnoczi <stefanha@gmail.com>:
> 
> > On Mon, Mar 11, 2013 at 02:44:03PM +0100, Peter Lieven wrote:
> >> I ever since had a few VMs which are very hard to migrate because of a lot of memory I/O. I found that finding the next dirty bit
> >> seemed to be one of the culprits (apart from removing locking which Paolo is working on).
> >> 
> >> I have to following proposal which seems to help a lot in my case. Just wanted to have some feedback.
> > 
> > Hi Peter,
> > Do you have any performance numbers for this patch?  I'm just curious
> > how big the win is.
> 
> Hi Stefan,
> 
> please see my recent email to the list with the final patch.
> The win is up to 100%. Worst case execution time (whole
> array is zero) is halved on x86_64.

Thanks!

Stefan
diff mbox

Patch

--- a/util/bitops.c
+++ b/util/bitops.c
@@ -24,12 +24,13 @@  unsigned long find_next_bit(const unsigned long *addr, unsigned long size,
      const unsigned long *p = addr + BITOP_WORD(offset);
      unsigned long result = offset & ~(BITS_PER_LONG-1);
      unsigned long tmp;
+    unsigned long d0,d1,d2,d3;

      if (offset >= size) {
          return size;
      }
      size -= result;
-    offset %= BITS_PER_LONG;
+    offset &= (BITS_PER_LONG-1);
      if (offset) {
          tmp = *(p++);
          tmp &= (~0UL << offset);
@@ -43,6 +44,18 @@  unsigned long find_next_bit(const unsigned long *addr, unsigned long size,
          result += BITS_PER_LONG;
      }
      while (size & ~(BITS_PER_LONG-1)) {
+        while (!(size & (4*BITS_PER_LONG-1))) {
+            d0 = *p;
+            d1 = *(p+1);
+            d2 = *(p+2);
+            d3 = *(p+3);
+            if (d0 || d1 || d2 || d3) {
+                break;
+            }
+            p+=4;
+            result += 4*BITS_PER_LONG;
+            size -= 4*BITS_PER_LONG;
+        }
          if ((tmp = *(p++))) {
              goto found_middle;
          }