Submitted by Peter Lieven on March 11, 2013, 1:44 p.m.

Message ID | 513DDFA3.1020308@dlhnet.de |
---|---|

State | New |

Headers | show |

On 11 March 2013 13:44, Peter Lieven <pl@dlhnet.de> wrote: > @@ -24,12 +24,13 @@ unsigned long find_next_bit(const unsigned long *addr, > unsigned long size, > const unsigned long *p = addr + BITOP_WORD(offset); > unsigned long result = offset & ~(BITS_PER_LONG-1); > unsigned long tmp; > + unsigned long d0,d1,d2,d3; > > if (offset >= size) { > return size; > } > size -= result; > - offset %= BITS_PER_LONG; > + offset &= (BITS_PER_LONG-1); This change at least is unnecessary -- I just checked, and gcc is already smart enough to turn the % operation into a logical and. The generated object files are identical. -- PMM

Il 11/03/2013 14:44, Peter Lieven ha scritto: > Hi, > > I ever since had a few VMs which are very hard to migrate because of a > lot of memory I/O. I found that finding the next dirty bit > seemed to be one of the culprits (apart from removing locking which > Paolo is working on). > > I have to following proposal which seems to help a lot in my case. Just > wanted to have some feedback. > I applied the same unrolling idea like in buffer_is_zero(). > > Peter > > --- a/util/bitops.c > +++ b/util/bitops.c > @@ -24,12 +24,13 @@ unsigned long find_next_bit(const unsigned long > *addr, unsigned long size, > const unsigned long *p = addr + BITOP_WORD(offset); > unsigned long result = offset & ~(BITS_PER_LONG-1); > unsigned long tmp; > + unsigned long d0,d1,d2,d3; > > if (offset >= size) { > return size; > } > size -= result; > - offset %= BITS_PER_LONG; > + offset &= (BITS_PER_LONG-1); > if (offset) { > tmp = *(p++); > tmp &= (~0UL << offset); > @@ -43,6 +44,18 @@ unsigned long find_next_bit(const unsigned long > *addr, unsigned long size, > result += BITS_PER_LONG; > } > while (size & ~(BITS_PER_LONG-1)) { > + while (!(size & (4*BITS_PER_LONG-1))) { This really means if (!(size & (4*BITS_PER_LONG-1))) { while (1) { ... } } because the subtraction will not change the result of the "while" loop condition. What you want is probably "while (size & ~(4*BITS_PER_LONG-1))", which in turn means "while (size >= 4*BITS_PER_LONG). Please change both while loops to use a ">=" condition, it's easier to read. Paolo > + d0 = *p; > + d1 = *(p+1); > + d2 = *(p+2); > + d3 = *(p+3); > + if (d0 || d1 || d2 || d3) { > + break; > + } > + p+=4; > + result += 4*BITS_PER_LONG; > + size -= 4*BITS_PER_LONG; > + } > if ((tmp = *(p++))) { > goto found_middle; > }

Am 11.03.2013 um 15:14 schrieb Paolo Bonzini <pbonzini@redhat.com>: > Il 11/03/2013 14:44, Peter Lieven ha scritto: >> Hi, >> >> I ever since had a few VMs which are very hard to migrate because of a >> lot of memory I/O. I found that finding the next dirty bit >> seemed to be one of the culprits (apart from removing locking which >> Paolo is working on). >> >> I have to following proposal which seems to help a lot in my case. Just >> wanted to have some feedback. >> I applied the same unrolling idea like in buffer_is_zero(). >> >> Peter >> >> --- a/util/bitops.c >> +++ b/util/bitops.c >> @@ -24,12 +24,13 @@ unsigned long find_next_bit(const unsigned long >> *addr, unsigned long size, >> const unsigned long *p = addr + BITOP_WORD(offset); >> unsigned long result = offset & ~(BITS_PER_LONG-1); >> unsigned long tmp; >> + unsigned long d0,d1,d2,d3; >> >> if (offset >= size) { >> return size; >> } >> size -= result; >> - offset %= BITS_PER_LONG; >> + offset &= (BITS_PER_LONG-1); >> if (offset) { >> tmp = *(p++); >> tmp &= (~0UL << offset); >> @@ -43,6 +44,18 @@ unsigned long find_next_bit(const unsigned long >> *addr, unsigned long size, >> result += BITS_PER_LONG; >> } >> while (size & ~(BITS_PER_LONG-1)) { >> + while (!(size & (4*BITS_PER_LONG-1))) { > > This really means > > if (!(size & (4*BITS_PER_LONG-1))) { > while (1) { > ... > } > } > > because the subtraction will not change the result of the "while" loop > condition. Are you sure? The above is working nicely for me (wondering why ;-)) I think !(size & (4*BITS_PER_LONG-1)) is the same as what you propose. If size & (4*BITS_PER_LONG-1) is not zero its not dividable by 4*BITS_PER_LONG. But it see it might be a problem for size == 0. > > What you want is probably "while (size & ~(4*BITS_PER_LONG-1))", which > in turn means "while (size >= 4*BITS_PER_LONG). > > Please change both while loops to use a ">=" condition, it's easier to read. Good idea, its easier to understand. Has anyone evidence if unrolling 4 longs is optimal on today processors? I just chooses 4 longs because it was the same in buffer_is_zero. Peter > > Paolo > >> + d0 = *p; >> + d1 = *(p+1); >> + d2 = *(p+2); >> + d3 = *(p+3); >> + if (d0 || d1 || d2 || d3) { >> + break; >> + } >> + p+=4; >> + result += 4*BITS_PER_LONG; >> + size -= 4*BITS_PER_LONG; >> + } >> if ((tmp = *(p++))) { >> goto found_middle; >> } >

Am 11.03.2013 um 15:22 schrieb Peter Lieven <pl@dlhnet.de>: > > Am 11.03.2013 um 15:14 schrieb Paolo Bonzini <pbonzini@redhat.com>: > >> Il 11/03/2013 14:44, Peter Lieven ha scritto: >>> Hi, >>> >>> I ever since had a few VMs which are very hard to migrate because of a >>> lot of memory I/O. I found that finding the next dirty bit >>> seemed to be one of the culprits (apart from removing locking which >>> Paolo is working on). >>> >>> I have to following proposal which seems to help a lot in my case. Just >>> wanted to have some feedback. >>> I applied the same unrolling idea like in buffer_is_zero(). >>> >>> Peter >>> >>> --- a/util/bitops.c >>> +++ b/util/bitops.c >>> @@ -24,12 +24,13 @@ unsigned long find_next_bit(const unsigned long >>> *addr, unsigned long size, >>> const unsigned long *p = addr + BITOP_WORD(offset); >>> unsigned long result = offset & ~(BITS_PER_LONG-1); >>> unsigned long tmp; >>> + unsigned long d0,d1,d2,d3; >>> >>> if (offset >= size) { >>> return size; >>> } >>> size -= result; >>> - offset %= BITS_PER_LONG; >>> + offset &= (BITS_PER_LONG-1); >>> if (offset) { >>> tmp = *(p++); >>> tmp &= (~0UL << offset); >>> @@ -43,6 +44,18 @@ unsigned long find_next_bit(const unsigned long >>> *addr, unsigned long size, >>> result += BITS_PER_LONG; >>> } >>> while (size & ~(BITS_PER_LONG-1)) { >>> + while (!(size & (4*BITS_PER_LONG-1))) { >> >> This really means >> >> if (!(size & (4*BITS_PER_LONG-1))) { >> while (1) { >> ... >> } >> } >> >> because the subtraction will not change the result of the "while" loop >> condition. > > Are you sure? The above is working nicely for me (wondering why ;-)) > I think !(size & (4*BITS_PER_LONG-1)) is the same as what you > propose. If size & (4*BITS_PER_LONG-1) is not zero its not dividable > by 4*BITS_PER_LONG. But it see it might be a problem for size == 0. > >> >> What you want is probably "while (size & ~(4*BITS_PER_LONG-1))", which >> in turn means "while (size >= 4*BITS_PER_LONG). >> >> Please change both while loops to use a ">=" condition, it's easier to read. Thinking again, in case a bit is found, this might lead to unnecessary iterations in the while loop if the bit is in d1, d2 or d3. > > Good idea, its easier to understand. > > Has anyone evidence if unrolling 4 longs is optimal on today processors? > I just chooses 4 longs because it was the same in buffer_is_zero. > > Peter > >> >> Paolo >> >>> + d0 = *p; >>> + d1 = *(p+1); >>> + d2 = *(p+2); >>> + d3 = *(p+3); >>> + if (d0 || d1 || d2 || d3) { >>> + break; >>> + } >>> + p+=4; >>> + result += 4*BITS_PER_LONG; >>> + size -= 4*BITS_PER_LONG; >>> + } >>> if ((tmp = *(p++))) { >>> goto found_middle; >>> } >> >

Il 11/03/2013 15:22, Peter Lieven ha scritto: > > Am 11.03.2013 um 15:14 schrieb Paolo Bonzini <pbonzini@redhat.com>: > >> Il 11/03/2013 14:44, Peter Lieven ha scritto: >>> Hi, >>> >>> I ever since had a few VMs which are very hard to migrate because of a >>> lot of memory I/O. I found that finding the next dirty bit >>> seemed to be one of the culprits (apart from removing locking which >>> Paolo is working on). >>> >>> I have to following proposal which seems to help a lot in my case. Just >>> wanted to have some feedback. >>> I applied the same unrolling idea like in buffer_is_zero(). >>> >>> Peter >>> >>> --- a/util/bitops.c >>> +++ b/util/bitops.c >>> @@ -24,12 +24,13 @@ unsigned long find_next_bit(const unsigned long >>> *addr, unsigned long size, >>> const unsigned long *p = addr + BITOP_WORD(offset); >>> unsigned long result = offset & ~(BITS_PER_LONG-1); >>> unsigned long tmp; >>> + unsigned long d0,d1,d2,d3; >>> >>> if (offset >= size) { >>> return size; >>> } >>> size -= result; >>> - offset %= BITS_PER_LONG; >>> + offset &= (BITS_PER_LONG-1); >>> if (offset) { >>> tmp = *(p++); >>> tmp &= (~0UL << offset); >>> @@ -43,6 +44,18 @@ unsigned long find_next_bit(const unsigned long >>> *addr, unsigned long size, >>> result += BITS_PER_LONG; >>> } >>> while (size & ~(BITS_PER_LONG-1)) { >>> + while (!(size & (4*BITS_PER_LONG-1))) { >> >> This really means >> >> if (!(size & (4*BITS_PER_LONG-1))) { >> while (1) { >> ... >> } >> } >> >> because the subtraction will not change the result of the "while" loop >> condition. > > Are you sure? The above is working nicely for me (wondering why ;-)) while (!(size & (4*BITS_PER_LONG-1))) => while (!(size % (4*BITS_PER_LONG)) => while ((size % (4*BITS_PER_LONG)) == 0) Subtracting 4*BITS_PER_LONG doesn't change the modulus. > I think !(size & (4*BITS_PER_LONG-1)) is the same as what you > propose. If size & (4*BITS_PER_LONG-1) is not zero its not dividable > by 4*BITS_PER_LONG. But it see it might be a problem for size == 0. In fact I'm not really sure why it works for you. :) >> What you want is probably "while (size & ~(4*BITS_PER_LONG-1))", which >> in turn means "while (size >= 4*BITS_PER_LONG). >> >> Please change both while loops to use a ">=" condition, it's easier to read. > > Good idea, its easier to understand. > >>> Please change both while loops to use a ">=" condition, it's easier to read. > > Thinking again, in case a bit is found, this might lead to unnecessary iterations > in the while loop if the bit is in d1, d2 or d3. How would that be different in your patch? But you can solve it by making two >= loops, one checking for 4*BITS_PER_LONG and one checking BITS_PER_LONG. Paolo > > Peter > >> >> Paolo >> >>> + d0 = *p; >>> + d1 = *(p+1); >>> + d2 = *(p+2); >>> + d3 = *(p+3); >>> + if (d0 || d1 || d2 || d3) { >>> + break; >>> + } >>> + p+=4; >>> + result += 4*BITS_PER_LONG; >>> + size -= 4*BITS_PER_LONG; >>> + } >>> if ((tmp = *(p++))) { >>> goto found_middle; >>> } >> >

On Mon, Mar 11, 2013 at 02:44:03PM +0100, Peter Lieven wrote: > I ever since had a few VMs which are very hard to migrate because of a lot of memory I/O. I found that finding the next dirty bit > seemed to be one of the culprits (apart from removing locking which Paolo is working on). > > I have to following proposal which seems to help a lot in my case. Just wanted to have some feedback. Hi Peter, Do you have any performance numbers for this patch? I'm just curious how big the win is. Stefan

Am 12.03.2013 um 09:35 schrieb Stefan Hajnoczi <stefanha@gmail.com>: > On Mon, Mar 11, 2013 at 02:44:03PM +0100, Peter Lieven wrote: >> I ever since had a few VMs which are very hard to migrate because of a lot of memory I/O. I found that finding the next dirty bit >> seemed to be one of the culprits (apart from removing locking which Paolo is working on). >> >> I have to following proposal which seems to help a lot in my case. Just wanted to have some feedback. > > Hi Peter, > Do you have any performance numbers for this patch? I'm just curious > how big the win is. Hi Stefan, please see my recent email to the list with the final patch. The win is up to 100%. Worst case execution time (whole array is zero) is halved on x86_64. Peter > > Stefan

On Tue, Mar 12, 2013 at 09:41:04AM +0100, Peter Lieven wrote: > > Am 12.03.2013 um 09:35 schrieb Stefan Hajnoczi <stefanha@gmail.com>: > > > On Mon, Mar 11, 2013 at 02:44:03PM +0100, Peter Lieven wrote: > >> I ever since had a few VMs which are very hard to migrate because of a lot of memory I/O. I found that finding the next dirty bit > >> seemed to be one of the culprits (apart from removing locking which Paolo is working on). > >> > >> I have to following proposal which seems to help a lot in my case. Just wanted to have some feedback. > > > > Hi Peter, > > Do you have any performance numbers for this patch? I'm just curious > > how big the win is. > > Hi Stefan, > > please see my recent email to the list with the final patch. > The win is up to 100%. Worst case execution time (whole > array is zero) is halved on x86_64. Thanks! Stefan

--- a/util/bitops.c +++ b/util/bitops.c @@ -24,12 +24,13 @@ unsigned long find_next_bit(const unsigned long *addr, unsigned long size, const unsigned long *p = addr + BITOP_WORD(offset); unsigned long result = offset & ~(BITS_PER_LONG-1); unsigned long tmp; + unsigned long d0,d1,d2,d3; if (offset >= size) { return size; } size -= result; - offset %= BITS_PER_LONG; + offset &= (BITS_PER_LONG-1); if (offset) { tmp = *(p++); tmp &= (~0UL << offset); @@ -43,6 +44,18 @@ unsigned long find_next_bit(const unsigned long *addr, unsigned long size, result += BITS_PER_LONG; } while (size & ~(BITS_PER_LONG-1)) { + while (!(size & (4*BITS_PER_LONG-1))) { + d0 = *p; + d1 = *(p+1); + d2 = *(p+2); + d3 = *(p+3); + if (d0 || d1 || d2 || d3) { + break; + } + p+=4; + result += 4*BITS_PER_LONG; + size -= 4*BITS_PER_LONG; + } if ((tmp = *(p++))) { goto found_middle; }