diff mbox

[02/20] add hierarchical bitmap data type and test cases

Message ID 1355319999-30627-3-git-send-email-pbonzini@redhat.com
State New
Headers show

Commit Message

Paolo Bonzini Dec. 12, 2012, 1:46 p.m. UTC
HBitmaps provides an array of bits.  The bits are stored as usual in an
array of unsigned longs, but HBitmap is also optimized to provide fast
iteration over set bits; going from one bit to the next is O(logB n)
worst case, with B = sizeof(long) * CHAR_BIT: the result is low enough
that the number of levels is in fact fixed.

In order to do this, it stacks multiple bitmaps with progressively coarser
granularity; in all levels except the last, bit N is set iff the N-th
unsigned long is nonzero in the immediately next level.  When iteration
completes on the last level it can examine the 2nd-last level to quickly
skip entire words, and even do so recursively to skip blocks of 64 words or
powers thereof (32 on 32-bit machines).

Given an index in the bitmap, it can be split in group of bits like
this (for the 64-bit case):

     bits 0-57 => word in the last bitmap     | bits 58-63 => bit in the word
     bits 0-51 => word in the 2nd-last bitmap | bits 52-57 => bit in the word
     bits 0-45 => word in the 3rd-last bitmap | bits 46-51 => bit in the word

So it is easy to move up simply by shifting the index right by
log2(BITS_PER_LONG) bits.  To move down, you shift the index left
similarly, and add the word index within the group.  Iteration uses
ffs (find first set bit) to find the next word to examine; this
operation can be done in constant time in most current architectures.

Setting or clearing a range of m bits on all levels, the work to perform
is O(m + m/W + m/W^2 + ...), which is O(m) like on a regular bitmap.

When iterating on a bitmap, each bit (on any level) is only visited
once.  Hence, The total cost of visiting a bitmap with m bits in it is
the number of bits that are set in all bitmaps.  Unless the bitmap is
extremely sparse, this is also O(m + m/W + m/W^2 + ...), so the amortized
cost of advancing from one bit to the next is usually constant.

Reviewed-by: Laszlo Ersek <lersek@redhat.com>
Signed-off-by: Paolo Bonzini <pbonzini@redhat.com>
---
 hbitmap.c            | 400 ++++++++++++++++++++++++++++++++++++++++++++++++++
 hbitmap.h            | 207 ++++++++++++++++++++++++++
 tests/Makefile       |   2 +
 tests/test-hbitmap.c | 408 +++++++++++++++++++++++++++++++++++++++++++++++++++
 trace-events         |   5 +
 5 files changed, 1022 insertions(+)
 create mode 100644 hbitmap.c
 create mode 100644 hbitmap.h
 create mode 100644 tests/test-hbitmap.c

Comments

Eric Blake Dec. 14, 2012, 12:04 a.m. UTC | #1
On 12/12/2012 06:46 AM, Paolo Bonzini wrote:
> HBitmaps provides an array of bits.  The bits are stored as usual in an
> array of unsigned longs, but HBitmap is also optimized to provide fast
> iteration over set bits; going from one bit to the next is O(logB n)
> worst case, with B = sizeof(long) * CHAR_BIT: the result is low enough
> that the number of levels is in fact fixed.
> 

> 
> Reviewed-by: Laszlo Ersek <lersek@redhat.com>
> Signed-off-by: Paolo Bonzini <pbonzini@redhat.com>
> ---
>  hbitmap.c            | 400 ++++++++++++++++++++++++++++++++++++++++++++++++++
>  hbitmap.h            | 207 ++++++++++++++++++++++++++
>  tests/Makefile       |   2 +
>  tests/test-hbitmap.c | 408 +++++++++++++++++++++++++++++++++++++++++++++++++++
>  trace-events         |   5 +
>  5 files changed, 1022 insertions(+)
>  create mode 100644 hbitmap.c
>  create mode 100644 hbitmap.h
>  create mode 100644 tests/test-hbitmap.c

I've looked through this several times, but did another once-over, and
you can feel free to add:

Reviewed-by: Eric Blake <eblake@redhat.com>
Stefan Hajnoczi Jan. 11, 2013, 6:27 p.m. UTC | #2
On Wed, Dec 12, 2012 at 02:46:21PM +0100, Paolo Bonzini wrote:
> diff --git a/hbitmap.c b/hbitmap.c
> new file mode 100644
> index 0000000..ae59f39
> --- /dev/null
> +++ b/hbitmap.c
> @@ -0,0 +1,400 @@
> +/*
> + * Hierarchical Bitmap Data Type
> + *
> + * Copyright Red Hat, Inc., 2012
> + *
> + * Author: Paolo Bonzini <pbonzini@redhat.com>
> + *
> + * This work is licensed under the terms of the GNU GPL, version 2 or
> + * later.  See the COPYING file in the top-level directory.
> + */
> +
> +#include "osdep.h"
> +#include "hbitmap.h"
> +#include "host-utils.h"
> +#include "trace.h"
> +#include <string.h>
> +#include <glib.h>
> +#include <assert.h>

System headers before application headers, please.
diff mbox

Patch

diff --git a/hbitmap.c b/hbitmap.c
new file mode 100644
index 0000000..ae59f39
--- /dev/null
+++ b/hbitmap.c
@@ -0,0 +1,400 @@ 
+/*
+ * Hierarchical Bitmap Data Type
+ *
+ * Copyright Red Hat, Inc., 2012
+ *
+ * Author: Paolo Bonzini <pbonzini@redhat.com>
+ *
+ * This work is licensed under the terms of the GNU GPL, version 2 or
+ * later.  See the COPYING file in the top-level directory.
+ */
+
+#include "osdep.h"
+#include "hbitmap.h"
+#include "host-utils.h"
+#include "trace.h"
+#include <string.h>
+#include <glib.h>
+#include <assert.h>
+
+/* HBitmaps provides an array of bits.  The bits are stored as usual in an
+ * array of unsigned longs, but HBitmap is also optimized to provide fast
+ * iteration over set bits; going from one bit to the next is O(logB n)
+ * worst case, with B = sizeof(long) * CHAR_BIT: the result is low enough
+ * that the number of levels is in fact fixed.
+ *
+ * In order to do this, it stacks multiple bitmaps with progressively coarser
+ * granularity; in all levels except the last, bit N is set iff the N-th
+ * unsigned long is nonzero in the immediately next level.  When iteration
+ * completes on the last level it can examine the 2nd-last level to quickly
+ * skip entire words, and even do so recursively to skip blocks of 64 words or
+ * powers thereof (32 on 32-bit machines).
+ *
+ * Given an index in the bitmap, it can be split in group of bits like
+ * this (for the 64-bit case):
+ *
+ *   bits 0-57 => word in the last bitmap     | bits 58-63 => bit in the word
+ *   bits 0-51 => word in the 2nd-last bitmap | bits 52-57 => bit in the word
+ *   bits 0-45 => word in the 3rd-last bitmap | bits 46-51 => bit in the word
+ *
+ * So it is easy to move up simply by shifting the index right by
+ * log2(BITS_PER_LONG) bits.  To move down, you shift the index left
+ * similarly, and add the word index within the group.  Iteration uses
+ * ffs (find first set bit) to find the next word to examine; this
+ * operation can be done in constant time in most current architectures.
+ *
+ * Setting or clearing a range of m bits on all levels, the work to perform
+ * is O(m + m/W + m/W^2 + ...), which is O(m) like on a regular bitmap.
+ *
+ * When iterating on a bitmap, each bit (on any level) is only visited
+ * once.  Hence, The total cost of visiting a bitmap with m bits in it is
+ * the number of bits that are set in all bitmaps.  Unless the bitmap is
+ * extremely sparse, this is also O(m + m/W + m/W^2 + ...), so the amortized
+ * cost of advancing from one bit to the next is usually constant (worst case
+ * O(logB n) as in the non-amortized complexity).
+ */
+
+struct HBitmap {
+    /* Number of total bits in the bottom level.  */
+    uint64_t size;
+
+    /* Number of set bits in the bottom level.  */
+    uint64_t count;
+
+    /* A scaling factor.  Given a granularity of G, each bit in the bitmap will
+     * will actually represent a group of 2^G elements.  Each operation on a
+     * range of bits first rounds the bits to determine which group they land
+     * in, and then affect the entire page; iteration will only visit the first
+     * bit of each group.  Here is an example of operations in a size-16,
+     * granularity-1 HBitmap:
+     *
+     *    initial state            00000000
+     *    set(start=0, count=9)    11111000 (iter: 0, 2, 4, 6, 8)
+     *    reset(start=1, count=3)  00111000 (iter: 4, 6, 8)
+     *    set(start=9, count=2)    00111100 (iter: 4, 6, 8, 10)
+     *    reset(start=5, count=5)  00000000
+     *
+     * From an implementation point of view, when setting or resetting bits,
+     * the bitmap will scale bit numbers right by this amount of bits.  When
+     * iterating, the bitmap will scale bit numbers left by this amount of
+     * bits.
+     */
+    int granularity;
+
+    /* A number of progressively less coarse bitmaps (i.e. level 0 is the
+     * coarsest).  Each bit in level N represents a word in level N+1 that
+     * has a set bit, except the last level where each bit represents the
+     * actual bitmap.
+     *
+     * Note that all bitmaps have the same number of levels.  Even a 1-bit
+     * bitmap will still allocate HBITMAP_LEVELS arrays.
+     */
+    unsigned long *levels[HBITMAP_LEVELS];
+};
+
+static inline int popcountl(unsigned long l)
+{
+    return BITS_PER_LONG == 32 ? ctpop32(l) : ctpop64(l);
+}
+
+/* Advance hbi to the next nonzero word and return it.  hbi->pos
+ * is updated.  Returns zero if we reach the end of the bitmap.
+ */
+unsigned long hbitmap_iter_skip_words(HBitmapIter *hbi)
+{
+    size_t pos = hbi->pos;
+    const HBitmap *hb = hbi->hb;
+    unsigned i = HBITMAP_LEVELS - 1;
+
+    unsigned long cur;
+    do {
+        cur = hbi->cur[--i];
+        pos >>= BITS_PER_LEVEL;
+    } while (cur == 0);
+
+    /* Check for end of iteration.  We always use fewer than BITS_PER_LONG
+     * bits in the level 0 bitmap; thus we can repurpose the most significant
+     * bit as a sentinel.  The sentinel is set in hbitmap_alloc and ensures
+     * that the above loop ends even without an explicit check on i.
+     */
+
+    if (i == 0 && cur == (1UL << (BITS_PER_LONG - 1))) {
+        return 0;
+    }
+    for (; i < HBITMAP_LEVELS - 1; i++) {
+        /* Shift back pos to the left, matching the right shifts above.
+         * The index of this word's least significant set bit provides
+         * the low-order bits.
+         */
+        pos = (pos << BITS_PER_LEVEL) + ffsl(cur) - 1;
+        hbi->cur[i] = cur & (cur - 1);
+
+        /* Set up next level for iteration.  */
+        cur = hb->levels[i + 1][pos];
+    }
+
+    hbi->pos = pos;
+    trace_hbitmap_iter_skip_words(hbi->hb, hbi, pos, cur);
+
+    assert(cur);
+    return cur;
+}
+
+void hbitmap_iter_init(HBitmapIter *hbi, const HBitmap *hb, uint64_t first)
+{
+    unsigned i, bit;
+    uint64_t pos;
+
+    hbi->hb = hb;
+    pos = first >> hb->granularity;
+    hbi->pos = pos >> BITS_PER_LEVEL;
+    hbi->granularity = hb->granularity;
+
+    for (i = HBITMAP_LEVELS; i-- > 0; ) {
+        bit = pos & (BITS_PER_LONG - 1);
+        pos >>= BITS_PER_LEVEL;
+
+        /* Drop bits representing items before first.  */
+        hbi->cur[i] = hb->levels[i][pos] & ~((1UL << bit) - 1);
+
+        /* We have already added level i+1, so the lowest set bit has
+         * been processed.  Clear it.
+         */
+        if (i != HBITMAP_LEVELS - 1) {
+            hbi->cur[i] &= ~(1UL << bit);
+        }
+    }
+}
+
+bool hbitmap_empty(const HBitmap *hb)
+{
+    return hb->count == 0;
+}
+
+int hbitmap_granularity(const HBitmap *hb)
+{
+    return hb->granularity;
+}
+
+uint64_t hbitmap_count(const HBitmap *hb)
+{
+    return hb->count << hb->granularity;
+}
+
+/* Count the number of set bits between start and end, not accounting for
+ * the granularity.  Also an example of how to use hbitmap_iter_next_word.
+ */
+static uint64_t hb_count_between(HBitmap *hb, uint64_t start, uint64_t last)
+{
+    HBitmapIter hbi;
+    uint64_t count = 0;
+    uint64_t end = last + 1;
+    unsigned long cur;
+    size_t pos;
+
+    hbitmap_iter_init(&hbi, hb, start << hb->granularity);
+    for (;;) {
+        pos = hbitmap_iter_next_word(&hbi, &cur);
+        if (pos >= (end >> BITS_PER_LEVEL)) {
+            break;
+        }
+        count += popcountl(cur);
+    }
+
+    if (pos == (end >> BITS_PER_LEVEL)) {
+        /* Drop bits representing the END-th and subsequent items.  */
+        int bit = end & (BITS_PER_LONG - 1);
+        cur &= (1UL << bit) - 1;
+        count += popcountl(cur);
+    }
+
+    return count;
+}
+
+/* Setting starts at the last layer and propagates up if an element
+ * changes from zero to non-zero.
+ */
+static inline bool hb_set_elem(unsigned long *elem, uint64_t start, uint64_t last)
+{
+    unsigned long mask;
+    bool changed;
+
+    assert((last >> BITS_PER_LEVEL) == (start >> BITS_PER_LEVEL));
+    assert(start <= last);
+
+    mask = 2UL << (last & (BITS_PER_LONG - 1));
+    mask -= 1UL << (start & (BITS_PER_LONG - 1));
+    changed = (*elem == 0);
+    *elem |= mask;
+    return changed;
+}
+
+/* The recursive workhorse (the depth is limited to HBITMAP_LEVELS)... */
+static void hb_set_between(HBitmap *hb, int level, uint64_t start, uint64_t last)
+{
+    size_t pos = start >> BITS_PER_LEVEL;
+    size_t lastpos = last >> BITS_PER_LEVEL;
+    bool changed = false;
+    size_t i;
+
+    i = pos;
+    if (i < lastpos) {
+        uint64_t next = (start | (BITS_PER_LONG - 1)) + 1;
+        changed |= hb_set_elem(&hb->levels[level][i], start, next - 1);
+        for (;;) {
+            start = next;
+            next += BITS_PER_LONG;
+            if (++i == lastpos) {
+                break;
+            }
+            changed |= (hb->levels[level][i] == 0);
+            hb->levels[level][i] = ~0UL;
+        }
+    }
+    changed |= hb_set_elem(&hb->levels[level][i], start, last);
+
+    /* If there was any change in this layer, we may have to update
+     * the one above.
+     */
+    if (level > 0 && changed) {
+        hb_set_between(hb, level - 1, pos, lastpos);
+    }
+}
+
+void hbitmap_set(HBitmap *hb, uint64_t start, uint64_t count)
+{
+    /* Compute range in the last layer.  */
+    uint64_t last = start + count - 1;
+
+    trace_hbitmap_set(hb, start, count,
+                      start >> hb->granularity, last >> hb->granularity);
+
+    start >>= hb->granularity;
+    last >>= hb->granularity;
+    count = last - start + 1;
+
+    hb->count += count - hb_count_between(hb, start, last);
+    hb_set_between(hb, HBITMAP_LEVELS - 1, start, last);
+}
+
+/* Resetting works the other way round: propagate up if the new
+ * value is zero.
+ */
+static inline bool hb_reset_elem(unsigned long *elem, uint64_t start, uint64_t last)
+{
+    unsigned long mask;
+    bool blanked;
+
+    assert((last >> BITS_PER_LEVEL) == (start >> BITS_PER_LEVEL));
+    assert(start <= last);
+
+    mask = 2UL << (last & (BITS_PER_LONG - 1));
+    mask -= 1UL << (start & (BITS_PER_LONG - 1));
+    blanked = *elem != 0 && ((*elem & ~mask) == 0);
+    *elem &= ~mask;
+    return blanked;
+}
+
+/* The recursive workhorse (the depth is limited to HBITMAP_LEVELS)... */
+static void hb_reset_between(HBitmap *hb, int level, uint64_t start, uint64_t last)
+{
+    size_t pos = start >> BITS_PER_LEVEL;
+    size_t lastpos = last >> BITS_PER_LEVEL;
+    bool changed = false;
+    size_t i;
+
+    i = pos;
+    if (i < lastpos) {
+        uint64_t next = (start | (BITS_PER_LONG - 1)) + 1;
+
+        /* Here we need a more complex test than when setting bits.  Even if
+         * something was changed, we must not blank bits in the upper level
+         * unless the lower-level word became entirely zero.  So, remove pos
+         * from the upper-level range if bits remain set.
+         */
+        if (hb_reset_elem(&hb->levels[level][i], start, next - 1)) {
+            changed = true;
+        } else {
+            pos++;
+        }
+
+        for (;;) {
+            start = next;
+            next += BITS_PER_LONG;
+            if (++i == lastpos) {
+                break;
+            }
+            changed |= (hb->levels[level][i] != 0);
+            hb->levels[level][i] = 0UL;
+        }
+    }
+
+    /* Same as above, this time for lastpos.  */
+    if (hb_reset_elem(&hb->levels[level][i], start, last)) {
+        changed = true;
+    } else {
+        lastpos--;
+    }
+
+    if (level > 0 && changed) {
+        hb_reset_between(hb, level - 1, pos, lastpos);
+    }
+}
+
+void hbitmap_reset(HBitmap *hb, uint64_t start, uint64_t count)
+{
+    /* Compute range in the last layer.  */
+    uint64_t last = start + count - 1;
+
+    trace_hbitmap_reset(hb, start, count,
+                        start >> hb->granularity, last >> hb->granularity);
+
+    start >>= hb->granularity;
+    last >>= hb->granularity;
+
+    hb->count -= hb_count_between(hb, start, last);
+    hb_reset_between(hb, HBITMAP_LEVELS - 1, start, last);
+}
+
+bool hbitmap_get(const HBitmap *hb, uint64_t item)
+{
+    /* Compute position and bit in the last layer.  */
+    uint64_t pos = item >> hb->granularity;
+    unsigned long bit = 1UL << (pos & (BITS_PER_LONG - 1));
+
+    return (hb->levels[HBITMAP_LEVELS - 1][pos >> BITS_PER_LEVEL] & bit) != 0;
+}
+
+void hbitmap_free(HBitmap *hb)
+{
+    unsigned i;
+    for (i = HBITMAP_LEVELS; i-- > 0; ) {
+        g_free(hb->levels[i]);
+    }
+    g_free(hb);
+}
+
+HBitmap *hbitmap_alloc(uint64_t size, int granularity)
+{
+    HBitmap *hb = g_malloc0(sizeof(struct HBitmap));
+    unsigned i;
+
+    assert(granularity >= 0 && granularity < 64);
+    size = (size + (1ULL << granularity) - 1) >> granularity;
+    assert(size <= ((uint64_t)1 << HBITMAP_LOG_MAX_SIZE));
+
+    hb->size = size;
+    hb->granularity = granularity;
+    for (i = HBITMAP_LEVELS; i-- > 0; ) {
+        size = MAX((size + BITS_PER_LONG - 1) >> BITS_PER_LEVEL, 1);
+        hb->levels[i] = g_malloc0(size * sizeof(unsigned long));
+    }
+
+    /* We necessarily have free bits in level 0 due to the definition
+     * of HBITMAP_LEVELS, so use one for a sentinel.  This speeds up
+     * hbitmap_iter_skip_words.
+     */
+    assert(size == 1);
+    hb->levels[0][0] |= 1UL << (BITS_PER_LONG - 1);
+    return hb;
+}
diff --git a/hbitmap.h b/hbitmap.h
new file mode 100644
index 0000000..7ddfb66
--- /dev/null
+++ b/hbitmap.h
@@ -0,0 +1,207 @@ 
+/*
+ * Hierarchical Bitmap Data Type
+ *
+ * Copyright Red Hat, Inc., 2012
+ *
+ * Author: Paolo Bonzini <pbonzini@redhat.com>
+ *
+ * This work is licensed under the terms of the GNU GPL, version 2 or
+ * later.  See the COPYING file in the top-level directory.
+ */
+
+#ifndef HBITMAP_H
+#define HBITMAP_H 1
+
+#include <limits.h>
+#include <stdint.h>
+#include <stdbool.h>
+#include "bitops.h"
+
+typedef struct HBitmap HBitmap;
+typedef struct HBitmapIter HBitmapIter;
+
+#define BITS_PER_LEVEL         (BITS_PER_LONG == 32 ? 5 : 6)
+
+/* For 32-bit, the largest that fits in a 4 GiB address space.
+ * For 64-bit, the number of sectors in 1 PiB.  Good luck, in
+ * either case... :)
+ */
+#define HBITMAP_LOG_MAX_SIZE   (BITS_PER_LONG == 32 ? 34 : 41)
+
+/* We need to place a sentinel in level 0 to speed up iteration.  Thus,
+ * we do this instead of HBITMAP_LOG_MAX_SIZE / BITS_PER_LEVEL.  The
+ * difference is that it allocates an extra level when HBITMAP_LOG_MAX_SIZE
+ * is an exact multiple of BITS_PER_LEVEL.
+ */
+#define HBITMAP_LEVELS         ((HBITMAP_LOG_MAX_SIZE / BITS_PER_LEVEL) + 1)
+
+struct HBitmapIter {
+    const HBitmap *hb;
+
+    /* Copied from hb for access in the inline functions (hb is opaque).  */
+    int granularity;
+
+    /* Entry offset into the last-level array of longs.  */
+    size_t pos;
+
+    /* The currently-active path in the tree.  Each item of cur[i] stores
+     * the bits (i.e. the subtrees) yet to be processed under that node.
+     */
+    unsigned long cur[HBITMAP_LEVELS];
+};
+
+/**
+ * hbitmap_alloc:
+ * @size: Number of bits in the bitmap.
+ * @granularity: Granularity of the bitmap.  Aligned groups of 2^@granularity
+ * bits will be represented by a single bit.  Each operation on a
+ * range of bits first rounds the bits to determine which group they land
+ * in, and then affect the entire set; iteration will only visit the first
+ * bit of each group.
+ *
+ * Allocate a new HBitmap.
+ */
+HBitmap *hbitmap_alloc(uint64_t size, int granularity);
+
+/**
+ * hbitmap_empty:
+ * @hb: HBitmap to operate on.
+ *
+ * Return whether the bitmap is empty.
+ */
+bool hbitmap_empty(const HBitmap *hb);
+
+/**
+ * hbitmap_granularity:
+ * @hb: HBitmap to operate on.
+ *
+ * Return the granularity of the HBitmap.
+ */
+int hbitmap_granularity(const HBitmap *hb);
+
+/**
+ * hbitmap_count:
+ * @hb: HBitmap to operate on.
+ *
+ * Return the number of bits set in the HBitmap.
+ */
+uint64_t hbitmap_count(const HBitmap *hb);
+
+/**
+ * hbitmap_set:
+ * @hb: HBitmap to operate on.
+ * @start: First bit to set (0-based).
+ * @count: Number of bits to set.
+ *
+ * Set a consecutive range of bits in an HBitmap.
+ */
+void hbitmap_set(HBitmap *hb, uint64_t start, uint64_t count);
+
+/**
+ * hbitmap_reset:
+ * @hb: HBitmap to operate on.
+ * @start: First bit to reset (0-based).
+ * @count: Number of bits to reset.
+ *
+ * Reset a consecutive range of bits in an HBitmap.
+ */
+void hbitmap_reset(HBitmap *hb, uint64_t start, uint64_t count);
+
+/**
+ * hbitmap_get:
+ * @hb: HBitmap to operate on.
+ * @item: Bit to query (0-based).
+ *
+ * Return whether the @item-th bit in an HBitmap is set.
+ */
+bool hbitmap_get(const HBitmap *hb, uint64_t item);
+
+/**
+ * hbitmap_free:
+ * @hb: HBitmap to operate on.
+ *
+ * Free an HBitmap and all of its associated memory.
+ */
+void hbitmap_free(HBitmap *hb);
+
+/**
+ * hbitmap_iter_init:
+ * @hbi: HBitmapIter to initialize.
+ * @hb: HBitmap to iterate on.
+ * @first: First bit to visit (0-based).
+ *
+ * Set up @hbi to iterate on the HBitmap @hb.  hbitmap_iter_next will return
+ * the lowest-numbered bit that is set in @hb, starting at @first.
+ *
+ * Concurrent setting of bits is acceptable, and will at worst cause the
+ * iteration to miss some of those bits.  Resetting bits before the current
+ * position of the iterator is also okay.  However, concurrent resetting of
+ * bits can lead to unexpected behavior if the iterator has not yet reached
+ * those bits.
+ */
+void hbitmap_iter_init(HBitmapIter *hbi, const HBitmap *hb, uint64_t first);
+
+/* hbitmap_iter_skip_words:
+ * @hbi: HBitmapIter to operate on.
+ *
+ * Internal function used by hbitmap_iter_next and hbitmap_iter_next_word.
+ */
+unsigned long hbitmap_iter_skip_words(HBitmapIter *hbi);
+
+/**
+ * hbitmap_iter_next:
+ * @hbi: HBitmapIter to operate on.
+ *
+ * Return the next bit that is set in @hbi's associated HBitmap,
+ * or -1 if all remaining bits are zero.
+ */
+static inline int64_t hbitmap_iter_next(HBitmapIter *hbi)
+{
+    unsigned long cur = hbi->cur[HBITMAP_LEVELS - 1];
+    int64_t item;
+
+    if (cur == 0) {
+        cur = hbitmap_iter_skip_words(hbi);
+        if (cur == 0) {
+            return -1;
+        }
+    }
+
+    /* The next call will resume work from the next bit.  */
+    hbi->cur[HBITMAP_LEVELS - 1] = cur & (cur - 1);
+    item = ((uint64_t)hbi->pos << BITS_PER_LEVEL) + ffsl(cur) - 1;
+
+    return item << hbi->granularity;
+}
+
+/**
+ * hbitmap_iter_next_word:
+ * @hbi: HBitmapIter to operate on.
+ * @p_cur: Location where to store the next non-zero word.
+ *
+ * Return the index of the next nonzero word that is set in @hbi's
+ * associated HBitmap, and set *p_cur to the content of that word
+ * (bits before the index that was passed to hbitmap_iter_init are
+ * trimmed on the first call).  Return -1, and set *p_cur to zero,
+ * if all remaining words are zero.
+ */
+static inline size_t hbitmap_iter_next_word(HBitmapIter *hbi, unsigned long *p_cur)
+{
+    unsigned long cur = hbi->cur[HBITMAP_LEVELS - 1];
+
+    if (cur == 0) {
+        cur = hbitmap_iter_skip_words(hbi);
+        if (cur == 0) {
+            *p_cur = 0;
+            return -1;
+        }
+    }
+
+    /* The next call will resume work from the next word.  */
+    hbi->cur[HBITMAP_LEVELS - 1] = 0;
+    *p_cur = cur;
+    return hbi->pos;
+}
+
+
+#endif
diff --git a/tests/Makefile b/tests/Makefile
index b60f0fb..b04ecb4 100644
--- a/tests/Makefile
+++ b/tests/Makefile
@@ -17,6 +17,7 @@  check-unit-y += tests/test-visitor-serialization$(EXESUF)
 check-unit-y += tests/test-iov$(EXESUF)
 check-unit-y += tests/test-aio$(EXESUF)
 check-unit-y += tests/test-thread-pool$(EXESUF)
+check-unit-y += tests/test-hbitmap$(EXESUF)
 
 check-block-$(CONFIG_POSIX) += tests/qemu-iotests-quick.sh
 
@@ -54,6 +55,7 @@  tests/test-coroutine$(EXESUF): tests/test-coroutine.o $(coroutine-obj-y) $(tools
 tests/test-aio$(EXESUF): tests/test-aio.o $(coroutine-obj-y) $(tools-obj-y) $(block-obj-y) libqemustub.a
 tests/test-thread-pool$(EXESUF): tests/test-thread-pool.o $(coroutine-obj-y) $(tools-obj-y) $(block-obj-y) libqemustub.a
 tests/test-iov$(EXESUF): tests/test-iov.o iov.o
+tests/test-hbitmap$(EXESUF): tests/test-hbitmap.o hbitmap.o $(trace-obj-y)
 
 tests/test-qapi-types.c tests/test-qapi-types.h :\
 $(SRC_PATH)/qapi-schema-test.json $(SRC_PATH)/scripts/qapi-types.py
diff --git a/tests/test-hbitmap.c b/tests/test-hbitmap.c
new file mode 100644
index 0000000..b5d76a7
--- /dev/null
+++ b/tests/test-hbitmap.c
@@ -0,0 +1,408 @@ 
+/*
+ * Hierarchical bitmap unit-tests.
+ *
+ * Copyright (C) 2012 Red Hat Inc.
+ *
+ * Author: Paolo Bonzini <pbonzini@redhat.com>
+ *
+ * This work is licensed under the terms of the GNU GPL, version 2 or later.
+ * See the COPYING file in the top-level directory.
+ */
+
+#include <glib.h>
+#include <stdarg.h>
+#include "hbitmap.h"
+
+#define LOG_BITS_PER_LONG          (BITS_PER_LONG == 32 ? 5 : 6)
+
+#define L1                         BITS_PER_LONG
+#define L2                         (BITS_PER_LONG * L1)
+#define L3                         (BITS_PER_LONG * L2)
+
+typedef struct TestHBitmapData {
+    HBitmap       *hb;
+    unsigned long *bits;
+    size_t         size;
+    int            granularity;
+} TestHBitmapData;
+
+
+/* Check that the HBitmap and the shadow bitmap contain the same data,
+ * ignoring the same "first" bits.
+ */
+static void hbitmap_test_check(TestHBitmapData *data,
+                               uint64_t first)
+{
+    uint64_t count = 0;
+    size_t pos;
+    int bit;
+    HBitmapIter hbi;
+    int64_t i, next;
+
+    hbitmap_iter_init(&hbi, data->hb, first);
+
+    i = first;
+    for (;;) {
+        next = hbitmap_iter_next(&hbi);
+        if (next < 0) {
+            next = data->size;
+        }
+
+        while (i < next) {
+            pos = i >> LOG_BITS_PER_LONG;
+            bit = i & (BITS_PER_LONG - 1);
+            i++;
+            g_assert_cmpint(data->bits[pos] & (1UL << bit), ==, 0);
+        }
+
+        if (next == data->size) {
+            break;
+        }
+
+        pos = i >> LOG_BITS_PER_LONG;
+        bit = i & (BITS_PER_LONG - 1);
+        i++;
+        count++;
+        g_assert_cmpint(data->bits[pos] & (1UL << bit), !=, 0);
+    }
+
+    if (first == 0) {
+        g_assert_cmpint(count << data->granularity, ==, hbitmap_count(data->hb));
+    }
+}
+
+/* This is provided instead of a test setup function so that the sizes
+   are kept in the test functions (and not in main()) */
+static void hbitmap_test_init(TestHBitmapData *data,
+                              uint64_t size, int granularity)
+{
+    size_t n;
+    data->hb = hbitmap_alloc(size, granularity);
+
+    n = (size + BITS_PER_LONG - 1) / BITS_PER_LONG;
+    if (n == 0) {
+        n = 1;
+    }
+    data->bits = g_new0(unsigned long, n);
+    data->size = size;
+    data->granularity = granularity;
+    hbitmap_test_check(data, 0);
+}
+
+static void hbitmap_test_teardown(TestHBitmapData *data,
+                                  const void *unused)
+{
+    if (data->hb) {
+        hbitmap_free(data->hb);
+        data->hb = NULL;
+    }
+    if (data->bits) {
+        g_free(data->bits);
+        data->bits = NULL;
+    }
+}
+
+/* Set a range in the HBitmap and in the shadow "simple" bitmap.
+ * The two bitmaps are then tested against each other.
+ */
+static void hbitmap_test_set(TestHBitmapData *data,
+                             uint64_t first, uint64_t count)
+{
+    hbitmap_set(data->hb, first, count);
+    while (count-- != 0) {
+        size_t pos = first >> LOG_BITS_PER_LONG;
+        int bit = first & (BITS_PER_LONG - 1);
+        first++;
+
+        data->bits[pos] |= 1UL << bit;
+    }
+
+    if (data->granularity == 0) {
+        hbitmap_test_check(data, 0);
+    }
+}
+
+/* Reset a range in the HBitmap and in the shadow "simple" bitmap.
+ */
+static void hbitmap_test_reset(TestHBitmapData *data,
+                               uint64_t first, uint64_t count)
+{
+    hbitmap_reset(data->hb, first, count);
+    while (count-- != 0) {
+        size_t pos = first >> LOG_BITS_PER_LONG;
+        int bit = first & (BITS_PER_LONG - 1);
+        first++;
+
+        data->bits[pos] &= ~(1UL << bit);
+    }
+
+    if (data->granularity == 0) {
+        hbitmap_test_check(data, 0);
+    }
+}
+
+static void hbitmap_test_check_get(TestHBitmapData *data)
+{
+    uint64_t count = 0;
+    uint64_t i;
+
+    for (i = 0; i < data->size; i++) {
+        size_t pos = i >> LOG_BITS_PER_LONG;
+        int bit = i & (BITS_PER_LONG - 1);
+        unsigned long val = data->bits[pos] & (1UL << bit);
+        count += hbitmap_get(data->hb, i);
+        g_assert_cmpint(hbitmap_get(data->hb, i), ==, val != 0);
+    }
+    g_assert_cmpint(count, ==, hbitmap_count(data->hb));
+}
+
+static void test_hbitmap_zero(TestHBitmapData *data,
+                               const void *unused)
+{
+    hbitmap_test_init(data, 0, 0);
+}
+
+static void test_hbitmap_unaligned(TestHBitmapData *data,
+                                   const void *unused)
+{
+    hbitmap_test_init(data, L3 + 23, 0);
+    hbitmap_test_set(data, 0, 1);
+    hbitmap_test_set(data, L3 + 22, 1);
+}
+
+static void test_hbitmap_iter_empty(TestHBitmapData *data,
+                                    const void *unused)
+{
+    hbitmap_test_init(data, L1, 0);
+}
+
+static void test_hbitmap_iter_partial(TestHBitmapData *data,
+                                      const void *unused)
+{
+    hbitmap_test_init(data, L3, 0);
+    hbitmap_test_set(data, 0, L3);
+    hbitmap_test_check(data, 1);
+    hbitmap_test_check(data, L1 - 1);
+    hbitmap_test_check(data, L1);
+    hbitmap_test_check(data, L1 * 2 - 1);
+    hbitmap_test_check(data, L2 - 1);
+    hbitmap_test_check(data, L2);
+    hbitmap_test_check(data, L2 + 1);
+    hbitmap_test_check(data, L2 + L1);
+    hbitmap_test_check(data, L2 + L1 * 2 - 1);
+    hbitmap_test_check(data, L2 * 2 - 1);
+    hbitmap_test_check(data, L2 * 2);
+    hbitmap_test_check(data, L2 * 2 + 1);
+    hbitmap_test_check(data, L2 * 2 + L1);
+    hbitmap_test_check(data, L2 * 2 + L1 * 2 - 1);
+    hbitmap_test_check(data, L3 / 2);
+}
+
+static void test_hbitmap_iter_past(TestHBitmapData *data,
+                                    const void *unused)
+{
+    hbitmap_test_init(data, L3, 0);
+    hbitmap_test_set(data, 0, L3);
+    hbitmap_test_check(data, L3);
+}
+
+static void test_hbitmap_set_all(TestHBitmapData *data,
+                                 const void *unused)
+{
+    hbitmap_test_init(data, L3, 0);
+    hbitmap_test_set(data, 0, L3);
+}
+
+static void test_hbitmap_get_all(TestHBitmapData *data,
+                                 const void *unused)
+{
+    hbitmap_test_init(data, L3, 0);
+    hbitmap_test_set(data, 0, L3);
+    hbitmap_test_check_get(data);
+}
+
+static void test_hbitmap_get_some(TestHBitmapData *data,
+                                  const void *unused)
+{
+    hbitmap_test_init(data, 2 * L2, 0);
+    hbitmap_test_set(data, 10, 1);
+    hbitmap_test_check_get(data);
+    hbitmap_test_set(data, L1 - 1, 1);
+    hbitmap_test_check_get(data);
+    hbitmap_test_set(data, L1, 1);
+    hbitmap_test_check_get(data);
+    hbitmap_test_set(data, L2 - 1, 1);
+    hbitmap_test_check_get(data);
+    hbitmap_test_set(data, L2, 1);
+    hbitmap_test_check_get(data);
+}
+
+static void test_hbitmap_set_one(TestHBitmapData *data,
+                                 const void *unused)
+{
+    hbitmap_test_init(data, 2 * L2, 0);
+    hbitmap_test_set(data, 10, 1);
+    hbitmap_test_set(data, L1 - 1, 1);
+    hbitmap_test_set(data, L1, 1);
+    hbitmap_test_set(data, L2 - 1, 1);
+    hbitmap_test_set(data, L2, 1);
+}
+
+static void test_hbitmap_set_two_elem(TestHBitmapData *data,
+                                      const void *unused)
+{
+    hbitmap_test_init(data, 2 * L2, 0);
+    hbitmap_test_set(data, L1 - 1, 2);
+    hbitmap_test_set(data, L1 * 2 - 1, 4);
+    hbitmap_test_set(data, L1 * 4, L1 + 1);
+    hbitmap_test_set(data, L1 * 8 - 1, L1 + 1);
+    hbitmap_test_set(data, L2 - 1, 2);
+    hbitmap_test_set(data, L2 + L1 - 1, 8);
+    hbitmap_test_set(data, L2 + L1 * 4, L1 + 1);
+    hbitmap_test_set(data, L2 + L1 * 8 - 1, L1 + 1);
+}
+
+static void test_hbitmap_set(TestHBitmapData *data,
+                             const void *unused)
+{
+    hbitmap_test_init(data, L3 * 2, 0);
+    hbitmap_test_set(data, L1 - 1, L1 + 2);
+    hbitmap_test_set(data, L1 * 3 - 1, L1 + 2);
+    hbitmap_test_set(data, L1 * 5, L1 * 2 + 1);
+    hbitmap_test_set(data, L1 * 8 - 1, L1 * 2 + 1);
+    hbitmap_test_set(data, L2 - 1, L1 + 2);
+    hbitmap_test_set(data, L2 + L1 * 2 - 1, L1 + 2);
+    hbitmap_test_set(data, L2 + L1 * 4, L1 * 2 + 1);
+    hbitmap_test_set(data, L2 + L1 * 7 - 1, L1 * 2 + 1);
+    hbitmap_test_set(data, L2 * 2 - 1, L3 * 2 - L2 * 2);
+}
+
+static void test_hbitmap_set_twice(TestHBitmapData *data,
+                                   const void *unused)
+{
+    hbitmap_test_init(data, L1 * 3, 0);
+    hbitmap_test_set(data, 0, L1 * 3);
+    hbitmap_test_set(data, L1, 1);
+}
+
+static void test_hbitmap_set_overlap(TestHBitmapData *data,
+                                     const void *unused)
+{
+    hbitmap_test_init(data, L3 * 2, 0);
+    hbitmap_test_set(data, L1 - 1, L1 + 2);
+    hbitmap_test_set(data, L1 * 2 - 1, L1 * 2 + 2);
+    hbitmap_test_set(data, 0, L1 * 3);
+    hbitmap_test_set(data, L1 * 8 - 1, L2);
+    hbitmap_test_set(data, L2, L1);
+    hbitmap_test_set(data, L2 - L1 - 1, L1 * 8 + 2);
+    hbitmap_test_set(data, L2, L3 - L2 + 1);
+    hbitmap_test_set(data, L3 - L1, L1 * 3);
+    hbitmap_test_set(data, L3 - 1, 3);
+    hbitmap_test_set(data, L3 - 1, L2);
+}
+
+static void test_hbitmap_reset_empty(TestHBitmapData *data,
+                                     const void *unused)
+{
+    hbitmap_test_init(data, L3, 0);
+    hbitmap_test_reset(data, 0, L3);
+}
+
+static void test_hbitmap_reset(TestHBitmapData *data,
+                               const void *unused)
+{
+    hbitmap_test_init(data, L3 * 2, 0);
+    hbitmap_test_set(data, L1 - 1, L1 + 2);
+    hbitmap_test_reset(data, L1 * 2 - 1, L1 * 2 + 2);
+    hbitmap_test_set(data, 0, L1 * 3);
+    hbitmap_test_reset(data, L1 * 8 - 1, L2);
+    hbitmap_test_set(data, L2, L1);
+    hbitmap_test_reset(data, L2 - L1 - 1, L1 * 8 + 2);
+    hbitmap_test_set(data, L2, L3 - L2 + 1);
+    hbitmap_test_reset(data, L3 - L1, L1 * 3);
+    hbitmap_test_set(data, L3 - 1, 3);
+    hbitmap_test_reset(data, L3 - 1, L2);
+    hbitmap_test_set(data, 0, L3 * 2);
+    hbitmap_test_reset(data, 0, L1);
+    hbitmap_test_reset(data, 0, L2);
+    hbitmap_test_reset(data, L3, L3);
+    hbitmap_test_set(data, L3 / 2, L3);
+}
+
+static void test_hbitmap_granularity(TestHBitmapData *data,
+                                     const void *unused)
+{
+    /* Note that hbitmap_test_check has to be invoked manually in this test.  */
+    hbitmap_test_init(data, L1, 1);
+    hbitmap_test_set(data, 0, 1);
+    g_assert_cmpint(hbitmap_count(data->hb), ==, 2);
+    hbitmap_test_check(data, 0);
+    hbitmap_test_set(data, 2, 1);
+    g_assert_cmpint(hbitmap_count(data->hb), ==, 4);
+    hbitmap_test_check(data, 0);
+    hbitmap_test_set(data, 0, 3);
+    g_assert_cmpint(hbitmap_count(data->hb), ==, 4);
+    hbitmap_test_reset(data, 0, 1);
+    g_assert_cmpint(hbitmap_count(data->hb), ==, 2);
+}
+
+static void test_hbitmap_iter_granularity(TestHBitmapData *data,
+                                          const void *unused)
+{
+    HBitmapIter hbi;
+
+    /* Note that hbitmap_test_check has to be invoked manually in this test.  */
+    hbitmap_test_init(data, 131072 << 7, 7);
+    hbitmap_iter_init(&hbi, data->hb, 0);
+    g_assert_cmpint(hbitmap_iter_next(&hbi), <, 0);
+
+    hbitmap_test_set(data, ((L2 + L1 + 1) << 7) + 8, 8);
+    hbitmap_iter_init(&hbi, data->hb, 0);
+    g_assert_cmpint(hbitmap_iter_next(&hbi), ==, (L2 + L1 + 1) << 7);
+    g_assert_cmpint(hbitmap_iter_next(&hbi), <, 0);
+
+    hbitmap_iter_init(&hbi, data->hb, (L2 + L1 + 2) << 7);
+    g_assert_cmpint(hbitmap_iter_next(&hbi), <, 0);
+
+    hbitmap_test_set(data, (131072 << 7) - 8, 8);
+    hbitmap_iter_init(&hbi, data->hb, 0);
+    g_assert_cmpint(hbitmap_iter_next(&hbi), ==, (L2 + L1 + 1) << 7);
+    g_assert_cmpint(hbitmap_iter_next(&hbi), ==, 131071 << 7);
+    g_assert_cmpint(hbitmap_iter_next(&hbi), <, 0);
+
+    hbitmap_iter_init(&hbi, data->hb, (L2 + L1 + 2) << 7);
+    g_assert_cmpint(hbitmap_iter_next(&hbi), ==, 131071 << 7);
+    g_assert_cmpint(hbitmap_iter_next(&hbi), <, 0);
+}
+
+static void hbitmap_test_add(const char *testpath,
+                                   void (*test_func)(TestHBitmapData *data, const void *user_data))
+{
+    g_test_add(testpath, TestHBitmapData, NULL, NULL, test_func,
+               hbitmap_test_teardown);
+}
+
+int main(int argc, char **argv)
+{
+    g_test_init(&argc, &argv, NULL);
+    hbitmap_test_add("/hbitmap/size/0", test_hbitmap_zero);
+    hbitmap_test_add("/hbitmap/size/unaligned", test_hbitmap_unaligned);
+    hbitmap_test_add("/hbitmap/iter/empty", test_hbitmap_iter_empty);
+    hbitmap_test_add("/hbitmap/iter/past", test_hbitmap_iter_past);
+    hbitmap_test_add("/hbitmap/iter/partial", test_hbitmap_iter_partial);
+    hbitmap_test_add("/hbitmap/iter/granularity", test_hbitmap_iter_granularity);
+    hbitmap_test_add("/hbitmap/get/all", test_hbitmap_get_all);
+    hbitmap_test_add("/hbitmap/get/some", test_hbitmap_get_some);
+    hbitmap_test_add("/hbitmap/set/all", test_hbitmap_set_all);
+    hbitmap_test_add("/hbitmap/set/one", test_hbitmap_set_one);
+    hbitmap_test_add("/hbitmap/set/two-elem", test_hbitmap_set_two_elem);
+    hbitmap_test_add("/hbitmap/set/general", test_hbitmap_set);
+    hbitmap_test_add("/hbitmap/set/twice", test_hbitmap_set_twice);
+    hbitmap_test_add("/hbitmap/set/overlap", test_hbitmap_set_overlap);
+    hbitmap_test_add("/hbitmap/reset/empty", test_hbitmap_reset_empty);
+    hbitmap_test_add("/hbitmap/reset/general", test_hbitmap_reset);
+    hbitmap_test_add("/hbitmap/granularity", test_hbitmap_granularity);
+    g_test_run();
+
+    return 0;
+}
diff --git a/trace-events b/trace-events
index 6c6cbf1..93e8901 100644
--- a/trace-events
+++ b/trace-events
@@ -1022,3 +1022,8 @@  spapr_pci_rtas_ibm_change_msi(unsigned func, unsigned req) "func %u, requested %
 spapr_pci_rtas_ibm_query_interrupt_source_number(unsigned ioa, unsigned intr) "queries for #%u, IRQ%u"
 spapr_pci_msi_write(uint64_t addr, uint64_t data, uint32_t dt_irq) "@%"PRIx64"<=%"PRIx64" IRQ %u"
 spapr_pci_lsi_set(const char *busname, int pin, uint32_t irq) "%s PIN%d IRQ %u"
+
+# hbitmap.c
+hbitmap_iter_skip_words(const void *hb, void *hbi, uint64_t pos, unsigned long cur) "hb %p hbi %p pos %"PRId64" cur 0x%lx"
+hbitmap_reset(void *hb, uint64_t start, uint64_t count, uint64_t sbit, uint64_t ebit) "hb %p items %"PRIu64",%"PRIu64" bits %"PRIu64"..%"PRIu64
+hbitmap_set(void *hb, uint64_t start, uint64_t count, uint64_t sbit, uint64_t ebit) "hb %p items %"PRIu64",%"PRIu64" bits %"PRIu64"..%"PRIu64