diff mbox

hw/virtio/balloon: Fixes for different host page sizes

Message ID 1460548364-27469-1-git-send-email-thuth@redhat.com
State New
Headers show

Commit Message

Thomas Huth April 13, 2016, 11:52 a.m. UTC
The balloon code currently calls madvise() with TARGET_PAGE_SIZE
as length parameter, and an address which is directly based on
the page address supplied by the guest. Since the virtio-balloon
protocol is always based on 4k based addresses/sizes, no matter
what the host and guest are using as page size, this has a couple
of issues which could even lead to data loss in the worst case.

TARGET_PAGE_SIZE might not be 4k, so it is wrong to use that
value for the madvise() call. If TARGET_PAGE_SIZE is bigger than
4k, we also destroy the 4k areas after the current one - which
might be wrong since the guest did not want free that area yet (in
case the guest used as smaller MMU page size than the hard-coded
TARGET_PAGE_SIZE). So to fix this issue, introduce a proper define
called BALLOON_PAGE_SIZE (which is 4096) to use this as the size
parameter for the madvise() call instead.

Then, there's yet another problem: If the host page size is bigger
than the 4k balloon page size, we can not simply call madvise() on
each of the 4k balloon addresses that we get from the guest - since
the madvise() always evicts the whole host page, not only a 4k area!
So in this case, we've got to track the 4k fragments of a host page
and only call madvise(DONTNEED) when all fragments have been collected.
This of course only works fine if the guest sends consecutive 4k
fragments - which is the case in the most important scenarios that
I try to address here (like a ppc64 guest with 64k page size that
is running on a ppc64 host with 64k page size). In case the guest
uses a page size that is smaller than the host page size, we might
need to add some more additional logic here later to increase the
probability of being able to release memory, but at least the guest
should now not crash anymore due to unintentionally evicted pages.

Signed-off-by: Thomas Huth <thuth@redhat.com>
---
 I've tested this patch with both, a 4k page size ppc64 guest
 and a 64k page size ppc64 guest on a 64k page size ppc64 host.
 With this patch applied, I was not able to crash to crash the
 guests anymore (the 4k guest easily crashes without this patch).
 And looking at the output of the "free" command on the host,
 the ballooning also still works as expected.

 hw/virtio/virtio-balloon.c         | 68 ++++++++++++++++++++++++++++++++++----
 include/hw/virtio/virtio-balloon.h |  3 ++
 2 files changed, 65 insertions(+), 6 deletions(-)

Comments

Andrew Jones April 13, 2016, 12:37 p.m. UTC | #1
On Wed, Apr 13, 2016 at 01:52:44PM +0200, Thomas Huth wrote:
> The balloon code currently calls madvise() with TARGET_PAGE_SIZE
> as length parameter, and an address which is directly based on
> the page address supplied by the guest. Since the virtio-balloon
> protocol is always based on 4k based addresses/sizes, no matter
> what the host and guest are using as page size, this has a couple
> of issues which could even lead to data loss in the worst case.
> 
> TARGET_PAGE_SIZE might not be 4k, so it is wrong to use that
> value for the madvise() call. If TARGET_PAGE_SIZE is bigger than
> 4k, we also destroy the 4k areas after the current one - which
> might be wrong since the guest did not want free that area yet (in
> case the guest used as smaller MMU page size than the hard-coded
> TARGET_PAGE_SIZE). So to fix this issue, introduce a proper define
> called BALLOON_PAGE_SIZE (which is 4096) to use this as the size
> parameter for the madvise() call instead.
> 
> Then, there's yet another problem: If the host page size is bigger
> than the 4k balloon page size, we can not simply call madvise() on
> each of the 4k balloon addresses that we get from the guest - since
> the madvise() always evicts the whole host page, not only a 4k area!
> So in this case, we've got to track the 4k fragments of a host page
> and only call madvise(DONTNEED) when all fragments have been collected.
> This of course only works fine if the guest sends consecutive 4k
> fragments - which is the case in the most important scenarios that
> I try to address here (like a ppc64 guest with 64k page size that
> is running on a ppc64 host with 64k page size). In case the guest
> uses a page size that is smaller than the host page size, we might
> need to add some more additional logic here later to increase the
> probability of being able to release memory, but at least the guest
> should now not crash anymore due to unintentionally evicted pages.
> 
> Signed-off-by: Thomas Huth <thuth@redhat.com>
> ---
>  I've tested this patch with both, a 4k page size ppc64 guest
>  and a 64k page size ppc64 guest on a 64k page size ppc64 host.
>  With this patch applied, I was not able to crash to crash the
>  guests anymore (the 4k guest easily crashes without this patch).
>  And looking at the output of the "free" command on the host,
>  the ballooning also still works as expected.
> 
>  hw/virtio/virtio-balloon.c         | 68 ++++++++++++++++++++++++++++++++++----
>  include/hw/virtio/virtio-balloon.h |  3 ++
>  2 files changed, 65 insertions(+), 6 deletions(-)
> 
> diff --git a/hw/virtio/virtio-balloon.c b/hw/virtio/virtio-balloon.c
> index c74101e..886faa8 100644
> --- a/hw/virtio/virtio-balloon.c
> +++ b/hw/virtio/virtio-balloon.c
> @@ -35,13 +35,56 @@
>  #include "hw/virtio/virtio-bus.h"
>  #include "hw/virtio/virtio-access.h"
>  
> -static void balloon_page(void *addr, int deflate)
> +#define BALLOON_PAGE_SIZE        (1 << VIRTIO_BALLOON_PFN_SHIFT)
> +#define BALLOON_NO_CURRENT_PAGE  ((void *)-1)
> +
> +static void balloon_page(VirtIOBalloon *s, void *addr, int deflate)
>  {
>  #if defined(__linux__)
> -    if (!qemu_balloon_is_inhibited() && (!kvm_enabled() ||
> -                                         kvm_has_sync_mmu())) {
> -        qemu_madvise(addr, TARGET_PAGE_SIZE,
> -                deflate ? QEMU_MADV_WILLNEED : QEMU_MADV_DONTNEED);
> +    size_t host_page_size;
> +    void *aligned_addr;
> +
> +    if (qemu_balloon_is_inhibited() || (kvm_enabled() && !kvm_has_sync_mmu())) {
> +        return;
> +    }
> +
> +    host_page_size = getpagesize();
> +    if (host_page_size <= BALLOON_PAGE_SIZE) {
> +        qemu_madvise(addr, BALLOON_PAGE_SIZE,
> +                     deflate ? QEMU_MADV_WILLNEED : QEMU_MADV_DONTNEED);
> +        return;
> +    }
> +
> +    /* Host page size > ballon page size ==> use aligned host address */
> +    aligned_addr = (void *)((uintptr_t)addr & ~(host_page_size - 1));
> +    if (deflate) {
> +        /* MADV_WILLNEED is just a hint for the host kernel, the guest should
> +         * also be able to use the memory again without this call, so let's
> +         * only do it for the first, aligned fragment of a host page and
> +         * ignore it for the others.
> +         */
> +        if (addr == aligned_addr) {
> +            qemu_madvise(aligned_addr, host_page_size, QEMU_MADV_WILLNEED);
> +        }
> +        s->current_addr = BALLOON_NO_CURRENT_PAGE;
> +    } else {
> +        const int max_frags = host_page_size / BALLOON_PAGE_SIZE;
> +        /* If we end up here, that means we want to evict balloon pages, but
> +         * the host's page size is bigger than the 4k pages from the balloon.
> +         * Since madvise() only works on the granularity of the host page size,
> +         * we've got to collect all the 4k fragments from the guest first
> +         * before we can issue the MADV_DONTNEED call.
> +         */
> +        if (aligned_addr != s->current_addr) {
> +            memset(s->fragment_bits, 0, s->fragment_bits_size);
> +            s->current_addr = aligned_addr;

So here's where we blow away a partial frag list when the guest advertises
pages in non-sequential order. Obviously not ideal, since in the worst-case
we'll never call madvise-dontneed. However, this patch fixes a problem, and
does likely work with normal guest behavior, so we're probably fine. It
might be nice to have some accounting to measure how often this happens
though.

> +        }
> +        set_bit((addr - aligned_addr) / BALLOON_PAGE_SIZE, s->fragment_bits);
> +        if (find_first_zero_bit(s->fragment_bits, max_frags) == max_frags) {
> +            qemu_madvise(aligned_addr, host_page_size, QEMU_MADV_DONTNEED);
> +            memset(s->fragment_bits, 0, s->fragment_bits_size);
> +            s->current_addr = BALLOON_NO_CURRENT_PAGE;
> +        }
>      }
>  #endif
>  }
> @@ -240,7 +283,7 @@ static void virtio_balloon_handle_output(VirtIODevice *vdev, VirtQueue *vq)
>              /* Using memory_region_get_ram_ptr is bending the rules a bit, but
>                 should be OK because we only want a single page.  */
>              addr = section.offset_within_region;
> -            balloon_page(memory_region_get_ram_ptr(section.mr) + addr,
> +            balloon_page(s, memory_region_get_ram_ptr(section.mr) + addr,
>                           !!(vq == s->dvq));
>              memory_region_unref(section.mr);
>          }
> @@ -455,6 +498,13 @@ static void virtio_balloon_device_realize(DeviceState *dev, Error **errp)
>      s->dvq = virtio_add_queue(vdev, 128, virtio_balloon_handle_output);
>      s->svq = virtio_add_queue(vdev, 128, virtio_balloon_receive_stats);
>  
> +    if (getpagesize() > BALLOON_PAGE_SIZE) {
> +        s->fragment_bits_size = (getpagesize() / BALLOON_PAGE_SIZE
> +                                 + sizeof(long) * 8 - 1) / 8;
> +        s->fragment_bits = g_malloc0(s->fragment_bits_size);
> +        s->current_addr = BALLOON_NO_CURRENT_PAGE;
> +    }
> +
>      reset_stats(s);
>  
>      register_savevm(dev, "virtio-balloon", -1, 1,
> @@ -466,6 +516,7 @@ static void virtio_balloon_device_unrealize(DeviceState *dev, Error **errp)
>      VirtIODevice *vdev = VIRTIO_DEVICE(dev);
>      VirtIOBalloon *s = VIRTIO_BALLOON(dev);
>  
> +    g_free(s->fragment_bits);
>      balloon_stats_destroy_timer(s);
>      qemu_remove_balloon_handler(s);
>      unregister_savevm(dev, "virtio-balloon", s);
> @@ -480,6 +531,11 @@ static void virtio_balloon_device_reset(VirtIODevice *vdev)
>          g_free(s->stats_vq_elem);
>          s->stats_vq_elem = NULL;
>      }
> +
> +    if (s->fragment_bits) {
> +        memset(s->fragment_bits, 0, s->fragment_bits_size);
> +        s->current_addr = BALLOON_NO_CURRENT_PAGE;
> +    }
>  }
>  
>  static void virtio_balloon_instance_init(Object *obj)
> diff --git a/include/hw/virtio/virtio-balloon.h b/include/hw/virtio/virtio-balloon.h
> index 35f62ac..04b7c0c 100644
> --- a/include/hw/virtio/virtio-balloon.h
> +++ b/include/hw/virtio/virtio-balloon.h
> @@ -43,6 +43,9 @@ typedef struct VirtIOBalloon {
>      int64_t stats_last_update;
>      int64_t stats_poll_interval;
>      uint32_t host_features;
> +    void *current_addr;
> +    unsigned long *fragment_bits;
> +    int fragment_bits_size;
>  } VirtIOBalloon;
>  
>  #endif
> -- 
> 1.8.3.1
> 
>

Other than the accounting suggestion which can be taken or left,

Reviewed-by: Andrew Jones <drjones@redhat.com>

drew
Michael S. Tsirkin April 13, 2016, 1:15 p.m. UTC | #2
On Wed, Apr 13, 2016 at 01:52:44PM +0200, Thomas Huth wrote:
> The balloon code currently calls madvise() with TARGET_PAGE_SIZE
> as length parameter, and an address which is directly based on
> the page address supplied by the guest. Since the virtio-balloon
> protocol is always based on 4k based addresses/sizes, no matter
> what the host and guest are using as page size, this has a couple
> of issues which could even lead to data loss in the worst case.
> 
> TARGET_PAGE_SIZE might not be 4k, so it is wrong to use that
> value for the madvise() call. If TARGET_PAGE_SIZE is bigger than
> 4k, we also destroy the 4k areas after the current one - which
> might be wrong since the guest did not want free that area yet (in
> case the guest used as smaller MMU page size than the hard-coded
> TARGET_PAGE_SIZE). So to fix this issue, introduce a proper define
> called BALLOON_PAGE_SIZE (which is 4096) to use this as the size
> parameter for the madvise() call instead.

this makes absolute sense.


> Then, there's yet another problem: If the host page size is bigger
> than the 4k balloon page size, we can not simply call madvise() on
> each of the 4k balloon addresses that we get from the guest - since
> the madvise() always evicts the whole host page, not only a 4k area!

Does it really round length up?
Wow, it does:
        len = (len_in + ~PAGE_MASK) & PAGE_MASK;

which seems to be undocumented, but has been there forever.


> So in this case, we've got to track the 4k fragments of a host page
> and only call madvise(DONTNEED) when all fragments have been collected.
> This of course only works fine if the guest sends consecutive 4k
> fragments - which is the case in the most important scenarios that
> I try to address here (like a ppc64 guest with 64k page size that
> is running on a ppc64 host with 64k page size). In case the guest
> uses a page size that is smaller than the host page size, we might
> need to add some more additional logic here later to increase the
> probability of being able to release memory, but at least the guest
> should now not crash anymore due to unintentionally evicted pages.
> 
> Signed-off-by: Thomas Huth <thuth@redhat.com>
> ---
>  I've tested this patch with both, a 4k page size ppc64 guest
>  and a 64k page size ppc64 guest on a 64k page size ppc64 host.
>  With this patch applied, I was not able to crash to crash the
>  guests anymore (the 4k guest easily crashes without this patch).
>  And looking at the output of the "free" command on the host,
>  the ballooning also still works as expected.
> 
>  hw/virtio/virtio-balloon.c         | 68 ++++++++++++++++++++++++++++++++++----
>  include/hw/virtio/virtio-balloon.h |  3 ++
>  2 files changed, 65 insertions(+), 6 deletions(-)
> 
> diff --git a/hw/virtio/virtio-balloon.c b/hw/virtio/virtio-balloon.c
> index c74101e..886faa8 100644
> --- a/hw/virtio/virtio-balloon.c
> +++ b/hw/virtio/virtio-balloon.c
> @@ -35,13 +35,56 @@
>  #include "hw/virtio/virtio-bus.h"
>  #include "hw/virtio/virtio-access.h"
>  
> -static void balloon_page(void *addr, int deflate)
> +#define BALLOON_PAGE_SIZE        (1 << VIRTIO_BALLOON_PFN_SHIFT)
> +#define BALLOON_NO_CURRENT_PAGE  ((void *)-1)
> +
> +static void balloon_page(VirtIOBalloon *s, void *addr, int deflate)
>  {
>  #if defined(__linux__)
> -    if (!qemu_balloon_is_inhibited() && (!kvm_enabled() ||
> -                                         kvm_has_sync_mmu())) {
> -        qemu_madvise(addr, TARGET_PAGE_SIZE,
> -                deflate ? QEMU_MADV_WILLNEED : QEMU_MADV_DONTNEED);
> +    size_t host_page_size;
> +    void *aligned_addr;
> +
> +    if (qemu_balloon_is_inhibited() || (kvm_enabled() && !kvm_has_sync_mmu())) {
> +        return;
> +    }
> +
> +    host_page_size = getpagesize();
> +    if (host_page_size <= BALLOON_PAGE_SIZE) {
> +        qemu_madvise(addr, BALLOON_PAGE_SIZE,
> +                     deflate ? QEMU_MADV_WILLNEED : QEMU_MADV_DONTNEED);
> +        return;
> +    }
> +
> +    /* Host page size > ballon page size ==> use aligned host address */
> +    aligned_addr = (void *)((uintptr_t)addr & ~(host_page_size - 1));
> +    if (deflate) {
> +        /* MADV_WILLNEED is just a hint for the host kernel, the guest should
> +         * also be able to use the memory again without this call, so let's
> +         * only do it for the first, aligned fragment of a host page and
> +         * ignore it for the others.
> +         */
> +        if (addr == aligned_addr) {
> +            qemu_madvise(aligned_addr, host_page_size, QEMU_MADV_WILLNEED);
> +        }
> +        s->current_addr = BALLOON_NO_CURRENT_PAGE;
> +    } else {
> +        const int max_frags = host_page_size / BALLOON_PAGE_SIZE;
> +        /* If we end up here, that means we want to evict balloon pages, but
> +         * the host's page size is bigger than the 4k pages from the balloon.
> +         * Since madvise() only works on the granularity of the host page size,
> +         * we've got to collect all the 4k fragments from the guest first
> +         * before we can issue the MADV_DONTNEED call.
> +         */
> +        if (aligned_addr != s->current_addr) {
> +            memset(s->fragment_bits, 0, s->fragment_bits_size);
> +            s->current_addr = aligned_addr;
> +        }
> +        set_bit((addr - aligned_addr) / BALLOON_PAGE_SIZE, s->fragment_bits);
> +        if (find_first_zero_bit(s->fragment_bits, max_frags) == max_frags) {
> +            qemu_madvise(aligned_addr, host_page_size, QEMU_MADV_DONTNEED);
> +            memset(s->fragment_bits, 0, s->fragment_bits_size);
> +            s->current_addr = BALLOON_NO_CURRENT_PAGE;
> +        }
>      }
>  #endif
>  }
> @@ -240,7 +283,7 @@ static void virtio_balloon_handle_output(VirtIODevice *vdev, VirtQueue *vq)
>              /* Using memory_region_get_ram_ptr is bending the rules a bit, but
>                 should be OK because we only want a single page.  */
>              addr = section.offset_within_region;
> -            balloon_page(memory_region_get_ram_ptr(section.mr) + addr,
> +            balloon_page(s, memory_region_get_ram_ptr(section.mr) + addr,
>                           !!(vq == s->dvq));
>              memory_region_unref(section.mr);
>          }
> @@ -455,6 +498,13 @@ static void virtio_balloon_device_realize(DeviceState *dev, Error **errp)
>      s->dvq = virtio_add_queue(vdev, 128, virtio_balloon_handle_output);
>      s->svq = virtio_add_queue(vdev, 128, virtio_balloon_receive_stats);
>  
> +    if (getpagesize() > BALLOON_PAGE_SIZE) {
> +        s->fragment_bits_size = (getpagesize() / BALLOON_PAGE_SIZE
> +                                 + sizeof(long) * 8 - 1) / 8;
> +        s->fragment_bits = g_malloc0(s->fragment_bits_size);
> +        s->current_addr = BALLOON_NO_CURRENT_PAGE;
> +    }
> +
>      reset_stats(s);
>  
>      register_savevm(dev, "virtio-balloon", -1, 1,
> @@ -466,6 +516,7 @@ static void virtio_balloon_device_unrealize(DeviceState *dev, Error **errp)
>      VirtIODevice *vdev = VIRTIO_DEVICE(dev);
>      VirtIOBalloon *s = VIRTIO_BALLOON(dev);
>  
> +    g_free(s->fragment_bits);
>      balloon_stats_destroy_timer(s);
>      qemu_remove_balloon_handler(s);
>      unregister_savevm(dev, "virtio-balloon", s);
> @@ -480,6 +531,11 @@ static void virtio_balloon_device_reset(VirtIODevice *vdev)
>          g_free(s->stats_vq_elem);
>          s->stats_vq_elem = NULL;
>      }
> +
> +    if (s->fragment_bits) {
> +        memset(s->fragment_bits, 0, s->fragment_bits_size);
> +        s->current_addr = BALLOON_NO_CURRENT_PAGE;
> +    }
>  }
>  
>  static void virtio_balloon_instance_init(Object *obj)
> diff --git a/include/hw/virtio/virtio-balloon.h b/include/hw/virtio/virtio-balloon.h
> index 35f62ac..04b7c0c 100644
> --- a/include/hw/virtio/virtio-balloon.h
> +++ b/include/hw/virtio/virtio-balloon.h
> @@ -43,6 +43,9 @@ typedef struct VirtIOBalloon {
>      int64_t stats_last_update;
>      int64_t stats_poll_interval;
>      uint32_t host_features;
> +    void *current_addr;
> +    unsigned long *fragment_bits;
> +    int fragment_bits_size;
>  } VirtIOBalloon;
>  
>  #endif
> -- 
> 1.8.3.1


It looks like fragment_bits would have to be migrated.
Which is a lot of complexity.
And work arounds for specific guest behaviour are really ugly.
There are patches on-list to maintain a balloon bitmap -
that will enable fixing it cleanly.
How about we just skip madvise if host page size is > balloon
page size, for 2.6?
Thomas Huth April 13, 2016, 2:51 p.m. UTC | #3
On 13.04.2016 15:15, Michael S. Tsirkin wrote:
> On Wed, Apr 13, 2016 at 01:52:44PM +0200, Thomas Huth wrote:
>> The balloon code currently calls madvise() with TARGET_PAGE_SIZE
>> as length parameter, and an address which is directly based on
>> the page address supplied by the guest. Since the virtio-balloon
>> protocol is always based on 4k based addresses/sizes, no matter
>> what the host and guest are using as page size, this has a couple
>> of issues which could even lead to data loss in the worst case.
>>
>> TARGET_PAGE_SIZE might not be 4k, so it is wrong to use that
>> value for the madvise() call. If TARGET_PAGE_SIZE is bigger than
>> 4k, we also destroy the 4k areas after the current one - which
>> might be wrong since the guest did not want free that area yet (in
>> case the guest used as smaller MMU page size than the hard-coded
>> TARGET_PAGE_SIZE). So to fix this issue, introduce a proper define
>> called BALLOON_PAGE_SIZE (which is 4096) to use this as the size
>> parameter for the madvise() call instead.
> 
> this makes absolute sense.
> 
>> Then, there's yet another problem: If the host page size is bigger
>> than the 4k balloon page size, we can not simply call madvise() on
>> each of the 4k balloon addresses that we get from the guest - since
>> the madvise() always evicts the whole host page, not only a 4k area!
> 
> Does it really round length up?
> Wow, it does:
>         len = (len_in + ~PAGE_MASK) & PAGE_MASK;
> 
> which seems to be undocumented, but has been there forever.

Yes, that's ugly - I also had to take a look at the kernel sources to
understand what this call is supposed to do when being called with a
size < PAGE_SIZE.

>> So in this case, we've got to track the 4k fragments of a host page
>> and only call madvise(DONTNEED) when all fragments have been collected.
>> This of course only works fine if the guest sends consecutive 4k
>> fragments - which is the case in the most important scenarios that
>> I try to address here (like a ppc64 guest with 64k page size that
>> is running on a ppc64 host with 64k page size). In case the guest
>> uses a page size that is smaller than the host page size, we might
>> need to add some more additional logic here later to increase the
>> probability of being able to release memory, but at least the guest
>> should now not crash anymore due to unintentionally evicted pages.
...
>>  static void virtio_balloon_instance_init(Object *obj)
>> diff --git a/include/hw/virtio/virtio-balloon.h b/include/hw/virtio/virtio-balloon.h
>> index 35f62ac..04b7c0c 100644
>> --- a/include/hw/virtio/virtio-balloon.h
>> +++ b/include/hw/virtio/virtio-balloon.h
>> @@ -43,6 +43,9 @@ typedef struct VirtIOBalloon {
>>      int64_t stats_last_update;
>>      int64_t stats_poll_interval;
>>      uint32_t host_features;
>> +    void *current_addr;
>> +    unsigned long *fragment_bits;
>> +    int fragment_bits_size;
>>  } VirtIOBalloon;
>>  
>>  #endif
>> -- 
>> 1.8.3.1
> 
> 
> It looks like fragment_bits would have to be migrated.
> Which is a lot of complexity.

I think we could simply omit this for now. In case somebody migrates the
VM while the ballooning is going on, we'd loose the information for one
host page, so we might miss one madvise(DONTNEED), but I think we could
live with that.

> And work arounds for specific guest behaviour are really ugly.
> There are patches on-list to maintain a balloon bitmap -
> that will enable fixing it cleanly.

Ah, good to know, I wasn't aware of them yet, so that will be a chance
for a really proper final solution, I hope.

> How about we just skip madvise if host page size is > balloon
> page size, for 2.6?

That would mean a regression compared to what we have today. Currently,
the ballooning is working OK for 64k guests on a 64k ppc host - rather
by chance than on purpose, but it's working. The guest is always sending
all the 4k fragments of a 64k page, and QEMU is trying to call madvise()
for every one of them, but the kernel is ignoring madvise() on
non-64k-aligned addresses, so we end up with a situation where the
madvise() frees a whole 64k page which is also declared as free by the
guest.

I think we should either take this patch as it is right now (without
adding extra code for migration) and later update it to the bitmap code
by Jitendra Kolhe, or omit it completely (leaving 4k guests broken) and
fix it properly after the bitmap code has been applied. But disabling
the balloon code for 64k guests on 64k hosts completely does not sound
very appealing to me. What do you think?

 Thomas
Michael S. Tsirkin April 13, 2016, 5:07 p.m. UTC | #4
On Wed, Apr 13, 2016 at 04:51:49PM +0200, Thomas Huth wrote:
> On 13.04.2016 15:15, Michael S. Tsirkin wrote:
> > On Wed, Apr 13, 2016 at 01:52:44PM +0200, Thomas Huth wrote:
> >> The balloon code currently calls madvise() with TARGET_PAGE_SIZE
> >> as length parameter, and an address which is directly based on
> >> the page address supplied by the guest. Since the virtio-balloon
> >> protocol is always based on 4k based addresses/sizes, no matter
> >> what the host and guest are using as page size, this has a couple
> >> of issues which could even lead to data loss in the worst case.
> >>
> >> TARGET_PAGE_SIZE might not be 4k, so it is wrong to use that
> >> value for the madvise() call. If TARGET_PAGE_SIZE is bigger than
> >> 4k, we also destroy the 4k areas after the current one - which
> >> might be wrong since the guest did not want free that area yet (in
> >> case the guest used as smaller MMU page size than the hard-coded
> >> TARGET_PAGE_SIZE). So to fix this issue, introduce a proper define
> >> called BALLOON_PAGE_SIZE (which is 4096) to use this as the size
> >> parameter for the madvise() call instead.
> > 
> > this makes absolute sense.
> > 
> >> Then, there's yet another problem: If the host page size is bigger
> >> than the 4k balloon page size, we can not simply call madvise() on
> >> each of the 4k balloon addresses that we get from the guest - since
> >> the madvise() always evicts the whole host page, not only a 4k area!
> > 
> > Does it really round length up?
> > Wow, it does:
> >         len = (len_in + ~PAGE_MASK) & PAGE_MASK;
> > 
> > which seems to be undocumented, but has been there forever.
> 
> Yes, that's ugly - I also had to take a look at the kernel sources to
> understand what this call is supposed to do when being called with a
> size < PAGE_SIZE.
> 
> >> So in this case, we've got to track the 4k fragments of a host page
> >> and only call madvise(DONTNEED) when all fragments have been collected.
> >> This of course only works fine if the guest sends consecutive 4k
> >> fragments - which is the case in the most important scenarios that
> >> I try to address here (like a ppc64 guest with 64k page size that
> >> is running on a ppc64 host with 64k page size). In case the guest
> >> uses a page size that is smaller than the host page size, we might
> >> need to add some more additional logic here later to increase the
> >> probability of being able to release memory, but at least the guest
> >> should now not crash anymore due to unintentionally evicted pages.
> ...
> >>  static void virtio_balloon_instance_init(Object *obj)
> >> diff --git a/include/hw/virtio/virtio-balloon.h b/include/hw/virtio/virtio-balloon.h
> >> index 35f62ac..04b7c0c 100644
> >> --- a/include/hw/virtio/virtio-balloon.h
> >> +++ b/include/hw/virtio/virtio-balloon.h
> >> @@ -43,6 +43,9 @@ typedef struct VirtIOBalloon {
> >>      int64_t stats_last_update;
> >>      int64_t stats_poll_interval;
> >>      uint32_t host_features;
> >> +    void *current_addr;
> >> +    unsigned long *fragment_bits;
> >> +    int fragment_bits_size;
> >>  } VirtIOBalloon;
> >>  
> >>  #endif
> >> -- 
> >> 1.8.3.1
> > 
> > 
> > It looks like fragment_bits would have to be migrated.
> > Which is a lot of complexity.
> 
> I think we could simply omit this for now. In case somebody migrates the
> VM while the ballooning is going on, we'd loose the information for one
> host page, so we might miss one madvise(DONTNEED), but I think we could
> live with that.
> 
> > And work arounds for specific guest behaviour are really ugly.
> > There are patches on-list to maintain a balloon bitmap -
> > that will enable fixing it cleanly.
> 
> Ah, good to know, I wasn't aware of them yet, so that will be a chance
> for a really proper final solution, I hope.
> 
> > How about we just skip madvise if host page size is > balloon
> > page size, for 2.6?
> 
> That would mean a regression compared to what we have today. Currently,
> the ballooning is working OK for 64k guests on a 64k ppc host - rather
> by chance than on purpose, but it's working. The guest is always sending
> all the 4k fragments of a 64k page, and QEMU is trying to call madvise()
> for every one of them, but the kernel is ignoring madvise() on
> non-64k-aligned addresses, so we end up with a situation where the
> madvise() frees a whole 64k page which is also declared as free by the
> guest.
> 
> I think we should either take this patch as it is right now (without
> adding extra code for migration) and later update it to the bitmap code
> by Jitendra Kolhe, or omit it completely (leaving 4k guests broken) and
> fix it properly after the bitmap code has been applied. But disabling
> the balloon code for 64k guests on 64k hosts completely does not sound
> very appealing to me. What do you think?
> 
>  Thomas

True. As simple a hack - how about disabling madvise when host page size >
target page size?
Thomas Huth April 13, 2016, 5:38 p.m. UTC | #5
On 13.04.2016 19:07, Michael S. Tsirkin wrote:
> On Wed, Apr 13, 2016 at 04:51:49PM +0200, Thomas Huth wrote:
>> On 13.04.2016 15:15, Michael S. Tsirkin wrote:
>>> On Wed, Apr 13, 2016 at 01:52:44PM +0200, Thomas Huth wrote:
...
>>>> Then, there's yet another problem: If the host page size is bigger
>>>> than the 4k balloon page size, we can not simply call madvise() on
>>>> each of the 4k balloon addresses that we get from the guest - since
>>>> the madvise() always evicts the whole host page, not only a 4k area!
>>>>
>>>> So in this case, we've got to track the 4k fragments of a host page
>>>> and only call madvise(DONTNEED) when all fragments have been collected.
>>>> This of course only works fine if the guest sends consecutive 4k
>>>> fragments - which is the case in the most important scenarios that
>>>> I try to address here (like a ppc64 guest with 64k page size that
>>>> is running on a ppc64 host with 64k page size). In case the guest
>>>> uses a page size that is smaller than the host page size, we might
>>>> need to add some more additional logic here later to increase the
>>>> probability of being able to release memory, but at least the guest
>>>> should now not crash anymore due to unintentionally evicted pages.
>> ...
>>>>  static void virtio_balloon_instance_init(Object *obj)
>>>> diff --git a/include/hw/virtio/virtio-balloon.h b/include/hw/virtio/virtio-balloon.h
>>>> index 35f62ac..04b7c0c 100644
>>>> --- a/include/hw/virtio/virtio-balloon.h
>>>> +++ b/include/hw/virtio/virtio-balloon.h
>>>> @@ -43,6 +43,9 @@ typedef struct VirtIOBalloon {
>>>>      int64_t stats_last_update;
>>>>      int64_t stats_poll_interval;
>>>>      uint32_t host_features;
>>>> +    void *current_addr;
>>>> +    unsigned long *fragment_bits;
>>>> +    int fragment_bits_size;
>>>>  } VirtIOBalloon;
>>>>  
>>>>  #endif
>>>
>>> It looks like fragment_bits would have to be migrated.
>>> Which is a lot of complexity.
...
>>> How about we just skip madvise if host page size is > balloon
>>> page size, for 2.6?
>>
>> That would mean a regression compared to what we have today. Currently,
>> the ballooning is working OK for 64k guests on a 64k ppc host - rather
>> by chance than on purpose, but it's working. The guest is always sending
>> all the 4k fragments of a 64k page, and QEMU is trying to call madvise()
>> for every one of them, but the kernel is ignoring madvise() on
>> non-64k-aligned addresses, so we end up with a situation where the
>> madvise() frees a whole 64k page which is also declared as free by the
>> guest.
>>
>> I think we should either take this patch as it is right now (without
>> adding extra code for migration) and later update it to the bitmap code
>> by Jitendra Kolhe, or omit it completely (leaving 4k guests broken) and
>> fix it properly after the bitmap code has been applied. But disabling
>> the balloon code for 64k guests on 64k hosts completely does not sound
>> very appealing to me. What do you think?
>>
>>  Thomas
> 
> True. As simple a hack - how about disabling madvise when host page size >
> target page size?

That could work - but is there a generic way in QEMU to get the current
page size from a guest (since this might differ from TARGET_PAGE_SIZE)?
Or would that mean to pollute the virtio-balloon code with ugly #ifdefs?

 Thomas
Michael S. Tsirkin April 13, 2016, 5:55 p.m. UTC | #6
On Wed, Apr 13, 2016 at 07:38:12PM +0200, Thomas Huth wrote:
> On 13.04.2016 19:07, Michael S. Tsirkin wrote:
> > On Wed, Apr 13, 2016 at 04:51:49PM +0200, Thomas Huth wrote:
> >> On 13.04.2016 15:15, Michael S. Tsirkin wrote:
> >>> On Wed, Apr 13, 2016 at 01:52:44PM +0200, Thomas Huth wrote:
> ...
> >>>> Then, there's yet another problem: If the host page size is bigger
> >>>> than the 4k balloon page size, we can not simply call madvise() on
> >>>> each of the 4k balloon addresses that we get from the guest - since
> >>>> the madvise() always evicts the whole host page, not only a 4k area!
> >>>>
> >>>> So in this case, we've got to track the 4k fragments of a host page
> >>>> and only call madvise(DONTNEED) when all fragments have been collected.
> >>>> This of course only works fine if the guest sends consecutive 4k
> >>>> fragments - which is the case in the most important scenarios that
> >>>> I try to address here (like a ppc64 guest with 64k page size that
> >>>> is running on a ppc64 host with 64k page size). In case the guest
> >>>> uses a page size that is smaller than the host page size, we might
> >>>> need to add some more additional logic here later to increase the
> >>>> probability of being able to release memory, but at least the guest
> >>>> should now not crash anymore due to unintentionally evicted pages.
> >> ...
> >>>>  static void virtio_balloon_instance_init(Object *obj)
> >>>> diff --git a/include/hw/virtio/virtio-balloon.h b/include/hw/virtio/virtio-balloon.h
> >>>> index 35f62ac..04b7c0c 100644
> >>>> --- a/include/hw/virtio/virtio-balloon.h
> >>>> +++ b/include/hw/virtio/virtio-balloon.h
> >>>> @@ -43,6 +43,9 @@ typedef struct VirtIOBalloon {
> >>>>      int64_t stats_last_update;
> >>>>      int64_t stats_poll_interval;
> >>>>      uint32_t host_features;
> >>>> +    void *current_addr;
> >>>> +    unsigned long *fragment_bits;
> >>>> +    int fragment_bits_size;
> >>>>  } VirtIOBalloon;
> >>>>  
> >>>>  #endif
> >>>
> >>> It looks like fragment_bits would have to be migrated.
> >>> Which is a lot of complexity.
> ...
> >>> How about we just skip madvise if host page size is > balloon
> >>> page size, for 2.6?
> >>
> >> That would mean a regression compared to what we have today. Currently,
> >> the ballooning is working OK for 64k guests on a 64k ppc host - rather
> >> by chance than on purpose, but it's working. The guest is always sending
> >> all the 4k fragments of a 64k page, and QEMU is trying to call madvise()
> >> for every one of them, but the kernel is ignoring madvise() on
> >> non-64k-aligned addresses, so we end up with a situation where the
> >> madvise() frees a whole 64k page which is also declared as free by the
> >> guest.
> >>
> >> I think we should either take this patch as it is right now (without
> >> adding extra code for migration) and later update it to the bitmap code
> >> by Jitendra Kolhe, or omit it completely (leaving 4k guests broken) and
> >> fix it properly after the bitmap code has been applied. But disabling
> >> the balloon code for 64k guests on 64k hosts completely does not sound
> >> very appealing to me. What do you think?
> >>
> >>  Thomas
> > 
> > True. As simple a hack - how about disabling madvise when host page size >
> > target page size?
> 
> That could work - but is there a generic way in QEMU to get the current
> page size from a guest (since this might differ from TARGET_PAGE_SIZE)?
> Or would that mean to pollute the virtio-balloon code with ugly #ifdefs?
> 
>  Thomas

let's just use TARGET_PAGE_SIZE, that's the best I can think of.
Thomas Huth April 13, 2016, 6:11 p.m. UTC | #7
On 13.04.2016 19:55, Michael S. Tsirkin wrote:
> On Wed, Apr 13, 2016 at 07:38:12PM +0200, Thomas Huth wrote:
>> On 13.04.2016 19:07, Michael S. Tsirkin wrote:
>>> On Wed, Apr 13, 2016 at 04:51:49PM +0200, Thomas Huth wrote:
>>>> On 13.04.2016 15:15, Michael S. Tsirkin wrote:
>>>>> On Wed, Apr 13, 2016 at 01:52:44PM +0200, Thomas Huth wrote:
>> ...
>>>>>> Then, there's yet another problem: If the host page size is bigger
>>>>>> than the 4k balloon page size, we can not simply call madvise() on
>>>>>> each of the 4k balloon addresses that we get from the guest - since
>>>>>> the madvise() always evicts the whole host page, not only a 4k area!
>>>>>>
>>>>>> So in this case, we've got to track the 4k fragments of a host page
>>>>>> and only call madvise(DONTNEED) when all fragments have been collected.
>>>>>> This of course only works fine if the guest sends consecutive 4k
>>>>>> fragments - which is the case in the most important scenarios that
>>>>>> I try to address here (like a ppc64 guest with 64k page size that
>>>>>> is running on a ppc64 host with 64k page size). In case the guest
>>>>>> uses a page size that is smaller than the host page size, we might
>>>>>> need to add some more additional logic here later to increase the
>>>>>> probability of being able to release memory, but at least the guest
>>>>>> should now not crash anymore due to unintentionally evicted pages.
>>>> ...
>>>>>>  static void virtio_balloon_instance_init(Object *obj)
>>>>>> diff --git a/include/hw/virtio/virtio-balloon.h b/include/hw/virtio/virtio-balloon.h
>>>>>> index 35f62ac..04b7c0c 100644
>>>>>> --- a/include/hw/virtio/virtio-balloon.h
>>>>>> +++ b/include/hw/virtio/virtio-balloon.h
>>>>>> @@ -43,6 +43,9 @@ typedef struct VirtIOBalloon {
>>>>>>      int64_t stats_last_update;
>>>>>>      int64_t stats_poll_interval;
>>>>>>      uint32_t host_features;
>>>>>> +    void *current_addr;
>>>>>> +    unsigned long *fragment_bits;
>>>>>> +    int fragment_bits_size;
>>>>>>  } VirtIOBalloon;
>>>>>>  
>>>>>>  #endif
>>>>>
>>>>> It looks like fragment_bits would have to be migrated.
>>>>> Which is a lot of complexity.
>> ...
>>>>> How about we just skip madvise if host page size is > balloon
>>>>> page size, for 2.6?
>>>>
>>>> That would mean a regression compared to what we have today. Currently,
>>>> the ballooning is working OK for 64k guests on a 64k ppc host - rather
>>>> by chance than on purpose, but it's working. The guest is always sending
>>>> all the 4k fragments of a 64k page, and QEMU is trying to call madvise()
>>>> for every one of them, but the kernel is ignoring madvise() on
>>>> non-64k-aligned addresses, so we end up with a situation where the
>>>> madvise() frees a whole 64k page which is also declared as free by the
>>>> guest.
>>>>
>>>> I think we should either take this patch as it is right now (without
>>>> adding extra code for migration) and later update it to the bitmap code
>>>> by Jitendra Kolhe, or omit it completely (leaving 4k guests broken) and
>>>> fix it properly after the bitmap code has been applied. But disabling
>>>> the balloon code for 64k guests on 64k hosts completely does not sound
>>>> very appealing to me. What do you think?
>>>
>>> True. As simple a hack - how about disabling madvise when host page size >
>>> target page size?
>>
>> That could work - but is there a generic way in QEMU to get the current
>> page size from a guest (since this might differ from TARGET_PAGE_SIZE)?
>> Or would that mean to pollute the virtio-balloon code with ugly #ifdefs?
> 
> let's just use TARGET_PAGE_SIZE, that's the best I can think of.

That won't work - at least not on ppc: TARGET_PAGE_SIZE is always
defined to 4096 here. The Linux kernel then switches the real page size
during runtime to 65536. So we'd need a way to detect this automatically...

 Thomas
Michael S. Tsirkin April 13, 2016, 6:14 p.m. UTC | #8
On Wed, Apr 13, 2016 at 08:11:36PM +0200, Thomas Huth wrote:
> On 13.04.2016 19:55, Michael S. Tsirkin wrote:
> > On Wed, Apr 13, 2016 at 07:38:12PM +0200, Thomas Huth wrote:
> >> On 13.04.2016 19:07, Michael S. Tsirkin wrote:
> >>> On Wed, Apr 13, 2016 at 04:51:49PM +0200, Thomas Huth wrote:
> >>>> On 13.04.2016 15:15, Michael S. Tsirkin wrote:
> >>>>> On Wed, Apr 13, 2016 at 01:52:44PM +0200, Thomas Huth wrote:
> >> ...
> >>>>>> Then, there's yet another problem: If the host page size is bigger
> >>>>>> than the 4k balloon page size, we can not simply call madvise() on
> >>>>>> each of the 4k balloon addresses that we get from the guest - since
> >>>>>> the madvise() always evicts the whole host page, not only a 4k area!
> >>>>>>
> >>>>>> So in this case, we've got to track the 4k fragments of a host page
> >>>>>> and only call madvise(DONTNEED) when all fragments have been collected.
> >>>>>> This of course only works fine if the guest sends consecutive 4k
> >>>>>> fragments - which is the case in the most important scenarios that
> >>>>>> I try to address here (like a ppc64 guest with 64k page size that
> >>>>>> is running on a ppc64 host with 64k page size). In case the guest
> >>>>>> uses a page size that is smaller than the host page size, we might
> >>>>>> need to add some more additional logic here later to increase the
> >>>>>> probability of being able to release memory, but at least the guest
> >>>>>> should now not crash anymore due to unintentionally evicted pages.
> >>>> ...
> >>>>>>  static void virtio_balloon_instance_init(Object *obj)
> >>>>>> diff --git a/include/hw/virtio/virtio-balloon.h b/include/hw/virtio/virtio-balloon.h
> >>>>>> index 35f62ac..04b7c0c 100644
> >>>>>> --- a/include/hw/virtio/virtio-balloon.h
> >>>>>> +++ b/include/hw/virtio/virtio-balloon.h
> >>>>>> @@ -43,6 +43,9 @@ typedef struct VirtIOBalloon {
> >>>>>>      int64_t stats_last_update;
> >>>>>>      int64_t stats_poll_interval;
> >>>>>>      uint32_t host_features;
> >>>>>> +    void *current_addr;
> >>>>>> +    unsigned long *fragment_bits;
> >>>>>> +    int fragment_bits_size;
> >>>>>>  } VirtIOBalloon;
> >>>>>>  
> >>>>>>  #endif
> >>>>>
> >>>>> It looks like fragment_bits would have to be migrated.
> >>>>> Which is a lot of complexity.
> >> ...
> >>>>> How about we just skip madvise if host page size is > balloon
> >>>>> page size, for 2.6?
> >>>>
> >>>> That would mean a regression compared to what we have today. Currently,
> >>>> the ballooning is working OK for 64k guests on a 64k ppc host - rather
> >>>> by chance than on purpose, but it's working. The guest is always sending
> >>>> all the 4k fragments of a 64k page, and QEMU is trying to call madvise()
> >>>> for every one of them, but the kernel is ignoring madvise() on
> >>>> non-64k-aligned addresses, so we end up with a situation where the
> >>>> madvise() frees a whole 64k page which is also declared as free by the
> >>>> guest.
> >>>>
> >>>> I think we should either take this patch as it is right now (without
> >>>> adding extra code for migration) and later update it to the bitmap code
> >>>> by Jitendra Kolhe, or omit it completely (leaving 4k guests broken) and
> >>>> fix it properly after the bitmap code has been applied. But disabling
> >>>> the balloon code for 64k guests on 64k hosts completely does not sound
> >>>> very appealing to me. What do you think?
> >>>
> >>> True. As simple a hack - how about disabling madvise when host page size >
> >>> target page size?
> >>
> >> That could work - but is there a generic way in QEMU to get the current
> >> page size from a guest (since this might differ from TARGET_PAGE_SIZE)?
> >> Or would that mean to pollute the virtio-balloon code with ugly #ifdefs?
> > 
> > let's just use TARGET_PAGE_SIZE, that's the best I can think of.
> 
> That won't work - at least not on ppc: TARGET_PAGE_SIZE is always
> defined to 4096 here. The Linux kernel then switches the real page size
> during runtime to 65536. So we'd need a way to detect this automatically...
> 
>  Thomas

I see. I don't know how to do this.  If we can't find a quick fix, leave
this part around for now then?  Just fix the page size.
Andrew Jones April 13, 2016, 6:21 p.m. UTC | #9
On Wed, Apr 13, 2016 at 08:11:36PM +0200, Thomas Huth wrote:
> On 13.04.2016 19:55, Michael S. Tsirkin wrote:
> > On Wed, Apr 13, 2016 at 07:38:12PM +0200, Thomas Huth wrote:
> >> On 13.04.2016 19:07, Michael S. Tsirkin wrote:
> >>> On Wed, Apr 13, 2016 at 04:51:49PM +0200, Thomas Huth wrote:
> >>>> On 13.04.2016 15:15, Michael S. Tsirkin wrote:
> >>>>> On Wed, Apr 13, 2016 at 01:52:44PM +0200, Thomas Huth wrote:
> >> ...
> >>>>>> Then, there's yet another problem: If the host page size is bigger
> >>>>>> than the 4k balloon page size, we can not simply call madvise() on
> >>>>>> each of the 4k balloon addresses that we get from the guest - since
> >>>>>> the madvise() always evicts the whole host page, not only a 4k area!
> >>>>>>
> >>>>>> So in this case, we've got to track the 4k fragments of a host page
> >>>>>> and only call madvise(DONTNEED) when all fragments have been collected.
> >>>>>> This of course only works fine if the guest sends consecutive 4k
> >>>>>> fragments - which is the case in the most important scenarios that
> >>>>>> I try to address here (like a ppc64 guest with 64k page size that
> >>>>>> is running on a ppc64 host with 64k page size). In case the guest
> >>>>>> uses a page size that is smaller than the host page size, we might
> >>>>>> need to add some more additional logic here later to increase the
> >>>>>> probability of being able to release memory, but at least the guest
> >>>>>> should now not crash anymore due to unintentionally evicted pages.
> >>>> ...
> >>>>>>  static void virtio_balloon_instance_init(Object *obj)
> >>>>>> diff --git a/include/hw/virtio/virtio-balloon.h b/include/hw/virtio/virtio-balloon.h
> >>>>>> index 35f62ac..04b7c0c 100644
> >>>>>> --- a/include/hw/virtio/virtio-balloon.h
> >>>>>> +++ b/include/hw/virtio/virtio-balloon.h
> >>>>>> @@ -43,6 +43,9 @@ typedef struct VirtIOBalloon {
> >>>>>>      int64_t stats_last_update;
> >>>>>>      int64_t stats_poll_interval;
> >>>>>>      uint32_t host_features;
> >>>>>> +    void *current_addr;
> >>>>>> +    unsigned long *fragment_bits;
> >>>>>> +    int fragment_bits_size;
> >>>>>>  } VirtIOBalloon;
> >>>>>>  
> >>>>>>  #endif
> >>>>>
> >>>>> It looks like fragment_bits would have to be migrated.
> >>>>> Which is a lot of complexity.
> >> ...
> >>>>> How about we just skip madvise if host page size is > balloon
> >>>>> page size, for 2.6?
> >>>>
> >>>> That would mean a regression compared to what we have today. Currently,
> >>>> the ballooning is working OK for 64k guests on a 64k ppc host - rather
> >>>> by chance than on purpose, but it's working. The guest is always sending
> >>>> all the 4k fragments of a 64k page, and QEMU is trying to call madvise()
> >>>> for every one of them, but the kernel is ignoring madvise() on
> >>>> non-64k-aligned addresses, so we end up with a situation where the
> >>>> madvise() frees a whole 64k page which is also declared as free by the
> >>>> guest.
> >>>>
> >>>> I think we should either take this patch as it is right now (without
> >>>> adding extra code for migration) and later update it to the bitmap code
> >>>> by Jitendra Kolhe, or omit it completely (leaving 4k guests broken) and
> >>>> fix it properly after the bitmap code has been applied. But disabling
> >>>> the balloon code for 64k guests on 64k hosts completely does not sound
> >>>> very appealing to me. What do you think?
> >>>
> >>> True. As simple a hack - how about disabling madvise when host page size >
> >>> target page size?
> >>
> >> That could work - but is there a generic way in QEMU to get the current
> >> page size from a guest (since this might differ from TARGET_PAGE_SIZE)?
> >> Or would that mean to pollute the virtio-balloon code with ugly #ifdefs?
> > 
> > let's just use TARGET_PAGE_SIZE, that's the best I can think of.
> 
> That won't work - at least not on ppc: TARGET_PAGE_SIZE is always
> defined to 4096 here. The Linux kernel then switches the real page size
> during runtime to 65536. So we'd need a way to detect this automatically...

Won't work for ARM either. TARGET_PAGE_SIZE is always 1K. The host will
always have at least 4K, but the guest could have as much as 64K. There's
no non-hacky way (that I know of) to determine what it is.

Thanks,
drew
David Gibson April 14, 2016, 3:37 a.m. UTC | #10
On Wed, 13 Apr 2016 13:52:44 +0200
Thomas Huth <thuth@redhat.com> wrote:

> The balloon code currently calls madvise() with TARGET_PAGE_SIZE
> as length parameter, and an address which is directly based on
> the page address supplied by the guest. Since the virtio-balloon
> protocol is always based on 4k based addresses/sizes, no matter
> what the host and guest are using as page size, this has a couple
> of issues which could even lead to data loss in the worst case.
> 
> TARGET_PAGE_SIZE might not be 4k, so it is wrong to use that
> value for the madvise() call. If TARGET_PAGE_SIZE is bigger than
> 4k, we also destroy the 4k areas after the current one - which
> might be wrong since the guest did not want free that area yet (in
> case the guest used as smaller MMU page size than the hard-coded
> TARGET_PAGE_SIZE). So to fix this issue, introduce a proper define
> called BALLOON_PAGE_SIZE (which is 4096) to use this as the size
> parameter for the madvise() call instead.
> 
> Then, there's yet another problem: If the host page size is bigger
> than the 4k balloon page size, we can not simply call madvise() on
> each of the 4k balloon addresses that we get from the guest - since
> the madvise() always evicts the whole host page, not only a 4k area!
> So in this case, we've got to track the 4k fragments of a host page
> and only call madvise(DONTNEED) when all fragments have been collected.
> This of course only works fine if the guest sends consecutive 4k
> fragments - which is the case in the most important scenarios that
> I try to address here (like a ppc64 guest with 64k page size that
> is running on a ppc64 host with 64k page size). In case the guest
> uses a page size that is smaller than the host page size, we might
> need to add some more additional logic here later to increase the
> probability of being able to release memory, but at least the guest
> should now not crash anymore due to unintentionally evicted pages.

Yeah, this is a serious bug.  I think the patch is basically ok, except
for a couple of minor points noted below.

It's a bit more complex than I'd generally like to apply during hard
freeze, but given the seriousness of the problem, I think we should put
this in for qemu-2.6.

> 
> Signed-off-by: Thomas Huth <thuth@redhat.com>
> ---
>  I've tested this patch with both, a 4k page size ppc64 guest
>  and a 64k page size ppc64 guest on a 64k page size ppc64 host.
>  With this patch applied, I was not able to crash to crash the
>  guests anymore (the 4k guest easily crashes without this patch).
>  And looking at the output of the "free" command on the host,
>  the ballooning also still works as expected.
> 
>  hw/virtio/virtio-balloon.c         | 68 ++++++++++++++++++++++++++++++++++----
>  include/hw/virtio/virtio-balloon.h |  3 ++
>  2 files changed, 65 insertions(+), 6 deletions(-)
> 
> diff --git a/hw/virtio/virtio-balloon.c b/hw/virtio/virtio-balloon.c
> index c74101e..886faa8 100644
> --- a/hw/virtio/virtio-balloon.c
> +++ b/hw/virtio/virtio-balloon.c
> @@ -35,13 +35,56 @@
>  #include "hw/virtio/virtio-bus.h"
>  #include "hw/virtio/virtio-access.h"
>  
> -static void balloon_page(void *addr, int deflate)
> +#define BALLOON_PAGE_SIZE        (1 << VIRTIO_BALLOON_PFN_SHIFT)
> +#define BALLOON_NO_CURRENT_PAGE  ((void *)-1)
> +
> +static void balloon_page(VirtIOBalloon *s, void *addr, int deflate)
>  {
>  #if defined(__linux__)
> -    if (!qemu_balloon_is_inhibited() && (!kvm_enabled() ||
> -                                         kvm_has_sync_mmu())) {
> -        qemu_madvise(addr, TARGET_PAGE_SIZE,
> -                deflate ? QEMU_MADV_WILLNEED : QEMU_MADV_DONTNEED);
> +    size_t host_page_size;
> +    void *aligned_addr;
> +
> +    if (qemu_balloon_is_inhibited() || (kvm_enabled() && !kvm_has_sync_mmu())) {
> +        return;
> +    }
> +
> +    host_page_size = getpagesize();

I think you actually want getrampagesize() here (or rather some wrapper
around that to handle the non-kvm cases properly) - to cover the case
where the VM has been built using hugepages in the host.  For the
normal case there's already the qemu_real_host_page_size global, so you
shouldn't need to call getpagesize() yourself.

> +    if (host_page_size <= BALLOON_PAGE_SIZE) {
> +        qemu_madvise(addr, BALLOON_PAGE_SIZE,
> +                     deflate ? QEMU_MADV_WILLNEED : QEMU_MADV_DONTNEED);
> +        return;
> +    }
> +
> +    /* Host page size > ballon page size ==> use aligned host address */
> +    aligned_addr = (void *)((uintptr_t)addr & ~(host_page_size - 1));
> +    if (deflate) {
> +        /* MADV_WILLNEED is just a hint for the host kernel, the guest should
> +         * also be able to use the memory again without this call, so let's
> +         * only do it for the first, aligned fragment of a host page and
> +         * ignore it for the others.
> +         */
> +        if (addr == aligned_addr) {
> +            qemu_madvise(aligned_addr, host_page_size, QEMU_MADV_WILLNEED);
> +        }
> +        s->current_addr = BALLOON_NO_CURRENT_PAGE;
> +    } else {
> +        const int max_frags = host_page_size / BALLOON_PAGE_SIZE;
> +        /* If we end up here, that means we want to evict balloon pages, but
> +         * the host's page size is bigger than the 4k pages from the balloon.
> +         * Since madvise() only works on the granularity of the host page size,
> +         * we've got to collect all the 4k fragments from the guest first
> +         * before we can issue the MADV_DONTNEED call.
> +         */
> +        if (aligned_addr != s->current_addr) {

I'd suggest a (non fatal) error_report() here if s->current_addr !=
BALLOON_NO_CURRENT_PAGE.

> +            memset(s->fragment_bits, 0, s->fragment_bits_size);
> +            s->current_addr = aligned_addr;
> +        }
> +        set_bit((addr - aligned_addr) / BALLOON_PAGE_SIZE, s->fragment_bits);
> +        if (find_first_zero_bit(s->fragment_bits, max_frags) == max_frags) {

If you invert the sense of the bitmap you can just check for an
all-zero bitmap here which might be a little simpler.

> +            qemu_madvise(aligned_addr, host_page_size, QEMU_MADV_DONTNEED);
> +            memset(s->fragment_bits, 0, s->fragment_bits_size);
> +            s->current_addr = BALLOON_NO_CURRENT_PAGE;
> +        }
>      }
>  #endif
>  }
> @@ -240,7 +283,7 @@ static void virtio_balloon_handle_output(VirtIODevice *vdev, VirtQueue *vq)
>              /* Using memory_region_get_ram_ptr is bending the rules a bit, but
>                 should be OK because we only want a single page.  */
>              addr = section.offset_within_region;
> -            balloon_page(memory_region_get_ram_ptr(section.mr) + addr,
> +            balloon_page(s, memory_region_get_ram_ptr(section.mr) + addr,
>                           !!(vq == s->dvq));
>              memory_region_unref(section.mr);
>          }
> @@ -455,6 +498,13 @@ static void virtio_balloon_device_realize(DeviceState *dev, Error **errp)
>      s->dvq = virtio_add_queue(vdev, 128, virtio_balloon_handle_output);
>      s->svq = virtio_add_queue(vdev, 128, virtio_balloon_receive_stats);
>  
> +    if (getpagesize() > BALLOON_PAGE_SIZE) {
> +        s->fragment_bits_size = (getpagesize() / BALLOON_PAGE_SIZE
> +                                 + sizeof(long) * 8 - 1) / 8;
> +        s->fragment_bits = g_malloc0(s->fragment_bits_size);
> +        s->current_addr = BALLOON_NO_CURRENT_PAGE;
> +    }
> +
>      reset_stats(s);
>  
>      register_savevm(dev, "virtio-balloon", -1, 1,
> @@ -466,6 +516,7 @@ static void virtio_balloon_device_unrealize(DeviceState *dev, Error **errp)
>      VirtIODevice *vdev = VIRTIO_DEVICE(dev);
>      VirtIOBalloon *s = VIRTIO_BALLOON(dev);
>  
> +    g_free(s->fragment_bits);
>      balloon_stats_destroy_timer(s);
>      qemu_remove_balloon_handler(s);
>      unregister_savevm(dev, "virtio-balloon", s);
> @@ -480,6 +531,11 @@ static void virtio_balloon_device_reset(VirtIODevice *vdev)
>          g_free(s->stats_vq_elem);
>          s->stats_vq_elem = NULL;
>      }
> +
> +    if (s->fragment_bits) {
> +        memset(s->fragment_bits, 0, s->fragment_bits_size);
> +        s->current_addr = BALLOON_NO_CURRENT_PAGE;
> +    }
>  }
>  
>  static void virtio_balloon_instance_init(Object *obj)
> diff --git a/include/hw/virtio/virtio-balloon.h b/include/hw/virtio/virtio-balloon.h
> index 35f62ac..04b7c0c 100644
> --- a/include/hw/virtio/virtio-balloon.h
> +++ b/include/hw/virtio/virtio-balloon.h
> @@ -43,6 +43,9 @@ typedef struct VirtIOBalloon {
>      int64_t stats_last_update;
>      int64_t stats_poll_interval;
>      uint32_t host_features;
> +    void *current_addr;
> +    unsigned long *fragment_bits;
> +    int fragment_bits_size;
>  } VirtIOBalloon;
>  
>  #endif
> -- 
> 1.8.3.1
>
David Gibson April 14, 2016, 3:39 a.m. UTC | #11
On Wed, 13 Apr 2016 16:15:25 +0300
"Michael S. Tsirkin" <mst@redhat.com> wrote:

> On Wed, Apr 13, 2016 at 01:52:44PM +0200, Thomas Huth wrote:
> > The balloon code currently calls madvise() with TARGET_PAGE_SIZE
> > as length parameter, and an address which is directly based on
> > the page address supplied by the guest. Since the virtio-balloon
> > protocol is always based on 4k based addresses/sizes, no matter
> > what the host and guest are using as page size, this has a couple
> > of issues which could even lead to data loss in the worst case.
> > 
> > TARGET_PAGE_SIZE might not be 4k, so it is wrong to use that
> > value for the madvise() call. If TARGET_PAGE_SIZE is bigger than
> > 4k, we also destroy the 4k areas after the current one - which
> > might be wrong since the guest did not want free that area yet (in
> > case the guest used as smaller MMU page size than the hard-coded
> > TARGET_PAGE_SIZE). So to fix this issue, introduce a proper define
> > called BALLOON_PAGE_SIZE (which is 4096) to use this as the size
> > parameter for the madvise() call instead.  
> 
> this makes absolute sense.
> 
> 
> > Then, there's yet another problem: If the host page size is bigger
> > than the 4k balloon page size, we can not simply call madvise() on
> > each of the 4k balloon addresses that we get from the guest - since
> > the madvise() always evicts the whole host page, not only a 4k area!  
> 
> Does it really round length up?
> Wow, it does:
>         len = (len_in + ~PAGE_MASK) & PAGE_MASK;
> 
> which seems to be undocumented, but has been there forever.
> 
> 
> > So in this case, we've got to track the 4k fragments of a host page
> > and only call madvise(DONTNEED) when all fragments have been collected.
> > This of course only works fine if the guest sends consecutive 4k
> > fragments - which is the case in the most important scenarios that
> > I try to address here (like a ppc64 guest with 64k page size that
> > is running on a ppc64 host with 64k page size). In case the guest
> > uses a page size that is smaller than the host page size, we might
> > need to add some more additional logic here later to increase the
> > probability of being able to release memory, but at least the guest
> > should now not crash anymore due to unintentionally evicted pages.
> > 
> > Signed-off-by: Thomas Huth <thuth@redhat.com>
> > ---
> >  I've tested this patch with both, a 4k page size ppc64 guest
> >  and a 64k page size ppc64 guest on a 64k page size ppc64 host.
> >  With this patch applied, I was not able to crash to crash the
> >  guests anymore (the 4k guest easily crashes without this patch).
> >  And looking at the output of the "free" command on the host,
> >  the ballooning also still works as expected.
> > 
> >  hw/virtio/virtio-balloon.c         | 68 ++++++++++++++++++++++++++++++++++----
> >  include/hw/virtio/virtio-balloon.h |  3 ++
> >  2 files changed, 65 insertions(+), 6 deletions(-)
> > 
> > diff --git a/hw/virtio/virtio-balloon.c b/hw/virtio/virtio-balloon.c
> > index c74101e..886faa8 100644
> > --- a/hw/virtio/virtio-balloon.c
> > +++ b/hw/virtio/virtio-balloon.c
> > @@ -35,13 +35,56 @@
> >  #include "hw/virtio/virtio-bus.h"
> >  #include "hw/virtio/virtio-access.h"
> >  
> > -static void balloon_page(void *addr, int deflate)
> > +#define BALLOON_PAGE_SIZE        (1 << VIRTIO_BALLOON_PFN_SHIFT)
> > +#define BALLOON_NO_CURRENT_PAGE  ((void *)-1)
> > +
> > +static void balloon_page(VirtIOBalloon *s, void *addr, int deflate)
> >  {
> >  #if defined(__linux__)
> > -    if (!qemu_balloon_is_inhibited() && (!kvm_enabled() ||
> > -                                         kvm_has_sync_mmu())) {
> > -        qemu_madvise(addr, TARGET_PAGE_SIZE,
> > -                deflate ? QEMU_MADV_WILLNEED : QEMU_MADV_DONTNEED);
> > +    size_t host_page_size;
> > +    void *aligned_addr;
> > +
> > +    if (qemu_balloon_is_inhibited() || (kvm_enabled() && !kvm_has_sync_mmu())) {
> > +        return;
> > +    }
> > +
> > +    host_page_size = getpagesize();
> > +    if (host_page_size <= BALLOON_PAGE_SIZE) {
> > +        qemu_madvise(addr, BALLOON_PAGE_SIZE,
> > +                     deflate ? QEMU_MADV_WILLNEED : QEMU_MADV_DONTNEED);
> > +        return;
> > +    }
> > +
> > +    /* Host page size > ballon page size ==> use aligned host address */
> > +    aligned_addr = (void *)((uintptr_t)addr & ~(host_page_size - 1));
> > +    if (deflate) {
> > +        /* MADV_WILLNEED is just a hint for the host kernel, the guest should
> > +         * also be able to use the memory again without this call, so let's
> > +         * only do it for the first, aligned fragment of a host page and
> > +         * ignore it for the others.
> > +         */
> > +        if (addr == aligned_addr) {
> > +            qemu_madvise(aligned_addr, host_page_size, QEMU_MADV_WILLNEED);
> > +        }
> > +        s->current_addr = BALLOON_NO_CURRENT_PAGE;
> > +    } else {
> > +        const int max_frags = host_page_size / BALLOON_PAGE_SIZE;
> > +        /* If we end up here, that means we want to evict balloon pages, but
> > +         * the host's page size is bigger than the 4k pages from the balloon.
> > +         * Since madvise() only works on the granularity of the host page size,
> > +         * we've got to collect all the 4k fragments from the guest first
> > +         * before we can issue the MADV_DONTNEED call.
> > +         */
> > +        if (aligned_addr != s->current_addr) {
> > +            memset(s->fragment_bits, 0, s->fragment_bits_size);
> > +            s->current_addr = aligned_addr;
> > +        }
> > +        set_bit((addr - aligned_addr) / BALLOON_PAGE_SIZE, s->fragment_bits);
> > +        if (find_first_zero_bit(s->fragment_bits, max_frags) == max_frags) {
> > +            qemu_madvise(aligned_addr, host_page_size, QEMU_MADV_DONTNEED);
> > +            memset(s->fragment_bits, 0, s->fragment_bits_size);
> > +            s->current_addr = BALLOON_NO_CURRENT_PAGE;
> > +        }
> >      }
> >  #endif
> >  }
> > @@ -240,7 +283,7 @@ static void virtio_balloon_handle_output(VirtIODevice *vdev, VirtQueue *vq)
> >              /* Using memory_region_get_ram_ptr is bending the rules a bit, but
> >                 should be OK because we only want a single page.  */
> >              addr = section.offset_within_region;
> > -            balloon_page(memory_region_get_ram_ptr(section.mr) + addr,
> > +            balloon_page(s, memory_region_get_ram_ptr(section.mr) + addr,
> >                           !!(vq == s->dvq));
> >              memory_region_unref(section.mr);
> >          }
> > @@ -455,6 +498,13 @@ static void virtio_balloon_device_realize(DeviceState *dev, Error **errp)
> >      s->dvq = virtio_add_queue(vdev, 128, virtio_balloon_handle_output);
> >      s->svq = virtio_add_queue(vdev, 128, virtio_balloon_receive_stats);
> >  
> > +    if (getpagesize() > BALLOON_PAGE_SIZE) {
> > +        s->fragment_bits_size = (getpagesize() / BALLOON_PAGE_SIZE
> > +                                 + sizeof(long) * 8 - 1) / 8;
> > +        s->fragment_bits = g_malloc0(s->fragment_bits_size);
> > +        s->current_addr = BALLOON_NO_CURRENT_PAGE;
> > +    }
> > +
> >      reset_stats(s);
> >  
> >      register_savevm(dev, "virtio-balloon", -1, 1,
> > @@ -466,6 +516,7 @@ static void virtio_balloon_device_unrealize(DeviceState *dev, Error **errp)
> >      VirtIODevice *vdev = VIRTIO_DEVICE(dev);
> >      VirtIOBalloon *s = VIRTIO_BALLOON(dev);
> >  
> > +    g_free(s->fragment_bits);
> >      balloon_stats_destroy_timer(s);
> >      qemu_remove_balloon_handler(s);
> >      unregister_savevm(dev, "virtio-balloon", s);
> > @@ -480,6 +531,11 @@ static void virtio_balloon_device_reset(VirtIODevice *vdev)
> >          g_free(s->stats_vq_elem);
> >          s->stats_vq_elem = NULL;
> >      }
> > +
> > +    if (s->fragment_bits) {
> > +        memset(s->fragment_bits, 0, s->fragment_bits_size);
> > +        s->current_addr = BALLOON_NO_CURRENT_PAGE;
> > +    }
> >  }
> >  
> >  static void virtio_balloon_instance_init(Object *obj)
> > diff --git a/include/hw/virtio/virtio-balloon.h b/include/hw/virtio/virtio-balloon.h
> > index 35f62ac..04b7c0c 100644
> > --- a/include/hw/virtio/virtio-balloon.h
> > +++ b/include/hw/virtio/virtio-balloon.h
> > @@ -43,6 +43,9 @@ typedef struct VirtIOBalloon {
> >      int64_t stats_last_update;
> >      int64_t stats_poll_interval;
> >      uint32_t host_features;
> > +    void *current_addr;
> > +    unsigned long *fragment_bits;
> > +    int fragment_bits_size;
> >  } VirtIOBalloon;
> >  
> >  #endif
> > -- 
> > 1.8.3.1  
> 
> 
> It looks like fragment_bits would have to be migrated.

Hmm.. do they?  If we just ignore migration, isn't the worse that
happens that we just keep one extra page allocated.

> Which is a lot of complexity.
> And work arounds for specific guest behaviour are really ugly.
> There are patches on-list to maintain a balloon bitmap -
> that will enable fixing it cleanly.
> How about we just skip madvise if host page size is > balloon
> page size, for 2.6?
> 
> -- 
> MST
David Gibson April 14, 2016, 3:45 a.m. UTC | #12
On Wed, 13 Apr 2016 21:14:44 +0300
"Michael S. Tsirkin" <mst@redhat.com> wrote:

> On Wed, Apr 13, 2016 at 08:11:36PM +0200, Thomas Huth wrote:
> > On 13.04.2016 19:55, Michael S. Tsirkin wrote:  
> > > On Wed, Apr 13, 2016 at 07:38:12PM +0200, Thomas Huth wrote:  
> > >> On 13.04.2016 19:07, Michael S. Tsirkin wrote:  
> > >>> On Wed, Apr 13, 2016 at 04:51:49PM +0200, Thomas Huth wrote:  
> > >>>> On 13.04.2016 15:15, Michael S. Tsirkin wrote:  
> > >>>>> On Wed, Apr 13, 2016 at 01:52:44PM +0200, Thomas Huth wrote:  
> > >> ...  
> > >>>>>> Then, there's yet another problem: If the host page size is bigger
> > >>>>>> than the 4k balloon page size, we can not simply call madvise() on
> > >>>>>> each of the 4k balloon addresses that we get from the guest - since
> > >>>>>> the madvise() always evicts the whole host page, not only a 4k area!
> > >>>>>>
> > >>>>>> So in this case, we've got to track the 4k fragments of a host page
> > >>>>>> and only call madvise(DONTNEED) when all fragments have been collected.
> > >>>>>> This of course only works fine if the guest sends consecutive 4k
> > >>>>>> fragments - which is the case in the most important scenarios that
> > >>>>>> I try to address here (like a ppc64 guest with 64k page size that
> > >>>>>> is running on a ppc64 host with 64k page size). In case the guest
> > >>>>>> uses a page size that is smaller than the host page size, we might
> > >>>>>> need to add some more additional logic here later to increase the
> > >>>>>> probability of being able to release memory, but at least the guest
> > >>>>>> should now not crash anymore due to unintentionally evicted pages.  
> > >>>> ...  
> > >>>>>>  static void virtio_balloon_instance_init(Object *obj)
> > >>>>>> diff --git a/include/hw/virtio/virtio-balloon.h b/include/hw/virtio/virtio-balloon.h
> > >>>>>> index 35f62ac..04b7c0c 100644
> > >>>>>> --- a/include/hw/virtio/virtio-balloon.h
> > >>>>>> +++ b/include/hw/virtio/virtio-balloon.h
> > >>>>>> @@ -43,6 +43,9 @@ typedef struct VirtIOBalloon {
> > >>>>>>      int64_t stats_last_update;
> > >>>>>>      int64_t stats_poll_interval;
> > >>>>>>      uint32_t host_features;
> > >>>>>> +    void *current_addr;
> > >>>>>> +    unsigned long *fragment_bits;
> > >>>>>> +    int fragment_bits_size;
> > >>>>>>  } VirtIOBalloon;
> > >>>>>>  
> > >>>>>>  #endif  
> > >>>>>
> > >>>>> It looks like fragment_bits would have to be migrated.
> > >>>>> Which is a lot of complexity.  
> > >> ...  
> > >>>>> How about we just skip madvise if host page size is > balloon
> > >>>>> page size, for 2.6?  
> > >>>>
> > >>>> That would mean a regression compared to what we have today. Currently,
> > >>>> the ballooning is working OK for 64k guests on a 64k ppc host - rather
> > >>>> by chance than on purpose, but it's working. The guest is always sending
> > >>>> all the 4k fragments of a 64k page, and QEMU is trying to call madvise()
> > >>>> for every one of them, but the kernel is ignoring madvise() on
> > >>>> non-64k-aligned addresses, so we end up with a situation where the
> > >>>> madvise() frees a whole 64k page which is also declared as free by the
> > >>>> guest.
> > >>>>
> > >>>> I think we should either take this patch as it is right now (without
> > >>>> adding extra code for migration) and later update it to the bitmap code
> > >>>> by Jitendra Kolhe, or omit it completely (leaving 4k guests broken) and
> > >>>> fix it properly after the bitmap code has been applied. But disabling
> > >>>> the balloon code for 64k guests on 64k hosts completely does not sound
> > >>>> very appealing to me. What do you think?  
> > >>>
> > >>> True. As simple a hack - how about disabling madvise when host page size >
> > >>> target page size?  
> > >>
> > >> That could work - but is there a generic way in QEMU to get the current
> > >> page size from a guest (since this might differ from TARGET_PAGE_SIZE)?
> > >> Or would that mean to pollute the virtio-balloon code with ugly #ifdefs?  
> > > 
> > > let's just use TARGET_PAGE_SIZE, that's the best I can think of.  
> > 
> > That won't work - at least not on ppc: TARGET_PAGE_SIZE is always
> > defined to 4096 here. The Linux kernel then switches the real page size
> > during runtime to 65536. So we'd need a way to detect this automatically...
> > 
> >  Thomas  
> 
> I see. I don't know how to do this.  If we can't find a quick fix, leave
> this part around for now then?  Just fix the page size.

I'm not sure what you're suggesting by "fix the page size".
TARGET_PAGE_SIZE remains 4kiB because that's the default hardware page
size.  However, modern Linux guests always use 64kiB pages in
practice.  This isn't a global setting, it just means it uses 64kiB
pages for every mapping it establishes, so there's really no sane way
for qemu to determine the "guest page size" because that's not really a
well-defined concept.

Even having some flag that's cleared if the guest ever makes a < 64kiB
mapping won't work, because I think there are a few edge cases where a
"64kiB" guest which uses 64kiB pages for all RAM mappings will use 4kiB
pages for certain IO mappings.
Dr. David Alan Gilbert April 14, 2016, 11:47 a.m. UTC | #13
* Thomas Huth (thuth@redhat.com) wrote:

> That would mean a regression compared to what we have today. Currently,
> the ballooning is working OK for 64k guests on a 64k ppc host - rather
> by chance than on purpose, but it's working. The guest is always sending
> all the 4k fragments of a 64k page, and QEMU is trying to call madvise()
> for every one of them, but the kernel is ignoring madvise() on
> non-64k-aligned addresses, so we end up with a situation where the
> madvise() frees a whole 64k page which is also declared as free by the
> guest.

I wouldn't worry about migrating your fragmenet map; but I wonder if it
needs to be that complex - does the guest normally do something more sane
like do the 4k pages in order and so you've just got to track the last
page it tried rather than having a full map?

A side question is whether the behaviour that's seen by virtio_ballon_handle_output
is always actually the full 64k page;  it calls balloon_page once
for each message/element - but if all of those elements add back up to the full
page, perhaps it makes more sense to reassemble it there?

> I think we should either take this patch as it is right now (without
> adding extra code for migration) and later update it to the bitmap code
> by Jitendra Kolhe, or omit it completely (leaving 4k guests broken) and
> fix it properly after the bitmap code has been applied. But disabling
> the balloon code for 64k guests on 64k hosts completely does not sound
> very appealing to me. What do you think?

Yeh I agree; your existing code should work and I don't think we should
break 64k-on-64k.

Dave
> 
>  Thomas
> 
--
Dr. David Alan Gilbert / dgilbert@redhat.com / Manchester, UK
Thomas Huth April 14, 2016, 12:19 p.m. UTC | #14
On 14.04.2016 13:47, Dr. David Alan Gilbert wrote:
> * Thomas Huth (thuth@redhat.com) wrote:
> 
>> That would mean a regression compared to what we have today. Currently,
>> the ballooning is working OK for 64k guests on a 64k ppc host - rather
>> by chance than on purpose, but it's working. The guest is always sending
>> all the 4k fragments of a 64k page, and QEMU is trying to call madvise()
>> for every one of them, but the kernel is ignoring madvise() on
>> non-64k-aligned addresses, so we end up with a situation where the
>> madvise() frees a whole 64k page which is also declared as free by the
>> guest.
> 
> I wouldn't worry about migrating your fragmenet map; but I wonder if it
> needs to be that complex - does the guest normally do something more sane
> like do the 4k pages in order and so you've just got to track the last
> page it tried rather than having a full map?

That's maybe a little bit easier and might work for well-known Linux
guests, but IMHO it's even more a hack than my approach: If the Linux
driver one day is switched to send the pages in the opposite order, or
if somebody tries to run a non-wellknown (i.e. non-Linux) guest, this
does not work at all anymore.

> A side question is whether the behaviour that's seen by virtio_ballon_handle_output
> is always actually the full 64k page;  it calls balloon_page once
> for each message/element - but if all of those elements add back up to the full
> page, perhaps it makes more sense to reassemble it there?

That might work for 64k page size guests ... but for 4k guests, I think
you'll have a hard time to reassemble a page there more easily than with
my current approach. Or do you have a clever algorithm in mind that
could do the job well there?

 Thomas
Dr. David Alan Gilbert April 14, 2016, 6:34 p.m. UTC | #15
* Thomas Huth (thuth@redhat.com) wrote:
> On 14.04.2016 13:47, Dr. David Alan Gilbert wrote:
> > * Thomas Huth (thuth@redhat.com) wrote:
> > 
> >> That would mean a regression compared to what we have today. Currently,
> >> the ballooning is working OK for 64k guests on a 64k ppc host - rather
> >> by chance than on purpose, but it's working. The guest is always sending
> >> all the 4k fragments of a 64k page, and QEMU is trying to call madvise()
> >> for every one of them, but the kernel is ignoring madvise() on
> >> non-64k-aligned addresses, so we end up with a situation where the
> >> madvise() frees a whole 64k page which is also declared as free by the
> >> guest.
> > 
> > I wouldn't worry about migrating your fragmenet map; but I wonder if it
> > needs to be that complex - does the guest normally do something more sane
> > like do the 4k pages in order and so you've just got to track the last
> > page it tried rather than having a full map?
> 
> That's maybe a little bit easier and might work for well-known Linux
> guests, but IMHO it's even more a hack than my approach: If the Linux
> driver one day is switched to send the pages in the opposite order, or
> if somebody tries to run a non-wellknown (i.e. non-Linux) guest, this
> does not work at all anymore.

True.

> > A side question is whether the behaviour that's seen by virtio_ballon_handle_output
> > is always actually the full 64k page;  it calls balloon_page once
> > for each message/element - but if all of those elements add back up to the full
> > page, perhaps it makes more sense to reassemble it there?
> 
> That might work for 64k page size guests ... but for 4k guests, I think
> you'll have a hard time to reassemble a page there more easily than with
> my current approach. Or do you have a clever algorithm in mind that
> could do the job well there?

No, i didn't; I just have an ulterior motive which is trying to
do as few madvise's as possible, and while virtio_balloon_handle_output sees
potentially quite a few requests at once, balloon_page is stuck down
there at the bottom without any idea of whether there are any more coming.

Dave

>  Thomas
> 
--
Dr. David Alan Gilbert / dgilbert@redhat.com / Manchester, UK
David Gibson April 15, 2016, 4:26 a.m. UTC | #16
On Thu, 14 Apr 2016 19:34:05 +0100
"Dr. David Alan Gilbert" <dgilbert@redhat.com> wrote:

> * Thomas Huth (thuth@redhat.com) wrote:
> > On 14.04.2016 13:47, Dr. David Alan Gilbert wrote:  
> > > * Thomas Huth (thuth@redhat.com) wrote:
> > >   
> > >> That would mean a regression compared to what we have today. Currently,
> > >> the ballooning is working OK for 64k guests on a 64k ppc host - rather
> > >> by chance than on purpose, but it's working. The guest is always sending
> > >> all the 4k fragments of a 64k page, and QEMU is trying to call madvise()
> > >> for every one of them, but the kernel is ignoring madvise() on
> > >> non-64k-aligned addresses, so we end up with a situation where the
> > >> madvise() frees a whole 64k page which is also declared as free by the
> > >> guest.  
> > > 
> > > I wouldn't worry about migrating your fragmenet map; but I wonder if it
> > > needs to be that complex - does the guest normally do something more sane
> > > like do the 4k pages in order and so you've just got to track the last
> > > page it tried rather than having a full map?  
> > 
> > That's maybe a little bit easier and might work for well-known Linux
> > guests, but IMHO it's even more a hack than my approach: If the Linux
> > driver one day is switched to send the pages in the opposite order, or
> > if somebody tries to run a non-wellknown (i.e. non-Linux) guest, this
> > does not work at all anymore.  
> 
> True.

TBH, I'm not sure that basing off last sub-page ballooned will even be
that much easier to implement, or at least to implement in a way that's
convincingly correct even for the limited cases it's supposed to work
in.

> > > A side question is whether the behaviour that's seen by virtio_ballon_handle_output
> > > is always actually the full 64k page;  it calls balloon_page once
> > > for each message/element - but if all of those elements add back up to the full
> > > page, perhaps it makes more sense to reassemble it there?  
> > 
> > That might work for 64k page size guests ... but for 4k guests, I think
> > you'll have a hard time to reassemble a page there more easily than with
> > my current approach. Or do you have a clever algorithm in mind that
> > could do the job well there?  
> 
> No, i didn't; I just have an ulterior motive which is trying to
> do as few madvise's as possible, and while virtio_balloon_handle_output sees
> potentially quite a few requests at once, balloon_page is stuck down
> there at the bottom without any idea of whether there are any more coming.
> 
> Dave
> 
> >  Thomas
> >   
> --
> Dr. David Alan Gilbert / dgilbert@redhat.com / Manchester, UK
Jitendra Kolhe May 23, 2016, 6:25 a.m. UTC | #17
On 4/13/2016 8:21 PM, Thomas Huth wrote:
> On 13.04.2016 15:15, Michael S. Tsirkin wrote:
>> On Wed, Apr 13, 2016 at 01:52:44PM +0200, Thomas Huth wrote:
>>> The balloon code currently calls madvise() with TARGET_PAGE_SIZE
>>> as length parameter, and an address which is directly based on
>>> the page address supplied by the guest. Since the virtio-balloon
>>> protocol is always based on 4k based addresses/sizes, no matter
>>> what the host and guest are using as page size, this has a couple
>>> of issues which could even lead to data loss in the worst case.
>>>
>>> TARGET_PAGE_SIZE might not be 4k, so it is wrong to use that
>>> value for the madvise() call. If TARGET_PAGE_SIZE is bigger than
>>> 4k, we also destroy the 4k areas after the current one - which
>>> might be wrong since the guest did not want free that area yet (in
>>> case the guest used as smaller MMU page size than the hard-coded
>>> TARGET_PAGE_SIZE). So to fix this issue, introduce a proper define
>>> called BALLOON_PAGE_SIZE (which is 4096) to use this as the size
>>> parameter for the madvise() call instead.
>>
>> this makes absolute sense.
>>
>>> Then, there's yet another problem: If the host page size is bigger
>>> than the 4k balloon page size, we can not simply call madvise() on
>>> each of the 4k balloon addresses that we get from the guest - since
>>> the madvise() always evicts the whole host page, not only a 4k area!
>>
>> Does it really round length up?
>> Wow, it does:
>>         len = (len_in + ~PAGE_MASK) & PAGE_MASK;
>>
>> which seems to be undocumented, but has been there forever.
> 
> Yes, that's ugly - I also had to take a look at the kernel sources to
> understand what this call is supposed to do when being called with a
> size < PAGE_SIZE.
> 
>>> So in this case, we've got to track the 4k fragments of a host page
>>> and only call madvise(DONTNEED) when all fragments have been collected.
>>> This of course only works fine if the guest sends consecutive 4k
>>> fragments - which is the case in the most important scenarios that
>>> I try to address here (like a ppc64 guest with 64k page size that
>>> is running on a ppc64 host with 64k page size). In case the guest
>>> uses a page size that is smaller than the host page size, we might
>>> need to add some more additional logic here later to increase the
>>> probability of being able to release memory, but at least the guest
>>> should now not crash anymore due to unintentionally evicted pages.
> ...
>>>  static void virtio_balloon_instance_init(Object *obj)
>>> diff --git a/include/hw/virtio/virtio-balloon.h b/include/hw/virtio/virtio-balloon.h
>>> index 35f62ac..04b7c0c 100644
>>> --- a/include/hw/virtio/virtio-balloon.h
>>> +++ b/include/hw/virtio/virtio-balloon.h
>>> @@ -43,6 +43,9 @@ typedef struct VirtIOBalloon {
>>>      int64_t stats_last_update;
>>>      int64_t stats_poll_interval;
>>>      uint32_t host_features;
>>> +    void *current_addr;
>>> +    unsigned long *fragment_bits;
>>> +    int fragment_bits_size;
>>>  } VirtIOBalloon;
>>>  
>>>  #endif
>>> -- 
>>> 1.8.3.1
>>
>>
>> It looks like fragment_bits would have to be migrated.
>> Which is a lot of complexity.
> 
> I think we could simply omit this for now. In case somebody migrates the
> VM while the ballooning is going on, we'd loose the information for one
> host page, so we might miss one madvise(DONTNEED), but I think we could
> live with that.
> 
>> And work arounds for specific guest behaviour are really ugly.
>> There are patches on-list to maintain a balloon bitmap -
>> that will enable fixing it cleanly.
> 
> Ah, good to know, I wasn't aware of them yet, so that will be a chance
> for a really proper final solution, I hope.
> 
>> How about we just skip madvise if host page size is > balloon
>> page size, for 2.6?
> 
> That would mean a regression compared to what we have today. Currently,
> the ballooning is working OK for 64k guests on a 64k ppc host - rather
> by chance than on purpose, but it's working. The guest is always sending
> all the 4k fragments of a 64k page, and QEMU is trying to call madvise()
> for every one of them, but the kernel is ignoring madvise() on
> non-64k-aligned addresses, so we end up with a situation where the
> madvise() frees a whole 64k page which is also declared as free by the
> guest.
> 
> I think we should either take this patch as it is right now (without
> adding extra code for migration) and later update it to the bitmap code
> by Jitendra Kolhe, or omit it completely (leaving 4k guests broken) and
> fix it properly after the bitmap code has been applied. But disabling
> the balloon code for 64k guests on 64k hosts completely does not sound
> very appealing to me. What do you think?
> 

please find v3 version of below set of balloon bitmap patch series posted
on qemu-devel list.

  balloon: maintain bitmap for pages released by guest balloon driver.
  balloon: add balloon bitmap migration capability and setup bitmap
    migration status.
  balloon: reset balloon bitmap ramblock size on source and target.
  migration: skip scanning and migrating ram pages released by
    virtio-balloon driver.

- Jitendra

>  Thomas
>
diff mbox

Patch

diff --git a/hw/virtio/virtio-balloon.c b/hw/virtio/virtio-balloon.c
index c74101e..886faa8 100644
--- a/hw/virtio/virtio-balloon.c
+++ b/hw/virtio/virtio-balloon.c
@@ -35,13 +35,56 @@ 
 #include "hw/virtio/virtio-bus.h"
 #include "hw/virtio/virtio-access.h"
 
-static void balloon_page(void *addr, int deflate)
+#define BALLOON_PAGE_SIZE        (1 << VIRTIO_BALLOON_PFN_SHIFT)
+#define BALLOON_NO_CURRENT_PAGE  ((void *)-1)
+
+static void balloon_page(VirtIOBalloon *s, void *addr, int deflate)
 {
 #if defined(__linux__)
-    if (!qemu_balloon_is_inhibited() && (!kvm_enabled() ||
-                                         kvm_has_sync_mmu())) {
-        qemu_madvise(addr, TARGET_PAGE_SIZE,
-                deflate ? QEMU_MADV_WILLNEED : QEMU_MADV_DONTNEED);
+    size_t host_page_size;
+    void *aligned_addr;
+
+    if (qemu_balloon_is_inhibited() || (kvm_enabled() && !kvm_has_sync_mmu())) {
+        return;
+    }
+
+    host_page_size = getpagesize();
+    if (host_page_size <= BALLOON_PAGE_SIZE) {
+        qemu_madvise(addr, BALLOON_PAGE_SIZE,
+                     deflate ? QEMU_MADV_WILLNEED : QEMU_MADV_DONTNEED);
+        return;
+    }
+
+    /* Host page size > ballon page size ==> use aligned host address */
+    aligned_addr = (void *)((uintptr_t)addr & ~(host_page_size - 1));
+    if (deflate) {
+        /* MADV_WILLNEED is just a hint for the host kernel, the guest should
+         * also be able to use the memory again without this call, so let's
+         * only do it for the first, aligned fragment of a host page and
+         * ignore it for the others.
+         */
+        if (addr == aligned_addr) {
+            qemu_madvise(aligned_addr, host_page_size, QEMU_MADV_WILLNEED);
+        }
+        s->current_addr = BALLOON_NO_CURRENT_PAGE;
+    } else {
+        const int max_frags = host_page_size / BALLOON_PAGE_SIZE;
+        /* If we end up here, that means we want to evict balloon pages, but
+         * the host's page size is bigger than the 4k pages from the balloon.
+         * Since madvise() only works on the granularity of the host page size,
+         * we've got to collect all the 4k fragments from the guest first
+         * before we can issue the MADV_DONTNEED call.
+         */
+        if (aligned_addr != s->current_addr) {
+            memset(s->fragment_bits, 0, s->fragment_bits_size);
+            s->current_addr = aligned_addr;
+        }
+        set_bit((addr - aligned_addr) / BALLOON_PAGE_SIZE, s->fragment_bits);
+        if (find_first_zero_bit(s->fragment_bits, max_frags) == max_frags) {
+            qemu_madvise(aligned_addr, host_page_size, QEMU_MADV_DONTNEED);
+            memset(s->fragment_bits, 0, s->fragment_bits_size);
+            s->current_addr = BALLOON_NO_CURRENT_PAGE;
+        }
     }
 #endif
 }
@@ -240,7 +283,7 @@  static void virtio_balloon_handle_output(VirtIODevice *vdev, VirtQueue *vq)
             /* Using memory_region_get_ram_ptr is bending the rules a bit, but
                should be OK because we only want a single page.  */
             addr = section.offset_within_region;
-            balloon_page(memory_region_get_ram_ptr(section.mr) + addr,
+            balloon_page(s, memory_region_get_ram_ptr(section.mr) + addr,
                          !!(vq == s->dvq));
             memory_region_unref(section.mr);
         }
@@ -455,6 +498,13 @@  static void virtio_balloon_device_realize(DeviceState *dev, Error **errp)
     s->dvq = virtio_add_queue(vdev, 128, virtio_balloon_handle_output);
     s->svq = virtio_add_queue(vdev, 128, virtio_balloon_receive_stats);
 
+    if (getpagesize() > BALLOON_PAGE_SIZE) {
+        s->fragment_bits_size = (getpagesize() / BALLOON_PAGE_SIZE
+                                 + sizeof(long) * 8 - 1) / 8;
+        s->fragment_bits = g_malloc0(s->fragment_bits_size);
+        s->current_addr = BALLOON_NO_CURRENT_PAGE;
+    }
+
     reset_stats(s);
 
     register_savevm(dev, "virtio-balloon", -1, 1,
@@ -466,6 +516,7 @@  static void virtio_balloon_device_unrealize(DeviceState *dev, Error **errp)
     VirtIODevice *vdev = VIRTIO_DEVICE(dev);
     VirtIOBalloon *s = VIRTIO_BALLOON(dev);
 
+    g_free(s->fragment_bits);
     balloon_stats_destroy_timer(s);
     qemu_remove_balloon_handler(s);
     unregister_savevm(dev, "virtio-balloon", s);
@@ -480,6 +531,11 @@  static void virtio_balloon_device_reset(VirtIODevice *vdev)
         g_free(s->stats_vq_elem);
         s->stats_vq_elem = NULL;
     }
+
+    if (s->fragment_bits) {
+        memset(s->fragment_bits, 0, s->fragment_bits_size);
+        s->current_addr = BALLOON_NO_CURRENT_PAGE;
+    }
 }
 
 static void virtio_balloon_instance_init(Object *obj)
diff --git a/include/hw/virtio/virtio-balloon.h b/include/hw/virtio/virtio-balloon.h
index 35f62ac..04b7c0c 100644
--- a/include/hw/virtio/virtio-balloon.h
+++ b/include/hw/virtio/virtio-balloon.h
@@ -43,6 +43,9 @@  typedef struct VirtIOBalloon {
     int64_t stats_last_update;
     int64_t stats_poll_interval;
     uint32_t host_features;
+    void *current_addr;
+    unsigned long *fragment_bits;
+    int fragment_bits_size;
 } VirtIOBalloon;
 
 #endif