[RFC,5/5] virtio-balloon: Safely handle BALLOON_PAGE_SIZE < host page size

Message ID 20181012032431.32693-6-david@gibson.dropbear.id.au
State New
Headers show
Series
  • Improve balloon handling of pagesizes other than 4kiB
Related show

Commit Message

David Gibson Oct. 12, 2018, 3:24 a.m.
The virtio-balloon always works in units of 4kiB (BALLOON_PAGE_SIZE), but
on the host side, we can only actually discard memory in units of the host
page size.

At present we handle this very badly: we silently ignore balloon requests
that aren't host page aligned, and for requests that are host page aligned
we discard the entire host page.  The latter potentially corrupts guest
memory if its page size is smaller than the host's.

We could just disable the balloon if the host page size is not 4kiB, but
that would break a the special case where host and guest have the same page
size, but that's larger than 4kiB.  Thius case currently works by accident:
when the guest puts its page into the balloon, it will submit balloon
requests for each 4kiB subpage.  Most will be ignored, but the one which
happens to be host page aligned will discard the whole lot.

This occurs in practice routinely for POWER KVM systems, since both host
and guest typically use 64kiB pages.

To make this safe, without breaking that useful case, we need to
accumulate 4kiB balloon requests until we have a whole contiguous host page
at which point we can discard it.

We could in principle do that across all guest memory, but it would require
a large bitmap to track.  This patch represents a compromise: instead we
track ballooned subpages for a single contiguous host page at a time.  This
means that if the guest discards all 4kiB chunks of a host page in
succession, we will discard it.  In particular that means the balloon will
continue to work for the (host page size) == (guest page size) > 4kiB case.

If the guest scatters 4kiB requests across different host pages, we don't
discard anything, and issue a warning.  Not ideal, but at least we don't
corrupt guest memory as the previous version could.

Signed-off-by: David Gibson <david@gibson.dropbear.id.au>
---
 hw/virtio/virtio-balloon.c         | 67 +++++++++++++++++++++++++-----
 include/hw/virtio/virtio-balloon.h |  3 ++
 2 files changed, 60 insertions(+), 10 deletions(-)

Comments

David Hildenbrand Oct. 12, 2018, 8:06 a.m. | #1
On 12/10/2018 05:24, David Gibson wrote:
> The virtio-balloon always works in units of 4kiB (BALLOON_PAGE_SIZE), but
> on the host side, we can only actually discard memory in units of the host
> page size.
> 
> At present we handle this very badly: we silently ignore balloon requests
> that aren't host page aligned, and for requests that are host page aligned
> we discard the entire host page.  The latter potentially corrupts guest
> memory if its page size is smaller than the host's.
> 
> We could just disable the balloon if the host page size is not 4kiB, but
> that would break a the special case where host and guest have the same page
> size, but that's larger than 4kiB.  Thius case currently works by accident:
> when the guest puts its page into the balloon, it will submit balloon
> requests for each 4kiB subpage.  Most will be ignored, but the one which
> happens to be host page aligned will discard the whole lot.

Actually I would vote for disabling it if the host page size is not 4k.
This is broken by design and should not be worked around.

I consider 64k a similar case as huge pages in the hypervisor (from a
virtio-balloon interface point of view - which is broken by design for
these cases).

> 
> This occurs in practice routinely for POWER KVM systems, since both host
> and guest typically use 64kiB pages.
> 
> To make this safe, without breaking that useful case, we need to
> accumulate 4kiB balloon requests until we have a whole contiguous host page
> at which point we can discard it.

... and if you have a 4k guest it will just randomly report pages and
your 67 LOC tweak is useless. And this can and will easily happen.

> 
> We could in principle do that across all guest memory, but it would require
> a large bitmap to track.  This patch represents a compromise: instead we

Oh god no, no tracking of all memory. (size of bitmap, memory hotplug,
... migration?)

> track ballooned subpages for a single contiguous host page at a time.  This
> means that if the guest discards all 4kiB chunks of a host page in
> succession, we will discard it.  In particular that means the balloon will
> continue to work for the (host page size) == (guest page size) > 4kiB case.
> 
> If the guest scatters 4kiB requests across different host pages, we don't
> discard anything, and issue a warning.  Not ideal, but at least we don't
> corrupt guest memory as the previous version could.

My take would be to somehow disable the balloon on the hypervisor side
in case the host page size is not 4k. Like not allowing to realize it.
No corruptions, no warnings people will ignore.

> 
> Signed-off-by: David Gibson <david@gibson.dropbear.id.au>
> ---
>  hw/virtio/virtio-balloon.c         | 67 +++++++++++++++++++++++++-----
>  include/hw/virtio/virtio-balloon.h |  3 ++
>  2 files changed, 60 insertions(+), 10 deletions(-)
> 
> diff --git a/hw/virtio/virtio-balloon.c b/hw/virtio/virtio-balloon.c
> index 4435905c87..39573ef2e3 100644
> --- a/hw/virtio/virtio-balloon.c
> +++ b/hw/virtio/virtio-balloon.c
> @@ -33,33 +33,80 @@
>  
>  #define BALLOON_PAGE_SIZE  (1 << VIRTIO_BALLOON_PFN_SHIFT)
>  
> +typedef struct PartiallyBalloonedPage {
> +    RAMBlock *rb;
> +    ram_addr_t base;
> +    unsigned long bitmap[];
> +} PartiallyBalloonedPage;
> +
>  static void balloon_inflate_page(VirtIOBalloon *balloon,
>                                   MemoryRegion *mr, hwaddr offset)
>  {
>      void *addr = memory_region_get_ram_ptr(mr) + offset;
>      RAMBlock *rb;
>      size_t rb_page_size;
> -    ram_addr_t ram_offset;
> +    int subpages;
> +    ram_addr_t ram_offset, host_page_base;
>  
>      /* XXX is there a better way to get to the RAMBlock than via a
>       * host address? */
>      rb = qemu_ram_block_from_host(addr, false, &ram_offset);
>      rb_page_size = qemu_ram_pagesize(rb);
> +    host_page_base = ram_offset & ~(rb_page_size - 1);
> +
> +    if (rb_page_size == BALLOON_PAGE_SIZE) {
> +        /* Easy case */
>  
> -    /* Silently ignore hugepage RAM blocks */
> -    if (rb_page_size != getpagesize()) {
> +        ram_block_discard_range(rb, ram_offset, rb_page_size);
> +        /* We ignore errors from ram_block_discard_range(), because it
> +         * has already reported them, and failing to discard a balloon
> +         * page is not fatal */
>          return;
>      }
>  
> -    /* Silently ignore unaligned requests */
> -    if (ram_offset & (rb_page_size - 1)) {
> -        return;
> +    /* Hard case
> +     *
> +     * We've put a piece of a larger host page into the balloon - we
> +     * need to keep track until we have a whole host page to
> +     * discard
> +     */
> +    subpages = rb_page_size / BALLOON_PAGE_SIZE;
> +
> +    if (balloon->pbp
> +        && (rb != balloon->pbp->rb
> +            || host_page_base != balloon->pbp->base)) {
> +        /* We've partially ballooned part of a host page, but now
> +         * we're trying to balloon part of a different one.  Too hard,
> +         * give up on the old partial page */
> +        warn_report("Unable to insert a partial page into virtio-balloon");

I am pretty sure that you can create misleading warnings in case you
migrate at the wrong time. (migrate while half the 64k page is inflated,
on the new host the other part is inflated - a warning when switching to
another 64k page).

> +        free(balloon->pbp);
> +        balloon->pbp = NULL;
>      }
>  
> -    ram_block_discard_range(rb, ram_offset, rb_page_size);
> -    /* We ignore errors from ram_block_discard_range(), because it has
> -     * already reported them, and failing to discard a balloon page is
> -     * not fatal */
> +    if (!balloon->pbp) {
> +        /* Starting on a new host page */
> +        size_t bitlen = BITS_TO_LONGS(subpages) * sizeof(unsigned long);
> +        balloon->pbp = g_malloc0(sizeof(PartiallyBalloonedPage) + bitlen);
> +        balloon->pbp->rb = rb;
> +        balloon->pbp->base = host_page_base;
> +    }
> +
> +    bitmap_set(balloon->pbp->bitmap,
> +               (ram_offset - balloon->pbp->base) / BALLOON_PAGE_SIZE,
> +               subpages);
> +
> +    if (bitmap_full(balloon->pbp->bitmap, subpages)) {
> +        /* We've accumulated a full host page, we can actually discard
> +         * it now */
> +
> +        ram_block_discard_range(rb, balloon->pbp->base, rb_page_size);
> +        /* We ignore errors from ram_block_discard_range(), because it
> +         * has already reported them, and failing to discard a balloon
> +         * page is not fatal */
> +
> +        free(balloon->pbp);
> +        balloon->pbp = NULL;
> +    }
>  }
No, not a fan of this approach.
David Hildenbrand Oct. 12, 2018, 8:32 a.m. | #2
On 12/10/2018 05:24, David Gibson wrote:
> The virtio-balloon always works in units of 4kiB (BALLOON_PAGE_SIZE), but
> on the host side, we can only actually discard memory in units of the host
> page size.
> 
> At present we handle this very badly: we silently ignore balloon requests
> that aren't host page aligned, and for requests that are host page aligned
> we discard the entire host page.  The latter potentially corrupts guest
> memory if its page size is smaller than the host's.
> 
> We could just disable the balloon if the host page size is not 4kiB, but
> that would break a the special case where host and guest have the same page
> size, but that's larger than 4kiB.  Thius case currently works by accident:
> when the guest puts its page into the balloon, it will submit balloon
> requests for each 4kiB subpage.  Most will be ignored, but the one which
> happens to be host page aligned will discard the whole lot.
> 
> This occurs in practice routinely for POWER KVM systems, since both host
> and guest typically use 64kiB pages.
> 
> To make this safe, without breaking that useful case, we need to
> accumulate 4kiB balloon requests until we have a whole contiguous host page
> at which point we can discard it.
> 
> We could in principle do that across all guest memory, but it would require
> a large bitmap to track.  This patch represents a compromise: instead we
> track ballooned subpages for a single contiguous host page at a time.  This
> means that if the guest discards all 4kiB chunks of a host page in
> succession, we will discard it.  In particular that means the balloon will
> continue to work for the (host page size) == (guest page size) > 4kiB case.
> 
> If the guest scatters 4kiB requests across different host pages, we don't
> discard anything, and issue a warning.  Not ideal, but at least we don't
> corrupt guest memory as the previous version could.
> 
> Signed-off-by: David Gibson <david@gibson.dropbear.id.au>
> ---
>  hw/virtio/virtio-balloon.c         | 67 +++++++++++++++++++++++++-----
>  include/hw/virtio/virtio-balloon.h |  3 ++
>  2 files changed, 60 insertions(+), 10 deletions(-)
> 
> diff --git a/hw/virtio/virtio-balloon.c b/hw/virtio/virtio-balloon.c
> index 4435905c87..39573ef2e3 100644
> --- a/hw/virtio/virtio-balloon.c
> +++ b/hw/virtio/virtio-balloon.c
> @@ -33,33 +33,80 @@
>  
>  #define BALLOON_PAGE_SIZE  (1 << VIRTIO_BALLOON_PFN_SHIFT)
>  
> +typedef struct PartiallyBalloonedPage {
> +    RAMBlock *rb;
> +    ram_addr_t base;
> +    unsigned long bitmap[];

BTW, it might be easier to only remember the last inflated page and
incrementing it when you see the successor.

initialize last_page to -1ull on realize/reset


if (QEMU_IS_ALIGNED(addr, PAGE_SIZE)) {
	/* start of a new potential page block */
	last_page == addr;
} else if (addr == last_page + BALLOON_PAGE_SIZE) {
	/* next successor */
	last_page == addr;
	if (QEMU_IS_ALIGNED(last_page + BALLOON_PAGE_SIZE, PAGE_SIZE)) {
		ramblock_discard()....
	}
} else {
	last_page = -1ull;
}


> +} PartiallyBalloonedPage;
> +
>  static void balloon_inflate_page(VirtIOBalloon *balloon,
>                                   MemoryRegion *mr, hwaddr offset)
>  {
>      void *addr = memory_region_get_ram_ptr(mr) + offset;
>      RAMBlock *rb;
>      size_t rb_page_size;
> -    ram_addr_t ram_offset;
> +    int subpages;
> +    ram_addr_t ram_offset, host_page_base;
>  
>      /* XXX is there a better way to get to the RAMBlock than via a
>       * host address? */
>      rb = qemu_ram_block_from_host(addr, false, &ram_offset);
>      rb_page_size = qemu_ram_pagesize(rb);
> +    host_page_base = ram_offset & ~(rb_page_size - 1);
> +
> +    if (rb_page_size == BALLOON_PAGE_SIZE) {
> +        /* Easy case */
>  
> -    /* Silently ignore hugepage RAM blocks */
> -    if (rb_page_size != getpagesize()) {
> +        ram_block_discard_range(rb, ram_offset, rb_page_size);
> +        /* We ignore errors from ram_block_discard_range(), because it
> +         * has already reported them, and failing to discard a balloon
> +         * page is not fatal */
>          return;
>      }
>  
> -    /* Silently ignore unaligned requests */
> -    if (ram_offset & (rb_page_size - 1)) {
> -        return;
> +    /* Hard case
> +     *
> +     * We've put a piece of a larger host page into the balloon - we
> +     * need to keep track until we have a whole host page to
> +     * discard
> +     */
> +    subpages = rb_page_size / BALLOON_PAGE_SIZE;
> +
> +    if (balloon->pbp
> +        && (rb != balloon->pbp->rb
> +            || host_page_base != balloon->pbp->base)) {
> +        /* We've partially ballooned part of a host page, but now
> +         * we're trying to balloon part of a different one.  Too hard,
> +         * give up on the old partial page */
> +        warn_report("Unable to insert a partial page into virtio-balloon");
> +        free(balloon->pbp);
> +        balloon->pbp = NULL;
>      }
>  
> -    ram_block_discard_range(rb, ram_offset, rb_page_size);
> -    /* We ignore errors from ram_block_discard_range(), because it has
> -     * already reported them, and failing to discard a balloon page is
> -     * not fatal */
> +    if (!balloon->pbp) {
> +        /* Starting on a new host page */
> +        size_t bitlen = BITS_TO_LONGS(subpages) * sizeof(unsigned long);
> +        balloon->pbp = g_malloc0(sizeof(PartiallyBalloonedPage) + bitlen);
> +        balloon->pbp->rb = rb;
> +        balloon->pbp->base = host_page_base;
> +    }
> +
> +    bitmap_set(balloon->pbp->bitmap,
> +               (ram_offset - balloon->pbp->base) / BALLOON_PAGE_SIZE,
> +               subpages);
> +
> +    if (bitmap_full(balloon->pbp->bitmap, subpages)) {
> +        /* We've accumulated a full host page, we can actually discard
> +         * it now */
> +
> +        ram_block_discard_range(rb, balloon->pbp->base, rb_page_size);
> +        /* We ignore errors from ram_block_discard_range(), because it
> +         * has already reported them, and failing to discard a balloon
> +         * page is not fatal */
> +
> +        free(balloon->pbp);
> +        balloon->pbp = NULL;
> +    }
>  }
>  
>  static const char *balloon_stat_names[] = {
> diff --git a/include/hw/virtio/virtio-balloon.h b/include/hw/virtio/virtio-balloon.h
> index e0df3528c8..99dcd6d105 100644
> --- a/include/hw/virtio/virtio-balloon.h
> +++ b/include/hw/virtio/virtio-balloon.h
> @@ -30,6 +30,8 @@ typedef struct virtio_balloon_stat_modern {
>         uint64_t val;
>  } VirtIOBalloonStatModern;
>  
> +typedef struct PartiallyBalloonedPage PartiallyBalloonedPage;
> +
>  typedef struct VirtIOBalloon {
>      VirtIODevice parent_obj;
>      VirtQueue *ivq, *dvq, *svq;
> @@ -42,6 +44,7 @@ typedef struct VirtIOBalloon {
>      int64_t stats_last_update;
>      int64_t stats_poll_interval;
>      uint32_t host_features;
> +    PartiallyBalloonedPage *pbp;
>  } VirtIOBalloon;
>  
>  #endif
>
David Gibson Oct. 13, 2018, 6:40 a.m. | #3
On Fri, Oct 12, 2018 at 10:06:50AM +0200, David Hildenbrand wrote:
> On 12/10/2018 05:24, David Gibson wrote:
> > The virtio-balloon always works in units of 4kiB (BALLOON_PAGE_SIZE), but
> > on the host side, we can only actually discard memory in units of the host
> > page size.
> > 
> > At present we handle this very badly: we silently ignore balloon requests
> > that aren't host page aligned, and for requests that are host page aligned
> > we discard the entire host page.  The latter potentially corrupts guest
> > memory if its page size is smaller than the host's.
> > 
> > We could just disable the balloon if the host page size is not 4kiB, but
> > that would break a the special case where host and guest have the same page
> > size, but that's larger than 4kiB.  Thius case currently works by accident:
> > when the guest puts its page into the balloon, it will submit balloon
> > requests for each 4kiB subpage.  Most will be ignored, but the one which
> > happens to be host page aligned will discard the whole lot.
> 
> Actually I would vote for disabling it if the host page size is not 4k.
> This is broken by design and should not be worked around.

I really don't think that's feasible.  This is not some theoretical
edge case: essentially every KVM on POWER system out there with
balloon enabled is in this situation.  And AIUI, a bunch of common
management layers routinely enable the balloon.  Just disabling this
for non-4k systems will break existing, supported production setups.

> I consider 64k a similar case as huge pages in the hypervisor (from a
> virtio-balloon interface point of view - which is broken by design for
> these cases).

Technically it is very similar, but again, this is in routine use on
real systems out there.

> > This occurs in practice routinely for POWER KVM systems, since both host
> > and guest typically use 64kiB pages.
> > 
> > To make this safe, without breaking that useful case, we need to
> > accumulate 4kiB balloon requests until we have a whole contiguous host page
> > at which point we can discard it.
> 
> ... and if you have a 4k guest it will just randomly report pages and
> your 67 LOC tweak is useless.

Yes.

> And this can and will easily happen.

What cases are you thinking of?  For POWER at least, 4k on 64k will be
vastly less common than 64k on 64k.

> > We could in principle do that across all guest memory, but it would require
> > a large bitmap to track.  This patch represents a compromise: instead we
> 
> Oh god no, no tracking of all memory. (size of bitmap, memory hotplug,
> ... migration?)

Quite, that's why I didn't do it.  Although, I don't actually think
migration is such an issue: we *already* essentially lose track of
which pages are inside the balloon across migration.

> > track ballooned subpages for a single contiguous host page at a time.  This
> > means that if the guest discards all 4kiB chunks of a host page in
> > succession, we will discard it.  In particular that means the balloon will
> > continue to work for the (host page size) == (guest page size) > 4kiB case.
> > 
> > If the guest scatters 4kiB requests across different host pages, we don't
> > discard anything, and issue a warning.  Not ideal, but at least we don't
> > corrupt guest memory as the previous version could.
> 
> My take would be to somehow disable the balloon on the hypervisor side
> in case the host page size is not 4k. Like not allowing to realize it.
> No corruptions, no warnings people will ignore.

No, that's even worse than just having it silently do nothing on
non-4k setups.  Silently being useless we might get away with, we'll
just waste memory, but failing the device load will absolutely break
existing setups.

> > Signed-off-by: David Gibson <david@gibson.dropbear.id.au>
> > ---
> >  hw/virtio/virtio-balloon.c         | 67 +++++++++++++++++++++++++-----
> >  include/hw/virtio/virtio-balloon.h |  3 ++
> >  2 files changed, 60 insertions(+), 10 deletions(-)
> > 
> > diff --git a/hw/virtio/virtio-balloon.c b/hw/virtio/virtio-balloon.c
> > index 4435905c87..39573ef2e3 100644
> > --- a/hw/virtio/virtio-balloon.c
> > +++ b/hw/virtio/virtio-balloon.c
> > @@ -33,33 +33,80 @@
> >  
> >  #define BALLOON_PAGE_SIZE  (1 << VIRTIO_BALLOON_PFN_SHIFT)
> >  
> > +typedef struct PartiallyBalloonedPage {
> > +    RAMBlock *rb;
> > +    ram_addr_t base;
> > +    unsigned long bitmap[];
> > +} PartiallyBalloonedPage;
> > +
> >  static void balloon_inflate_page(VirtIOBalloon *balloon,
> >                                   MemoryRegion *mr, hwaddr offset)
> >  {
> >      void *addr = memory_region_get_ram_ptr(mr) + offset;
> >      RAMBlock *rb;
> >      size_t rb_page_size;
> > -    ram_addr_t ram_offset;
> > +    int subpages;
> > +    ram_addr_t ram_offset, host_page_base;
> >  
> >      /* XXX is there a better way to get to the RAMBlock than via a
> >       * host address? */
> >      rb = qemu_ram_block_from_host(addr, false, &ram_offset);
> >      rb_page_size = qemu_ram_pagesize(rb);
> > +    host_page_base = ram_offset & ~(rb_page_size - 1);
> > +
> > +    if (rb_page_size == BALLOON_PAGE_SIZE) {
> > +        /* Easy case */
> >  
> > -    /* Silently ignore hugepage RAM blocks */
> > -    if (rb_page_size != getpagesize()) {
> > +        ram_block_discard_range(rb, ram_offset, rb_page_size);
> > +        /* We ignore errors from ram_block_discard_range(), because it
> > +         * has already reported them, and failing to discard a balloon
> > +         * page is not fatal */
> >          return;
> >      }
> >  
> > -    /* Silently ignore unaligned requests */
> > -    if (ram_offset & (rb_page_size - 1)) {
> > -        return;
> > +    /* Hard case
> > +     *
> > +     * We've put a piece of a larger host page into the balloon - we
> > +     * need to keep track until we have a whole host page to
> > +     * discard
> > +     */
> > +    subpages = rb_page_size / BALLOON_PAGE_SIZE;
> > +
> > +    if (balloon->pbp
> > +        && (rb != balloon->pbp->rb
> > +            || host_page_base != balloon->pbp->base)) {
> > +        /* We've partially ballooned part of a host page, but now
> > +         * we're trying to balloon part of a different one.  Too hard,
> > +         * give up on the old partial page */
> > +        warn_report("Unable to insert a partial page into virtio-balloon");
> 
> I am pretty sure that you can create misleading warnings in case you
> migrate at the wrong time. (migrate while half the 64k page is inflated,
> on the new host the other part is inflated - a warning when switching to
> another 64k page).

Yes we can get bogus warnings across migration with this.  I was
considering that an acceptable price, but I'm open to better
alternatives.

> > +        free(balloon->pbp);
> > +        balloon->pbp = NULL;
> >      }
> >  
> > -    ram_block_discard_range(rb, ram_offset, rb_page_size);
> > -    /* We ignore errors from ram_block_discard_range(), because it has
> > -     * already reported them, and failing to discard a balloon page is
> > -     * not fatal */
> > +    if (!balloon->pbp) {
> > +        /* Starting on a new host page */
> > +        size_t bitlen = BITS_TO_LONGS(subpages) * sizeof(unsigned long);
> > +        balloon->pbp = g_malloc0(sizeof(PartiallyBalloonedPage) + bitlen);
> > +        balloon->pbp->rb = rb;
> > +        balloon->pbp->base = host_page_base;
> > +    }
> > +
> > +    bitmap_set(balloon->pbp->bitmap,
> > +               (ram_offset - balloon->pbp->base) / BALLOON_PAGE_SIZE,
> > +               subpages);
> > +
> > +    if (bitmap_full(balloon->pbp->bitmap, subpages)) {
> > +        /* We've accumulated a full host page, we can actually discard
> > +         * it now */
> > +
> > +        ram_block_discard_range(rb, balloon->pbp->base, rb_page_size);
> > +        /* We ignore errors from ram_block_discard_range(), because it
> > +         * has already reported them, and failing to discard a balloon
> > +         * page is not fatal */
> > +
> > +        free(balloon->pbp);
> > +        balloon->pbp = NULL;
> > +    }
> >  }
> No, not a fan of this approach.

I can see why, but I really can't see what else to do without breaking
existing, supported, working (albeit by accident) setups.
David Gibson Oct. 13, 2018, 6:41 a.m. | #4
On Fri, Oct 12, 2018 at 10:32:34AM +0200, David Hildenbrand wrote:
> On 12/10/2018 05:24, David Gibson wrote:
> > The virtio-balloon always works in units of 4kiB (BALLOON_PAGE_SIZE), but
> > on the host side, we can only actually discard memory in units of the host
> > page size.
> > 
> > At present we handle this very badly: we silently ignore balloon requests
> > that aren't host page aligned, and for requests that are host page aligned
> > we discard the entire host page.  The latter potentially corrupts guest
> > memory if its page size is smaller than the host's.
> > 
> > We could just disable the balloon if the host page size is not 4kiB, but
> > that would break a the special case where host and guest have the same page
> > size, but that's larger than 4kiB.  Thius case currently works by accident:
> > when the guest puts its page into the balloon, it will submit balloon
> > requests for each 4kiB subpage.  Most will be ignored, but the one which
> > happens to be host page aligned will discard the whole lot.
> > 
> > This occurs in practice routinely for POWER KVM systems, since both host
> > and guest typically use 64kiB pages.
> > 
> > To make this safe, without breaking that useful case, we need to
> > accumulate 4kiB balloon requests until we have a whole contiguous host page
> > at which point we can discard it.
> > 
> > We could in principle do that across all guest memory, but it would require
> > a large bitmap to track.  This patch represents a compromise: instead we
> > track ballooned subpages for a single contiguous host page at a time.  This
> > means that if the guest discards all 4kiB chunks of a host page in
> > succession, we will discard it.  In particular that means the balloon will
> > continue to work for the (host page size) == (guest page size) > 4kiB case.
> > 
> > If the guest scatters 4kiB requests across different host pages, we don't
> > discard anything, and issue a warning.  Not ideal, but at least we don't
> > corrupt guest memory as the previous version could.
> > 
> > Signed-off-by: David Gibson <david@gibson.dropbear.id.au>
> > ---
> >  hw/virtio/virtio-balloon.c         | 67 +++++++++++++++++++++++++-----
> >  include/hw/virtio/virtio-balloon.h |  3 ++
> >  2 files changed, 60 insertions(+), 10 deletions(-)
> > 
> > diff --git a/hw/virtio/virtio-balloon.c b/hw/virtio/virtio-balloon.c
> > index 4435905c87..39573ef2e3 100644
> > --- a/hw/virtio/virtio-balloon.c
> > +++ b/hw/virtio/virtio-balloon.c
> > @@ -33,33 +33,80 @@
> >  
> >  #define BALLOON_PAGE_SIZE  (1 << VIRTIO_BALLOON_PFN_SHIFT)
> >  
> > +typedef struct PartiallyBalloonedPage {
> > +    RAMBlock *rb;
> > +    ram_addr_t base;
> > +    unsigned long bitmap[];
> 
> BTW, it might be easier to only remember the last inflated page and
> incrementing it when you see the successor.

That would be marginally simpler, but I was preferring not to rely on
the guest always doing things in a particular order.

> 
> initialize last_page to -1ull on realize/reset
> 
> 
> if (QEMU_IS_ALIGNED(addr, PAGE_SIZE)) {
> 	/* start of a new potential page block */
> 	last_page == addr;
> } else if (addr == last_page + BALLOON_PAGE_SIZE) {
> 	/* next successor */
> 	last_page == addr;
> 	if (QEMU_IS_ALIGNED(last_page + BALLOON_PAGE_SIZE, PAGE_SIZE)) {
> 		ramblock_discard()....
> 	}
> } else {
> 	last_page = -1ull;
> }
> 
> 
> > +} PartiallyBalloonedPage;
> > +
> >  static void balloon_inflate_page(VirtIOBalloon *balloon,
> >                                   MemoryRegion *mr, hwaddr offset)
> >  {
> >      void *addr = memory_region_get_ram_ptr(mr) + offset;
> >      RAMBlock *rb;
> >      size_t rb_page_size;
> > -    ram_addr_t ram_offset;
> > +    int subpages;
> > +    ram_addr_t ram_offset, host_page_base;
> >  
> >      /* XXX is there a better way to get to the RAMBlock than via a
> >       * host address? */
> >      rb = qemu_ram_block_from_host(addr, false, &ram_offset);
> >      rb_page_size = qemu_ram_pagesize(rb);
> > +    host_page_base = ram_offset & ~(rb_page_size - 1);
> > +
> > +    if (rb_page_size == BALLOON_PAGE_SIZE) {
> > +        /* Easy case */
> >  
> > -    /* Silently ignore hugepage RAM blocks */
> > -    if (rb_page_size != getpagesize()) {
> > +        ram_block_discard_range(rb, ram_offset, rb_page_size);
> > +        /* We ignore errors from ram_block_discard_range(), because it
> > +         * has already reported them, and failing to discard a balloon
> > +         * page is not fatal */
> >          return;
> >      }
> >  
> > -    /* Silently ignore unaligned requests */
> > -    if (ram_offset & (rb_page_size - 1)) {
> > -        return;
> > +    /* Hard case
> > +     *
> > +     * We've put a piece of a larger host page into the balloon - we
> > +     * need to keep track until we have a whole host page to
> > +     * discard
> > +     */
> > +    subpages = rb_page_size / BALLOON_PAGE_SIZE;
> > +
> > +    if (balloon->pbp
> > +        && (rb != balloon->pbp->rb
> > +            || host_page_base != balloon->pbp->base)) {
> > +        /* We've partially ballooned part of a host page, but now
> > +         * we're trying to balloon part of a different one.  Too hard,
> > +         * give up on the old partial page */
> > +        warn_report("Unable to insert a partial page into virtio-balloon");
> > +        free(balloon->pbp);
> > +        balloon->pbp = NULL;
> >      }
> >  
> > -    ram_block_discard_range(rb, ram_offset, rb_page_size);
> > -    /* We ignore errors from ram_block_discard_range(), because it has
> > -     * already reported them, and failing to discard a balloon page is
> > -     * not fatal */
> > +    if (!balloon->pbp) {
> > +        /* Starting on a new host page */
> > +        size_t bitlen = BITS_TO_LONGS(subpages) * sizeof(unsigned long);
> > +        balloon->pbp = g_malloc0(sizeof(PartiallyBalloonedPage) + bitlen);
> > +        balloon->pbp->rb = rb;
> > +        balloon->pbp->base = host_page_base;
> > +    }
> > +
> > +    bitmap_set(balloon->pbp->bitmap,
> > +               (ram_offset - balloon->pbp->base) / BALLOON_PAGE_SIZE,
> > +               subpages);
> > +
> > +    if (bitmap_full(balloon->pbp->bitmap, subpages)) {
> > +        /* We've accumulated a full host page, we can actually discard
> > +         * it now */
> > +
> > +        ram_block_discard_range(rb, balloon->pbp->base, rb_page_size);
> > +        /* We ignore errors from ram_block_discard_range(), because it
> > +         * has already reported them, and failing to discard a balloon
> > +         * page is not fatal */
> > +
> > +        free(balloon->pbp);
> > +        balloon->pbp = NULL;
> > +    }
> >  }
> >  
> >  static const char *balloon_stat_names[] = {
> > diff --git a/include/hw/virtio/virtio-balloon.h b/include/hw/virtio/virtio-balloon.h
> > index e0df3528c8..99dcd6d105 100644
> > --- a/include/hw/virtio/virtio-balloon.h
> > +++ b/include/hw/virtio/virtio-balloon.h
> > @@ -30,6 +30,8 @@ typedef struct virtio_balloon_stat_modern {
> >         uint64_t val;
> >  } VirtIOBalloonStatModern;
> >  
> > +typedef struct PartiallyBalloonedPage PartiallyBalloonedPage;
> > +
> >  typedef struct VirtIOBalloon {
> >      VirtIODevice parent_obj;
> >      VirtQueue *ivq, *dvq, *svq;
> > @@ -42,6 +44,7 @@ typedef struct VirtIOBalloon {
> >      int64_t stats_last_update;
> >      int64_t stats_poll_interval;
> >      uint32_t host_features;
> > +    PartiallyBalloonedPage *pbp;
> >  } VirtIOBalloon;
> >  
> >  #endif
> > 
> 
>
David Hildenbrand Oct. 15, 2018, 7:08 a.m. | #5
>>> This occurs in practice routinely for POWER KVM systems, since both host
>>> and guest typically use 64kiB pages.
>>>
>>> To make this safe, without breaking that useful case, we need to
>>> accumulate 4kiB balloon requests until we have a whole contiguous host page
>>> at which point we can discard it.
>>
>> ... and if you have a 4k guest it will just randomly report pages and
>> your 67 LOC tweak is useless.
> 
> Yes.
> 
>> And this can and will easily happen.
> 
> What cases are you thinking of?  For POWER at least, 4k on 64k will be
> vastly less common than 64k on 64k.

Okay, I was wondering why we don't get tons of bug reports for 4k on
64k. So 64k on 64k while using ballooning is supported and heavily used
then I guess? Because as you said, 4k on 64k triggers memory corruptions.

> 
>>> We could in principle do that across all guest memory, but it would require
>>> a large bitmap to track.  This patch represents a compromise: instead we
>>
>> Oh god no, no tracking of all memory. (size of bitmap, memory hotplug,
>> ... migration?)
> 
> Quite, that's why I didn't do it.  Although, I don't actually think
> migration is such an issue: we *already* essentially lose track of
> which pages are inside the balloon across migration.

Well, we migrate zero pages that get replaces by zero pages on the
target. So at least KSM could recover them.

> 
>>> track ballooned subpages for a single contiguous host page at a time.  This
>>> means that if the guest discards all 4kiB chunks of a host page in
>>> succession, we will discard it.  In particular that means the balloon will
>>> continue to work for the (host page size) == (guest page size) > 4kiB case.
>>>
>>> If the guest scatters 4kiB requests across different host pages, we don't
>>> discard anything, and issue a warning.  Not ideal, but at least we don't
>>> corrupt guest memory as the previous version could.
>>
>> My take would be to somehow disable the balloon on the hypervisor side
>> in case the host page size is not 4k. Like not allowing to realize it.
>> No corruptions, no warnings people will ignore.
> 
> No, that's even worse than just having it silently do nothing on
> non-4k setups.  Silently being useless we might get away with, we'll
> just waste memory, but failing the device load will absolutely break
> existing setups.

Silently consume more memory is very very evil. Think about
auto-ballooning setups
https://www.ovirt.org/documentation/how-to/autoballooning/

But on the other hand, this has been broken forever for huge pages
without printing warnings. Oh man, virtio-balloon ...

Disallowing to realize will only break migration from an old host to a
new host. But migration will bail out right away. We could of course
glue this to a compat machine, but I guess the point you have is that
customer want to continue using this "works by accident without memory
corruptions" virtio-balloon implementation.

> 
>>> Signed-off-by: David Gibson <david@gibson.dropbear.id.au>
>>> ---
>>>  hw/virtio/virtio-balloon.c         | 67 +++++++++++++++++++++++++-----
>>>  include/hw/virtio/virtio-balloon.h |  3 ++
>>>  2 files changed, 60 insertions(+), 10 deletions(-)
>>>
>>> diff --git a/hw/virtio/virtio-balloon.c b/hw/virtio/virtio-balloon.c
>>> index 4435905c87..39573ef2e3 100644
>>> --- a/hw/virtio/virtio-balloon.c
>>> +++ b/hw/virtio/virtio-balloon.c
>>> @@ -33,33 +33,80 @@
>>>  
>>>  #define BALLOON_PAGE_SIZE  (1 << VIRTIO_BALLOON_PFN_SHIFT)
>>>  
>>> +typedef struct PartiallyBalloonedPage {
>>> +    RAMBlock *rb;
>>> +    ram_addr_t base;
>>> +    unsigned long bitmap[];
>>> +} PartiallyBalloonedPage;
>>> +
>>>  static void balloon_inflate_page(VirtIOBalloon *balloon,
>>>                                   MemoryRegion *mr, hwaddr offset)
>>>  {
>>>      void *addr = memory_region_get_ram_ptr(mr) + offset;
>>>      RAMBlock *rb;
>>>      size_t rb_page_size;
>>> -    ram_addr_t ram_offset;
>>> +    int subpages;
>>> +    ram_addr_t ram_offset, host_page_base;
>>>  
>>>      /* XXX is there a better way to get to the RAMBlock than via a
>>>       * host address? */
>>>      rb = qemu_ram_block_from_host(addr, false, &ram_offset);
>>>      rb_page_size = qemu_ram_pagesize(rb);
>>> +    host_page_base = ram_offset & ~(rb_page_size - 1);
>>> +
>>> +    if (rb_page_size == BALLOON_PAGE_SIZE) {
>>> +        /* Easy case */
>>>  
>>> -    /* Silently ignore hugepage RAM blocks */
>>> -    if (rb_page_size != getpagesize()) {
>>> +        ram_block_discard_range(rb, ram_offset, rb_page_size);
>>> +        /* We ignore errors from ram_block_discard_range(), because it
>>> +         * has already reported them, and failing to discard a balloon
>>> +         * page is not fatal */
>>>          return;
>>>      }
>>>  
>>> -    /* Silently ignore unaligned requests */
>>> -    if (ram_offset & (rb_page_size - 1)) {
>>> -        return;
>>> +    /* Hard case
>>> +     *
>>> +     * We've put a piece of a larger host page into the balloon - we
>>> +     * need to keep track until we have a whole host page to
>>> +     * discard
>>> +     */
>>> +    subpages = rb_page_size / BALLOON_PAGE_SIZE;
>>> +
>>> +    if (balloon->pbp
>>> +        && (rb != balloon->pbp->rb
>>> +            || host_page_base != balloon->pbp->base)) {
>>> +        /* We've partially ballooned part of a host page, but now
>>> +         * we're trying to balloon part of a different one.  Too hard,
>>> +         * give up on the old partial page */
>>> +        warn_report("Unable to insert a partial page into virtio-balloon");
>>
>> I am pretty sure that you can create misleading warnings in case you
>> migrate at the wrong time. (migrate while half the 64k page is inflated,
>> on the new host the other part is inflated - a warning when switching to
>> another 64k page).
> 
> Yes we can get bogus warnings across migration with this.  I was
> considering that an acceptable price, but I'm open to better
> alternatives.

Is maybe reporting a warning on a 64k host when realizing the better
approach than on every inflation?

"host page size does not match virtio-balloon page size. If the guest
has a different page size than the host, inflating the balloon might not
effectively free up memory."

Or reporting a warning whenever changing the balloon target size.

> 
>>> +        free(balloon->pbp);
>>> +        balloon->pbp = NULL;
>>>      }
>>>  
>>> -    ram_block_discard_range(rb, ram_offset, rb_page_size);
>>> -    /* We ignore errors from ram_block_discard_range(), because it has
>>> -     * already reported them, and failing to discard a balloon page is
>>> -     * not fatal */
>>> +    if (!balloon->pbp) {
>>> +        /* Starting on a new host page */
>>> +        size_t bitlen = BITS_TO_LONGS(subpages) * sizeof(unsigned long);
>>> +        balloon->pbp = g_malloc0(sizeof(PartiallyBalloonedPage) + bitlen);
>>> +        balloon->pbp->rb = rb;
>>> +        balloon->pbp->base = host_page_base;
>>> +    }
>>> +
>>> +    bitmap_set(balloon->pbp->bitmap,
>>> +               (ram_offset - balloon->pbp->base) / BALLOON_PAGE_SIZE,
>>> +               subpages);
>>> +
>>> +    if (bitmap_full(balloon->pbp->bitmap, subpages)) {
>>> +        /* We've accumulated a full host page, we can actually discard
>>> +         * it now */
>>> +
>>> +        ram_block_discard_range(rb, balloon->pbp->base, rb_page_size);
>>> +        /* We ignore errors from ram_block_discard_range(), because it
>>> +         * has already reported them, and failing to discard a balloon
>>> +         * page is not fatal */
>>> +
>>> +        free(balloon->pbp);
>>> +        balloon->pbp = NULL;
>>> +    }
>>>  }
>> No, not a fan of this approach.
> 
> I can see why, but I really can't see what else to do without breaking
> existing, supported, working (albeit by accident) setups.
> 

Is there any reason to use this more complicated "allow random freeing"
approach over a simplistic sequential freeing I propose? Then we can
simply migrate the last freed page and should be fine.

As far as I know Linux guests have been freeing and reporting these
pages sequentially, or is that not true? Are you aware of other
implementations that we might miss?
David Gibson Oct. 17, 2018, 3:28 a.m. | #6
On Mon, Oct 15, 2018 at 09:08:45AM +0200, David Hildenbrand wrote:
> >>> This occurs in practice routinely for POWER KVM systems, since both host
> >>> and guest typically use 64kiB pages.
> >>>
> >>> To make this safe, without breaking that useful case, we need to
> >>> accumulate 4kiB balloon requests until we have a whole contiguous host page
> >>> at which point we can discard it.
> >>
> >> ... and if you have a 4k guest it will just randomly report pages and
> >> your 67 LOC tweak is useless.
> > 
> > Yes.
> > 
> >> And this can and will easily happen.
> > 
> > What cases are you thinking of?  For POWER at least, 4k on 64k will be
> > vastly less common than 64k on 64k.
> 
> Okay, I was wondering why we don't get tons of bug reports for 4k on
> 64k.

Right, and the answer is because 64k has been the normal configuration
on every major ppc64 distro for years (for both host and guest).

> So 64k on 64k while using ballooning is supported and heavily used
> then I guess? Because as you said, 4k on 64k triggers memory corruptions.

Right.

> >>> We could in principle do that across all guest memory, but it would require
> >>> a large bitmap to track.  This patch represents a compromise: instead we
> >>
> >> Oh god no, no tracking of all memory. (size of bitmap, memory hotplug,
> >> ... migration?)
> > 
> > Quite, that's why I didn't do it.  Although, I don't actually think
> > migration is such an issue: we *already* essentially lose track of
> > which pages are inside the balloon across migration.
> 
> Well, we migrate zero pages that get replaces by zero pages on the
> target. So at least KSM could recover them.

Right, but there's no explicit state being transferred.  In a sense,
KSM and zero page handling across migration resets the balloon to a
state optimized for the current memory state of the guest.

> >>> track ballooned subpages for a single contiguous host page at a time.  This
> >>> means that if the guest discards all 4kiB chunks of a host page in
> >>> succession, we will discard it.  In particular that means the balloon will
> >>> continue to work for the (host page size) == (guest page size) > 4kiB case.
> >>>
> >>> If the guest scatters 4kiB requests across different host pages, we don't
> >>> discard anything, and issue a warning.  Not ideal, but at least we don't
> >>> corrupt guest memory as the previous version could.
> >>
> >> My take would be to somehow disable the balloon on the hypervisor side
> >> in case the host page size is not 4k. Like not allowing to realize it.
> >> No corruptions, no warnings people will ignore.
> > 
> > No, that's even worse than just having it silently do nothing on
> > non-4k setups.  Silently being useless we might get away with, we'll
> > just waste memory, but failing the device load will absolutely break
> > existing setups.
> 
> Silently consume more memory is very very evil. Think about
> auto-ballooning setups
> https://www.ovirt.org/documentation/how-to/autoballooning/

Well, sure, I don't think this is a good idea.

> But on the other hand, this has been broken forever for huge pages
> without printing warnings. Oh man, virtio-balloon ...
> 
> Disallowing to realize will only break migration from an old host to a
> new host. But migration will bail out right away.

The issue isn't really migration here.  If we prevent the balloon
device from realizing, guests which already had it configured simply
won't be able to start after qemu is updated.

> We could of course
> glue this to a compat machine,

We could, but what would it accomplish?  We'd still have to do
something for the old machine versions - so we're back at either
allowing memory corruption like we do right now, or batch gathering as
my patch proposes.

> but I guess the point you have is that
> customer want to continue using this "works by accident without memory
> corruptions" virtio-balloon implementation.

Right.  That's why I really think we need this batch-gathering
approach, ugly and irritating though it is.

> 
> > 
> >>> Signed-off-by: David Gibson <david@gibson.dropbear.id.au>
> >>> ---
> >>>  hw/virtio/virtio-balloon.c         | 67 +++++++++++++++++++++++++-----
> >>>  include/hw/virtio/virtio-balloon.h |  3 ++
> >>>  2 files changed, 60 insertions(+), 10 deletions(-)
> >>>
> >>> diff --git a/hw/virtio/virtio-balloon.c b/hw/virtio/virtio-balloon.c
> >>> index 4435905c87..39573ef2e3 100644
> >>> --- a/hw/virtio/virtio-balloon.c
> >>> +++ b/hw/virtio/virtio-balloon.c
> >>> @@ -33,33 +33,80 @@
> >>>  
> >>>  #define BALLOON_PAGE_SIZE  (1 << VIRTIO_BALLOON_PFN_SHIFT)
> >>>  
> >>> +typedef struct PartiallyBalloonedPage {
> >>> +    RAMBlock *rb;
> >>> +    ram_addr_t base;
> >>> +    unsigned long bitmap[];
> >>> +} PartiallyBalloonedPage;
> >>> +
> >>>  static void balloon_inflate_page(VirtIOBalloon *balloon,
> >>>                                   MemoryRegion *mr, hwaddr offset)
> >>>  {
> >>>      void *addr = memory_region_get_ram_ptr(mr) + offset;
> >>>      RAMBlock *rb;
> >>>      size_t rb_page_size;
> >>> -    ram_addr_t ram_offset;
> >>> +    int subpages;
> >>> +    ram_addr_t ram_offset, host_page_base;
> >>>  
> >>>      /* XXX is there a better way to get to the RAMBlock than via a
> >>>       * host address? */
> >>>      rb = qemu_ram_block_from_host(addr, false, &ram_offset);
> >>>      rb_page_size = qemu_ram_pagesize(rb);
> >>> +    host_page_base = ram_offset & ~(rb_page_size - 1);
> >>> +
> >>> +    if (rb_page_size == BALLOON_PAGE_SIZE) {
> >>> +        /* Easy case */
> >>>  
> >>> -    /* Silently ignore hugepage RAM blocks */
> >>> -    if (rb_page_size != getpagesize()) {
> >>> +        ram_block_discard_range(rb, ram_offset, rb_page_size);
> >>> +        /* We ignore errors from ram_block_discard_range(), because it
> >>> +         * has already reported them, and failing to discard a balloon
> >>> +         * page is not fatal */
> >>>          return;
> >>>      }
> >>>  
> >>> -    /* Silently ignore unaligned requests */
> >>> -    if (ram_offset & (rb_page_size - 1)) {
> >>> -        return;
> >>> +    /* Hard case
> >>> +     *
> >>> +     * We've put a piece of a larger host page into the balloon - we
> >>> +     * need to keep track until we have a whole host page to
> >>> +     * discard
> >>> +     */
> >>> +    subpages = rb_page_size / BALLOON_PAGE_SIZE;
> >>> +
> >>> +    if (balloon->pbp
> >>> +        && (rb != balloon->pbp->rb
> >>> +            || host_page_base != balloon->pbp->base)) {
> >>> +        /* We've partially ballooned part of a host page, but now
> >>> +         * we're trying to balloon part of a different one.  Too hard,
> >>> +         * give up on the old partial page */
> >>> +        warn_report("Unable to insert a partial page into virtio-balloon");
> >>
> >> I am pretty sure that you can create misleading warnings in case you
> >> migrate at the wrong time. (migrate while half the 64k page is inflated,
> >> on the new host the other part is inflated - a warning when switching to
> >> another 64k page).
> > 
> > Yes we can get bogus warnings across migration with this.  I was
> > considering that an acceptable price, but I'm open to better
> > alternatives.
> 
> Is maybe reporting a warning on a 64k host when realizing the better
> approach than on every inflation?
> 
> "host page size does not match virtio-balloon page size. If the guest
> has a different page size than the host, inflating the balloon might not
> effectively free up memory."

That might work - I'll see what I can come up with.  One complication
is that theoretically at least, you can have multiple host page sizes
(main memory in normal pages, a DIMM in hugepages).  That makes the
condition on which the warning should be issued a bit fiddly to work
out.

> Or reporting a warning whenever changing the balloon target size.

I'm not sure what you mean by this.

> 
> > 
> >>> +        free(balloon->pbp);
> >>> +        balloon->pbp = NULL;
> >>>      }
> >>>  
> >>> -    ram_block_discard_range(rb, ram_offset, rb_page_size);
> >>> -    /* We ignore errors from ram_block_discard_range(), because it has
> >>> -     * already reported them, and failing to discard a balloon page is
> >>> -     * not fatal */
> >>> +    if (!balloon->pbp) {
> >>> +        /* Starting on a new host page */
> >>> +        size_t bitlen = BITS_TO_LONGS(subpages) * sizeof(unsigned long);
> >>> +        balloon->pbp = g_malloc0(sizeof(PartiallyBalloonedPage) + bitlen);
> >>> +        balloon->pbp->rb = rb;
> >>> +        balloon->pbp->base = host_page_base;
> >>> +    }
> >>> +
> >>> +    bitmap_set(balloon->pbp->bitmap,
> >>> +               (ram_offset - balloon->pbp->base) / BALLOON_PAGE_SIZE,
> >>> +               subpages);
> >>> +
> >>> +    if (bitmap_full(balloon->pbp->bitmap, subpages)) {
> >>> +        /* We've accumulated a full host page, we can actually discard
> >>> +         * it now */
> >>> +
> >>> +        ram_block_discard_range(rb, balloon->pbp->base, rb_page_size);
> >>> +        /* We ignore errors from ram_block_discard_range(), because it
> >>> +         * has already reported them, and failing to discard a balloon
> >>> +         * page is not fatal */
> >>> +
> >>> +        free(balloon->pbp);
> >>> +        balloon->pbp = NULL;
> >>> +    }
> >>>  }
> >> No, not a fan of this approach.
> > 
> > I can see why, but I really can't see what else to do without breaking
> > existing, supported, working (albeit by accident) setups.
> 
> Is there any reason to use this more complicated "allow random freeing"
> approach over a simplistic sequential freeing I propose? Then we can
> simply migrate the last freed page and should be fine.

Well.. your approach is probably simpler in terms of the calculations
that need to be done, though only very slightly.  I think my approach
is conceptually clearer though, since we're explicitly checking for
exactly the condition we need, rather than something we thing should
match up with that condition.

Even if the extra robustness is strictly theoretical, I prefer the
idea of more robust but slightly longer code versus less robust and
slightly shorter code - at least when the latter will want a long
detailed comment explaining why it's correct.

> As far as I know Linux guests have been freeing and reporting these
> pages sequentially, or is that not true? Are you aware of other
> implementations that we might miss?

No, I'm not aware of any other implementations we're likely to care
about.
David Hildenbrand Oct. 17, 2018, 9:56 a.m. | #7
>>>> I am pretty sure that you can create misleading warnings in case you
>>>> migrate at the wrong time. (migrate while half the 64k page is inflated,
>>>> on the new host the other part is inflated - a warning when switching to
>>>> another 64k page).
>>>
>>> Yes we can get bogus warnings across migration with this.  I was
>>> considering that an acceptable price, but I'm open to better
>>> alternatives.
>>
>> Is maybe reporting a warning on a 64k host when realizing the better
>> approach than on every inflation?
>>
>> "host page size does not match virtio-balloon page size. If the guest
>> has a different page size than the host, inflating the balloon might not
>> effectively free up memory."
> 
> That might work - I'll see what I can come up with.  One complication
> is that theoretically at least, you can have multiple host page sizes
> (main memory in normal pages, a DIMM in hugepages).  That makes the
> condition on which the warning should be issued a bit fiddly to work
> out.

I assume issuing a warning on these strange systems would not hurt after
all. ("there is a chance this might not work")

> 
>> Or reporting a warning whenever changing the balloon target size.
> 
> I'm not sure what you mean by this.

I mean when setting e.g. qmp_balloon() to something != 0. This avoids a
warning when a virtio-balloon device is silently created (e.g. by
libvirt?) but never used.

Checking in virtio_balloon_to_target would be sufficient I guess.

> 
>>
>>>
>>>>> +        free(balloon->pbp);
>>>>> +        balloon->pbp = NULL;
>>>>>      }
>>>>>  
>>>>> -    ram_block_discard_range(rb, ram_offset, rb_page_size);
>>>>> -    /* We ignore errors from ram_block_discard_range(), because it has
>>>>> -     * already reported them, and failing to discard a balloon page is
>>>>> -     * not fatal */
>>>>> +    if (!balloon->pbp) {
>>>>> +        /* Starting on a new host page */
>>>>> +        size_t bitlen = BITS_TO_LONGS(subpages) * sizeof(unsigned long);
>>>>> +        balloon->pbp = g_malloc0(sizeof(PartiallyBalloonedPage) + bitlen);
>>>>> +        balloon->pbp->rb = rb;
>>>>> +        balloon->pbp->base = host_page_base;
>>>>> +    }
>>>>> +
>>>>> +    bitmap_set(balloon->pbp->bitmap,
>>>>> +               (ram_offset - balloon->pbp->base) / BALLOON_PAGE_SIZE,
>>>>> +               subpages);
>>>>> +
>>>>> +    if (bitmap_full(balloon->pbp->bitmap, subpages)) {
>>>>> +        /* We've accumulated a full host page, we can actually discard
>>>>> +         * it now */
>>>>> +
>>>>> +        ram_block_discard_range(rb, balloon->pbp->base, rb_page_size);
>>>>> +        /* We ignore errors from ram_block_discard_range(), because it
>>>>> +         * has already reported them, and failing to discard a balloon
>>>>> +         * page is not fatal */
>>>>> +
>>>>> +        free(balloon->pbp);
>>>>> +        balloon->pbp = NULL;
>>>>> +    }
>>>>>  }
>>>> No, not a fan of this approach.
>>>
>>> I can see why, but I really can't see what else to do without breaking
>>> existing, supported, working (albeit by accident) setups.
>>
>> Is there any reason to use this more complicated "allow random freeing"
>> approach over a simplistic sequential freeing I propose? Then we can
>> simply migrate the last freed page and should be fine.
> 
> Well.. your approach is probably simpler in terms of the calculations
> that need to be done, though only very slightly.  I think my approach
> is conceptually clearer though, since we're explicitly checking for
> exactly the condition we need, rather than something we thing should
> match up with that condition.

I prefer to keep it simple where possible. We expect sequential freeing,
so it's easy to implement with only one additional uint64_t that can be
easily migrated. Having to use bitmaps + alloc/free is not really needed.

If you insist, at least try to get rid of the malloc to e.g. simplify
migration. (otherwise, please add freeing code on unrealize(), I guess
you are missing that right now)
David Gibson Oct. 23, 2018, 8:02 a.m. | #8
On Wed, Oct 17, 2018 at 11:56:09AM +0200, David Hildenbrand wrote:
> >>>> I am pretty sure that you can create misleading warnings in case you
> >>>> migrate at the wrong time. (migrate while half the 64k page is inflated,
> >>>> on the new host the other part is inflated - a warning when switching to
> >>>> another 64k page).
> >>>
> >>> Yes we can get bogus warnings across migration with this.  I was
> >>> considering that an acceptable price, but I'm open to better
> >>> alternatives.
> >>
> >> Is maybe reporting a warning on a 64k host when realizing the better
> >> approach than on every inflation?
> >>
> >> "host page size does not match virtio-balloon page size. If the guest
> >> has a different page size than the host, inflating the balloon might not
> >> effectively free up memory."
> > 
> > That might work - I'll see what I can come up with.  One complication
> > is that theoretically at least, you can have multiple host page sizes
> > (main memory in normal pages, a DIMM in hugepages).  That makes the
> > condition on which the warning should be issued a bit fiddly to work
> > out.
> 
> I assume issuing a warning on these strange systems would not hurt after
> all. ("there is a chance this might not work")

Sure.

> >> Or reporting a warning whenever changing the balloon target size.
> > 
> > I'm not sure what you mean by this.
> 
> I mean when setting e.g. qmp_balloon() to something != 0. This avoids a
> warning when a virtio-balloon device is silently created (e.g. by
> libvirt?) but never used.

Ah, right.  Yeah, that's a good idea.

> Checking in virtio_balloon_to_target would be sufficient I guess.
> 
> > 
> >>
> >>>
> >>>>> +        free(balloon->pbp);
> >>>>> +        balloon->pbp = NULL;
> >>>>>      }
> >>>>>  
> >>>>> -    ram_block_discard_range(rb, ram_offset, rb_page_size);
> >>>>> -    /* We ignore errors from ram_block_discard_range(), because it has
> >>>>> -     * already reported them, and failing to discard a balloon page is
> >>>>> -     * not fatal */
> >>>>> +    if (!balloon->pbp) {
> >>>>> +        /* Starting on a new host page */
> >>>>> +        size_t bitlen = BITS_TO_LONGS(subpages) * sizeof(unsigned long);
> >>>>> +        balloon->pbp = g_malloc0(sizeof(PartiallyBalloonedPage) + bitlen);
> >>>>> +        balloon->pbp->rb = rb;
> >>>>> +        balloon->pbp->base = host_page_base;
> >>>>> +    }
> >>>>> +
> >>>>> +    bitmap_set(balloon->pbp->bitmap,
> >>>>> +               (ram_offset - balloon->pbp->base) / BALLOON_PAGE_SIZE,
> >>>>> +               subpages);
> >>>>> +
> >>>>> +    if (bitmap_full(balloon->pbp->bitmap, subpages)) {
> >>>>> +        /* We've accumulated a full host page, we can actually discard
> >>>>> +         * it now */
> >>>>> +
> >>>>> +        ram_block_discard_range(rb, balloon->pbp->base, rb_page_size);
> >>>>> +        /* We ignore errors from ram_block_discard_range(), because it
> >>>>> +         * has already reported them, and failing to discard a balloon
> >>>>> +         * page is not fatal */
> >>>>> +
> >>>>> +        free(balloon->pbp);
> >>>>> +        balloon->pbp = NULL;
> >>>>> +    }
> >>>>>  }
> >>>> No, not a fan of this approach.
> >>>
> >>> I can see why, but I really can't see what else to do without breaking
> >>> existing, supported, working (albeit by accident) setups.
> >>
> >> Is there any reason to use this more complicated "allow random freeing"
> >> approach over a simplistic sequential freeing I propose? Then we can
> >> simply migrate the last freed page and should be fine.
> > 
> > Well.. your approach is probably simpler in terms of the calculations
> > that need to be done, though only very slightly.  I think my approach
> > is conceptually clearer though, since we're explicitly checking for
> > exactly the condition we need, rather than something we thing should
> > match up with that condition.
> 
> I prefer to keep it simple where possible. We expect sequential freeing,
> so it's easy to implement with only one additional uint64_t that can be
> easily migrated. Having to use bitmaps + alloc/free is not really needed.
> 
> If you insist, at least try to get rid of the malloc to e.g. simplify
> migration.

I don't think there's really anything to do for migration; we already
effectively lose the state of the balloon across migration.  I think
it's fine to lose a little extra transitional state.

> (otherwise, please add freeing code on unrealize(), I guess
> you are missing that right now)

Ah, good point, I need to fix that.
David Hildenbrand Oct. 23, 2018, 3:13 p.m. | #9
>>>>> I can see why, but I really can't see what else to do without breaking
>>>>> existing, supported, working (albeit by accident) setups.
>>>>
>>>> Is there any reason to use this more complicated "allow random freeing"
>>>> approach over a simplistic sequential freeing I propose? Then we can
>>>> simply migrate the last freed page and should be fine.
>>>
>>> Well.. your approach is probably simpler in terms of the calculations
>>> that need to be done, though only very slightly.  I think my approach
>>> is conceptually clearer though, since we're explicitly checking for
>>> exactly the condition we need, rather than something we thing should
>>> match up with that condition.
>>
>> I prefer to keep it simple where possible. We expect sequential freeing,
>> so it's easy to implement with only one additional uint64_t that can be
>> easily migrated. Having to use bitmaps + alloc/free is not really needed.
>>
>> If you insist, at least try to get rid of the malloc to e.g. simplify
>> migration.
> 
> I don't think there's really anything to do for migration; we already
> effectively lose the state of the balloon across migration.  I think
> it's fine to lose a little extra transitional state.

Yeah, I guess it would only make sense when trying to avoid misleading
warnings (depending on which approach you'll use for when to report
warnings).

Patch

diff --git a/hw/virtio/virtio-balloon.c b/hw/virtio/virtio-balloon.c
index 4435905c87..39573ef2e3 100644
--- a/hw/virtio/virtio-balloon.c
+++ b/hw/virtio/virtio-balloon.c
@@ -33,33 +33,80 @@ 
 
 #define BALLOON_PAGE_SIZE  (1 << VIRTIO_BALLOON_PFN_SHIFT)
 
+typedef struct PartiallyBalloonedPage {
+    RAMBlock *rb;
+    ram_addr_t base;
+    unsigned long bitmap[];
+} PartiallyBalloonedPage;
+
 static void balloon_inflate_page(VirtIOBalloon *balloon,
                                  MemoryRegion *mr, hwaddr offset)
 {
     void *addr = memory_region_get_ram_ptr(mr) + offset;
     RAMBlock *rb;
     size_t rb_page_size;
-    ram_addr_t ram_offset;
+    int subpages;
+    ram_addr_t ram_offset, host_page_base;
 
     /* XXX is there a better way to get to the RAMBlock than via a
      * host address? */
     rb = qemu_ram_block_from_host(addr, false, &ram_offset);
     rb_page_size = qemu_ram_pagesize(rb);
+    host_page_base = ram_offset & ~(rb_page_size - 1);
+
+    if (rb_page_size == BALLOON_PAGE_SIZE) {
+        /* Easy case */
 
-    /* Silently ignore hugepage RAM blocks */
-    if (rb_page_size != getpagesize()) {
+        ram_block_discard_range(rb, ram_offset, rb_page_size);
+        /* We ignore errors from ram_block_discard_range(), because it
+         * has already reported them, and failing to discard a balloon
+         * page is not fatal */
         return;
     }
 
-    /* Silently ignore unaligned requests */
-    if (ram_offset & (rb_page_size - 1)) {
-        return;
+    /* Hard case
+     *
+     * We've put a piece of a larger host page into the balloon - we
+     * need to keep track until we have a whole host page to
+     * discard
+     */
+    subpages = rb_page_size / BALLOON_PAGE_SIZE;
+
+    if (balloon->pbp
+        && (rb != balloon->pbp->rb
+            || host_page_base != balloon->pbp->base)) {
+        /* We've partially ballooned part of a host page, but now
+         * we're trying to balloon part of a different one.  Too hard,
+         * give up on the old partial page */
+        warn_report("Unable to insert a partial page into virtio-balloon");
+        free(balloon->pbp);
+        balloon->pbp = NULL;
     }
 
-    ram_block_discard_range(rb, ram_offset, rb_page_size);
-    /* We ignore errors from ram_block_discard_range(), because it has
-     * already reported them, and failing to discard a balloon page is
-     * not fatal */
+    if (!balloon->pbp) {
+        /* Starting on a new host page */
+        size_t bitlen = BITS_TO_LONGS(subpages) * sizeof(unsigned long);
+        balloon->pbp = g_malloc0(sizeof(PartiallyBalloonedPage) + bitlen);
+        balloon->pbp->rb = rb;
+        balloon->pbp->base = host_page_base;
+    }
+
+    bitmap_set(balloon->pbp->bitmap,
+               (ram_offset - balloon->pbp->base) / BALLOON_PAGE_SIZE,
+               subpages);
+
+    if (bitmap_full(balloon->pbp->bitmap, subpages)) {
+        /* We've accumulated a full host page, we can actually discard
+         * it now */
+
+        ram_block_discard_range(rb, balloon->pbp->base, rb_page_size);
+        /* We ignore errors from ram_block_discard_range(), because it
+         * has already reported them, and failing to discard a balloon
+         * page is not fatal */
+
+        free(balloon->pbp);
+        balloon->pbp = NULL;
+    }
 }
 
 static const char *balloon_stat_names[] = {
diff --git a/include/hw/virtio/virtio-balloon.h b/include/hw/virtio/virtio-balloon.h
index e0df3528c8..99dcd6d105 100644
--- a/include/hw/virtio/virtio-balloon.h
+++ b/include/hw/virtio/virtio-balloon.h
@@ -30,6 +30,8 @@  typedef struct virtio_balloon_stat_modern {
        uint64_t val;
 } VirtIOBalloonStatModern;
 
+typedef struct PartiallyBalloonedPage PartiallyBalloonedPage;
+
 typedef struct VirtIOBalloon {
     VirtIODevice parent_obj;
     VirtQueue *ivq, *dvq, *svq;
@@ -42,6 +44,7 @@  typedef struct VirtIOBalloon {
     int64_t stats_last_update;
     int64_t stats_poll_interval;
     uint32_t host_features;
+    PartiallyBalloonedPage *pbp;
 } VirtIOBalloon;
 
 #endif