diff mbox

set VIRTIO_BALLOON_F_MUST_TELL_HOST unconditionally

Message ID 1302878695-10256-1-git-send-email-dave@linux.vnet.ibm.com
State New
Headers show

Commit Message

Dave Hansen April 15, 2011, 2:44 p.m. UTC
There's a patch pending on LKML at the moment:

	http://lkml.org/lkml/2011/4/7/101

The virtio balloon driver has a VIRTIO_BALLOON_F_MUST_TELL_HOST
feature bit.  As of now, qemu-kvm defines the bit, but doesn't set it.
feature bit.  Whenever the bit is set, the guest kernel must always
tell the host before we free pages back to the allocator.  Without
this feature, we might free a page (and have another user touch it)
while the hypervisor is unprepared for it.

But, if the bit is _not_ set, there is no obligation to reverse the
order; we're under no obligation to do _anything_.

This patch makes the "tell host first" logic the only case.  This
should make everybody happy, and reduce the amount of untested or
untestable code in the kernel.

Signed-off-by: Dave Hansen <dave@linux.vnet.ibm.com>
Cc: Amit Shah <amit.shah@redhat.com>
Cc: Anthony Liguori <aliguori@linux.vnet.ibm.com>
---
 hw/virtio-balloon.c |    4 ++++
 1 files changed, 4 insertions(+), 0 deletions(-)

Comments

Anthony Liguori April 15, 2011, 4:21 p.m. UTC | #1
On 04/15/2011 09:44 AM, Dave Hansen wrote:
> There's a patch pending on LKML at the moment:
>
> 	http://lkml.org/lkml/2011/4/7/101
>
> The virtio balloon driver has a VIRTIO_BALLOON_F_MUST_TELL_HOST
> feature bit.  As of now, qemu-kvm defines the bit, but doesn't set it.
> feature bit.  Whenever the bit is set, the guest kernel must always
> tell the host before we free pages back to the allocator.  Without
> this feature, we might free a page (and have another user touch it)
> while the hypervisor is unprepared for it.
>
> But, if the bit is _not_ set, there is no obligation to reverse the
> order; we're under no obligation to do _anything_.
>
> This patch makes the "tell host first" logic the only case.  This
> should make everybody happy, and reduce the amount of untested or
> untestable code in the kernel.

It doesn't make me happy.

Why would we do this in QEMU?  This prevents the guest from doing 
ballooning reclaim during OOM.

Regards,

Anthony Liguori

> Signed-off-by: Dave Hansen<dave@linux.vnet.ibm.com>
> Cc: Amit Shah<amit.shah@redhat.com>
> Cc: Anthony Liguori<aliguori@linux.vnet.ibm.com>
> ---
>   hw/virtio-balloon.c |    4 ++++
>   1 files changed, 4 insertions(+), 0 deletions(-)
>
> diff --git a/hw/virtio-balloon.c b/hw/virtio-balloon.c
> index 1e6be18..71b0864 100644
> --- a/hw/virtio-balloon.c
> +++ b/hw/virtio-balloon.c
> @@ -198,6 +198,10 @@ static void virtio_balloon_set_config(VirtIODevice *vdev,
>   static uint32_t virtio_balloon_get_features(VirtIODevice *vdev, uint32_t f)
>   {
>       f |= (1<<  VIRTIO_BALLOON_F_STATS_VQ);
> +    /* Kernels>= 2.6.40 do this unconditionally, so set
> +     * the bit to make old kernels do the same thing
> +     */
> +    f |= (1<<  VIRTIO_BALLOON_F_MUST_TELL_HOST);
>       return f;
>   }
>
Dave Hansen April 15, 2011, 4:36 p.m. UTC | #2
On Fri, 2011-04-15 at 11:21 -0500, Anthony Liguori wrote:
> > This patch makes the "tell host first" logic the only case.  This
> > should make everybody happy, and reduce the amount of untested or
> > untestable code in the kernel.
> 
> It doesn't make me happy.

Darn.

> Why would we do this in QEMU?  This prevents the guest from doing 
> ballooning reclaim during OOM.

What the heck is "ballooning reclaim"?  Could you elaborate a bit on how
this happens?  I think I'm missing some subtlety here.

I think 'tell host first' is the only sane way to do it.  Look at the
'tell host second code':

	release_pages_by_pfn(); // let other kernel users at the pages
	tell_host(); // tell the hypervisor they're used again

At the point we've started using the pages again, we haven't *told* the
host that we're using them.  I think that's potentially a problem.  Is
qemu somehow cool with the guest touching pages that are supposed to be
in the balloon and unusable?

-- Dave
Anthony Liguori April 15, 2011, 5:17 p.m. UTC | #3
On 04/15/2011 11:36 AM, Dave Hansen wrote:
> On Fri, 2011-04-15 at 11:21 -0500, Anthony Liguori wrote:
>>> This patch makes the "tell host first" logic the only case.  This
>>> should make everybody happy, and reduce the amount of untested or
>>> untestable code in the kernel.
>> It doesn't make me happy.
> Darn.
>
>> Why would we do this in QEMU?  This prevents the guest from doing
>> ballooning reclaim during OOM.
> What the heck is "ballooning reclaim"?  Could you elaborate a bit on how
> this happens?  I think I'm missing some subtlety here.

If you're in OOM and you need memory, you can't ask the host for more 
and wait for a response.  You have to reclaim it immediately.

See the s390 balloon driver for an example of this.

> I think 'tell host first' is the only sane way to do it.  Look at the
> 'tell host second code':
>
> 	release_pages_by_pfn(); // let other kernel users at the pages
> 	tell_host(); // tell the hypervisor they're used again

> At the point we've started using the pages again, we haven't *told* the
> host that we're using them.  I think that's potentially a problem.  Is
> qemu somehow cool with the guest touching pages that are supposed to be
> in the balloon and unusable?

Yes, of course it is.  All ballooning does is madvise(MADV_DONTNEED).  A 
guest can reclaim the memory as soon as it wants it to.

Regards,

Anthony Liguori

> -- Dave
>
Dave Hansen April 15, 2011, 7:15 p.m. UTC | #4
On Fri, 2011-04-15 at 12:17 -0500, Anthony Liguori wrote:
> On 04/15/2011 11:36 AM, Dave Hansen wrote:
> >> Why would we do this in QEMU?  This prevents the guest from doing
> >> ballooning reclaim during OOM.
> > What the heck is "ballooning reclaim"?  Could you elaborate a bit on how
> > this happens?  I think I'm missing some subtlety here.
> 
> If you're in OOM and you need memory, you can't ask the host for more 
> and wait for a response.  You have to reclaim it immediately.

Why not?  The call in to the notifier chain the s390 case is
synchronous.  The OOM only affects one task at a time and won't proceed
elsewhere while this is going on.  

> See the s390 balloon driver for an example of this.

I could buy that in an OOM situation, but that's not what virtio-balloon
is doing.  

> > I think 'tell host first' is the only sane way to do it.  Look at the
> > 'tell host second code':
> >
> > 	release_pages_by_pfn(); // let other kernel users at the pages
> > 	tell_host(); // tell the hypervisor they're used again
> 
> > At the point we've started using the pages again, we haven't *told* the
> > host that we're using them.  I think that's potentially a problem.  Is
> > qemu somehow cool with the guest touching pages that are supposed to be
> > in the balloon and unusable?
> 
> Yes, of course it is.  All ballooning does is madvise(MADV_DONTNEED).  A 
> guest can reclaim the memory as soon as it wants it to.

That's not true on all hypervisors, but as long as it's a guarantee in
qemu, I guess we're OK here.  Let's hope nobody else ever starts to use
virtio-balloon.

Why do we even _tell_ qemu, though?  The MADV_WILLNEED is nice, but far
from being necessary.  We could just skip the entire notification in OOM
situations.

-- Dave
Anthony Liguori April 15, 2011, 7:20 p.m. UTC | #5
On 04/15/2011 02:15 PM, Dave Hansen wrote:
> On Fri, 2011-04-15 at 12:17 -0500, Anthony Liguori wrote:
>> On 04/15/2011 11:36 AM, Dave Hansen wrote:
>>>> Why would we do this in QEMU?  This prevents the guest from doing
>>>> ballooning reclaim during OOM.
>>> What the heck is "ballooning reclaim"?  Could you elaborate a bit on how
>>> this happens?  I think I'm missing some subtlety here.
>> If you're in OOM and you need memory, you can't ask the host for more
>> and wait for a response.  You have to reclaim it immediately.
> Why not?  The call in to the notifier chain the s390 case is
> synchronous.  The OOM only affects one task at a time and won't proceed
> elsewhere while this is going on.

Because if we tell the host, we have to wait for the host to ack which 
means we'd sleep waiting for an interrupt.  Can you do this in the OOM path?

>>> At the point we've started using the pages again, we haven't *told* the
>>> host that we're using them.  I think that's potentially a problem.  Is
>>> qemu somehow cool with the guest touching pages that are supposed to be
>>> in the balloon and unusable?
>> Yes, of course it is.  All ballooning does is madvise(MADV_DONTNEED).  A
>> guest can reclaim the memory as soon as it wants it to.
> That's not true on all hypervisors, but as long as it's a guarantee in
> qemu, I guess we're OK here.  Let's hope nobody else ever starts to use
> virtio-balloon.

If you haven't figured out yet, I'm kind of partial to a certain 
hypervisor....

That's why this option exists in the first place.  To let QEMU do it's 
thing and for other hypervisors that are not quite as clever, they could 
advertise this bit.

> Why do we even _tell_ qemu, though?  The MADV_WILLNEED is nice, but far
> from being necessary.  We could just skip the entire notification in OOM
> situations.

There's really no particular reason to other than symmetry.

Regards,

Anthony Liguori

> -- Dave
>
>
Dave Hansen April 18, 2011, 3:37 p.m. UTC | #6
On Fri, 2011-04-15 at 14:20 -0500, Anthony Liguori wrote: 
> On 04/15/2011 02:15 PM, Dave Hansen wrote:
> > On Fri, 2011-04-15 at 12:17 -0500, Anthony Liguori wrote:
> >> If you're in OOM and you need memory, you can't ask the host for more
> >> and wait for a response.  You have to reclaim it immediately.
> > Why not?  The call in to the notifier chain the s390 case is
> > synchronous.  The OOM only affects one task at a time and won't proceed
> > elsewhere while this is going on.
> 
> Because if we tell the host, we have to wait for the host to ack which 
> means we'd sleep waiting for an interrupt.  Can you do this in the OOM path?

Sure.  One of the s390 handlers sleeps today.

> > Why do we even _tell_ qemu, though?  The MADV_WILLNEED is nice, but far
> > from being necessary.  We could just skip the entire notification in OOM
> > situations.
> 
> There's really no particular reason to other than symmetry.

When we get there, it sounds like we can simply _skip_ the notification
in the qemu case.  

-- Dave
diff mbox

Patch

diff --git a/hw/virtio-balloon.c b/hw/virtio-balloon.c
index 1e6be18..71b0864 100644
--- a/hw/virtio-balloon.c
+++ b/hw/virtio-balloon.c
@@ -198,6 +198,10 @@  static void virtio_balloon_set_config(VirtIODevice *vdev,
 static uint32_t virtio_balloon_get_features(VirtIODevice *vdev, uint32_t f)
 {
     f |= (1 << VIRTIO_BALLOON_F_STATS_VQ);
+    /* Kernels >= 2.6.40 do this unconditionally, so set
+     * the bit to make old kernels do the same thing
+     */
+    f |= (1 << VIRTIO_BALLOON_F_MUST_TELL_HOST);
     return f;
 }