diff mbox

[V2] balloon: Don't try fetching info if machine is stopped

Message ID 20100826060513.GF18351@amit-laptop.redhat.com
State New
Headers show

Commit Message

Amit Shah Aug. 26, 2010, 6:05 a.m. UTC
On (Sun) Aug 22 2010 [16:54:06], Anthony Liguori wrote:
> On 08/19/2010 07:48 PM, Amit Shah wrote:
> >If the machine is stopped and 'info balloon' is invoked, the monitor
> >process just hangs waiting for info from the guest. Return the most
> >recent balloon data in that case.
> >
> >See https://bugzilla.redhat.com/show_bug.cgi?id=623903
> >
> >Reported-by:<lihuang@redhat.com>
> >Signed-off-by: Amit Shah<amit.shah@redhat.com>
> 
> !vm_running is just a special case of an unresponsive guest.  Even
> if the guest was running, if it was oops'd and the administrator
> didn't know, you would have the same issue.
> 
> I'd suggest using a timeout based on rt_clock.  If the stats request
> times out, print an appropriate message to the user.

This is what I have currently. It would need some timer handling in the
save/load case as well, right?

		Amit

From efe4a36423d7dec1aa9b142ac14c82c2dc80abe4 Mon Sep 17 00:00:00 2001
Message-Id: <efe4a36423d7dec1aa9b142ac14c82c2dc80abe4.1282802628.git.amit.shah@redhat.com>
From: Amit Shah <amit.shah@redhat.com>
Date: Thu, 26 Aug 2010 11:17:14 +0530
Subject: [PATCH] balloon: Don't try fetching info if guest is unresponsive

If the guest is unresponsive and 'info balloon' is invoked, the monitor
process just hangs waiting for info from the guest. Return the most
recent balloon data in that case.

A new timer is added, which on expiry, just presents the old data to the
monitor callback functions.

See https://bugzilla.redhat.com/show_bug.cgi?id=623903

Reported-by: <lihuang@redhat.com>
Signed-off-by: Amit Shah <amit.shah@redhat.com>
---
 hw/virtio-balloon.c |   11 +++++++++++
 1 files changed, 11 insertions(+), 0 deletions(-)

Comments

Paolo Bonzini Aug. 26, 2010, 8:05 a.m. UTC | #1
On 08/26/2010 08:05 AM, Amit Shah wrote:
> This is what I have currently. It would need some timer handling in
> the save/load case as well, right?

When loading you won't have any pending "info balloon" command, so I 
think the timer need not be preserved across migration.

Also, 5 seconds for a stopped guest is actually a lot, so maybe Amit's 
original patch or a variant thereof would make sense anyway.

Paolo
Daniel P. Berrangé Aug. 26, 2010, 8:14 a.m. UTC | #2
On Thu, Aug 26, 2010 at 10:05:44AM +0200, Paolo Bonzini wrote:
> On 08/26/2010 08:05 AM, Amit Shah wrote:
> >This is what I have currently. It would need some timer handling in
> >the save/load case as well, right?
> 
> When loading you won't have any pending "info balloon" command, so I 
> think the timer need not be preserved across migration.
> 
> Also, 5 seconds for a stopped guest is actually a lot, so maybe Amit's 
> original patch or a variant thereof would make sense anyway.

We should have a combination of both. If we know the guest is stopped
we should return immediately, otherwise we should use the timer as a
way to cope with a crashed/evil guest.

Daniel
Amit Shah Aug. 26, 2010, 8:17 a.m. UTC | #3
On (Thu) Aug 26 2010 [10:05:44], Paolo Bonzini wrote:
> On 08/26/2010 08:05 AM, Amit Shah wrote:
> >This is what I have currently. It would need some timer handling in
> >the save/load case as well, right?
> 
> When loading you won't have any pending "info balloon" command, so I
> think the timer need not be preserved across migration.
> 
> Also, 5 seconds for a stopped guest is actually a lot,

That's the problem; it's policy. Where and how to specify it?

> so maybe
> Amit's original patch or a variant thereof would make sense anyway.

This seems to be needed though -- as Anthony mentioned, a guest which
has oopsed or similar, incapable of servicing the stats request, is
going to block the monitor command from returning forever.

So it's better to have a timeout, just that we need to decide how much
it should be.

		Amit
Paolo Bonzini Aug. 26, 2010, 8:19 a.m. UTC | #4
On 08/26/2010 10:17 AM, Amit Shah wrote:
>> >
>> >  Also, 5 seconds for a stopped guest is actually a lot,
> That's the problem; it's policy. Where and how to specify it?

For a crashed/oopsed guest even 10 seconds may be okay, as long as it's 
0 for a stopped guest.  We need both patches.

Paolo
Daniel P. Berrangé Aug. 26, 2010, 8:28 a.m. UTC | #5
On Thu, Aug 26, 2010 at 01:47:50PM +0530, Amit Shah wrote:
> On (Thu) Aug 26 2010 [10:05:44], Paolo Bonzini wrote:
> > On 08/26/2010 08:05 AM, Amit Shah wrote:
> > >This is what I have currently. It would need some timer handling in
> > >the save/load case as well, right?
> > 
> > When loading you won't have any pending "info balloon" command, so I
> > think the timer need not be preserved across migration.
> > 
> > Also, 5 seconds for a stopped guest is actually a lot,
> 
> That's the problem; it's policy. Where and how to specify it?

It is unfortunate that this is policy, but we just have to accept
that the current query-balloon command is a flawed design. IMHO
we  should just hardcode the timeout at 5 seconds as you do (plus
immediate return for paused guests). Then focus on adding new 
monitor commands/events to deal with balloon query in a way 
that doesn't require this kind of policy in QEMU, and deprecate 
the existing query-balloon command.

REgards,
Daniel
Luiz Capitulino Aug. 26, 2010, 12:57 p.m. UTC | #6
On Thu, 26 Aug 2010 09:28:42 +0100
"Daniel P. Berrange" <berrange@redhat.com> wrote:

> On Thu, Aug 26, 2010 at 01:47:50PM +0530, Amit Shah wrote:
> > On (Thu) Aug 26 2010 [10:05:44], Paolo Bonzini wrote:
> > > On 08/26/2010 08:05 AM, Amit Shah wrote:
> > > >This is what I have currently. It would need some timer handling in
> > > >the save/load case as well, right?
> > > 
> > > When loading you won't have any pending "info balloon" command, so I
> > > think the timer need not be preserved across migration.
> > > 
> > > Also, 5 seconds for a stopped guest is actually a lot,
> > 
> > That's the problem; it's policy. Where and how to specify it?
> 
> It is unfortunate that this is policy, but we just have to accept
> that the current query-balloon command is a flawed design. IMHO
> we  should just hardcode the timeout at 5 seconds as you do (plus
> immediate return for paused guests). Then focus on adding new 
> monitor commands/events to deal with balloon query in a way 
> that doesn't require this kind of policy in QEMU, and deprecate 
> the existing query-balloon command.

Agreed, but it's not just that: we've never correctly specified how
commands that talk with the guest should behave.

  *brain dump warning*

We were talking about making all commands work as synchronous and
asynchronous. If we do that, then we'll need a 'global' timeout
for all synchronous commands. We could have a default value and a
command to set it.

 *brain dump warning ends*

I really don't know what to do 0.13. Probably the hard-coded timer is
the best solution we have, but I'm wondering if it's going to cause
problems in the near future, when we get proper asynchronous command
support.
Anthony Liguori Aug. 26, 2010, 1:22 p.m. UTC | #7
On 08/26/2010 03:14 AM, Daniel P. Berrange wrote:
> On Thu, Aug 26, 2010 at 10:05:44AM +0200, Paolo Bonzini wrote:
>    
>> On 08/26/2010 08:05 AM, Amit Shah wrote:
>>      
>>> This is what I have currently. It would need some timer handling in
>>> the save/load case as well, right?
>>>        
>> When loading you won't have any pending "info balloon" command, so I
>> think the timer need not be preserved across migration.
>>
>> Also, 5 seconds for a stopped guest is actually a lot, so maybe Amit's
>> original patch or a variant thereof would make sense anyway.
>>      
> We should have a combination of both. If we know the guest is stopped
> we should return immediately, otherwise we should use the timer as a
> way to cope with a crashed/evil guest.
>    

Stopped doesn't necessarily mean that it's permanently stopped or even 
that a user has stopped it.

We stop a guest during live migration and in some other cases (like on 
disk error).

Returning immediately is an optimization on something that should be a 
proper fix.  Otherwise, you have a guest initiated DoS attack on 
management tools.

Regards,

Anthony Liguori

> Daniel
>
Paolo Bonzini Aug. 26, 2010, 1:30 p.m. UTC | #8
On 08/26/2010 02:57 PM, Luiz Capitulino wrote:
> I really don't know what to do 0.13. Probably the hard-coded timer is
> the best solution we have, but I'm wondering if it's going to cause
> problems in the near future, when we get proper asynchronous command
> support.

Just make it a different command, or make it dependent on whether the 
initial handshaking activated the asynchronous command capability.

Paolo
diff mbox

Patch

diff --git a/hw/virtio-balloon.c b/hw/virtio-balloon.c
index 9fe3886..1ec03b3 100644
--- a/hw/virtio-balloon.c
+++ b/hw/virtio-balloon.c
@@ -40,6 +40,7 @@  typedef struct VirtIOBalloon
     size_t stats_vq_offset;
     MonitorCompletion *stats_callback;
     void *stats_opaque_callback_data;
+    QEMUTimer *timer;
 } VirtIOBalloon;
 
 static VirtIOBalloon *to_virtio_balloon(VirtIODevice *vdev)
@@ -137,6 +138,11 @@  static void complete_stats_request(VirtIOBalloon *vb)
     vb->stats_callback = NULL;
 }
 
+static void stats_timer_expired(void *opaque)
+{
+    complete_stats_request(opaque);
+}
+
 static void virtio_balloon_receive_stats(VirtIODevice *vdev, VirtQueue *vq)
 {
     VirtIOBalloon *s = DO_UPCAST(VirtIOBalloon, vdev, vdev);
@@ -148,6 +154,8 @@  static void virtio_balloon_receive_stats(VirtIODevice *vdev, VirtQueue *vq)
         return;
     }
 
+    qemu_del_timer(s->timer);
+
     /* Initialize the stats to get rid of any stale values.  This is only
      * needed to handle the case where a guest supports fewer stats than it
      * used to (ie. it has booted into an old kernel).
@@ -215,6 +223,7 @@  static void virtio_balloon_to_target(void *opaque, ram_addr_t target,
         dev->stats_callback = cb;
         dev->stats_opaque_callback_data = cb_data; 
         if (dev->vdev.guest_features & (1 << VIRTIO_BALLOON_F_STATS_VQ)) {
+            qemu_mod_timer(dev->timer, qemu_get_clock(rt_clock) + 5000);
             virtqueue_push(dev->svq, &dev->stats_vq_elem, dev->stats_vq_offset);
             virtio_notify(&dev->vdev, dev->svq);
         } else {
@@ -267,6 +276,8 @@  VirtIODevice *virtio_balloon_init(DeviceState *dev)
     s->dvq = virtio_add_queue(&s->vdev, 128, virtio_balloon_handle_output);
     s->svq = virtio_add_queue(&s->vdev, 128, virtio_balloon_receive_stats);
 
+    s->timer = qemu_new_timer(rt_clock, stats_timer_expired, s);
+
     reset_stats(s);
     qemu_add_balloon_handler(virtio_balloon_to_target, s);