Message ID | 1302878695-10256-1-git-send-email-dave@linux.vnet.ibm.com |
---|---|
State | New |
Headers | show |
On 04/15/2011 09:44 AM, Dave Hansen wrote: > There's a patch pending on LKML at the moment: > > http://lkml.org/lkml/2011/4/7/101 > > The virtio balloon driver has a VIRTIO_BALLOON_F_MUST_TELL_HOST > feature bit. As of now, qemu-kvm defines the bit, but doesn't set it. > feature bit. Whenever the bit is set, the guest kernel must always > tell the host before we free pages back to the allocator. Without > this feature, we might free a page (and have another user touch it) > while the hypervisor is unprepared for it. > > But, if the bit is _not_ set, there is no obligation to reverse the > order; we're under no obligation to do _anything_. > > This patch makes the "tell host first" logic the only case. This > should make everybody happy, and reduce the amount of untested or > untestable code in the kernel. It doesn't make me happy. Why would we do this in QEMU? This prevents the guest from doing ballooning reclaim during OOM. Regards, Anthony Liguori > Signed-off-by: Dave Hansen<dave@linux.vnet.ibm.com> > Cc: Amit Shah<amit.shah@redhat.com> > Cc: Anthony Liguori<aliguori@linux.vnet.ibm.com> > --- > hw/virtio-balloon.c | 4 ++++ > 1 files changed, 4 insertions(+), 0 deletions(-) > > diff --git a/hw/virtio-balloon.c b/hw/virtio-balloon.c > index 1e6be18..71b0864 100644 > --- a/hw/virtio-balloon.c > +++ b/hw/virtio-balloon.c > @@ -198,6 +198,10 @@ static void virtio_balloon_set_config(VirtIODevice *vdev, > static uint32_t virtio_balloon_get_features(VirtIODevice *vdev, uint32_t f) > { > f |= (1<< VIRTIO_BALLOON_F_STATS_VQ); > + /* Kernels>= 2.6.40 do this unconditionally, so set > + * the bit to make old kernels do the same thing > + */ > + f |= (1<< VIRTIO_BALLOON_F_MUST_TELL_HOST); > return f; > } >
On Fri, 2011-04-15 at 11:21 -0500, Anthony Liguori wrote: > > This patch makes the "tell host first" logic the only case. This > > should make everybody happy, and reduce the amount of untested or > > untestable code in the kernel. > > It doesn't make me happy. Darn. > Why would we do this in QEMU? This prevents the guest from doing > ballooning reclaim during OOM. What the heck is "ballooning reclaim"? Could you elaborate a bit on how this happens? I think I'm missing some subtlety here. I think 'tell host first' is the only sane way to do it. Look at the 'tell host second code': release_pages_by_pfn(); // let other kernel users at the pages tell_host(); // tell the hypervisor they're used again At the point we've started using the pages again, we haven't *told* the host that we're using them. I think that's potentially a problem. Is qemu somehow cool with the guest touching pages that are supposed to be in the balloon and unusable? -- Dave
On 04/15/2011 11:36 AM, Dave Hansen wrote: > On Fri, 2011-04-15 at 11:21 -0500, Anthony Liguori wrote: >>> This patch makes the "tell host first" logic the only case. This >>> should make everybody happy, and reduce the amount of untested or >>> untestable code in the kernel. >> It doesn't make me happy. > Darn. > >> Why would we do this in QEMU? This prevents the guest from doing >> ballooning reclaim during OOM. > What the heck is "ballooning reclaim"? Could you elaborate a bit on how > this happens? I think I'm missing some subtlety here. If you're in OOM and you need memory, you can't ask the host for more and wait for a response. You have to reclaim it immediately. See the s390 balloon driver for an example of this. > I think 'tell host first' is the only sane way to do it. Look at the > 'tell host second code': > > release_pages_by_pfn(); // let other kernel users at the pages > tell_host(); // tell the hypervisor they're used again > At the point we've started using the pages again, we haven't *told* the > host that we're using them. I think that's potentially a problem. Is > qemu somehow cool with the guest touching pages that are supposed to be > in the balloon and unusable? Yes, of course it is. All ballooning does is madvise(MADV_DONTNEED). A guest can reclaim the memory as soon as it wants it to. Regards, Anthony Liguori > -- Dave >
On Fri, 2011-04-15 at 12:17 -0500, Anthony Liguori wrote: > On 04/15/2011 11:36 AM, Dave Hansen wrote: > >> Why would we do this in QEMU? This prevents the guest from doing > >> ballooning reclaim during OOM. > > What the heck is "ballooning reclaim"? Could you elaborate a bit on how > > this happens? I think I'm missing some subtlety here. > > If you're in OOM and you need memory, you can't ask the host for more > and wait for a response. You have to reclaim it immediately. Why not? The call in to the notifier chain the s390 case is synchronous. The OOM only affects one task at a time and won't proceed elsewhere while this is going on. > See the s390 balloon driver for an example of this. I could buy that in an OOM situation, but that's not what virtio-balloon is doing. > > I think 'tell host first' is the only sane way to do it. Look at the > > 'tell host second code': > > > > release_pages_by_pfn(); // let other kernel users at the pages > > tell_host(); // tell the hypervisor they're used again > > > At the point we've started using the pages again, we haven't *told* the > > host that we're using them. I think that's potentially a problem. Is > > qemu somehow cool with the guest touching pages that are supposed to be > > in the balloon and unusable? > > Yes, of course it is. All ballooning does is madvise(MADV_DONTNEED). A > guest can reclaim the memory as soon as it wants it to. That's not true on all hypervisors, but as long as it's a guarantee in qemu, I guess we're OK here. Let's hope nobody else ever starts to use virtio-balloon. Why do we even _tell_ qemu, though? The MADV_WILLNEED is nice, but far from being necessary. We could just skip the entire notification in OOM situations. -- Dave
On 04/15/2011 02:15 PM, Dave Hansen wrote: > On Fri, 2011-04-15 at 12:17 -0500, Anthony Liguori wrote: >> On 04/15/2011 11:36 AM, Dave Hansen wrote: >>>> Why would we do this in QEMU? This prevents the guest from doing >>>> ballooning reclaim during OOM. >>> What the heck is "ballooning reclaim"? Could you elaborate a bit on how >>> this happens? I think I'm missing some subtlety here. >> If you're in OOM and you need memory, you can't ask the host for more >> and wait for a response. You have to reclaim it immediately. > Why not? The call in to the notifier chain the s390 case is > synchronous. The OOM only affects one task at a time and won't proceed > elsewhere while this is going on. Because if we tell the host, we have to wait for the host to ack which means we'd sleep waiting for an interrupt. Can you do this in the OOM path? >>> At the point we've started using the pages again, we haven't *told* the >>> host that we're using them. I think that's potentially a problem. Is >>> qemu somehow cool with the guest touching pages that are supposed to be >>> in the balloon and unusable? >> Yes, of course it is. All ballooning does is madvise(MADV_DONTNEED). A >> guest can reclaim the memory as soon as it wants it to. > That's not true on all hypervisors, but as long as it's a guarantee in > qemu, I guess we're OK here. Let's hope nobody else ever starts to use > virtio-balloon. If you haven't figured out yet, I'm kind of partial to a certain hypervisor.... That's why this option exists in the first place. To let QEMU do it's thing and for other hypervisors that are not quite as clever, they could advertise this bit. > Why do we even _tell_ qemu, though? The MADV_WILLNEED is nice, but far > from being necessary. We could just skip the entire notification in OOM > situations. There's really no particular reason to other than symmetry. Regards, Anthony Liguori > -- Dave > >
On Fri, 2011-04-15 at 14:20 -0500, Anthony Liguori wrote: > On 04/15/2011 02:15 PM, Dave Hansen wrote: > > On Fri, 2011-04-15 at 12:17 -0500, Anthony Liguori wrote: > >> If you're in OOM and you need memory, you can't ask the host for more > >> and wait for a response. You have to reclaim it immediately. > > Why not? The call in to the notifier chain the s390 case is > > synchronous. The OOM only affects one task at a time and won't proceed > > elsewhere while this is going on. > > Because if we tell the host, we have to wait for the host to ack which > means we'd sleep waiting for an interrupt. Can you do this in the OOM path? Sure. One of the s390 handlers sleeps today. > > Why do we even _tell_ qemu, though? The MADV_WILLNEED is nice, but far > > from being necessary. We could just skip the entire notification in OOM > > situations. > > There's really no particular reason to other than symmetry. When we get there, it sounds like we can simply _skip_ the notification in the qemu case. -- Dave
diff --git a/hw/virtio-balloon.c b/hw/virtio-balloon.c index 1e6be18..71b0864 100644 --- a/hw/virtio-balloon.c +++ b/hw/virtio-balloon.c @@ -198,6 +198,10 @@ static void virtio_balloon_set_config(VirtIODevice *vdev, static uint32_t virtio_balloon_get_features(VirtIODevice *vdev, uint32_t f) { f |= (1 << VIRTIO_BALLOON_F_STATS_VQ); + /* Kernels >= 2.6.40 do this unconditionally, so set + * the bit to make old kernels do the same thing + */ + f |= (1 << VIRTIO_BALLOON_F_MUST_TELL_HOST); return f; }
There's a patch pending on LKML at the moment: http://lkml.org/lkml/2011/4/7/101 The virtio balloon driver has a VIRTIO_BALLOON_F_MUST_TELL_HOST feature bit. As of now, qemu-kvm defines the bit, but doesn't set it. feature bit. Whenever the bit is set, the guest kernel must always tell the host before we free pages back to the allocator. Without this feature, we might free a page (and have another user touch it) while the hypervisor is unprepared for it. But, if the bit is _not_ set, there is no obligation to reverse the order; we're under no obligation to do _anything_. This patch makes the "tell host first" logic the only case. This should make everybody happy, and reduce the amount of untested or untestable code in the kernel. Signed-off-by: Dave Hansen <dave@linux.vnet.ibm.com> Cc: Amit Shah <amit.shah@redhat.com> Cc: Anthony Liguori <aliguori@linux.vnet.ibm.com> --- hw/virtio-balloon.c | 4 ++++ 1 files changed, 4 insertions(+), 0 deletions(-)