Message ID | 1433845144-26889-2-git-send-email-den@openvz.org |
---|---|
State | New |
Headers | show |
Am 09.06.2015 um 12:19 schrieb Denis V. Lunev: > Excessive virtio_balloon inflation can cause invocation of OOM-killer, > when Linux is under severe memory pressure. Various mechanisms are > responsible for correct virtio_balloon memory management. Nevertheless it > is often the case that these control tools does not have enough time to > react on fast changing memory load. As a result OS runs out of memory and > invokes OOM-killer. The balancing of memory by use of the virtio balloon > should not cause the termination of processes while there are pages in the > balloon. Now there is no way for virtio balloon driver to free memory at > the last moment before some process get killed by OOM-killer. > > This does not provide a security breach as balloon itself is running > inside Guest OS and is working in the cooperation with the host. Thus > some improvements from Guest side should be considered as normal. > > To solve the problem, introduce a virtio_balloon callback which is > expected to be called from the oom notifier call chain in out_of_memory() > function. If virtio balloon could release some memory, it will make the > system return and retry the allocation that forced the out of memory > killer to run. > > This behavior should be enabled if and only if appropriate feature bit > is set on the device. It is off by default. The balloon frees pages in this way static void balloon_page(void *addr, int deflate) { #if defined(__linux__) if (!kvm_enabled() || kvm_has_sync_mmu()) qemu_madvise(addr, TARGET_PAGE_SIZE, deflate ? QEMU_MADV_WILLNEED : QEMU_MADV_DONTNEED); #endif } The guest can re-touch that page and get a empty zero or the old page back without tampering the host integrity. This should work for all cases I am aware of (without sync_mmu its a nop anyway) so why not enable that by default? Anything that I missed? Christian > > This functionality was recently merged into vanilla Linux. > > commit 5a10b7dbf904bfe01bb9fcc6298f7df09eed77d5 > Author: Raushaniya Maksudova <rmaksudova@parallels.com> > Date: Mon Nov 10 09:36:29 2014 +1030 > > This patch adds respective control bits into QEMU. It introduces > deflate-on-oom option for balloon device which does the trick. > > Signed-off-by: Denis V. Lunev <den@openvz.org> > CC: Raushaniya Maksudova <rmaksudova@virtuozzo.com> > CC: Anthony Liguori <aliguori@amazon.com> > CC: Michael S. Tsirkin <mst@redhat.com> > --- > hw/virtio/virtio-balloon.c | 6 ++++-- > include/hw/virtio/virtio-balloon.h | 1 + > 2 files changed, 5 insertions(+), 2 deletions(-) > > diff --git a/hw/virtio/virtio-balloon.c b/hw/virtio/virtio-balloon.c > index f915c7b..d3f36f8 100644 > --- a/hw/virtio/virtio-balloon.c > +++ b/hw/virtio/virtio-balloon.c > @@ -312,8 +312,8 @@ static void virtio_balloon_set_config(VirtIODevice *vdev, > > static uint64_t virtio_balloon_get_features(VirtIODevice *vdev, uint64_t f) > { > - f |= (1 << VIRTIO_BALLOON_F_STATS_VQ); > - return f; > + VirtIOBalloon *dev = VIRTIO_BALLOON(vdev); > + return f | (1u << VIRTIO_BALLOON_F_STATS_VQ) | dev->host_features; > } > > static void virtio_balloon_stat(void *opaque, BalloonInfo *info) > @@ -423,6 +423,8 @@ static void virtio_balloon_instance_init(Object *obj) > } > > static Property virtio_balloon_properties[] = { > + DEFINE_PROP_BIT("deflate-on-oom", VirtIOBalloon, host_features, > + VIRTIO_BALLOON_F_DEFLATE_ON_OOM, false), > DEFINE_PROP_END_OF_LIST(), > }; > > diff --git a/include/hw/virtio/virtio-balloon.h b/include/hw/virtio/virtio-balloon.h > index 4ab8f54..7f49b1f 100644 > --- a/include/hw/virtio/virtio-balloon.h > +++ b/include/hw/virtio/virtio-balloon.h > @@ -36,6 +36,7 @@ typedef struct VirtIOBalloon { > QEMUTimer *stats_timer; > int64_t stats_last_update; > int64_t stats_poll_interval; > + uint32_t host_features; > } VirtIOBalloon; > > #endif >
On 09/06/15 13:19, Denis V. Lunev wrote: > Excessive virtio_balloon inflation can cause invocation of OOM-killer, > when Linux is under severe memory pressure. Various mechanisms are > responsible for correct virtio_balloon memory management. Nevertheless it > is often the case that these control tools does not have enough time to > react on fast changing memory load. As a result OS runs out of memory and > invokes OOM-killer. The balancing of memory by use of the virtio balloon > should not cause the termination of processes while there are pages in the > balloon. Now there is no way for virtio balloon driver to free memory at > the last moment before some process get killed by OOM-killer. > > This does not provide a security breach as balloon itself is running > inside Guest OS and is working in the cooperation with the host. Thus > some improvements from Guest side should be considered as normal. > > To solve the problem, introduce a virtio_balloon callback which is > expected to be called from the oom notifier call chain in out_of_memory() > function. If virtio balloon could release some memory, it will make the > system return and retry the allocation that forced the out of memory > killer to run. > > This behavior should be enabled if and only if appropriate feature bit > is set on the device. It is off by default. > > This functionality was recently merged into vanilla Linux. > > commit 5a10b7dbf904bfe01bb9fcc6298f7df09eed77d5 > Author: Raushaniya Maksudova <rmaksudova@parallels.com> > Date: Mon Nov 10 09:36:29 2014 +1030 > > This patch adds respective control bits into QEMU. It introduces > deflate-on-oom option for balloon device which does the trick. > > Signed-off-by: Denis V. Lunev <den@openvz.org> > CC: Raushaniya Maksudova <rmaksudova@virtuozzo.com> > CC: Anthony Liguori <aliguori@amazon.com> > CC: Michael S. Tsirkin <mst@redhat.com> > --- > hw/virtio/virtio-balloon.c | 6 ++++-- > include/hw/virtio/virtio-balloon.h | 1 + > 2 files changed, 5 insertions(+), 2 deletions(-) > > diff --git a/hw/virtio/virtio-balloon.c b/hw/virtio/virtio-balloon.c > index f915c7b..d3f36f8 100644 > --- a/hw/virtio/virtio-balloon.c > +++ b/hw/virtio/virtio-balloon.c > @@ -312,8 +312,8 @@ static void virtio_balloon_set_config(VirtIODevice *vdev, > > static uint64_t virtio_balloon_get_features(VirtIODevice *vdev, uint64_t f) > { > - f |= (1 << VIRTIO_BALLOON_F_STATS_VQ); > - return f; > + VirtIOBalloon *dev = VIRTIO_BALLOON(vdev); > + return f | (1u << VIRTIO_BALLOON_F_STATS_VQ) | dev->host_features; > } > > static void virtio_balloon_stat(void *opaque, BalloonInfo *info) > @@ -423,6 +423,8 @@ static void virtio_balloon_instance_init(Object *obj) > } > > static Property virtio_balloon_properties[] = { > + DEFINE_PROP_BIT("deflate-on-oom", VirtIOBalloon, host_features, > + VIRTIO_BALLOON_F_DEFLATE_ON_OOM, false), > DEFINE_PROP_END_OF_LIST(), > }; > > diff --git a/include/hw/virtio/virtio-balloon.h b/include/hw/virtio/virtio-balloon.h > index 4ab8f54..7f49b1f 100644 > --- a/include/hw/virtio/virtio-balloon.h > +++ b/include/hw/virtio/virtio-balloon.h > @@ -36,6 +36,7 @@ typedef struct VirtIOBalloon { > QEMUTimer *stats_timer; > int64_t stats_last_update; > int64_t stats_poll_interval; > + uint32_t host_features; > } VirtIOBalloon; > > #endif Michael, I can rebase over uncommitted [PATCH 23/33] virtio-balloon: switch to virtio_add_feature which changes bits assignments for balloon. Should I? Regards, Den
On 09/06/15 13:37, Christian Borntraeger wrote: > Am 09.06.2015 um 12:19 schrieb Denis V. Lunev: >> Excessive virtio_balloon inflation can cause invocation of OOM-killer, >> when Linux is under severe memory pressure. Various mechanisms are >> responsible for correct virtio_balloon memory management. Nevertheless it >> is often the case that these control tools does not have enough time to >> react on fast changing memory load. As a result OS runs out of memory and >> invokes OOM-killer. The balancing of memory by use of the virtio balloon >> should not cause the termination of processes while there are pages in the >> balloon. Now there is no way for virtio balloon driver to free memory at >> the last moment before some process get killed by OOM-killer. >> >> This does not provide a security breach as balloon itself is running >> inside Guest OS and is working in the cooperation with the host. Thus >> some improvements from Guest side should be considered as normal. >> >> To solve the problem, introduce a virtio_balloon callback which is >> expected to be called from the oom notifier call chain in out_of_memory() >> function. If virtio balloon could release some memory, it will make the >> system return and retry the allocation that forced the out of memory >> killer to run. >> >> This behavior should be enabled if and only if appropriate feature bit >> is set on the device. It is off by default. > The balloon frees pages in this way > > static void balloon_page(void *addr, int deflate) > { > #if defined(__linux__) > if (!kvm_enabled() || kvm_has_sync_mmu()) > qemu_madvise(addr, TARGET_PAGE_SIZE, > deflate ? QEMU_MADV_WILLNEED : QEMU_MADV_DONTNEED); > #endif > } > > The guest can re-touch that page and get a empty zero or the old page back without > tampering the host integrity. This should work for all cases I am aware of (without sync_mmu its a nop anyway) so why not enable that by default? Anything that I missed? > > Christian I'd like to do that :) Actually original version of kernel patch has enabled this unconditionally. But Michael asked to make it configurable and off by default. Den
On Wed, Jun 10, 2015 at 03:02:21PM +0300, Denis V. Lunev wrote: > On 09/06/15 13:37, Christian Borntraeger wrote: > >Am 09.06.2015 um 12:19 schrieb Denis V. Lunev: > >>Excessive virtio_balloon inflation can cause invocation of OOM-killer, > >>when Linux is under severe memory pressure. Various mechanisms are > >>responsible for correct virtio_balloon memory management. Nevertheless it > >>is often the case that these control tools does not have enough time to > >>react on fast changing memory load. As a result OS runs out of memory and > >>invokes OOM-killer. The balancing of memory by use of the virtio balloon > >>should not cause the termination of processes while there are pages in the > >>balloon. Now there is no way for virtio balloon driver to free memory at > >>the last moment before some process get killed by OOM-killer. > >> > >>This does not provide a security breach as balloon itself is running > >>inside Guest OS and is working in the cooperation with the host. Thus > >>some improvements from Guest side should be considered as normal. > >> > >>To solve the problem, introduce a virtio_balloon callback which is > >>expected to be called from the oom notifier call chain in out_of_memory() > >>function. If virtio balloon could release some memory, it will make the > >>system return and retry the allocation that forced the out of memory > >>killer to run. > >> > >>This behavior should be enabled if and only if appropriate feature bit > >>is set on the device. It is off by default. > >The balloon frees pages in this way > > > >static void balloon_page(void *addr, int deflate) > >{ > >#if defined(__linux__) > > if (!kvm_enabled() || kvm_has_sync_mmu()) > > qemu_madvise(addr, TARGET_PAGE_SIZE, > > deflate ? QEMU_MADV_WILLNEED : QEMU_MADV_DONTNEED); > >#endif > >} > > > >The guest can re-touch that page and get a empty zero or the old page back without > >tampering the host integrity. This should work for all cases I am aware of (without sync_mmu its a nop anyway) so why not enable that by default? Anything that I missed? > > > >Christian > > I'd like to do that :) Actually original version of kernel patch > has enabled this unconditionally. But Michael asked to make > it configurable and off by default. > > Den That's not the question here. The question is why is it limited by kvm_has_sync_mmu.
On 10/06/15 16:13, Michael S. Tsirkin wrote: > On Wed, Jun 10, 2015 at 03:02:21PM +0300, Denis V. Lunev wrote: >> On 09/06/15 13:37, Christian Borntraeger wrote: >>> Am 09.06.2015 um 12:19 schrieb Denis V. Lunev: >>>> Excessive virtio_balloon inflation can cause invocation of OOM-killer, >>>> when Linux is under severe memory pressure. Various mechanisms are >>>> responsible for correct virtio_balloon memory management. Nevertheless it >>>> is often the case that these control tools does not have enough time to >>>> react on fast changing memory load. As a result OS runs out of memory and >>>> invokes OOM-killer. The balancing of memory by use of the virtio balloon >>>> should not cause the termination of processes while there are pages in the >>>> balloon. Now there is no way for virtio balloon driver to free memory at >>>> the last moment before some process get killed by OOM-killer. >>>> >>>> This does not provide a security breach as balloon itself is running >>>> inside Guest OS and is working in the cooperation with the host. Thus >>>> some improvements from Guest side should be considered as normal. >>>> >>>> To solve the problem, introduce a virtio_balloon callback which is >>>> expected to be called from the oom notifier call chain in out_of_memory() >>>> function. If virtio balloon could release some memory, it will make the >>>> system return and retry the allocation that forced the out of memory >>>> killer to run. >>>> >>>> This behavior should be enabled if and only if appropriate feature bit >>>> is set on the device. It is off by default. >>> The balloon frees pages in this way >>> >>> static void balloon_page(void *addr, int deflate) >>> { >>> #if defined(__linux__) >>> if (!kvm_enabled() || kvm_has_sync_mmu()) >>> qemu_madvise(addr, TARGET_PAGE_SIZE, >>> deflate ? QEMU_MADV_WILLNEED : QEMU_MADV_DONTNEED); >>> #endif >>> } >>> >>> The guest can re-touch that page and get a empty zero or the old page back without >>> tampering the host integrity. This should work for all cases I am aware of (without sync_mmu its a nop anyway) so why not enable that by default? Anything that I missed? >>> >>> Christian >> I'd like to do that :) Actually original version of kernel patch >> has enabled this unconditionally. But Michael asked to make >> it configurable and off by default. >> >> Den > That's not the question here. The question is why is it limited by kvm_has_sync_mmu. > original comment about this is quite simple " Until 2.6.27, KVM forced memory pinning so we must disable ballooning unless the kernel actually supports it when using KVM. It's always safe when using TCG." Thus this check is a rudiment of a very-very old kernels. Actually I do not know whether current QEMU will start on such kernels :) Den
Am 10.06.2015 um 15:13 schrieb Michael S. Tsirkin: > On Wed, Jun 10, 2015 at 03:02:21PM +0300, Denis V. Lunev wrote: >> On 09/06/15 13:37, Christian Borntraeger wrote: >>> Am 09.06.2015 um 12:19 schrieb Denis V. Lunev: >>>> Excessive virtio_balloon inflation can cause invocation of OOM-killer, >>>> when Linux is under severe memory pressure. Various mechanisms are >>>> responsible for correct virtio_balloon memory management. Nevertheless it >>>> is often the case that these control tools does not have enough time to >>>> react on fast changing memory load. As a result OS runs out of memory and >>>> invokes OOM-killer. The balancing of memory by use of the virtio balloon >>>> should not cause the termination of processes while there are pages in the >>>> balloon. Now there is no way for virtio balloon driver to free memory at >>>> the last moment before some process get killed by OOM-killer. >>>> >>>> This does not provide a security breach as balloon itself is running >>>> inside Guest OS and is working in the cooperation with the host. Thus >>>> some improvements from Guest side should be considered as normal. >>>> >>>> To solve the problem, introduce a virtio_balloon callback which is >>>> expected to be called from the oom notifier call chain in out_of_memory() >>>> function. If virtio balloon could release some memory, it will make the >>>> system return and retry the allocation that forced the out of memory >>>> killer to run. >>>> >>>> This behavior should be enabled if and only if appropriate feature bit >>>> is set on the device. It is off by default. >>> The balloon frees pages in this way >>> >>> static void balloon_page(void *addr, int deflate) >>> { >>> #if defined(__linux__) >>> if (!kvm_enabled() || kvm_has_sync_mmu()) >>> qemu_madvise(addr, TARGET_PAGE_SIZE, >>> deflate ? QEMU_MADV_WILLNEED : QEMU_MADV_DONTNEED); >>> #endif >>> } >>> >>> The guest can re-touch that page and get a empty zero or the old page back without >>> tampering the host integrity. This should work for all cases I am aware of (without sync_mmu its a nop anyway) so why not enable that by default? Anything that I missed? >>> >>> Christian >> >> I'd like to do that :) Actually original version of kernel patch >> has enabled this unconditionally. But Michael asked to make >> it configurable and off by default. >> >> Den > > That's not the question here. The question is why is it limited by kvm_has_sync_mmu. Well we have two interesting options here: VIRTIO_BALLOON_F_MUST_TELL_HOST and VIRTIO_BALLOON_F_DEFLATE_ON_OOM For any sane host with ondemand paging just re-accessing the page should simply work. So the common case could be VIRTIO_BALLOON_F_MUST_TELL_HOST == off VIRTIO_BALLOON_F_DEFLATE_ON_OOM == on Only for the rare case of hypervisors without paging or other memory related restrictions we have to enable MUST_TELL_HOST. Now: QEMU knows exactly which case we have, so why not let QEMU tell the guest what the capabilities are. (e.g. sync_mmu ---> no need to tell the host). I can at least imaging that some admin wants to make the the oom case configurable, but a sane default seems to be to not kill random guest processes. Christian
On Fri, Jun 12, 2015 at 01:56:37PM +0200, Christian Borntraeger wrote: > Am 10.06.2015 um 15:13 schrieb Michael S. Tsirkin: > > On Wed, Jun 10, 2015 at 03:02:21PM +0300, Denis V. Lunev wrote: > >> On 09/06/15 13:37, Christian Borntraeger wrote: > >>> Am 09.06.2015 um 12:19 schrieb Denis V. Lunev: > >>>> Excessive virtio_balloon inflation can cause invocation of OOM-killer, > >>>> when Linux is under severe memory pressure. Various mechanisms are > >>>> responsible for correct virtio_balloon memory management. Nevertheless it > >>>> is often the case that these control tools does not have enough time to > >>>> react on fast changing memory load. As a result OS runs out of memory and > >>>> invokes OOM-killer. The balancing of memory by use of the virtio balloon > >>>> should not cause the termination of processes while there are pages in the > >>>> balloon. Now there is no way for virtio balloon driver to free memory at > >>>> the last moment before some process get killed by OOM-killer. > >>>> > >>>> This does not provide a security breach as balloon itself is running > >>>> inside Guest OS and is working in the cooperation with the host. Thus > >>>> some improvements from Guest side should be considered as normal. > >>>> > >>>> To solve the problem, introduce a virtio_balloon callback which is > >>>> expected to be called from the oom notifier call chain in out_of_memory() > >>>> function. If virtio balloon could release some memory, it will make the > >>>> system return and retry the allocation that forced the out of memory > >>>> killer to run. > >>>> > >>>> This behavior should be enabled if and only if appropriate feature bit > >>>> is set on the device. It is off by default. > >>> The balloon frees pages in this way > >>> > >>> static void balloon_page(void *addr, int deflate) > >>> { > >>> #if defined(__linux__) > >>> if (!kvm_enabled() || kvm_has_sync_mmu()) > >>> qemu_madvise(addr, TARGET_PAGE_SIZE, > >>> deflate ? QEMU_MADV_WILLNEED : QEMU_MADV_DONTNEED); > >>> #endif > >>> } > >>> > >>> The guest can re-touch that page and get a empty zero or the old page back without > >>> tampering the host integrity. This should work for all cases I am aware of (without sync_mmu its a nop anyway) so why not enable that by default? Anything that I missed? > >>> > >>> Christian > >> > >> I'd like to do that :) Actually original version of kernel patch > >> has enabled this unconditionally. But Michael asked to make > >> it configurable and off by default. > >> > >> Den > > > > That's not the question here. The question is why is it limited by kvm_has_sync_mmu. > > Well we have two interesting options here: > > VIRTIO_BALLOON_F_MUST_TELL_HOST and VIRTIO_BALLOON_F_DEFLATE_ON_OOM > > For any sane host with ondemand paging just re-accessing the page > should simply work. So the common case could be > VIRTIO_BALLOON_F_MUST_TELL_HOST == off Disabling this breaks useful optimizations such as ability not to migrate memory in the balloon. > VIRTIO_BALLOON_F_DEFLATE_ON_OOM == on AFAIK management tools depend on balloon not deflating below host-specified threshold to avoid OOM on the host. So I don't think we can make this a default, management needs to enable this explicitly. > Only for the rare case of hypervisors without paging or other memory > related restrictions we have to enable MUST_TELL_HOST. > Now: QEMU knows exactly which case we have, so why not let QEMU tell > the guest what the capabilities are. (e.g. sync_mmu ---> no need to > tell the host). > > I can at least imaging that some admin wants to make the the oom case > configurable, but a sane default seems to be to not kill random > guest processes. > > Christian
Am 13.06.2015 um 22:10 schrieb Michael S. Tsirkin: > On Fri, Jun 12, 2015 at 01:56:37PM +0200, Christian Borntraeger wrote: >> Am 10.06.2015 um 15:13 schrieb Michael S. Tsirkin: >>> On Wed, Jun 10, 2015 at 03:02:21PM +0300, Denis V. Lunev wrote: >>>> On 09/06/15 13:37, Christian Borntraeger wrote: >>>>> Am 09.06.2015 um 12:19 schrieb Denis V. Lunev: >>>>>> Excessive virtio_balloon inflation can cause invocation of OOM-killer, >>>>>> when Linux is under severe memory pressure. Various mechanisms are >>>>>> responsible for correct virtio_balloon memory management. Nevertheless it >>>>>> is often the case that these control tools does not have enough time to >>>>>> react on fast changing memory load. As a result OS runs out of memory and >>>>>> invokes OOM-killer. The balancing of memory by use of the virtio balloon >>>>>> should not cause the termination of processes while there are pages in the >>>>>> balloon. Now there is no way for virtio balloon driver to free memory at >>>>>> the last moment before some process get killed by OOM-killer. >>>>>> >>>>>> This does not provide a security breach as balloon itself is running >>>>>> inside Guest OS and is working in the cooperation with the host. Thus >>>>>> some improvements from Guest side should be considered as normal. >>>>>> >>>>>> To solve the problem, introduce a virtio_balloon callback which is >>>>>> expected to be called from the oom notifier call chain in out_of_memory() >>>>>> function. If virtio balloon could release some memory, it will make the >>>>>> system return and retry the allocation that forced the out of memory >>>>>> killer to run. >>>>>> >>>>>> This behavior should be enabled if and only if appropriate feature bit >>>>>> is set on the device. It is off by default. >>>>> The balloon frees pages in this way >>>>> >>>>> static void balloon_page(void *addr, int deflate) >>>>> { >>>>> #if defined(__linux__) >>>>> if (!kvm_enabled() || kvm_has_sync_mmu()) >>>>> qemu_madvise(addr, TARGET_PAGE_SIZE, >>>>> deflate ? QEMU_MADV_WILLNEED : QEMU_MADV_DONTNEED); >>>>> #endif >>>>> } >>>>> >>>>> The guest can re-touch that page and get a empty zero or the old page back without >>>>> tampering the host integrity. This should work for all cases I am aware of (without sync_mmu its a nop anyway) so why not enable that by default? Anything that I missed? >>>>> >>>>> Christian >>>> >>>> I'd like to do that :) Actually original version of kernel patch >>>> has enabled this unconditionally. But Michael asked to make >>>> it configurable and off by default. >>>> >>>> Den >>> >>> That's not the question here. The question is why is it limited by kvm_has_sync_mmu. >> >> Well we have two interesting options here: >> >> VIRTIO_BALLOON_F_MUST_TELL_HOST and VIRTIO_BALLOON_F_DEFLATE_ON_OOM >> >> For any sane host with ondemand paging just re-accessing the page >> should simply work. So the common case could be >> VIRTIO_BALLOON_F_MUST_TELL_HOST == off > > Disabling this breaks useful optimizations such as > ability not to migrate memory in the balloon. memory in the balloon is usually backed by the empty zero page after the madvise (WONT_NEED will finally result in zap_pte_range for the common case). In a ideal world migration should be able to optimize zero pages. >> VIRTIO_BALLOON_F_DEFLATE_ON_OOM == on > > AFAIK management tools depend on balloon not deflating > below host-specified threshold to avoid OOM on the host. > So I don't think we can make this a default, > management needs to enable this explicitly. If the ballooning is required to keep the host memory managedment from OOM - iow abusing ballooning as memory hotplug between guests then yes better let the guest oom - that makes sense. Now: I think that doing so (not having enough swap in the host if all guests deflate) and relying on balloon semantics is fundamentally broken. Let me explain this: The problem is that we rely on guest cooperation for the host integrity. As I explained using madvise WONT_NEED will replace the current PTEs with invalid/emtpy PTEs. As soon as the guest kernel re-touches the page (e.g. a malicious kernel module - not the balloon driver) it will be backed by the VMAs default method - so usually with a shared R/O copy of the empty zero page. Write accesses will result in a copy-on-write and allocate new memory in the host. There is nothing we can do in the balloon protocol to protect the host against malicious guests allocating all the maximum memory. If you need host integrity against guest memory usage, something like cgroups_memory or so is probably the only reliable way. > >> Only for the rare case of hypervisors without paging or other memory >> related restrictions we have to enable MUST_TELL_HOST. >> Now: QEMU knows exactly which case we have, so why not let QEMU tell >> the guest what the capabilities are. (e.g. sync_mmu ---> no need to >> tell the host). >> >> I can at least imaging that some admin wants to make the the oom case >> configurable, but a sane default seems to be to not kill random >> guest processes. >> >> Christian > >
On Mon, Jun 15, 2015 at 09:01:53AM +0200, Christian Borntraeger wrote: > Am 13.06.2015 um 22:10 schrieb Michael S. Tsirkin: > > On Fri, Jun 12, 2015 at 01:56:37PM +0200, Christian Borntraeger wrote: > >> Am 10.06.2015 um 15:13 schrieb Michael S. Tsirkin: > >>> On Wed, Jun 10, 2015 at 03:02:21PM +0300, Denis V. Lunev wrote: > >>>> On 09/06/15 13:37, Christian Borntraeger wrote: > >>>>> Am 09.06.2015 um 12:19 schrieb Denis V. Lunev: > >>>>>> Excessive virtio_balloon inflation can cause invocation of OOM-killer, > >>>>>> when Linux is under severe memory pressure. Various mechanisms are > >>>>>> responsible for correct virtio_balloon memory management. Nevertheless it > >>>>>> is often the case that these control tools does not have enough time to > >>>>>> react on fast changing memory load. As a result OS runs out of memory and > >>>>>> invokes OOM-killer. The balancing of memory by use of the virtio balloon > >>>>>> should not cause the termination of processes while there are pages in the > >>>>>> balloon. Now there is no way for virtio balloon driver to free memory at > >>>>>> the last moment before some process get killed by OOM-killer. > >>>>>> > >>>>>> This does not provide a security breach as balloon itself is running > >>>>>> inside Guest OS and is working in the cooperation with the host. Thus > >>>>>> some improvements from Guest side should be considered as normal. > >>>>>> > >>>>>> To solve the problem, introduce a virtio_balloon callback which is > >>>>>> expected to be called from the oom notifier call chain in out_of_memory() > >>>>>> function. If virtio balloon could release some memory, it will make the > >>>>>> system return and retry the allocation that forced the out of memory > >>>>>> killer to run. > >>>>>> > >>>>>> This behavior should be enabled if and only if appropriate feature bit > >>>>>> is set on the device. It is off by default. > >>>>> The balloon frees pages in this way > >>>>> > >>>>> static void balloon_page(void *addr, int deflate) > >>>>> { > >>>>> #if defined(__linux__) > >>>>> if (!kvm_enabled() || kvm_has_sync_mmu()) > >>>>> qemu_madvise(addr, TARGET_PAGE_SIZE, > >>>>> deflate ? QEMU_MADV_WILLNEED : QEMU_MADV_DONTNEED); > >>>>> #endif > >>>>> } > >>>>> > >>>>> The guest can re-touch that page and get a empty zero or the old page back without > >>>>> tampering the host integrity. This should work for all cases I am aware of (without sync_mmu its a nop anyway) so why not enable that by default? Anything that I missed? > >>>>> > >>>>> Christian > >>>> > >>>> I'd like to do that :) Actually original version of kernel patch > >>>> has enabled this unconditionally. But Michael asked to make > >>>> it configurable and off by default. > >>>> > >>>> Den > >>> > >>> That's not the question here. The question is why is it limited by kvm_has_sync_mmu. > >> > >> Well we have two interesting options here: > >> > >> VIRTIO_BALLOON_F_MUST_TELL_HOST and VIRTIO_BALLOON_F_DEFLATE_ON_OOM > >> > >> For any sane host with ondemand paging just re-accessing the page > >> should simply work. So the common case could be > >> VIRTIO_BALLOON_F_MUST_TELL_HOST == off > > > > Disabling this breaks useful optimizations such as > > ability not to migrate memory in the balloon. > > memory in the balloon is usually backed by the empty zero page after > the madvise (WONT_NEED will finally result in zap_pte_range for the > common case). In a ideal world migration should be able to optimize > zero pages. This still involves reading them in as opposed to just skipping them. > > >> VIRTIO_BALLOON_F_DEFLATE_ON_OOM == on > > > > AFAIK management tools depend on balloon not deflating > > below host-specified threshold to avoid OOM on the host. > > So I don't think we can make this a default, > > management needs to enable this explicitly. > > If the ballooning is required to keep the host memory managedment > from OOM - iow abusing ballooning as memory hotplug between guests > then yes better let the guest oom - that makes sense. > > Now: I think that doing so (not having enough swap in the host if > all guests deflate) and relying on balloon semantics is fundamentally > broken. Let me explain this: The problem is that we rely on guest > cooperation for the host integrity. As I explained using madvise > WONT_NEED will replace the current PTEs with invalid/emtpy PTEs. As > soon as the guest kernel re-touches the page (e.g. a malicious > kernel module - not the balloon driver) it will be backed by the VMAs > default method - so usually with a shared R/O copy of the empty > zero page. Write accesses will result in a copy-on-write and allocate > new memory in the host. > There is nothing we can do in the balloon protocol to protect the host > against malicious guests allocating all the maximum memory. If we want to try and harden host, we can unmap it so guest will crash if it touches pages without deflate. > If you need host integrity against guest memory usage, something like > cgroups_memory or so is probably the only reliable way. In the original design, protection against a malicious guest is not the point of the balloon, it's a technology that let you overcommit cooperative guests. > > > >> Only for the rare case of hypervisors without paging or other memory > >> related restrictions we have to enable MUST_TELL_HOST. > >> Now: QEMU knows exactly which case we have, so why not let QEMU tell > >> the guest what the capabilities are. (e.g. sync_mmu ---> no need to > >> tell the host). > >> > >> I can at least imaging that some admin wants to make the the oom case > >> configurable, but a sane default seems to be to not kill random > >> guest processes. > >> > >> Christian > > > >
Am 15.06.2015 um 11:06 schrieb Michael S. Tsirkin: >>> AFAIK management tools depend on balloon not deflating >>> below host-specified threshold to avoid OOM on the host. >>> So I don't think we can make this a default, >>> management needs to enable this explicitly. >> >> If the ballooning is required to keep the host memory managedment >> from OOM - iow abusing ballooning as memory hotplug between guests >> then yes better let the guest oom - that makes sense. >> >> Now: I think that doing so (not having enough swap in the host if >> all guests deflate) and relying on balloon semantics is fundamentally >> broken. Let me explain this: The problem is that we rely on guest >> cooperation for the host integrity. As I explained using madvise >> WONT_NEED will replace the current PTEs with invalid/emtpy PTEs. As >> soon as the guest kernel re-touches the page (e.g. a malicious >> kernel module - not the balloon driver) it will be backed by the VMAs >> default method - so usually with a shared R/O copy of the empty >> zero page. Write accesses will result in a copy-on-write and allocate >> new memory in the host. >> There is nothing we can do in the balloon protocol to protect the host >> against malicious guests allocating all the maximum memory. > > If we want to try and harden host, we can unmap it so guest will crash > if it touches pages without deflate. > >> If you need host integrity against guest memory usage, something like >> cgroups_memory or so is probably the only reliable way. > > In the original design, protection against a malicious guest is not the > point of the balloon, it's a technology that let you overcommit > cooperative guests. Sure. But then its perfectly fine to let the guest reclaim by default, because your statement "AFAIK management tools depend on balloon not deflating below host-specified threshold to avoid OOM on the host. So I don't think we can make this a default, management needs to enable this explicitly." is not true ;-) Christian
On Mon, Jun 15, 2015 at 11:59:05AM +0200, Christian Borntraeger wrote: > Am 15.06.2015 um 11:06 schrieb Michael S. Tsirkin: > > >>> AFAIK management tools depend on balloon not deflating > >>> below host-specified threshold to avoid OOM on the host. > >>> So I don't think we can make this a default, > >>> management needs to enable this explicitly. > >> > >> If the ballooning is required to keep the host memory managedment > >> from OOM - iow abusing ballooning as memory hotplug between guests > >> then yes better let the guest oom - that makes sense. > >> > >> Now: I think that doing so (not having enough swap in the host if > >> all guests deflate) and relying on balloon semantics is fundamentally > >> broken. Let me explain this: The problem is that we rely on guest > >> cooperation for the host integrity. As I explained using madvise > >> WONT_NEED will replace the current PTEs with invalid/emtpy PTEs. As > >> soon as the guest kernel re-touches the page (e.g. a malicious > >> kernel module - not the balloon driver) it will be backed by the VMAs > >> default method - so usually with a shared R/O copy of the empty > >> zero page. Write accesses will result in a copy-on-write and allocate > >> new memory in the host. > >> There is nothing we can do in the balloon protocol to protect the host > >> against malicious guests allocating all the maximum memory. > > > > If we want to try and harden host, we can unmap it so guest will crash > > if it touches pages without deflate. > > > >> If you need host integrity against guest memory usage, something like > >> cgroups_memory or so is probably the only reliable way. > > > > In the original design, protection against a malicious guest is not the > > point of the balloon, it's a technology that let you overcommit > > cooperative guests. > > Sure. But then its perfectly fine to let the guest reclaim by default, > because your statement > "AFAIK management tools depend on balloon not deflating > below host-specified threshold to avoid OOM on the host. > So I don't think we can make this a default, > management needs to enable this explicitly." > > is not true ;-) > > Christian I don't see the connection. Deflate on OOM means "it's OK to deflate if you like, this won't cause any harm". If you set this flag, you can't overcommit host too agressively even with cooperative guests.
Am 15.06.2015 um 12:10 schrieb Michael S. Tsirkin: > On Mon, Jun 15, 2015 at 11:59:05AM +0200, Christian Borntraeger wrote: >> Am 15.06.2015 um 11:06 schrieb Michael S. Tsirkin: >> >>>>> AFAIK management tools depend on balloon not deflating >>>>> below host-specified threshold to avoid OOM on the host. >>>>> So I don't think we can make this a default, >>>>> management needs to enable this explicitly. >>>> >>>> If the ballooning is required to keep the host memory managedment >>>> from OOM - iow abusing ballooning as memory hotplug between guests >>>> then yes better let the guest oom - that makes sense. >>>> >>>> Now: I think that doing so (not having enough swap in the host if >>>> all guests deflate) and relying on balloon semantics is fundamentally >>>> broken. Let me explain this: The problem is that we rely on guest >>>> cooperation for the host integrity. As I explained using madvise >>>> WONT_NEED will replace the current PTEs with invalid/emtpy PTEs. As >>>> soon as the guest kernel re-touches the page (e.g. a malicious >>>> kernel module - not the balloon driver) it will be backed by the VMAs >>>> default method - so usually with a shared R/O copy of the empty >>>> zero page. Write accesses will result in a copy-on-write and allocate >>>> new memory in the host. >>>> There is nothing we can do in the balloon protocol to protect the host >>>> against malicious guests allocating all the maximum memory. >>> >>> If we want to try and harden host, we can unmap it so guest will crash >>> if it touches pages without deflate. >>> >>>> If you need host integrity against guest memory usage, something like >>>> cgroups_memory or so is probably the only reliable way. >>> >>> In the original design, protection against a malicious guest is not the >>> point of the balloon, it's a technology that let you overcommit >>> cooperative guests. >> >> Sure. But then its perfectly fine to let the guest reclaim by default, >> because your statement >> "AFAIK management tools depend on balloon not deflating >> below host-specified threshold to avoid OOM on the host. >> So I don't think we can make this a default, >> management needs to enable this explicitly." >> >> is not true ;-) >> >> Christian > > I don't see the connection. > Deflate on OOM means "it's OK to deflate if you like, this won't > cause any harm". If you set this flag, you can't overcommit host too > agressively even with cooperative guests. The connection is that there is no fundamental issue being solved that requires the setting to be off or on. (both is ok with pros and cons) Keeping reclaim on oom off has of course the lowest impact as it is just todays behaviour - so we play safe. I personally can live fine with both defaults and in the end its up to you to decide as virtio maintainer. (my personal opinion to have deflate on OOM=yes, MUST_TELL_HOST=off still stands, though) Just keep in mind that you add an interface that we will drag along forever and that we might require all future user to add a statement to libvirt to do the right thing. So we better make up our mind which default value has the bigger downsides. If that decision making process comes to the conclusion to do it like this patch - fine with me. Christian
diff --git a/hw/virtio/virtio-balloon.c b/hw/virtio/virtio-balloon.c index f915c7b..d3f36f8 100644 --- a/hw/virtio/virtio-balloon.c +++ b/hw/virtio/virtio-balloon.c @@ -312,8 +312,8 @@ static void virtio_balloon_set_config(VirtIODevice *vdev, static uint64_t virtio_balloon_get_features(VirtIODevice *vdev, uint64_t f) { - f |= (1 << VIRTIO_BALLOON_F_STATS_VQ); - return f; + VirtIOBalloon *dev = VIRTIO_BALLOON(vdev); + return f | (1u << VIRTIO_BALLOON_F_STATS_VQ) | dev->host_features; } static void virtio_balloon_stat(void *opaque, BalloonInfo *info) @@ -423,6 +423,8 @@ static void virtio_balloon_instance_init(Object *obj) } static Property virtio_balloon_properties[] = { + DEFINE_PROP_BIT("deflate-on-oom", VirtIOBalloon, host_features, + VIRTIO_BALLOON_F_DEFLATE_ON_OOM, false), DEFINE_PROP_END_OF_LIST(), }; diff --git a/include/hw/virtio/virtio-balloon.h b/include/hw/virtio/virtio-balloon.h index 4ab8f54..7f49b1f 100644 --- a/include/hw/virtio/virtio-balloon.h +++ b/include/hw/virtio/virtio-balloon.h @@ -36,6 +36,7 @@ typedef struct VirtIOBalloon { QEMUTimer *stats_timer; int64_t stats_last_update; int64_t stats_poll_interval; + uint32_t host_features; } VirtIOBalloon; #endif
Excessive virtio_balloon inflation can cause invocation of OOM-killer, when Linux is under severe memory pressure. Various mechanisms are responsible for correct virtio_balloon memory management. Nevertheless it is often the case that these control tools does not have enough time to react on fast changing memory load. As a result OS runs out of memory and invokes OOM-killer. The balancing of memory by use of the virtio balloon should not cause the termination of processes while there are pages in the balloon. Now there is no way for virtio balloon driver to free memory at the last moment before some process get killed by OOM-killer. This does not provide a security breach as balloon itself is running inside Guest OS and is working in the cooperation with the host. Thus some improvements from Guest side should be considered as normal. To solve the problem, introduce a virtio_balloon callback which is expected to be called from the oom notifier call chain in out_of_memory() function. If virtio balloon could release some memory, it will make the system return and retry the allocation that forced the out of memory killer to run. This behavior should be enabled if and only if appropriate feature bit is set on the device. It is off by default. This functionality was recently merged into vanilla Linux. commit 5a10b7dbf904bfe01bb9fcc6298f7df09eed77d5 Author: Raushaniya Maksudova <rmaksudova@parallels.com> Date: Mon Nov 10 09:36:29 2014 +1030 This patch adds respective control bits into QEMU. It introduces deflate-on-oom option for balloon device which does the trick. Signed-off-by: Denis V. Lunev <den@openvz.org> CC: Raushaniya Maksudova <rmaksudova@virtuozzo.com> CC: Anthony Liguori <aliguori@amazon.com> CC: Michael S. Tsirkin <mst@redhat.com> --- hw/virtio/virtio-balloon.c | 6 ++++-- include/hw/virtio/virtio-balloon.h | 1 + 2 files changed, 5 insertions(+), 2 deletions(-)