mbox series

[RFC,0/2] s390: stop abusing memory_region_allocate_system_memory()

Message ID 20190729145229.4333-1-imammedo@redhat.com
Headers show
Series s390: stop abusing memory_region_allocate_system_memory() | expand

Message

Igor Mammedov July 29, 2019, 2:52 p.m. UTC
While looking into unifying guest RAM allocation to use hostmem backends
for initial RAM (especially when -mempath is used) and retiring
memory_region_allocate_system_memory() API, leaving only single hostmem backend,
I was inspecting how currently it is used by boards and it turns out several
boards abuse it by calling the function several times (despite documented contract
forbiding it).

s390 is one of such boards where KVM limitation on memslot size got propagated
to board design and memory_region_allocate_system_memory() was abused to satisfy
KVM requirement for max RAM chunk where memory region alias would suffice.

Unfortunately, memory_region_allocate_system_memory() usage created migration
dependency where guest RAM is transferred in migration stream as several RAMBlocks
if it's more than KVM_SLOT_MAX_BYTES.

In order to replace these several RAM chunks with a single memdev and keep it
working with KVM memslot size limit and migration compatible, following was done:
   * [2/2] use memory region aliases to partition hostmem backend RAM on
           KVM_SLOT_MAX_BYTES chunks, which should keep KVM side working
   * [1/2] hacked memory region aliases (to ram memory regions only) to have
           its own RAMBlocks pointing to RAM chunks owned by aliased memory
           region. While it's admittedly a hack, but it's relatively simple and
           allows board code rashape migration stream as necessary

           I haven't tried to use migratable aliases on x86 machines, but with it
           it could be possible to drop legacy RAM allocation and compat knob
           (cd5ff8333a) dropping '-numa node,mem' completely even for old machines.

PS:
   Tested with ping pong cross version migration on s390 machine 
   (with reduced KVM_SLOT_MAX_BYTES since I don't have access to large
    enough host)
     

Igor Mammedov (2):
  memory: make MemoryRegion alias migratable
  s390: do not call memory_region_allocate_system_memory() multiple
    times

 exec.c                     |  7 ++++---
 hw/s390x/s390-virtio-ccw.c | 20 +++++++++++++++-----
 memory.c                   |  5 +++++
 3 files changed, 24 insertions(+), 8 deletions(-)

Comments

Cornelia Huck July 29, 2019, 2:58 p.m. UTC | #1
On Mon, 29 Jul 2019 10:52:27 -0400
Igor Mammedov <imammedo@redhat.com> wrote:

cc:ing some folks for awareness.

> While looking into unifying guest RAM allocation to use hostmem backends
> for initial RAM (especially when -mempath is used) and retiring
> memory_region_allocate_system_memory() API, leaving only single hostmem backend,
> I was inspecting how currently it is used by boards and it turns out several
> boards abuse it by calling the function several times (despite documented contract
> forbiding it).
> 
> s390 is one of such boards where KVM limitation on memslot size got propagated
> to board design and memory_region_allocate_system_memory() was abused to satisfy
> KVM requirement for max RAM chunk where memory region alias would suffice.
> 
> Unfortunately, memory_region_allocate_system_memory() usage created migration
> dependency where guest RAM is transferred in migration stream as several RAMBlocks
> if it's more than KVM_SLOT_MAX_BYTES.
> 
> In order to replace these several RAM chunks with a single memdev and keep it
> working with KVM memslot size limit and migration compatible, following was done:
>    * [2/2] use memory region aliases to partition hostmem backend RAM on
>            KVM_SLOT_MAX_BYTES chunks, which should keep KVM side working
>    * [1/2] hacked memory region aliases (to ram memory regions only) to have
>            its own RAMBlocks pointing to RAM chunks owned by aliased memory
>            region. While it's admittedly a hack, but it's relatively simple and
>            allows board code rashape migration stream as necessary
> 
>            I haven't tried to use migratable aliases on x86 machines, but with it
>            it could be possible to drop legacy RAM allocation and compat knob
>            (cd5ff8333a) dropping '-numa node,mem' completely even for old machines.
> 
> PS:
>    Tested with ping pong cross version migration on s390 machine 
>    (with reduced KVM_SLOT_MAX_BYTES since I don't have access to large
>     enough host)
>      
> 
> Igor Mammedov (2):
>   memory: make MemoryRegion alias migratable
>   s390: do not call memory_region_allocate_system_memory() multiple
>     times
> 
>  exec.c                     |  7 ++++---
>  hw/s390x/s390-virtio-ccw.c | 20 +++++++++++++++-----
>  memory.c                   |  5 +++++
>  3 files changed, 24 insertions(+), 8 deletions(-)
>
Christian Borntraeger July 30, 2019, 3:22 p.m. UTC | #2
I remember that you send a similar patch a while ago and something broke on s390x.
Have you changed something from the old patchs set?

On 29.07.19 16:52, Igor Mammedov wrote:
> While looking into unifying guest RAM allocation to use hostmem backends
> for initial RAM (especially when -mempath is used) and retiring
> memory_region_allocate_system_memory() API, leaving only single hostmem backend,
> I was inspecting how currently it is used by boards and it turns out several
> boards abuse it by calling the function several times (despite documented contract
> forbiding it).
> 
> s390 is one of such boards where KVM limitation on memslot size got propagated
> to board design and memory_region_allocate_system_memory() was abused to satisfy
> KVM requirement for max RAM chunk where memory region alias would suffice.
> 
> Unfortunately, memory_region_allocate_system_memory() usage created migration
> dependency where guest RAM is transferred in migration stream as several RAMBlocks
> if it's more than KVM_SLOT_MAX_BYTES.
> 
> In order to replace these several RAM chunks with a single memdev and keep it
> working with KVM memslot size limit and migration compatible, following was done:
>    * [2/2] use memory region aliases to partition hostmem backend RAM on
>            KVM_SLOT_MAX_BYTES chunks, which should keep KVM side working
>    * [1/2] hacked memory region aliases (to ram memory regions only) to have
>            its own RAMBlocks pointing to RAM chunks owned by aliased memory
>            region. While it's admittedly a hack, but it's relatively simple and
>            allows board code rashape migration stream as necessary
> 
>            I haven't tried to use migratable aliases on x86 machines, but with it
>            it could be possible to drop legacy RAM allocation and compat knob
>            (cd5ff8333a) dropping '-numa node,mem' completely even for old machines.
> 
> PS:
>    Tested with ping pong cross version migration on s390 machine 
>    (with reduced KVM_SLOT_MAX_BYTES since I don't have access to large
>     enough host)
>      
> 
> Igor Mammedov (2):
>   memory: make MemoryRegion alias migratable
>   s390: do not call memory_region_allocate_system_memory() multiple
>     times
> 
>  exec.c                     |  7 ++++---
>  hw/s390x/s390-virtio-ccw.c | 20 +++++++++++++++-----
>  memory.c                   |  5 +++++
>  3 files changed, 24 insertions(+), 8 deletions(-)
>
Igor Mammedov July 30, 2019, 3:49 p.m. UTC | #3
On Tue, 30 Jul 2019 17:22:01 +0200
Christian Borntraeger <borntraeger@de.ibm.com> wrote:

> I remember that you send a similar patch a while ago and something broke on s390x.
> Have you changed something from the old patchs set?
Thanks for reminder, I totally forgot about it.

it was "[PATCH v1 5/5] s390: do not call memory_region_allocate_system_memory() multiple times"
now looking at history it all comes back, so this series is incomplete as is
due to memory memory region aliases being merged back to one big memory section
after flatview is rendered. So KVM get too big chunk of RAM and it breaks.

So aliases solve only half of the problem (keeping migration side working)
and to fix KVM side, I'd add splitting memory section on chunks into
kvm_set_phys_mem() to keep KVM specifics to kvm code only.
Board would only have to set max size value and kvm code would use it for splitting,
I'll try to find that patch.


> 
> On 29.07.19 16:52, Igor Mammedov wrote:
> > While looking into unifying guest RAM allocation to use hostmem backends
> > for initial RAM (especially when -mempath is used) and retiring
> > memory_region_allocate_system_memory() API, leaving only single hostmem backend,
> > I was inspecting how currently it is used by boards and it turns out several
> > boards abuse it by calling the function several times (despite documented contract
> > forbiding it).
> > 
> > s390 is one of such boards where KVM limitation on memslot size got propagated
> > to board design and memory_region_allocate_system_memory() was abused to satisfy
> > KVM requirement for max RAM chunk where memory region alias would suffice.
> > 
> > Unfortunately, memory_region_allocate_system_memory() usage created migration
> > dependency where guest RAM is transferred in migration stream as several RAMBlocks
> > if it's more than KVM_SLOT_MAX_BYTES.
> > 
> > In order to replace these several RAM chunks with a single memdev and keep it
> > working with KVM memslot size limit and migration compatible, following was done:
> >    * [2/2] use memory region aliases to partition hostmem backend RAM on
> >            KVM_SLOT_MAX_BYTES chunks, which should keep KVM side working
> >    * [1/2] hacked memory region aliases (to ram memory regions only) to have
> >            its own RAMBlocks pointing to RAM chunks owned by aliased memory
> >            region. While it's admittedly a hack, but it's relatively simple and
> >            allows board code rashape migration stream as necessary
> > 
> >            I haven't tried to use migratable aliases on x86 machines, but with it
> >            it could be possible to drop legacy RAM allocation and compat knob
> >            (cd5ff8333a) dropping '-numa node,mem' completely even for old machines.
> > 
> > PS:
> >    Tested with ping pong cross version migration on s390 machine 
> >    (with reduced KVM_SLOT_MAX_BYTES since I don't have access to large
> >     enough host)
> >      
> > 
> > Igor Mammedov (2):
> >   memory: make MemoryRegion alias migratable
> >   s390: do not call memory_region_allocate_system_memory() multiple
> >     times
> > 
> >  exec.c                     |  7 ++++---
> >  hw/s390x/s390-virtio-ccw.c | 20 +++++++++++++++-----
> >  memory.c                   |  5 +++++
> >  3 files changed, 24 insertions(+), 8 deletions(-)
> > 
>
David Hildenbrand Aug. 2, 2019, 8:04 a.m. UTC | #4
On 29.07.19 16:52, Igor Mammedov wrote:
> While looking into unifying guest RAM allocation to use hostmem backends
> for initial RAM (especially when -mempath is used) and retiring
> memory_region_allocate_system_memory() API, leaving only single hostmem backend,
> I was inspecting how currently it is used by boards and it turns out several
> boards abuse it by calling the function several times (despite documented contract
> forbiding it).
> 
> s390 is one of such boards where KVM limitation on memslot size got propagated
> to board design and memory_region_allocate_system_memory() was abused to satisfy
> KVM requirement for max RAM chunk where memory region alias would suffice.
> 
> Unfortunately, memory_region_allocate_system_memory() usage created migration
> dependency where guest RAM is transferred in migration stream as several RAMBlocks
> if it's more than KVM_SLOT_MAX_BYTES.

So if I understand it correctly, we only call
memory_region_allocate_system_memory() in case the guest initial memory
size exceeds KVM_SLOT_MAX_BYTES - ~8TB.

Do we *really* care about keeping migration of systems running that most
probably nobody (except Christian ;) ) really uses? (especially not in
production).

I am fine keeping migration running if it's easy, but introducing hacks
(reading below) for such obscure use cases - I don't know.

@Christian: Please prove me wrong. :)

> 
> In order to replace these several RAM chunks with a single memdev and keep it
> working with KVM memslot size limit and migration compatible, following was done:
>    * [2/2] use memory region aliases to partition hostmem backend RAM on
>            KVM_SLOT_MAX_BYTES chunks, which should keep KVM side working
>    * [1/2] hacked memory region aliases (to ram memory regions only) to have
>            its own RAMBlocks pointing to RAM chunks owned by aliased memory
>            region. While it's admittedly a hack, but it's relatively simple and
>            allows board code rashape migration stream as necessary
> 
>            I haven't tried to use migratable aliases on x86 machines, but with it
>            it could be possible to drop legacy RAM allocation and compat knob
>            (cd5ff8333a) dropping '-numa node,mem' completely even for old machines.
> 
> PS:
>    Tested with ping pong cross version migration on s390 machine 
>    (with reduced KVM_SLOT_MAX_BYTES since I don't have access to large
>     enough host)
>      
> 
> Igor Mammedov (2):
>   memory: make MemoryRegion alias migratable
>   s390: do not call memory_region_allocate_system_memory() multiple
>     times
> 
>  exec.c                     |  7 ++++---
>  hw/s390x/s390-virtio-ccw.c | 20 +++++++++++++++-----
>  memory.c                   |  5 +++++
>  3 files changed, 24 insertions(+), 8 deletions(-)
>
David Hildenbrand Aug. 2, 2019, 8:23 a.m. UTC | #5
On 02.08.19 10:04, David Hildenbrand wrote:
> On 29.07.19 16:52, Igor Mammedov wrote:
>> While looking into unifying guest RAM allocation to use hostmem backends
>> for initial RAM (especially when -mempath is used) and retiring
>> memory_region_allocate_system_memory() API, leaving only single hostmem backend,
>> I was inspecting how currently it is used by boards and it turns out several
>> boards abuse it by calling the function several times (despite documented contract
>> forbiding it).
>>
>> s390 is one of such boards where KVM limitation on memslot size got propagated
>> to board design and memory_region_allocate_system_memory() was abused to satisfy
>> KVM requirement for max RAM chunk where memory region alias would suffice.
>>
>> Unfortunately, memory_region_allocate_system_memory() usage created migration
>> dependency where guest RAM is transferred in migration stream as several RAMBlocks
>> if it's more than KVM_SLOT_MAX_BYTES.
> 
> So if I understand it correctly, we only call
> memory_region_allocate_system_memory() in case the guest initial memory
> size exceeds KVM_SLOT_MAX_BYTES - ~8TB.
> 
> Do we *really* care about keeping migration of systems running that most
> probably nobody (except Christian ;) ) really uses? (especially not in
> production).
> 
> I am fine keeping migration running if it's easy, but introducing hacks
> (reading below) for such obscure use cases - I don't know.
> 
> @Christian: Please prove me wrong. :)

For reference:

https://access.redhat.com/articles/rhel-kvm-limits

RHEL 7/8 supports up to 2TB maximum memory in KVM guests.


https://www.suse.com/releasenotes/s390x/SUSE-SLES/15-SP1/

SLES 15 seems to support up to 4TB in a VM


If migration failing would mean "abort migration, continue on migration
source", I could sleep good at night. (as long as nothing crashes)
David Hildenbrand Aug. 2, 2019, 8:26 a.m. UTC | #6
On 02.08.19 10:04, David Hildenbrand wrote:
> On 29.07.19 16:52, Igor Mammedov wrote:
>> While looking into unifying guest RAM allocation to use hostmem backends
>> for initial RAM (especially when -mempath is used) and retiring
>> memory_region_allocate_system_memory() API, leaving only single hostmem backend,
>> I was inspecting how currently it is used by boards and it turns out several
>> boards abuse it by calling the function several times (despite documented contract
>> forbiding it).
>>
>> s390 is one of such boards where KVM limitation on memslot size got propagated
>> to board design and memory_region_allocate_system_memory() was abused to satisfy
>> KVM requirement for max RAM chunk where memory region alias would suffice.
>>
>> Unfortunately, memory_region_allocate_system_memory() usage created migration
>> dependency where guest RAM is transferred in migration stream as several RAMBlocks
>> if it's more than KVM_SLOT_MAX_BYTES.
> 
> So if I understand it correctly, we only call
> memory_region_allocate_system_memory() in case the guest initial memory
> size exceeds KVM_SLOT_MAX_BYTES - ~8TB.

(to clarify, I meant: call it more than once)
Christian Borntraeger Aug. 2, 2019, 8:29 a.m. UTC | #7
On 02.08.19 10:04, David Hildenbrand wrote:
> On 29.07.19 16:52, Igor Mammedov wrote:
>> While looking into unifying guest RAM allocation to use hostmem backends
>> for initial RAM (especially when -mempath is used) and retiring
>> memory_region_allocate_system_memory() API, leaving only single hostmem backend,
>> I was inspecting how currently it is used by boards and it turns out several
>> boards abuse it by calling the function several times (despite documented contract
>> forbiding it).
>>
>> s390 is one of such boards where KVM limitation on memslot size got propagated
>> to board design and memory_region_allocate_system_memory() was abused to satisfy
>> KVM requirement for max RAM chunk where memory region alias would suffice.
>>
>> Unfortunately, memory_region_allocate_system_memory() usage created migration
>> dependency where guest RAM is transferred in migration stream as several RAMBlocks
>> if it's more than KVM_SLOT_MAX_BYTES.
> 
> So if I understand it correctly, we only call
> memory_region_allocate_system_memory() in case the guest initial memory
> size exceeds KVM_SLOT_MAX_BYTES - ~8TB.

We always call it. We just call it twice for > 8TB
> 
> Do we *really* care about keeping migration of systems running that most
> probably nobody (except Christian ;) ) really uses? (especially not in
> production).
> 
> I am fine keeping migration running if it's easy, but introducing hacks
> (reading below) for such obscure use cases - I don't know.
> 
> @Christian: Please prove me wrong. :)

For the time being we can block migration for guests > 8TB if that helps (it should not
fail in a guest killing fashion), but we should
1. continue to be able to migrate guests < 8TB
2. continue to be 

On the other hand I find "and suddenly it fails if you go beyond this" really
unpleasant. So it would be interesting to see the next round of patches to 
check how "hacky" those really are.


> 
>>
>> In order to replace these several RAM chunks with a single memdev and keep it
>> working with KVM memslot size limit and migration compatible, following was done:
>>    * [2/2] use memory region aliases to partition hostmem backend RAM on
>>            KVM_SLOT_MAX_BYTES chunks, which should keep KVM side working
>>    * [1/2] hacked memory region aliases (to ram memory regions only) to have
>>            its own RAMBlocks pointing to RAM chunks owned by aliased memory
>>            region. While it's admittedly a hack, but it's relatively simple and
>>            allows board code rashape migration stream as necessary
>>
>>            I haven't tried to use migratable aliases on x86 machines, but with it
>>            it could be possible to drop legacy RAM allocation and compat knob
>>            (cd5ff8333a) dropping '-numa node,mem' completely even for old machines.
>>
>> PS:
>>    Tested with ping pong cross version migration on s390 machine 
>>    (with reduced KVM_SLOT_MAX_BYTES since I don't have access to large
>>     enough host)
>>      
>>
>> Igor Mammedov (2):
>>   memory: make MemoryRegion alias migratable
>>   s390: do not call memory_region_allocate_system_memory() multiple
>>     times
>>
>>  exec.c                     |  7 ++++---
>>  hw/s390x/s390-virtio-ccw.c | 20 +++++++++++++++-----
>>  memory.c                   |  5 +++++
>>  3 files changed, 24 insertions(+), 8 deletions(-)
>>
> 
>
David Hildenbrand Aug. 2, 2019, 8:37 a.m. UTC | #8
On 02.08.19 10:29, Christian Borntraeger wrote:
> 
> 
> On 02.08.19 10:04, David Hildenbrand wrote:
>> On 29.07.19 16:52, Igor Mammedov wrote:
>>> While looking into unifying guest RAM allocation to use hostmem backends
>>> for initial RAM (especially when -mempath is used) and retiring
>>> memory_region_allocate_system_memory() API, leaving only single hostmem backend,
>>> I was inspecting how currently it is used by boards and it turns out several
>>> boards abuse it by calling the function several times (despite documented contract
>>> forbiding it).
>>>
>>> s390 is one of such boards where KVM limitation on memslot size got propagated
>>> to board design and memory_region_allocate_system_memory() was abused to satisfy
>>> KVM requirement for max RAM chunk where memory region alias would suffice.
>>>
>>> Unfortunately, memory_region_allocate_system_memory() usage created migration
>>> dependency where guest RAM is transferred in migration stream as several RAMBlocks
>>> if it's more than KVM_SLOT_MAX_BYTES.
>>
>> So if I understand it correctly, we only call
>> memory_region_allocate_system_memory() in case the guest initial memory
>> size exceeds KVM_SLOT_MAX_BYTES - ~8TB.
> 
> We always call it. We just call it twice for > 8TB

Yeah, that's what I meant.

>>
>> Do we *really* care about keeping migration of systems running that most
>> probably nobody (except Christian ;) ) really uses? (especially not in
>> production).
>>
>> I am fine keeping migration running if it's easy, but introducing hacks
>> (reading below) for such obscure use cases - I don't know.
>>
>> @Christian: Please prove me wrong. :)
> 
> For the time being we can block migration for guests > 8TB if that helps (it should not
> fail in a guest killing fashion), but we should
> 1. continue to be able to migrate guests < 8TB
> 2. continue to be 
> 
> On the other hand I find "and suddenly it fails if you go beyond this" really
> unpleasant. So it would be interesting to see the next round of patches to 
> check how "hacky" those really are.

I mean migration will work perfectly fine once we fixed it for new QEMU
versions. It's only the older QEMU versions to/from the > fixed one.

Looking at the log I can see that this was introduced with v2.12.0.

I would document this in the next release notes: "Migration of unusual
big VMs (>= 8TB) will not work from/to previous QEMU versions (up to
v2.12, before that starting such guests didn't even work)."
Igor Mammedov Aug. 2, 2019, 9:18 a.m. UTC | #9
On Fri, 2 Aug 2019 10:29:43 +0200
Christian Borntraeger <borntraeger@de.ibm.com> wrote:

> On 02.08.19 10:04, David Hildenbrand wrote:
> > On 29.07.19 16:52, Igor Mammedov wrote:  
> >> While looking into unifying guest RAM allocation to use hostmem backends
> >> for initial RAM (especially when -mempath is used) and retiring
> >> memory_region_allocate_system_memory() API, leaving only single hostmem backend,
> >> I was inspecting how currently it is used by boards and it turns out several
> >> boards abuse it by calling the function several times (despite documented contract
> >> forbiding it).
> >>
> >> s390 is one of such boards where KVM limitation on memslot size got propagated
> >> to board design and memory_region_allocate_system_memory() was abused to satisfy
> >> KVM requirement for max RAM chunk where memory region alias would suffice.
> >>
> >> Unfortunately, memory_region_allocate_system_memory() usage created migration
> >> dependency where guest RAM is transferred in migration stream as several RAMBlocks
> >> if it's more than KVM_SLOT_MAX_BYTES.  
> > 
> > So if I understand it correctly, we only call
> > memory_region_allocate_system_memory() in case the guest initial memory
> > size exceeds KVM_SLOT_MAX_BYTES - ~8TB.  
> 
> We always call it. We just call it twice for > 8TB
> > 
> > Do we *really* care about keeping migration of systems running that most
> > probably nobody (except Christian ;) ) really uses? (especially not in
> > production).
> > 
> > I am fine keeping migration running if it's easy, but introducing hacks
> > (reading below) for such obscure use cases - I don't know.
> > 
> > @Christian: Please prove me wrong. :)  
> 
> For the time being we can block migration for guests > 8TB if that helps (it should not
> fail in a guest killing fashion), but we should
> 1. continue to be able to migrate guests < 8TB
> 2. continue to be 
> 
> On the other hand I find "and suddenly it fails if you go beyond this" really
> unpleasant. So it would be interesting to see the next round of patches to 
> check how "hacky" those really are.

I've looked into cleaning up "migratable aliases",
so far it works fine (well in configurations I was able to test,
there were no regressions in x86 machines which use aliases quite a bit).
As I've written before, aliases could be used for x86 later but
that yet to be proved, so I'd prefer to delay this patch if possible.

However, I'd prefer to intentionally break migration with >8TB
guests to simpler code without keepeing around compat mode
that won't/isn't used in practice.

As for new round of patches (including missing KVM fixup),
I'm going to post them now-ish for you to check it out.

> 
> >   
> >>
> >> In order to replace these several RAM chunks with a single memdev and keep it
> >> working with KVM memslot size limit and migration compatible, following was done:
> >>    * [2/2] use memory region aliases to partition hostmem backend RAM on
> >>            KVM_SLOT_MAX_BYTES chunks, which should keep KVM side working
> >>    * [1/2] hacked memory region aliases (to ram memory regions only) to have
> >>            its own RAMBlocks pointing to RAM chunks owned by aliased memory
> >>            region. While it's admittedly a hack, but it's relatively simple and
> >>            allows board code rashape migration stream as necessary
> >>
> >>            I haven't tried to use migratable aliases on x86 machines, but with it
> >>            it could be possible to drop legacy RAM allocation and compat knob
> >>            (cd5ff8333a) dropping '-numa node,mem' completely even for old machines.
> >>
> >> PS:
> >>    Tested with ping pong cross version migration on s390 machine 
> >>    (with reduced KVM_SLOT_MAX_BYTES since I don't have access to large
> >>     enough host)
> >>      
> >>
> >> Igor Mammedov (2):
> >>   memory: make MemoryRegion alias migratable
> >>   s390: do not call memory_region_allocate_system_memory() multiple
> >>     times
> >>
> >>  exec.c                     |  7 ++++---
> >>  hw/s390x/s390-virtio-ccw.c | 20 +++++++++++++++-----
> >>  memory.c                   |  5 +++++
> >>  3 files changed, 24 insertions(+), 8 deletions(-)
> >>  
> > 
> >   
>
Christian Borntraeger Aug. 2, 2019, 10:24 a.m. UTC | #10
On 02.08.19 10:37, David Hildenbrand wrote:
> On 02.08.19 10:29, Christian Borntraeger wrote:
>>
>>
>> On 02.08.19 10:04, David Hildenbrand wrote:
>>> On 29.07.19 16:52, Igor Mammedov wrote:
>>>> While looking into unifying guest RAM allocation to use hostmem backends
>>>> for initial RAM (especially when -mempath is used) and retiring
>>>> memory_region_allocate_system_memory() API, leaving only single hostmem backend,
>>>> I was inspecting how currently it is used by boards and it turns out several
>>>> boards abuse it by calling the function several times (despite documented contract
>>>> forbiding it).
>>>>
>>>> s390 is one of such boards where KVM limitation on memslot size got propagated
>>>> to board design and memory_region_allocate_system_memory() was abused to satisfy
>>>> KVM requirement for max RAM chunk where memory region alias would suffice.
>>>>
>>>> Unfortunately, memory_region_allocate_system_memory() usage created migration
>>>> dependency where guest RAM is transferred in migration stream as several RAMBlocks
>>>> if it's more than KVM_SLOT_MAX_BYTES.
>>>
>>> So if I understand it correctly, we only call
>>> memory_region_allocate_system_memory() in case the guest initial memory
>>> size exceeds KVM_SLOT_MAX_BYTES - ~8TB.
>>
>> We always call it. We just call it twice for > 8TB
> 
> Yeah, that's what I meant.
> 
>>>
>>> Do we *really* care about keeping migration of systems running that most
>>> probably nobody (except Christian ;) ) really uses? (especially not in
>>> production).
>>>
>>> I am fine keeping migration running if it's easy, but introducing hacks
>>> (reading below) for such obscure use cases - I don't know.
>>>
>>> @Christian: Please prove me wrong. :)
>>
>> For the time being we can block migration for guests > 8TB if that helps (it should not
>> fail in a guest killing fashion), but we should
>> 1. continue to be able to migrate guests < 8TB
>> 2. continue to be 
>>
>> On the other hand I find "and suddenly it fails if you go beyond this" really
>> unpleasant. So it would be interesting to see the next round of patches to 
>> check how "hacky" those really are.
> 
> I mean migration will work perfectly fine once we fixed it for new QEMU
> versions. It's only the older QEMU versions to/from the > fixed one.

I think that would be fine. We just have to make sure that migration really
fails in a way that the system continues to run. 
> 
> Looking at the log I can see that this was introduced with v2.12.0.
> 
> I would document this in the next release notes: "Migration of unusual
> big VMs (>= 8TB) will not work from/to previous QEMU versions (up to
> v2.12, before that starting such guests didn't even work)."

Yes, sounds good.