diff mbox series

[for-4.0?] exec: Only count mapped memory backends for qemu_getrampagesize()

Message ID 20190326035058.20634-1-david@gibson.dropbear.id.au
State New
Headers show
Series [for-4.0?] exec: Only count mapped memory backends for qemu_getrampagesize() | expand

Commit Message

David Gibson March 26, 2019, 3:50 a.m. UTC
qemu_getrampagesize() works out the minimum host page size backing any of
guest RAM.  This is required in a few places, such as for POWER8 PAPR KVM
guests, because limitations of the hardware virtualization mean the guest
can't use pagesizes larger than the host pages backing its memory.

However, it currently checks against *every* memory backend, whether or not
it is actually mapped into guest memory at the moment.  This is incorrect.

This can cause a problem attempting to add memory to a POWER8 pseries KVM
guest which is configured to allow hugepages in the guest (e.g.
-machine cap-hpt-max-page-size=16m).  If you attempt to add non-hugepage,
you can (correctly) create a memory backend, however it (correctly) will
throw an error when you attempt to map that memory into the guest by
'device_add'ing a pc-dimm.

What's not correct is that if you then reset the guest a startup check
against qemu_getrampagesize() will cause a fatal error because of the new
memory object, even though it's not mapped into the guest.

This patch corrects the problem by adjusting find_max_supported_pagesize()
(called from qemu_getrampagesize() via object_child_foreach) to exclude
non-mapped memory backends.

Signed-off-by: David Gibson <david@gibson.dropbear.id.au>
---
 exec.c | 5 +++--
 1 file changed, 3 insertions(+), 2 deletions(-)

This is definitely a bug, but it's not a regression.  I'm not sure if
this is 4.0 material at this stage of the freeze or not.

Comments

Laurent Vivier March 26, 2019, 8:37 a.m. UTC | #1
On 26/03/2019 04:50, David Gibson wrote:
> qemu_getrampagesize() works out the minimum host page size backing any of
> guest RAM.  This is required in a few places, such as for POWER8 PAPR KVM
> guests, because limitations of the hardware virtualization mean the guest
> can't use pagesizes larger than the host pages backing its memory.
> 
> However, it currently checks against *every* memory backend, whether or not
> it is actually mapped into guest memory at the moment.  This is incorrect.
> 
> This can cause a problem attempting to add memory to a POWER8 pseries KVM
> guest which is configured to allow hugepages in the guest (e.g.
> -machine cap-hpt-max-page-size=16m).  If you attempt to add non-hugepage,
> you can (correctly) create a memory backend, however it (correctly) will
> throw an error when you attempt to map that memory into the guest by
> 'device_add'ing a pc-dimm.
> 
> What's not correct is that if you then reset the guest a startup check
> against qemu_getrampagesize() will cause a fatal error because of the new
> memory object, even though it's not mapped into the guest.
> 
> This patch corrects the problem by adjusting find_max_supported_pagesize()
> (called from qemu_getrampagesize() via object_child_foreach) to exclude
> non-mapped memory backends.
> 
> Signed-off-by: David Gibson <david@gibson.dropbear.id.au>
> ---
>  exec.c | 5 +++--
>  1 file changed, 3 insertions(+), 2 deletions(-)
> 
> This is definitely a bug, but it's not a regression.  I'm not sure if
> this is 4.0 material at this stage of the freeze or not.
> 
> diff --git a/exec.c b/exec.c
> index 86a38d3b3b..6ab62f4eee 100644
> --- a/exec.c
> +++ b/exec.c
> @@ -1692,9 +1692,10 @@ static int find_max_supported_pagesize(Object *obj, void *opaque)
>      long *hpsize_min = opaque;
>  
>      if (object_dynamic_cast(obj, TYPE_MEMORY_BACKEND)) {
> -        long hpsize = host_memory_backend_pagesize(MEMORY_BACKEND(obj));
> +        HostMemoryBackend *backend = MEMORY_BACKEND(obj);
> +        long hpsize = host_memory_backend_pagesize(backend);
>  
> -        if (hpsize < *hpsize_min) {
> +        if (host_memory_backend_is_mapped(backend) && (hpsize < *hpsize_min)) {
>              *hpsize_min = hpsize;
>          }
>      }
> 

Reviewed-by: Laurent Vivier <lvivier@redhat.com>
Igor Mammedov March 26, 2019, 2:08 p.m. UTC | #2
On Tue, 26 Mar 2019 14:50:58 +1100
David Gibson <david@gibson.dropbear.id.au> wrote:

> qemu_getrampagesize() works out the minimum host page size backing any of
> guest RAM.  This is required in a few places, such as for POWER8 PAPR KVM
> guests, because limitations of the hardware virtualization mean the guest
> can't use pagesizes larger than the host pages backing its memory.
> 
> However, it currently checks against *every* memory backend, whether or not
> it is actually mapped into guest memory at the moment.  This is incorrect.
> 
> This can cause a problem attempting to add memory to a POWER8 pseries KVM
> guest which is configured to allow hugepages in the guest (e.g.
> -machine cap-hpt-max-page-size=16m).  If you attempt to add non-hugepage,
> you can (correctly) create a memory backend, however it (correctly) will
> throw an error when you attempt to map that memory into the guest by
> 'device_add'ing a pc-dimm.
> 
> What's not correct is that if you then reset the guest a startup check
> against qemu_getrampagesize() will cause a fatal error because of the new
> memory object, even though it's not mapped into the guest.
I'd say that backend should be remove by mgmt app since device_add failed
instead of leaving it to hang around. (but fatal error either not a nice
behavior on QEMU part)

> 
> This patch corrects the problem by adjusting find_max_supported_pagesize()
> (called from qemu_getrampagesize() via object_child_foreach) to exclude
> non-mapped memory backends.
I'm not sure about if it's ok do so. It depends on where from
qemu_getrampagesize() is called. For example s390 calls it rather early
when initializing KVM, so there isn't anything mapped yet.

And once I replace -mem-path with hostmem backend and drop
qemu_mempath_getpagesize(mem_path) /which btw aren't guarantied to be mapped either/
this patch might lead to incorrect results for initial memory as well.

> 
> Signed-off-by: David Gibson <david@gibson.dropbear.id.au>
> ---
>  exec.c | 5 +++--
>  1 file changed, 3 insertions(+), 2 deletions(-)
> 
> This is definitely a bug, but it's not a regression.  I'm not sure if
> this is 4.0 material at this stage of the freeze or not.
> 
> diff --git a/exec.c b/exec.c
> index 86a38d3b3b..6ab62f4eee 100644
> --- a/exec.c
> +++ b/exec.c
> @@ -1692,9 +1692,10 @@ static int find_max_supported_pagesize(Object *obj, void *opaque)
>      long *hpsize_min = opaque;
>  
>      if (object_dynamic_cast(obj, TYPE_MEMORY_BACKEND)) {
> -        long hpsize = host_memory_backend_pagesize(MEMORY_BACKEND(obj));
> +        HostMemoryBackend *backend = MEMORY_BACKEND(obj);
> +        long hpsize = host_memory_backend_pagesize(backend);
>  
> -        if (hpsize < *hpsize_min) {
> +        if (host_memory_backend_is_mapped(backend) && (hpsize < *hpsize_min)) {
>              *hpsize_min = hpsize;
>          }
>      }
David Hildenbrand March 26, 2019, 5:02 p.m. UTC | #3
On 26.03.19 15:08, Igor Mammedov wrote:
> On Tue, 26 Mar 2019 14:50:58 +1100
> David Gibson <david@gibson.dropbear.id.au> wrote:
> 
>> qemu_getrampagesize() works out the minimum host page size backing any of
>> guest RAM.  This is required in a few places, such as for POWER8 PAPR KVM
>> guests, because limitations of the hardware virtualization mean the guest
>> can't use pagesizes larger than the host pages backing its memory.
>>
>> However, it currently checks against *every* memory backend, whether or not
>> it is actually mapped into guest memory at the moment.  This is incorrect.
>>
>> This can cause a problem attempting to add memory to a POWER8 pseries KVM
>> guest which is configured to allow hugepages in the guest (e.g.
>> -machine cap-hpt-max-page-size=16m).  If you attempt to add non-hugepage,
>> you can (correctly) create a memory backend, however it (correctly) will
>> throw an error when you attempt to map that memory into the guest by
>> 'device_add'ing a pc-dimm.
>>
>> What's not correct is that if you then reset the guest a startup check
>> against qemu_getrampagesize() will cause a fatal error because of the new
>> memory object, even though it's not mapped into the guest.
> I'd say that backend should be remove by mgmt app since device_add failed
> instead of leaving it to hang around. (but fatal error either not a nice
> behavior on QEMU part)

Indeed, it should be removed. Depending on the options (huge pages with
prealloc?) memory might be consumed for unused memory. Undesired.

> 
>>
>> This patch corrects the problem by adjusting find_max_supported_pagesize()
>> (called from qemu_getrampagesize() via object_child_foreach) to exclude
>> non-mapped memory backends.
> I'm not sure about if it's ok do so. It depends on where from
> qemu_getrampagesize() is called. For example s390 calls it rather early
> when initializing KVM, so there isn't anything mapped yet.
> 
> And once I replace -mem-path with hostmem backend and drop
> qemu_mempath_getpagesize(mem_path) /which btw aren't guarantied to be mapped either/
> this patch might lead to incorrect results for initial memory as well.
> 
>>
>> Signed-off-by: David Gibson <david@gibson.dropbear.id.au>
>> ---
>>  exec.c | 5 +++--
>>  1 file changed, 3 insertions(+), 2 deletions(-)
>>
>> This is definitely a bug, but it's not a regression.  I'm not sure if
>> this is 4.0 material at this stage of the freeze or not.
>>
>> diff --git a/exec.c b/exec.c
>> index 86a38d3b3b..6ab62f4eee 100644
>> --- a/exec.c
>> +++ b/exec.c
>> @@ -1692,9 +1692,10 @@ static int find_max_supported_pagesize(Object *obj, void *opaque)
>>      long *hpsize_min = opaque;
>>  
>>      if (object_dynamic_cast(obj, TYPE_MEMORY_BACKEND)) {
>> -        long hpsize = host_memory_backend_pagesize(MEMORY_BACKEND(obj));
>> +        HostMemoryBackend *backend = MEMORY_BACKEND(obj);
>> +        long hpsize = host_memory_backend_pagesize(backend);
>>  
>> -        if (hpsize < *hpsize_min) {
>> +        if (host_memory_backend_is_mapped(backend) && (hpsize < *hpsize_min)) {
>>              *hpsize_min = hpsize;
>>          }
>>      }
> 
>
David Gibson March 27, 2019, 12:11 a.m. UTC | #4
On Tue, Mar 26, 2019 at 03:08:20PM +0100, Igor Mammedov wrote:
> On Tue, 26 Mar 2019 14:50:58 +1100
> David Gibson <david@gibson.dropbear.id.au> wrote:
> 
> > qemu_getrampagesize() works out the minimum host page size backing any of
> > guest RAM.  This is required in a few places, such as for POWER8 PAPR KVM
> > guests, because limitations of the hardware virtualization mean the guest
> > can't use pagesizes larger than the host pages backing its memory.
> > 
> > However, it currently checks against *every* memory backend, whether or not
> > it is actually mapped into guest memory at the moment.  This is incorrect.
> > 
> > This can cause a problem attempting to add memory to a POWER8 pseries KVM
> > guest which is configured to allow hugepages in the guest (e.g.
> > -machine cap-hpt-max-page-size=16m).  If you attempt to add non-hugepage,
> > you can (correctly) create a memory backend, however it (correctly) will
> > throw an error when you attempt to map that memory into the guest by
> > 'device_add'ing a pc-dimm.
> > 
> > What's not correct is that if you then reset the guest a startup check
> > against qemu_getrampagesize() will cause a fatal error because of the new
> > memory object, even though it's not mapped into the guest.
> I'd say that backend should be remove by mgmt app since device_add failed
> instead of leaving it to hang around. (but fatal error either not a nice
> behavior on QEMU part)

I agree, but reset could be guest initiated, so even if management is
doing the right thing there's a window where it could break.

> > This patch corrects the problem by adjusting find_max_supported_pagesize()
> > (called from qemu_getrampagesize() via object_child_foreach) to exclude
> > non-mapped memory backends.
> I'm not sure about if it's ok do so. It depends on where from
> qemu_getrampagesize() is called. For example s390 calls it rather early
> when initializing KVM, so there isn't anything mapped yet.

Oh.  Drat.

> And once I replace -mem-path with hostmem backend and drop
> qemu_mempath_getpagesize(mem_path) /which btw aren't guarantied to be mapped either/
> this patch might lead to incorrect results for initial memory as
> well.

Uh.. I don't immediately see why.
David Gibson March 27, 2019, 12:12 a.m. UTC | #5
On Tue, Mar 26, 2019 at 06:02:51PM +0100, David Hildenbrand wrote:
> On 26.03.19 15:08, Igor Mammedov wrote:
> > On Tue, 26 Mar 2019 14:50:58 +1100
> > David Gibson <david@gibson.dropbear.id.au> wrote:
> > 
> >> qemu_getrampagesize() works out the minimum host page size backing any of
> >> guest RAM.  This is required in a few places, such as for POWER8 PAPR KVM
> >> guests, because limitations of the hardware virtualization mean the guest
> >> can't use pagesizes larger than the host pages backing its memory.
> >>
> >> However, it currently checks against *every* memory backend, whether or not
> >> it is actually mapped into guest memory at the moment.  This is incorrect.
> >>
> >> This can cause a problem attempting to add memory to a POWER8 pseries KVM
> >> guest which is configured to allow hugepages in the guest (e.g.
> >> -machine cap-hpt-max-page-size=16m).  If you attempt to add non-hugepage,
> >> you can (correctly) create a memory backend, however it (correctly) will
> >> throw an error when you attempt to map that memory into the guest by
> >> 'device_add'ing a pc-dimm.
> >>
> >> What's not correct is that if you then reset the guest a startup check
> >> against qemu_getrampagesize() will cause a fatal error because of the new
> >> memory object, even though it's not mapped into the guest.
> > I'd say that backend should be remove by mgmt app since device_add failed
> > instead of leaving it to hang around. (but fatal error either not a nice
> > behavior on QEMU part)
> 
> Indeed, it should be removed. Depending on the options (huge pages with
> prealloc?) memory might be consumed for unused memory. Undesired.

Right, but if the guest initiates a reboot before the management gets
to that, we'll have a crash.
David Hildenbrand March 27, 2019, 8:10 a.m. UTC | #6
On 27.03.19 01:12, David Gibson wrote:
> On Tue, Mar 26, 2019 at 06:02:51PM +0100, David Hildenbrand wrote:
>> On 26.03.19 15:08, Igor Mammedov wrote:
>>> On Tue, 26 Mar 2019 14:50:58 +1100
>>> David Gibson <david@gibson.dropbear.id.au> wrote:
>>>
>>>> qemu_getrampagesize() works out the minimum host page size backing any of
>>>> guest RAM.  This is required in a few places, such as for POWER8 PAPR KVM
>>>> guests, because limitations of the hardware virtualization mean the guest
>>>> can't use pagesizes larger than the host pages backing its memory.
>>>>
>>>> However, it currently checks against *every* memory backend, whether or not
>>>> it is actually mapped into guest memory at the moment.  This is incorrect.
>>>>
>>>> This can cause a problem attempting to add memory to a POWER8 pseries KVM
>>>> guest which is configured to allow hugepages in the guest (e.g.
>>>> -machine cap-hpt-max-page-size=16m).  If you attempt to add non-hugepage,
>>>> you can (correctly) create a memory backend, however it (correctly) will
>>>> throw an error when you attempt to map that memory into the guest by
>>>> 'device_add'ing a pc-dimm.
>>>>
>>>> What's not correct is that if you then reset the guest a startup check
>>>> against qemu_getrampagesize() will cause a fatal error because of the new
>>>> memory object, even though it's not mapped into the guest.
>>> I'd say that backend should be remove by mgmt app since device_add failed
>>> instead of leaving it to hang around. (but fatal error either not a nice
>>> behavior on QEMU part)
>>
>> Indeed, it should be removed. Depending on the options (huge pages with
>> prealloc?) memory might be consumed for unused memory. Undesired.
> 
> Right, but if the guest initiates a reboot before the management gets
> to that, we'll have a crash.
> 

Yes, I agree.

At least on s390x (extending on what Igor said):

mc->init() -> s390_memory_init() ->
memory_region_allocate_system_memory() -> host_memory_backend_set_mapped()


ac->init_machine() -> kvm_arch_init() ->
kvm_s390_configure_mempath_backing() -> qemu_getrampagesize()


And in vl.c

configure_accelerator(current_machine, argv[0]);
...
machine_run_board_init()

So memory is indeed not mapped before calling qemu_getrampagesize().


We *could* move the call to kvm_s390_configure_mempath_backing() to
s390_memory_init().

cap_hpage_1m is not needed before we create VCPUs, so this would work fine.

We could than eventually make qemu_getrampagesize() asssert if no
backends are mapped at all, to catch other user that rely on this being
correct.
Igor Mammedov March 27, 2019, 8:57 a.m. UTC | #7
On Wed, 27 Mar 2019 11:11:46 +1100
David Gibson <david@gibson.dropbear.id.au> wrote:

> On Tue, Mar 26, 2019 at 03:08:20PM +0100, Igor Mammedov wrote:
> > On Tue, 26 Mar 2019 14:50:58 +1100
> > David Gibson <david@gibson.dropbear.id.au> wrote:
> >   
> > > qemu_getrampagesize() works out the minimum host page size backing any of
> > > guest RAM.  This is required in a few places, such as for POWER8 PAPR KVM
> > > guests, because limitations of the hardware virtualization mean the guest
> > > can't use pagesizes larger than the host pages backing its memory.
> > > 
> > > However, it currently checks against *every* memory backend, whether or not
> > > it is actually mapped into guest memory at the moment.  This is incorrect.
> > > 
> > > This can cause a problem attempting to add memory to a POWER8 pseries KVM
> > > guest which is configured to allow hugepages in the guest (e.g.
> > > -machine cap-hpt-max-page-size=16m).  If you attempt to add non-hugepage,
> > > you can (correctly) create a memory backend, however it (correctly) will
> > > throw an error when you attempt to map that memory into the guest by
> > > 'device_add'ing a pc-dimm.
> > > 
> > > What's not correct is that if you then reset the guest a startup check
> > > against qemu_getrampagesize() will cause a fatal error because of the new
> > > memory object, even though it's not mapped into the guest.  
> > I'd say that backend should be remove by mgmt app since device_add failed
> > instead of leaving it to hang around. (but fatal error either not a nice
> > behavior on QEMU part)  
> 
> I agree, but reset could be guest initiated, so even if management is
> doing the right thing there's a window where it could break.

We could (log term) get rid of qemu_getrampagesize() (it's not really good
to push machine/target specific condition into generic function) and move
pagesize calculation closer to machine. In this case machine (spapr) knows
exactly when and what is mapped and can enumerate NOT backends but mapped
memory regions and/or pc-dimms (lets say we have memory_region_page_size()
and apply whatever policy to the results.

> 
> > > This patch corrects the problem by adjusting find_max_supported_pagesize()
> > > (called from qemu_getrampagesize() via object_child_foreach) to exclude
> > > non-mapped memory backends.  
> > I'm not sure about if it's ok do so. It depends on where from
> > qemu_getrampagesize() is called. For example s390 calls it rather early
> > when initializing KVM, so there isn't anything mapped yet.  
> 
> Oh.  Drat.
> 
> > And once I replace -mem-path with hostmem backend and drop
> > qemu_mempath_getpagesize(mem_path) /which btw aren't guarantied to be mapped either/
> > this patch might lead to incorrect results for initial memory as
> > well.  
> 
> Uh.. I don't immediately see why.
It depends on when qemu_getrampagesize() is called with this patch.

Maybe qemu_getrampagesize() is not a good interface anymore?
I also see it being called directly from device side hw/vfio/spapr.c, which
in my opinion should be propagated from machine, plus adhoc call
from kvm parts.
David Gibson March 27, 2019, 9:03 a.m. UTC | #8
On Wed, Mar 27, 2019 at 09:10:01AM +0100, David Hildenbrand wrote:
> On 27.03.19 01:12, David Gibson wrote:
> > On Tue, Mar 26, 2019 at 06:02:51PM +0100, David Hildenbrand wrote:
> >> On 26.03.19 15:08, Igor Mammedov wrote:
> >>> On Tue, 26 Mar 2019 14:50:58 +1100
> >>> David Gibson <david@gibson.dropbear.id.au> wrote:
> >>>
> >>>> qemu_getrampagesize() works out the minimum host page size backing any of
> >>>> guest RAM.  This is required in a few places, such as for POWER8 PAPR KVM
> >>>> guests, because limitations of the hardware virtualization mean the guest
> >>>> can't use pagesizes larger than the host pages backing its memory.
> >>>>
> >>>> However, it currently checks against *every* memory backend, whether or not
> >>>> it is actually mapped into guest memory at the moment.  This is incorrect.
> >>>>
> >>>> This can cause a problem attempting to add memory to a POWER8 pseries KVM
> >>>> guest which is configured to allow hugepages in the guest (e.g.
> >>>> -machine cap-hpt-max-page-size=16m).  If you attempt to add non-hugepage,
> >>>> you can (correctly) create a memory backend, however it (correctly) will
> >>>> throw an error when you attempt to map that memory into the guest by
> >>>> 'device_add'ing a pc-dimm.
> >>>>
> >>>> What's not correct is that if you then reset the guest a startup check
> >>>> against qemu_getrampagesize() will cause a fatal error because of the new
> >>>> memory object, even though it's not mapped into the guest.
> >>> I'd say that backend should be remove by mgmt app since device_add failed
> >>> instead of leaving it to hang around. (but fatal error either not a nice
> >>> behavior on QEMU part)
> >>
> >> Indeed, it should be removed. Depending on the options (huge pages with
> >> prealloc?) memory might be consumed for unused memory. Undesired.
> > 
> > Right, but if the guest initiates a reboot before the management gets
> > to that, we'll have a crash.
> > 
> 
> Yes, I agree.
> 
> At least on s390x (extending on what Igor said):
> 
> mc->init() -> s390_memory_init() ->
> memory_region_allocate_system_memory() -> host_memory_backend_set_mapped()
> 
> 
> ac->init_machine() -> kvm_arch_init() ->
> kvm_s390_configure_mempath_backing() -> qemu_getrampagesize()
> 
> 
> And in vl.c
> 
> configure_accelerator(current_machine, argv[0]);
> ...
> machine_run_board_init()
> 
> So memory is indeed not mapped before calling qemu_getrampagesize().
> 
> 
> We *could* move the call to kvm_s390_configure_mempath_backing() to
> s390_memory_init().
> 
> cap_hpage_1m is not needed before we create VCPUs, so this would work fine.
> 
> We could than eventually make qemu_getrampagesize() asssert if no
> backends are mapped at all, to catch other user that rely on this being
> correct.

So.. I had a look at the usage in kvm_s390_configure_mempath_backing()
and I'm pretty sure it's broken.  It will work in the case where
there's only one backend.  And if that's the default -mem-path rather
than an explicit memory backend then my patch won't break it any
further.

qemu_getrampagesize() returns the smallest host page size for any
memory backend.  That's what matters for pcc KVM (in several cases)
because we need certain things to be host-contiguous, not just
guest-contiguous.  Bigger host page sizes are fine for that purpose,
clearly.

AFAICT on s390 you're looking to determine if any backend is using
hugepages, because KVM may not support that.  The minimum host page
size isn't adequate to determine that, so qemu_getrampagesize() won't
tell you what you need.
David Gibson March 27, 2019, 9:07 a.m. UTC | #9
On Wed, Mar 27, 2019 at 09:57:45AM +0100, Igor Mammedov wrote:
> On Wed, 27 Mar 2019 11:11:46 +1100
> David Gibson <david@gibson.dropbear.id.au> wrote:
> 
> > On Tue, Mar 26, 2019 at 03:08:20PM +0100, Igor Mammedov wrote:
> > > On Tue, 26 Mar 2019 14:50:58 +1100
> > > David Gibson <david@gibson.dropbear.id.au> wrote:
> > >   
> > > > qemu_getrampagesize() works out the minimum host page size backing any of
> > > > guest RAM.  This is required in a few places, such as for POWER8 PAPR KVM
> > > > guests, because limitations of the hardware virtualization mean the guest
> > > > can't use pagesizes larger than the host pages backing its memory.
> > > > 
> > > > However, it currently checks against *every* memory backend, whether or not
> > > > it is actually mapped into guest memory at the moment.  This is incorrect.
> > > > 
> > > > This can cause a problem attempting to add memory to a POWER8 pseries KVM
> > > > guest which is configured to allow hugepages in the guest (e.g.
> > > > -machine cap-hpt-max-page-size=16m).  If you attempt to add non-hugepage,
> > > > you can (correctly) create a memory backend, however it (correctly) will
> > > > throw an error when you attempt to map that memory into the guest by
> > > > 'device_add'ing a pc-dimm.
> > > > 
> > > > What's not correct is that if you then reset the guest a startup check
> > > > against qemu_getrampagesize() will cause a fatal error because of the new
> > > > memory object, even though it's not mapped into the guest.  
> > > I'd say that backend should be remove by mgmt app since device_add failed
> > > instead of leaving it to hang around. (but fatal error either not a nice
> > > behavior on QEMU part)  
> > 
> > I agree, but reset could be guest initiated, so even if management is
> > doing the right thing there's a window where it could break.
> 
> We could (log term) get rid of qemu_getrampagesize() (it's not really good
> to push machine/target specific condition into generic function) and move
> pagesize calculation closer to machine. In this case machine (spapr) knows
> exactly when and what is mapped and can enumerate NOT backends but mapped
> memory regions and/or pc-dimms (lets say we have memory_region_page_size()
> and apply whatever policy to the results.

So.. it used to be in the machine specific code, but was made generic
because it's used in the vfio code.  Where it is ppc specific, but not
keyed into the machine specific code in a way that we can really
re-use it there.

> > > > This patch corrects the problem by adjusting find_max_supported_pagesize()
> > > > (called from qemu_getrampagesize() via object_child_foreach) to exclude
> > > > non-mapped memory backends.  
> > > I'm not sure about if it's ok do so. It depends on where from
> > > qemu_getrampagesize() is called. For example s390 calls it rather early
> > > when initializing KVM, so there isn't anything mapped yet.  
> > 
> > Oh.  Drat.
> > 
> > > And once I replace -mem-path with hostmem backend and drop
> > > qemu_mempath_getpagesize(mem_path) /which btw aren't guarantied to be mapped either/
> > > this patch might lead to incorrect results for initial memory as
> > > well.  
> > 
> > Uh.. I don't immediately see why.
> It depends on when qemu_getrampagesize() is called with this patch.
> 
> Maybe qemu_getrampagesize() is not a good interface anymore?
> I also see it being called directly from device side hw/vfio/spapr.c, which
> in my opinion should be propagated from machine,

Um.. I'm not sure what you mean by that.

> plus adhoc call
> from kvm parts.
Igor Mammedov March 27, 2019, 9:09 a.m. UTC | #10
On Wed, 27 Mar 2019 09:10:01 +0100
David Hildenbrand <david@redhat.com> wrote:

> On 27.03.19 01:12, David Gibson wrote:
> > On Tue, Mar 26, 2019 at 06:02:51PM +0100, David Hildenbrand wrote:  
> >> On 26.03.19 15:08, Igor Mammedov wrote:  
> >>> On Tue, 26 Mar 2019 14:50:58 +1100
> >>> David Gibson <david@gibson.dropbear.id.au> wrote:
> >>>  
> >>>> qemu_getrampagesize() works out the minimum host page size backing any of
> >>>> guest RAM.  This is required in a few places, such as for POWER8 PAPR KVM
> >>>> guests, because limitations of the hardware virtualization mean the guest
> >>>> can't use pagesizes larger than the host pages backing its memory.
> >>>>
> >>>> However, it currently checks against *every* memory backend, whether or not
> >>>> it is actually mapped into guest memory at the moment.  This is incorrect.
> >>>>
> >>>> This can cause a problem attempting to add memory to a POWER8 pseries KVM
> >>>> guest which is configured to allow hugepages in the guest (e.g.
> >>>> -machine cap-hpt-max-page-size=16m).  If you attempt to add non-hugepage,
> >>>> you can (correctly) create a memory backend, however it (correctly) will
> >>>> throw an error when you attempt to map that memory into the guest by
> >>>> 'device_add'ing a pc-dimm.
> >>>>
> >>>> What's not correct is that if you then reset the guest a startup check
> >>>> against qemu_getrampagesize() will cause a fatal error because of the new
> >>>> memory object, even though it's not mapped into the guest.  
> >>> I'd say that backend should be remove by mgmt app since device_add failed
> >>> instead of leaving it to hang around. (but fatal error either not a nice
> >>> behavior on QEMU part)  
> >>
> >> Indeed, it should be removed. Depending on the options (huge pages with
> >> prealloc?) memory might be consumed for unused memory. Undesired.  
> > 
> > Right, but if the guest initiates a reboot before the management gets
> > to that, we'll have a crash.
> >   
> 
> Yes, I agree.
> 
> At least on s390x (extending on what Igor said):
> 
> mc->init() -> s390_memory_init() ->
> memory_region_allocate_system_memory() -> host_memory_backend_set_mapped()
> 
> 
> ac->init_machine() -> kvm_arch_init() ->
> kvm_s390_configure_mempath_backing() -> qemu_getrampagesize()
> 
> 
> And in vl.c
> 
> configure_accelerator(current_machine, argv[0]);
Looking more at it, it is seems s390 is 'broken' anyways.
We call qemu_getrampagesize() here with huge page backends on CLI
but memory-backends are initialized later
 qemu_opts_foreach(..., object_create_delayed, ...)
so s390 doesn't take into account memory backends currently

> ...
> machine_run_board_init()
> 
> So memory is indeed not mapped before calling qemu_getrampagesize().



> 
> 
> We *could* move the call to kvm_s390_configure_mempath_backing() to
> s390_memory_init().
> 
> cap_hpage_1m is not needed before we create VCPUs, so this would work fine.
> 
> We could than eventually make qemu_getrampagesize() asssert if no
> backends are mapped at all, to catch other user that rely on this being
> correct.
Looks like a reasonable way to fix immediate crash in 4.0 with mandatory assert
(but see my other reply, about getting rid of qemu_getrampagesize())
Igor Mammedov March 27, 2019, 9:38 a.m. UTC | #11
On Wed, 27 Mar 2019 20:07:57 +1100
David Gibson <david@gibson.dropbear.id.au> wrote:

> On Wed, Mar 27, 2019 at 09:57:45AM +0100, Igor Mammedov wrote:
> > On Wed, 27 Mar 2019 11:11:46 +1100
> > David Gibson <david@gibson.dropbear.id.au> wrote:
> >   
> > > On Tue, Mar 26, 2019 at 03:08:20PM +0100, Igor Mammedov wrote:  
> > > > On Tue, 26 Mar 2019 14:50:58 +1100
> > > > David Gibson <david@gibson.dropbear.id.au> wrote:
> > > >     
> > > > > qemu_getrampagesize() works out the minimum host page size backing any of
> > > > > guest RAM.  This is required in a few places, such as for POWER8 PAPR KVM
> > > > > guests, because limitations of the hardware virtualization mean the guest
> > > > > can't use pagesizes larger than the host pages backing its memory.
> > > > > 
> > > > > However, it currently checks against *every* memory backend, whether or not
> > > > > it is actually mapped into guest memory at the moment.  This is incorrect.
> > > > > 
> > > > > This can cause a problem attempting to add memory to a POWER8 pseries KVM
> > > > > guest which is configured to allow hugepages in the guest (e.g.
> > > > > -machine cap-hpt-max-page-size=16m).  If you attempt to add non-hugepage,
> > > > > you can (correctly) create a memory backend, however it (correctly) will
> > > > > throw an error when you attempt to map that memory into the guest by
> > > > > 'device_add'ing a pc-dimm.
> > > > > 
> > > > > What's not correct is that if you then reset the guest a startup check
> > > > > against qemu_getrampagesize() will cause a fatal error because of the new
> > > > > memory object, even though it's not mapped into the guest.    
> > > > I'd say that backend should be remove by mgmt app since device_add failed
> > > > instead of leaving it to hang around. (but fatal error either not a nice
> > > > behavior on QEMU part)    
> > > 
> > > I agree, but reset could be guest initiated, so even if management is
> > > doing the right thing there's a window where it could break.  
> > 
> > We could (log term) get rid of qemu_getrampagesize() (it's not really good
> > to push machine/target specific condition into generic function) and move
> > pagesize calculation closer to machine. In this case machine (spapr) knows
> > exactly when and what is mapped and can enumerate NOT backends but mapped
> > memory regions and/or pc-dimms (lets say we have memory_region_page_size()
> > and apply whatever policy to the results.  
> 
> So.. it used to be in the machine specific code, but was made generic
> because it's used in the vfio code.  Where it is ppc specific, but not
> keyed into the machine specific code in a way that we can really
> re-use it there.
It probably was the easiest way to hack things together, but then API gets
generalized and misused and then tweaked to specific machine usecase
and it gets only worse over time.

> 
> > > > > This patch corrects the problem by adjusting find_max_supported_pagesize()
> > > > > (called from qemu_getrampagesize() via object_child_foreach) to exclude
> > > > > non-mapped memory backends.    
> > > > I'm not sure about if it's ok do so. It depends on where from
> > > > qemu_getrampagesize() is called. For example s390 calls it rather early
> > > > when initializing KVM, so there isn't anything mapped yet.    
> > > 
> > > Oh.  Drat.
> > >   
> > > > And once I replace -mem-path with hostmem backend and drop
> > > > qemu_mempath_getpagesize(mem_path) /which btw aren't guarantied to be mapped either/
> > > > this patch might lead to incorrect results for initial memory as
> > > > well.    
> > > 
> > > Uh.. I don't immediately see why.  
> > It depends on when qemu_getrampagesize() is called with this patch.
> > 
> > Maybe qemu_getrampagesize() is not a good interface anymore?
> > I also see it being called directly from device side hw/vfio/spapr.c, which
> > in my opinion should be propagated from machine,  
> 
> Um.. I'm not sure what you mean by that.
It would be better if machine would set a page size property on vfio device
when it's created.

> 
> > plus adhoc call
> > from kvm parts.  
>
David Gibson March 27, 2019, 12:03 p.m. UTC | #12
On Wed, Mar 27, 2019 at 10:38:38AM +0100, Igor Mammedov wrote:
> On Wed, 27 Mar 2019 20:07:57 +1100
> David Gibson <david@gibson.dropbear.id.au> wrote:
> 
> > On Wed, Mar 27, 2019 at 09:57:45AM +0100, Igor Mammedov wrote:
> > > On Wed, 27 Mar 2019 11:11:46 +1100
> > > David Gibson <david@gibson.dropbear.id.au> wrote:
> > >   
> > > > On Tue, Mar 26, 2019 at 03:08:20PM +0100, Igor Mammedov wrote:  
> > > > > On Tue, 26 Mar 2019 14:50:58 +1100
> > > > > David Gibson <david@gibson.dropbear.id.au> wrote:
> > > > >     
> > > > > > qemu_getrampagesize() works out the minimum host page size backing any of
> > > > > > guest RAM.  This is required in a few places, such as for POWER8 PAPR KVM
> > > > > > guests, because limitations of the hardware virtualization mean the guest
> > > > > > can't use pagesizes larger than the host pages backing its memory.
> > > > > > 
> > > > > > However, it currently checks against *every* memory backend, whether or not
> > > > > > it is actually mapped into guest memory at the moment.  This is incorrect.
> > > > > > 
> > > > > > This can cause a problem attempting to add memory to a POWER8 pseries KVM
> > > > > > guest which is configured to allow hugepages in the guest (e.g.
> > > > > > -machine cap-hpt-max-page-size=16m).  If you attempt to add non-hugepage,
> > > > > > you can (correctly) create a memory backend, however it (correctly) will
> > > > > > throw an error when you attempt to map that memory into the guest by
> > > > > > 'device_add'ing a pc-dimm.
> > > > > > 
> > > > > > What's not correct is that if you then reset the guest a startup check
> > > > > > against qemu_getrampagesize() will cause a fatal error because of the new
> > > > > > memory object, even though it's not mapped into the guest.    
> > > > > I'd say that backend should be remove by mgmt app since device_add failed
> > > > > instead of leaving it to hang around. (but fatal error either not a nice
> > > > > behavior on QEMU part)    
> > > > 
> > > > I agree, but reset could be guest initiated, so even if management is
> > > > doing the right thing there's a window where it could break.  
> > > 
> > > We could (log term) get rid of qemu_getrampagesize() (it's not really good
> > > to push machine/target specific condition into generic function) and move
> > > pagesize calculation closer to machine. In this case machine (spapr) knows
> > > exactly when and what is mapped and can enumerate NOT backends but mapped
> > > memory regions and/or pc-dimms (lets say we have memory_region_page_size()
> > > and apply whatever policy to the results.  
> > 
> > So.. it used to be in the machine specific code, but was made generic
> > because it's used in the vfio code.  Where it is ppc specific, but not
> > keyed into the machine specific code in a way that we can really
> > re-use it there.
> It probably was the easiest way to hack things together, but then API gets
> generalized and misused and then tweaked to specific machine usecase
> and it gets only worse over time.

I don't really know what you mean by that.  In both the (main) ppc and
VFIO cases the semantics we want from getrampagesize() really are the
same: what's the smallest chunk of guest-contiguous memory we can rely
on to also be host-contiguous.

> > > > > > This patch corrects the problem by adjusting find_max_supported_pagesize()
> > > > > > (called from qemu_getrampagesize() via object_child_foreach) to exclude
> > > > > > non-mapped memory backends.    
> > > > > I'm not sure about if it's ok do so. It depends on where from
> > > > > qemu_getrampagesize() is called. For example s390 calls it rather early
> > > > > when initializing KVM, so there isn't anything mapped yet.    
> > > > 
> > > > Oh.  Drat.
> > > >   
> > > > > And once I replace -mem-path with hostmem backend and drop
> > > > > qemu_mempath_getpagesize(mem_path) /which btw aren't guarantied to be mapped either/
> > > > > this patch might lead to incorrect results for initial memory as
> > > > > well.    
> > > > 
> > > > Uh.. I don't immediately see why.  
> > > It depends on when qemu_getrampagesize() is called with this patch.
> > > 
> > > Maybe qemu_getrampagesize() is not a good interface anymore?
> > > I also see it being called directly from device side hw/vfio/spapr.c, which
> > > in my opinion should be propagated from machine,  
> > 
> > Um.. I'm not sure what you mean by that.
> It would be better if machine would set a page size property on vfio device
> when it's created.

No, that doesn't work.  The limitation doesn't originate with the
machine, it originates with the *host*.  I mean, in practice it'll
nearly always be KVM and so a machine matching the host, but it
doesn't actually have to be that way.

It is possible to run an x86 (or ARM) guest with TCG using a VFIO
device on a POWER host, and that would need to use the POWER VFIO
driver.  You'd need to line up a bunch of details to make it work, and
it'd be kind of pointless, but it's possible.
David Hildenbrand March 27, 2019, 1:19 p.m. UTC | #13
On 27.03.19 10:03, David Gibson wrote:
> On Wed, Mar 27, 2019 at 09:10:01AM +0100, David Hildenbrand wrote:
>> On 27.03.19 01:12, David Gibson wrote:
>>> On Tue, Mar 26, 2019 at 06:02:51PM +0100, David Hildenbrand wrote:
>>>> On 26.03.19 15:08, Igor Mammedov wrote:
>>>>> On Tue, 26 Mar 2019 14:50:58 +1100
>>>>> David Gibson <david@gibson.dropbear.id.au> wrote:
>>>>>
>>>>>> qemu_getrampagesize() works out the minimum host page size backing any of
>>>>>> guest RAM.  This is required in a few places, such as for POWER8 PAPR KVM
>>>>>> guests, because limitations of the hardware virtualization mean the guest
>>>>>> can't use pagesizes larger than the host pages backing its memory.
>>>>>>
>>>>>> However, it currently checks against *every* memory backend, whether or not
>>>>>> it is actually mapped into guest memory at the moment.  This is incorrect.
>>>>>>
>>>>>> This can cause a problem attempting to add memory to a POWER8 pseries KVM
>>>>>> guest which is configured to allow hugepages in the guest (e.g.
>>>>>> -machine cap-hpt-max-page-size=16m).  If you attempt to add non-hugepage,
>>>>>> you can (correctly) create a memory backend, however it (correctly) will
>>>>>> throw an error when you attempt to map that memory into the guest by
>>>>>> 'device_add'ing a pc-dimm.
>>>>>>
>>>>>> What's not correct is that if you then reset the guest a startup check
>>>>>> against qemu_getrampagesize() will cause a fatal error because of the new
>>>>>> memory object, even though it's not mapped into the guest.
>>>>> I'd say that backend should be remove by mgmt app since device_add failed
>>>>> instead of leaving it to hang around. (but fatal error either not a nice
>>>>> behavior on QEMU part)
>>>>
>>>> Indeed, it should be removed. Depending on the options (huge pages with
>>>> prealloc?) memory might be consumed for unused memory. Undesired.
>>>
>>> Right, but if the guest initiates a reboot before the management gets
>>> to that, we'll have a crash.
>>>
>>
>> Yes, I agree.
>>
>> At least on s390x (extending on what Igor said):
>>
>> mc->init() -> s390_memory_init() ->
>> memory_region_allocate_system_memory() -> host_memory_backend_set_mapped()
>>
>>
>> ac->init_machine() -> kvm_arch_init() ->
>> kvm_s390_configure_mempath_backing() -> qemu_getrampagesize()
>>
>>
>> And in vl.c
>>
>> configure_accelerator(current_machine, argv[0]);
>> ...
>> machine_run_board_init()
>>
>> So memory is indeed not mapped before calling qemu_getrampagesize().
>>
>>
>> We *could* move the call to kvm_s390_configure_mempath_backing() to
>> s390_memory_init().
>>
>> cap_hpage_1m is not needed before we create VCPUs, so this would work fine.
>>
>> We could than eventually make qemu_getrampagesize() asssert if no
>> backends are mapped at all, to catch other user that rely on this being
>> correct.
> 
> So.. I had a look at the usage in kvm_s390_configure_mempath_backing()
> and I'm pretty sure it's broken.  It will work in the case where
> there's only one backend.  And if that's the default -mem-path rather
> than an explicit memory backend then my patch won't break it any
> further.

It works for the current scenarios, where you only have one (maximum
two) backings of the same kind. Your patch would break that.

> 
> qemu_getrampagesize() returns the smallest host page size for any
> memory backend.  That's what matters for pcc KVM (in several cases)
> because we need certain things to be host-contiguous, not just
> guest-contiguous.  Bigger host page sizes are fine for that purpose,
> clearly.
> 
> AFAICT on s390 you're looking to determine if any backend is using
> hugepages, because KVM may not support that.  The minimum host page
> size isn't adequate to determine that, so qemu_getrampagesize() won't
> tell you what you need.

Well, as long as we don't support DIMMS or anything like that it works
perfectly fine. But yes it is far from beautiful.

First of all, I'll prepare a patch to do the call from a different
context. Then we can fine tune to using something else than
qemu_getrampagesize()
David Hildenbrand March 27, 2019, 1:44 p.m. UTC | #14
On 27.03.19 10:09, Igor Mammedov wrote:
> On Wed, 27 Mar 2019 09:10:01 +0100
> David Hildenbrand <david@redhat.com> wrote:
> 
>> On 27.03.19 01:12, David Gibson wrote:
>>> On Tue, Mar 26, 2019 at 06:02:51PM +0100, David Hildenbrand wrote:  
>>>> On 26.03.19 15:08, Igor Mammedov wrote:  
>>>>> On Tue, 26 Mar 2019 14:50:58 +1100
>>>>> David Gibson <david@gibson.dropbear.id.au> wrote:
>>>>>  
>>>>>> qemu_getrampagesize() works out the minimum host page size backing any of
>>>>>> guest RAM.  This is required in a few places, such as for POWER8 PAPR KVM
>>>>>> guests, because limitations of the hardware virtualization mean the guest
>>>>>> can't use pagesizes larger than the host pages backing its memory.
>>>>>>
>>>>>> However, it currently checks against *every* memory backend, whether or not
>>>>>> it is actually mapped into guest memory at the moment.  This is incorrect.
>>>>>>
>>>>>> This can cause a problem attempting to add memory to a POWER8 pseries KVM
>>>>>> guest which is configured to allow hugepages in the guest (e.g.
>>>>>> -machine cap-hpt-max-page-size=16m).  If you attempt to add non-hugepage,
>>>>>> you can (correctly) create a memory backend, however it (correctly) will
>>>>>> throw an error when you attempt to map that memory into the guest by
>>>>>> 'device_add'ing a pc-dimm.
>>>>>>
>>>>>> What's not correct is that if you then reset the guest a startup check
>>>>>> against qemu_getrampagesize() will cause a fatal error because of the new
>>>>>> memory object, even though it's not mapped into the guest.  
>>>>> I'd say that backend should be remove by mgmt app since device_add failed
>>>>> instead of leaving it to hang around. (but fatal error either not a nice
>>>>> behavior on QEMU part)  
>>>>
>>>> Indeed, it should be removed. Depending on the options (huge pages with
>>>> prealloc?) memory might be consumed for unused memory. Undesired.  
>>>
>>> Right, but if the guest initiates a reboot before the management gets
>>> to that, we'll have a crash.
>>>   
>>
>> Yes, I agree.
>>
>> At least on s390x (extending on what Igor said):
>>
>> mc->init() -> s390_memory_init() ->
>> memory_region_allocate_system_memory() -> host_memory_backend_set_mapped()
>>
>>
>> ac->init_machine() -> kvm_arch_init() ->
>> kvm_s390_configure_mempath_backing() -> qemu_getrampagesize()
>>
>>
>> And in vl.c
>>
>> configure_accelerator(current_machine, argv[0]);
> Looking more at it, it is seems s390 is 'broken' anyways.
> We call qemu_getrampagesize() here with huge page backends on CLI
> but memory-backends are initialized later
>  qemu_opts_foreach(..., object_create_delayed, ...)
> so s390 doesn't take into account memory backends currently
> 
>> ...
>> machine_run_board_init()
>>
>> So memory is indeed not mapped before calling qemu_getrampagesize().
> 
> 
> 
>>
>>
>> We *could* move the call to kvm_s390_configure_mempath_backing() to
>> s390_memory_init().
>>
>> cap_hpage_1m is not needed before we create VCPUs, so this would work fine.
>>
>> We could than eventually make qemu_getrampagesize() asssert if no
>> backends are mapped at all, to catch other user that rely on this being
>> correct.
> Looks like a reasonable way to fix immediate crash in 4.0 with mandatory assert
> (but see my other reply, about getting rid of qemu_getrampagesize())
> 

I'll send a patch to move the call for s390x. We can than decide how to
proceed with qemu_getrampagesize() in general.
David Hildenbrand March 27, 2019, 2:01 p.m. UTC | #15
On 27.03.19 10:09, Igor Mammedov wrote:
> On Wed, 27 Mar 2019 09:10:01 +0100
> David Hildenbrand <david@redhat.com> wrote:
> 
>> On 27.03.19 01:12, David Gibson wrote:
>>> On Tue, Mar 26, 2019 at 06:02:51PM +0100, David Hildenbrand wrote:  
>>>> On 26.03.19 15:08, Igor Mammedov wrote:  
>>>>> On Tue, 26 Mar 2019 14:50:58 +1100
>>>>> David Gibson <david@gibson.dropbear.id.au> wrote:
>>>>>  
>>>>>> qemu_getrampagesize() works out the minimum host page size backing any of
>>>>>> guest RAM.  This is required in a few places, such as for POWER8 PAPR KVM
>>>>>> guests, because limitations of the hardware virtualization mean the guest
>>>>>> can't use pagesizes larger than the host pages backing its memory.
>>>>>>
>>>>>> However, it currently checks against *every* memory backend, whether or not
>>>>>> it is actually mapped into guest memory at the moment.  This is incorrect.
>>>>>>
>>>>>> This can cause a problem attempting to add memory to a POWER8 pseries KVM
>>>>>> guest which is configured to allow hugepages in the guest (e.g.
>>>>>> -machine cap-hpt-max-page-size=16m).  If you attempt to add non-hugepage,
>>>>>> you can (correctly) create a memory backend, however it (correctly) will
>>>>>> throw an error when you attempt to map that memory into the guest by
>>>>>> 'device_add'ing a pc-dimm.
>>>>>>
>>>>>> What's not correct is that if you then reset the guest a startup check
>>>>>> against qemu_getrampagesize() will cause a fatal error because of the new
>>>>>> memory object, even though it's not mapped into the guest.  
>>>>> I'd say that backend should be remove by mgmt app since device_add failed
>>>>> instead of leaving it to hang around. (but fatal error either not a nice
>>>>> behavior on QEMU part)  
>>>>
>>>> Indeed, it should be removed. Depending on the options (huge pages with
>>>> prealloc?) memory might be consumed for unused memory. Undesired.  
>>>
>>> Right, but if the guest initiates a reboot before the management gets
>>> to that, we'll have a crash.
>>>   
>>
>> Yes, I agree.
>>
>> At least on s390x (extending on what Igor said):
>>
>> mc->init() -> s390_memory_init() ->
>> memory_region_allocate_system_memory() -> host_memory_backend_set_mapped()
>>
>>
>> ac->init_machine() -> kvm_arch_init() ->
>> kvm_s390_configure_mempath_backing() -> qemu_getrampagesize()
>>
>>
>> And in vl.c
>>
>> configure_accelerator(current_machine, argv[0]);
> Looking more at it, it is seems s390 is 'broken' anyways.
> We call qemu_getrampagesize() here with huge page backends on CLI
> but memory-backends are initialized later
>  qemu_opts_foreach(..., object_create_delayed, ...)
> so s390 doesn't take into account memory backends currently

BTW that might indeed be true, we only check against --mem-path.
David Hildenbrand March 27, 2019, 2:22 p.m. UTC | #16
On 27.03.19 10:03, David Gibson wrote:
> On Wed, Mar 27, 2019 at 09:10:01AM +0100, David Hildenbrand wrote:
>> On 27.03.19 01:12, David Gibson wrote:
>>> On Tue, Mar 26, 2019 at 06:02:51PM +0100, David Hildenbrand wrote:
>>>> On 26.03.19 15:08, Igor Mammedov wrote:
>>>>> On Tue, 26 Mar 2019 14:50:58 +1100
>>>>> David Gibson <david@gibson.dropbear.id.au> wrote:
>>>>>
>>>>>> qemu_getrampagesize() works out the minimum host page size backing any of
>>>>>> guest RAM.  This is required in a few places, such as for POWER8 PAPR KVM
>>>>>> guests, because limitations of the hardware virtualization mean the guest
>>>>>> can't use pagesizes larger than the host pages backing its memory.
>>>>>>
>>>>>> However, it currently checks against *every* memory backend, whether or not
>>>>>> it is actually mapped into guest memory at the moment.  This is incorrect.
>>>>>>
>>>>>> This can cause a problem attempting to add memory to a POWER8 pseries KVM
>>>>>> guest which is configured to allow hugepages in the guest (e.g.
>>>>>> -machine cap-hpt-max-page-size=16m).  If you attempt to add non-hugepage,
>>>>>> you can (correctly) create a memory backend, however it (correctly) will
>>>>>> throw an error when you attempt to map that memory into the guest by
>>>>>> 'device_add'ing a pc-dimm.
>>>>>>
>>>>>> What's not correct is that if you then reset the guest a startup check
>>>>>> against qemu_getrampagesize() will cause a fatal error because of the new
>>>>>> memory object, even though it's not mapped into the guest.
>>>>> I'd say that backend should be remove by mgmt app since device_add failed
>>>>> instead of leaving it to hang around. (but fatal error either not a nice
>>>>> behavior on QEMU part)
>>>>
>>>> Indeed, it should be removed. Depending on the options (huge pages with
>>>> prealloc?) memory might be consumed for unused memory. Undesired.
>>>
>>> Right, but if the guest initiates a reboot before the management gets
>>> to that, we'll have a crash.
>>>
>>
>> Yes, I agree.
>>
>> At least on s390x (extending on what Igor said):
>>
>> mc->init() -> s390_memory_init() ->
>> memory_region_allocate_system_memory() -> host_memory_backend_set_mapped()
>>
>>
>> ac->init_machine() -> kvm_arch_init() ->
>> kvm_s390_configure_mempath_backing() -> qemu_getrampagesize()
>>
>>
>> And in vl.c
>>
>> configure_accelerator(current_machine, argv[0]);
>> ...
>> machine_run_board_init()
>>
>> So memory is indeed not mapped before calling qemu_getrampagesize().
>>
>>
>> We *could* move the call to kvm_s390_configure_mempath_backing() to
>> s390_memory_init().
>>
>> cap_hpage_1m is not needed before we create VCPUs, so this would work fine.
>>
>> We could than eventually make qemu_getrampagesize() asssert if no
>> backends are mapped at all, to catch other user that rely on this being
>> correct.
> 
> So.. I had a look at the usage in kvm_s390_configure_mempath_backing()
> and I'm pretty sure it's broken.  It will work in the case where
> there's only one backend.  And if that's the default -mem-path rather
> than an explicit memory backend then my patch won't break it any
> further.

On the second look, I think I get your point.

1. Why on earth does "find_max_supported_pagesize" find the "minimum
page size". What kind of nasty stuff is this.

2. qemu_mempath_getpagesize() is not affected by your patch and that
seems to be the only thing used on s390x for now.

I sent a patch to move the call on s390x. But we really have to detect
the maximum page size (what find_max_supported_pagesize promises), not
the minimum page size.

> 
> qemu_getrampagesize() returns the smallest host page size for any
> memory backend.  That's what matters for pcc KVM (in several cases)
> because we need certain things to be host-contiguous, not just
> guest-contiguous.  Bigger host page sizes are fine for that purpose,
> clearly.
> 
> AFAICT on s390 you're looking to determine if any backend is using
> hugepages, because KVM may not support that.  The minimum host page
> size isn't adequate to determine that, so qemu_getrampagesize() won't
> tell you what you need.
> 

Indeed.
Igor Mammedov March 27, 2019, 5:08 p.m. UTC | #17
On Wed, 27 Mar 2019 15:01:37 +0100
David Hildenbrand <david@redhat.com> wrote:

> On 27.03.19 10:09, Igor Mammedov wrote:
> > On Wed, 27 Mar 2019 09:10:01 +0100
> > David Hildenbrand <david@redhat.com> wrote:
> > 
> >> On 27.03.19 01:12, David Gibson wrote:
> >>> On Tue, Mar 26, 2019 at 06:02:51PM +0100, David Hildenbrand wrote:  
> >>>> On 26.03.19 15:08, Igor Mammedov wrote:  
> >>>>> On Tue, 26 Mar 2019 14:50:58 +1100
> >>>>> David Gibson <david@gibson.dropbear.id.au> wrote:
> >>>>>  
> >>>>>> qemu_getrampagesize() works out the minimum host page size backing any of
> >>>>>> guest RAM.  This is required in a few places, such as for POWER8 PAPR KVM
> >>>>>> guests, because limitations of the hardware virtualization mean the guest
> >>>>>> can't use pagesizes larger than the host pages backing its memory.
> >>>>>>
> >>>>>> However, it currently checks against *every* memory backend, whether or not
> >>>>>> it is actually mapped into guest memory at the moment.  This is incorrect.
> >>>>>>
> >>>>>> This can cause a problem attempting to add memory to a POWER8 pseries KVM
> >>>>>> guest which is configured to allow hugepages in the guest (e.g.
> >>>>>> -machine cap-hpt-max-page-size=16m).  If you attempt to add non-hugepage,
> >>>>>> you can (correctly) create a memory backend, however it (correctly) will
> >>>>>> throw an error when you attempt to map that memory into the guest by
> >>>>>> 'device_add'ing a pc-dimm.
> >>>>>>
> >>>>>> What's not correct is that if you then reset the guest a startup check
> >>>>>> against qemu_getrampagesize() will cause a fatal error because of the new
> >>>>>> memory object, even though it's not mapped into the guest.  
> >>>>> I'd say that backend should be remove by mgmt app since device_add failed
> >>>>> instead of leaving it to hang around. (but fatal error either not a nice
> >>>>> behavior on QEMU part)  
> >>>>
> >>>> Indeed, it should be removed. Depending on the options (huge pages with
> >>>> prealloc?) memory might be consumed for unused memory. Undesired.  
> >>>
> >>> Right, but if the guest initiates a reboot before the management gets
> >>> to that, we'll have a crash.
> >>>   
> >>
> >> Yes, I agree.
> >>
> >> At least on s390x (extending on what Igor said):
> >>
> >> mc->init() -> s390_memory_init() ->
> >> memory_region_allocate_system_memory() -> host_memory_backend_set_mapped()
> >>
> >>
> >> ac->init_machine() -> kvm_arch_init() ->
> >> kvm_s390_configure_mempath_backing() -> qemu_getrampagesize()
> >>
> >>
> >> And in vl.c
> >>
> >> configure_accelerator(current_machine, argv[0]);
> > Looking more at it, it is seems s390 is 'broken' anyways.
> > We call qemu_getrampagesize() here with huge page backends on CLI
> > but memory-backends are initialized later
> >  qemu_opts_foreach(..., object_create_delayed, ...)
> > so s390 doesn't take into account memory backends currently
> 
> BTW that might indeed be true, we only check against --mem-path.

It's possible to break it with '-numa node,memdev=...' since we don't really have
anything to block that call chain for s390 (but I'd argue it's invalid use of CLI
for s390 but it's effectively -mem-path on steroids alternative)
David Hildenbrand March 27, 2019, 5:19 p.m. UTC | #18
On 27.03.19 18:08, Igor Mammedov wrote:
> On Wed, 27 Mar 2019 15:01:37 +0100
> David Hildenbrand <david@redhat.com> wrote:
> 
>> On 27.03.19 10:09, Igor Mammedov wrote:
>>> On Wed, 27 Mar 2019 09:10:01 +0100
>>> David Hildenbrand <david@redhat.com> wrote:
>>>
>>>> On 27.03.19 01:12, David Gibson wrote:
>>>>> On Tue, Mar 26, 2019 at 06:02:51PM +0100, David Hildenbrand wrote:  
>>>>>> On 26.03.19 15:08, Igor Mammedov wrote:  
>>>>>>> On Tue, 26 Mar 2019 14:50:58 +1100
>>>>>>> David Gibson <david@gibson.dropbear.id.au> wrote:
>>>>>>>  
>>>>>>>> qemu_getrampagesize() works out the minimum host page size backing any of
>>>>>>>> guest RAM.  This is required in a few places, such as for POWER8 PAPR KVM
>>>>>>>> guests, because limitations of the hardware virtualization mean the guest
>>>>>>>> can't use pagesizes larger than the host pages backing its memory.
>>>>>>>>
>>>>>>>> However, it currently checks against *every* memory backend, whether or not
>>>>>>>> it is actually mapped into guest memory at the moment.  This is incorrect.
>>>>>>>>
>>>>>>>> This can cause a problem attempting to add memory to a POWER8 pseries KVM
>>>>>>>> guest which is configured to allow hugepages in the guest (e.g.
>>>>>>>> -machine cap-hpt-max-page-size=16m).  If you attempt to add non-hugepage,
>>>>>>>> you can (correctly) create a memory backend, however it (correctly) will
>>>>>>>> throw an error when you attempt to map that memory into the guest by
>>>>>>>> 'device_add'ing a pc-dimm.
>>>>>>>>
>>>>>>>> What's not correct is that if you then reset the guest a startup check
>>>>>>>> against qemu_getrampagesize() will cause a fatal error because of the new
>>>>>>>> memory object, even though it's not mapped into the guest.  
>>>>>>> I'd say that backend should be remove by mgmt app since device_add failed
>>>>>>> instead of leaving it to hang around. (but fatal error either not a nice
>>>>>>> behavior on QEMU part)  
>>>>>>
>>>>>> Indeed, it should be removed. Depending on the options (huge pages with
>>>>>> prealloc?) memory might be consumed for unused memory. Undesired.  
>>>>>
>>>>> Right, but if the guest initiates a reboot before the management gets
>>>>> to that, we'll have a crash.
>>>>>   
>>>>
>>>> Yes, I agree.
>>>>
>>>> At least on s390x (extending on what Igor said):
>>>>
>>>> mc->init() -> s390_memory_init() ->
>>>> memory_region_allocate_system_memory() -> host_memory_backend_set_mapped()
>>>>
>>>>
>>>> ac->init_machine() -> kvm_arch_init() ->
>>>> kvm_s390_configure_mempath_backing() -> qemu_getrampagesize()
>>>>
>>>>
>>>> And in vl.c
>>>>
>>>> configure_accelerator(current_machine, argv[0]);
>>> Looking more at it, it is seems s390 is 'broken' anyways.
>>> We call qemu_getrampagesize() here with huge page backends on CLI
>>> but memory-backends are initialized later
>>>  qemu_opts_foreach(..., object_create_delayed, ...)
>>> so s390 doesn't take into account memory backends currently
>>
>> BTW that might indeed be true, we only check against --mem-path.
> 
> It's possible to break it with '-numa node,memdev=...' since we don't really have
> anything to block that call chain for s390 (but I'd argue it's invalid use of CLI
> for s390 but it's effectively -mem-path on steroids alternative)
> 

I remember that -numa on s390x is completely blocked, but my mind might
play tricks with me. Anyhow, detecting the biggest page size also ahs to
be fixed.
Igor Mammedov March 27, 2019, 5:24 p.m. UTC | #19
On Wed, 27 Mar 2019 23:03:58 +1100
David Gibson <david@gibson.dropbear.id.au> wrote:

> On Wed, Mar 27, 2019 at 10:38:38AM +0100, Igor Mammedov wrote:
> > On Wed, 27 Mar 2019 20:07:57 +1100
> > David Gibson <david@gibson.dropbear.id.au> wrote:
> > 
> > > On Wed, Mar 27, 2019 at 09:57:45AM +0100, Igor Mammedov wrote:
> > > > On Wed, 27 Mar 2019 11:11:46 +1100
> > > > David Gibson <david@gibson.dropbear.id.au> wrote:
> > > >   
> > > > > On Tue, Mar 26, 2019 at 03:08:20PM +0100, Igor Mammedov wrote:  
> > > > > > On Tue, 26 Mar 2019 14:50:58 +1100
> > > > > > David Gibson <david@gibson.dropbear.id.au> wrote:
> > > > > >     
> > > > > > > qemu_getrampagesize() works out the minimum host page size backing any of
> > > > > > > guest RAM.  This is required in a few places, such as for POWER8 PAPR KVM
> > > > > > > guests, because limitations of the hardware virtualization mean the guest
> > > > > > > can't use pagesizes larger than the host pages backing its memory.
> > > > > > > 
> > > > > > > However, it currently checks against *every* memory backend, whether or not
> > > > > > > it is actually mapped into guest memory at the moment.  This is incorrect.
> > > > > > > 
> > > > > > > This can cause a problem attempting to add memory to a POWER8 pseries KVM
> > > > > > > guest which is configured to allow hugepages in the guest (e.g.
> > > > > > > -machine cap-hpt-max-page-size=16m).  If you attempt to add non-hugepage,
> > > > > > > you can (correctly) create a memory backend, however it (correctly) will
> > > > > > > throw an error when you attempt to map that memory into the guest by
> > > > > > > 'device_add'ing a pc-dimm.
> > > > > > > 
> > > > > > > What's not correct is that if you then reset the guest a startup check
> > > > > > > against qemu_getrampagesize() will cause a fatal error because of the new
> > > > > > > memory object, even though it's not mapped into the guest.    
> > > > > > I'd say that backend should be remove by mgmt app since device_add failed
> > > > > > instead of leaving it to hang around. (but fatal error either not a nice
> > > > > > behavior on QEMU part)    
> > > > > 
> > > > > I agree, but reset could be guest initiated, so even if management is
> > > > > doing the right thing there's a window where it could break.  
> > > > 
> > > > We could (log term) get rid of qemu_getrampagesize() (it's not really good
> > > > to push machine/target specific condition into generic function) and move
> > > > pagesize calculation closer to machine. In this case machine (spapr) knows
> > > > exactly when and what is mapped and can enumerate NOT backends but mapped
> > > > memory regions and/or pc-dimms (lets say we have memory_region_page_size()
> > > > and apply whatever policy to the results.  
> > > 
> > > So.. it used to be in the machine specific code, but was made generic
> > > because it's used in the vfio code.  Where it is ppc specific, but not
> > > keyed into the machine specific code in a way that we can really
> > > re-use it there.
> > It probably was the easiest way to hack things together, but then API gets
> > generalized and misused and then tweaked to specific machine usecase
> > and it gets only worse over time.
> 
> I don't really know what you mean by that.  In both the (main) ppc and
> VFIO cases the semantics we want from getrampagesize() really are the
> same: what's the smallest chunk of guest-contiguous memory we can rely
> on to also be host-contiguous.
"smallest chunk of guest-contiguous memory" is frontend abstraction while
the later "host-contiguous" is backend's one. But qemu_getrampagesize()
operates though on backend side (which doesn't have to represent guest RAM).
Shouldn't we look for guest RAM from frontend side (to be sure we are dealing with guest RAM)
and then find out corresponding host side attributes from there?

btw:
above doesn't mean that we should block your simpler fix for 4.0.
So for this patch with David's s390 patch as precondition included

Reviewed-by: Igor Mammedov <imammedo@redhat.com>


> 
> > > > > > > This patch corrects the problem by adjusting find_max_supported_pagesize()
> > > > > > > (called from qemu_getrampagesize() via object_child_foreach) to exclude
> > > > > > > non-mapped memory backends.    
> > > > > > I'm not sure about if it's ok do so. It depends on where from
> > > > > > qemu_getrampagesize() is called. For example s390 calls it rather early
> > > > > > when initializing KVM, so there isn't anything mapped yet.    
> > > > > 
> > > > > Oh.  Drat.
> > > > >   
> > > > > > And once I replace -mem-path with hostmem backend and drop
> > > > > > qemu_mempath_getpagesize(mem_path) /which btw aren't guarantied to be mapped either/
> > > > > > this patch might lead to incorrect results for initial memory as
> > > > > > well.    
> > > > > 
> > > > > Uh.. I don't immediately see why.  
> > > > It depends on when qemu_getrampagesize() is called with this patch.
> > > > 
> > > > Maybe qemu_getrampagesize() is not a good interface anymore?
> > > > I also see it being called directly from device side hw/vfio/spapr.c, which
> > > > in my opinion should be propagated from machine,  
> > > 
> > > Um.. I'm not sure what you mean by that.
> > It would be better if machine would set a page size property on vfio device
> > when it's created.
> 
> No, that doesn't work.  The limitation doesn't originate with the
> machine, it originates with the *host*.  I mean, in practice it'll
> nearly always be KVM and so a machine matching the host, but it
> doesn't actually have to be that way.
> 
> It is possible to run an x86 (or ARM) guest with TCG using a VFIO
> device on a POWER host, and that would need to use the POWER VFIO
> driver.  You'd need to line up a bunch of details to make it work, and
> it'd be kind of pointless, but it's possible.
>
David Gibson March 28, 2019, 12:27 a.m. UTC | #20
On Wed, Mar 27, 2019 at 02:19:41PM +0100, David Hildenbrand wrote:
> On 27.03.19 10:03, David Gibson wrote:
> > On Wed, Mar 27, 2019 at 09:10:01AM +0100, David Hildenbrand wrote:
> >> On 27.03.19 01:12, David Gibson wrote:
> >>> On Tue, Mar 26, 2019 at 06:02:51PM +0100, David Hildenbrand wrote:
> >>>> On 26.03.19 15:08, Igor Mammedov wrote:
> >>>>> On Tue, 26 Mar 2019 14:50:58 +1100
> >>>>> David Gibson <david@gibson.dropbear.id.au> wrote:
> >>>>>
> >>>>>> qemu_getrampagesize() works out the minimum host page size backing any of
> >>>>>> guest RAM.  This is required in a few places, such as for POWER8 PAPR KVM
> >>>>>> guests, because limitations of the hardware virtualization mean the guest
> >>>>>> can't use pagesizes larger than the host pages backing its memory.
> >>>>>>
> >>>>>> However, it currently checks against *every* memory backend, whether or not
> >>>>>> it is actually mapped into guest memory at the moment.  This is incorrect.
> >>>>>>
> >>>>>> This can cause a problem attempting to add memory to a POWER8 pseries KVM
> >>>>>> guest which is configured to allow hugepages in the guest (e.g.
> >>>>>> -machine cap-hpt-max-page-size=16m).  If you attempt to add non-hugepage,
> >>>>>> you can (correctly) create a memory backend, however it (correctly) will
> >>>>>> throw an error when you attempt to map that memory into the guest by
> >>>>>> 'device_add'ing a pc-dimm.
> >>>>>>
> >>>>>> What's not correct is that if you then reset the guest a startup check
> >>>>>> against qemu_getrampagesize() will cause a fatal error because of the new
> >>>>>> memory object, even though it's not mapped into the guest.
> >>>>> I'd say that backend should be remove by mgmt app since device_add failed
> >>>>> instead of leaving it to hang around. (but fatal error either not a nice
> >>>>> behavior on QEMU part)
> >>>>
> >>>> Indeed, it should be removed. Depending on the options (huge pages with
> >>>> prealloc?) memory might be consumed for unused memory. Undesired.
> >>>
> >>> Right, but if the guest initiates a reboot before the management gets
> >>> to that, we'll have a crash.
> >>>
> >>
> >> Yes, I agree.
> >>
> >> At least on s390x (extending on what Igor said):
> >>
> >> mc->init() -> s390_memory_init() ->
> >> memory_region_allocate_system_memory() -> host_memory_backend_set_mapped()
> >>
> >>
> >> ac->init_machine() -> kvm_arch_init() ->
> >> kvm_s390_configure_mempath_backing() -> qemu_getrampagesize()
> >>
> >>
> >> And in vl.c
> >>
> >> configure_accelerator(current_machine, argv[0]);
> >> ...
> >> machine_run_board_init()
> >>
> >> So memory is indeed not mapped before calling qemu_getrampagesize().
> >>
> >>
> >> We *could* move the call to kvm_s390_configure_mempath_backing() to
> >> s390_memory_init().
> >>
> >> cap_hpage_1m is not needed before we create VCPUs, so this would work fine.
> >>
> >> We could than eventually make qemu_getrampagesize() asssert if no
> >> backends are mapped at all, to catch other user that rely on this being
> >> correct.
> > 
> > So.. I had a look at the usage in kvm_s390_configure_mempath_backing()
> > and I'm pretty sure it's broken.  It will work in the case where
> > there's only one backend.  And if that's the default -mem-path rather
> > than an explicit memory backend then my patch won't break it any
> > further.
> 
> It works for the current scenarios, where you only have one (maximum
> two) backings of the same kind. Your patch would break that.

Actually it wouldn't.  My patch only affects checking of explicit
backend objects - checking of the base -mem-path implicit backend
remains the same.

> > qemu_getrampagesize() returns the smallest host page size for any
> > memory backend.  That's what matters for pcc KVM (in several cases)
> > because we need certain things to be host-contiguous, not just
> > guest-contiguous.  Bigger host page sizes are fine for that purpose,
> > clearly.
> > 
> > AFAICT on s390 you're looking to determine if any backend is using
> > hugepages, because KVM may not support that.  The minimum host page
> > size isn't adequate to determine that, so qemu_getrampagesize() won't
> > tell you what you need.
> 
> Well, as long as we don't support DIMMS or anything like that it works
> perfectly fine. But yes it is far from beautiful.
> 
> First of all, I'll prepare a patch to do the call from a different
> context. Then we can fine tune to using something else than
> qemu_getrampagesize()
>
David Gibson March 28, 2019, 12:39 a.m. UTC | #21
On Wed, Mar 27, 2019 at 06:24:17PM +0100, Igor Mammedov wrote:
> On Wed, 27 Mar 2019 23:03:58 +1100
> David Gibson <david@gibson.dropbear.id.au> wrote:
> 
> > On Wed, Mar 27, 2019 at 10:38:38AM +0100, Igor Mammedov wrote:
> > > On Wed, 27 Mar 2019 20:07:57 +1100
> > > David Gibson <david@gibson.dropbear.id.au> wrote:
> > > 
> > > > On Wed, Mar 27, 2019 at 09:57:45AM +0100, Igor Mammedov wrote:
> > > > > On Wed, 27 Mar 2019 11:11:46 +1100
> > > > > David Gibson <david@gibson.dropbear.id.au> wrote:
> > > > >   
> > > > > > On Tue, Mar 26, 2019 at 03:08:20PM +0100, Igor Mammedov wrote:  
> > > > > > > On Tue, 26 Mar 2019 14:50:58 +1100
> > > > > > > David Gibson <david@gibson.dropbear.id.au> wrote:
> > > > > > >     
> > > > > > > > qemu_getrampagesize() works out the minimum host page size backing any of
> > > > > > > > guest RAM.  This is required in a few places, such as for POWER8 PAPR KVM
> > > > > > > > guests, because limitations of the hardware virtualization mean the guest
> > > > > > > > can't use pagesizes larger than the host pages backing its memory.
> > > > > > > > 
> > > > > > > > However, it currently checks against *every* memory backend, whether or not
> > > > > > > > it is actually mapped into guest memory at the moment.  This is incorrect.
> > > > > > > > 
> > > > > > > > This can cause a problem attempting to add memory to a POWER8 pseries KVM
> > > > > > > > guest which is configured to allow hugepages in the guest (e.g.
> > > > > > > > -machine cap-hpt-max-page-size=16m).  If you attempt to add non-hugepage,
> > > > > > > > you can (correctly) create a memory backend, however it (correctly) will
> > > > > > > > throw an error when you attempt to map that memory into the guest by
> > > > > > > > 'device_add'ing a pc-dimm.
> > > > > > > > 
> > > > > > > > What's not correct is that if you then reset the guest a startup check
> > > > > > > > against qemu_getrampagesize() will cause a fatal error because of the new
> > > > > > > > memory object, even though it's not mapped into the guest.    
> > > > > > > I'd say that backend should be remove by mgmt app since device_add failed
> > > > > > > instead of leaving it to hang around. (but fatal error either not a nice
> > > > > > > behavior on QEMU part)    
> > > > > > 
> > > > > > I agree, but reset could be guest initiated, so even if management is
> > > > > > doing the right thing there's a window where it could break.  
> > > > > 
> > > > > We could (log term) get rid of qemu_getrampagesize() (it's not really good
> > > > > to push machine/target specific condition into generic function) and move
> > > > > pagesize calculation closer to machine. In this case machine (spapr) knows
> > > > > exactly when and what is mapped and can enumerate NOT backends but mapped
> > > > > memory regions and/or pc-dimms (lets say we have memory_region_page_size()
> > > > > and apply whatever policy to the results.  
> > > > 
> > > > So.. it used to be in the machine specific code, but was made generic
> > > > because it's used in the vfio code.  Where it is ppc specific, but not
> > > > keyed into the machine specific code in a way that we can really
> > > > re-use it there.
> > > It probably was the easiest way to hack things together, but then API gets
> > > generalized and misused and then tweaked to specific machine usecase
> > > and it gets only worse over time.
> > 
> > I don't really know what you mean by that.  In both the (main) ppc and
> > VFIO cases the semantics we want from getrampagesize() really are the
> > same: what's the smallest chunk of guest-contiguous memory we can rely
> > on to also be host-contiguous.
> "smallest chunk of guest-contiguous memory" is frontend abstraction while
> the later "host-contiguous" is backend's one. But qemu_getrampagesize()
> operates though on backend side (which doesn't have to represent guest RAM).
> Shouldn't we look for guest RAM from frontend side (to be sure we are dealing with guest RAM)
> and then find out corresponding host side attributes from there?

I'm not really sure what you have in mind there.

In cases where we have a specific guest address, we do try to go from
that to backend information.  Unfortunately in some cases we can't
know a guest address beforehand, so we have to calculate for the worst
case, which is what this is about.

> btw:
> above doesn't mean that we should block your simpler fix for 4.0.
> So for this patch with David's s390 patch as precondition included
> 
> Reviewed-by: Igor Mammedov <imammedo@redhat.com>
> 
> 
> > 
> > > > > > > > This patch corrects the problem by adjusting find_max_supported_pagesize()
> > > > > > > > (called from qemu_getrampagesize() via object_child_foreach) to exclude
> > > > > > > > non-mapped memory backends.    
> > > > > > > I'm not sure about if it's ok do so. It depends on where from
> > > > > > > qemu_getrampagesize() is called. For example s390 calls it rather early
> > > > > > > when initializing KVM, so there isn't anything mapped yet.    
> > > > > > 
> > > > > > Oh.  Drat.
> > > > > >   
> > > > > > > And once I replace -mem-path with hostmem backend and drop
> > > > > > > qemu_mempath_getpagesize(mem_path) /which btw aren't guarantied to be mapped either/
> > > > > > > this patch might lead to incorrect results for initial memory as
> > > > > > > well.    
> > > > > > 
> > > > > > Uh.. I don't immediately see why.  
> > > > > It depends on when qemu_getrampagesize() is called with this patch.
> > > > > 
> > > > > Maybe qemu_getrampagesize() is not a good interface anymore?
> > > > > I also see it being called directly from device side hw/vfio/spapr.c, which
> > > > > in my opinion should be propagated from machine,  
> > > > 
> > > > Um.. I'm not sure what you mean by that.
> > > It would be better if machine would set a page size property on vfio device
> > > when it's created.
> > 
> > No, that doesn't work.  The limitation doesn't originate with the
> > machine, it originates with the *host*.  I mean, in practice it'll
> > nearly always be KVM and so a machine matching the host, but it
> > doesn't actually have to be that way.
> > 
> > It is possible to run an x86 (or ARM) guest with TCG using a VFIO
> > device on a POWER host, and that would need to use the POWER VFIO
> > driver.  You'd need to line up a bunch of details to make it work, and
> > it'd be kind of pointless, but it's possible.
> > 
>
David Gibson March 28, 2019, 1:18 a.m. UTC | #22
On Wed, Mar 27, 2019 at 03:22:58PM +0100, David Hildenbrand wrote:
> On 27.03.19 10:03, David Gibson wrote:
> > On Wed, Mar 27, 2019 at 09:10:01AM +0100, David Hildenbrand wrote:
> >> On 27.03.19 01:12, David Gibson wrote:
> >>> On Tue, Mar 26, 2019 at 06:02:51PM +0100, David Hildenbrand wrote:
> >>>> On 26.03.19 15:08, Igor Mammedov wrote:
> >>>>> On Tue, 26 Mar 2019 14:50:58 +1100
> >>>>> David Gibson <david@gibson.dropbear.id.au> wrote:
> >>>>>
> >>>>>> qemu_getrampagesize() works out the minimum host page size backing any of
> >>>>>> guest RAM.  This is required in a few places, such as for POWER8 PAPR KVM
> >>>>>> guests, because limitations of the hardware virtualization mean the guest
> >>>>>> can't use pagesizes larger than the host pages backing its memory.
> >>>>>>
> >>>>>> However, it currently checks against *every* memory backend, whether or not
> >>>>>> it is actually mapped into guest memory at the moment.  This is incorrect.
> >>>>>>
> >>>>>> This can cause a problem attempting to add memory to a POWER8 pseries KVM
> >>>>>> guest which is configured to allow hugepages in the guest (e.g.
> >>>>>> -machine cap-hpt-max-page-size=16m).  If you attempt to add non-hugepage,
> >>>>>> you can (correctly) create a memory backend, however it (correctly) will
> >>>>>> throw an error when you attempt to map that memory into the guest by
> >>>>>> 'device_add'ing a pc-dimm.
> >>>>>>
> >>>>>> What's not correct is that if you then reset the guest a startup check
> >>>>>> against qemu_getrampagesize() will cause a fatal error because of the new
> >>>>>> memory object, even though it's not mapped into the guest.
> >>>>> I'd say that backend should be remove by mgmt app since device_add failed
> >>>>> instead of leaving it to hang around. (but fatal error either not a nice
> >>>>> behavior on QEMU part)
> >>>>
> >>>> Indeed, it should be removed. Depending on the options (huge pages with
> >>>> prealloc?) memory might be consumed for unused memory. Undesired.
> >>>
> >>> Right, but if the guest initiates a reboot before the management gets
> >>> to that, we'll have a crash.
> >>>
> >>
> >> Yes, I agree.
> >>
> >> At least on s390x (extending on what Igor said):
> >>
> >> mc->init() -> s390_memory_init() ->
> >> memory_region_allocate_system_memory() -> host_memory_backend_set_mapped()
> >>
> >>
> >> ac->init_machine() -> kvm_arch_init() ->
> >> kvm_s390_configure_mempath_backing() -> qemu_getrampagesize()
> >>
> >>
> >> And in vl.c
> >>
> >> configure_accelerator(current_machine, argv[0]);
> >> ...
> >> machine_run_board_init()
> >>
> >> So memory is indeed not mapped before calling qemu_getrampagesize().
> >>
> >>
> >> We *could* move the call to kvm_s390_configure_mempath_backing() to
> >> s390_memory_init().
> >>
> >> cap_hpage_1m is not needed before we create VCPUs, so this would work fine.
> >>
> >> We could than eventually make qemu_getrampagesize() asssert if no
> >> backends are mapped at all, to catch other user that rely on this being
> >> correct.
> > 
> > So.. I had a look at the usage in kvm_s390_configure_mempath_backing()
> > and I'm pretty sure it's broken.  It will work in the case where
> > there's only one backend.  And if that's the default -mem-path rather
> > than an explicit memory backend then my patch won't break it any
> > further.
> 
> On the second look, I think I get your point.
> 
> 1. Why on earth does "find_max_supported_pagesize" find the "minimum
> page size". What kind of nasty stuff is this.

Ah, yeah, the naming is bad because of history.

The original usecase of this was because on POWER (before POWER9) the
way MMU virtualization works, pages inserted into the guest MMU view
have to be host-contiguous: there's no 2nd level translation that lets
them be broken into smaller host pages.

The upshot is that a KVM guest can only use large pages if it's backed
by large pages on the host.  We have to advertise the availability of
large pages to the guest at boot time though, and there's no way to
restrict it to certain parts of guest RAM.

So, this code path was finding the _maximum_ page size the guest could
use... which depends on the _minimum page_ size used on the host.
When this was moved to (partly) generic code we didn't think to
improve all the names.

> 2. qemu_mempath_getpagesize() is not affected by your patch

Correct.

> and that
> seems to be the only thing used on s390x for now.

Uh.. what?

> I sent a patch to move the call on s390x. But we really have to detect
> the maximum page size (what find_max_supported_pagesize promises), not
> the minimum page size.

Well.. sort of.  In the ppc case it really is the minimum page size we
care about, in the sense that if some part of RAM has a larger page
size, that's fine - even if it's a weird size that we didn't expect.

IIUC for s390 the problem is that KVM doesn't necessarily support
putting large pages into the guest at all, and what large page sizes
it can put into the guest depends on the KVM version.

Finding the maximum backing page size answers that question only on
the assumption that a KVM which supports page size X will support all
smaller page sizes.  I imagine that's true in practice, but does it
have to be true in theory?  If not, it seems safer to me to explicitly
step through every (mapped) backend and individually check if it has a
supported pagesize, rather than only testing the max page size.

It also occurs to me: why does this logic need to be in qemu at all?
KVM must know what pagesizes it supports, and I assume it will throw
an error if you try to put something with the wrong size into a
memslot.  So can't qemu just report that error, rather than checking
the pagesizes itself?

> > qemu_getrampagesize() returns the smallest host page size for any
> > memory backend.  That's what matters for pcc KVM (in several cases)
> > because we need certain things to be host-contiguous, not just
> > guest-contiguous.  Bigger host page sizes are fine for that purpose,
> > clearly.
> > 
> > AFAICT on s390 you're looking to determine if any backend is using
> > hugepages, because KVM may not support that.  The minimum host page
> > size isn't adequate to determine that, so qemu_getrampagesize() won't
> > tell you what you need.
> 
> Indeed.
David Hildenbrand March 28, 2019, 8:53 a.m. UTC | #23
On 28.03.19 01:27, David Gibson wrote:
> On Wed, Mar 27, 2019 at 02:19:41PM +0100, David Hildenbrand wrote:
>> On 27.03.19 10:03, David Gibson wrote:
>>> On Wed, Mar 27, 2019 at 09:10:01AM +0100, David Hildenbrand wrote:
>>>> On 27.03.19 01:12, David Gibson wrote:
>>>>> On Tue, Mar 26, 2019 at 06:02:51PM +0100, David Hildenbrand wrote:
>>>>>> On 26.03.19 15:08, Igor Mammedov wrote:
>>>>>>> On Tue, 26 Mar 2019 14:50:58 +1100
>>>>>>> David Gibson <david@gibson.dropbear.id.au> wrote:
>>>>>>>
>>>>>>>> qemu_getrampagesize() works out the minimum host page size backing any of
>>>>>>>> guest RAM.  This is required in a few places, such as for POWER8 PAPR KVM
>>>>>>>> guests, because limitations of the hardware virtualization mean the guest
>>>>>>>> can't use pagesizes larger than the host pages backing its memory.
>>>>>>>>
>>>>>>>> However, it currently checks against *every* memory backend, whether or not
>>>>>>>> it is actually mapped into guest memory at the moment.  This is incorrect.
>>>>>>>>
>>>>>>>> This can cause a problem attempting to add memory to a POWER8 pseries KVM
>>>>>>>> guest which is configured to allow hugepages in the guest (e.g.
>>>>>>>> -machine cap-hpt-max-page-size=16m).  If you attempt to add non-hugepage,
>>>>>>>> you can (correctly) create a memory backend, however it (correctly) will
>>>>>>>> throw an error when you attempt to map that memory into the guest by
>>>>>>>> 'device_add'ing a pc-dimm.
>>>>>>>>
>>>>>>>> What's not correct is that if you then reset the guest a startup check
>>>>>>>> against qemu_getrampagesize() will cause a fatal error because of the new
>>>>>>>> memory object, even though it's not mapped into the guest.
>>>>>>> I'd say that backend should be remove by mgmt app since device_add failed
>>>>>>> instead of leaving it to hang around. (but fatal error either not a nice
>>>>>>> behavior on QEMU part)
>>>>>>
>>>>>> Indeed, it should be removed. Depending on the options (huge pages with
>>>>>> prealloc?) memory might be consumed for unused memory. Undesired.
>>>>>
>>>>> Right, but if the guest initiates a reboot before the management gets
>>>>> to that, we'll have a crash.
>>>>>
>>>>
>>>> Yes, I agree.
>>>>
>>>> At least on s390x (extending on what Igor said):
>>>>
>>>> mc->init() -> s390_memory_init() ->
>>>> memory_region_allocate_system_memory() -> host_memory_backend_set_mapped()
>>>>
>>>>
>>>> ac->init_machine() -> kvm_arch_init() ->
>>>> kvm_s390_configure_mempath_backing() -> qemu_getrampagesize()
>>>>
>>>>
>>>> And in vl.c
>>>>
>>>> configure_accelerator(current_machine, argv[0]);
>>>> ...
>>>> machine_run_board_init()
>>>>
>>>> So memory is indeed not mapped before calling qemu_getrampagesize().
>>>>
>>>>
>>>> We *could* move the call to kvm_s390_configure_mempath_backing() to
>>>> s390_memory_init().
>>>>
>>>> cap_hpage_1m is not needed before we create VCPUs, so this would work fine.
>>>>
>>>> We could than eventually make qemu_getrampagesize() asssert if no
>>>> backends are mapped at all, to catch other user that rely on this being
>>>> correct.
>>>
>>> So.. I had a look at the usage in kvm_s390_configure_mempath_backing()
>>> and I'm pretty sure it's broken.  It will work in the case where
>>> there's only one backend.  And if that's the default -mem-path rather
>>> than an explicit memory backend then my patch won't break it any
>>> further.
>>
>> It works for the current scenarios, where you only have one (maximum
>> two) backings of the same kind. Your patch would break that.
> 
> Actually it wouldn't.  My patch only affects checking of explicit
> backend objects - checking of the base -mem-path implicit backend
> remains the same.

Yes, you're right.
David Hildenbrand March 28, 2019, 9:26 a.m. UTC | #24
On 28.03.19 02:18, David Gibson wrote:
> On Wed, Mar 27, 2019 at 03:22:58PM +0100, David Hildenbrand wrote:
>> On 27.03.19 10:03, David Gibson wrote:
>>> On Wed, Mar 27, 2019 at 09:10:01AM +0100, David Hildenbrand wrote:
>>>> On 27.03.19 01:12, David Gibson wrote:
>>>>> On Tue, Mar 26, 2019 at 06:02:51PM +0100, David Hildenbrand wrote:
>>>>>> On 26.03.19 15:08, Igor Mammedov wrote:
>>>>>>> On Tue, 26 Mar 2019 14:50:58 +1100
>>>>>>> David Gibson <david@gibson.dropbear.id.au> wrote:
>>>>>>>
>>>>>>>> qemu_getrampagesize() works out the minimum host page size backing any of
>>>>>>>> guest RAM.  This is required in a few places, such as for POWER8 PAPR KVM
>>>>>>>> guests, because limitations of the hardware virtualization mean the guest
>>>>>>>> can't use pagesizes larger than the host pages backing its memory.
>>>>>>>>
>>>>>>>> However, it currently checks against *every* memory backend, whether or not
>>>>>>>> it is actually mapped into guest memory at the moment.  This is incorrect.
>>>>>>>>
>>>>>>>> This can cause a problem attempting to add memory to a POWER8 pseries KVM
>>>>>>>> guest which is configured to allow hugepages in the guest (e.g.
>>>>>>>> -machine cap-hpt-max-page-size=16m).  If you attempt to add non-hugepage,
>>>>>>>> you can (correctly) create a memory backend, however it (correctly) will
>>>>>>>> throw an error when you attempt to map that memory into the guest by
>>>>>>>> 'device_add'ing a pc-dimm.
>>>>>>>>
>>>>>>>> What's not correct is that if you then reset the guest a startup check
>>>>>>>> against qemu_getrampagesize() will cause a fatal error because of the new
>>>>>>>> memory object, even though it's not mapped into the guest.
>>>>>>> I'd say that backend should be remove by mgmt app since device_add failed
>>>>>>> instead of leaving it to hang around. (but fatal error either not a nice
>>>>>>> behavior on QEMU part)
>>>>>>
>>>>>> Indeed, it should be removed. Depending on the options (huge pages with
>>>>>> prealloc?) memory might be consumed for unused memory. Undesired.
>>>>>
>>>>> Right, but if the guest initiates a reboot before the management gets
>>>>> to that, we'll have a crash.
>>>>>
>>>>
>>>> Yes, I agree.
>>>>
>>>> At least on s390x (extending on what Igor said):
>>>>
>>>> mc->init() -> s390_memory_init() ->
>>>> memory_region_allocate_system_memory() -> host_memory_backend_set_mapped()
>>>>
>>>>
>>>> ac->init_machine() -> kvm_arch_init() ->
>>>> kvm_s390_configure_mempath_backing() -> qemu_getrampagesize()
>>>>
>>>>
>>>> And in vl.c
>>>>
>>>> configure_accelerator(current_machine, argv[0]);
>>>> ...
>>>> machine_run_board_init()
>>>>
>>>> So memory is indeed not mapped before calling qemu_getrampagesize().
>>>>
>>>>
>>>> We *could* move the call to kvm_s390_configure_mempath_backing() to
>>>> s390_memory_init().
>>>>
>>>> cap_hpage_1m is not needed before we create VCPUs, so this would work fine.
>>>>
>>>> We could than eventually make qemu_getrampagesize() asssert if no
>>>> backends are mapped at all, to catch other user that rely on this being
>>>> correct.
>>>
>>> So.. I had a look at the usage in kvm_s390_configure_mempath_backing()
>>> and I'm pretty sure it's broken.  It will work in the case where
>>> there's only one backend.  And if that's the default -mem-path rather
>>> than an explicit memory backend then my patch won't break it any
>>> further.
>>
>> On the second look, I think I get your point.
>>
>> 1. Why on earth does "find_max_supported_pagesize" find the "minimum
>> page size". What kind of nasty stuff is this.
> 
> Ah, yeah, the naming is bad because of history.
> 
> The original usecase of this was because on POWER (before POWER9) the
> way MMU virtualization works, pages inserted into the guest MMU view
> have to be host-contiguous: there's no 2nd level translation that lets
> them be broken into smaller host pages.
> 
> The upshot is that a KVM guest can only use large pages if it's backed
> by large pages on the host.  We have to advertise the availability of
> large pages to the guest at boot time though, and there's no way to
> restrict it to certain parts of guest RAM.
> 
> So, this code path was finding the _maximum_ page size the guest could
> use... which depends on the _minimum page_ size used on the host.
> When this was moved to (partly) generic code we didn't think to
> improve all the names.
> 
>> 2. qemu_mempath_getpagesize() is not affected by your patch
> 
> Correct.
> 
>> and that
>> seems to be the only thing used on s390x for now.
> 
> Uh.. what?
> 
>> I sent a patch to move the call on s390x. But we really have to detect
>> the maximum page size (what find_max_supported_pagesize promises), not
>> the minimum page size.
> 
> Well.. sort of.  In the ppc case it really is the minimum page size we
> care about, in the sense that if some part of RAM has a larger page
> size, that's fine - even if it's a weird size that we didn't expect.
> 
> IIUC for s390 the problem is that KVM doesn't necessarily support
> putting large pages into the guest at all, and what large page sizes
> it can put into the guest depends on the KVM version.

Yes.

> 
> Finding the maximum backing page size answers that question only on
> the assumption that a KVM which supports page size X will support all
> smaller page sizes.  I imagine that's true in practice, but does it
> have to be true in theory?  If not, it seems safer to me to explicitly
> step through every (mapped) backend and individually check if it has a
> supported pagesize, rather than only testing the max page size.


We only support 4k and 1MB page sizes in KVM. So for now the question
does not apply. Then there is 2GB pages in HW.

A HW that supports 2GB supports 1MB. The only way something else (e.g.
512KB pages) could be added would be by doing massive changes to the
architecture. I don't consider this realistic. Like changing suddenly
the page size from 4k to 8k on a HW that was never prepared for it.
However, if that would ever happen, we'll get new KVM capabilities to
check/enable support.

E.g. for 2GB pages, we'll have another KVM capability to enable gigantic
pages. If 1MB pages would no longer supported by KVM, the capability
would not be indicated.

Once we support 2GB pages, we'll might have think about what you
describe here, depending on what the KVM interface promises us. If the
interfaces promises "If 2GB are enabled, 1MB are enabled implicitly", we
are fine, otherwise we would have to check per mapped backend.

> 
> It also occurs to me: why does this logic need to be in qemu at all?
> KVM must know what pagesizes it supports, and I assume it will throw
> an error if you try to put something with the wrong size into a
> memslot.  So can't qemu just report that error, rather than checking
> the pagesizes itself?

There are multiple things to that

1. "KVM must know what pagesizes it supports"

Yes it does, and this is reflected via KVM capabilities (e.g.
KVM_CAP_S390_HPAGE_1M) . To make use of
these capabilities, they have to be enabled by user space. Once we have
support for 2G pages, we'll have KVM_CAP_S390_HPAGE_2G.

In case the capability is enabled, certain things have to be changed in KVM
- CMMA can no longer be used (user space has to properly take care of that)
- Certain HW assists (PFMF interpretation, Storage Key Facility) have to
be disabled early.


2. "it will throw an error if you try to put something with the wrong
    size into a memslot"

An error will be reported when trying to map a huge page into the GMAP
(e.g. on a host page fault in virtualization mode). So not when the
memslot is configured, but during kvm_run.

Checking the memslot might be

a) complicated (check all VMAs)
b) waste of time (many VMAs)
c) incorrect - the content of a memslot can change any time. (KVM
synchronous MMU). Think of someone wanting to remap some pages part of a
memslot using huge pages.

3. Can you elaborate on "So can't qemu just report that error, rather
than checking the pagesizes itself?" We effectively check against the
capabilities of KVM and the page size. Based on that, we report the
error in QEMU. Reporting an error after the guest has already started
and crashed during kvm_run due to a huge page is way too late.
David Hildenbrand March 28, 2019, 9:51 a.m. UTC | #25
On 26.03.19 04:50, David Gibson wrote:
> qemu_getrampagesize() works out the minimum host page size backing any of
> guest RAM.  This is required in a few places, such as for POWER8 PAPR KVM
> guests, because limitations of the hardware virtualization mean the guest
> can't use pagesizes larger than the host pages backing its memory.
> 
> However, it currently checks against *every* memory backend, whether or not
> it is actually mapped into guest memory at the moment.  This is incorrect.
> 
> This can cause a problem attempting to add memory to a POWER8 pseries KVM
> guest which is configured to allow hugepages in the guest (e.g.
> -machine cap-hpt-max-page-size=16m).  If you attempt to add non-hugepage,
> you can (correctly) create a memory backend, however it (correctly) will
> throw an error when you attempt to map that memory into the guest by
> 'device_add'ing a pc-dimm.
> 
> What's not correct is that if you then reset the guest a startup check
> against qemu_getrampagesize() will cause a fatal error because of the new
> memory object, even though it's not mapped into the guest.
> 
> This patch corrects the problem by adjusting find_max_supported_pagesize()
> (called from qemu_getrampagesize() via object_child_foreach) to exclude
> non-mapped memory backends.
> 
> Signed-off-by: David Gibson <david@gibson.dropbear.id.au>
> ---
>  exec.c | 5 +++--
>  1 file changed, 3 insertions(+), 2 deletions(-)
> 
> This is definitely a bug, but it's not a regression.  I'm not sure if
> this is 4.0 material at this stage of the freeze or not.
> 
> diff --git a/exec.c b/exec.c
> index 86a38d3b3b..6ab62f4eee 100644
> --- a/exec.c
> +++ b/exec.c
> @@ -1692,9 +1692,10 @@ static int find_max_supported_pagesize(Object *obj, void *opaque)
>      long *hpsize_min = opaque;
>  
>      if (object_dynamic_cast(obj, TYPE_MEMORY_BACKEND)) {
> -        long hpsize = host_memory_backend_pagesize(MEMORY_BACKEND(obj));
> +        HostMemoryBackend *backend = MEMORY_BACKEND(obj);
> +        long hpsize = host_memory_backend_pagesize(backend);
>  
> -        if (hpsize < *hpsize_min) {
> +        if (host_memory_backend_is_mapped(backend) && (hpsize < *hpsize_min)) {
>              *hpsize_min = hpsize;
>          }
>      }
> 

As discussed, from a s390x point this doesn't seem to break anything.

Acked-by: David Hildenbrand <david@redhat.com>
David Gibson March 29, 2019, 12:19 a.m. UTC | #26
On Thu, Mar 28, 2019 at 10:26:07AM +0100, David Hildenbrand wrote:
> On 28.03.19 02:18, David Gibson wrote:
> > On Wed, Mar 27, 2019 at 03:22:58PM +0100, David Hildenbrand wrote:
> >> On 27.03.19 10:03, David Gibson wrote:
> >>> On Wed, Mar 27, 2019 at 09:10:01AM +0100, David Hildenbrand wrote:
> >>>> On 27.03.19 01:12, David Gibson wrote:
> >>>>> On Tue, Mar 26, 2019 at 06:02:51PM +0100, David Hildenbrand wrote:
> >>>>>> On 26.03.19 15:08, Igor Mammedov wrote:
> >>>>>>> On Tue, 26 Mar 2019 14:50:58 +1100
> >>>>>>> David Gibson <david@gibson.dropbear.id.au> wrote:
> >>>>>>>
> >>>>>>>> qemu_getrampagesize() works out the minimum host page size backing any of
> >>>>>>>> guest RAM.  This is required in a few places, such as for POWER8 PAPR KVM
> >>>>>>>> guests, because limitations of the hardware virtualization mean the guest
> >>>>>>>> can't use pagesizes larger than the host pages backing its memory.
> >>>>>>>>
> >>>>>>>> However, it currently checks against *every* memory backend, whether or not
> >>>>>>>> it is actually mapped into guest memory at the moment.  This is incorrect.
> >>>>>>>>
> >>>>>>>> This can cause a problem attempting to add memory to a POWER8 pseries KVM
> >>>>>>>> guest which is configured to allow hugepages in the guest (e.g.
> >>>>>>>> -machine cap-hpt-max-page-size=16m).  If you attempt to add non-hugepage,
> >>>>>>>> you can (correctly) create a memory backend, however it (correctly) will
> >>>>>>>> throw an error when you attempt to map that memory into the guest by
> >>>>>>>> 'device_add'ing a pc-dimm.
> >>>>>>>>
> >>>>>>>> What's not correct is that if you then reset the guest a startup check
> >>>>>>>> against qemu_getrampagesize() will cause a fatal error because of the new
> >>>>>>>> memory object, even though it's not mapped into the guest.
> >>>>>>> I'd say that backend should be remove by mgmt app since device_add failed
> >>>>>>> instead of leaving it to hang around. (but fatal error either not a nice
> >>>>>>> behavior on QEMU part)
> >>>>>>
> >>>>>> Indeed, it should be removed. Depending on the options (huge pages with
> >>>>>> prealloc?) memory might be consumed for unused memory. Undesired.
> >>>>>
> >>>>> Right, but if the guest initiates a reboot before the management gets
> >>>>> to that, we'll have a crash.
> >>>>>
> >>>>
> >>>> Yes, I agree.
> >>>>
> >>>> At least on s390x (extending on what Igor said):
> >>>>
> >>>> mc->init() -> s390_memory_init() ->
> >>>> memory_region_allocate_system_memory() -> host_memory_backend_set_mapped()
> >>>>
> >>>>
> >>>> ac->init_machine() -> kvm_arch_init() ->
> >>>> kvm_s390_configure_mempath_backing() -> qemu_getrampagesize()
> >>>>
> >>>>
> >>>> And in vl.c
> >>>>
> >>>> configure_accelerator(current_machine, argv[0]);
> >>>> ...
> >>>> machine_run_board_init()
> >>>>
> >>>> So memory is indeed not mapped before calling qemu_getrampagesize().
> >>>>
> >>>>
> >>>> We *could* move the call to kvm_s390_configure_mempath_backing() to
> >>>> s390_memory_init().
> >>>>
> >>>> cap_hpage_1m is not needed before we create VCPUs, so this would work fine.
> >>>>
> >>>> We could than eventually make qemu_getrampagesize() asssert if no
> >>>> backends are mapped at all, to catch other user that rely on this being
> >>>> correct.
> >>>
> >>> So.. I had a look at the usage in kvm_s390_configure_mempath_backing()
> >>> and I'm pretty sure it's broken.  It will work in the case where
> >>> there's only one backend.  And if that's the default -mem-path rather
> >>> than an explicit memory backend then my patch won't break it any
> >>> further.
> >>
> >> On the second look, I think I get your point.
> >>
> >> 1. Why on earth does "find_max_supported_pagesize" find the "minimum
> >> page size". What kind of nasty stuff is this.
> > 
> > Ah, yeah, the naming is bad because of history.
> > 
> > The original usecase of this was because on POWER (before POWER9) the
> > way MMU virtualization works, pages inserted into the guest MMU view
> > have to be host-contiguous: there's no 2nd level translation that lets
> > them be broken into smaller host pages.
> > 
> > The upshot is that a KVM guest can only use large pages if it's backed
> > by large pages on the host.  We have to advertise the availability of
> > large pages to the guest at boot time though, and there's no way to
> > restrict it to certain parts of guest RAM.
> > 
> > So, this code path was finding the _maximum_ page size the guest could
> > use... which depends on the _minimum page_ size used on the host.
> > When this was moved to (partly) generic code we didn't think to
> > improve all the names.
> > 
> >> 2. qemu_mempath_getpagesize() is not affected by your patch
> > 
> > Correct.
> > 
> >> and that
> >> seems to be the only thing used on s390x for now.
> > 
> > Uh.. what?
> > 
> >> I sent a patch to move the call on s390x. But we really have to detect
> >> the maximum page size (what find_max_supported_pagesize promises), not
> >> the minimum page size.
> > 
> > Well.. sort of.  In the ppc case it really is the minimum page size we
> > care about, in the sense that if some part of RAM has a larger page
> > size, that's fine - even if it's a weird size that we didn't expect.
> > 
> > IIUC for s390 the problem is that KVM doesn't necessarily support
> > putting large pages into the guest at all, and what large page sizes
> > it can put into the guest depends on the KVM version.
> 
> Yes.
> 
> > 
> > Finding the maximum backing page size answers that question only on
> > the assumption that a KVM which supports page size X will support all
> > smaller page sizes.  I imagine that's true in practice, but does it
> > have to be true in theory?  If not, it seems safer to me to explicitly
> > step through every (mapped) backend and individually check if it has a
> > supported pagesize, rather than only testing the max page size.
> 
> 
> We only support 4k and 1MB page sizes in KVM. So for now the question
> does not apply. Then there is 2GB pages in HW.
> 
> A HW that supports 2GB supports 1MB. The only way something else (e.g.
> 512KB pages) could be added would be by doing massive changes to the
> architecture. I don't consider this realistic. Like changing suddenly
> the page size from 4k to 8k on a HW that was never prepared for it.
> However, if that would ever happen, we'll get new KVM capabilities to
> check/enable support.
> 
> E.g. for 2GB pages, we'll have another KVM capability to enable gigantic
> pages. If 1MB pages would no longer supported by KVM, the capability
> would not be indicated.
> 
> Once we support 2GB pages, we'll might have think about what you
> describe here, depending on what the KVM interface promises us. If the
> interfaces promises "If 2GB are enabled, 1MB are enabled implicitly", we
> are fine, otherwise we would have to check per mapped backend.

I guess.  I'm generally in favour of checking explicitly for the
condition you need, rather than something that should be equivalent
based on a bunch of assumptions, even if those assumptions are pretty
solid.  At least if it's practical to do so, which explicitly
iterating through the backends seems like it would be here.

But, when it comes down to it, I don't really care that much which
solution you go with.

> > It also occurs to me: why does this logic need to be in qemu at all?
> > KVM must know what pagesizes it supports, and I assume it will throw
> > an error if you try to put something with the wrong size into a
> > memslot.  So can't qemu just report that error, rather than checking
> > the pagesizes itself?
> 
> There are multiple things to that
> 
> 1. "KVM must know what pagesizes it supports"
> 
> Yes it does, and this is reflected via KVM capabilities (e.g.
> KVM_CAP_S390_HPAGE_1M) . To make use of
> these capabilities, they have to be enabled by user space. Once we have
> support for 2G pages, we'll have KVM_CAP_S390_HPAGE_2G.
> 
> In case the capability is enabled, certain things have to be changed in KVM
> - CMMA can no longer be used (user space has to properly take care of that)
> - Certain HW assists (PFMF interpretation, Storage Key Facility) have to
> be disabled early.
> 
> 
> 2. "it will throw an error if you try to put something with the wrong
>     size into a memslot"
> 
> An error will be reported when trying to map a huge page into the GMAP
> (e.g. on a host page fault in virtualization mode). So not when the
> memslot is configured, but during kvm_run.

Ah, ok, if we don't get the error at set memslot time then yes that's
definitely something we'd need to check for in qemu in advance.

> Checking the memslot might be
> 
> a) complicated (check all VMAs)

Yeah, maybe.

> b) waste of time (many VMAs)

I doubt that's really the case, but it doesn't mater because..

> c) incorrect - the content of a memslot can change any time. (KVM
> synchronous MMU). Think of someone wanting to remap some pages part of a
> memslot using huge pages.

..good point.  Yeah, ok, that's not going to work.

> 3. Can you elaborate on "So can't qemu just report that error, rather
> than checking the pagesizes itself?" We effectively check against the
> capabilities of KVM and the page size. Based on that, we report the
> error in QEMU. Reporting an error after the guest has already started
> and crashed during kvm_run due to a huge page is way too late.

Agreed.
David Hildenbrand March 29, 2019, 7:45 a.m. UTC | #27
>> Once we support 2GB pages, we'll might have think about what you
>> describe here, depending on what the KVM interface promises us. If the
>> interfaces promises "If 2GB are enabled, 1MB are enabled implicitly", we
>> are fine, otherwise we would have to check per mapped backend.
> 
> I guess.  I'm generally in favour of checking explicitly for the
> condition you need, rather than something that should be equivalent
> based on a bunch of assumptions, even if those assumptions are pretty
> solid.  At least if it's practical to do so, which explicitly
> iterating through the backends seems like it would be here.
> 
> But, when it comes down to it, I don't really care that much which
> solution you go with.

I guess it will be all easier to handle once we would have all RAM
converted to (internal) memory devices. No need to check for (e.g.
mem-path) side conditions or memory backends used for different purposes
than RAM.

> 
>>> It also occurs to me: why does this logic need to be in qemu at all?
>>> KVM must know what pagesizes it supports, and I assume it will throw
>>> an error if you try to put something with the wrong size into a
>>> memslot.  So can't qemu just report that error, rather than checking
>>> the pagesizes itself?
>>
>> There are multiple things to that
>>
>> 1. "KVM must know what pagesizes it supports"
>>
>> Yes it does, and this is reflected via KVM capabilities (e.g.
>> KVM_CAP_S390_HPAGE_1M) . To make use of
>> these capabilities, they have to be enabled by user space. Once we have
>> support for 2G pages, we'll have KVM_CAP_S390_HPAGE_2G.
>>
>> In case the capability is enabled, certain things have to be changed in KVM
>> - CMMA can no longer be used (user space has to properly take care of that)
>> - Certain HW assists (PFMF interpretation, Storage Key Facility) have to
>> be disabled early.
>>
>>
>> 2. "it will throw an error if you try to put something with the wrong
>>     size into a memslot"
>>
>> An error will be reported when trying to map a huge page into the GMAP
>> (e.g. on a host page fault in virtualization mode). So not when the
>> memslot is configured, but during kvm_run.
> 
> Ah, ok, if we don't get the error at set memslot time then yes that's
> definitely something we'd need to check for in qemu in advance.
> 
>> Checking the memslot might be
>>
>> a) complicated (check all VMAs)
> 
> Yeah, maybe.
> 
>> b) waste of time (many VMAs)
> 
> I doubt that's really the case, but it doesn't mater because..
> 
>> c) incorrect - the content of a memslot can change any time. (KVM
>> synchronous MMU). Think of someone wanting to remap some pages part of a
>> memslot using huge pages.
> 
> ..good point.  Yeah, ok, that's not going to work.
> 
>> 3. Can you elaborate on "So can't qemu just report that error, rather
>> than checking the pagesizes itself?" We effectively check against the
>> capabilities of KVM and the page size. Based on that, we report the
>> error in QEMU. Reporting an error after the guest has already started
>> and crashed during kvm_run due to a huge page is way too late.
> 
> Agreed.
>
diff mbox series

Patch

diff --git a/exec.c b/exec.c
index 86a38d3b3b..6ab62f4eee 100644
--- a/exec.c
+++ b/exec.c
@@ -1692,9 +1692,10 @@  static int find_max_supported_pagesize(Object *obj, void *opaque)
     long *hpsize_min = opaque;
 
     if (object_dynamic_cast(obj, TYPE_MEMORY_BACKEND)) {
-        long hpsize = host_memory_backend_pagesize(MEMORY_BACKEND(obj));
+        HostMemoryBackend *backend = MEMORY_BACKEND(obj);
+        long hpsize = host_memory_backend_pagesize(backend);
 
-        if (hpsize < *hpsize_min) {
+        if (host_memory_backend_is_mapped(backend) && (hpsize < *hpsize_min)) {
             *hpsize_min = hpsize;
         }
     }