[v4,01/14] memory-device: drop assert related to align and start of address space

Message ID 20180517081527.14410-2-david@redhat.com
State New
Headers show
Series
  • MemoryDevice: use multi stage hotplug handlers
Related show

Commit Message

David Hildenbrand May 17, 2018, 8:15 a.m.
The start of the address space does not have to be aligned for the
search. Handle this case explicitly when starting the search for a new
address.

Signed-off-by: David Hildenbrand <david@redhat.com>
---
 hw/mem/memory-device.c | 3 +--
 1 file changed, 1 insertion(+), 2 deletions(-)

Comments

Igor Mammedov May 29, 2018, 1:27 p.m. | #1
On Thu, 17 May 2018 10:15:14 +0200
David Hildenbrand <david@redhat.com> wrote:

> The start of the address space does not have to be aligned for the
> search. Handle this case explicitly when starting the search for a new
> address.
That's true,
but commit message doesn't explain why address_space_start
should be allowed to be non aligned.

At least with this assert we would notice early that
board allocating misaligned address space.
I'd keep the assert unless there is a good reason to drop it.


> 
> Signed-off-by: David Hildenbrand <david@redhat.com>
> ---
>  hw/mem/memory-device.c | 3 +--
>  1 file changed, 1 insertion(+), 2 deletions(-)
> 
> diff --git a/hw/mem/memory-device.c b/hw/mem/memory-device.c
> index 3e04f3954e..361d38bfc5 100644
> --- a/hw/mem/memory-device.c
> +++ b/hw/mem/memory-device.c
> @@ -116,7 +116,6 @@ uint64_t memory_device_get_free_addr(MachineState *ms, const uint64_t *hint,
>      address_space_start = ms->device_memory->base;
>      address_space_end = address_space_start +
>                          memory_region_size(&ms->device_memory->mr);
> -    g_assert(QEMU_ALIGN_UP(address_space_start, align) == address_space_start);
>      g_assert(address_space_end >= address_space_start);
>  
>      memory_device_check_addable(ms, size, errp);
> @@ -149,7 +148,7 @@ uint64_t memory_device_get_free_addr(MachineState *ms, const uint64_t *hint,
>              return 0;
>          }
>      } else {
> -        new_addr = address_space_start;
> +        new_addr = QEMU_ALIGN_UP(address_space_start, align);
>      }
>  
>      /* find address range that will fit new memory device */
David Hildenbrand May 29, 2018, 4:02 p.m. | #2
On 29.05.2018 15:27, Igor Mammedov wrote:
> On Thu, 17 May 2018 10:15:14 +0200
> David Hildenbrand <david@redhat.com> wrote:
> 
>> The start of the address space does not have to be aligned for the
>> search. Handle this case explicitly when starting the search for a new
>> address.
> That's true,
> but commit message doesn't explain why address_space_start
> should be allowed to be non aligned.
> 
> At least with this assert we would notice early that
> board allocating misaligned address space.
> I'd keep the assert unless there is a good reason to drop it.

That reason might be that I can easily crash QEMU

 ./x86_64-softmmu/qemu-system-x86_64 -m 256M,maxmem=20G,slots=2 -object
memory-backend-file,id=mem0,size=8192M,mem-path=/dev/zero,align=8192M
-device pc-dimm,id=dimm1,memdev=mem0

ERROR:hw/mem/memory-device.c:146:memory_device_get_free_addr: assertion
failed: (QEMU_ALIGN_UP(address_space_start, align) == address_space_start)


> 
> 
>>
>> Signed-off-by: David Hildenbrand <david@redhat.com>
>> ---
>>  hw/mem/memory-device.c | 3 +--
>>  1 file changed, 1 insertion(+), 2 deletions(-)
>>
>> diff --git a/hw/mem/memory-device.c b/hw/mem/memory-device.c
>> index 3e04f3954e..361d38bfc5 100644
>> --- a/hw/mem/memory-device.c
>> +++ b/hw/mem/memory-device.c
>> @@ -116,7 +116,6 @@ uint64_t memory_device_get_free_addr(MachineState *ms, const uint64_t *hint,
>>      address_space_start = ms->device_memory->base;
>>      address_space_end = address_space_start +
>>                          memory_region_size(&ms->device_memory->mr);
>> -    g_assert(QEMU_ALIGN_UP(address_space_start, align) == address_space_start);
>>      g_assert(address_space_end >= address_space_start);
>>  
>>      memory_device_check_addable(ms, size, errp);
>> @@ -149,7 +148,7 @@ uint64_t memory_device_get_free_addr(MachineState *ms, const uint64_t *hint,
>>              return 0;
>>          }
>>      } else {
>> -        new_addr = address_space_start;
>> +        new_addr = QEMU_ALIGN_UP(address_space_start, align);
>>      }
>>  
>>      /* find address range that will fit new memory device */
>
Igor Mammedov May 30, 2018, 12:57 p.m. | #3
On Tue, 29 May 2018 18:02:06 +0200
David Hildenbrand <david@redhat.com> wrote:

> On 29.05.2018 15:27, Igor Mammedov wrote:
> > On Thu, 17 May 2018 10:15:14 +0200
> > David Hildenbrand <david@redhat.com> wrote:
> >   
> >> The start of the address space does not have to be aligned for the
> >> search. Handle this case explicitly when starting the search for a new
> >> address.  
> > That's true,
> > but commit message doesn't explain why address_space_start
> > should be allowed to be non aligned.
> > 
> > At least with this assert we would notice early that
> > board allocating misaligned address space.
> > I'd keep the assert unless there is a good reason to drop it.  
> 
> That reason might be that I can easily crash QEMU
> 
>  ./x86_64-softmmu/qemu-system-x86_64 -m 256M,maxmem=20G,slots=2 -object
> memory-backend-file,id=mem0,size=8192M,mem-path=/dev/zero,align=8192M
> -device pc-dimm,id=dimm1,memdev=mem0
> 
> ERROR:hw/mem/memory-device.c:146:memory_device_get_free_addr: assertion
> failed: (QEMU_ALIGN_UP(address_space_start, align) == address_space_start)
it looks like a different issue.
As I remember user visible 'align' property was added as duct tape since
we can't figure out alignment for DAX device no the host,
so it was left upto upper layers to magically figure that out.

However we probably shouldn't allow arbitrary or bigger aligment
than max page size supported by target machine/cpu
(i.e. currently hardcoded address_space_start alignment),
as it creates unnecessary fragmentation and not counted in size
of hotplug region (for x86 we count in additional 1Gb per memory device).

How about turning that assert into error check that
inhibits plugging in devices with alignment values
larger than address_space_start alignment?

 
> >   
> >>
> >> Signed-off-by: David Hildenbrand <david@redhat.com>
> >> ---
> >>  hw/mem/memory-device.c | 3 +--
> >>  1 file changed, 1 insertion(+), 2 deletions(-)
> >>
> >> diff --git a/hw/mem/memory-device.c b/hw/mem/memory-device.c
> >> index 3e04f3954e..361d38bfc5 100644
> >> --- a/hw/mem/memory-device.c
> >> +++ b/hw/mem/memory-device.c
> >> @@ -116,7 +116,6 @@ uint64_t memory_device_get_free_addr(MachineState *ms, const uint64_t *hint,
> >>      address_space_start = ms->device_memory->base;
> >>      address_space_end = address_space_start +
> >>                          memory_region_size(&ms->device_memory->mr);
> >> -    g_assert(QEMU_ALIGN_UP(address_space_start, align) == address_space_start);
> >>      g_assert(address_space_end >= address_space_start);
> >>  
> >>      memory_device_check_addable(ms, size, errp);
> >> @@ -149,7 +148,7 @@ uint64_t memory_device_get_free_addr(MachineState *ms, const uint64_t *hint,
> >>              return 0;
> >>          }
> >>      } else {
> >> -        new_addr = address_space_start;
> >> +        new_addr = QEMU_ALIGN_UP(address_space_start, align);
> >>      }
> >>  
> >>      /* find address range that will fit new memory device */  
> >   
> 
>
David Hildenbrand May 30, 2018, 2:06 p.m. | #4
On 30.05.2018 14:57, Igor Mammedov wrote:
> On Tue, 29 May 2018 18:02:06 +0200
> David Hildenbrand <david@redhat.com> wrote:
> 
>> On 29.05.2018 15:27, Igor Mammedov wrote:
>>> On Thu, 17 May 2018 10:15:14 +0200
>>> David Hildenbrand <david@redhat.com> wrote:
>>>   
>>>> The start of the address space does not have to be aligned for the
>>>> search. Handle this case explicitly when starting the search for a new
>>>> address.  
>>> That's true,
>>> but commit message doesn't explain why address_space_start
>>> should be allowed to be non aligned.
>>>
>>> At least with this assert we would notice early that
>>> board allocating misaligned address space.
>>> I'd keep the assert unless there is a good reason to drop it.  
>>
>> That reason might be that I can easily crash QEMU
>>
>>  ./x86_64-softmmu/qemu-system-x86_64 -m 256M,maxmem=20G,slots=2 -object
>> memory-backend-file,id=mem0,size=8192M,mem-path=/dev/zero,align=8192M
>> -device pc-dimm,id=dimm1,memdev=mem0
>>
>> ERROR:hw/mem/memory-device.c:146:memory_device_get_free_addr: assertion
>> failed: (QEMU_ALIGN_UP(address_space_start, align) == address_space_start)
> it looks like a different issue.
> As I remember user visible 'align' property was added as duct tape since
> we can't figure out alignment for DAX device no the host,
> so it was left upto upper layers to magically figure that out.
> 
> However we probably shouldn't allow arbitrary or bigger aligment
> than max page size supported by target machine/cpu
> (i.e. currently hardcoded address_space_start alignment),
> as it creates unnecessary fragmentation and not counted in size
> of hotplug region (for x86 we count in additional 1Gb per memory device).
> 
> How about turning that assert into error check that
> inhibits plugging in devices with alignment values
> larger than address_space_start alignment?


Let me explain a little bit why I don't like such restrictions (for
which I don't see a need yet):

virtio-mem devices will later have a certain block size (e.g. 4MB). I
want to give devices during resource assignment the possibility to
specify their alignment requirements.

For said virtio-mem device, this would e.g. be 4MB. (see patch 10 and 14
of how this call "get_align" comes into play), because the addresses of
the memory blocks are all aligned by e.g. 4MB. This is what is
guaranteed by the device specification.

E.g. for DIMMs we might want to allow to specify the section size (e.g.
128MB on x86), otherwise e.g. Linux is not able to add all memory. (but
we should not hardcode this, as this is a Linux specific requirement -
still it would be nice to specify)

So in general, a memory device might have some alignment that is to be
taken care of.

I don't understand right now why an upper limit on the alignment would
make sense at all. We can easily handle it during our search. And we
have to handle it either way during the search, if we plug some device
with strange sizes (e.g. 1MB DIMM).

Of course, we might end up fragmenting guest physical memory, but that
is purely a setup issue (choosing sizes of devices + main memory
properly). I don't see a reason to error out (and of course also not to
assert out :) ).
Igor Mammedov May 31, 2018, 1:54 p.m. | #5
On Wed, 30 May 2018 16:06:59 +0200
David Hildenbrand <david@redhat.com> wrote:

> On 30.05.2018 14:57, Igor Mammedov wrote:
> > On Tue, 29 May 2018 18:02:06 +0200
> > David Hildenbrand <david@redhat.com> wrote:
> >   
> >> On 29.05.2018 15:27, Igor Mammedov wrote:  
> >>> On Thu, 17 May 2018 10:15:14 +0200
> >>> David Hildenbrand <david@redhat.com> wrote:
> >>>     
> >>>> The start of the address space does not have to be aligned for the
> >>>> search. Handle this case explicitly when starting the search for a new
> >>>> address.    
> >>> That's true,
> >>> but commit message doesn't explain why address_space_start
> >>> should be allowed to be non aligned.
> >>>
> >>> At least with this assert we would notice early that
> >>> board allocating misaligned address space.
> >>> I'd keep the assert unless there is a good reason to drop it.    
> >>
> >> That reason might be that I can easily crash QEMU
> >>
> >>  ./x86_64-softmmu/qemu-system-x86_64 -m 256M,maxmem=20G,slots=2 -object
> >> memory-backend-file,id=mem0,size=8192M,mem-path=/dev/zero,align=8192M
> >> -device pc-dimm,id=dimm1,memdev=mem0
> >>
> >> ERROR:hw/mem/memory-device.c:146:memory_device_get_free_addr: assertion
> >> failed: (QEMU_ALIGN_UP(address_space_start, align) == address_space_start)  
> > it looks like a different issue.
> > As I remember user visible 'align' property was added as duct tape since
> > we can't figure out alignment for DAX device no the host,
> > so it was left upto upper layers to magically figure that out.
> > 
> > However we probably shouldn't allow arbitrary or bigger aligment
> > than max page size supported by target machine/cpu
> > (i.e. currently hardcoded address_space_start alignment),
> > as it creates unnecessary fragmentation and not counted in size
> > of hotplug region (for x86 we count in additional 1Gb per memory device).
> > 
> > How about turning that assert into error check that
> > inhibits plugging in devices with alignment values
> > larger than address_space_start alignment?  
> 
> 
> Let me explain a little bit why I don't like such restrictions (for
> which I don't see a need yet):
(*) being conservative is good here because we can always relax restrictions
in future if it's needed without breaking users, but we can't take away
something thing that's been shipped (and if we do it, it typically
introduces a bunch of compat code to keep old machines working).
Also beside of giving as some freedom of movement in the future,
restrictions also to some degree prevent user from misconfiguration)

> virtio-mem devices will later have a certain block size (e.g. 4MB). I
> want to give devices during resource assignment the possibility to
> specify their alignment requirements.
size and alignment are 2 diffrent things here, alignment in our design
is dictated by backing storage page size and for performance reasons
HVA and GPA should be aligned on the same boundary, users are free
to pick another GPA manually as far as it has the same alignment.
But for automatic placement we use backend's alignment to make placement
as compact as possible but still keeping GPA aligned with HVA.

> For said virtio-mem device, this would e.g. be 4MB. (see patch 10 and 14
> of how this call "get_align" comes into play), because the addresses of
> the memory blocks are all aligned by e.g. 4MB. This is what is
> guaranteed by the device specification.
where does this 4Mb magic comes from and why block must be aligned
on this size?
 
> E.g. for DIMMs we might want to allow to specify the section size (e.g.
> 128MB on x86), otherwise e.g. Linux is not able to add all memory. (but
> we should not hardcode this, as this is a Linux specific requirement -
> still it would be nice to specify)
true, it's guest specific and we do not have restrictions here.
The only restriction we have here is that size must be multiple of
backing storage page size (i.e. alignment) so that guest would
be able to use tail page.

> So in general, a memory device might have some alignment that is to be
> taken care of.
Do we really need introducing frontend specific alignment?
I'd try reuse backend's one and go for frontend's only in case we have to.

> I don't understand right now why an upper limit on the alignment would
> make sense at all. We can easily handle it during our search. And we
> have to handle it either way during the search, if we plug some device
> with strange sizes (e.g. 1MB DIMM).
> 
> Of course, we might end up fragmenting guest physical memory, but that
> is purely a setup issue (choosing sizes of devices + main memory
> properly). I don't see a reason to error out (and of course also not to
> assert out :) ).
Agreed about assert, but I'd still prefer error out there so that users
won't crate insane config and then complain (see below and *).

Upper alignment value is useful for proper sizing of hotplug address space,
so that user could plug #slots devices upto #maxmem specified on CLI.
It's still possible to misconfigure using manual placement, but default
one just works, user either consumes all memory #maxmem and/or #slots.

There is no misunderstanding in case of error here as it works same as
on baremetal, one doesn't have a free place to put more memory or all
memory one asked for is already there.

So it might be that #slots decoupling from memory device a premature
action and we can still use it with virtio-mem/pmem.
David Hildenbrand June 4, 2018, 10:53 a.m. | #6
>> Let me explain a little bit why I don't like such restrictions (for
>> which I don't see a need yet):
> (*) being conservative is good here because we can always relax restrictions
> in future if it's needed without breaking users, but we can't take away
> something thing that's been shipped (and if we do it, it typically
> introduces a bunch of compat code to keep old machines working).
> Also beside of giving as some freedom of movement in the future,
> restrictions also to some degree prevent user from misconfiguration)

Right, but I consider a maximum on an alignment arbitrary (and even
difficult to implement - as you said, "max page size supported by target
machine/cpu" - there are even different huge page sizes e.g. on x86).

So I'd really vote against introducing such checks.

And another point against introducing a check for the maimum alignment
would be: There might be users with a specified alignment > max page
size. Changing the behavior will break these setups (strange but true).

> 
>> virtio-mem devices will later have a certain block size (e.g. 4MB). I
>> want to give devices during resource assignment the possibility to
>> specify their alignment requirements.
> size and alignment are 2 diffrent things here, alignment in our design
> is dictated by backing storage page size and for performance reasons
> HVA and GPA should be aligned on the same boundary, users are free
> to pick another GPA manually as far as it has the same alignment.
> But for automatic placement we use backend's alignment to make placement
> as compact as possible but still keeping GPA aligned with HVA.
> 
>> For said virtio-mem device, this would e.g. be 4MB. (see patch 10 and 14
>> of how this call "get_align" comes into play), because the addresses of
>> the memory blocks are all aligned by e.g. 4MB. This is what is
>> guaranteed by the device specification.
> where does this 4Mb magic comes from and why block must be aligned
> on this size?

(knowing it is hard to get the idea without the current prototype in
place, but that will be fixed soon)

virtio-mem works in blocks. The block size is the size in which memory
can be plugged or unplugged by the guest. It also determines the
granularity (and therefore the bitmap) we have to use to keep track of
unplugged memory. It is configurable (e.g. for migration purposes), but
might also depend on the backend memory type. E.g. if huge pages are
used in the host, the huge page size defines the minimum block size. But
consider the last part a special case. Huge pages will not be supported
for now.

The block size also defines the alignment that has to be used by the
guest for plugging/unplugging (because that's how blocks gets mapped to
the bitmap entries). So a virtio-mem device really needs a block-aligned
start address,

For now I use 4MB because it matches what guests (e.g. Linux) can
actually support and keeps the bitmap small. But as I said, it is
configurable. (-device virtio-mem,block_size=1MB, ...)

>  
>> E.g. for DIMMs we might want to allow to specify the section size (e.g.
>> 128MB on x86), otherwise e.g. Linux is not able to add all memory. (but
>> we should not hardcode this, as this is a Linux specific requirement -
>> still it would be nice to specify)
> true, it's guest specific and we do not have restrictions here.
> The only restriction we have here is that size must be multiple of
> backing storage page size (i.e. alignment) so that guest would
> be able to use tail page.
> 
>> So in general, a memory device might have some alignment that is to be
>> taken care of.
> Do we really need introducing frontend specific alignment?
> I'd try reuse backend's one and go for frontend's only in case we have to.

I think so. But I consider this a special case for virtio-mem.
(specified indirectly via block_size). For the remaining stuff, we might
be able to live with the memory backend aligment. And I agree that this
should be the default if possible.

> 
>> I don't understand right now why an upper limit on the alignment would
>> make sense at all. We can easily handle it during our search. And we
>> have to handle it either way during the search, if we plug some device
>> with strange sizes (e.g. 1MB DIMM).
>>
>> Of course, we might end up fragmenting guest physical memory, but that
>> is purely a setup issue (choosing sizes of devices + main memory
>> properly). I don't see a reason to error out (and of course also not to
>> assert out :) ).
> Agreed about assert, but I'd still prefer error out there so that users
> won't crate insane config and then complain (see below and *).
> 
> Upper alignment value is useful for proper sizing of hotplug address space,
> so that user could plug #slots devices upto #maxmem specified on CLI.
> It's still possible to misconfigure using manual placement, but default
> one just works, user either consumes all memory #maxmem and/or #slots.


Please not that it will still work in most cases. Only if people start
to define crazy alignments (like I did), crazy DIMM sizes (e.g. 1MB) or
crazy block sizes for virtio-mem, we might have a fragmented guest
physical memory. This should  usually not happen, but if so, we already
have error messages for reporting this "fragmented memory".

And I consider these "strange configs" similar as "strange manual
placement". Same result: fragmented memory, error out.

Personally, I don't think we have to guarantee that automatic placement
works in all scenarios that the user is able to come up with. It should
work in sane configurations, which is still that case.


> 
> There is no misunderstanding in case of error here as it works same as
> on baremetal, one doesn't have a free place to put more memory or all
> memory one asked for is already there.
> 
> So it might be that #slots decoupling from memory device a premature
> action and we can still use it with virtio-mem/pmem.
> 

I treat #slots on x86 as #acpi_slots, that's why I don't think they
apply here. I can see how they are used (on x86 only!) to size the
address space - but I consider this a "nice to have feature" that should
not be supposed to work in all possible scenarios.

E.g. powerpc already relies on sane user configs. #slots is not used to
size the guest address space. Similar things will apply on s390x.
Igor Mammedov June 7, 2018, 1:26 p.m. | #7
On Mon, 4 Jun 2018 12:53:42 +0200
David Hildenbrand <david@redhat.com> wrote:

> >> Let me explain a little bit why I don't like such restrictions (for
> >> which I don't see a need yet):  
> > (*) being conservative is good here because we can always relax restrictions
> > in future if it's needed without breaking users, but we can't take away
> > something thing that's been shipped (and if we do it, it typically
> > introduces a bunch of compat code to keep old machines working).
> > Also beside of giving as some freedom of movement in the future,
> > restrictions also to some degree prevent user from misconfiguration)  
> 
> Right, but I consider a maximum on an alignment arbitrary (and even
> difficult to implement - as you said, "max page size supported by target
> machine/cpu" - there are even different huge page sizes e.g. on x86).
> 
> So I'd really vote against introducing such checks.
> 
> And another point against introducing a check for the maimum alignment
> would be: There might be users with a specified alignment > max page
> size. Changing the behavior will break these setups (strange but true).
check is already there implicitly as assert, and you're removing it
with this patch.
What I'm suggesting is to drop this patch or replace assert with
graceful error.
If you have actual usecase that requires this check being dropped,
then commit message should ddescribe it properly and the patch should
be a part of series that introduces that requirement.

> >> virtio-mem devices will later have a certain block size (e.g. 4MB). I
> >> want to give devices during resource assignment the possibility to
> >> specify their alignment requirements.  
> > size and alignment are 2 diffrent things here, alignment in our design
> > is dictated by backing storage page size and for performance reasons
> > HVA and GPA should be aligned on the same boundary, users are free
> > to pick another GPA manually as far as it has the same alignment.
> > But for automatic placement we use backend's alignment to make placement
> > as compact as possible but still keeping GPA aligned with HVA.
> >   
> >> For said virtio-mem device, this would e.g. be 4MB. (see patch 10 and 14
> >> of how this call "get_align" comes into play), because the addresses of
> >> the memory blocks are all aligned by e.g. 4MB. This is what is
> >> guaranteed by the device specification.  
> > where does this 4Mb magic comes from and why block must be aligned
> > on this size?  
> 
> (knowing it is hard to get the idea without the current prototype in
> place, but that will be fixed soon)
> 
> virtio-mem works in blocks. The block size is the size in which memory
> can be plugged or unplugged by the guest. It also determines the
> granularity (and therefore the bitmap) we have to use to keep track of
> unplugged memory. It is configurable (e.g. for migration purposes), but
> might also depend on the backend memory type. E.g. if huge pages are
> used in the host, the huge page size defines the minimum block size. But
> consider the last part a special case. Huge pages will not be supported
> for now.
> 
> The block size also defines the alignment that has to be used by the
> guest for plugging/unplugging (because that's how blocks gets mapped to
> the bitmap entries). So a virtio-mem device really needs a block-aligned
> start address,
> 
> For now I use 4MB because it matches what guests (e.g. Linux) can
> actually support and keeps the bitmap small. But as I said, it is
> configurable. (-device virtio-mem,block_size=1MB, ...)
> 
> >    
> >> E.g. for DIMMs we might want to allow to specify the section size (e.g.
> >> 128MB on x86), otherwise e.g. Linux is not able to add all memory. (but
> >> we should not hardcode this, as this is a Linux specific requirement -
> >> still it would be nice to specify)  
> > true, it's guest specific and we do not have restrictions here.
> > The only restriction we have here is that size must be multiple of
> > backing storage page size (i.e. alignment) so that guest would
> > be able to use tail page.
> >   
> >> So in general, a memory device might have some alignment that is to be
> >> taken care of.  
> > Do we really need introducing frontend specific alignment?
> > I'd try reuse backend's one and go for frontend's only in case we have to.  
> 
> I think so. But I consider this a special case for virtio-mem.
> (specified indirectly via block_size). For the remaining stuff, we might
> be able to live with the memory backend aligment. And I agree that this
> should be the default if possible.
So if you make block_size to be multiple of backing storage alignment (i.e. page size)
and we should keep GPA mapping (picking address to map region) based on backend's
alignment. virtio-mem probably would even work even with huge pages at cost
of increased granularity.

  
> >> I don't understand right now why an upper limit on the alignment would
> >> make sense at all. We can easily handle it during our search. And we
> >> have to handle it either way during the search, if we plug some device
> >> with strange sizes (e.g. 1MB DIMM).
> >>
> >> Of course, we might end up fragmenting guest physical memory, but that
> >> is purely a setup issue (choosing sizes of devices + main memory
> >> properly). I don't see a reason to error out (and of course also not to
> >> assert out :) ).  
> > Agreed about assert, but I'd still prefer error out there so that users
> > won't crate insane config and then complain (see below and *).
> > 
> > Upper alignment value is useful for proper sizing of hotplug address space,
> > so that user could plug #slots devices upto #maxmem specified on CLI.
> > It's still possible to misconfigure using manual placement, but default
> > one just works, user either consumes all memory #maxmem and/or #slots.  
> 
> 
> Please not that it will still work in most cases. Only if people start
> to define crazy alignments (like I did), crazy DIMM sizes (e.g. 1MB) or
> crazy block sizes for virtio-mem, we might have a fragmented guest
> physical memory. This should  usually not happen, but if so, we already
> have error messages for reporting this "fragmented memory".
> 
> And I consider these "strange configs" similar as "strange manual
> placement". Same result: fragmented memory, error out.
> 
> Personally, I don't think we have to guarantee that automatic placement
> works in all scenarios that the user is able to come up with. It should
> work in sane configurations, which is still that case.
It work now on x86 and I'd wish to keep it this way.
 
> > There is no misunderstanding in case of error here as it works same as
> > on baremetal, one doesn't have a free place to put more memory or all
> > memory one asked for is already there.
> > 
> > So it might be that #slots decoupling from memory device a premature
> > action and we can still use it with virtio-mem/pmem.
> >
> 
> I treat #slots on x86 as #acpi_slots, that's why I don't think they
> apply here. I can see how they are used (on x86 only!) to size the
> address space - but I consider this a "nice to have feature" that should
> not be supposed to work in all possible scenarios.
slot is more user space concept, as it is on real hw where you stick a plank
with memory, so it's easily understood by users.
On ACPI side it's just implementation detail, we can go beyond that if it's
needed without any adverse effects.
Restriction mostly comes from KVM side on # of memory slots and vhost.

> E.g. powerpc already relies on sane user configs. #slots is not used to
> size the guest address space. Similar things will apply on s390x.
They both could benefit from it the same way as x86 providing
properly sized area, so users won't have to scratch their head
trying to understand why they can't plug more memory.
David Hildenbrand June 7, 2018, 2:12 p.m. | #8
>> And another point against introducing a check for the maimum alignment
>> would be: There might be users with a specified alignment > max page
>> size. Changing the behavior will break these setups (strange but true).
> check is already there implicitly as assert, and you're removing it
> with this patch.
> What I'm suggesting is to drop this patch or replace assert with
> graceful error.
> If you have actual usecase that requires this check being dropped,
> then commit message should ddescribe it properly and the patch should
> be a part of series that introduces that requirement.

On x86 hotplug base is aligned to 1GB
On ppc hotplug base is aligned to 1GB

So what you're suggesting is to bail out in case we have an alignment
that does not base (i.e. turning this assert into a check like
"alignment too big")

The we would have to also install such base alignment on s390x, too.
Otherwise, if somebody specifies -m 511, adding something with alignment
of e.g. 4MB will fail.

> 
>>>> virtio-mem devices will later have a certain block size (e.g. 4MB). I
>>>> want to give devices during resource assignment the possibility to
>>>> specify their alignment requirements.  
>>> size and alignment are 2 diffrent things here, alignment in our design
>>> is dictated by backing storage page size and for performance reasons
>>> HVA and GPA should be aligned on the same boundary, users are free
>>> to pick another GPA manually as far as it has the same alignment.
>>> But for automatic placement we use backend's alignment to make placement
>>> as compact as possible but still keeping GPA aligned with HVA.
>>>   
>>>> For said virtio-mem device, this would e.g. be 4MB. (see patch 10 and 14
>>>> of how this call "get_align" comes into play), because the addresses of
>>>> the memory blocks are all aligned by e.g. 4MB. This is what is
>>>> guaranteed by the device specification.  
>>> where does this 4Mb magic comes from and why block must be aligned
>>> on this size?  
>>
>> (knowing it is hard to get the idea without the current prototype in
>> place, but that will be fixed soon)
>>
>> virtio-mem works in blocks. The block size is the size in which memory
>> can be plugged or unplugged by the guest. It also determines the
>> granularity (and therefore the bitmap) we have to use to keep track of
>> unplugged memory. It is configurable (e.g. for migration purposes), but
>> might also depend on the backend memory type. E.g. if huge pages are
>> used in the host, the huge page size defines the minimum block size. But
>> consider the last part a special case. Huge pages will not be supported
>> for now.
>>
>> The block size also defines the alignment that has to be used by the
>> guest for plugging/unplugging (because that's how blocks gets mapped to
>> the bitmap entries). So a virtio-mem device really needs a block-aligned
>> start address,
>>
>> For now I use 4MB because it matches what guests (e.g. Linux) can
>> actually support and keeps the bitmap small. But as I said, it is
>> configurable. (-device virtio-mem,block_size=1MB, ...)
>>
>>>    
>>>> E.g. for DIMMs we might want to allow to specify the section size (e.g.
>>>> 128MB on x86), otherwise e.g. Linux is not able to add all memory. (but
>>>> we should not hardcode this, as this is a Linux specific requirement -
>>>> still it would be nice to specify)  
>>> true, it's guest specific and we do not have restrictions here.
>>> The only restriction we have here is that size must be multiple of
>>> backing storage page size (i.e. alignment) so that guest would
>>> be able to use tail page.
>>>   
>>>> So in general, a memory device might have some alignment that is to be
>>>> taken care of.  
>>> Do we really need introducing frontend specific alignment?
>>> I'd try reuse backend's one and go for frontend's only in case we have to.  
>>
>> I think so. But I consider this a special case for virtio-mem.
>> (specified indirectly via block_size). For the remaining stuff, we might
>> be able to live with the memory backend aligment. And I agree that this
>> should be the default if possible.
> So if you make block_size to be multiple of backing storage alignment (i.e. page size)
> and we should keep GPA mapping (picking address to map region) based on backend's
> alignment. virtio-mem probably would even work even with huge pages at cost
> of increased granularity.
>

block_size will always be a multiple of backing storage alignment (well,
for now huge pages will not be supported, so the 4MB > page_size always
holds).

However, using an alignment of the backing storage (e.g. page size) is
not enough.

Using block_size of 1GB for virtio-mem is very unrealistic, so i think
we can ignore this for now. (meaning, getting errors if we implement
said alignment check)

>   
>>>> I don't understand right now why an upper limit on the alignment would
>>>> make sense at all. We can easily handle it during our search. And we
>>>> have to handle it either way during the search, if we plug some device
>>>> with strange sizes (e.g. 1MB DIMM).
>>>>
>>>> Of course, we might end up fragmenting guest physical memory, but that
>>>> is purely a setup issue (choosing sizes of devices + main memory
>>>> properly). I don't see a reason to error out (and of course also not to
>>>> assert out :) ).  
>>> Agreed about assert, but I'd still prefer error out there so that users
>>> won't crate insane config and then complain (see below and *).
>>>
>>> Upper alignment value is useful for proper sizing of hotplug address space,
>>> so that user could plug #slots devices upto #maxmem specified on CLI.
>>> It's still possible to misconfigure using manual placement, but default
>>> one just works, user either consumes all memory #maxmem and/or #slots.  
>>
>>
>> Please not that it will still work in most cases. Only if people start
>> to define crazy alignments (like I did), crazy DIMM sizes (e.g. 1MB) or
>> crazy block sizes for virtio-mem, we might have a fragmented guest
>> physical memory. This should  usually not happen, but if so, we already
>> have error messages for reporting this "fragmented memory".
>>
>> And I consider these "strange configs" similar as "strange manual
>> placement". Same result: fragmented memory, error out.
>>
>> Personally, I don't think we have to guarantee that automatic placement
>> works in all scenarios that the user is able to come up with. It should
>> work in sane configurations, which is still that case.
> It work now on x86 and I'd wish to keep it this way.

And it keeps working when only using DIMMS/PCDIMMs (slots). Which is
totally fine in my point of view.

>  
>>> There is no misunderstanding in case of error here as it works same as
>>> on baremetal, one doesn't have a free place to put more memory or all
>>> memory one asked for is already there.
>>>
>>> So it might be that #slots decoupling from memory device a premature
>>> action and we can still use it with virtio-mem/pmem.
>>>
>>
>> I treat #slots on x86 as #acpi_slots, that's why I don't think they
>> apply here. I can see how they are used (on x86 only!) to size the
>> address space - but I consider this a "nice to have feature" that should
>> not be supposed to work in all possible scenarios.
> slot is more user space concept, as it is on real hw where you stick a plank
> with memory, so it's easily understood by users.
> On ACPI side it's just implementation detail, we can go beyond that if it's
> needed without any adverse effects.
> Restriction mostly comes from KVM side on # of memory slots and vhost.

KVM memory slots is a completely different concept than the slots
parameter. The problem is: DIMMs use this as if it would be "acpi_slots".

> 
>> E.g. powerpc already relies on sane user configs. #slots is not used to
>> size the guest address space. Similar things will apply on s390x.
> They both could benefit from it the same way as x86 providing
> properly sized area, so users won't have to scratch their head
> trying to understand why they can't plug more memory.

Fragmented memory is more than enough as an error message IMHO.

Patch

diff --git a/hw/mem/memory-device.c b/hw/mem/memory-device.c
index 3e04f3954e..361d38bfc5 100644
--- a/hw/mem/memory-device.c
+++ b/hw/mem/memory-device.c
@@ -116,7 +116,6 @@  uint64_t memory_device_get_free_addr(MachineState *ms, const uint64_t *hint,
     address_space_start = ms->device_memory->base;
     address_space_end = address_space_start +
                         memory_region_size(&ms->device_memory->mr);
-    g_assert(QEMU_ALIGN_UP(address_space_start, align) == address_space_start);
     g_assert(address_space_end >= address_space_start);
 
     memory_device_check_addable(ms, size, errp);
@@ -149,7 +148,7 @@  uint64_t memory_device_get_free_addr(MachineState *ms, const uint64_t *hint,
             return 0;
         }
     } else {
-        new_addr = address_space_start;
+        new_addr = QEMU_ALIGN_UP(address_space_start, align);
     }
 
     /* find address range that will fit new memory device */