Message ID | 20180517081527.14410-2-david@redhat.com |
---|---|
State | New |
Headers | show |
Series | MemoryDevice: use multi stage hotplug handlers | expand |
On Thu, 17 May 2018 10:15:14 +0200 David Hildenbrand <david@redhat.com> wrote: > The start of the address space does not have to be aligned for the > search. Handle this case explicitly when starting the search for a new > address. That's true, but commit message doesn't explain why address_space_start should be allowed to be non aligned. At least with this assert we would notice early that board allocating misaligned address space. I'd keep the assert unless there is a good reason to drop it. > > Signed-off-by: David Hildenbrand <david@redhat.com> > --- > hw/mem/memory-device.c | 3 +-- > 1 file changed, 1 insertion(+), 2 deletions(-) > > diff --git a/hw/mem/memory-device.c b/hw/mem/memory-device.c > index 3e04f3954e..361d38bfc5 100644 > --- a/hw/mem/memory-device.c > +++ b/hw/mem/memory-device.c > @@ -116,7 +116,6 @@ uint64_t memory_device_get_free_addr(MachineState *ms, const uint64_t *hint, > address_space_start = ms->device_memory->base; > address_space_end = address_space_start + > memory_region_size(&ms->device_memory->mr); > - g_assert(QEMU_ALIGN_UP(address_space_start, align) == address_space_start); > g_assert(address_space_end >= address_space_start); > > memory_device_check_addable(ms, size, errp); > @@ -149,7 +148,7 @@ uint64_t memory_device_get_free_addr(MachineState *ms, const uint64_t *hint, > return 0; > } > } else { > - new_addr = address_space_start; > + new_addr = QEMU_ALIGN_UP(address_space_start, align); > } > > /* find address range that will fit new memory device */
On 29.05.2018 15:27, Igor Mammedov wrote: > On Thu, 17 May 2018 10:15:14 +0200 > David Hildenbrand <david@redhat.com> wrote: > >> The start of the address space does not have to be aligned for the >> search. Handle this case explicitly when starting the search for a new >> address. > That's true, > but commit message doesn't explain why address_space_start > should be allowed to be non aligned. > > At least with this assert we would notice early that > board allocating misaligned address space. > I'd keep the assert unless there is a good reason to drop it. That reason might be that I can easily crash QEMU ./x86_64-softmmu/qemu-system-x86_64 -m 256M,maxmem=20G,slots=2 -object memory-backend-file,id=mem0,size=8192M,mem-path=/dev/zero,align=8192M -device pc-dimm,id=dimm1,memdev=mem0 ERROR:hw/mem/memory-device.c:146:memory_device_get_free_addr: assertion failed: (QEMU_ALIGN_UP(address_space_start, align) == address_space_start) > > >> >> Signed-off-by: David Hildenbrand <david@redhat.com> >> --- >> hw/mem/memory-device.c | 3 +-- >> 1 file changed, 1 insertion(+), 2 deletions(-) >> >> diff --git a/hw/mem/memory-device.c b/hw/mem/memory-device.c >> index 3e04f3954e..361d38bfc5 100644 >> --- a/hw/mem/memory-device.c >> +++ b/hw/mem/memory-device.c >> @@ -116,7 +116,6 @@ uint64_t memory_device_get_free_addr(MachineState *ms, const uint64_t *hint, >> address_space_start = ms->device_memory->base; >> address_space_end = address_space_start + >> memory_region_size(&ms->device_memory->mr); >> - g_assert(QEMU_ALIGN_UP(address_space_start, align) == address_space_start); >> g_assert(address_space_end >= address_space_start); >> >> memory_device_check_addable(ms, size, errp); >> @@ -149,7 +148,7 @@ uint64_t memory_device_get_free_addr(MachineState *ms, const uint64_t *hint, >> return 0; >> } >> } else { >> - new_addr = address_space_start; >> + new_addr = QEMU_ALIGN_UP(address_space_start, align); >> } >> >> /* find address range that will fit new memory device */ >
On Tue, 29 May 2018 18:02:06 +0200 David Hildenbrand <david@redhat.com> wrote: > On 29.05.2018 15:27, Igor Mammedov wrote: > > On Thu, 17 May 2018 10:15:14 +0200 > > David Hildenbrand <david@redhat.com> wrote: > > > >> The start of the address space does not have to be aligned for the > >> search. Handle this case explicitly when starting the search for a new > >> address. > > That's true, > > but commit message doesn't explain why address_space_start > > should be allowed to be non aligned. > > > > At least with this assert we would notice early that > > board allocating misaligned address space. > > I'd keep the assert unless there is a good reason to drop it. > > That reason might be that I can easily crash QEMU > > ./x86_64-softmmu/qemu-system-x86_64 -m 256M,maxmem=20G,slots=2 -object > memory-backend-file,id=mem0,size=8192M,mem-path=/dev/zero,align=8192M > -device pc-dimm,id=dimm1,memdev=mem0 > > ERROR:hw/mem/memory-device.c:146:memory_device_get_free_addr: assertion > failed: (QEMU_ALIGN_UP(address_space_start, align) == address_space_start) it looks like a different issue. As I remember user visible 'align' property was added as duct tape since we can't figure out alignment for DAX device no the host, so it was left upto upper layers to magically figure that out. However we probably shouldn't allow arbitrary or bigger aligment than max page size supported by target machine/cpu (i.e. currently hardcoded address_space_start alignment), as it creates unnecessary fragmentation and not counted in size of hotplug region (for x86 we count in additional 1Gb per memory device). How about turning that assert into error check that inhibits plugging in devices with alignment values larger than address_space_start alignment? > > > >> > >> Signed-off-by: David Hildenbrand <david@redhat.com> > >> --- > >> hw/mem/memory-device.c | 3 +-- > >> 1 file changed, 1 insertion(+), 2 deletions(-) > >> > >> diff --git a/hw/mem/memory-device.c b/hw/mem/memory-device.c > >> index 3e04f3954e..361d38bfc5 100644 > >> --- a/hw/mem/memory-device.c > >> +++ b/hw/mem/memory-device.c > >> @@ -116,7 +116,6 @@ uint64_t memory_device_get_free_addr(MachineState *ms, const uint64_t *hint, > >> address_space_start = ms->device_memory->base; > >> address_space_end = address_space_start + > >> memory_region_size(&ms->device_memory->mr); > >> - g_assert(QEMU_ALIGN_UP(address_space_start, align) == address_space_start); > >> g_assert(address_space_end >= address_space_start); > >> > >> memory_device_check_addable(ms, size, errp); > >> @@ -149,7 +148,7 @@ uint64_t memory_device_get_free_addr(MachineState *ms, const uint64_t *hint, > >> return 0; > >> } > >> } else { > >> - new_addr = address_space_start; > >> + new_addr = QEMU_ALIGN_UP(address_space_start, align); > >> } > >> > >> /* find address range that will fit new memory device */ > > > >
On 30.05.2018 14:57, Igor Mammedov wrote: > On Tue, 29 May 2018 18:02:06 +0200 > David Hildenbrand <david@redhat.com> wrote: > >> On 29.05.2018 15:27, Igor Mammedov wrote: >>> On Thu, 17 May 2018 10:15:14 +0200 >>> David Hildenbrand <david@redhat.com> wrote: >>> >>>> The start of the address space does not have to be aligned for the >>>> search. Handle this case explicitly when starting the search for a new >>>> address. >>> That's true, >>> but commit message doesn't explain why address_space_start >>> should be allowed to be non aligned. >>> >>> At least with this assert we would notice early that >>> board allocating misaligned address space. >>> I'd keep the assert unless there is a good reason to drop it. >> >> That reason might be that I can easily crash QEMU >> >> ./x86_64-softmmu/qemu-system-x86_64 -m 256M,maxmem=20G,slots=2 -object >> memory-backend-file,id=mem0,size=8192M,mem-path=/dev/zero,align=8192M >> -device pc-dimm,id=dimm1,memdev=mem0 >> >> ERROR:hw/mem/memory-device.c:146:memory_device_get_free_addr: assertion >> failed: (QEMU_ALIGN_UP(address_space_start, align) == address_space_start) > it looks like a different issue. > As I remember user visible 'align' property was added as duct tape since > we can't figure out alignment for DAX device no the host, > so it was left upto upper layers to magically figure that out. > > However we probably shouldn't allow arbitrary or bigger aligment > than max page size supported by target machine/cpu > (i.e. currently hardcoded address_space_start alignment), > as it creates unnecessary fragmentation and not counted in size > of hotplug region (for x86 we count in additional 1Gb per memory device). > > How about turning that assert into error check that > inhibits plugging in devices with alignment values > larger than address_space_start alignment? Let me explain a little bit why I don't like such restrictions (for which I don't see a need yet): virtio-mem devices will later have a certain block size (e.g. 4MB). I want to give devices during resource assignment the possibility to specify their alignment requirements. For said virtio-mem device, this would e.g. be 4MB. (see patch 10 and 14 of how this call "get_align" comes into play), because the addresses of the memory blocks are all aligned by e.g. 4MB. This is what is guaranteed by the device specification. E.g. for DIMMs we might want to allow to specify the section size (e.g. 128MB on x86), otherwise e.g. Linux is not able to add all memory. (but we should not hardcode this, as this is a Linux specific requirement - still it would be nice to specify) So in general, a memory device might have some alignment that is to be taken care of. I don't understand right now why an upper limit on the alignment would make sense at all. We can easily handle it during our search. And we have to handle it either way during the search, if we plug some device with strange sizes (e.g. 1MB DIMM). Of course, we might end up fragmenting guest physical memory, but that is purely a setup issue (choosing sizes of devices + main memory properly). I don't see a reason to error out (and of course also not to assert out :) ).
On Wed, 30 May 2018 16:06:59 +0200 David Hildenbrand <david@redhat.com> wrote: > On 30.05.2018 14:57, Igor Mammedov wrote: > > On Tue, 29 May 2018 18:02:06 +0200 > > David Hildenbrand <david@redhat.com> wrote: > > > >> On 29.05.2018 15:27, Igor Mammedov wrote: > >>> On Thu, 17 May 2018 10:15:14 +0200 > >>> David Hildenbrand <david@redhat.com> wrote: > >>> > >>>> The start of the address space does not have to be aligned for the > >>>> search. Handle this case explicitly when starting the search for a new > >>>> address. > >>> That's true, > >>> but commit message doesn't explain why address_space_start > >>> should be allowed to be non aligned. > >>> > >>> At least with this assert we would notice early that > >>> board allocating misaligned address space. > >>> I'd keep the assert unless there is a good reason to drop it. > >> > >> That reason might be that I can easily crash QEMU > >> > >> ./x86_64-softmmu/qemu-system-x86_64 -m 256M,maxmem=20G,slots=2 -object > >> memory-backend-file,id=mem0,size=8192M,mem-path=/dev/zero,align=8192M > >> -device pc-dimm,id=dimm1,memdev=mem0 > >> > >> ERROR:hw/mem/memory-device.c:146:memory_device_get_free_addr: assertion > >> failed: (QEMU_ALIGN_UP(address_space_start, align) == address_space_start) > > it looks like a different issue. > > As I remember user visible 'align' property was added as duct tape since > > we can't figure out alignment for DAX device no the host, > > so it was left upto upper layers to magically figure that out. > > > > However we probably shouldn't allow arbitrary or bigger aligment > > than max page size supported by target machine/cpu > > (i.e. currently hardcoded address_space_start alignment), > > as it creates unnecessary fragmentation and not counted in size > > of hotplug region (for x86 we count in additional 1Gb per memory device). > > > > How about turning that assert into error check that > > inhibits plugging in devices with alignment values > > larger than address_space_start alignment? > > > Let me explain a little bit why I don't like such restrictions (for > which I don't see a need yet): (*) being conservative is good here because we can always relax restrictions in future if it's needed without breaking users, but we can't take away something thing that's been shipped (and if we do it, it typically introduces a bunch of compat code to keep old machines working). Also beside of giving as some freedom of movement in the future, restrictions also to some degree prevent user from misconfiguration) > virtio-mem devices will later have a certain block size (e.g. 4MB). I > want to give devices during resource assignment the possibility to > specify their alignment requirements. size and alignment are 2 diffrent things here, alignment in our design is dictated by backing storage page size and for performance reasons HVA and GPA should be aligned on the same boundary, users are free to pick another GPA manually as far as it has the same alignment. But for automatic placement we use backend's alignment to make placement as compact as possible but still keeping GPA aligned with HVA. > For said virtio-mem device, this would e.g. be 4MB. (see patch 10 and 14 > of how this call "get_align" comes into play), because the addresses of > the memory blocks are all aligned by e.g. 4MB. This is what is > guaranteed by the device specification. where does this 4Mb magic comes from and why block must be aligned on this size? > E.g. for DIMMs we might want to allow to specify the section size (e.g. > 128MB on x86), otherwise e.g. Linux is not able to add all memory. (but > we should not hardcode this, as this is a Linux specific requirement - > still it would be nice to specify) true, it's guest specific and we do not have restrictions here. The only restriction we have here is that size must be multiple of backing storage page size (i.e. alignment) so that guest would be able to use tail page. > So in general, a memory device might have some alignment that is to be > taken care of. Do we really need introducing frontend specific alignment? I'd try reuse backend's one and go for frontend's only in case we have to. > I don't understand right now why an upper limit on the alignment would > make sense at all. We can easily handle it during our search. And we > have to handle it either way during the search, if we plug some device > with strange sizes (e.g. 1MB DIMM). > > Of course, we might end up fragmenting guest physical memory, but that > is purely a setup issue (choosing sizes of devices + main memory > properly). I don't see a reason to error out (and of course also not to > assert out :) ). Agreed about assert, but I'd still prefer error out there so that users won't crate insane config and then complain (see below and *). Upper alignment value is useful for proper sizing of hotplug address space, so that user could plug #slots devices upto #maxmem specified on CLI. It's still possible to misconfigure using manual placement, but default one just works, user either consumes all memory #maxmem and/or #slots. There is no misunderstanding in case of error here as it works same as on baremetal, one doesn't have a free place to put more memory or all memory one asked for is already there. So it might be that #slots decoupling from memory device a premature action and we can still use it with virtio-mem/pmem.
>> Let me explain a little bit why I don't like such restrictions (for >> which I don't see a need yet): > (*) being conservative is good here because we can always relax restrictions > in future if it's needed without breaking users, but we can't take away > something thing that's been shipped (and if we do it, it typically > introduces a bunch of compat code to keep old machines working). > Also beside of giving as some freedom of movement in the future, > restrictions also to some degree prevent user from misconfiguration) Right, but I consider a maximum on an alignment arbitrary (and even difficult to implement - as you said, "max page size supported by target machine/cpu" - there are even different huge page sizes e.g. on x86). So I'd really vote against introducing such checks. And another point against introducing a check for the maimum alignment would be: There might be users with a specified alignment > max page size. Changing the behavior will break these setups (strange but true). > >> virtio-mem devices will later have a certain block size (e.g. 4MB). I >> want to give devices during resource assignment the possibility to >> specify their alignment requirements. > size and alignment are 2 diffrent things here, alignment in our design > is dictated by backing storage page size and for performance reasons > HVA and GPA should be aligned on the same boundary, users are free > to pick another GPA manually as far as it has the same alignment. > But for automatic placement we use backend's alignment to make placement > as compact as possible but still keeping GPA aligned with HVA. > >> For said virtio-mem device, this would e.g. be 4MB. (see patch 10 and 14 >> of how this call "get_align" comes into play), because the addresses of >> the memory blocks are all aligned by e.g. 4MB. This is what is >> guaranteed by the device specification. > where does this 4Mb magic comes from and why block must be aligned > on this size? (knowing it is hard to get the idea without the current prototype in place, but that will be fixed soon) virtio-mem works in blocks. The block size is the size in which memory can be plugged or unplugged by the guest. It also determines the granularity (and therefore the bitmap) we have to use to keep track of unplugged memory. It is configurable (e.g. for migration purposes), but might also depend on the backend memory type. E.g. if huge pages are used in the host, the huge page size defines the minimum block size. But consider the last part a special case. Huge pages will not be supported for now. The block size also defines the alignment that has to be used by the guest for plugging/unplugging (because that's how blocks gets mapped to the bitmap entries). So a virtio-mem device really needs a block-aligned start address, For now I use 4MB because it matches what guests (e.g. Linux) can actually support and keeps the bitmap small. But as I said, it is configurable. (-device virtio-mem,block_size=1MB, ...) > >> E.g. for DIMMs we might want to allow to specify the section size (e.g. >> 128MB on x86), otherwise e.g. Linux is not able to add all memory. (but >> we should not hardcode this, as this is a Linux specific requirement - >> still it would be nice to specify) > true, it's guest specific and we do not have restrictions here. > The only restriction we have here is that size must be multiple of > backing storage page size (i.e. alignment) so that guest would > be able to use tail page. > >> So in general, a memory device might have some alignment that is to be >> taken care of. > Do we really need introducing frontend specific alignment? > I'd try reuse backend's one and go for frontend's only in case we have to. I think so. But I consider this a special case for virtio-mem. (specified indirectly via block_size). For the remaining stuff, we might be able to live with the memory backend aligment. And I agree that this should be the default if possible. > >> I don't understand right now why an upper limit on the alignment would >> make sense at all. We can easily handle it during our search. And we >> have to handle it either way during the search, if we plug some device >> with strange sizes (e.g. 1MB DIMM). >> >> Of course, we might end up fragmenting guest physical memory, but that >> is purely a setup issue (choosing sizes of devices + main memory >> properly). I don't see a reason to error out (and of course also not to >> assert out :) ). > Agreed about assert, but I'd still prefer error out there so that users > won't crate insane config and then complain (see below and *). > > Upper alignment value is useful for proper sizing of hotplug address space, > so that user could plug #slots devices upto #maxmem specified on CLI. > It's still possible to misconfigure using manual placement, but default > one just works, user either consumes all memory #maxmem and/or #slots. Please not that it will still work in most cases. Only if people start to define crazy alignments (like I did), crazy DIMM sizes (e.g. 1MB) or crazy block sizes for virtio-mem, we might have a fragmented guest physical memory. This should usually not happen, but if so, we already have error messages for reporting this "fragmented memory". And I consider these "strange configs" similar as "strange manual placement". Same result: fragmented memory, error out. Personally, I don't think we have to guarantee that automatic placement works in all scenarios that the user is able to come up with. It should work in sane configurations, which is still that case. > > There is no misunderstanding in case of error here as it works same as > on baremetal, one doesn't have a free place to put more memory or all > memory one asked for is already there. > > So it might be that #slots decoupling from memory device a premature > action and we can still use it with virtio-mem/pmem. > I treat #slots on x86 as #acpi_slots, that's why I don't think they apply here. I can see how they are used (on x86 only!) to size the address space - but I consider this a "nice to have feature" that should not be supposed to work in all possible scenarios. E.g. powerpc already relies on sane user configs. #slots is not used to size the guest address space. Similar things will apply on s390x.
On Mon, 4 Jun 2018 12:53:42 +0200 David Hildenbrand <david@redhat.com> wrote: > >> Let me explain a little bit why I don't like such restrictions (for > >> which I don't see a need yet): > > (*) being conservative is good here because we can always relax restrictions > > in future if it's needed without breaking users, but we can't take away > > something thing that's been shipped (and if we do it, it typically > > introduces a bunch of compat code to keep old machines working). > > Also beside of giving as some freedom of movement in the future, > > restrictions also to some degree prevent user from misconfiguration) > > Right, but I consider a maximum on an alignment arbitrary (and even > difficult to implement - as you said, "max page size supported by target > machine/cpu" - there are even different huge page sizes e.g. on x86). > > So I'd really vote against introducing such checks. > > And another point against introducing a check for the maimum alignment > would be: There might be users with a specified alignment > max page > size. Changing the behavior will break these setups (strange but true). check is already there implicitly as assert, and you're removing it with this patch. What I'm suggesting is to drop this patch or replace assert with graceful error. If you have actual usecase that requires this check being dropped, then commit message should ddescribe it properly and the patch should be a part of series that introduces that requirement. > >> virtio-mem devices will later have a certain block size (e.g. 4MB). I > >> want to give devices during resource assignment the possibility to > >> specify their alignment requirements. > > size and alignment are 2 diffrent things here, alignment in our design > > is dictated by backing storage page size and for performance reasons > > HVA and GPA should be aligned on the same boundary, users are free > > to pick another GPA manually as far as it has the same alignment. > > But for automatic placement we use backend's alignment to make placement > > as compact as possible but still keeping GPA aligned with HVA. > > > >> For said virtio-mem device, this would e.g. be 4MB. (see patch 10 and 14 > >> of how this call "get_align" comes into play), because the addresses of > >> the memory blocks are all aligned by e.g. 4MB. This is what is > >> guaranteed by the device specification. > > where does this 4Mb magic comes from and why block must be aligned > > on this size? > > (knowing it is hard to get the idea without the current prototype in > place, but that will be fixed soon) > > virtio-mem works in blocks. The block size is the size in which memory > can be plugged or unplugged by the guest. It also determines the > granularity (and therefore the bitmap) we have to use to keep track of > unplugged memory. It is configurable (e.g. for migration purposes), but > might also depend on the backend memory type. E.g. if huge pages are > used in the host, the huge page size defines the minimum block size. But > consider the last part a special case. Huge pages will not be supported > for now. > > The block size also defines the alignment that has to be used by the > guest for plugging/unplugging (because that's how blocks gets mapped to > the bitmap entries). So a virtio-mem device really needs a block-aligned > start address, > > For now I use 4MB because it matches what guests (e.g. Linux) can > actually support and keeps the bitmap small. But as I said, it is > configurable. (-device virtio-mem,block_size=1MB, ...) > > > > >> E.g. for DIMMs we might want to allow to specify the section size (e.g. > >> 128MB on x86), otherwise e.g. Linux is not able to add all memory. (but > >> we should not hardcode this, as this is a Linux specific requirement - > >> still it would be nice to specify) > > true, it's guest specific and we do not have restrictions here. > > The only restriction we have here is that size must be multiple of > > backing storage page size (i.e. alignment) so that guest would > > be able to use tail page. > > > >> So in general, a memory device might have some alignment that is to be > >> taken care of. > > Do we really need introducing frontend specific alignment? > > I'd try reuse backend's one and go for frontend's only in case we have to. > > I think so. But I consider this a special case for virtio-mem. > (specified indirectly via block_size). For the remaining stuff, we might > be able to live with the memory backend aligment. And I agree that this > should be the default if possible. So if you make block_size to be multiple of backing storage alignment (i.e. page size) and we should keep GPA mapping (picking address to map region) based on backend's alignment. virtio-mem probably would even work even with huge pages at cost of increased granularity. > >> I don't understand right now why an upper limit on the alignment would > >> make sense at all. We can easily handle it during our search. And we > >> have to handle it either way during the search, if we plug some device > >> with strange sizes (e.g. 1MB DIMM). > >> > >> Of course, we might end up fragmenting guest physical memory, but that > >> is purely a setup issue (choosing sizes of devices + main memory > >> properly). I don't see a reason to error out (and of course also not to > >> assert out :) ). > > Agreed about assert, but I'd still prefer error out there so that users > > won't crate insane config and then complain (see below and *). > > > > Upper alignment value is useful for proper sizing of hotplug address space, > > so that user could plug #slots devices upto #maxmem specified on CLI. > > It's still possible to misconfigure using manual placement, but default > > one just works, user either consumes all memory #maxmem and/or #slots. > > > Please not that it will still work in most cases. Only if people start > to define crazy alignments (like I did), crazy DIMM sizes (e.g. 1MB) or > crazy block sizes for virtio-mem, we might have a fragmented guest > physical memory. This should usually not happen, but if so, we already > have error messages for reporting this "fragmented memory". > > And I consider these "strange configs" similar as "strange manual > placement". Same result: fragmented memory, error out. > > Personally, I don't think we have to guarantee that automatic placement > works in all scenarios that the user is able to come up with. It should > work in sane configurations, which is still that case. It work now on x86 and I'd wish to keep it this way. > > There is no misunderstanding in case of error here as it works same as > > on baremetal, one doesn't have a free place to put more memory or all > > memory one asked for is already there. > > > > So it might be that #slots decoupling from memory device a premature > > action and we can still use it with virtio-mem/pmem. > > > > I treat #slots on x86 as #acpi_slots, that's why I don't think they > apply here. I can see how they are used (on x86 only!) to size the > address space - but I consider this a "nice to have feature" that should > not be supposed to work in all possible scenarios. slot is more user space concept, as it is on real hw where you stick a plank with memory, so it's easily understood by users. On ACPI side it's just implementation detail, we can go beyond that if it's needed without any adverse effects. Restriction mostly comes from KVM side on # of memory slots and vhost. > E.g. powerpc already relies on sane user configs. #slots is not used to > size the guest address space. Similar things will apply on s390x. They both could benefit from it the same way as x86 providing properly sized area, so users won't have to scratch their head trying to understand why they can't plug more memory.
>> And another point against introducing a check for the maimum alignment >> would be: There might be users with a specified alignment > max page >> size. Changing the behavior will break these setups (strange but true). > check is already there implicitly as assert, and you're removing it > with this patch. > What I'm suggesting is to drop this patch or replace assert with > graceful error. > If you have actual usecase that requires this check being dropped, > then commit message should ddescribe it properly and the patch should > be a part of series that introduces that requirement. On x86 hotplug base is aligned to 1GB On ppc hotplug base is aligned to 1GB So what you're suggesting is to bail out in case we have an alignment that does not base (i.e. turning this assert into a check like "alignment too big") The we would have to also install such base alignment on s390x, too. Otherwise, if somebody specifies -m 511, adding something with alignment of e.g. 4MB will fail. > >>>> virtio-mem devices will later have a certain block size (e.g. 4MB). I >>>> want to give devices during resource assignment the possibility to >>>> specify their alignment requirements. >>> size and alignment are 2 diffrent things here, alignment in our design >>> is dictated by backing storage page size and for performance reasons >>> HVA and GPA should be aligned on the same boundary, users are free >>> to pick another GPA manually as far as it has the same alignment. >>> But for automatic placement we use backend's alignment to make placement >>> as compact as possible but still keeping GPA aligned with HVA. >>> >>>> For said virtio-mem device, this would e.g. be 4MB. (see patch 10 and 14 >>>> of how this call "get_align" comes into play), because the addresses of >>>> the memory blocks are all aligned by e.g. 4MB. This is what is >>>> guaranteed by the device specification. >>> where does this 4Mb magic comes from and why block must be aligned >>> on this size? >> >> (knowing it is hard to get the idea without the current prototype in >> place, but that will be fixed soon) >> >> virtio-mem works in blocks. The block size is the size in which memory >> can be plugged or unplugged by the guest. It also determines the >> granularity (and therefore the bitmap) we have to use to keep track of >> unplugged memory. It is configurable (e.g. for migration purposes), but >> might also depend on the backend memory type. E.g. if huge pages are >> used in the host, the huge page size defines the minimum block size. But >> consider the last part a special case. Huge pages will not be supported >> for now. >> >> The block size also defines the alignment that has to be used by the >> guest for plugging/unplugging (because that's how blocks gets mapped to >> the bitmap entries). So a virtio-mem device really needs a block-aligned >> start address, >> >> For now I use 4MB because it matches what guests (e.g. Linux) can >> actually support and keeps the bitmap small. But as I said, it is >> configurable. (-device virtio-mem,block_size=1MB, ...) >> >>> >>>> E.g. for DIMMs we might want to allow to specify the section size (e.g. >>>> 128MB on x86), otherwise e.g. Linux is not able to add all memory. (but >>>> we should not hardcode this, as this is a Linux specific requirement - >>>> still it would be nice to specify) >>> true, it's guest specific and we do not have restrictions here. >>> The only restriction we have here is that size must be multiple of >>> backing storage page size (i.e. alignment) so that guest would >>> be able to use tail page. >>> >>>> So in general, a memory device might have some alignment that is to be >>>> taken care of. >>> Do we really need introducing frontend specific alignment? >>> I'd try reuse backend's one and go for frontend's only in case we have to. >> >> I think so. But I consider this a special case for virtio-mem. >> (specified indirectly via block_size). For the remaining stuff, we might >> be able to live with the memory backend aligment. And I agree that this >> should be the default if possible. > So if you make block_size to be multiple of backing storage alignment (i.e. page size) > and we should keep GPA mapping (picking address to map region) based on backend's > alignment. virtio-mem probably would even work even with huge pages at cost > of increased granularity. > block_size will always be a multiple of backing storage alignment (well, for now huge pages will not be supported, so the 4MB > page_size always holds). However, using an alignment of the backing storage (e.g. page size) is not enough. Using block_size of 1GB for virtio-mem is very unrealistic, so i think we can ignore this for now. (meaning, getting errors if we implement said alignment check) > >>>> I don't understand right now why an upper limit on the alignment would >>>> make sense at all. We can easily handle it during our search. And we >>>> have to handle it either way during the search, if we plug some device >>>> with strange sizes (e.g. 1MB DIMM). >>>> >>>> Of course, we might end up fragmenting guest physical memory, but that >>>> is purely a setup issue (choosing sizes of devices + main memory >>>> properly). I don't see a reason to error out (and of course also not to >>>> assert out :) ). >>> Agreed about assert, but I'd still prefer error out there so that users >>> won't crate insane config and then complain (see below and *). >>> >>> Upper alignment value is useful for proper sizing of hotplug address space, >>> so that user could plug #slots devices upto #maxmem specified on CLI. >>> It's still possible to misconfigure using manual placement, but default >>> one just works, user either consumes all memory #maxmem and/or #slots. >> >> >> Please not that it will still work in most cases. Only if people start >> to define crazy alignments (like I did), crazy DIMM sizes (e.g. 1MB) or >> crazy block sizes for virtio-mem, we might have a fragmented guest >> physical memory. This should usually not happen, but if so, we already >> have error messages for reporting this "fragmented memory". >> >> And I consider these "strange configs" similar as "strange manual >> placement". Same result: fragmented memory, error out. >> >> Personally, I don't think we have to guarantee that automatic placement >> works in all scenarios that the user is able to come up with. It should >> work in sane configurations, which is still that case. > It work now on x86 and I'd wish to keep it this way. And it keeps working when only using DIMMS/PCDIMMs (slots). Which is totally fine in my point of view. > >>> There is no misunderstanding in case of error here as it works same as >>> on baremetal, one doesn't have a free place to put more memory or all >>> memory one asked for is already there. >>> >>> So it might be that #slots decoupling from memory device a premature >>> action and we can still use it with virtio-mem/pmem. >>> >> >> I treat #slots on x86 as #acpi_slots, that's why I don't think they >> apply here. I can see how they are used (on x86 only!) to size the >> address space - but I consider this a "nice to have feature" that should >> not be supposed to work in all possible scenarios. > slot is more user space concept, as it is on real hw where you stick a plank > with memory, so it's easily understood by users. > On ACPI side it's just implementation detail, we can go beyond that if it's > needed without any adverse effects. > Restriction mostly comes from KVM side on # of memory slots and vhost. KVM memory slots is a completely different concept than the slots parameter. The problem is: DIMMs use this as if it would be "acpi_slots". > >> E.g. powerpc already relies on sane user configs. #slots is not used to >> size the guest address space. Similar things will apply on s390x. > They both could benefit from it the same way as x86 providing > properly sized area, so users won't have to scratch their head > trying to understand why they can't plug more memory. Fragmented memory is more than enough as an error message IMHO.
diff --git a/hw/mem/memory-device.c b/hw/mem/memory-device.c index 3e04f3954e..361d38bfc5 100644 --- a/hw/mem/memory-device.c +++ b/hw/mem/memory-device.c @@ -116,7 +116,6 @@ uint64_t memory_device_get_free_addr(MachineState *ms, const uint64_t *hint, address_space_start = ms->device_memory->base; address_space_end = address_space_start + memory_region_size(&ms->device_memory->mr); - g_assert(QEMU_ALIGN_UP(address_space_start, align) == address_space_start); g_assert(address_space_end >= address_space_start); memory_device_check_addable(ms, size, errp); @@ -149,7 +148,7 @@ uint64_t memory_device_get_free_addr(MachineState *ms, const uint64_t *hint, return 0; } } else { - new_addr = address_space_start; + new_addr = QEMU_ALIGN_UP(address_space_start, align); } /* find address range that will fit new memory device */
The start of the address space does not have to be aligned for the search. Handle this case explicitly when starting the search for a new address. Signed-off-by: David Hildenbrand <david@redhat.com> --- hw/mem/memory-device.c | 3 +-- 1 file changed, 1 insertion(+), 2 deletions(-)