Message ID | 1445852815-85168-1-git-send-email-imammedo@redhat.com |
---|---|
State | New |
Headers | show |
On Mon, Oct 26, 2015 at 10:46:55AM +0100, Igor Mammedov wrote: > commit aa8580cd "pc: memhp: force gaps between DIMM's GPA" > regressed memory hot-unplug for linux guests triggering > following BUGON > ===== > kernel BUG at mm/memory_hotplug.c:703! > ... > [<ffffffff81385fa7>] acpi_memory_device_remove+0x79/0xa5 > [<ffffffff81357818>] acpi_bus_trim+0x5a/0x8d > [<ffffffff81359026>] acpi_device_hotplug+0x1b7/0x418 > === > BUG_ON(phys_start_pfn & ~PAGE_SECTION_MASK); > === > > reson for it is that x86-64 linux guest supports memory > hotplug in chunks of 128Mb and memory section also should > be 128Mb aligned. > However gaps forced between 128Mb DIMMs with backend's > natural alignment of 2Mb make the 2nd and following > DIMMs not being aligned on 128Mb boundary as it was > originally. To fix regression enforce minimal 128Mb > alignment like it was done for PPC. > > Signed-off-by: Igor Mammedov <imammedo@redhat.com> So our temporary work around is creating more trouble. I'm inclined to just revert aa8580cd and df0acded19 with it. > --- > PS: > PAGE_SECTION_MASK is derived from SECTION_SIZE_BITS which > is arch dependent so this is fix for x86-64 target only. > If anyone cares obout 32bit guests, it should also be fine > for x86-32 which has 64Mb memory sections/alignment. Like 32 bit guests are unheard of? This does not inspire confidence at all. So I dug in linux guest code: #ifdef CONFIG_X86_32 # ifdef CONFIG_X86_PAE # define SECTION_SIZE_BITS 29 # define MAX_PHYSADDR_BITS 36 # define MAX_PHYSMEM_BITS 36 # else # define SECTION_SIZE_BITS 26 # define MAX_PHYSADDR_BITS 32 # define MAX_PHYSMEM_BITS 32 # endif #else /* CONFIG_X86_32 */ # define SECTION_SIZE_BITS 27 /* matt - 128 is convenient right now */ # define MAX_PHYSADDR_BITS 44 # define MAX_PHYSMEM_BITS 46 #endif Looks like PAE needs more alignment. And it looks like 128 is arbitrary here. So we are tying ourselves to specific guest quirks. All this just looks wrong to me. > --- > hw/i386/pc.c | 12 ++++++++++++ > 1 file changed, 12 insertions(+) > > diff --git a/hw/i386/pc.c b/hw/i386/pc.c > index 3d958ba..0f7cf7c 100644 > --- a/hw/i386/pc.c > +++ b/hw/i386/pc.c > @@ -1610,6 +1610,8 @@ void ioapic_init_gsi(GSIState *gsi_state, const char *parent_name) > } > } > > +#define PC_MIN_DIMM_ALIGNMENT (1ULL << 27) /* 128Mb */ > + This kind of comment doesn't really help. > static void pc_dimm_plug(HotplugHandler *hotplug_dev, > DeviceState *dev, Error **errp) > { > @@ -1624,6 +1626,16 @@ static void pc_dimm_plug(HotplugHandler *hotplug_dev, > > if (memory_region_get_alignment(mr) && pcms->enforce_aligned_dimm) { > align = memory_region_get_alignment(mr); > + /* > + * Linux x64 guests expect 128Mb aligned DIMM, this implies no other guest cares. which isn't true. > + * but this change which change? > causes memory layout change change compared to what? > so > + * for compatibility compatibility with what? > apply 128Mb alignment only > + * when forced gaps are enabled since it is the cause > + * of misalignment. Which makes no sense, sorry. Can it be misaligned for some other reason? If not, why limit to this case? > + */ > + if (pcmc->inter_dimm_gap && align < PC_MIN_DIMM_ALIGNMENT) { > + align = PC_MIN_DIMM_ALIGNMENT; > + } > } > > if (!pcms->acpi_dev) { All this sounds pretty fragile. How about we revert the inter dimm gap thing for 2.4? It's just a work around, this is piling work arounds on top of work arounds. > -- > 1.8.3.1
On Mon, 26 Oct 2015 12:28:21 +0200 "Michael S. Tsirkin" <mst@redhat.com> wrote: > On Mon, Oct 26, 2015 at 10:46:55AM +0100, Igor Mammedov wrote: > > commit aa8580cd "pc: memhp: force gaps between DIMM's GPA" > > regressed memory hot-unplug for linux guests triggering > > following BUGON > > ===== > > kernel BUG at mm/memory_hotplug.c:703! > > ... > > [<ffffffff81385fa7>] acpi_memory_device_remove+0x79/0xa5 > > [<ffffffff81357818>] acpi_bus_trim+0x5a/0x8d > > [<ffffffff81359026>] acpi_device_hotplug+0x1b7/0x418 > > === > > BUG_ON(phys_start_pfn & ~PAGE_SECTION_MASK); > > === > > > > reson for it is that x86-64 linux guest supports memory > > hotplug in chunks of 128Mb and memory section also should > > be 128Mb aligned. > > However gaps forced between 128Mb DIMMs with backend's > > natural alignment of 2Mb make the 2nd and following > > DIMMs not being aligned on 128Mb boundary as it was > > originally. To fix regression enforce minimal 128Mb > > alignment like it was done for PPC. > > > > Signed-off-by: Igor Mammedov <imammedo@redhat.com> > > So our temporary work around is creating more trouble. I'm inclined to just > revert aa8580cd and df0acded19 with it. > > > --- > > PS: > > PAGE_SECTION_MASK is derived from SECTION_SIZE_BITS which > > is arch dependent so this is fix for x86-64 target only. > > If anyone cares obout 32bit guests, it should also be fine > > for x86-32 which has 64Mb memory sections/alignment. > > Like 32 bit guests are unheard of? This does not inspire confidence at all. > > > So I dug in linux guest code: > > #ifdef CONFIG_X86_32 > # ifdef CONFIG_X86_PAE > # define SECTION_SIZE_BITS 29 > # define MAX_PHYSADDR_BITS 36 > # define MAX_PHYSMEM_BITS 36 > # else > # define SECTION_SIZE_BITS 26 > # define MAX_PHYSADDR_BITS 32 > # define MAX_PHYSMEM_BITS 32 > # endif > #else /* CONFIG_X86_32 */ > # define SECTION_SIZE_BITS 27 /* matt - 128 is convenient right now */ > # define MAX_PHYSADDR_BITS 44 > # define MAX_PHYSMEM_BITS 46 > #endif > > Looks like PAE needs more alignment. > And it looks like 128 is arbitrary here. I'll amend it to 512Mb. > > So we are tying ourselves to specific guest quirks. > All this just looks wrong to me. On real hardware I've seen, slots were aligned to 1Gb which is also arbitrary but happens to work with exiting linux/windows OSes. There is no point to make broken hardware or in this case to make with QEMU intentionally broken (alignment) for linux guests when we can accommodate it. > > > > --- > > hw/i386/pc.c | 12 ++++++++++++ > > 1 file changed, 12 insertions(+) > > > > diff --git a/hw/i386/pc.c b/hw/i386/pc.c > > index 3d958ba..0f7cf7c 100644 > > --- a/hw/i386/pc.c > > +++ b/hw/i386/pc.c > > @@ -1610,6 +1610,8 @@ void ioapic_init_gsi(GSIState *gsi_state, const char *parent_name) > > } > > } > > > > +#define PC_MIN_DIMM_ALIGNMENT (1ULL << 27) /* 128Mb */ > > + > > This kind of comment doesn't really help. Sorry forgot to add comment here, how about: /* minimal DIMM alignment is derived from * arch/x86/include/asm/sparsemem.h:SECTION_SIZE_BITS * from PAE variant as the largest required alignment * among of PAE/x32/x64 arch-s. */ > > > static void pc_dimm_plug(HotplugHandler *hotplug_dev, > > DeviceState *dev, Error **errp) > > { > > @@ -1624,6 +1626,16 @@ static void pc_dimm_plug(HotplugHandler *hotplug_dev, > > > > if (memory_region_get_alignment(mr) && pcms->enforce_aligned_dimm) { > > align = memory_region_get_alignment(mr); > > + /* > > + * Linux x64 guests expect 128Mb aligned DIMM, > > this implies no other guest cares. which isn't true. > > > + * but this change > > which change? > > > causes memory layout change > > change compared to what? > > > so > > + * for compatibility > > compatibility with what? > > > apply 128Mb alignment only > > + * when forced gaps are enabled since it is the cause > > + * of misalignment. > > Which makes no sense, sorry. > > Can it be misaligned for some other reason? > > If not, why limit to this case? 2.4 machine will have DIMM's plugged in with backend's alignment which is 2Mb for memory-backend-ram, so forcing minimal 128Mb alignment will change DIMM's GPA and break migration from old 2.4 machine to new QEMU 2.4 machine type. Since gaps introduced regression it makes sense to condition alignment adjustment on inter_dimm_gap property so it would not affect machine types prior 2.5. How about following comment: /* Linux guests expect 512/64/128Mb alignment for PAE/x32/x64 arch * respectively. Windows works fine with 2Mb. To make sure that * memory hotplug would work with above flavors of Linux set * minimal alignment to 512Mb (i.e. PAE arch). * Enforcing minimal alignment bigger than that of a backend * (2Mb for memory-backend-ram) changes GPAs of the 2nd and * following DIMMs which breaks migration for older machine types, * so force minimal alignment only since 2.5 machine type and do not * apply it to older machine types (prior 2.5) */ > > > + */ > > + if (pcmc->inter_dimm_gap && align < PC_MIN_DIMM_ALIGNMENT) { > > + align = PC_MIN_DIMM_ALIGNMENT; > > + } > > } > > > > if (!pcms->acpi_dev) { > > All this sounds pretty fragile. How about we revert the inter dimm gap > thing for 2.4? It's just a work around, this is piling work arounds on > top of work arounds. Yep it's workaround but it works around QEMU's broken virtio implementation in a simple way without need for guest side changes. Without foreseeable virtio fix it makes memory hotplug unusable and even more so if there were a virtio fix it won't fix old guests since you've said that virtio fix would require changes of both QEMU and guest sides. > > > -- > > 1.8.3.1 >
> How about following comment: > /* Linux guests expect 512/64/128Mb alignment for PAE/x32/x64 arch > * respectively. Windows works fine with 2Mb. To make sure that > * memory hotplug would work with above flavors of Linux set > * minimal alignment to 512Mb (i.e. PAE arch). > * Enforcing minimal alignment bigger than that of a backend > * (2Mb for memory-backend-ram) changes GPAs of the 2nd and > * following DIMMs which breaks migration for older machine types, > * so force minimal alignment only since 2.5 machine type and do not > * apply it to older machine types (prior 2.5) > */ > PAE guests are working with 128Mb alignment just fine, at least 3.8...3.16 I`ve tried previously, have you seen a counter-example?
On Mon, 26 Oct 2015 21:42:05 +0300 Andrey Korolyov <andrey@xdel.ru> wrote: > > How about following comment: > > /* Linux guests expect 512/64/128Mb alignment for PAE/x32/x64 arch > > * respectively. Windows works fine with 2Mb. To make sure that > > * memory hotplug would work with above flavors of Linux set > > * minimal alignment to 512Mb (i.e. PAE arch). > > * Enforcing minimal alignment bigger than that of a backend > > * (2Mb for memory-backend-ram) changes GPAs of the 2nd and > > * following DIMMs which breaks migration for older machine types, > > * so force minimal alignment only since 2.5 machine type and do not > > * apply it to older machine types (prior 2.5) > > */ > > > > > PAE guests are working with 128Mb alignment just fine, at least > 3.8...3.16 I`ve tried previously, have you seen a counter-example? Issue is seen only during unplug when DIMM is not aligned to SECTION_SIZE_BITS.
On Mon, Oct 26, 2015 at 02:24:32PM +0100, Igor Mammedov wrote: > Yep it's workaround but it works around QEMU's broken virtio > implementation in a simple way without need for guest side changes. > > Without foreseeable virtio fix it makes memory hotplug unusable and even > more so if there were a virtio fix it won't fix old guests since you've > said that virtio fix would require changes of both QEMU and guest sides. What makes it not foreseeable? Apparently only the fact that we have a work-around in place so no one works on it. I can code it up pretty quickly, but I'm flat out of time for testing as I'm going on vacation soon, and hard freeze is pretty close. GPA space is kind of cheap, but wasting it in chunks of 512M seems way too aggressive.
On Tue, 27 Oct 2015 10:31:21 +0200 "Michael S. Tsirkin" <mst@redhat.com> wrote: > On Mon, Oct 26, 2015 at 02:24:32PM +0100, Igor Mammedov wrote: > > Yep it's workaround but it works around QEMU's broken virtio > > implementation in a simple way without need for guest side changes. > > > > Without foreseeable virtio fix it makes memory hotplug unusable and even > > more so if there were a virtio fix it won't fix old guests since you've > > said that virtio fix would require changes of both QEMU and guest sides. > > What makes it not foreseeable? > Apparently only the fact that we have a work-around in place so no one > works on it. I can code it up pretty quickly, but I'm flat out of time > for testing as I'm going on vacation soon, and hard freeze is pretty > close. I can lend a hand for testing part. > > GPA space is kind of cheap, but wasting it in chunks of 512M > seems way too aggressive. hotplug region is sized with 1Gb alignment reserve per DIMM so we aren't actually wasting anything here.
On Tue, Oct 27, 2015 at 09:48:37AM +0100, Igor Mammedov wrote: > On Tue, 27 Oct 2015 10:31:21 +0200 > "Michael S. Tsirkin" <mst@redhat.com> wrote: > > > On Mon, Oct 26, 2015 at 02:24:32PM +0100, Igor Mammedov wrote: > > > Yep it's workaround but it works around QEMU's broken virtio > > > implementation in a simple way without need for guest side changes. > > > > > > Without foreseeable virtio fix it makes memory hotplug unusable and even > > > more so if there were a virtio fix it won't fix old guests since you've > > > said that virtio fix would require changes of both QEMU and guest sides. > > > > What makes it not foreseeable? > > Apparently only the fact that we have a work-around in place so no one > > works on it. I can code it up pretty quickly, but I'm flat out of time > > for testing as I'm going on vacation soon, and hard freeze is pretty > > close. > I can lend a hand for testing part. > > > > > GPA space is kind of cheap, but wasting it in chunks of 512M > > seems way too aggressive. > hotplug region is sized with 1Gb alignment reserve per DIMM so we aren't > actually wasting anything here. > If I allocate two 1G DIMMs, what will be the gap size? 512M? 1G? It's too much either way.
On Tue, 27 Oct 2015 10:53:08 +0200 "Michael S. Tsirkin" <mst@redhat.com> wrote: > On Tue, Oct 27, 2015 at 09:48:37AM +0100, Igor Mammedov wrote: > > On Tue, 27 Oct 2015 10:31:21 +0200 > > "Michael S. Tsirkin" <mst@redhat.com> wrote: > > > > > On Mon, Oct 26, 2015 at 02:24:32PM +0100, Igor Mammedov wrote: > > > > Yep it's workaround but it works around QEMU's broken virtio > > > > implementation in a simple way without need for guest side changes. > > > > > > > > Without foreseeable virtio fix it makes memory hotplug unusable and even > > > > more so if there were a virtio fix it won't fix old guests since you've > > > > said that virtio fix would require changes of both QEMU and guest sides. > > > > > > What makes it not foreseeable? > > > Apparently only the fact that we have a work-around in place so no one > > > works on it. I can code it up pretty quickly, but I'm flat out of time > > > for testing as I'm going on vacation soon, and hard freeze is pretty > > > close. > > I can lend a hand for testing part. > > > > > > > > GPA space is kind of cheap, but wasting it in chunks of 512M > > > seems way too aggressive. > > hotplug region is sized with 1Gb alignment reserve per DIMM so we aren't > > actually wasting anything here. > > > > If I allocate two 1G DIMMs, what will be the gap size? 512M? 1G? > It's too much either way. minimum would be 512, and if backend is 1Gb-hugepage gap will be backend's natural alignment (i.e. 1Gb). We always reserve extra 1Gb per slot for max possible alignment so that we could map 1Gb hugepage aligned. It's over reserve but we don't know in advance what will be plugged in slot and so we always reserve max possible alignment.
On Tue, Oct 27, 2015 at 10:14:56AM +0100, Igor Mammedov wrote: > On Tue, 27 Oct 2015 10:53:08 +0200 > "Michael S. Tsirkin" <mst@redhat.com> wrote: > > > On Tue, Oct 27, 2015 at 09:48:37AM +0100, Igor Mammedov wrote: > > > On Tue, 27 Oct 2015 10:31:21 +0200 > > > "Michael S. Tsirkin" <mst@redhat.com> wrote: > > > > > > > On Mon, Oct 26, 2015 at 02:24:32PM +0100, Igor Mammedov wrote: > > > > > Yep it's workaround but it works around QEMU's broken virtio > > > > > implementation in a simple way without need for guest side changes. > > > > > > > > > > Without foreseeable virtio fix it makes memory hotplug unusable and even > > > > > more so if there were a virtio fix it won't fix old guests since you've > > > > > said that virtio fix would require changes of both QEMU and guest sides. > > > > > > > > What makes it not foreseeable? > > > > Apparently only the fact that we have a work-around in place so no one > > > > works on it. I can code it up pretty quickly, but I'm flat out of time > > > > for testing as I'm going on vacation soon, and hard freeze is pretty > > > > close. > > > I can lend a hand for testing part. > > > > > > > > > > > GPA space is kind of cheap, but wasting it in chunks of 512M > > > > seems way too aggressive. > > > hotplug region is sized with 1Gb alignment reserve per DIMM so we aren't > > > actually wasting anything here. > > > > > > > If I allocate two 1G DIMMs, what will be the gap size? 512M? 1G? > > It's too much either way. > minimum would be 512, and if backend is 1Gb-hugepage gap will be > backend's natural alignment (i.e. 1Gb). Is backend configuration even allowed to affect the machine ABI? We need to be able to change backend configuration when migrating the VM to another host.
On Tue, 27 Oct 2015 14:36:35 -0200 Eduardo Habkost <ehabkost@redhat.com> wrote: > On Tue, Oct 27, 2015 at 10:14:56AM +0100, Igor Mammedov wrote: > > On Tue, 27 Oct 2015 10:53:08 +0200 > > "Michael S. Tsirkin" <mst@redhat.com> wrote: > > > > > On Tue, Oct 27, 2015 at 09:48:37AM +0100, Igor Mammedov wrote: > > > > On Tue, 27 Oct 2015 10:31:21 +0200 > > > > "Michael S. Tsirkin" <mst@redhat.com> wrote: > > > > > > > > > On Mon, Oct 26, 2015 at 02:24:32PM +0100, Igor Mammedov wrote: > > > > > > Yep it's workaround but it works around QEMU's broken virtio > > > > > > implementation in a simple way without need for guest side changes. > > > > > > > > > > > > Without foreseeable virtio fix it makes memory hotplug unusable and even > > > > > > more so if there were a virtio fix it won't fix old guests since you've > > > > > > said that virtio fix would require changes of both QEMU and guest sides. > > > > > > > > > > What makes it not foreseeable? > > > > > Apparently only the fact that we have a work-around in place so no one > > > > > works on it. I can code it up pretty quickly, but I'm flat out of time > > > > > for testing as I'm going on vacation soon, and hard freeze is pretty > > > > > close. > > > > I can lend a hand for testing part. > > > > > > > > > > > > > > GPA space is kind of cheap, but wasting it in chunks of 512M > > > > > seems way too aggressive. > > > > hotplug region is sized with 1Gb alignment reserve per DIMM so we aren't > > > > actually wasting anything here. > > > > > > > > > > If I allocate two 1G DIMMs, what will be the gap size? 512M? 1G? > > > It's too much either way. > > minimum would be 512, and if backend is 1Gb-hugepage gap will be > > backend's natural alignment (i.e. 1Gb). > > Is backend configuration even allowed to affect the machine ABI? We need > to be able to change backend configuration when migrating the VM to > another host. for now, one has to use the same type of backend on both sides i.e. if source uses 1Gb huge pages backend then target also need to use it. We could change this for the next machine type to always force max alignment (1Gb), then it would be possible to change between backends with different alignments.
(CCing Michal and libvir-list, so libvirt team is aware of this restriction) On Thu, Oct 29, 2015 at 02:36:37PM +0100, Igor Mammedov wrote: > On Tue, 27 Oct 2015 14:36:35 -0200 > Eduardo Habkost <ehabkost@redhat.com> wrote: > > > On Tue, Oct 27, 2015 at 10:14:56AM +0100, Igor Mammedov wrote: > > > On Tue, 27 Oct 2015 10:53:08 +0200 > > > "Michael S. Tsirkin" <mst@redhat.com> wrote: > > > > > > > On Tue, Oct 27, 2015 at 09:48:37AM +0100, Igor Mammedov wrote: > > > > > On Tue, 27 Oct 2015 10:31:21 +0200 > > > > > "Michael S. Tsirkin" <mst@redhat.com> wrote: > > > > > > > > > > > On Mon, Oct 26, 2015 at 02:24:32PM +0100, Igor Mammedov wrote: > > > > > > > Yep it's workaround but it works around QEMU's broken virtio > > > > > > > implementation in a simple way without need for guest side changes. > > > > > > > > > > > > > > Without foreseeable virtio fix it makes memory hotplug unusable and even > > > > > > > more so if there were a virtio fix it won't fix old guests since you've > > > > > > > said that virtio fix would require changes of both QEMU and guest sides. > > > > > > > > > > > > What makes it not foreseeable? > > > > > > Apparently only the fact that we have a work-around in place so no one > > > > > > works on it. I can code it up pretty quickly, but I'm flat out of time > > > > > > for testing as I'm going on vacation soon, and hard freeze is pretty > > > > > > close. > > > > > I can lend a hand for testing part. > > > > > > > > > > > > > > > > > GPA space is kind of cheap, but wasting it in chunks of 512M > > > > > > seems way too aggressive. > > > > > hotplug region is sized with 1Gb alignment reserve per DIMM so we aren't > > > > > actually wasting anything here. > > > > > > > > > > > > > If I allocate two 1G DIMMs, what will be the gap size? 512M? 1G? > > > > It's too much either way. > > > minimum would be 512, and if backend is 1Gb-hugepage gap will be > > > backend's natural alignment (i.e. 1Gb). > > > > Is backend configuration even allowed to affect the machine ABI? We need > > to be able to change backend configuration when migrating the VM to > > another host. > for now, one has to use the same type of backend on both sides > i.e. if source uses 1Gb huge pages backend then target also > need to use it. > The page size of the backend don't even depend on QEMU arguments, but on the kernel command-line or hugetlbfs mount options. So it's possible to have exactly the same QEMU command-line on source and destination (with an explicit versioned machine-type), and get a VM that can't be migrated? That means we are breaking our guarantees about migration and guest ABI. > We could change this for the next machine type to always force > max alignment (1Gb), then it would be possible to change > between backends with different alignments. I'm not sure what's the best solution here. If always using 1GB is too aggressive, we could require management to ask for an explicit alignment as a -machine option if they know they will need a specific backend page size. BTW, are you talking about the behavior introduced by aa8580cddf011e8cedcf87f7a0fdea7549fc4704 ("pc: memhp: force gaps between DIMM's GPA") only, or the backend page size was already affecting GPA allocation before that commit?
On Thu, 29 Oct 2015 16:16:57 -0200 Eduardo Habkost <ehabkost@redhat.com> wrote: > (CCing Michal and libvir-list, so libvirt team is aware of this > restriction) > > On Thu, Oct 29, 2015 at 02:36:37PM +0100, Igor Mammedov wrote: > > On Tue, 27 Oct 2015 14:36:35 -0200 > > Eduardo Habkost <ehabkost@redhat.com> wrote: > > > > > On Tue, Oct 27, 2015 at 10:14:56AM +0100, Igor Mammedov wrote: > > > > On Tue, 27 Oct 2015 10:53:08 +0200 > > > > "Michael S. Tsirkin" <mst@redhat.com> wrote: > > > > > > > > > On Tue, Oct 27, 2015 at 09:48:37AM +0100, Igor Mammedov wrote: > > > > > > On Tue, 27 Oct 2015 10:31:21 +0200 > > > > > > "Michael S. Tsirkin" <mst@redhat.com> wrote: > > > > > > > > > > > > > On Mon, Oct 26, 2015 at 02:24:32PM +0100, Igor Mammedov wrote: > > > > > > > > Yep it's workaround but it works around QEMU's broken virtio > > > > > > > > implementation in a simple way without need for guest side changes. > > > > > > > > > > > > > > > > Without foreseeable virtio fix it makes memory hotplug unusable and even > > > > > > > > more so if there were a virtio fix it won't fix old guests since you've > > > > > > > > said that virtio fix would require changes of both QEMU and guest sides. > > > > > > > > > > > > > > What makes it not foreseeable? > > > > > > > Apparently only the fact that we have a work-around in place so no one > > > > > > > works on it. I can code it up pretty quickly, but I'm flat out of time > > > > > > > for testing as I'm going on vacation soon, and hard freeze is pretty > > > > > > > close. > > > > > > I can lend a hand for testing part. > > > > > > > > > > > > > > > > > > > > GPA space is kind of cheap, but wasting it in chunks of 512M > > > > > > > seems way too aggressive. > > > > > > hotplug region is sized with 1Gb alignment reserve per DIMM so we aren't > > > > > > actually wasting anything here. > > > > > > > > > > > > > > > > If I allocate two 1G DIMMs, what will be the gap size? 512M? 1G? > > > > > It's too much either way. > > > > minimum would be 512, and if backend is 1Gb-hugepage gap will be > > > > backend's natural alignment (i.e. 1Gb). > > > > > > Is backend configuration even allowed to affect the machine ABI? We need > > > to be able to change backend configuration when migrating the VM to > > > another host. > > for now, one has to use the same type of backend on both sides > > i.e. if source uses 1Gb huge pages backend then target also > > need to use it. > > > > The page size of the backend don't even depend on QEMU arguments, but on > the kernel command-line or hugetlbfs mount options. So it's possible to > have exactly the same QEMU command-line on source and destination (with > an explicit versioned machine-type), and get a VM that can't be > migrated? That means we are breaking our guarantees about migration and > guest ABI. > > > > We could change this for the next machine type to always force > > max alignment (1Gb), then it would be possible to change > > between backends with different alignments. > > I'm not sure what's the best solution here. If always using 1GB is too > aggressive, we could require management to ask for an explicit alignment > as a -machine option if they know they will need a specific backend page > size. > > BTW, are you talking about the behavior introduced by > aa8580cddf011e8cedcf87f7a0fdea7549fc4704 ("pc: memhp: force gaps between > DIMM's GPA") only, or the backend page size was already affecting GPA > allocation before that commit? backend alignment was there since beginning, we always over-reserve 1GB per slot since we don't know in advance what alignment hotplugged backend would require.
diff --git a/hw/i386/pc.c b/hw/i386/pc.c index 3d958ba..0f7cf7c 100644 --- a/hw/i386/pc.c +++ b/hw/i386/pc.c @@ -1610,6 +1610,8 @@ void ioapic_init_gsi(GSIState *gsi_state, const char *parent_name) } } +#define PC_MIN_DIMM_ALIGNMENT (1ULL << 27) /* 128Mb */ + static void pc_dimm_plug(HotplugHandler *hotplug_dev, DeviceState *dev, Error **errp) { @@ -1624,6 +1626,16 @@ static void pc_dimm_plug(HotplugHandler *hotplug_dev, if (memory_region_get_alignment(mr) && pcms->enforce_aligned_dimm) { align = memory_region_get_alignment(mr); + /* + * Linux x64 guests expect 128Mb aligned DIMM, + * but this change causes memory layout change so + * for compatibility apply 128Mb alignment only + * when forced gaps are enabled since it is the cause + * of misalignment. + */ + if (pcmc->inter_dimm_gap && align < PC_MIN_DIMM_ALIGNMENT) { + align = PC_MIN_DIMM_ALIGNMENT; + } } if (!pcms->acpi_dev) {
commit aa8580cd "pc: memhp: force gaps between DIMM's GPA" regressed memory hot-unplug for linux guests triggering following BUGON ===== kernel BUG at mm/memory_hotplug.c:703! ... [<ffffffff81385fa7>] acpi_memory_device_remove+0x79/0xa5 [<ffffffff81357818>] acpi_bus_trim+0x5a/0x8d [<ffffffff81359026>] acpi_device_hotplug+0x1b7/0x418 === BUG_ON(phys_start_pfn & ~PAGE_SECTION_MASK); === reson for it is that x86-64 linux guest supports memory hotplug in chunks of 128Mb and memory section also should be 128Mb aligned. However gaps forced between 128Mb DIMMs with backend's natural alignment of 2Mb make the 2nd and following DIMMs not being aligned on 128Mb boundary as it was originally. To fix regression enforce minimal 128Mb alignment like it was done for PPC. Signed-off-by: Igor Mammedov <imammedo@redhat.com> --- PS: PAGE_SECTION_MASK is derived from SECTION_SIZE_BITS which is arch dependent so this is fix for x86-64 target only. If anyone cares obout 32bit guests, it should also be fine for x86-32 which has 64Mb memory sections/alignment. --- hw/i386/pc.c | 12 ++++++++++++ 1 file changed, 12 insertions(+)