diff mbox

[v2] pc: memhp: enforce minimal 128Mb alignment for pc-dimm

Message ID 1445852815-85168-1-git-send-email-imammedo@redhat.com
State New
Headers show

Commit Message

Igor Mammedov Oct. 26, 2015, 9:46 a.m. UTC
commit aa8580cd "pc: memhp: force gaps between DIMM's GPA"
regressed memory hot-unplug for linux guests triggering
following BUGON
 =====
 kernel BUG at mm/memory_hotplug.c:703!
 ...
 [<ffffffff81385fa7>] acpi_memory_device_remove+0x79/0xa5
 [<ffffffff81357818>] acpi_bus_trim+0x5a/0x8d
 [<ffffffff81359026>] acpi_device_hotplug+0x1b7/0x418
 ===
    BUG_ON(phys_start_pfn & ~PAGE_SECTION_MASK);
 ===

reson for it is that x86-64 linux guest supports memory
hotplug in chunks of 128Mb and memory section also should
be 128Mb aligned.
However gaps forced between 128Mb DIMMs with backend's
natural alignment of 2Mb make the 2nd and following
DIMMs not being aligned on 128Mb boundary as it was
originally. To fix regression enforce minimal 128Mb
alignment like it was done for PPC.

Signed-off-by: Igor Mammedov <imammedo@redhat.com>
---
PS:
  PAGE_SECTION_MASK is derived from SECTION_SIZE_BITS which
  is arch dependent so this is fix for x86-64 target only.
  If anyone cares obout 32bit guests, it should also be fine
  for x86-32 which has 64Mb memory sections/alignment.
---
 hw/i386/pc.c | 12 ++++++++++++
 1 file changed, 12 insertions(+)

Comments

Michael S. Tsirkin Oct. 26, 2015, 10:28 a.m. UTC | #1
On Mon, Oct 26, 2015 at 10:46:55AM +0100, Igor Mammedov wrote:
> commit aa8580cd "pc: memhp: force gaps between DIMM's GPA"
> regressed memory hot-unplug for linux guests triggering
> following BUGON
>  =====
>  kernel BUG at mm/memory_hotplug.c:703!
>  ...
>  [<ffffffff81385fa7>] acpi_memory_device_remove+0x79/0xa5
>  [<ffffffff81357818>] acpi_bus_trim+0x5a/0x8d
>  [<ffffffff81359026>] acpi_device_hotplug+0x1b7/0x418
>  ===
>     BUG_ON(phys_start_pfn & ~PAGE_SECTION_MASK);
>  ===
> 
> reson for it is that x86-64 linux guest supports memory
> hotplug in chunks of 128Mb and memory section also should
> be 128Mb aligned.
> However gaps forced between 128Mb DIMMs with backend's
> natural alignment of 2Mb make the 2nd and following
> DIMMs not being aligned on 128Mb boundary as it was
> originally. To fix regression enforce minimal 128Mb
> alignment like it was done for PPC.
> 
> Signed-off-by: Igor Mammedov <imammedo@redhat.com>

So our temporary work around is creating more trouble.  I'm inclined to just
revert aa8580cd and df0acded19 with it.

> ---
> PS:
>   PAGE_SECTION_MASK is derived from SECTION_SIZE_BITS which
>   is arch dependent so this is fix for x86-64 target only.
>   If anyone cares obout 32bit guests, it should also be fine
>   for x86-32 which has 64Mb memory sections/alignment.

Like 32 bit guests are unheard of?  This does not inspire confidence at all.


So I dug in linux guest code:

#ifdef CONFIG_X86_32
# ifdef CONFIG_X86_PAE
#  define SECTION_SIZE_BITS     29
#  define MAX_PHYSADDR_BITS     36
#  define MAX_PHYSMEM_BITS      36
# else
#  define SECTION_SIZE_BITS     26
#  define MAX_PHYSADDR_BITS     32
#  define MAX_PHYSMEM_BITS      32
# endif
#else /* CONFIG_X86_32 */
# define SECTION_SIZE_BITS      27 /* matt - 128 is convenient right now */
# define MAX_PHYSADDR_BITS      44
# define MAX_PHYSMEM_BITS       46
#endif

Looks like PAE needs more alignment.
And it looks like 128 is arbitrary here.

So we are tying ourselves to specific guest quirks.
All this just looks wrong to me.


> ---
>  hw/i386/pc.c | 12 ++++++++++++
>  1 file changed, 12 insertions(+)
> 
> diff --git a/hw/i386/pc.c b/hw/i386/pc.c
> index 3d958ba..0f7cf7c 100644
> --- a/hw/i386/pc.c
> +++ b/hw/i386/pc.c
> @@ -1610,6 +1610,8 @@ void ioapic_init_gsi(GSIState *gsi_state, const char *parent_name)
>      }
>  }
>  
> +#define PC_MIN_DIMM_ALIGNMENT (1ULL << 27) /* 128Mb */
> +

This kind of comment doesn't really help.

>  static void pc_dimm_plug(HotplugHandler *hotplug_dev,
>                           DeviceState *dev, Error **errp)
>  {
> @@ -1624,6 +1626,16 @@ static void pc_dimm_plug(HotplugHandler *hotplug_dev,
>  
>      if (memory_region_get_alignment(mr) && pcms->enforce_aligned_dimm) {
>          align = memory_region_get_alignment(mr);
> +        /*
> +         * Linux x64 guests expect 128Mb aligned DIMM,

this implies no other guest cares. which isn't true.

> +         * but this change

which change?

> causes memory layout change

change compared to what?

> so
> +         * for compatibility

compatibility with what?

> apply 128Mb alignment only
> +         * when forced gaps are enabled since it is the cause
> +         * of misalignment.

Which makes no sense, sorry.

Can it be misaligned for some other reason?

If not, why limit to this case?

> +         */
> +        if (pcmc->inter_dimm_gap && align < PC_MIN_DIMM_ALIGNMENT) {
> +            align = PC_MIN_DIMM_ALIGNMENT;
> +        }
>      }
>  
>      if (!pcms->acpi_dev) {

All this sounds pretty fragile. How about we revert the inter dimm gap
thing for 2.4? It's just a work around, this is piling work arounds on
top of work arounds.

> -- 
> 1.8.3.1
Igor Mammedov Oct. 26, 2015, 1:24 p.m. UTC | #2
On Mon, 26 Oct 2015 12:28:21 +0200
"Michael S. Tsirkin" <mst@redhat.com> wrote:

> On Mon, Oct 26, 2015 at 10:46:55AM +0100, Igor Mammedov wrote:
> > commit aa8580cd "pc: memhp: force gaps between DIMM's GPA"
> > regressed memory hot-unplug for linux guests triggering
> > following BUGON
> >  =====
> >  kernel BUG at mm/memory_hotplug.c:703!
> >  ...
> >  [<ffffffff81385fa7>] acpi_memory_device_remove+0x79/0xa5
> >  [<ffffffff81357818>] acpi_bus_trim+0x5a/0x8d
> >  [<ffffffff81359026>] acpi_device_hotplug+0x1b7/0x418
> >  ===
> >     BUG_ON(phys_start_pfn & ~PAGE_SECTION_MASK);
> >  ===
> > 
> > reson for it is that x86-64 linux guest supports memory
> > hotplug in chunks of 128Mb and memory section also should
> > be 128Mb aligned.
> > However gaps forced between 128Mb DIMMs with backend's
> > natural alignment of 2Mb make the 2nd and following
> > DIMMs not being aligned on 128Mb boundary as it was
> > originally. To fix regression enforce minimal 128Mb
> > alignment like it was done for PPC.
> > 
> > Signed-off-by: Igor Mammedov <imammedo@redhat.com>
> 
> So our temporary work around is creating more trouble.  I'm inclined to just
> revert aa8580cd and df0acded19 with it.
> 
> > ---
> > PS:
> >   PAGE_SECTION_MASK is derived from SECTION_SIZE_BITS which
> >   is arch dependent so this is fix for x86-64 target only.
> >   If anyone cares obout 32bit guests, it should also be fine
> >   for x86-32 which has 64Mb memory sections/alignment.
> 
> Like 32 bit guests are unheard of?  This does not inspire confidence at all.
> 
> 
> So I dug in linux guest code:
> 
> #ifdef CONFIG_X86_32
> # ifdef CONFIG_X86_PAE
> #  define SECTION_SIZE_BITS     29
> #  define MAX_PHYSADDR_BITS     36
> #  define MAX_PHYSMEM_BITS      36
> # else
> #  define SECTION_SIZE_BITS     26
> #  define MAX_PHYSADDR_BITS     32
> #  define MAX_PHYSMEM_BITS      32
> # endif
> #else /* CONFIG_X86_32 */
> # define SECTION_SIZE_BITS      27 /* matt - 128 is convenient right now */
> # define MAX_PHYSADDR_BITS      44
> # define MAX_PHYSMEM_BITS       46
> #endif
> 
> Looks like PAE needs more alignment.
> And it looks like 128 is arbitrary here.
I'll amend it to 512Mb.


> 
> So we are tying ourselves to specific guest quirks.
> All this just looks wrong to me.
On real hardware I've seen, slots were aligned to 1Gb
which is also arbitrary but happens to work with exiting
linux/windows OSes.
There is no point to make broken hardware or in this
case to make with QEMU intentionally broken (alignment)
for linux guests when we can accommodate it.

> 
> 
> > ---
> >  hw/i386/pc.c | 12 ++++++++++++
> >  1 file changed, 12 insertions(+)
> > 
> > diff --git a/hw/i386/pc.c b/hw/i386/pc.c
> > index 3d958ba..0f7cf7c 100644
> > --- a/hw/i386/pc.c
> > +++ b/hw/i386/pc.c
> > @@ -1610,6 +1610,8 @@ void ioapic_init_gsi(GSIState *gsi_state, const char *parent_name)
> >      }
> >  }
> >  
> > +#define PC_MIN_DIMM_ALIGNMENT (1ULL << 27) /* 128Mb */
> > +
> 
> This kind of comment doesn't really help.
Sorry forgot to add comment here, how about:

/* minimal DIMM alignment is derived from
 * arch/x86/include/asm/sparsemem.h:SECTION_SIZE_BITS
 * from PAE variant as the largest required alignment
 * among of PAE/x32/x64 arch-s.
 */
> 
> >  static void pc_dimm_plug(HotplugHandler *hotplug_dev,
> >                           DeviceState *dev, Error **errp)
> >  {
> > @@ -1624,6 +1626,16 @@ static void pc_dimm_plug(HotplugHandler *hotplug_dev,
> >  
> >      if (memory_region_get_alignment(mr) && pcms->enforce_aligned_dimm) {
> >          align = memory_region_get_alignment(mr);
> > +        /*
> > +         * Linux x64 guests expect 128Mb aligned DIMM,
> 
> this implies no other guest cares. which isn't true.
> 
> > +         * but this change
> 
> which change?
> 
> > causes memory layout change
> 
> change compared to what?
> 
> > so
> > +         * for compatibility
> 
> compatibility with what?
> 
> > apply 128Mb alignment only
> > +         * when forced gaps are enabled since it is the cause
> > +         * of misalignment.
> 
> Which makes no sense, sorry.
> 
> Can it be misaligned for some other reason?
> 
> If not, why limit to this case?
2.4 machine will have DIMM's plugged in with backend's alignment
which is 2Mb for memory-backend-ram, so forcing minimal 128Mb alignment
will change DIMM's GPA and break migration from old 2.4 machine to new
QEMU 2.4 machine type.

Since gaps introduced regression it makes sense to condition
alignment adjustment on inter_dimm_gap property so it would not
affect machine types prior 2.5.

How about following comment:
/* Linux guests expect 512/64/128Mb alignment for PAE/x32/x64 arch
 * respectively. Windows works fine with 2Mb. To make sure that
 * memory hotplug would work with above flavors of Linux set
 * minimal alignment to 512Mb (i.e. PAE arch).
 * Enforcing minimal alignment bigger than that of a backend
 * (2Mb for memory-backend-ram) changes GPAs of the 2nd and
 * following DIMMs which breaks migration for older machine types,
 * so force minimal alignment only since 2.5 machine type and do not
 * apply it to older machine types (prior 2.5)
 */

> 
> > +         */
> > +        if (pcmc->inter_dimm_gap && align < PC_MIN_DIMM_ALIGNMENT) {
> > +            align = PC_MIN_DIMM_ALIGNMENT;
> > +        }
> >      }
> >  
> >      if (!pcms->acpi_dev) {
> 
> All this sounds pretty fragile. How about we revert the inter dimm gap
> thing for 2.4? It's just a work around, this is piling work arounds on
> top of work arounds.
Yep it's workaround but it works around QEMU's broken virtio
implementation in a simple way without need for guest side changes.

Without foreseeable virtio fix it makes memory hotplug unusable and even
more so if there were a virtio fix it won't fix old guests since you've
said that virtio fix would require changes of both QEMU and guest sides.

> 
> > -- 
> > 1.8.3.1
>
Andrey Korolyov Oct. 26, 2015, 6:42 p.m. UTC | #3
> How about following comment:
> /* Linux guests expect 512/64/128Mb alignment for PAE/x32/x64 arch
>  * respectively. Windows works fine with 2Mb. To make sure that
>  * memory hotplug would work with above flavors of Linux set
>  * minimal alignment to 512Mb (i.e. PAE arch).
>  * Enforcing minimal alignment bigger than that of a backend
>  * (2Mb for memory-backend-ram) changes GPAs of the 2nd and
>  * following DIMMs which breaks migration for older machine types,
>  * so force minimal alignment only since 2.5 machine type and do not
>  * apply it to older machine types (prior 2.5)
>  */
>


PAE guests are working with 128Mb alignment just fine, at least
3.8...3.16 I`ve tried previously, have you seen a counter-example?
Igor Mammedov Oct. 27, 2015, 7:39 a.m. UTC | #4
On Mon, 26 Oct 2015 21:42:05 +0300
Andrey Korolyov <andrey@xdel.ru> wrote:

> > How about following comment:
> > /* Linux guests expect 512/64/128Mb alignment for PAE/x32/x64 arch
> >  * respectively. Windows works fine with 2Mb. To make sure that
> >  * memory hotplug would work with above flavors of Linux set
> >  * minimal alignment to 512Mb (i.e. PAE arch).
> >  * Enforcing minimal alignment bigger than that of a backend
> >  * (2Mb for memory-backend-ram) changes GPAs of the 2nd and
> >  * following DIMMs which breaks migration for older machine types,
> >  * so force minimal alignment only since 2.5 machine type and do not
> >  * apply it to older machine types (prior 2.5)
> >  */
> >
> 
> 
> PAE guests are working with 128Mb alignment just fine, at least
> 3.8...3.16 I`ve tried previously, have you seen a counter-example?

Issue is seen only during unplug when DIMM is not aligned to 
SECTION_SIZE_BITS.
Michael S. Tsirkin Oct. 27, 2015, 8:31 a.m. UTC | #5
On Mon, Oct 26, 2015 at 02:24:32PM +0100, Igor Mammedov wrote:
> Yep it's workaround but it works around QEMU's broken virtio
> implementation in a simple way without need for guest side changes.
> 
> Without foreseeable virtio fix it makes memory hotplug unusable and even
> more so if there were a virtio fix it won't fix old guests since you've
> said that virtio fix would require changes of both QEMU and guest sides.

What makes it not foreseeable?
Apparently only the fact that we have a work-around in place so no one
works on it.  I can code it up pretty quickly, but I'm flat out of time
for testing as I'm going on vacation soon, and hard freeze is pretty
close.

GPA space is kind of cheap, but wasting it in chunks of 512M
seems way too aggressive.
Igor Mammedov Oct. 27, 2015, 8:48 a.m. UTC | #6
On Tue, 27 Oct 2015 10:31:21 +0200
"Michael S. Tsirkin" <mst@redhat.com> wrote:

> On Mon, Oct 26, 2015 at 02:24:32PM +0100, Igor Mammedov wrote:
> > Yep it's workaround but it works around QEMU's broken virtio
> > implementation in a simple way without need for guest side changes.
> > 
> > Without foreseeable virtio fix it makes memory hotplug unusable and even
> > more so if there were a virtio fix it won't fix old guests since you've
> > said that virtio fix would require changes of both QEMU and guest sides.
> 
> What makes it not foreseeable?
> Apparently only the fact that we have a work-around in place so no one
> works on it.  I can code it up pretty quickly, but I'm flat out of time
> for testing as I'm going on vacation soon, and hard freeze is pretty
> close.
I can lend a hand for testing part.

> 
> GPA space is kind of cheap, but wasting it in chunks of 512M
> seems way too aggressive.
hotplug region is sized with 1Gb alignment reserve per DIMM so we aren't
actually wasting anything here.
Michael S. Tsirkin Oct. 27, 2015, 8:53 a.m. UTC | #7
On Tue, Oct 27, 2015 at 09:48:37AM +0100, Igor Mammedov wrote:
> On Tue, 27 Oct 2015 10:31:21 +0200
> "Michael S. Tsirkin" <mst@redhat.com> wrote:
> 
> > On Mon, Oct 26, 2015 at 02:24:32PM +0100, Igor Mammedov wrote:
> > > Yep it's workaround but it works around QEMU's broken virtio
> > > implementation in a simple way without need for guest side changes.
> > > 
> > > Without foreseeable virtio fix it makes memory hotplug unusable and even
> > > more so if there were a virtio fix it won't fix old guests since you've
> > > said that virtio fix would require changes of both QEMU and guest sides.
> > 
> > What makes it not foreseeable?
> > Apparently only the fact that we have a work-around in place so no one
> > works on it.  I can code it up pretty quickly, but I'm flat out of time
> > for testing as I'm going on vacation soon, and hard freeze is pretty
> > close.
> I can lend a hand for testing part.
> 
> > 
> > GPA space is kind of cheap, but wasting it in chunks of 512M
> > seems way too aggressive.
> hotplug region is sized with 1Gb alignment reserve per DIMM so we aren't
> actually wasting anything here.
>

If I allocate two 1G DIMMs, what will be the gap size? 512M? 1G?
It's too much either way.
Igor Mammedov Oct. 27, 2015, 9:14 a.m. UTC | #8
On Tue, 27 Oct 2015 10:53:08 +0200
"Michael S. Tsirkin" <mst@redhat.com> wrote:

> On Tue, Oct 27, 2015 at 09:48:37AM +0100, Igor Mammedov wrote:
> > On Tue, 27 Oct 2015 10:31:21 +0200
> > "Michael S. Tsirkin" <mst@redhat.com> wrote:
> > 
> > > On Mon, Oct 26, 2015 at 02:24:32PM +0100, Igor Mammedov wrote:
> > > > Yep it's workaround but it works around QEMU's broken virtio
> > > > implementation in a simple way without need for guest side changes.
> > > > 
> > > > Without foreseeable virtio fix it makes memory hotplug unusable and even
> > > > more so if there were a virtio fix it won't fix old guests since you've
> > > > said that virtio fix would require changes of both QEMU and guest sides.
> > > 
> > > What makes it not foreseeable?
> > > Apparently only the fact that we have a work-around in place so no one
> > > works on it.  I can code it up pretty quickly, but I'm flat out of time
> > > for testing as I'm going on vacation soon, and hard freeze is pretty
> > > close.
> > I can lend a hand for testing part.
> > 
> > > 
> > > GPA space is kind of cheap, but wasting it in chunks of 512M
> > > seems way too aggressive.
> > hotplug region is sized with 1Gb alignment reserve per DIMM so we aren't
> > actually wasting anything here.
> >
> 
> If I allocate two 1G DIMMs, what will be the gap size? 512M? 1G?
> It's too much either way.
minimum would be 512, and if backend is 1Gb-hugepage gap will be
backend's natural alignment (i.e. 1Gb).

We always reserve extra 1Gb per slot for max possible alignment so
that we could map 1Gb hugepage aligned. It's over reserve but
we don't know in advance what will be plugged in slot and so we
always reserve max possible alignment.
Eduardo Habkost Oct. 27, 2015, 4:36 p.m. UTC | #9
On Tue, Oct 27, 2015 at 10:14:56AM +0100, Igor Mammedov wrote:
> On Tue, 27 Oct 2015 10:53:08 +0200
> "Michael S. Tsirkin" <mst@redhat.com> wrote:
> 
> > On Tue, Oct 27, 2015 at 09:48:37AM +0100, Igor Mammedov wrote:
> > > On Tue, 27 Oct 2015 10:31:21 +0200
> > > "Michael S. Tsirkin" <mst@redhat.com> wrote:
> > > 
> > > > On Mon, Oct 26, 2015 at 02:24:32PM +0100, Igor Mammedov wrote:
> > > > > Yep it's workaround but it works around QEMU's broken virtio
> > > > > implementation in a simple way without need for guest side changes.
> > > > > 
> > > > > Without foreseeable virtio fix it makes memory hotplug unusable and even
> > > > > more so if there were a virtio fix it won't fix old guests since you've
> > > > > said that virtio fix would require changes of both QEMU and guest sides.
> > > > 
> > > > What makes it not foreseeable?
> > > > Apparently only the fact that we have a work-around in place so no one
> > > > works on it.  I can code it up pretty quickly, but I'm flat out of time
> > > > for testing as I'm going on vacation soon, and hard freeze is pretty
> > > > close.
> > > I can lend a hand for testing part.
> > > 
> > > > 
> > > > GPA space is kind of cheap, but wasting it in chunks of 512M
> > > > seems way too aggressive.
> > > hotplug region is sized with 1Gb alignment reserve per DIMM so we aren't
> > > actually wasting anything here.
> > >
> > 
> > If I allocate two 1G DIMMs, what will be the gap size? 512M? 1G?
> > It's too much either way.
> minimum would be 512, and if backend is 1Gb-hugepage gap will be
> backend's natural alignment (i.e. 1Gb).

Is backend configuration even allowed to affect the machine ABI? We need
to be able to change backend configuration when migrating the VM to
another host.
Igor Mammedov Oct. 29, 2015, 1:36 p.m. UTC | #10
On Tue, 27 Oct 2015 14:36:35 -0200
Eduardo Habkost <ehabkost@redhat.com> wrote:

> On Tue, Oct 27, 2015 at 10:14:56AM +0100, Igor Mammedov wrote:
> > On Tue, 27 Oct 2015 10:53:08 +0200
> > "Michael S. Tsirkin" <mst@redhat.com> wrote:
> > 
> > > On Tue, Oct 27, 2015 at 09:48:37AM +0100, Igor Mammedov wrote:
> > > > On Tue, 27 Oct 2015 10:31:21 +0200
> > > > "Michael S. Tsirkin" <mst@redhat.com> wrote:
> > > > 
> > > > > On Mon, Oct 26, 2015 at 02:24:32PM +0100, Igor Mammedov wrote:
> > > > > > Yep it's workaround but it works around QEMU's broken virtio
> > > > > > implementation in a simple way without need for guest side changes.
> > > > > > 
> > > > > > Without foreseeable virtio fix it makes memory hotplug unusable and even
> > > > > > more so if there were a virtio fix it won't fix old guests since you've
> > > > > > said that virtio fix would require changes of both QEMU and guest sides.
> > > > > 
> > > > > What makes it not foreseeable?
> > > > > Apparently only the fact that we have a work-around in place so no one
> > > > > works on it.  I can code it up pretty quickly, but I'm flat out of time
> > > > > for testing as I'm going on vacation soon, and hard freeze is pretty
> > > > > close.
> > > > I can lend a hand for testing part.
> > > > 
> > > > > 
> > > > > GPA space is kind of cheap, but wasting it in chunks of 512M
> > > > > seems way too aggressive.
> > > > hotplug region is sized with 1Gb alignment reserve per DIMM so we aren't
> > > > actually wasting anything here.
> > > >
> > > 
> > > If I allocate two 1G DIMMs, what will be the gap size? 512M? 1G?
> > > It's too much either way.
> > minimum would be 512, and if backend is 1Gb-hugepage gap will be
> > backend's natural alignment (i.e. 1Gb).
> 
> Is backend configuration even allowed to affect the machine ABI? We need
> to be able to change backend configuration when migrating the VM to
> another host.
for now, one has to use the same type of backend on both sides
i.e. if source uses 1Gb huge pages backend then target also
need to use it.

We could change this for the next machine type to always force
max alignment (1Gb), then it would be possible to change
between backends with different alignments.
Eduardo Habkost Oct. 29, 2015, 6:16 p.m. UTC | #11
(CCing Michal and libvir-list, so libvirt team is aware of this
restriction)

On Thu, Oct 29, 2015 at 02:36:37PM +0100, Igor Mammedov wrote:
> On Tue, 27 Oct 2015 14:36:35 -0200
> Eduardo Habkost <ehabkost@redhat.com> wrote:
> 
> > On Tue, Oct 27, 2015 at 10:14:56AM +0100, Igor Mammedov wrote:
> > > On Tue, 27 Oct 2015 10:53:08 +0200
> > > "Michael S. Tsirkin" <mst@redhat.com> wrote:
> > > 
> > > > On Tue, Oct 27, 2015 at 09:48:37AM +0100, Igor Mammedov wrote:
> > > > > On Tue, 27 Oct 2015 10:31:21 +0200
> > > > > "Michael S. Tsirkin" <mst@redhat.com> wrote:
> > > > > 
> > > > > > On Mon, Oct 26, 2015 at 02:24:32PM +0100, Igor Mammedov wrote:
> > > > > > > Yep it's workaround but it works around QEMU's broken virtio
> > > > > > > implementation in a simple way without need for guest side changes.
> > > > > > > 
> > > > > > > Without foreseeable virtio fix it makes memory hotplug unusable and even
> > > > > > > more so if there were a virtio fix it won't fix old guests since you've
> > > > > > > said that virtio fix would require changes of both QEMU and guest sides.
> > > > > > 
> > > > > > What makes it not foreseeable?
> > > > > > Apparently only the fact that we have a work-around in place so no one
> > > > > > works on it.  I can code it up pretty quickly, but I'm flat out of time
> > > > > > for testing as I'm going on vacation soon, and hard freeze is pretty
> > > > > > close.
> > > > > I can lend a hand for testing part.
> > > > > 
> > > > > > 
> > > > > > GPA space is kind of cheap, but wasting it in chunks of 512M
> > > > > > seems way too aggressive.
> > > > > hotplug region is sized with 1Gb alignment reserve per DIMM so we aren't
> > > > > actually wasting anything here.
> > > > >
> > > > 
> > > > If I allocate two 1G DIMMs, what will be the gap size? 512M? 1G?
> > > > It's too much either way.
> > > minimum would be 512, and if backend is 1Gb-hugepage gap will be
> > > backend's natural alignment (i.e. 1Gb).
> > 
> > Is backend configuration even allowed to affect the machine ABI? We need
> > to be able to change backend configuration when migrating the VM to
> > another host.
> for now, one has to use the same type of backend on both sides
> i.e. if source uses 1Gb huge pages backend then target also
> need to use it.
> 

The page size of the backend don't even depend on QEMU arguments, but on
the kernel command-line or hugetlbfs mount options. So it's possible to
have exactly the same QEMU command-line on source and destination (with
an explicit versioned machine-type), and get a VM that can't be
migrated? That means we are breaking our guarantees about migration and
guest ABI.


> We could change this for the next machine type to always force
> max alignment (1Gb), then it would be possible to change
> between backends with different alignments.

I'm not sure what's the best solution here. If always using 1GB is too
aggressive, we could require management to ask for an explicit alignment
as a -machine option if they know they will need a specific backend page
size.

BTW, are you talking about the behavior introduced by
aa8580cddf011e8cedcf87f7a0fdea7549fc4704 ("pc: memhp: force gaps between
DIMM's GPA") only, or the backend page size was already affecting GPA
allocation before that commit?
Igor Mammedov Oct. 30, 2015, 1:07 p.m. UTC | #12
On Thu, 29 Oct 2015 16:16:57 -0200
Eduardo Habkost <ehabkost@redhat.com> wrote:

> (CCing Michal and libvir-list, so libvirt team is aware of this
> restriction)
> 
> On Thu, Oct 29, 2015 at 02:36:37PM +0100, Igor Mammedov wrote:
> > On Tue, 27 Oct 2015 14:36:35 -0200
> > Eduardo Habkost <ehabkost@redhat.com> wrote:
> > 
> > > On Tue, Oct 27, 2015 at 10:14:56AM +0100, Igor Mammedov wrote:
> > > > On Tue, 27 Oct 2015 10:53:08 +0200
> > > > "Michael S. Tsirkin" <mst@redhat.com> wrote:
> > > > 
> > > > > On Tue, Oct 27, 2015 at 09:48:37AM +0100, Igor Mammedov wrote:
> > > > > > On Tue, 27 Oct 2015 10:31:21 +0200
> > > > > > "Michael S. Tsirkin" <mst@redhat.com> wrote:
> > > > > > 
> > > > > > > On Mon, Oct 26, 2015 at 02:24:32PM +0100, Igor Mammedov wrote:
> > > > > > > > Yep it's workaround but it works around QEMU's broken virtio
> > > > > > > > implementation in a simple way without need for guest side changes.
> > > > > > > > 
> > > > > > > > Without foreseeable virtio fix it makes memory hotplug unusable and even
> > > > > > > > more so if there were a virtio fix it won't fix old guests since you've
> > > > > > > > said that virtio fix would require changes of both QEMU and guest sides.
> > > > > > > 
> > > > > > > What makes it not foreseeable?
> > > > > > > Apparently only the fact that we have a work-around in place so no one
> > > > > > > works on it.  I can code it up pretty quickly, but I'm flat out of time
> > > > > > > for testing as I'm going on vacation soon, and hard freeze is pretty
> > > > > > > close.
> > > > > > I can lend a hand for testing part.
> > > > > > 
> > > > > > > 
> > > > > > > GPA space is kind of cheap, but wasting it in chunks of 512M
> > > > > > > seems way too aggressive.
> > > > > > hotplug region is sized with 1Gb alignment reserve per DIMM so we aren't
> > > > > > actually wasting anything here.
> > > > > >
> > > > > 
> > > > > If I allocate two 1G DIMMs, what will be the gap size? 512M? 1G?
> > > > > It's too much either way.
> > > > minimum would be 512, and if backend is 1Gb-hugepage gap will be
> > > > backend's natural alignment (i.e. 1Gb).
> > > 
> > > Is backend configuration even allowed to affect the machine ABI? We need
> > > to be able to change backend configuration when migrating the VM to
> > > another host.
> > for now, one has to use the same type of backend on both sides
> > i.e. if source uses 1Gb huge pages backend then target also
> > need to use it.
> > 
> 
> The page size of the backend don't even depend on QEMU arguments, but on
> the kernel command-line or hugetlbfs mount options. So it's possible to
> have exactly the same QEMU command-line on source and destination (with
> an explicit versioned machine-type), and get a VM that can't be
> migrated? That means we are breaking our guarantees about migration and
> guest ABI.
> 
> 
> > We could change this for the next machine type to always force
> > max alignment (1Gb), then it would be possible to change
> > between backends with different alignments.
> 
> I'm not sure what's the best solution here. If always using 1GB is too
> aggressive, we could require management to ask for an explicit alignment
> as a -machine option if they know they will need a specific backend page
> size.
> 
> BTW, are you talking about the behavior introduced by
> aa8580cddf011e8cedcf87f7a0fdea7549fc4704 ("pc: memhp: force gaps between
> DIMM's GPA") only, or the backend page size was already affecting GPA
> allocation before that commit?
backend alignment was there since beginning,
we always over-reserve 1GB per slot since we don't know in advance what
alignment hotplugged backend would require.
diff mbox

Patch

diff --git a/hw/i386/pc.c b/hw/i386/pc.c
index 3d958ba..0f7cf7c 100644
--- a/hw/i386/pc.c
+++ b/hw/i386/pc.c
@@ -1610,6 +1610,8 @@  void ioapic_init_gsi(GSIState *gsi_state, const char *parent_name)
     }
 }
 
+#define PC_MIN_DIMM_ALIGNMENT (1ULL << 27) /* 128Mb */
+
 static void pc_dimm_plug(HotplugHandler *hotplug_dev,
                          DeviceState *dev, Error **errp)
 {
@@ -1624,6 +1626,16 @@  static void pc_dimm_plug(HotplugHandler *hotplug_dev,
 
     if (memory_region_get_alignment(mr) && pcms->enforce_aligned_dimm) {
         align = memory_region_get_alignment(mr);
+        /*
+         * Linux x64 guests expect 128Mb aligned DIMM,
+         * but this change causes memory layout change so
+         * for compatibility apply 128Mb alignment only
+         * when forced gaps are enabled since it is the cause
+         * of misalignment.
+         */
+        if (pcmc->inter_dimm_gap && align < PC_MIN_DIMM_ALIGNMENT) {
+            align = PC_MIN_DIMM_ALIGNMENT;
+        }
     }
 
     if (!pcms->acpi_dev) {