diff mbox

[10/16] dimm: add busy slot check and slot auto-allocation

Message ID 1374596592-7027-11-git-send-email-imammedo@redhat.com
State New
Headers show

Commit Message

Igor Mammedov July 23, 2013, 4:23 p.m. UTC
- if slot property is not specified on -device/device_add command,
treat default value as request for assigning DimmDevice to
the first free slot.

- if slot is provided with -device/device_add command, attempt to
use it or fail command if it's already occupied.

Signed-off-by: Igor Mammedov <imammedo@redhat.com>
---
 hw/mem-hotplug/dimm.c         |   51 ++++++++++++++++++++++++++++++++++++++++-
 include/hw/mem-hotplug/dimm.h |    5 ++++
 2 files changed, 55 insertions(+), 1 deletions(-)

Comments

Paolo Bonzini July 23, 2013, 5:09 p.m. UTC | #1
Il 23/07/2013 18:23, Igor Mammedov ha scritto:
> - if slot property is not specified on -device/device_add command,
> treat default value as request for assigning DimmDevice to
> the first free slot.

Even with "-m" instead of "-numa mem", I think this is problematic
because we still need to separate the host and guest parts of the DIMM
device.  "-numa mem" (or the QMP command that Wanlong added) will be
necessary to allocate memory on the host side before adding a DIMM.

So slots will have three states: free (created with "-m"), allocated (a
free slot moves to this state with "-numa mem...,populated=no" when
migrating, or with the QMP command for regular hotplug), populated (an
allocated slot moves to this state with "-device dimm").

You would be able to plug a DIMM only into an allocated slot, and the
size will be specified on the slot rather than the DIMM device.

In general, I don't think free slots should be managed by the DimmBus,
and host vs. guest separation should be there even if we accept your
"-m" extension (doesn't look bad at all, I must say).

Paolo
Igor Mammedov July 24, 2013, 8:36 a.m. UTC | #2
On Tue, 23 Jul 2013 19:09:26 +0200
Paolo Bonzini <pbonzini@redhat.com> wrote:

> Il 23/07/2013 18:23, Igor Mammedov ha scritto:
> > - if slot property is not specified on -device/device_add command,
> > treat default value as request for assigning DimmDevice to
> > the first free slot.
> 
> Even with "-m" instead of "-numa mem", I think this is problematic
> because we still need to separate the host and guest parts of the DIMM
> device.  "-numa mem" (or the QMP command that Wanlong added) will be
> necessary to allocate memory on the host side before adding a DIMM.
why not do host allocation part at the same time when DIMM is added, is
there a real need to separate DIMM device?

I probably miss something but -numa mem option and co aside what problem
couldn't be solved during DIMM device initialization and would require
a split DIMM device?

> 
> So slots will have three states: free (created with "-m"), allocated (a
> free slot moves to this state with "-numa mem...,populated=no" when
> migrating, or with the QMP command for regular hotplug), populated (an
> allocated slot moves to this state with "-device dimm").
> 
> You would be able to plug a DIMM only into an allocated slot, and the
> size will be specified on the slot rather than the DIMM device.
'slot' property is there only for migration sake to provide stable
numeric ID for QEMU<->ACPI BIOS interface. It's not used for any other
purpose and wasn't intended for any other usage..

on baremetal slot has noting to do with size of plugged in DIMM, why we
would model it other way if it only brings problems: like predefined size,
allocated, free etc. I think slot should be either free or busy.


> 
> In general, I don't think free slots should be managed by the DimmBus,
> and host vs. guest separation should be there even if we accept your
> "-m" extension (doesn't look bad at all, I must say).
> 
> Paolo
Paolo Bonzini July 24, 2013, 9:41 a.m. UTC | #3
Il 24/07/2013 10:36, Igor Mammedov ha scritto:
> On Tue, 23 Jul 2013 19:09:26 +0200
> Paolo Bonzini <pbonzini@redhat.com> wrote:
> 
>> Il 23/07/2013 18:23, Igor Mammedov ha scritto:
>>> - if slot property is not specified on -device/device_add command,
>>> treat default value as request for assigning DimmDevice to
>>> the first free slot.
>>
>> Even with "-m" instead of "-numa mem", I think this is problematic
>> because we still need to separate the host and guest parts of the DIMM
>> device.  "-numa mem" (or the QMP command that Wanlong added) will be
>> necessary to allocate memory on the host side before adding a DIMM.
> why not do host allocation part at the same time when DIMM is added, is
> there a real need to separate DIMM device?
> 
> I probably miss something but -numa mem option and co aside what problem
> couldn't be solved during DIMM device initialization and would require
> a split DIMM device?

Because otherwise, every option we add to "-numa mem" will have to be
added to "-device dimm".  For example,

   -device dimm,policy=interleave

makes no sense to me.

In fact, this is no different from having to do drive_add or netdev_add
before device_add.  First you tell QEMU about the host resources to use,
then you add the guest device and bind the device to those resources.

>> So slots will have three states: free (created with "-m"), allocated (a
>> free slot moves to this state with "-numa mem...,populated=no" when
>> migrating, or with the QMP command for regular hotplug), populated (an
>> allocated slot moves to this state with "-device dimm").
>>
>> You would be able to plug a DIMM only into an allocated slot, and the
>> size will be specified on the slot rather than the DIMM device.
> 'slot' property is there only for migration sake to provide stable
> numeric ID for QEMU<->ACPI BIOS interface. It's not used for any other
> purpose and wasn't intended for any other usage..

How would you otherwise refer to the memory you want to affect in a
set-mem-policy monitor command?

> on baremetal slot has noting to do with size of plugged in DIMM,

On baremetal slots also belong to a specific NUMA node, for what it's
worth.  There are going to be differences with baremetal no matter what.

> why we
> would model it other way if it only brings problems: like predefined size,

It doesn't have to be predefined.  In the previous discussions (and also
based on Vasilis and Hu Tao's implementations) I assumed predefined slot
sizes.  Now I understand the benefit of having a simpler command-line
with "-m", but then in return you need three slot states instead of just
unpopulated/populated.

So you'd just do

   set-mem-policy 0 size=2G      # free->allocated
   device_add dimm,slotid=0      # allocated->populated

to hotplug a 2G DIMM.  And you'll be able to pin it to host NUMA nodes,
and assign it to guest NUMA nodes, like this:

   set-mem-policy 0 size=2G,nodeid=1,policy=membind host-nodes=0-1
   device_add dimm,slotid=0

Again, this is the same as drive_add/device_add.

Paolo

> allocated, free etc. I think slot should be either free or busy.
> 
> 
>>
>> In general, I don't think free slots should be managed by the DimmBus,
>> and host vs. guest separation should be there even if we accept your
>> "-m" extension (doesn't look bad at all, I must say).
>>
>> Paolo
>
Igor Mammedov July 24, 2013, 11:34 a.m. UTC | #4
On Wed, 24 Jul 2013 11:41:04 +0200
Paolo Bonzini <pbonzini@redhat.com> wrote:

> Il 24/07/2013 10:36, Igor Mammedov ha scritto:
> > On Tue, 23 Jul 2013 19:09:26 +0200
> > Paolo Bonzini <pbonzini@redhat.com> wrote:
> > 
> >> Il 23/07/2013 18:23, Igor Mammedov ha scritto:
> >>> - if slot property is not specified on -device/device_add command,
> >>> treat default value as request for assigning DimmDevice to
> >>> the first free slot.
> >>
> >> Even with "-m" instead of "-numa mem", I think this is problematic
> >> because we still need to separate the host and guest parts of the DIMM
> >> device.  "-numa mem" (or the QMP command that Wanlong added) will be
> >> necessary to allocate memory on the host side before adding a DIMM.
> > why not do host allocation part at the same time when DIMM is added, is
> > there a real need to separate DIMM device?
> > 
> > I probably miss something but -numa mem option and co aside what problem
> > couldn't be solved during DIMM device initialization and would require
> > a split DIMM device?
> 
> Because otherwise, every option we add to "-numa mem" will have to be
> added to "-device dimm".  For example,
> 
>    -device dimm,policy=interleave
if it's feature of DIMM device sure, if it is not lets find a better
place for it. See below for an alternative approach.

> 
> makes no sense to me.
> 
> In fact, this is no different from having to do drive_add or netdev_add
> before device_add.  First you tell QEMU about the host resources to use,
> then you add the guest device and bind the device to those resources.
> 
> >> So slots will have three states: free (created with "-m"), allocated (a
> >> free slot moves to this state with "-numa mem...,populated=no" when
> >> migrating, or with the QMP command for regular hotplug), populated (an
> >> allocated slot moves to this state with "-device dimm").
> >>
> >> You would be able to plug a DIMM only into an allocated slot, and the
> >> size will be specified on the slot rather than the DIMM device.
> > 'slot' property is there only for migration sake to provide stable
> > numeric ID for QEMU<->ACPI BIOS interface. It's not used for any other
> > purpose and wasn't intended for any other usage..
> 
> How would you otherwise refer to the memory you want to affect in a
> set-mem-policy monitor command?
could be 'id' property or even better a QOM path

> 
> > on baremetal slot has noting to do with size of plugged in DIMM,
> 
> On baremetal slots also belong to a specific NUMA node, for what it's
> worth.  There are going to be differences with baremetal no matter what.
sure we can deviate here, but I don't see full picture yet so I'm trying
to find justification for it first and asking questions. Maybe a better
solution will be found.

> 
> > why we
> > would model it other way if it only brings problems: like predefined size,
> 
> It doesn't have to be predefined.  In the previous discussions (and also
> based on Vasilis and Hu Tao's implementations) I assumed predefined slot
> sizes.  Now I understand the benefit of having a simpler command-line
> with "-m", but then in return you need three slot states instead of just
> unpopulated/populated.
> 
> So you'd just do
> 
>    set-mem-policy 0 size=2G      # free->allocated
>    device_add dimm,slotid=0      # allocated->populated
> 
> to hotplug a 2G DIMM.  And you'll be able to pin it to host NUMA nodes,
> and assign it to guest NUMA nodes, like this:
> 
>    set-mem-policy 0 size=2G,nodeid=1,policy=membind host-nodes=0-1
>    device_add dimm,slotid=0
Do policy and other -numa mem properties belong to a particular DIMM device
or rather to a particular NUMA node?

How about following idea: guest-node maps to a specific host-node, then
when we plug DIMM, guest node provides information on policies and whatever
to the creator of DIMM device (via DimmBus and/or mhc) which allocates
memory, applies policies and binds new memory to a specific host node.
That would eliminate 2 stage approach.

in this case DIMM device only needs to specify where it's plugged in, using
'node' property (now number but could become QOM path to NUMA node object).

Ideally it would be QOM hierarchy:

/nodeX/@dimmbus/dimm_device
where even 'node' property would become obsolete, just specify right
bus to attach DIMM device to.

PS:
we need a similar QOM hierarchy for CPUs as well to sort out
-numa cpus=ids mess.

> 
> Again, this is the same as drive_add/device_add.
> 
> Paolo
> 
> > allocated, free etc. I think slot should be either free or busy.
> > 
> > 
> >>
> >> In general, I don't think free slots should be managed by the DimmBus,
> >> and host vs. guest separation should be there even if we accept your
> >> "-m" extension (doesn't look bad at all, I must say).
> >>
> >> Paolo
> > 
>
Paolo Bonzini July 24, 2013, 12:41 p.m. UTC | #5
Il 24/07/2013 13:34, Igor Mammedov ha scritto:
> On Wed, 24 Jul 2013 11:41:04 +0200
> Paolo Bonzini <pbonzini@redhat.com> wrote:
> 
>> Il 24/07/2013 10:36, Igor Mammedov ha scritto:
>>> On Tue, 23 Jul 2013 19:09:26 +0200
>>> Paolo Bonzini <pbonzini@redhat.com> wrote:
>>>
>>>> Il 23/07/2013 18:23, Igor Mammedov ha scritto:
>>>>> - if slot property is not specified on -device/device_add command,
>>>>> treat default value as request for assigning DimmDevice to
>>>>> the first free slot.
>>>>
>>>> Even with "-m" instead of "-numa mem", I think this is problematic
>>>> because we still need to separate the host and guest parts of the DIMM
>>>> device.  "-numa mem" (or the QMP command that Wanlong added) will be
>>>> necessary to allocate memory on the host side before adding a DIMM.
>>> why not do host allocation part at the same time when DIMM is added, is
>>> there a real need to separate DIMM device?
>>>
>>> I probably miss something but -numa mem option and co aside what problem
>>> couldn't be solved during DIMM device initialization and would require
>>> a split DIMM device?
>>
>> Because otherwise, every option we add to "-numa mem" will have to be
>> added to "-device dimm".  For example,
>>
>>    -device dimm,policy=interleave
> if it's feature of DIMM device sure, if it is not lets find a better
> place for it. See below for an alternative approach.
> 
>>
>> makes no sense to me.
>>
>> In fact, this is no different from having to do drive_add or netdev_add
>> before device_add.  First you tell QEMU about the host resources to use,
>> then you add the guest device and bind the device to those resources.
>>
>>>> So slots will have three states: free (created with "-m"), allocated (a
>>>> free slot moves to this state with "-numa mem...,populated=no" when
>>>> migrating, or with the QMP command for regular hotplug), populated (an
>>>> allocated slot moves to this state with "-device dimm").
>>>>
>>>> You would be able to plug a DIMM only into an allocated slot, and the
>>>> size will be specified on the slot rather than the DIMM device.
>>> 'slot' property is there only for migration sake to provide stable
>>> numeric ID for QEMU<->ACPI BIOS interface. It's not used for any other
>>> purpose and wasn't intended for any other usage..
>>
>> How would you otherwise refer to the memory you want to affect in a
>> set-mem-policy monitor command?
> could be 'id' property or even better a QOM path
> 
>>
>>> on baremetal slot has noting to do with size of plugged in DIMM,
>>
>> On baremetal slots also belong to a specific NUMA node, for what it's
>> worth.  There are going to be differences with baremetal no matter what.
> sure we can deviate here, but I don't see full picture yet so I'm trying
> to find justification for it first and asking questions. Maybe a better
> solution will be found.
> 
>>
>>> why we
>>> would model it other way if it only brings problems: like predefined size,
>>
>> It doesn't have to be predefined.  In the previous discussions (and also
>> based on Vasilis and Hu Tao's implementations) I assumed predefined slot
>> sizes.  Now I understand the benefit of having a simpler command-line
>> with "-m", but then in return you need three slot states instead of just
>> unpopulated/populated.
>>
>> So you'd just do
>>
>>    set-mem-policy 0 size=2G      # free->allocated
>>    device_add dimm,slotid=0      # allocated->populated
>>
>> to hotplug a 2G DIMM.  And you'll be able to pin it to host NUMA nodes,
>> and assign it to guest NUMA nodes, like this:
>>
>>    set-mem-policy 0 size=2G,nodeid=1,policy=membind host-nodes=0-1
>>    device_add dimm,slotid=0
> Do policy and other -numa mem properties belong to a particular DIMM device
> or rather to a particular NUMA node?
> 
> How about following idea: guest-node maps to a specific host-node, then
> when we plug DIMM, guest node provides information on policies and whatever
> to the creator of DIMM device (via DimmBus and/or mhc) which allocates
> memory, applies policies and binds new memory to a specific host node.
> That would eliminate 2 stage approach.

It makes sense.  My main worry is not to deviate from what we've been
doing for drives and netdevs (because that's a proven design).  Both
"-numa mem" and this proposal satisfy that goal.

I originally proposed "-numa mem" because Vasilis and Hu's patches were
relying on specifying predefined sizes for all slots.  So "-numa mem"
was a good fit for both memory hotplug (done Hu's way) and NUMA policy.
 It also simplified the command line which had a lot of "mem-" prefixed
options.

With the approach you suggest it may not be necessary at all, and we can
go back to just "-numa
node,cpus=0,mem=1G,mem-policy=membind,mem-hostnodes=0-1,cpu-hostnodes=0"
or something like that.

Whether it is workable, it depends on what granularity Wanlong/Hu want.

There may be some scenarios where per-slot policies make sense.  For
example, imagine that in general you want memory to be bound to the
corresponding host node.  It turns out some nodes are now fully
committed and others are free, and you need more memory on a VM.  You
can hotplug that memory without really caring about binding and
momentarily suffer some performance loss.

I agree that specifying the policy on every hotplug complicates
management and may be overkill.  But then, most guests are not NUMA at
all and you would hardly perceive the difference, you would just have to
separate

    set-mem-policy 0 size=2G
    device_add dimm,slot=0

instead of

    device_add dimm,slot,size=2G

which is not a big chore.

> in this case DIMM device only needs to specify where it's plugged in, using
> 'node' property (now number but could become QOM path to NUMA node object).

Yeah, then it's the same as the id.

Paolo

> Ideally it would be QOM hierarchy:
> 
> /nodeX/@dimmbus/dimm_device
> where even 'node' property would become obsolete, just specify right
> bus to attach DIMM device to.
> 
> PS:
> we need a similar QOM hierarchy for CPUs as well to sort out
> -numa cpus=ids mess.
> 
>>
>> Again, this is the same as drive_add/device_add.
>>
>> Paolo
>>
>>> allocated, free etc. I think slot should be either free or busy.
>>>
>>>
>>>>
>>>> In general, I don't think free slots should be managed by the DimmBus,
>>>> and host vs. guest separation should be there even if we accept your
>>>> "-m" extension (doesn't look bad at all, I must say).
>>>>
>>>> Paolo
>>>
>>
>
Igor Mammedov July 26, 2013, 7:38 a.m. UTC | #6
On Wed, 24 Jul 2013 14:41:36 +0200
Paolo Bonzini <pbonzini@redhat.com> wrote:

> Il 24/07/2013 13:34, Igor Mammedov ha scritto:
> > On Wed, 24 Jul 2013 11:41:04 +0200
> > Paolo Bonzini <pbonzini@redhat.com> wrote:
> > 
> >> Il 24/07/2013 10:36, Igor Mammedov ha scritto:
> >>> On Tue, 23 Jul 2013 19:09:26 +0200
> >>> Paolo Bonzini <pbonzini@redhat.com> wrote:
> >>>
> >>>> Il 23/07/2013 18:23, Igor Mammedov ha scritto:
> >>>>> - if slot property is not specified on -device/device_add command,
> >>>>> treat default value as request for assigning DimmDevice to
> >>>>> the first free slot.
> >>>>
> >>>> Even with "-m" instead of "-numa mem", I think this is problematic
> >>>> because we still need to separate the host and guest parts of the DIMM
> >>>> device.  "-numa mem" (or the QMP command that Wanlong added) will be
> >>>> necessary to allocate memory on the host side before adding a DIMM.
> >>> why not do host allocation part at the same time when DIMM is added, is
> >>> there a real need to separate DIMM device?
> >>>
> >>> I probably miss something but -numa mem option and co aside what problem
> >>> couldn't be solved during DIMM device initialization and would require
> >>> a split DIMM device?
> >>
> >> Because otherwise, every option we add to "-numa mem" will have to be
> >> added to "-device dimm".  For example,
> >>
> >>    -device dimm,policy=interleave
> > if it's feature of DIMM device sure, if it is not lets find a better
> > place for it. See below for an alternative approach.
> > 
> >>
> >> makes no sense to me.
> >>
> >> In fact, this is no different from having to do drive_add or netdev_add
> >> before device_add.  First you tell QEMU about the host resources to use,
> >> then you add the guest device and bind the device to those resources.
> >>
> >>>> So slots will have three states: free (created with "-m"), allocated (a
> >>>> free slot moves to this state with "-numa mem...,populated=no" when
> >>>> migrating, or with the QMP command for regular hotplug), populated (an
> >>>> allocated slot moves to this state with "-device dimm").
> >>>>
> >>>> You would be able to plug a DIMM only into an allocated slot, and the
> >>>> size will be specified on the slot rather than the DIMM device.
> >>> 'slot' property is there only for migration sake to provide stable
> >>> numeric ID for QEMU<->ACPI BIOS interface. It's not used for any other
> >>> purpose and wasn't intended for any other usage..
> >>
> >> How would you otherwise refer to the memory you want to affect in a
> >> set-mem-policy monitor command?
> > could be 'id' property or even better a QOM path
> > 
> >>
> >>> on baremetal slot has noting to do with size of plugged in DIMM,
> >>
> >> On baremetal slots also belong to a specific NUMA node, for what it's
> >> worth.  There are going to be differences with baremetal no matter what.
> > sure we can deviate here, but I don't see full picture yet so I'm trying
> > to find justification for it first and asking questions. Maybe a better
> > solution will be found.
> > 
> >>
> >>> why we
> >>> would model it other way if it only brings problems: like predefined size,
> >>
> >> It doesn't have to be predefined.  In the previous discussions (and also
> >> based on Vasilis and Hu Tao's implementations) I assumed predefined slot
> >> sizes.  Now I understand the benefit of having a simpler command-line
> >> with "-m", but then in return you need three slot states instead of just
> >> unpopulated/populated.
> >>
> >> So you'd just do
> >>
> >>    set-mem-policy 0 size=2G      # free->allocated
> >>    device_add dimm,slotid=0      # allocated->populated
> >>
> >> to hotplug a 2G DIMM.  And you'll be able to pin it to host NUMA nodes,
> >> and assign it to guest NUMA nodes, like this:
> >>
> >>    set-mem-policy 0 size=2G,nodeid=1,policy=membind host-nodes=0-1
> >>    device_add dimm,slotid=0
> > Do policy and other -numa mem properties belong to a particular DIMM device
> > or rather to a particular NUMA node?
> > 
> > How about following idea: guest-node maps to a specific host-node, then
> > when we plug DIMM, guest node provides information on policies and whatever
> > to the creator of DIMM device (via DimmBus and/or mhc) which allocates
> > memory, applies policies and binds new memory to a specific host node.
> > That would eliminate 2 stage approach.
> 
> It makes sense.  My main worry is not to deviate from what we've been
> doing for drives and netdevs (because that's a proven design).  Both
> "-numa mem" and this proposal satisfy that goal.
> 
> I originally proposed "-numa mem" because Vasilis and Hu's patches were
> relying on specifying predefined sizes for all slots.  So "-numa mem"
> was a good fit for both memory hotplug (done Hu's way) and NUMA policy.
>  It also simplified the command line which had a lot of "mem-" prefixed
> options.
> 
> With the approach you suggest it may not be necessary at all, and we can
> go back to just "-numa
> node,cpus=0,mem=1G,mem-policy=membind,mem-hostnodes=0-1,cpu-hostnodes=0"
> or something like that.
> 
> Whether it is workable, it depends on what granularity Wanlong/Hu want.
> 
> There may be some scenarios where per-slot policies make sense.  For
> example, imagine that in general you want memory to be bound to the
> corresponding host node.  It turns out some nodes are now fully
> committed and others are free, and you need more memory on a VM.  You
> can hotplug that memory without really caring about binding and
> momentarily suffer some performance loss.
Perhaps denying memory add and suggesting node migration to a node with
more memory would be right approach, otherwise user is bound to be hit by
cross node penalty.

> 
> I agree that specifying the policy on every hotplug complicates
> management and may be overkill.  But then, most guests are not NUMA at
> all and you would hardly perceive the difference, you would just have to
> separate
> 
>     set-mem-policy 0 size=2G
>     device_add dimm,slot=0
I'm confused, size is inherent property of generic DimmDevice and policies
are NUMA specific of node.
If there is need for per DIMM policies, I'd store NUMA specific policy inside
object that implements it (allocates memory), and apply them to DIMM when
it's realized.

    set-mem-policy nodeid=X,mem-hostnodes=Z,dev=dimmY
    device_add dimm,id=dimmY,size=2G,node=X

> instead of
> 
>     device_add dimm,slot,size=2G
> 
> which is not a big chore.
> 
> > in this case DIMM device only needs to specify where it's plugged in, using
> > 'node' property (now number but could become QOM path to NUMA node object).
> 
> Yeah, then it's the same as the id.
> 
> Paolo
> 
> > Ideally it would be QOM hierarchy:
> > 
> > /nodeX/@dimmbus/dimm_device
> > where even 'node' property would become obsolete, just specify right
> > bus to attach DIMM device to.
> > 
> > PS:
> > we need a similar QOM hierarchy for CPUs as well to sort out
> > -numa cpus=ids mess.
> > 
> >>
> >> Again, this is the same as drive_add/device_add.
> >>
> >> Paolo
> >>
> >>> allocated, free etc. I think slot should be either free or busy.
> >>>
> >>>
> >>>>
> >>>> In general, I don't think free slots should be managed by the DimmBus,
> >>>> and host vs. guest separation should be there even if we accept your
> >>>> "-m" extension (doesn't look bad at all, I must say).
> >>>>
> >>>> Paolo
> >>>
> >>
> > 
> 
>
Paolo Bonzini July 26, 2013, 9:26 a.m. UTC | #7
Il 26/07/2013 09:38, Igor Mammedov ha scritto:
> Perhaps denying memory add and suggesting node migration to a node with
> more memory would be right approach, otherwise user is bound to be hit by
> cross node penalty.

Or better, the user can first change the policy from "bind" to
"preferred", and then hotplug.

>> > I agree that specifying the policy on every hotplug complicates
>> > management and may be overkill.  But then, most guests are not NUMA at
>> > all and you would hardly perceive the difference, you would just have to
>> > separate
>> > 
>> >     set-mem-policy 0 size=2G
>> >     device_add dimm,slot=0
> I'm confused, size is inherent property of generic DimmDevice and policies
> are NUMA specific of node.

No, the size is not a property of the DimmDevice, it is a property of
the host object that backs the DimmDevice.  This is like the size is not
a property of a disk, but rather of the image that backs it.

Now, there may be a good reason to remove the host/guest distinction in
the case of memory, but I'm still not sure this is the case.

In the case of NUMA policies, I talked to Andrea Arcangeli and indeed
his suggestion was to have a single policy per guest node (typically
bind or preferred if guest node size <= host node size, interleave if
guest node size > host node size).

However, there are other properties of the memory object (e.g. hugetlbfs
path) that could be customized for every slot.

Paolo

> If there is need for per DIMM policies, I'd store NUMA specific policy inside
> object that implements it (allocates memory), and apply them to DIMM when
> it's realized.
> 
>     set-mem-policy nodeid=X,mem-hostnodes=Z,dev=dimmY
>     device_add dimm,id=dimmY,size=2G,node=X
>
Igor Mammedov July 26, 2013, 12:51 p.m. UTC | #8
On Fri, 26 Jul 2013 11:26:16 +0200
Paolo Bonzini <pbonzini@redhat.com> wrote:

> Il 26/07/2013 09:38, Igor Mammedov ha scritto:
> > Perhaps denying memory add and suggesting node migration to a node with
> > more memory would be right approach, otherwise user is bound to be hit by
> > cross node penalty.
> 
> Or better, the user can first change the policy from "bind" to
> "preferred", and then hotplug.
> 
> >> > I agree that specifying the policy on every hotplug complicates
> >> > management and may be overkill.  But then, most guests are not NUMA at
> >> > all and you would hardly perceive the difference, you would just have to
> >> > separate
> >> > 
> >> >     set-mem-policy 0 size=2G
> >> >     device_add dimm,slot=0
> > I'm confused, size is inherent property of generic DimmDevice and policies
> > are NUMA specific of node.
> 
> No, the size is not a property of the DimmDevice, it is a property of
> the host object that backs the DimmDevice.  This is like the size is not
> a property of a disk, but rather of the image that backs it.
> 
> Now, there may be a good reason to remove the host/guest distinction in
> the case of memory, but I'm still not sure this is the case.
I don't like split model in this case because it's complicates interface
and confuses user. On bare-metal when user adds DIMM, it definitely has
size property. So when user adds DIMM device like:

     set-mem-policy 0 size=2G,somehostprop=y
     device_add dimm,slot=0

it's not very clear/intuitive to what 'size' relates to. On contrary:

     set-mem-policy 0 somehostprop=y
     device_add dimm,slot=0,size=2G

clearly says that we are adding 2Gb DIMM and allocator of memory that
bakes up DIMM, takes care about host specific details isolating them
from DIMM device model (which resembles baremetal one as much as possible).

> In the case of NUMA policies, I talked to Andrea Arcangeli and indeed
> his suggestion was to have a single policy per guest node (typically
> bind or preferred if guest node size <= host node size, interleave if
> guest node size > host node size).
with current implementation we could inherit DimmBusNumaAware from DimmBus,
and replace allocator to do that, not touching DIMM device model at all.

> 
> However, there are other properties of the memory object (e.g. hugetlbfs
> path) that could be customized for every slot.
Why not keep this mapping in object that allocates memory (DimmBus) and
allow it apply them to allocated memory when DIMM device is being added?

It still would be a split model but DimmDevice interface and implementation
would stay stable and the same (i.e. device_add dimm,id,size,slot,start,node)
whether we use per DIMM host specific policies, NUMA node policies or not care
about them at all.

> Paolo
> 
> > If there is need for per DIMM policies, I'd store NUMA specific policy inside
> > object that implements it (allocates memory), and apply them to DIMM when
> > it's realized.
> > 
> >     set-mem-policy nodeid=X,mem-hostnodes=Z,dev=dimmY
> >     device_add dimm,id=dimmY,size=2G,node=X
> > 
> 
>
Paolo Bonzini July 26, 2013, 2:37 p.m. UTC | #9
Il 26/07/2013 14:51, Igor Mammedov ha scritto:
> On Fri, 26 Jul 2013 11:26:16 +0200
> Paolo Bonzini <pbonzini@redhat.com> wrote:
> 
>> Il 26/07/2013 09:38, Igor Mammedov ha scritto:
>>> Perhaps denying memory add and suggesting node migration to a node with
>>> more memory would be right approach, otherwise user is bound to be hit by
>>> cross node penalty.
>>
>> Or better, the user can first change the policy from "bind" to
>> "preferred", and then hotplug.
>>
>>>>> I agree that specifying the policy on every hotplug complicates
>>>>> management and may be overkill.  But then, most guests are not NUMA at
>>>>> all and you would hardly perceive the difference, you would just have to
>>>>> separate
>>>>>
>>>>>     set-mem-policy 0 size=2G
>>>>>     device_add dimm,slot=0
>>> I'm confused, size is inherent property of generic DimmDevice and policies
>>> are NUMA specific of node.
>>
>> No, the size is not a property of the DimmDevice, it is a property of
>> the host object that backs the DimmDevice.  This is like the size is not
>> a property of a disk, but rather of the image that backs it.
>>
>> Now, there may be a good reason to remove the host/guest distinction in
>> the case of memory, but I'm still not sure this is the case.
> I don't like split model in this case because it's complicates interface
> and confuses user. On bare-metal when user adds DIMM, it definitely has
> size property. So when user adds DIMM device like:
> 
>      set-mem-policy 0 size=2G,somehostprop=y
>      device_add dimm,slot=0
> 
> it's not very clear/intuitive to what 'size' relates to. On contrary:
> 
>      set-mem-policy 0 somehostprop=y
>      device_add dimm,slot=0,size=2G
> 
> clearly says that we are adding 2Gb DIMM and allocator of memory that
> bakes up DIMM, takes care about host specific details isolating them
> from DIMM device model (which resembles baremetal one as much as possible).

How is this different from

    drive-add id=foo,aio=native
    device-add virtio-block,drive=foo,file=/vm/foo.img

We clearly do not do that, we put the file in the drive-add.

>> In the case of NUMA policies, I talked to Andrea Arcangeli and indeed
>> his suggestion was to have a single policy per guest node (typically
>> bind or preferred if guest node size <= host node size, interleave if
>> guest node size > host node size).
> with current implementation we could inherit DimmBusNumaAware from DimmBus,
> and replace allocator to do that, not touching DIMM device model at all.

The guest bus should surely not be aware of NUMA policies in the host.

Remember here the guest may have just one node, but you want it bound to
one particular node of the host.

>> However, there are other properties of the memory object (e.g. hugetlbfs
>> path) that could be customized for every slot.
> 
> Why not keep this mapping in object that allocates memory (DimmBus) and
> allow it apply them to allocated memory when DIMM device is being added?

It's not DimmBus that allocates memory.  Allocating memory is a host
action, and even if it is triggered by DimmBus, it is not _done_ by DimmBus.

> It still would be a split model but DimmDevice interface and implementation
> would stay stable and the same (i.e. device_add dimm,id,size,slot,start,node)
> whether we use per DIMM host specific policies, NUMA node policies or not care
> about them at all.

I'm not sure how.  Clean separation of host vs. guest properties and
actions is the only thing that makes DimmDevice stable.

In fact, the same separation is present in the real world as well.  The
DimmDevice is not the physical memory chips, it is just a guest-visible
representation of it.  Size is a property of the physical memory chips.

Paolo
Andreas Färber Aug. 3, 2013, 1:56 p.m. UTC | #10
Am 26.07.2013 16:37, schrieb Paolo Bonzini:
> Il 26/07/2013 14:51, Igor Mammedov ha scritto:
>> On Fri, 26 Jul 2013 11:26:16 +0200
>> Paolo Bonzini <pbonzini@redhat.com> wrote:
>>> Il 26/07/2013 09:38, Igor Mammedov ha scritto:
>>>>>> I agree that specifying the policy on every hotplug complicates
>>>>>> management and may be overkill.  But then, most guests are not NUMA at
>>>>>> all and you would hardly perceive the difference, you would just have to
>>>>>> separate
>>>>>>
>>>>>>     set-mem-policy 0 size=2G
>>>>>>     device_add dimm,slot=0
>>>> I'm confused, size is inherent property of generic DimmDevice and policies
>>>> are NUMA specific of node.
>>>
>>> No, the size is not a property of the DimmDevice, it is a property of
>>> the host object that backs the DimmDevice.  This is like the size is not
>>> a property of a disk, but rather of the image that backs it.
>>>
>>> Now, there may be a good reason to remove the host/guest distinction in
>>> the case of memory, but I'm still not sure this is the case.
>> I don't like split model in this case because it's complicates interface
>> and confuses user. On bare-metal when user adds DIMM, it definitely has
>> size property. So when user adds DIMM device like:
>>
>>      set-mem-policy 0 size=2G,somehostprop=y
>>      device_add dimm,slot=0
>>
>> it's not very clear/intuitive to what 'size' relates to. On contrary:
>>
>>      set-mem-policy 0 somehostprop=y
>>      device_add dimm,slot=0,size=2G
>>
>> clearly says that we are adding 2Gb DIMM and allocator of memory that
>> bakes up DIMM, takes care about host specific details isolating them
>> from DIMM device model (which resembles baremetal one as much as possible).
> 
> How is this different from
> 
>     drive-add id=foo,aio=native
>     device-add virtio-block,drive=foo,file=/vm/foo.img
> 
> We clearly do not do that, we put the file in the drive-add.

The difference is that a user can understand drive-add, whereas
set-mem-policy in this use case is really hard to grasp as a prereq. If
it were
  memory-add id=foo,size=2G
  device-add dimm,slot=0,mem=foo
that may be different, but that is unhandy implementation-wise because
we assume initial RAM to be in one chunk and want to partition it into
guest NUMA nodes IIUC. Or has that changed?

I don't care if we name the device DIMM or MemoryBar, point is it should
represent the physical thing one plugs into a physical board, to match
what admins do with physical servers. And that physical bar has a size,
as Igor points out. It should not just represent some virtual
hot-plugged policy thing.

The drive is separate from the device because we can exchange the CD
media while leaving the device connected to ATA/AHCI/SCSI (and it allows
us to keep file, cache, etc. properties in a central place). Admins buy
servers/boards and memory bars, they won't solder chips onto it nor
change the NUMA configuration of the board while running, so we should
consider it immutable once created. If we unplug physical DIMMs, the
data can be considered lost (ignoring deep cooling tricks etc.), so no
point to transfer memory data from one DIMM to another.

Would we seriously want to exchange the memory a DIMM is backed by while
leaving the DIMM plugged? Rather the entity that owns the DIMM slots (as
opposed to DIMMs) has the guest NUMA policy configured for the
unpopulated slots. The slot has a maximum size and the DIMM has a size
less or equal to slot's maximum size.

I just wonder whether we need a DimmBus at all? Because if we get the
slot specified as in your examples then we could just set some dimm[n]
link<DIMM> on realize (question is where). We had a discussion once
about a missing callback when a link property is set to a new value, not
sure if that has been resolved meantime?
Can't think of a situation where we would have multiple DimmBus'es, only
some cases where it's on a SoC or SoM and not directly on the mainboard,
i.e. not /machine/dimm[n].
Code seems to be copying ICC bus, but ICC bus was added because APIC
needed a hot-pluggable bus to replace SysBus.

Hadn't Anthony and Ping Fang factored out a memory controller on the
i440fx? Patch 9 seems to add the bus directly to the i440fx PCI device.

Patch 13 seems to use a Notifier for PIIX4 ACPI rather than having the
bus handle hotplug itself and bus / memory controller communicating with
ACPI as necessary (layering violation). For the CPU we initially had no
bus at all to handle such event (we still don't outside x86).

What I am not finding in this series is a translation from -m to initial
non-hotplugged DIMMs?

Some random other remarks on the series:
* s/"dimmbus"/"dimm-bus"/g
* s/klass/oc/g -- there were loud protests against k* (class reserved)
* gtk-doc markup for parent fields is missing in header
* s/qdev/parent_obj/, s/qbus/parent_obj/ -- don't invite to use them
* s/dc/dbc/g -- dc is for DeviceClass
* DimmBusClass::register_memory() is never overwritten? In that case it
could well be just a static function called from realizefn - only
disctinction I can think of is how to allocate MemoryRegion. A
caller-settable DimmBus::slot_plugged() for i440fx to notify piix4 would
seem to solve the issue of the Notifier elegantly. PCIBus has such hooks.

Regards,
Andreas
Markus Armbruster Aug. 6, 2013, 7:13 a.m. UTC | #11
Paolo Bonzini <pbonzini@redhat.com> writes:

> Il 26/07/2013 14:51, Igor Mammedov ha scritto:
>> On Fri, 26 Jul 2013 11:26:16 +0200
>> Paolo Bonzini <pbonzini@redhat.com> wrote:
>> 
>>> Il 26/07/2013 09:38, Igor Mammedov ha scritto:
>>>> Perhaps denying memory add and suggesting node migration to a node with
>>>> more memory would be right approach, otherwise user is bound to be hit by
>>>> cross node penalty.
>>>
>>> Or better, the user can first change the policy from "bind" to
>>> "preferred", and then hotplug.
>>>
>>>>>> I agree that specifying the policy on every hotplug complicates
>>>>>> management and may be overkill.  But then, most guests are not NUMA at
>>>>>> all and you would hardly perceive the difference, you would just have to
>>>>>> separate
>>>>>>
>>>>>>     set-mem-policy 0 size=2G
>>>>>>     device_add dimm,slot=0
>>>> I'm confused, size is inherent property of generic DimmDevice and policies
>>>> are NUMA specific of node.
>>>
>>> No, the size is not a property of the DimmDevice, it is a property of
>>> the host object that backs the DimmDevice.  This is like the size is not
>>> a property of a disk, but rather of the image that backs it.

If you excuse my pedantry: size *is* a property of the block device
model (guest part, a.k.a. frontend).  We just choose to derive it from
the image size (host part, a.k.a. backend) instead of making it
separately configurable.

If we wanted to make it separately configurable (overriding the implicit
image size), I'd certainly argue to make it a device model property.

>>> Now, there may be a good reason to remove the host/guest distinction in
>>> the case of memory, but I'm still not sure this is the case.
>> I don't like split model in this case because it's complicates interface
>> and confuses user. On bare-metal when user adds DIMM, it definitely has
>> size property. So when user adds DIMM device like:
>> 
>>      set-mem-policy 0 size=2G,somehostprop=y
>>      device_add dimm,slot=0
>> 
>> it's not very clear/intuitive to what 'size' relates to. On contrary:
>> 
>>      set-mem-policy 0 somehostprop=y
>>      device_add dimm,slot=0,size=2G
>> 
>> clearly says that we are adding 2Gb DIMM and allocator of memory that
>> bakes up DIMM, takes care about host specific details isolating them
>> from DIMM device model (which resembles baremetal one as much as possible).
>
> How is this different from
>
>     drive-add id=foo,aio=native
>     device-add virtio-block,drive=foo,file=/vm/foo.img
>
> We clearly do not do that, we put the file in the drive-add.

Yes, because the image file is a backend property, not a frontend
property.

>>> In the case of NUMA policies, I talked to Andrea Arcangeli and indeed
>>> his suggestion was to have a single policy per guest node (typically
>>> bind or preferred if guest node size <= host node size, interleave if
>>> guest node size > host node size).
>> with current implementation we could inherit DimmBusNumaAware from DimmBus,
>> and replace allocator to do that, not touching DIMM device model at all.
>
> The guest bus should surely not be aware of NUMA policies in the host.
>
> Remember here the guest may have just one node, but you want it bound to
> one particular node of the host.
>
>>> However, there are other properties of the memory object (e.g. hugetlbfs
>>> path) that could be customized for every slot.
>> 
>> Why not keep this mapping in object that allocates memory (DimmBus) and
>> allow it apply them to allocated memory when DIMM device is being added?
>
> It's not DimmBus that allocates memory.  Allocating memory is a host
> action, and even if it is triggered by DimmBus, it is not _done_ by DimmBus.
>
>> It still would be a split model but DimmDevice interface and implementation
>> would stay stable and the same (i.e. device_add dimm,id,size,slot,start,node)
>> whether we use per DIMM host specific policies, NUMA node policies or not care
>> about them at all.
>
> I'm not sure how.  Clean separation of host vs. guest properties and
> actions is the only thing that makes DimmDevice stable.
>
> In fact, the same separation is present in the real world as well.  The
> DimmDevice is not the physical memory chips, it is just a guest-visible
> representation of it.  Size is a property of the physical memory chips.

Without fully understanding the issues in this specific case: clean
separation of host vs. guest matters.  Old code didn't care, and we
spent a lot of sweat on cleaning it up, dancing to the merry backward
compatibility polka all the way.
Igor Mammedov Sept. 11, 2013, 3:12 p.m. UTC | #12
On Sat, 03 Aug 2013 15:56:50 +0200
Andreas Färber <afaerber@suse.de> wrote:

> Am 26.07.2013 16:37, schrieb Paolo Bonzini:
> > Il 26/07/2013 14:51, Igor Mammedov ha scritto:
> >> On Fri, 26 Jul 2013 11:26:16 +0200
> >> Paolo Bonzini <pbonzini@redhat.com> wrote:
> >>> Il 26/07/2013 09:38, Igor Mammedov ha scritto:
> >>>>>> I agree that specifying the policy on every hotplug complicates
> >>>>>> management and may be overkill.  But then, most guests are not NUMA at
> >>>>>> all and you would hardly perceive the difference, you would just have to
> >>>>>> separate
> >>>>>>
> >>>>>>     set-mem-policy 0 size=2G
> >>>>>>     device_add dimm,slot=0
> >>>> I'm confused, size is inherent property of generic DimmDevice and policies
> >>>> are NUMA specific of node.
> >>>
> >>> No, the size is not a property of the DimmDevice, it is a property of
> >>> the host object that backs the DimmDevice.  This is like the size is not
> >>> a property of a disk, but rather of the image that backs it.
> >>>
> >>> Now, there may be a good reason to remove the host/guest distinction in
> >>> the case of memory, but I'm still not sure this is the case.
> >> I don't like split model in this case because it's complicates interface
> >> and confuses user. On bare-metal when user adds DIMM, it definitely has
> >> size property. So when user adds DIMM device like:
> >>
> >>      set-mem-policy 0 size=2G,somehostprop=y
> >>      device_add dimm,slot=0
> >>
> >> it's not very clear/intuitive to what 'size' relates to. On contrary:
> >>
> >>      set-mem-policy 0 somehostprop=y
> >>      device_add dimm,slot=0,size=2G
> >>
> >> clearly says that we are adding 2Gb DIMM and allocator of memory that
> >> bakes up DIMM, takes care about host specific details isolating them
> >> from DIMM device model (which resembles baremetal one as much as possible).
> > 
> > How is this different from
> > 
> >     drive-add id=foo,aio=native
> >     device-add virtio-block,drive=foo,file=/vm/foo.img
> > 
> > We clearly do not do that, we put the file in the drive-add.
> 
> The difference is that a user can understand drive-add, whereas
> set-mem-policy in this use case is really hard to grasp as a prereq. If
> it were
>   memory-add id=foo,size=2G
>   device-add dimm,slot=0,mem=foo
> that may be different, but that is unhandy implementation-wise because
> we assume initial RAM to be in one chunk and want to partition it into
> guest NUMA nodes IIUC. Or has that changed?
> 
> I don't care if we name the device DIMM or MemoryBar, point is it should
> represent the physical thing one plugs into a physical board, to match
> what admins do with physical servers. And that physical bar has a size,
> as Igor points out. It should not just represent some virtual
> hot-plugged policy thing.
> 
> The drive is separate from the device because we can exchange the CD
> media while leaving the device connected to ATA/AHCI/SCSI (and it allows
> us to keep file, cache, etc. properties in a central place). Admins buy
> servers/boards and memory bars, they won't solder chips onto it nor
> change the NUMA configuration of the board while running, so we should
> consider it immutable once created. If we unplug physical DIMMs, the
> data can be considered lost (ignoring deep cooling tricks etc.), so no
> point to transfer memory data from one DIMM to another.
> 
> Would we seriously want to exchange the memory a DIMM is backed by while
> leaving the DIMM plugged? Rather the entity that owns the DIMM slots (as
> opposed to DIMMs) has the guest NUMA policy configured for the
> unpopulated slots. The slot has a maximum size and the DIMM has a size
> less or equal to slot's maximum size.
maximum size per slot is hardware limitation, we can ignore it for virtual
hardware case. I've added 'slot' option only for migration case, so we could
replicate environment on receiving side (it's interface connecting QEMU
and ACPI objects in guest).
The thing is even if slot was an object providing NUMA policies and etc.,
I'd like to keep it dynamic and specifying such properties in runtime
rather then at startup time to keep it flexible and avoid not necessary
limitations wich lead again to backend/frontend separation.

If DimmDevice is separated on backend/frontend parts than for simplicity
'size' option could go to memory-add command as well.

So taking previous feedback, would be following interface acceptable?

memory-add id=foo,size=x,numa-node=y
device-add dimm,slot=z,mem=foo

memory-add would:
 1. allocate host memory
 2. apply policy for a specified node

device-add would create DimmDevice which:
 1. will own memory region
 2. have properties:
     slot - addressing specific ACPI memory device through
            QEMU/ACPI OSPM layers [default: auto selected]
     start - address where memory is mapped [default: auto selected]
 3. serve as proxy object for back-end size and proximity properties

> I just wonder whether we need a DimmBus at all? Because if we get the
> slot specified as in your examples then we could just set some dimm[n]
> link<DIMM> on realize (question is where). We had a discussion once
> about a missing callback when a link property is set to a new value, not
> sure if that has been resolved meantime?
> Can't think of a situation where we would have multiple DimmBus'es, only
> some cases where it's on a SoC or SoM and not directly on the mainboard,
> i.e. not /machine/dimm[n].
> Code seems to be copying ICC bus, but ICC bus was added because APIC
> needed a hot-pluggable bus to replace SysBus.
Bus here is not only source of implicit links but also a gateway that provides
address space for mapping DimmDevice-s in acceptable way.

I've tried /machine/link<dimm> looking for simpler solution, but result
was more hackish, so I've dumped idea and returned to original Bus approach
with established usage pattern.

> 
> Hadn't Anthony and Ping Fang factored out a memory controller on the
> i440fx? Patch 9 seems to add the bus directly to the i440fx PCI device.
not upstream so far, there was ignored RFC by Hu Tao
[http://lists.nongnu.org/archive/html/qemu-devel/2013-06/msg03500.html]

This series provides memory hotplug in the least invasive way, and it
would be trivial to move to factored out memory controller once it's upstream.

> Patch 13 seems to use a Notifier for PIIX4 ACPI rather than having the
> bus handle hotplug itself and bus / memory controller communicating with
> ACPI as necessary (layering violation). For the CPU we initially had no
> bus at all to handle such event (we still don't outside x86).
I'll look at it.

> 
> What I am not finding in this series is a translation from -m to initial
> non-hotplugged DIMMs?
it wasn't goal of this series for it would require factored out a memory
controller, so not to increase mess in current code.

> 
> Some random other remarks on the series:
> * s/"dimmbus"/"dimm-bus"/g
> * s/klass/oc/g -- there were loud protests against k* (class reserved)
> * gtk-doc markup for parent fields is missing in header
> * s/qdev/parent_obj/, s/qbus/parent_obj/ -- don't invite to use them
> * s/dc/dbc/g -- dc is for DeviceClass
> * DimmBusClass::register_memory() is never overwritten? In that case it
> could well be just a static function called from realizefn - only
It's not much of over-engineering and keeps things object oriented.

> disctinction I can think of is how to allocate MemoryRegion. A
> caller-settable DimmBus::slot_plugged() for i440fx to notify piix4 would
> seem to solve the issue of the Notifier elegantly. PCIBus has such hooks.
Thanks for a tip, I'll certainly try it.

> 
> Regards,
> Andreas
>
diff mbox

Patch

diff --git a/hw/mem-hotplug/dimm.c b/hw/mem-hotplug/dimm.c
index 63c2c8e..c716906 100644
--- a/hw/mem-hotplug/dimm.c
+++ b/hw/mem-hotplug/dimm.c
@@ -20,6 +20,7 @@ 
 
 #include "hw/mem-hotplug/dimm.h"
 #include "qemu/config-file.h"
+#include "qemu/bitmap.h"
 
 static void dimm_bus_initfn(Object *obj)
 {
@@ -27,6 +28,45 @@  static void dimm_bus_initfn(Object *obj)
 
     b->allow_hotplug = true;
 }
+
+static int dimm_bus_slot2bitmap(DeviceState *dev, void *opaque)
+{
+    unsigned long *bitmap = opaque;
+    BusClass *bc = BUS_GET_CLASS(qdev_get_parent_bus(dev));
+    DimmDevice *d = DIMM(dev);
+
+    if (dev->realized) { /* count only realized DIMMs */
+        g_assert(d->slot < bc->max_dev);
+        set_bit(d->slot, bitmap);
+    }
+    return 0;
+}
+
+static int dimm_bus_get_free_slot(DimmBus *bus, const int *hint, Error **errp)
+{
+    BusClass *bc = BUS_GET_CLASS(bus);
+    unsigned long *bitmap = bitmap_new(bc->max_dev);
+    int slot = 0;
+
+    qbus_walk_children(BUS(bus), dimm_bus_slot2bitmap, NULL, bitmap);
+
+    /* check if requested slot is not occupied */
+    if (hint) {
+        if (!test_bit(*hint, bitmap)) {
+            slot = *hint;
+        } else {
+            error_setg(errp, "slot %d is busy", *hint);
+        }
+        goto out;
+    }
+
+    /* search for free slot */
+    slot = find_first_zero_bit(bitmap, bc->max_dev);
+out:
+    g_free(bitmap);
+    return slot;
+}
+
 static void dimm_bus_register_memory(DimmBus *bus, DimmDevice *dimm,
                                      Error **errp)
 {
@@ -44,6 +84,7 @@  static void dimm_bus_class_init(ObjectClass *klass, void *data)
         bc->max_dev = qemu_opt_get_number(opts, "slots", 0);
     }
     dc->register_memory = dimm_bus_register_memory;
+    dc->get_free_slot = dimm_bus_get_free_slot;
 }
 
 static const TypeInfo dimm_bus_info = {
@@ -59,7 +100,7 @@  static Property dimm_properties[] = {
     DEFINE_PROP_UINT64("start", DimmDevice, start, 0),
     DEFINE_PROP_SIZE("size", DimmDevice, size, DEFAULT_DIMMSIZE),
     DEFINE_PROP_UINT32("node", DimmDevice, node, 0),
-    DEFINE_PROP_INT32("slot", DimmDevice, slot, 0),
+    DEFINE_PROP_INT32("slot", DimmDevice, slot, -1),
     DEFINE_PROP_END_OF_LIST(),
 };
 
@@ -69,6 +110,7 @@  static void dimm_realize(DeviceState *dev, Error **errp)
     DimmBus *bus = DIMM_BUS(qdev_get_parent_bus(dev));
     BusClass *bc = BUS_GET_CLASS(bus);
     DimmBusClass *dc = DIMM_BUS_GET_CLASS(bus);
+    int *slot_hint;
 
     if (!dev->id) {
         error_setg(errp, "missing 'id' property");
@@ -79,6 +121,13 @@  static void dimm_realize(DeviceState *dev, Error **errp)
         error_setg(errp, "maximum allowed slot is: %d", bc->max_dev - 1);
         return;
     }
+    g_assert(dc->get_free_slot);
+    slot_hint = dimm->slot < 0 ? NULL : &dimm->slot;
+    dimm->slot = dc->get_free_slot(bus, slot_hint, errp);
+    if (error_is_set(errp)) {
+        return;
+    }
+
 
     memory_region_init_ram(&dimm->mr, dev->id, dimm->size);
 
diff --git a/include/hw/mem-hotplug/dimm.h b/include/hw/mem-hotplug/dimm.h
index 84d6ba6..d8d11a3 100644
--- a/include/hw/mem-hotplug/dimm.h
+++ b/include/hw/mem-hotplug/dimm.h
@@ -35,6 +35,7 @@ 
  * @size: amount of memory mapped at @start.
  * @node: numa node to which @DimmDevice is attached.
  * @slot: slot number into which @DimmDevice is plugged in.
+ * Default value: -1, means that slot is auto-allocated.
  */
 typedef struct DimmDevice {
     DeviceState qdev;
@@ -69,11 +70,15 @@  typedef struct DimmBus {
 
 /**
  * DimmBusClass:
+ * @get_free_slot: returns a not occupied slot number. If @hint is provided,
+ * it tries to return slot specified by @hint if it's not busy or returns
+ * error in @errp.
  * @register_memory: map @DimmDevice into hot-plugable address space
  */
 typedef struct DimmBusClass {
     BusClass parent_class;
 
+    int (*get_free_slot)(DimmBus *bus, const int *hint, Error **errp);
     void (*register_memory)(DimmBus *bus, DimmDevice *dimm, Error **errp);
 } DimmBusClass;