Patchwork [10/10] qdev: fix create in place obj's life cycle problem

login
register
mail settings
Submitter pingfan liu
Date Aug. 24, 2012, 9:49 a.m.
Message ID <1345801763-24227-11-git-send-email-qemulist@gmail.com>
Download mbox | patch
Permalink /patch/179810/
State New
Headers show

Comments

pingfan liu - Aug. 24, 2012, 9:49 a.m.
From: Liu Ping Fan <pingfank@linux.vnet.ibm.com>

Scene:
  obja lies in objA, when objA's ref->0, it will be freed,
but at that time obja can still be in use.

The real example is:
typedef struct PCIIDEState {
    PCIDevice dev;
    IDEBus bus[2]; --> create in place
    .....
}

When without big lock protection for mmio-dispatch, we will hold
obj's refcnt. So memory_region_init_io() will replace the third para
"void *opaque" with "Object *obj".
With this patch, we can protect PCIIDEState from disappearing during
mmio-dispatch hold the IDEBus->ref.

And the ref circle has been broken when calling qdev_delete_subtree().

Signed-off-by: Liu Ping Fan <pingfank@linux.vnet.ibm.com>
---
 hw/qdev.c |    2 ++
 hw/qdev.h |    1 +
 2 files changed, 3 insertions(+), 0 deletions(-)
Paolo Bonzini - Aug. 24, 2012, 2:42 p.m.
Il 24/08/2012 11:49, Liu Ping Fan ha scritto:
> With this patch, we can protect PCIIDEState from disappearing during
> mmio-dispatch hold the IDEBus->ref.

I don't see why MMIO dispatch should hold the IDEBus ref rather than the
PCIIDEState.

In the case of the PIIX, the BARs are set up by the PCIIDEState in
bmdma_setup_bar (called by bmdma_setup_bar).

Also, containment may happen just as well for devices, not buses.  Why
isn't it a problem in that case?  It looks like you're papering over a
different bug.

Paolo

> And the ref circle has been broken when calling qdev_delete_subtree().
> 
> Signed-off-by: Liu Ping Fan <pingfank@linux.vnet.ibm.com>
> ---
>  hw/qdev.c |    2 ++
>  hw/qdev.h |    1 +
>  2 files changed, 3 insertions(+), 0 deletions(-)
> 
> diff --git a/hw/qdev.c b/hw/qdev.c
> index e2339a1..b09ebbf 100644
> --- a/hw/qdev.c
> +++ b/hw/qdev.c
> @@ -510,6 +510,8 @@ void qbus_create_inplace(BusState *bus, const char *typename,
>  {
>      object_initialize(bus, typename);
>  
> +    bus->overlap = parent;
> +    object_ref(OBJECT(bus->overlap));
>      bus->parent = parent;
>      bus->name = name ? g_strdup(name) : NULL;
>      qbus_realize(bus);
> diff --git a/hw/qdev.h b/hw/qdev.h
> index 182cfa5..9bc5783 100644
> --- a/hw/qdev.h
> +++ b/hw/qdev.h
> @@ -117,6 +117,7 @@ struct BusState {
>      int allow_hotplug;
>      bool qom_allocated;
>      bool glib_allocated;
> +    DeviceState *overlap;
>      int max_index;
>      QTAILQ_HEAD(ChildrenHead, BusChild) children;
>      QLIST_ENTRY(BusState) sibling;
> -- 1.7.4.4
pingfan liu - Aug. 25, 2012, 7:42 a.m.
On Fri, Aug 24, 2012 at 10:42 PM, Paolo Bonzini <pbonzini@redhat.com> wrote:
> Il 24/08/2012 11:49, Liu Ping Fan ha scritto:
>> With this patch, we can protect PCIIDEState from disappearing during
>> mmio-dispatch hold the IDEBus->ref.
>
> I don't see why MMIO dispatch should hold the IDEBus ref rather than the
> PCIIDEState.
>
When transfer memory_region_init_io()  3rd para from void* opaque to
Object* obj,  the obj : opaque is not neccessary 1:1 map. For such
situation, in order to let MemoryRegionOps tell between them, we
should pass PCIIDEState->bus[0], bus[1] separately.

> In the case of the PIIX, the BARs are set up by the PCIIDEState in
> bmdma_setup_bar (called by bmdma_setup_bar).
>
Supposing we have convert  PCIIDEState->bmdma[0]/[1] to Object. And in
mmio-dispatch, object_ref will impose on bmdma[0/[1], but this can not
prevent  PCIIDEState->refcnt=0, and then the whole object disappear!

Thanks and regards,
pingfan


> Also, containment may happen just as well for devices, not buses.  Why
> isn't it a problem in that case?  It looks like you're papering over a
> different bug.
>
> Paolo
>
>> And the ref circle has been broken when calling qdev_delete_subtree().
>>
>> Signed-off-by: Liu Ping Fan <pingfank@linux.vnet.ibm.com>
>> ---
>>  hw/qdev.c |    2 ++
>>  hw/qdev.h |    1 +
>>  2 files changed, 3 insertions(+), 0 deletions(-)
>>
>> diff --git a/hw/qdev.c b/hw/qdev.c
>> index e2339a1..b09ebbf 100644
>> --- a/hw/qdev.c
>> +++ b/hw/qdev.c
>> @@ -510,6 +510,8 @@ void qbus_create_inplace(BusState *bus, const char *typename,
>>  {
>>      object_initialize(bus, typename);
>>
>> +    bus->overlap = parent;
>> +    object_ref(OBJECT(bus->overlap));
>>      bus->parent = parent;
>>      bus->name = name ? g_strdup(name) : NULL;
>>      qbus_realize(bus);
>> diff --git a/hw/qdev.h b/hw/qdev.h
>> index 182cfa5..9bc5783 100644
>> --- a/hw/qdev.h
>> +++ b/hw/qdev.h
>> @@ -117,6 +117,7 @@ struct BusState {
>>      int allow_hotplug;
>>      bool qom_allocated;
>>      bool glib_allocated;
>> +    DeviceState *overlap;
>>      int max_index;
>>      QTAILQ_HEAD(ChildrenHead, BusChild) children;
>>      QLIST_ENTRY(BusState) sibling;
>> -- 1.7.4.4
>
Paolo Bonzini - Aug. 27, 2012, 7:01 a.m.
Il 25/08/2012 09:42, liu ping fan ha scritto:
>> >
>> > I don't see why MMIO dispatch should hold the IDEBus ref rather than the
>> > PCIIDEState.
>> >
> When transfer memory_region_init_io()  3rd para from void* opaque to
> Object* obj,  the obj : opaque is not neccessary 1:1 map. For such
> situation, in order to let MemoryRegionOps tell between them, we
> should pass PCIIDEState->bus[0], bus[1] separately.

The rule should be that the obj is the object that you want referenced,
and that should be the PCIIDEState.

But this is anyway moot because it only applies to objects that are
converted to use unlocked dispatch.  This likely will not be the case
for IDE.

Paolo

>> > In the case of the PIIX, the BARs are set up by the PCIIDEState in
>> > bmdma_setup_bar (called by bmdma_setup_bar).
>> >
> Supposing we have convert  PCIIDEState->bmdma[0]/[1] to Object. And in
> mmio-dispatch, object_ref will impose on bmdma[0/[1], but this can not
> prevent  PCIIDEState->refcnt=0, and then the whole object disappear!
Jan Kiszka - Aug. 27, 2012, 7:47 a.m.
On 2012-08-27 09:01, Paolo Bonzini wrote:
> Il 25/08/2012 09:42, liu ping fan ha scritto:
>>>>
>>>> I don't see why MMIO dispatch should hold the IDEBus ref rather than the
>>>> PCIIDEState.
>>>>
>> When transfer memory_region_init_io()  3rd para from void* opaque to
>> Object* obj,  the obj : opaque is not neccessary 1:1 map. For such
>> situation, in order to let MemoryRegionOps tell between them, we
>> should pass PCIIDEState->bus[0], bus[1] separately.
> 
> The rule should be that the obj is the object that you want referenced,
> and that should be the PCIIDEState.
> 
> But this is anyway moot because it only applies to objects that are
> converted to use unlocked dispatch.  This likely will not be the case
> for IDE.

BTW, I'm pretty sure - after implementing the basics for BQL-free PIO
dispatching - that device objects are the wrong target for reference
counting. We keep memory regions in our dispatching tables (PIO
dispatching needs some refactoring for this), and those regions need
protection for BQL-free use. Devices can't pass away as long as the have
referenced regions, memory region deregistration services will have to
take care of this.

I'm currently not using reference counting at all, I'm enforcing that
only BQL-protected regions can be deregistered.

Also note that there seems to be another misconception in the
discussions: deregistration is not only bound to device unplug. It also
happens on device reconfiguration, e.g. PCI BAR (re-)mapping. Another
strong indicator that we should worry about individual memory regions,
not devices.

Jan
pingfan liu - Aug. 27, 2012, 8:17 a.m.
On Mon, Aug 27, 2012 at 3:47 PM, Jan Kiszka <jan.kiszka@web.de> wrote:
> On 2012-08-27 09:01, Paolo Bonzini wrote:
>> Il 25/08/2012 09:42, liu ping fan ha scritto:
>>>>>
>>>>> I don't see why MMIO dispatch should hold the IDEBus ref rather than the
>>>>> PCIIDEState.
>>>>>
>>> When transfer memory_region_init_io()  3rd para from void* opaque to
>>> Object* obj,  the obj : opaque is not neccessary 1:1 map. For such
>>> situation, in order to let MemoryRegionOps tell between them, we
>>> should pass PCIIDEState->bus[0], bus[1] separately.
>>
>> The rule should be that the obj is the object that you want referenced,
>> and that should be the PCIIDEState.
>>
>> But this is anyway moot because it only applies to objects that are
>> converted to use unlocked dispatch.  This likely will not be the case
>> for IDE.
>
> BTW, I'm pretty sure - after implementing the basics for BQL-free PIO
> dispatching - that device objects are the wrong target for reference

Hi Jan, thanks for reminder, but could you explain it more detail?
mmio dispatch table holds 1 ref for device, before releasing this
ref,( When unplugging, we detach all the device's mr from memory, then
drop the ref. So I think that no leak will be exposed by mr  and it is
safe to use device as target for reference.

> counting. We keep memory regions in our dispatching tables (PIO
> dispatching needs some refactoring for this), and those regions need
> protection for BQL-free use. Devices can't pass away as long as the have
Yes, it is right. Device can pass away only after mr removed from
dispatching tables

Thanx pingfan
> referenced regions, memory region deregistration services will have to
> take care of this.
>

> I'm currently not using reference counting at all, I'm enforcing that
> only BQL-protected regions can be deregistered.
>
> Also note that there seems to be another misconception in the
> discussions: deregistration is not only bound to device unplug. It also
> happens on device reconfiguration, e.g. PCI BAR (re-)mapping. Another
> strong indicator that we should worry about individual memory regions,
> not devices.
>
> Jan
>
>
Jan Kiszka - Aug. 27, 2012, 8:27 a.m.
On 2012-08-27 10:17, liu ping fan wrote:
> On Mon, Aug 27, 2012 at 3:47 PM, Jan Kiszka <jan.kiszka@web.de> wrote:
>> On 2012-08-27 09:01, Paolo Bonzini wrote:
>>> Il 25/08/2012 09:42, liu ping fan ha scritto:
>>>>>>
>>>>>> I don't see why MMIO dispatch should hold the IDEBus ref rather than the
>>>>>> PCIIDEState.
>>>>>>
>>>> When transfer memory_region_init_io()  3rd para from void* opaque to
>>>> Object* obj,  the obj : opaque is not neccessary 1:1 map. For such
>>>> situation, in order to let MemoryRegionOps tell between them, we
>>>> should pass PCIIDEState->bus[0], bus[1] separately.
>>>
>>> The rule should be that the obj is the object that you want referenced,
>>> and that should be the PCIIDEState.
>>>
>>> But this is anyway moot because it only applies to objects that are
>>> converted to use unlocked dispatch.  This likely will not be the case
>>> for IDE.
>>
>> BTW, I'm pretty sure - after implementing the basics for BQL-free PIO
>> dispatching - that device objects are the wrong target for reference
> 
> Hi Jan, thanks for reminder, but could you explain it more detail?
> mmio dispatch table holds 1 ref for device, before releasing this
> ref,( When unplugging, we detach all the device's mr from memory, then
> drop the ref. So I think that no leak will be exposed by mr  and it is
> safe to use device as target for reference.

It would be a mistake to assume that memory regions can only be embedded
in device objects. Memory regions can be reconfigured or dynamically
added/removed (see e.g. portio lists) - there is no "device" in this
sentence. Regions are stored in the dispatching table, they will first
of all be touched without holding the BQL. So their content has to be
stable in that period, and it is the proper abstraction, IMHO, to focus
on their life cycle management and attach all the rest to them.

> 
>> counting. We keep memory regions in our dispatching tables (PIO
>> dispatching needs some refactoring for this), and those regions need
>> protection for BQL-free use. Devices can't pass away as long as the have
> Yes, it is right. Device can pass away only after mr removed from
> dispatching tables

Great, then you don't have to worry about device objects in the context
of dispatching.

Jan
Anthony Liguori - Aug. 27, 2012, 1:19 p.m.
Liu Ping Fan <qemulist@gmail.com> writes:

> From: Liu Ping Fan <pingfank@linux.vnet.ibm.com>
>
> Scene:
>   obja lies in objA, when objA's ref->0, it will be freed,
> but at that time obja can still be in use.
>
> The real example is:
> typedef struct PCIIDEState {
>     PCIDevice dev;
>     IDEBus bus[2]; --> create in place
>     .....
> }
>
> When without big lock protection for mmio-dispatch, we will hold
> obj's refcnt. So memory_region_init_io() will replace the third para
> "void *opaque" with "Object *obj".
> With this patch, we can protect PCIIDEState from disappearing during
> mmio-dispatch hold the IDEBus->ref.
>
> And the ref circle has been broken when calling qdev_delete_subtree().
>
> Signed-off-by: Liu Ping Fan <pingfank@linux.vnet.ibm.com>

I think this is solving the wrong problem.  There are many, many
dependencies a device may have on other devices.  Memory allocation
isn't the only one.

The problem is that we want to make sure that a device doesn't "go away"
while an MMIO dispatch is happening.  This is easy to solve without
touching referencing counting.

The device will hold a lock while the MMIO is being dispatched.  The
delete path simply needs to acquire that same lock.  This will ensure
that a delete operation cannot finish while MMIO is still in flight.

Regarding deleting a device, not all devices are capable of being
deleted and specifically, devices that are composed within the memory of
another device cannot be directly deleted (they can only be deleted
as part of their parent's destruction).

Regards,

Anthony Liguori

> ---
>  hw/qdev.c |    2 ++
>  hw/qdev.h |    1 +
>  2 files changed, 3 insertions(+), 0 deletions(-)
>
> diff --git a/hw/qdev.c b/hw/qdev.c
> index e2339a1..b09ebbf 100644
> --- a/hw/qdev.c
> +++ b/hw/qdev.c
> @@ -510,6 +510,8 @@ void qbus_create_inplace(BusState *bus, const char *typename,
>  {
>      object_initialize(bus, typename);
>  
> +    bus->overlap = parent;
> +    object_ref(OBJECT(bus->overlap));
>      bus->parent = parent;
>      bus->name = name ? g_strdup(name) : NULL;
>      qbus_realize(bus);
> diff --git a/hw/qdev.h b/hw/qdev.h
> index 182cfa5..9bc5783 100644
> --- a/hw/qdev.h
> +++ b/hw/qdev.h
> @@ -117,6 +117,7 @@ struct BusState {
>      int allow_hotplug;
>      bool qom_allocated;
>      bool glib_allocated;
> +    DeviceState *overlap;
>      int max_index;
>      QTAILQ_HEAD(ChildrenHead, BusChild) children;
>      QLIST_ENTRY(BusState) sibling;
> -- 
> 1.7.4.4
Jan Kiszka - Aug. 27, 2012, 3:02 p.m.
On 2012-08-27 15:19, Anthony Liguori wrote:
> Liu Ping Fan <qemulist@gmail.com> writes:
> 
>> From: Liu Ping Fan <pingfank@linux.vnet.ibm.com>
>>
>> Scene:
>>   obja lies in objA, when objA's ref->0, it will be freed,
>> but at that time obja can still be in use.
>>
>> The real example is:
>> typedef struct PCIIDEState {
>>     PCIDevice dev;
>>     IDEBus bus[2]; --> create in place
>>     .....
>> }
>>
>> When without big lock protection for mmio-dispatch, we will hold
>> obj's refcnt. So memory_region_init_io() will replace the third para
>> "void *opaque" with "Object *obj".
>> With this patch, we can protect PCIIDEState from disappearing during
>> mmio-dispatch hold the IDEBus->ref.
>>
>> And the ref circle has been broken when calling qdev_delete_subtree().
>>
>> Signed-off-by: Liu Ping Fan <pingfank@linux.vnet.ibm.com>
> 
> I think this is solving the wrong problem.  There are many, many
> dependencies a device may have on other devices.  Memory allocation
> isn't the only one.
> 
> The problem is that we want to make sure that a device doesn't "go away"
> while an MMIO dispatch is happening.  This is easy to solve without
> touching referencing counting.
> 
> The device will hold a lock while the MMIO is being dispatched.  The
> delete path simply needs to acquire that same lock.  This will ensure
> that a delete operation cannot finish while MMIO is still in flight.

That's a bit too simple. Quite a few MMIO/PIO fast-paths will work
without any device-specific locking, e.g. just to read a simple register
value. So we will need reference counting (for devices using private
locks), but on the "front-line" object: the memory region. That region
will block its owner from disappearing by waiting on dispatch when
someone tries to unregister it.

Also note that "holding a lock" is easily said but will be more tricky
in practice. Quite a significant share of our code will continue to run
under BQL, even for devices with their own locks. Init/cleanup functions
will likely fall into this category, simply because the surrounding
logic is hard to convert into fine-grained locking and is also not
performance critical. At the same time, we can't take BQL -> device-lock
as we have to support device-lock -> BQL ordering for (slow-path) calls
into BQL-protected areas while holding a per-device lock (e.g. device
mapping changes).

Jan
Anthony Liguori - Aug. 27, 2012, 3:14 p.m.
Jan Kiszka <jan.kiszka@siemens.com> writes:

> On 2012-08-27 15:19, Anthony Liguori wrote:
>> Liu Ping Fan <qemulist@gmail.com> writes:
>> 
>>> From: Liu Ping Fan <pingfank@linux.vnet.ibm.com>
>>>
>>> Scene:
>>>   obja lies in objA, when objA's ref->0, it will be freed,
>>> but at that time obja can still be in use.
>>>
>>> The real example is:
>>> typedef struct PCIIDEState {
>>>     PCIDevice dev;
>>>     IDEBus bus[2]; --> create in place
>>>     .....
>>> }
>>>
>>> When without big lock protection for mmio-dispatch, we will hold
>>> obj's refcnt. So memory_region_init_io() will replace the third para
>>> "void *opaque" with "Object *obj".
>>> With this patch, we can protect PCIIDEState from disappearing during
>>> mmio-dispatch hold the IDEBus->ref.
>>>
>>> And the ref circle has been broken when calling qdev_delete_subtree().
>>>
>>> Signed-off-by: Liu Ping Fan <pingfank@linux.vnet.ibm.com>
>> 
>> I think this is solving the wrong problem.  There are many, many
>> dependencies a device may have on other devices.  Memory allocation
>> isn't the only one.
>> 
>> The problem is that we want to make sure that a device doesn't "go away"
>> while an MMIO dispatch is happening.  This is easy to solve without
>> touching referencing counting.
>> 
>> The device will hold a lock while the MMIO is being dispatched.  The
>> delete path simply needs to acquire that same lock.  This will ensure
>> that a delete operation cannot finish while MMIO is still in flight.
>
> That's a bit too simple. Quite a few MMIO/PIO fast-paths will work
> without any device-specific locking, e.g. just to read a simple register
> value. So we will need reference counting

But then you'll need to acquire a lock to take the reference/remove the
reference which sort of defeats the purpose of trying to fast path.

> (for devices using private
> locks), but on the "front-line" object: the memory region. That region
> will block its owner from disappearing by waiting on dispatch when
> someone tries to unregister it.
>
> Also note that "holding a lock" is easily said but will be more tricky
> in practice. Quite a significant share of our code will continue to run
> under BQL, even for devices with their own locks. Init/cleanup functions
> will likely fall into this category,

I'm not sure I'm convinced of this--but it's hard to tell until we
really start converting.

BTW, I'm pretty sure we have to tackle main loop functions first before
we try to convert any devices off the BQL.

Regards,

Anthony Liguori

> simply because the surrounding
> logic is hard to convert into fine-grained locking and is also not
> performance critical. At the same time, we can't take BQL -> device-lock
> as we have to support device-lock -> BQL ordering for (slow-path) calls
> into BQL-protected areas while holding a per-device lock (e.g. device
> mapping changes).


>
> Jan
>
> -- 
> Siemens AG, Corporate Technology, CT RTC ITP SDP-DE
> Corporate Competence Center Embedded Linux
Jan Kiszka - Aug. 27, 2012, 3:26 p.m.
On 2012-08-27 17:14, Anthony Liguori wrote:
> Jan Kiszka <jan.kiszka@siemens.com> writes:
> 
>> On 2012-08-27 15:19, Anthony Liguori wrote:
>>> Liu Ping Fan <qemulist@gmail.com> writes:
>>>
>>>> From: Liu Ping Fan <pingfank@linux.vnet.ibm.com>
>>>>
>>>> Scene:
>>>>   obja lies in objA, when objA's ref->0, it will be freed,
>>>> but at that time obja can still be in use.
>>>>
>>>> The real example is:
>>>> typedef struct PCIIDEState {
>>>>     PCIDevice dev;
>>>>     IDEBus bus[2]; --> create in place
>>>>     .....
>>>> }
>>>>
>>>> When without big lock protection for mmio-dispatch, we will hold
>>>> obj's refcnt. So memory_region_init_io() will replace the third para
>>>> "void *opaque" with "Object *obj".
>>>> With this patch, we can protect PCIIDEState from disappearing during
>>>> mmio-dispatch hold the IDEBus->ref.
>>>>
>>>> And the ref circle has been broken when calling qdev_delete_subtree().
>>>>
>>>> Signed-off-by: Liu Ping Fan <pingfank@linux.vnet.ibm.com>
>>>
>>> I think this is solving the wrong problem.  There are many, many
>>> dependencies a device may have on other devices.  Memory allocation
>>> isn't the only one.
>>>
>>> The problem is that we want to make sure that a device doesn't "go away"
>>> while an MMIO dispatch is happening.  This is easy to solve without
>>> touching referencing counting.
>>>
>>> The device will hold a lock while the MMIO is being dispatched.  The
>>> delete path simply needs to acquire that same lock.  This will ensure
>>> that a delete operation cannot finish while MMIO is still in flight.
>>
>> That's a bit too simple. Quite a few MMIO/PIO fast-paths will work
>> without any device-specific locking, e.g. just to read a simple register
>> value. So we will need reference counting
> 
> But then you'll need to acquire a lock to take the reference/remove the
> reference which sort of defeats the purpose of trying to fast path.

Atomic ops? RCU? This problem won't be solved for the first time.

> 
>> (for devices using private
>> locks), but on the "front-line" object: the memory region. That region
>> will block its owner from disappearing by waiting on dispatch when
>> someone tries to unregister it.
>>
>> Also note that "holding a lock" is easily said but will be more tricky
>> in practice. Quite a significant share of our code will continue to run
>> under BQL, even for devices with their own locks. Init/cleanup functions
>> will likely fall into this category,
> 
> I'm not sure I'm convinced of this--but it's hard to tell until we
> really start converting.
> 
> BTW, I'm pretty sure we have to tackle main loop functions first before
> we try to convert any devices off the BQL.

I'm sure we should leave existing code alone wherever possible, focusing
on providing alternative versions for those paths that matter. Example:
Most timers are fine under BQL. But some sensitive devices (RTC or HPET
as clock source) will want their own timers. So the approach is to
instantiate a separate, also prioritizeable instance of the timer
subsystem for them and be done.

We won't convert QEMU in a day, but we surely want results before the
last corner is refactored (which would take years, at best).

Jan
Anthony Liguori - Aug. 27, 2012, 4:24 p.m.
Jan Kiszka <jan.kiszka@siemens.com> writes:

> On 2012-08-27 17:14, Anthony Liguori wrote:
>> Jan Kiszka <jan.kiszka@siemens.com> writes:
>> 
>>> On 2012-08-27 15:19, Anthony Liguori wrote:
>>>> Liu Ping Fan <qemulist@gmail.com> writes:
>>>>
>>>>> From: Liu Ping Fan <pingfank@linux.vnet.ibm.com>
>>>>>
>>>>> Scene:
>>>>>   obja lies in objA, when objA's ref->0, it will be freed,
>>>>> but at that time obja can still be in use.
>>>>>
>>>>> The real example is:
>>>>> typedef struct PCIIDEState {
>>>>>     PCIDevice dev;
>>>>>     IDEBus bus[2]; --> create in place
>>>>>     .....
>>>>> }
>>>>>
>>>>> When without big lock protection for mmio-dispatch, we will hold
>>>>> obj's refcnt. So memory_region_init_io() will replace the third para
>>>>> "void *opaque" with "Object *obj".
>>>>> With this patch, we can protect PCIIDEState from disappearing during
>>>>> mmio-dispatch hold the IDEBus->ref.
>>>>>
>>>>> And the ref circle has been broken when calling qdev_delete_subtree().
>>>>>
>>>>> Signed-off-by: Liu Ping Fan <pingfank@linux.vnet.ibm.com>
>>>>
>>>> I think this is solving the wrong problem.  There are many, many
>>>> dependencies a device may have on other devices.  Memory allocation
>>>> isn't the only one.
>>>>
>>>> The problem is that we want to make sure that a device doesn't "go away"
>>>> while an MMIO dispatch is happening.  This is easy to solve without
>>>> touching referencing counting.
>>>>
>>>> The device will hold a lock while the MMIO is being dispatched.  The
>>>> delete path simply needs to acquire that same lock.  This will ensure
>>>> that a delete operation cannot finish while MMIO is still in flight.
>>>
>>> That's a bit too simple. Quite a few MMIO/PIO fast-paths will work
>>> without any device-specific locking, e.g. just to read a simple register
>>> value. So we will need reference counting
>> 
>> But then you'll need to acquire a lock to take the reference/remove the
>> reference which sort of defeats the purpose of trying to fast path.
>
> Atomic ops? RCU? This problem won't be solved for the first time.

Yes, there are ways to do this, but you add a fair bit of complication.

It's much simplier to make objects not have any locks and enforce all
callers to protect access.  IOW, noone except the device gets to inc/dec
reference counts.

>>> (for devices using private
>>> locks), but on the "front-line" object: the memory region. That region
>>> will block its owner from disappearing by waiting on dispatch when
>>> someone tries to unregister it.
>>>
>>> Also note that "holding a lock" is easily said but will be more tricky
>>> in practice. Quite a significant share of our code will continue to run
>>> under BQL, even for devices with their own locks. Init/cleanup functions
>>> will likely fall into this category,
>> 
>> I'm not sure I'm convinced of this--but it's hard to tell until we
>> really start converting.
>> 
>> BTW, I'm pretty sure we have to tackle main loop functions first before
>> we try to convert any devices off the BQL.
>
> I'm sure we should leave existing code alone wherever possible, focusing
> on providing alternative versions for those paths that matter. Example:
> Most timers are fine under BQL. But some sensitive devices (RTC or HPET
> as clock source) will want their own timers. So the approach is to
> instantiate a separate, also prioritizeable instance of the timer
> subsystem for them and be done.

I disagree.  I think we conver the timer subsystem to be lockless and
then let some devices acquire the BQL during dispatch.

And we have a nice thread-aware main loop available to us--glib.  We
don't need to reinvent the wheel.

Regards,

Anthony Liguori

>
> We won't convert QEMU in a day, but we surely want results before the
> last corner is refactored (which would take years, at best).
>
> Jan
>
> -- 
> Siemens AG, Corporate Technology, CT RTC ITP SDP-DE
> Corporate Competence Center Embedded Linux
Jan Kiszka - Aug. 27, 2012, 4:59 p.m.
On 2012-08-27 18:24, Anthony Liguori wrote:
> Jan Kiszka <jan.kiszka@siemens.com> writes:
> 
>> On 2012-08-27 17:14, Anthony Liguori wrote:
>>> Jan Kiszka <jan.kiszka@siemens.com> writes:
>>>
>>>> On 2012-08-27 15:19, Anthony Liguori wrote:
>>>>> Liu Ping Fan <qemulist@gmail.com> writes:
>>>>>
>>>>>> From: Liu Ping Fan <pingfank@linux.vnet.ibm.com>
>>>>>>
>>>>>> Scene:
>>>>>>   obja lies in objA, when objA's ref->0, it will be freed,
>>>>>> but at that time obja can still be in use.
>>>>>>
>>>>>> The real example is:
>>>>>> typedef struct PCIIDEState {
>>>>>>     PCIDevice dev;
>>>>>>     IDEBus bus[2]; --> create in place
>>>>>>     .....
>>>>>> }
>>>>>>
>>>>>> When without big lock protection for mmio-dispatch, we will hold
>>>>>> obj's refcnt. So memory_region_init_io() will replace the third para
>>>>>> "void *opaque" with "Object *obj".
>>>>>> With this patch, we can protect PCIIDEState from disappearing during
>>>>>> mmio-dispatch hold the IDEBus->ref.
>>>>>>
>>>>>> And the ref circle has been broken when calling qdev_delete_subtree().
>>>>>>
>>>>>> Signed-off-by: Liu Ping Fan <pingfank@linux.vnet.ibm.com>
>>>>>
>>>>> I think this is solving the wrong problem.  There are many, many
>>>>> dependencies a device may have on other devices.  Memory allocation
>>>>> isn't the only one.
>>>>>
>>>>> The problem is that we want to make sure that a device doesn't "go away"
>>>>> while an MMIO dispatch is happening.  This is easy to solve without
>>>>> touching referencing counting.
>>>>>
>>>>> The device will hold a lock while the MMIO is being dispatched.  The
>>>>> delete path simply needs to acquire that same lock.  This will ensure
>>>>> that a delete operation cannot finish while MMIO is still in flight.
>>>>
>>>> That's a bit too simple. Quite a few MMIO/PIO fast-paths will work
>>>> without any device-specific locking, e.g. just to read a simple register
>>>> value. So we will need reference counting
>>>
>>> But then you'll need to acquire a lock to take the reference/remove the
>>> reference which sort of defeats the purpose of trying to fast path.
>>
>> Atomic ops? RCU? This problem won't be solved for the first time.
> 
> Yes, there are ways to do this, but you add a fair bit of complication.
> 
> It's much simplier to make objects not have any locks and enforce all
> callers to protect access.  IOW, noone except the device gets to inc/dec
> reference counts.

Pulling locks out of the devices may appear simpler on first sight but
won't be in practice. Objects can issue requests to other objects. Will
they hold its lock while taking the target's lock? Lock management is a
per-device thing, not only for performance reasons.

But I guess we will have to compare working prototypes. Many concepts
look nice on the paper, but when you implement them or even try to run
them, the interesting problems show up. Specifically those related to a
smooth transition phase.

> 
>>>> (for devices using private
>>>> locks), but on the "front-line" object: the memory region. That region
>>>> will block its owner from disappearing by waiting on dispatch when
>>>> someone tries to unregister it.
>>>>
>>>> Also note that "holding a lock" is easily said but will be more tricky
>>>> in practice. Quite a significant share of our code will continue to run
>>>> under BQL, even for devices with their own locks. Init/cleanup functions
>>>> will likely fall into this category,
>>>
>>> I'm not sure I'm convinced of this--but it's hard to tell until we
>>> really start converting.
>>>
>>> BTW, I'm pretty sure we have to tackle main loop functions first before
>>> we try to convert any devices off the BQL.
>>
>> I'm sure we should leave existing code alone wherever possible, focusing
>> on providing alternative versions for those paths that matter. Example:
>> Most timers are fine under BQL. But some sensitive devices (RTC or HPET
>> as clock source) will want their own timers. So the approach is to
>> instantiate a separate, also prioritizeable instance of the timer
>> subsystem for them and be done.
> 
> I disagree.  I think we conver the timer subsystem to be lockless and
> then let some devices acquire the BQL during dispatch.

The timer subsystem is lockless in this sense already. Its user holds
the necessary lock (BQL so far). I'm just making it multi-instance so
that the time-critical users don't need to worry about other, low-prio
timers. In that sense, I guess we are describing the same end result.

Jan
Avi Kivity - Aug. 27, 2012, 5:09 p.m.
On 08/27/2012 12:47 AM, Jan Kiszka wrote:
> On 2012-08-27 09:01, Paolo Bonzini wrote:
> > Il 25/08/2012 09:42, liu ping fan ha scritto:
> >>>>
> >>>> I don't see why MMIO dispatch should hold the IDEBus ref rather than the
> >>>> PCIIDEState.
> >>>>
> >> When transfer memory_region_init_io()  3rd para from void* opaque to
> >> Object* obj,  the obj : opaque is not neccessary 1:1 map. For such
> >> situation, in order to let MemoryRegionOps tell between them, we
> >> should pass PCIIDEState->bus[0], bus[1] separately.
> > 
> > The rule should be that the obj is the object that you want referenced,
> > and that should be the PCIIDEState.
> > 
> > But this is anyway moot because it only applies to objects that are
> > converted to use unlocked dispatch.  This likely will not be the case
> > for IDE.
>
> BTW, I'm pretty sure - after implementing the basics for BQL-free PIO
> dispatching - that device objects are the wrong target for reference
> counting. We keep memory regions in our dispatching tables (PIO
> dispatching needs some refactoring for this), and those regions need
> protection for BQL-free use. Devices can't pass away as long as the have
> referenced regions, memory region deregistration services will have to
> take care of this.
>
> I'm currently not using reference counting at all, I'm enforcing that
> only BQL-protected regions can be deregistered.

That's a pretty harsh constraint.

> Also note that there seems to be another misconception in the
> discussions: deregistration is not only bound to device unplug. It also
> happens on device reconfiguration, e.g. PCI BAR (re-)mapping. Another
> strong indicator that we should worry about individual memory regions,
> not devices.
>
>

Deregistration is fine, the problem is destruction.
Jan Kiszka - Aug. 27, 2012, 5:14 p.m.
On 2012-08-27 19:09, Avi Kivity wrote:
> On 08/27/2012 12:47 AM, Jan Kiszka wrote:
>> On 2012-08-27 09:01, Paolo Bonzini wrote:
>>> Il 25/08/2012 09:42, liu ping fan ha scritto:
>>>>>>
>>>>>> I don't see why MMIO dispatch should hold the IDEBus ref rather than the
>>>>>> PCIIDEState.
>>>>>>
>>>> When transfer memory_region_init_io()  3rd para from void* opaque to
>>>> Object* obj,  the obj : opaque is not neccessary 1:1 map. For such
>>>> situation, in order to let MemoryRegionOps tell between them, we
>>>> should pass PCIIDEState->bus[0], bus[1] separately.
>>>
>>> The rule should be that the obj is the object that you want referenced,
>>> and that should be the PCIIDEState.
>>>
>>> But this is anyway moot because it only applies to objects that are
>>> converted to use unlocked dispatch.  This likely will not be the case
>>> for IDE.
>>
>> BTW, I'm pretty sure - after implementing the basics for BQL-free PIO
>> dispatching - that device objects are the wrong target for reference
>> counting. We keep memory regions in our dispatching tables (PIO
>> dispatching needs some refactoring for this), and those regions need
>> protection for BQL-free use. Devices can't pass away as long as the have
>> referenced regions, memory region deregistration services will have to
>> take care of this.
>>
>> I'm currently not using reference counting at all, I'm enforcing that
>> only BQL-protected regions can be deregistered.
> 
> That's a pretty harsh constraint.

Of course. It's a first step, not a long-term restriction. But we have
to do small steps.

> 
>> Also note that there seems to be another misconception in the
>> discussions: deregistration is not only bound to device unplug. It also
>> happens on device reconfiguration, e.g. PCI BAR (re-)mapping. Another
>> strong indicator that we should worry about individual memory regions,
>> not devices.
>>
>>
> 
> Deregistration is fine, the problem is destruction.
> 

It isn't as you access memory region states that can change after
deregistration. Devices can remove memory regions from the mapping,
alter and then reinsert them. The last to steps must not happen while
anyone is still using a reference to that region.

Jan
Avi Kivity - Aug. 27, 2012, 6:09 p.m.
On 08/27/2012 10:14 AM, Jan Kiszka wrote:
> > 
> > Deregistration is fine, the problem is destruction.
> > 
>
> It isn't as you access memory region states that can change after
> deregistration. Devices can remove memory regions from the mapping,
> alter and then reinsert them. The last to steps must not happen while
> anyone is still using a reference to that region.
>

Why not?  If the guest is accessing an mmio region while reconfiguring
it in a way that changes its meaning, either the previous or the next
meaning is valid.

It is true that memory_region_set_enabled(..., false) will become weaker
as a result.  Code will have to be prepared for that.
Jan Kiszka - Aug. 27, 2012, 6:17 p.m.
On 2012-08-27 20:09, Avi Kivity wrote:
> On 08/27/2012 10:14 AM, Jan Kiszka wrote:
>>>
>>> Deregistration is fine, the problem is destruction.
>>>
>>
>> It isn't as you access memory region states that can change after
>> deregistration. Devices can remove memory regions from the mapping,
>> alter and then reinsert them. The last to steps must not happen while
>> anyone is still using a reference to that region.
>>
> 
> Why not?  If the guest is accessing an mmio region while reconfiguring
> it in a way that changes its meaning, either the previous or the next
> meaning is valid.

If the memory region owner sets the content to zero or even releases it
(nothing states a memory region can only live inside a device
structure), we will crash. Restricting how a memory region can be
created and handled after it was once registered somewhere is an
unnatural interface, waiting to cause subtle bugs.

Jan
Avi Kivity - Aug. 27, 2012, 6:20 p.m.
On 08/27/2012 11:17 AM, Jan Kiszka wrote:
> On 2012-08-27 20:09, Avi Kivity wrote:
> > On 08/27/2012 10:14 AM, Jan Kiszka wrote:
> >>>
> >>> Deregistration is fine, the problem is destruction.
> >>>
> >>
> >> It isn't as you access memory region states that can change after
> >> deregistration. Devices can remove memory regions from the mapping,
> >> alter and then reinsert them. The last to steps must not happen while
> >> anyone is still using a reference to that region.
> >>
> > 
> > Why not?  If the guest is accessing an mmio region while reconfiguring
> > it in a way that changes its meaning, either the previous or the next
> > meaning is valid.
>
> If the memory region owner sets the content to zero or even releases it
> (nothing states a memory region can only live inside a device
> structure), we will crash. Restricting how a memory region can be
> created and handled after it was once registered somewhere is an
> unnatural interface, waiting to cause subtle bugs.

Using an Object * allows the simple case to be really simple (object ==
device) and the hard cases to be doable.

What would you suggest as a better interface?
Avi Kivity - Aug. 27, 2012, 6:27 p.m.
On 08/27/2012 06:19 AM, Anthony Liguori wrote:
> Liu Ping Fan <qemulist@gmail.com> writes:
>
> > From: Liu Ping Fan <pingfank@linux.vnet.ibm.com>
> >
> > Scene:
> >   obja lies in objA, when objA's ref->0, it will be freed,
> > but at that time obja can still be in use.
> >
> > The real example is:
> > typedef struct PCIIDEState {
> >     PCIDevice dev;
> >     IDEBus bus[2]; --> create in place
> >     .....
> > }
> >
> > When without big lock protection for mmio-dispatch, we will hold
> > obj's refcnt. So memory_region_init_io() will replace the third para
> > "void *opaque" with "Object *obj".
> > With this patch, we can protect PCIIDEState from disappearing during
> > mmio-dispatch hold the IDEBus->ref.
> >
> > And the ref circle has been broken when calling qdev_delete_subtree().
> >
> > Signed-off-by: Liu Ping Fan <pingfank@linux.vnet.ibm.com>
>
> I think this is solving the wrong problem.  There are many, many
> dependencies a device may have on other devices.  Memory allocation
> isn't the only one.
>
> The problem is that we want to make sure that a device doesn't "go away"
> while an MMIO dispatch is happening.  This is easy to solve without
> touching referencing counting.
>
> The device will hold a lock while the MMIO is being dispatched.  The
> delete path simply needs to acquire that same lock.  This will ensure
> that a delete operation cannot finish while MMIO is still in flight.

Where does the lock live?  If it lives in the device, then we can't
acquire it during mmio dispatch (the device may have gone away).  If it
lives outside the device, we need to to tell the memory core about it. 
I would really like to avoid adding additional state; I think the curent
opaque should serve both to provide state to the callbacks and for life
cycle management.

> Regarding deleting a device, not all devices are capable of being
> deleted and specifically, devices that are composed within the memory of
> another device cannot be directly deleted (they can only be deleted
> as part of their parent's destruction).

This doesn't help at all.  We have devices that can be deleted, and we
would like to make the devices use the unlocked path, so we have to
account for it.
Avi Kivity - Aug. 27, 2012, 6:35 p.m.
On 08/27/2012 09:24 AM, Anthony Liguori wrote:
> >
> > I'm sure we should leave existing code alone wherever possible, focusing
> > on providing alternative versions for those paths that matter. Example:
> > Most timers are fine under BQL. But some sensitive devices (RTC or HPET
> > as clock source) will want their own timers. So the approach is to
> > instantiate a separate, also prioritizeable instance of the timer
> > subsystem for them and be done.
>
> I disagree.  I think we conver the timer subsystem to be lockless and
> then let some devices acquire the BQL during dispatch.

I agree with your disagreement but disagree with the rest.  The timer
subsystem should have its own internal locking that callers will not be
aware of.  Requiring devices to acquire the bql will lead to deadlocks.

Note that fine-grained locking timers will also require reference
counting: you want to call the timer expiration callback after releasing
the timer subsystem lock, so you need to make sure the callback does not
go away.

Linux manages without it (hrtimer_interrupt), so maybe we can too.
Jan Kiszka - Aug. 27, 2012, 6:39 p.m.
On 2012-08-27 20:20, Avi Kivity wrote:
> On 08/27/2012 11:17 AM, Jan Kiszka wrote:
>> On 2012-08-27 20:09, Avi Kivity wrote:
>>> On 08/27/2012 10:14 AM, Jan Kiszka wrote:
>>>>>
>>>>> Deregistration is fine, the problem is destruction.
>>>>>
>>>>
>>>> It isn't as you access memory region states that can change after
>>>> deregistration. Devices can remove memory regions from the mapping,
>>>> alter and then reinsert them. The last to steps must not happen while
>>>> anyone is still using a reference to that region.
>>>>
>>>
>>> Why not?  If the guest is accessing an mmio region while reconfiguring
>>> it in a way that changes its meaning, either the previous or the next
>>> meaning is valid.
>>
>> If the memory region owner sets the content to zero or even releases it
>> (nothing states a memory region can only live inside a device
>> structure), we will crash. Restricting how a memory region can be
>> created and handled after it was once registered somewhere is an
>> unnatural interface, waiting to cause subtle bugs.
> 
> Using an Object * allows the simple case to be really simple (object ==
> device) and the hard cases to be doable.
> 
> What would you suggest as a better interface?

To protect the life cycle of the object we manage in the memory layer:
regions. We don't manage devices there. If there is any implementation
benefit in having a full QOM object, then make memory regions objects.

I simply don't like this indirection, having the memory layer pick up
the opaque value of the region and interpret it. Even worse, apply
restrictions on how the dispatched objects, the regions, have to be
treated because of this.

Also, using memory regions to control the locking behaviour allows for
more fine-grained control. A device may expose certain regions with
self-managed locking while others, less time critical ones, can still be
handled under BQL for simplicity reasons.

Example: regions that translate MMIO to PIO (alpha-pci.c, every user of
isa_mmio.c, ...). If PIO dispatching runs outside of the BQL, these
regions must not be protected by the BQL anymore. At the same time, we
may not want to convert the device exposing the region to its own
locking scheme (yet). And we surely don't want to take the per-device
lock for this state-less PIO dispatching.

Jan
Avi Kivity - Aug. 27, 2012, 6:52 p.m.
On 08/27/2012 11:39 AM, Jan Kiszka wrote:
> On 2012-08-27 20:20, Avi Kivity wrote:
> > On 08/27/2012 11:17 AM, Jan Kiszka wrote:
> >> On 2012-08-27 20:09, Avi Kivity wrote:
> >>> On 08/27/2012 10:14 AM, Jan Kiszka wrote:
> >>>>>
> >>>>> Deregistration is fine, the problem is destruction.
> >>>>>
> >>>>
> >>>> It isn't as you access memory region states that can change after
> >>>> deregistration. Devices can remove memory regions from the mapping,
> >>>> alter and then reinsert them. The last to steps must not happen while
> >>>> anyone is still using a reference to that region.
> >>>>
> >>>
> >>> Why not?  If the guest is accessing an mmio region while reconfiguring
> >>> it in a way that changes its meaning, either the previous or the next
> >>> meaning is valid.
> >>
> >> If the memory region owner sets the content to zero or even releases it
> >> (nothing states a memory region can only live inside a device
> >> structure), we will crash. Restricting how a memory region can be
> >> created and handled after it was once registered somewhere is an
> >> unnatural interface, waiting to cause subtle bugs.
> > 
> > Using an Object * allows the simple case to be really simple (object ==
> > device) and the hard cases to be doable.
> > 
> > What would you suggest as a better interface?
>
> To protect the life cycle of the object we manage in the memory layer:
> regions. We don't manage devices there. If there is any implementation
> benefit in having a full QOM object, then make memory regions objects.

I see and sympathise with this point of view.

The idea to use opaque and convert it to Object * is that often a device
has multiple regions but no interest in managing them separately.  So
managing regions will cause the device model authors to create
boilerplate code to push region management to device management.

I had this experience with the first iterations of the region interface,
where instead of opaque we had MemoryRegion *.  Device code would use
container_of() in the simple case, but the non-simple case caused lots
of pain.  I believe it is the natural interface, but in practice it
turned out to cause too much churn.

> I simply don't like this indirection, having the memory layer pick up
> the opaque value of the region and interpret it. 

That's just an intermediate step.  The idea is that eventually opaque
will be changed to Object *.

> Even worse, apply
> restrictions on how the dispatched objects, the regions, have to be
> treated because of this.

Please elaborate.

> Also, using memory regions to control the locking behaviour allows for
> more fine-grained control. A device may expose certain regions with
> self-managed locking while others, less time critical ones, can still be
> handled under BQL for simplicity reasons.

You can still aquire the bql in one callback and your local lock in
another if you choose (but that would be most confusing IMO).

> Example: regions that translate MMIO to PIO (alpha-pci.c, every user of
> isa_mmio.c, ...). If PIO dispatching runs outside of the BQL, these
> regions must not be protected by the BQL anymore. At the same time, we
> may not want to convert the device exposing the region to its own
> locking scheme (yet). And we surely don't want to take the per-device
> lock for this state-less PIO dispatching.

We're diverging, but eventually PIO dispatch will share the same code as
mmio dispatch.  PIO-to-MMIO bridges will simply be containers or
aliases, they will not call cpu_inb() and relatives.
Anthony Liguori - Aug. 27, 2012, 7:17 p.m.
Avi Kivity <avi@redhat.com> writes:

> On 08/27/2012 09:24 AM, Anthony Liguori wrote:
>> >
>> > I'm sure we should leave existing code alone wherever possible, focusing
>> > on providing alternative versions for those paths that matter. Example:
>> > Most timers are fine under BQL. But some sensitive devices (RTC or HPET
>> > as clock source) will want their own timers. So the approach is to
>> > instantiate a separate, also prioritizeable instance of the timer
>> > subsystem for them and be done.
>>
>> I disagree.  I think we conver the timer subsystem to be lockless and
>> then let some devices acquire the BQL during dispatch.
>
> I agree with your disagreement but disagree with the rest.  The timer
> subsystem should have its own internal locking that callers will not be
> aware of.  Requiring devices to acquire the bql will lead to
> deadlocks.

Err, I completely mispoke there.

I meant, to say, we should convert the timer subsystem to be re-entrant
and then let some devices acquire the BQL during dispatch.

Existing users of the time API should get the BQL acquired automagically
for them.  The new API would not hold the BQL during dispatch but the
old API, implemented in terms of the new API, would acquire it on behalf
of the caller.

As a rule, if a device relies on the BQL, it must use the old APIs.  If
a device uses the new APIs, it must not acquire the BQL.

> Note that fine-grained locking timers will also require reference
> counting: you want to call the timer expiration callback after releasing
> the timer subsystem lock, so you need to make sure the callback does not
> go away.

glib simply uses a single main loop lock to deal with all of this.  I
think that's absolutely the right model.

It's best to start this conversion using very coarse locking.  There's
no need to start with ultra fine grain locking.

Regards,

Anthony Liguori

> Linux manages without it (hrtimer_interrupt), so maybe we can too.
> -- 
> I have a truly marvellous patch that fixes the bug which this
> signature is too narrow to contain.
Jan Kiszka - Aug. 27, 2012, 7:22 p.m.
On 2012-08-27 21:17, Anthony Liguori wrote:
> Avi Kivity <avi@redhat.com> writes:
> 
>> On 08/27/2012 09:24 AM, Anthony Liguori wrote:
>>>>
>>>> I'm sure we should leave existing code alone wherever possible, focusing
>>>> on providing alternative versions for those paths that matter. Example:
>>>> Most timers are fine under BQL. But some sensitive devices (RTC or HPET
>>>> as clock source) will want their own timers. So the approach is to
>>>> instantiate a separate, also prioritizeable instance of the timer
>>>> subsystem for them and be done.
>>>
>>> I disagree.  I think we conver the timer subsystem to be lockless and
>>> then let some devices acquire the BQL during dispatch.
>>
>> I agree with your disagreement but disagree with the rest.  The timer
>> subsystem should have its own internal locking that callers will not be
>> aware of.  Requiring devices to acquire the bql will lead to
>> deadlocks.
> 
> Err, I completely mispoke there.
> 
> I meant, to say, we should convert the timer subsystem to be re-entrant
> and then let some devices acquire the BQL during dispatch.
> 
> Existing users of the time API should get the BQL acquired automagically
> for them.  The new API would not hold the BQL during dispatch but the
> old API, implemented in terms of the new API, would acquire it on behalf
> of the caller.
> 
> As a rule, if a device relies on the BQL, it must use the old APIs.  If
> a device uses the new APIs, it must not acquire the BQL.

This makes sense.

> 
>> Note that fine-grained locking timers will also require reference
>> counting: you want to call the timer expiration callback after releasing
>> the timer subsystem lock, so you need to make sure the callback does not
>> go away.
> 
> glib simply uses a single main loop lock to deal with all of this.  I
> think that's absolutely the right model.

That depends on what is under the lock. Also, relying on locking of
external libraries is a no-go for realtime as they usually do not care
about priorities and inversion avoidance.

Jan
Jan Kiszka - Aug. 27, 2012, 7:38 p.m.
On 2012-08-27 20:52, Avi Kivity wrote:
> On 08/27/2012 11:39 AM, Jan Kiszka wrote:
>> On 2012-08-27 20:20, Avi Kivity wrote:
>>> On 08/27/2012 11:17 AM, Jan Kiszka wrote:
>>>> On 2012-08-27 20:09, Avi Kivity wrote:
>>>>> On 08/27/2012 10:14 AM, Jan Kiszka wrote:
>>>>>>>
>>>>>>> Deregistration is fine, the problem is destruction.
>>>>>>>
>>>>>>
>>>>>> It isn't as you access memory region states that can change after
>>>>>> deregistration. Devices can remove memory regions from the mapping,
>>>>>> alter and then reinsert them. The last to steps must not happen while
>>>>>> anyone is still using a reference to that region.
>>>>>>
>>>>>
>>>>> Why not?  If the guest is accessing an mmio region while reconfiguring
>>>>> it in a way that changes its meaning, either the previous or the next
>>>>> meaning is valid.
>>>>
>>>> If the memory region owner sets the content to zero or even releases it
>>>> (nothing states a memory region can only live inside a device
>>>> structure), we will crash. Restricting how a memory region can be
>>>> created and handled after it was once registered somewhere is an
>>>> unnatural interface, waiting to cause subtle bugs.
>>>
>>> Using an Object * allows the simple case to be really simple (object ==
>>> device) and the hard cases to be doable.
>>>
>>> What would you suggest as a better interface?
>>
>> To protect the life cycle of the object we manage in the memory layer:
>> regions. We don't manage devices there. If there is any implementation
>> benefit in having a full QOM object, then make memory regions objects.
> 
> I see and sympathise with this point of view.
> 
> The idea to use opaque and convert it to Object * is that often a device
> has multiple regions but no interest in managing them separately.  So
> managing regions will cause the device model authors to create
> boilerplate code to push region management to device management.
> 
> I had this experience with the first iterations of the region interface,
> where instead of opaque we had MemoryRegion *.  Device code would use
> container_of() in the simple case, but the non-simple case caused lots
> of pain.  I believe it is the natural interface, but in practice it
> turned out to cause too much churn.
> 
>> I simply don't like this indirection, having the memory layer pick up
>> the opaque value of the region and interpret it. 
> 
> That's just an intermediate step.  The idea is that eventually opaque
> will be changed to Object *.
> 
>> Even worse, apply
>> restrictions on how the dispatched objects, the regions, have to be
>> treated because of this.
> 
> Please elaborate.

The fact that you can't manipulate a memory region object arbitrarily
after removing it from the mapping because you track the references to
the object that the region points to, not the region itself. The region
remains in use by the dispatching layer and potentially the called
device, even after deregistration.

Jan
Avi Kivity - Aug. 27, 2012, 8:53 p.m.
On 08/27/2012 12:38 PM, Jan Kiszka wrote:
> >> Even worse, apply
> >> restrictions on how the dispatched objects, the regions, have to be
> >> treated because of this.
> > 
> > Please elaborate.
>
> The fact that you can't manipulate a memory region object arbitrarily
> after removing it from the mapping because you track the references to
> the object that the region points to, not the region itself. The region
> remains in use by the dispatching layer and potentially the called
> device, even after deregistration.

That object will be a container_of() the region, usually literally but
sometimes only in spirit.  Reference counting the regions means they
cannot be embedded into other objects any more.

We can probably figure out a way to flush out accesses.  After switching
to rcu, for example, all we need is synchronize_rcu() in a
non-deadlocking place.  But my bet is that it will not be needed.
Avi Kivity - Aug. 27, 2012, 8:58 p.m.
On 08/27/2012 12:17 PM, Anthony Liguori wrote:
> Avi Kivity <avi@redhat.com> writes:
>
> > On 08/27/2012 09:24 AM, Anthony Liguori wrote:
> >> >
> >> > I'm sure we should leave existing code alone wherever possible, focusing
> >> > on providing alternative versions for those paths that matter. Example:
> >> > Most timers are fine under BQL. But some sensitive devices (RTC or HPET
> >> > as clock source) will want their own timers. So the approach is to
> >> > instantiate a separate, also prioritizeable instance of the timer
> >> > subsystem for them and be done.
> >>
> >> I disagree.  I think we conver the timer subsystem to be lockless and
> >> then let some devices acquire the BQL during dispatch.
> >
> > I agree with your disagreement but disagree with the rest.  The timer
> > subsystem should have its own internal locking that callers will not be
> > aware of.  Requiring devices to acquire the bql will lead to
> > deadlocks.
>
> Err, I completely mispoke there.
>
> I meant, to say, we should convert the timer subsystem to be re-entrant
> and then let some devices acquire the BQL during dispatch.

That would be all devices, with converted devices not acquiring it as
they are converted.

> Existing users of the time API should get the BQL acquired automagically
> for them.  The new API would not hold the BQL during dispatch but the
> old API, implemented in terms of the new API, would acquire it on behalf
> of the caller.
>
> As a rule, if a device relies on the BQL, it must use the old APIs.  If
> a device uses the new APIs, it must not acquire the BQL.

This exactly mirrors the plan with memory region conversion.

> > Note that fine-grained locking timers will also require reference
> > counting: you want to call the timer expiration callback after releasing
> > the timer subsystem lock, so you need to make sure the callback does not
> > go away.
>
> glib simply uses a single main loop lock to deal with all of this.  I
> think that's absolutely the right model.
>
> It's best to start this conversion using very coarse locking.  There's
> no need to start with ultra fine grain locking.

How does it work?  You have to drop this main loop lock to dispatch the
callback, so the data structure you use for dispatch is no longer protected.
Paolo Bonzini - Aug. 27, 2012, 9:34 p.m.
Il 27/08/2012 22:58, Avi Kivity ha scritto:
>> > It's best to start this conversion using very coarse locking.  There's
>> > no need to start with ultra fine grain locking.
> How does it work?  You have to drop this main loop lock to dispatch the
> callback, so the data structure you use for dispatch is no longer protected.

It is protected by a mixture of reference counting, thread-local storage and
careful coding to take care of re-entrancy.

I'm not too sure that we can use it, though.  Mostly because of the reasons Jan
mentioned (it may also not be precise enough, but we can fix that with our own
GSource wrapping POSIX real-time timers or Linux timerfds).

Paolo
Jan Kiszka - Aug. 28, 2012, 1:01 a.m.
On 2012-08-27 22:53, Avi Kivity wrote:
> On 08/27/2012 12:38 PM, Jan Kiszka wrote:
>>>> Even worse, apply
>>>> restrictions on how the dispatched objects, the regions, have to be
>>>> treated because of this.
>>>
>>> Please elaborate.
>>
>> The fact that you can't manipulate a memory region object arbitrarily
>> after removing it from the mapping because you track the references to
>> the object that the region points to, not the region itself. The region
>> remains in use by the dispatching layer and potentially the called
>> device, even after deregistration.
> 
> That object will be a container_of() the region, usually literally but
> sometimes only in spirit.  Reference counting the regions means they
> cannot be embedded into other objects any more.

I cannot follow the logic of the last sentence. Reference counting just
means that we track if there are users left, not necessarily that we
perform asynchronous releases. We can simply wait for those users to
complete.

> 
> We can probably figure out a way to flush out accesses.  After switching
> to rcu, for example, all we need is synchronize_rcu() in a
> non-deadlocking place.  But my bet is that it will not be needed.

If you properly flush out accesses, you don't need to track at device
level anymore - and mess with abstraction layers. That's my whole point.

Jan
pingfan liu - Aug. 28, 2012, 3:09 a.m.
On Tue, Aug 28, 2012 at 3:38 AM, Jan Kiszka <jan.kiszka@siemens.com> wrote:
> On 2012-08-27 20:52, Avi Kivity wrote:
>> On 08/27/2012 11:39 AM, Jan Kiszka wrote:
>>> On 2012-08-27 20:20, Avi Kivity wrote:
>>>> On 08/27/2012 11:17 AM, Jan Kiszka wrote:
>>>>> On 2012-08-27 20:09, Avi Kivity wrote:
>>>>>> On 08/27/2012 10:14 AM, Jan Kiszka wrote:
>>>>>>>>
>>>>>>>> Deregistration is fine, the problem is destruction.
>>>>>>>>
>>>>>>>
>>>>>>> It isn't as you access memory region states that can change after
>>>>>>> deregistration. Devices can remove memory regions from the mapping,
>>>>>>> alter and then reinsert them. The last to steps must not happen while
>>>>>>> anyone is still using a reference to that region.
>>>>>>>
>>>>>>
>>>>>> Why not?  If the guest is accessing an mmio region while reconfiguring
>>>>>> it in a way that changes its meaning, either the previous or the next
>>>>>> meaning is valid.
>>>>>
>>>>> If the memory region owner sets the content to zero or even releases it
>>>>> (nothing states a memory region can only live inside a device
>>>>> structure), we will crash. Restricting how a memory region can be
>>>>> created and handled after it was once registered somewhere is an
>>>>> unnatural interface, waiting to cause subtle bugs.
>>>>
>>>> Using an Object * allows the simple case to be really simple (object ==
>>>> device) and the hard cases to be doable.
>>>>
>>>> What would you suggest as a better interface?
>>>
>>> To protect the life cycle of the object we manage in the memory layer:
>>> regions. We don't manage devices there. If there is any implementation
>>> benefit in having a full QOM object, then make memory regions objects.
>>
>> I see and sympathise with this point of view.
>>
>> The idea to use opaque and convert it to Object * is that often a device
>> has multiple regions but no interest in managing them separately.  So
>> managing regions will cause the device model authors to create
>> boilerplate code to push region management to device management.
>>
>> I had this experience with the first iterations of the region interface,
>> where instead of opaque we had MemoryRegion *.  Device code would use
>> container_of() in the simple case, but the non-simple case caused lots
>> of pain.  I believe it is the natural interface, but in practice it
>> turned out to cause too much churn.
>>
>>> I simply don't like this indirection, having the memory layer pick up
>>> the opaque value of the region and interpret it.
>>
>> That's just an intermediate step.  The idea is that eventually opaque
>> will be changed to Object *.
>>
>>> Even worse, apply
>>> restrictions on how the dispatched objects, the regions, have to be
>>> treated because of this.
>>
>> Please elaborate.
>
> The fact that you can't manipulate a memory region object arbitrarily
> after removing it from the mapping because you track the references to
> the object that the region points to, not the region itself. The region
> remains in use by the dispatching layer and potentially the called
> device, even after deregistration.

Why can it in use by dispatching layer? Memory region just servers as
interface for mem lookup, when dispatching in mr's MemoryRegionOps,
why do we need to access it?  Which means that read/write can cause
register/deregister mr?  Example?
Also I think MemoryRegionOps will eventually operate on Object (mr is
only interface). MRs is only represent of Object in the view of
memory(so their life cycle can be managed by Object). And there could
be some intermediate mr like those lie in subpage_t which are not
based on Object. But we can walk down to the terminal point and
extract the real Object,
so when dispatching, we only need to ensure that Object and its direct
mr are alive.

Regards,
pingfan
>
> Jan
>
> --
> Siemens AG, Corporate Technology, CT RTC ITP SDP-DE
> Corporate Competence Center Embedded Linux
pingfan liu - Aug. 28, 2012, 3:38 a.m.
On Tue, Aug 28, 2012 at 11:09 AM, liu ping fan <qemulist@gmail.com> wrote:
> On Tue, Aug 28, 2012 at 3:38 AM, Jan Kiszka <jan.kiszka@siemens.com> wrote:
>> On 2012-08-27 20:52, Avi Kivity wrote:
>>> On 08/27/2012 11:39 AM, Jan Kiszka wrote:
>>>> On 2012-08-27 20:20, Avi Kivity wrote:
>>>>> On 08/27/2012 11:17 AM, Jan Kiszka wrote:
>>>>>> On 2012-08-27 20:09, Avi Kivity wrote:
>>>>>>> On 08/27/2012 10:14 AM, Jan Kiszka wrote:
>>>>>>>>>
>>>>>>>>> Deregistration is fine, the problem is destruction.
>>>>>>>>>
>>>>>>>>
>>>>>>>> It isn't as you access memory region states that can change after
>>>>>>>> deregistration. Devices can remove memory regions from the mapping,
>>>>>>>> alter and then reinsert them. The last to steps must not happen while
>>>>>>>> anyone is still using a reference to that region.
>>>>>>>>
>>>>>>>
>>>>>>> Why not?  If the guest is accessing an mmio region while reconfiguring
>>>>>>> it in a way that changes its meaning, either the previous or the next
>>>>>>> meaning is valid.
>>>>>>
>>>>>> If the memory region owner sets the content to zero or even releases it
>>>>>> (nothing states a memory region can only live inside a device
>>>>>> structure), we will crash. Restricting how a memory region can be
>>>>>> created and handled after it was once registered somewhere is an
>>>>>> unnatural interface, waiting to cause subtle bugs.
>>>>>
>>>>> Using an Object * allows the simple case to be really simple (object ==
>>>>> device) and the hard cases to be doable.
>>>>>
>>>>> What would you suggest as a better interface?
>>>>
>>>> To protect the life cycle of the object we manage in the memory layer:
>>>> regions. We don't manage devices there. If there is any implementation
>>>> benefit in having a full QOM object, then make memory regions objects.
>>>
>>> I see and sympathise with this point of view.
>>>
>>> The idea to use opaque and convert it to Object * is that often a device
>>> has multiple regions but no interest in managing them separately.  So
>>> managing regions will cause the device model authors to create
>>> boilerplate code to push region management to device management.
>>>
>>> I had this experience with the first iterations of the region interface,
>>> where instead of opaque we had MemoryRegion *.  Device code would use
>>> container_of() in the simple case, but the non-simple case caused lots
>>> of pain.  I believe it is the natural interface, but in practice it
>>> turned out to cause too much churn.
>>>
>>>> I simply don't like this indirection, having the memory layer pick up
>>>> the opaque value of the region and interpret it.
>>>
>>> That's just an intermediate step.  The idea is that eventually opaque
>>> will be changed to Object *.
>>>
>>>> Even worse, apply
>>>> restrictions on how the dispatched objects, the regions, have to be
>>>> treated because of this.
>>>
>>> Please elaborate.
>>
>> The fact that you can't manipulate a memory region object arbitrarily
>> after removing it from the mapping because you track the references to
>> the object that the region points to, not the region itself. The region
>> remains in use by the dispatching layer and potentially the called
>> device, even after deregistration.
>
> Why can it in use by dispatching layer? Memory region just servers as
> interface for mem lookup, when dispatching in mr's MemoryRegionOps,
> why do we need to access it?  Which means that read/write can cause
> register/deregister mr?  Example?

I found example in pci_update_mappings(), but but my following point
still stand.
> Also I think MemoryRegionOps will eventually operate on Object (mr is
> only interface). MRs is only represent of Object in the view of
> memory(so their life cycle can be managed by Object). And there could
> be some intermediate mr like those lie in subpage_t which are not
> based on Object. But we can walk down to the terminal point and
> extract the real Object,
> so when dispatching, we only need to ensure that Object and its direct
> mr are alive.
>
> Regards,
> pingfan
>>
>> Jan
>>
>> --
>> Siemens AG, Corporate Technology, CT RTC ITP SDP-DE
>> Corporate Competence Center Embedded Linux
Jan Kiszka - Aug. 28, 2012, 9:42 a.m.
On 2012-08-28 05:38, liu ping fan wrote:
> On Tue, Aug 28, 2012 at 11:09 AM, liu ping fan <qemulist@gmail.com> wrote:
>> On Tue, Aug 28, 2012 at 3:38 AM, Jan Kiszka <jan.kiszka@siemens.com> wrote:
>>> On 2012-08-27 20:52, Avi Kivity wrote:
>>>> On 08/27/2012 11:39 AM, Jan Kiszka wrote:
>>>>> On 2012-08-27 20:20, Avi Kivity wrote:
>>>>>> On 08/27/2012 11:17 AM, Jan Kiszka wrote:
>>>>>>> On 2012-08-27 20:09, Avi Kivity wrote:
>>>>>>>> On 08/27/2012 10:14 AM, Jan Kiszka wrote:
>>>>>>>>>>
>>>>>>>>>> Deregistration is fine, the problem is destruction.
>>>>>>>>>>
>>>>>>>>>
>>>>>>>>> It isn't as you access memory region states that can change after
>>>>>>>>> deregistration. Devices can remove memory regions from the mapping,
>>>>>>>>> alter and then reinsert them. The last to steps must not happen while
>>>>>>>>> anyone is still using a reference to that region.
>>>>>>>>>
>>>>>>>>
>>>>>>>> Why not?  If the guest is accessing an mmio region while reconfiguring
>>>>>>>> it in a way that changes its meaning, either the previous or the next
>>>>>>>> meaning is valid.
>>>>>>>
>>>>>>> If the memory region owner sets the content to zero or even releases it
>>>>>>> (nothing states a memory region can only live inside a device
>>>>>>> structure), we will crash. Restricting how a memory region can be
>>>>>>> created and handled after it was once registered somewhere is an
>>>>>>> unnatural interface, waiting to cause subtle bugs.
>>>>>>
>>>>>> Using an Object * allows the simple case to be really simple (object ==
>>>>>> device) and the hard cases to be doable.
>>>>>>
>>>>>> What would you suggest as a better interface?
>>>>>
>>>>> To protect the life cycle of the object we manage in the memory layer:
>>>>> regions. We don't manage devices there. If there is any implementation
>>>>> benefit in having a full QOM object, then make memory regions objects.
>>>>
>>>> I see and sympathise with this point of view.
>>>>
>>>> The idea to use opaque and convert it to Object * is that often a device
>>>> has multiple regions but no interest in managing them separately.  So
>>>> managing regions will cause the device model authors to create
>>>> boilerplate code to push region management to device management.
>>>>
>>>> I had this experience with the first iterations of the region interface,
>>>> where instead of opaque we had MemoryRegion *.  Device code would use
>>>> container_of() in the simple case, but the non-simple case caused lots
>>>> of pain.  I believe it is the natural interface, but in practice it
>>>> turned out to cause too much churn.
>>>>
>>>>> I simply don't like this indirection, having the memory layer pick up
>>>>> the opaque value of the region and interpret it.
>>>>
>>>> That's just an intermediate step.  The idea is that eventually opaque
>>>> will be changed to Object *.
>>>>
>>>>> Even worse, apply
>>>>> restrictions on how the dispatched objects, the regions, have to be
>>>>> treated because of this.
>>>>
>>>> Please elaborate.
>>>
>>> The fact that you can't manipulate a memory region object arbitrarily
>>> after removing it from the mapping because you track the references to
>>> the object that the region points to, not the region itself. The region
>>> remains in use by the dispatching layer and potentially the called
>>> device, even after deregistration.
>>
>> Why can it in use by dispatching layer? Memory region just servers as
>> interface for mem lookup, when dispatching in mr's MemoryRegionOps,
>> why do we need to access it?  Which means that read/write can cause
>> register/deregister mr?  Example?
> 
> I found example in pci_update_mappings(), but but my following point
> still stand.
>> Also I think MemoryRegionOps will eventually operate on Object (mr is
>> only interface). MRs is only represent of Object in the view of
>> memory(so their life cycle can be managed by Object). And there could
>> be some intermediate mr like those lie in subpage_t which are not
>> based on Object. But we can walk down to the terminal point and
>> extract the real Object,
>> so when dispatching, we only need to ensure that Object and its direct
>> mr are alive.

Let's not talk about devices or MMIO dispatching. I think the problem is
way more generic, and we will face it multiple times in QEMU. Besides
MMIO and PIO dispatching, it will haunt us for file or event handlers,
any kind of callbacks etc.

Context A                               Context B
---------                               ---------
                                        object = lookup()
deregister(object)
modify(object) -> invalid state
...                                     use(object)
modify(object) -> valid state
register(object)

And with "object" I'm not talking about QOM but any data structure.

Jan
Paolo Bonzini - Aug. 28, 2012, 10:05 a.m.
Il 28/08/2012 11:42, Jan Kiszka ha scritto:
> Context A                               Context B
> ---------                               ---------
>                                         object = lookup()
> deregister(object)
> modify(object) -> invalid state
> ...                                     use(object)
> modify(object) -> valid state
> register(object)
> 
> And with "object" I'm not talking about QOM but any data structure.

If you want to avoid locks, the only way to do so is RCU.  Then you do
not modify object at all.

Paolo
Avi Kivity - Aug. 29, 2012, 5:13 p.m.
On 08/27/2012 06:01 PM, Jan Kiszka wrote:
> On 2012-08-27 22:53, Avi Kivity wrote:
> > On 08/27/2012 12:38 PM, Jan Kiszka wrote:
> >>>> Even worse, apply
> >>>> restrictions on how the dispatched objects, the regions, have to be
> >>>> treated because of this.
> >>>
> >>> Please elaborate.
> >>
> >> The fact that you can't manipulate a memory region object arbitrarily
> >> after removing it from the mapping because you track the references to
> >> the object that the region points to, not the region itself. The region
> >> remains in use by the dispatching layer and potentially the called
> >> device, even after deregistration.
> > 
> > That object will be a container_of() the region, usually literally but
> > sometimes only in spirit.  Reference counting the regions means they
> > cannot be embedded into other objects any more.
>
> I cannot follow the logic of the last sentence. Reference counting just
> means that we track if there are users left, not necessarily that we
> perform asynchronous releases. We can simply wait for those users to
> complete.

I don't see how.  Suppose you add a reference count to MemoryRegion. 
How do you delay its containing object's destructor from running?  Do
you iterate over all member MemoryRegion and examine their reference counts?

Usually a reference count controls the lifetime of the reference counted
object, what are you suggesting here?

> > 
> > We can probably figure out a way to flush out accesses.  After switching
> > to rcu, for example, all we need is synchronize_rcu() in a
> > non-deadlocking place.  But my bet is that it will not be needed.
>
> If you properly flush out accesses, you don't need to track at device
> level anymore - and mess with abstraction layers. That's my whole point.

To flush out an access you need either rwlock_write_lock() or
synchronize_rcu() (depending on the implementation).  But neither of
these can be run from an rcu read-side critical section or
rwlock_read_lock().

You could defer the change to a bottom half, but if the hardware demands
that the change be complete before returning, that doesn't work.
Jan Kiszka - Aug. 29, 2012, 5:21 p.m.
On 2012-08-29 19:13, Avi Kivity wrote:
> On 08/27/2012 06:01 PM, Jan Kiszka wrote:
>> On 2012-08-27 22:53, Avi Kivity wrote:
>>> On 08/27/2012 12:38 PM, Jan Kiszka wrote:
>>>>>> Even worse, apply
>>>>>> restrictions on how the dispatched objects, the regions, have to be
>>>>>> treated because of this.
>>>>>
>>>>> Please elaborate.
>>>>
>>>> The fact that you can't manipulate a memory region object arbitrarily
>>>> after removing it from the mapping because you track the references to
>>>> the object that the region points to, not the region itself. The region
>>>> remains in use by the dispatching layer and potentially the called
>>>> device, even after deregistration.
>>>
>>> That object will be a container_of() the region, usually literally but
>>> sometimes only in spirit.  Reference counting the regions means they
>>> cannot be embedded into other objects any more.
>>
>> I cannot follow the logic of the last sentence. Reference counting just
>> means that we track if there are users left, not necessarily that we
>> perform asynchronous releases. We can simply wait for those users to
>> complete.
> 
> I don't see how.  Suppose you add a reference count to MemoryRegion. 
> How do you delay its containing object's destructor from running?  Do
> you iterate over all member MemoryRegion and examine their reference counts?

memory_region_transaction_commit (or calls that trigger this) will have
to wait. That is required as the caller may start fiddling with the
region right afterward.

> 
> Usually a reference count controls the lifetime of the reference counted
> object, what are you suggesting here?

To synchronize on reference count going to zero. Or all readers leaving
the read-side critical sections.

> 
>>>
>>> We can probably figure out a way to flush out accesses.  After switching
>>> to rcu, for example, all we need is synchronize_rcu() in a
>>> non-deadlocking place.  But my bet is that it will not be needed.
>>
>> If you properly flush out accesses, you don't need to track at device
>> level anymore - and mess with abstraction layers. That's my whole point.
> 
> To flush out an access you need either rwlock_write_lock() or
> synchronize_rcu() (depending on the implementation).  But neither of
> these can be run from an rcu read-side critical section or
> rwlock_read_lock().
> 
> You could defer the change to a bottom half, but if the hardware demands
> that the change be complete before returning, that doesn't work.

Right, therefore synchronous transactions.

Jan
Avi Kivity - Aug. 29, 2012, 5:23 p.m.
On 08/28/2012 02:42 AM, Jan Kiszka wrote:
>
> Let's not talk about devices or MMIO dispatching. I think the problem is
> way more generic, and we will face it multiple times in QEMU. 

The problem exists outside qemu as well.  It is one of the reasons for
the popularity of garbage collection IMO, and the use of reference
counting when GC is not available.

This pattern is even documented in
Documentation/DocBook/kernel-locking.tmpl:

@@ -104,12 +114,11 @@
 struct object *cache_find(int id)
 {
         struct object *obj;
-        unsigned long flags;

-        spin_lock_irqsave(&amp;cache_lock, flags);
+        rcu_read_lock();
         obj = __cache_find(id);
         if (obj)
                 object_get(obj);
-        spin_unlock_irqrestore(&amp;cache_lock, flags);
+        rcu_read_unlock();
  

Of course that doesn't mean we should use it, but at least it indicates
that it is a standard pattern.  With MemoryRegion the pattern is
changed, since MemoryRegion is a thunk, not the object we're really
dispatching to.

> Besides
> MMIO and PIO dispatching, it will haunt us for file or event handlers,
> any kind of callbacks etc.
>
> Context A                               Context B
> ---------                               ---------
>                                         object = lookup()
> deregister(object)
> modify(object) -> invalid state
> ...                                     use(object)
> modify(object) -> valid state
> register(object)
>
> And with "object" I'm not talking about QOM but any data structure.
>


Context B
---------
rcu_read_lock()
object = lookup()
if (object) {
    ref(object)
}
rcu_read_unlock()

use(object)

unref(object)

(substitute locking scheme to conform to taste and/or dietary
restrictions)
Avi Kivity - Aug. 29, 2012, 5:27 p.m.
On 08/29/2012 10:21 AM, Jan Kiszka wrote:
> On 2012-08-29 19:13, Avi Kivity wrote:
> > On 08/27/2012 06:01 PM, Jan Kiszka wrote:
> >> On 2012-08-27 22:53, Avi Kivity wrote:
> >>> On 08/27/2012 12:38 PM, Jan Kiszka wrote:
> >>>>>> Even worse, apply
> >>>>>> restrictions on how the dispatched objects, the regions, have to be
> >>>>>> treated because of this.
> >>>>>
> >>>>> Please elaborate.
> >>>>
> >>>> The fact that you can't manipulate a memory region object arbitrarily
> >>>> after removing it from the mapping because you track the references to
> >>>> the object that the region points to, not the region itself. The region
> >>>> remains in use by the dispatching layer and potentially the called
> >>>> device, even after deregistration.
> >>>
> >>> That object will be a container_of() the region, usually literally but
> >>> sometimes only in spirit.  Reference counting the regions means they
> >>> cannot be embedded into other objects any more.
> >>
> >> I cannot follow the logic of the last sentence. Reference counting just
> >> means that we track if there are users left, not necessarily that we
> >> perform asynchronous releases. We can simply wait for those users to
> >> complete.
> > 
> > I don't see how.  Suppose you add a reference count to MemoryRegion. 
> > How do you delay its containing object's destructor from running?  Do
> > you iterate over all member MemoryRegion and examine their reference counts?
>
> memory_region_transaction_commit (or calls that trigger this) will have
> to wait. That is required as the caller may start fiddling with the
> region right afterward.

However, it may be running within an mmio dispatch, so it may wait forever.

Ignoring this, how does it wait? sleep()? or wait for a completion?

> > 
> > Usually a reference count controls the lifetime of the reference counted
> > object, what are you suggesting here?
>
> To synchronize on reference count going to zero. 

Quite unorthodox.  Do you have real life examples?

> Or all readers leaving
> the read-side critical sections.

This is rcu.  But again, we may be in an rcu read-side critical section
while needing to wait.  In fact this was what I suggested in the first
place, until Marcelo pointed out the deadlock, so we came up with the
refcount.

>
> > 
> >>>
> >>> We can probably figure out a way to flush out accesses.  After switching
> >>> to rcu, for example, all we need is synchronize_rcu() in a
> >>> non-deadlocking place.  But my bet is that it will not be needed.
> >>
> >> If you properly flush out accesses, you don't need to track at device
> >> level anymore - and mess with abstraction layers. That's my whole point.
> > 
> > To flush out an access you need either rwlock_write_lock() or
> > synchronize_rcu() (depending on the implementation).  But neither of
> > these can be run from an rcu read-side critical section or
> > rwlock_read_lock().
> > 
> > You could defer the change to a bottom half, but if the hardware demands
> > that the change be complete before returning, that doesn't work.
>
> Right, therefore synchronous transactions.

I don't follow.  Synchronous transactions mean you can't
synchronize_rcu() or upgrade the lock or wait for the refcount.  That's
the source of the problem!
Jan Kiszka - Aug. 29, 2012, 5:30 p.m.
On 2012-08-29 19:23, Avi Kivity wrote:
> On 08/28/2012 02:42 AM, Jan Kiszka wrote:
>>
>> Let's not talk about devices or MMIO dispatching. I think the problem is
>> way more generic, and we will face it multiple times in QEMU. 
> 
> The problem exists outside qemu as well.  It is one of the reasons for
> the popularity of garbage collection IMO, and the use of reference
> counting when GC is not available.
> 
> This pattern is even documented in
> Documentation/DocBook/kernel-locking.tmpl:
> 
> @@ -104,12 +114,11 @@
>  struct object *cache_find(int id)
>  {
>          struct object *obj;
> -        unsigned long flags;
> 
> -        spin_lock_irqsave(&amp;cache_lock, flags);
> +        rcu_read_lock();
>          obj = __cache_find(id);
>          if (obj)
>                  object_get(obj);
> -        spin_unlock_irqrestore(&amp;cache_lock, flags);
> +        rcu_read_unlock();
>   
> 
> Of course that doesn't mean we should use it, but at least it indicates
> that it is a standard pattern.  With MemoryRegion the pattern is
> changed, since MemoryRegion is a thunk, not the object we're really
> dispatching to.

We are dispatching according to the memory region (parameters, op
handlers, opaques). If we end up in device object is not decided at this
level. A memory region describes a dispatchable area - not to confuse
with a device that may only partially be able to receive such requests.

> 
>> Besides
>> MMIO and PIO dispatching, it will haunt us for file or event handlers,
>> any kind of callbacks etc.
>>
>> Context A                               Context B
>> ---------                               ---------
>>                                         object = lookup()
>> deregister(object)
>> modify(object) -> invalid state
>> ...                                     use(object)
>> modify(object) -> valid state
>> register(object)
>>
>> And with "object" I'm not talking about QOM but any data structure.
>>
> 
> 
> Context B
> ---------
> rcu_read_lock()
> object = lookup()
> if (object) {
>     ref(object)
> }
> rcu_read_unlock()
> 
> use(object)
> 
> unref(object)
> 
> (substitute locking scheme to conform to taste and/or dietary
> restrictions)   
> 

+ wait for refcount(object) == 0 in deregister(object). That's what I'm
proposing.

Jan
Avi Kivity - Aug. 29, 2012, 5:40 p.m.
On 08/29/2012 10:30 AM, Jan Kiszka wrote:
> On 2012-08-29 19:23, Avi Kivity wrote:
> > On 08/28/2012 02:42 AM, Jan Kiszka wrote:
> >>
> >> Let's not talk about devices or MMIO dispatching. I think the problem is
> >> way more generic, and we will face it multiple times in QEMU. 
> > 
> > The problem exists outside qemu as well.  It is one of the reasons for
> > the popularity of garbage collection IMO, and the use of reference
> > counting when GC is not available.
> > 
> > This pattern is even documented in
> > Documentation/DocBook/kernel-locking.tmpl:
> > 
> > @@ -104,12 +114,11 @@
> >  struct object *cache_find(int id)
> >  {
> >          struct object *obj;
> > -        unsigned long flags;
> > 
> > -        spin_lock_irqsave(&amp;cache_lock, flags);
> > +        rcu_read_lock();
> >          obj = __cache_find(id);
> >          if (obj)
> >                  object_get(obj);
> > -        spin_unlock_irqrestore(&amp;cache_lock, flags);
> > +        rcu_read_unlock();
> >   
> > 
> > Of course that doesn't mean we should use it, but at least it indicates
> > that it is a standard pattern.  With MemoryRegion the pattern is
> > changed, since MemoryRegion is a thunk, not the object we're really
> > dispatching to.
>
> We are dispatching according to the memory region (parameters, op
> handlers, opaques). If we end up in device object is not decided at this
> level. A memory region describes a dispatchable area - not to confuse
> with a device that may only partially be able to receive such requests.

Correct.

Let's experiment with refcounting MemoryRegion.  We can move the entire
contents of MemoryRegion to MemoryRegionImpl, add a reference count (to
MemoryRegionImpl), and change MemoryRegion to contain a pointer to the
refcounted MemoryRegionImpl:

struct MemoryRegion {
    struct MemoryRegionImpl *impl;
};

struct MemoryRegionImpl {
    atomic int refs;
    ...
};

The memory core can then store MemoryRegion copies (with elevated
refcounts) instead of pointers.  Devices can destroy MemoryRegions at
any time, the implementation will not go away.  However, what of the
opaque stored in MemoryRegionImpl?  It becomes a dangling pointer.

One way out is to add a lock to MemoryRegionImpl.  Dispatch takes the
lock, examines the ->enabled member, and bails out if it is cleared. 
The (MemoryRegion, not MemoryRegionImpl) destructor also takes the lock,
clears ->enabled,  releases the lock, and drops the reference.

The advantage to this scheme is that all changes are localized to the
memory core, no need for a full sweep.  It is a little complicated, but
we may be able to simplify it (or find another one).

>
> > 
> >> Besides
> >> MMIO and PIO dispatching, it will haunt us for file or event handlers,
> >> any kind of callbacks etc.
> >>
> >> Context A                               Context B
> >> ---------                               ---------
> >>                                         object = lookup()
> >> deregister(object)
> >> modify(object) -> invalid state
> >> ...                                     use(object)
> >> modify(object) -> valid state
> >> register(object)
> >>
> >> And with "object" I'm not talking about QOM but any data structure.
> >>
> > 
> > 
> > Context B
> > ---------
> > rcu_read_lock()
> > object = lookup()
> > if (object) {
> >     ref(object)
> > }
> > rcu_read_unlock()
> > 
> > use(object)
> > 
> > unref(object)
> > 
> > (substitute locking scheme to conform to taste and/or dietary
> > restrictions)   
> > 
>
> + wait for refcount(object) == 0 in deregister(object). That's what I'm
> proposing.

Consider timer_del() called from a timer callback.  It's often not doable.
Jan Kiszka - Aug. 29, 2012, 5:41 p.m.
On 2012-08-29 19:27, Avi Kivity wrote:
> On 08/29/2012 10:21 AM, Jan Kiszka wrote:
>> On 2012-08-29 19:13, Avi Kivity wrote:
>>> On 08/27/2012 06:01 PM, Jan Kiszka wrote:
>>>> On 2012-08-27 22:53, Avi Kivity wrote:
>>>>> On 08/27/2012 12:38 PM, Jan Kiszka wrote:
>>>>>>>> Even worse, apply
>>>>>>>> restrictions on how the dispatched objects, the regions, have to be
>>>>>>>> treated because of this.
>>>>>>>
>>>>>>> Please elaborate.
>>>>>>
>>>>>> The fact that you can't manipulate a memory region object arbitrarily
>>>>>> after removing it from the mapping because you track the references to
>>>>>> the object that the region points to, not the region itself. The region
>>>>>> remains in use by the dispatching layer and potentially the called
>>>>>> device, even after deregistration.
>>>>>
>>>>> That object will be a container_of() the region, usually literally but
>>>>> sometimes only in spirit.  Reference counting the regions means they
>>>>> cannot be embedded into other objects any more.
>>>>
>>>> I cannot follow the logic of the last sentence. Reference counting just
>>>> means that we track if there are users left, not necessarily that we
>>>> perform asynchronous releases. We can simply wait for those users to
>>>> complete.
>>>
>>> I don't see how.  Suppose you add a reference count to MemoryRegion. 
>>> How do you delay its containing object's destructor from running?  Do
>>> you iterate over all member MemoryRegion and examine their reference counts?
>>
>> memory_region_transaction_commit (or calls that trigger this) will have
>> to wait. That is required as the caller may start fiddling with the
>> region right afterward.
> 
> However, it may be running within an mmio dispatch, so it may wait forever.

We won't be able to wait for devices that can acquire the same lock we
hold while waiting, of course. That's why a lock ordering device-lock ->
BQL (the one we hold over memory region changes) won't be allowed. I was
wrong on this before: We must not take course-grained locks while
holding fine-grained ones, only the other way around. Pretty obvious,
specifically for realtime / low-latency.

> 
> Ignoring this, how does it wait? sleep()? or wait for a completion?

Ideally via completion.

> 
>>>
>>> Usually a reference count controls the lifetime of the reference counted
>>> object, what are you suggesting here?
>>
>> To synchronize on reference count going to zero. 
> 
> Quite unorthodox.  Do you have real life examples?

synchronize_rcu.

> 
>> Or all readers leaving
>> the read-side critical sections.
> 
> This is rcu.  But again, we may be in an rcu read-side critical section
> while needing to wait.  In fact this was what I suggested in the first
> place, until Marcelo pointed out the deadlock, so we came up with the
> refcount.

We must avoid that deadlock scenario. I bended my brain around it, and I
think that is the only answer. We can if we apply clear rules regarding
locking and BQL-protected services to those devices that will actually
use fine-grained locking, and there only for those paths that take
per-device locks. I'm pretty sure we won't get far with an
"all-or-nothing" model where every device uses a private lock.

> 
>>
>>>
>>>>>
>>>>> We can probably figure out a way to flush out accesses.  After switching
>>>>> to rcu, for example, all we need is synchronize_rcu() in a
>>>>> non-deadlocking place.  But my bet is that it will not be needed.
>>>>
>>>> If you properly flush out accesses, you don't need to track at device
>>>> level anymore - and mess with abstraction layers. That's my whole point.
>>>
>>> To flush out an access you need either rwlock_write_lock() or
>>> synchronize_rcu() (depending on the implementation).  But neither of
>>> these can be run from an rcu read-side critical section or
>>> rwlock_read_lock().
>>>
>>> You could defer the change to a bottom half, but if the hardware demands
>>> that the change be complete before returning, that doesn't work.
>>
>> Right, therefore synchronous transactions.
> 
> I don't follow.  Synchronous transactions mean you can't
> synchronize_rcu() or upgrade the lock or wait for the refcount.  That's
> the source of the problem!

Our current device models assume synchronous transactions on the memory
layer, actually on all layers. The proposals I saw so far try to change
this. But I'm very skeptical that this would succeed, due to the
conversion effort and due to the fact that it would give us a tricky  to
use API.

Jan
Jan Kiszka - Aug. 29, 2012, 5:49 p.m.
On 2012-08-29 19:40, Avi Kivity wrote:
> On 08/29/2012 10:30 AM, Jan Kiszka wrote:
>> On 2012-08-29 19:23, Avi Kivity wrote:
>>> On 08/28/2012 02:42 AM, Jan Kiszka wrote:
>>>>
>>>> Let's not talk about devices or MMIO dispatching. I think the problem is
>>>> way more generic, and we will face it multiple times in QEMU. 
>>>
>>> The problem exists outside qemu as well.  It is one of the reasons for
>>> the popularity of garbage collection IMO, and the use of reference
>>> counting when GC is not available.
>>>
>>> This pattern is even documented in
>>> Documentation/DocBook/kernel-locking.tmpl:
>>>
>>> @@ -104,12 +114,11 @@
>>>  struct object *cache_find(int id)
>>>  {
>>>          struct object *obj;
>>> -        unsigned long flags;
>>>
>>> -        spin_lock_irqsave(&amp;cache_lock, flags);
>>> +        rcu_read_lock();
>>>          obj = __cache_find(id);
>>>          if (obj)
>>>                  object_get(obj);
>>> -        spin_unlock_irqrestore(&amp;cache_lock, flags);
>>> +        rcu_read_unlock();
>>>   
>>>
>>> Of course that doesn't mean we should use it, but at least it indicates
>>> that it is a standard pattern.  With MemoryRegion the pattern is
>>> changed, since MemoryRegion is a thunk, not the object we're really
>>> dispatching to.
>>
>> We are dispatching according to the memory region (parameters, op
>> handlers, opaques). If we end up in device object is not decided at this
>> level. A memory region describes a dispatchable area - not to confuse
>> with a device that may only partially be able to receive such requests.
> 
> Correct.
> 
> Let's experiment with refcounting MemoryRegion.  We can move the entire
> contents of MemoryRegion to MemoryRegionImpl, add a reference count (to
> MemoryRegionImpl), and change MemoryRegion to contain a pointer to the
> refcounted MemoryRegionImpl:
> 
> struct MemoryRegion {
>     struct MemoryRegionImpl *impl;
> };
> 
> struct MemoryRegionImpl {
>     atomic int refs;
>     ...
> };
> 
> The memory core can then store MemoryRegion copies (with elevated
> refcounts) instead of pointers.  Devices can destroy MemoryRegions at
> any time, the implementation will not go away.  However, what of the
> opaque stored in MemoryRegionImpl?  It becomes a dangling pointer.
> 
> One way out is to add a lock to MemoryRegionImpl.  Dispatch takes the
> lock, examines the ->enabled member, and bails out if it is cleared. 
> The (MemoryRegion, not MemoryRegionImpl) destructor also takes the lock,
> clears ->enabled,  releases the lock, and drops the reference.

That means holding the MemoryRegionImpl lock across the handler call?

> 
> The advantage to this scheme is that all changes are localized to the
> memory core, no need for a full sweep.  It is a little complicated, but
> we may be able to simplify it (or find another one).

May work. We just need to detect if memory region tries to delete itself
from its handler to prevent the deadlock.

> 
>>
>>>
>>>> Besides
>>>> MMIO and PIO dispatching, it will haunt us for file or event handlers,
>>>> any kind of callbacks etc.
>>>>
>>>> Context A                               Context B
>>>> ---------                               ---------
>>>>                                         object = lookup()
>>>> deregister(object)
>>>> modify(object) -> invalid state
>>>> ...                                     use(object)
>>>> modify(object) -> valid state
>>>> register(object)
>>>>
>>>> And with "object" I'm not talking about QOM but any data structure.
>>>>
>>>
>>>
>>> Context B
>>> ---------
>>> rcu_read_lock()
>>> object = lookup()
>>> if (object) {
>>>     ref(object)
>>> }
>>> rcu_read_unlock()
>>>
>>> use(object)
>>>
>>> unref(object)
>>>
>>> (substitute locking scheme to conform to taste and/or dietary
>>> restrictions)   
>>>
>>
>> + wait for refcount(object) == 0 in deregister(object). That's what I'm
>> proposing.
> 
> Consider timer_del() called from a timer callback.  It's often not doable.

This is inherently synchronous already (when working against the same
alarm timer backend). We can detect this in timer_del and skip waiting,
as in the above scenario.

Jan
pingfan liu - Aug. 30, 2012, 5:54 a.m.
On Thu, Aug 30, 2012 at 1:40 AM, Avi Kivity <avi@redhat.com> wrote:
> On 08/29/2012 10:30 AM, Jan Kiszka wrote:
>> On 2012-08-29 19:23, Avi Kivity wrote:
>> > On 08/28/2012 02:42 AM, Jan Kiszka wrote:
>> >>
>> >> Let's not talk about devices or MMIO dispatching. I think the problem is
>> >> way more generic, and we will face it multiple times in QEMU.
>> >
>> > The problem exists outside qemu as well.  It is one of the reasons for
>> > the popularity of garbage collection IMO, and the use of reference
>> > counting when GC is not available.
>> >
>> > This pattern is even documented in
>> > Documentation/DocBook/kernel-locking.tmpl:
>> >
>> > @@ -104,12 +114,11 @@
>> >  struct object *cache_find(int id)
>> >  {
>> >          struct object *obj;
>> > -        unsigned long flags;
>> >
>> > -        spin_lock_irqsave(&amp;cache_lock, flags);
>> > +        rcu_read_lock();
>> >          obj = __cache_find(id);
>> >          if (obj)
>> >                  object_get(obj);
>> > -        spin_unlock_irqrestore(&amp;cache_lock, flags);
>> > +        rcu_read_unlock();
>> >
>> >
>> > Of course that doesn't mean we should use it, but at least it indicates
>> > that it is a standard pattern.  With MemoryRegion the pattern is
>> > changed, since MemoryRegion is a thunk, not the object we're really
>> > dispatching to.
>>
>> We are dispatching according to the memory region (parameters, op
>> handlers, opaques). If we end up in device object is not decided at this
>> level. A memory region describes a dispatchable area - not to confuse
>> with a device that may only partially be able to receive such requests.
>
But I think the meaning of the memory region is for dispatching. If no
dispatching associated with mr, why need it exist in the system?  And
could you elaborate that who will be the ref holder of mr?

> Correct.
>
> Let's experiment with refcounting MemoryRegion.  We can move the entire
> contents of MemoryRegion to MemoryRegionImpl, add a reference count (to
> MemoryRegionImpl), and change MemoryRegion to contain a pointer to the
> refcounted MemoryRegionImpl:
>
> struct MemoryRegion {
>     struct MemoryRegionImpl *impl;
> };
>
> struct MemoryRegionImpl {
>     atomic int refs;
>     ...
> };
>
> The memory core can then store MemoryRegion copies (with elevated
> refcounts) instead of pointers.  Devices can destroy MemoryRegions at
> any time, the implementation will not go away.  However, what of the
> opaque stored in MemoryRegionImpl?  It becomes a dangling pointer.
>
> One way out is to add a lock to MemoryRegionImpl.  Dispatch takes the
> lock, examines the ->enabled member, and bails out if it is cleared.
> The (MemoryRegion, not MemoryRegionImpl) destructor also takes the lock,
> clears ->enabled,  releases the lock, and drops the reference.
>
Assume we can get the host objectX of opaque, then we can just
object_ref(objectX). So dispatching will not face the hard nut to
handle opaque.

Regards,
pingfan
> The advantage to this scheme is that all changes are localized to the
> memory core, no need for a full sweep.  It is a little complicated, but
> we may be able to simplify it (or find another one).
>
>>
>> >
>> >> Besides
>> >> MMIO and PIO dispatching, it will haunt us for file or event handlers,
>> >> any kind of callbacks etc.
>> >>
>> >> Context A                               Context B
>> >> ---------                               ---------
>> >>                                         object = lookup()
>> >> deregister(object)
>> >> modify(object) -> invalid state
>> >> ...                                     use(object)
>> >> modify(object) -> valid state
>> >> register(object)
>> >>
>> >> And with "object" I'm not talking about QOM but any data structure.
>> >>
>> >
>> >
>> > Context B
>> > ---------
>> > rcu_read_lock()
>> > object = lookup()
>> > if (object) {
>> >     ref(object)
>> > }
>> > rcu_read_unlock()
>> >
>> > use(object)
>> >
>> > unref(object)
>> >
>> > (substitute locking scheme to conform to taste and/or dietary
>> > restrictions)
>> >
>>
>> + wait for refcount(object) == 0 in deregister(object). That's what I'm
>> proposing.
>
> Consider timer_del() called from a timer callback.  It's often not doable.
>
> --
> I have a truly marvellous patch that fixes the bug which this
> signature is too narrow to contain.
>
Jan Kiszka - Aug. 30, 2012, 7:08 a.m.
On 2012-08-30 07:54, liu ping fan wrote:
> On Thu, Aug 30, 2012 at 1:40 AM, Avi Kivity <avi@redhat.com> wrote:
>> On 08/29/2012 10:30 AM, Jan Kiszka wrote:
>>> On 2012-08-29 19:23, Avi Kivity wrote:
>>>> On 08/28/2012 02:42 AM, Jan Kiszka wrote:
>>>>>
>>>>> Let's not talk about devices or MMIO dispatching. I think the problem is
>>>>> way more generic, and we will face it multiple times in QEMU.
>>>>
>>>> The problem exists outside qemu as well.  It is one of the reasons for
>>>> the popularity of garbage collection IMO, and the use of reference
>>>> counting when GC is not available.
>>>>
>>>> This pattern is even documented in
>>>> Documentation/DocBook/kernel-locking.tmpl:
>>>>
>>>> @@ -104,12 +114,11 @@
>>>>  struct object *cache_find(int id)
>>>>  {
>>>>          struct object *obj;
>>>> -        unsigned long flags;
>>>>
>>>> -        spin_lock_irqsave(&amp;cache_lock, flags);
>>>> +        rcu_read_lock();
>>>>          obj = __cache_find(id);
>>>>          if (obj)
>>>>                  object_get(obj);
>>>> -        spin_unlock_irqrestore(&amp;cache_lock, flags);
>>>> +        rcu_read_unlock();
>>>>
>>>>
>>>> Of course that doesn't mean we should use it, but at least it indicates
>>>> that it is a standard pattern.  With MemoryRegion the pattern is
>>>> changed, since MemoryRegion is a thunk, not the object we're really
>>>> dispatching to.
>>>
>>> We are dispatching according to the memory region (parameters, op
>>> handlers, opaques). If we end up in device object is not decided at this
>>> level. A memory region describes a dispatchable area - not to confuse
>>> with a device that may only partially be able to receive such requests.
>>
> But I think the meaning of the memory region is for dispatching. If no
> dispatching associated with mr, why need it exist in the system?

Where did I say that memory regions should no longer be used for
dispatching? The point is to keep the clean layer separation between
memory regions and device objects instead of merging them together.

Given

       Object
       ^    ^
       |    |
Region 1    Region 2

you protect the object during dispatch, implicitly (and that is bad)
requiring that no region must change in that period. I say what rather
needs protection are the regions so that Region 2 can pass away and
maybe reappear independent of Region 1. And: I won't need to know the
type of that object the regions are referring to in this model. That's
the difference.

>  And
> could you elaborate that who will be the ref holder of mr?

The memory subsystem while running a memory region handler. If that will
be a reference counter or a per-region lock like Avi suggested, we still
need to find out.

Jan
pingfan liu - Aug. 30, 2012, 7:47 a.m.
On Thu, Aug 30, 2012 at 3:08 PM, Jan Kiszka <jan.kiszka@siemens.com> wrote:
> On 2012-08-30 07:54, liu ping fan wrote:
>> On Thu, Aug 30, 2012 at 1:40 AM, Avi Kivity <avi@redhat.com> wrote:
>>> On 08/29/2012 10:30 AM, Jan Kiszka wrote:
>>>> On 2012-08-29 19:23, Avi Kivity wrote:
>>>>> On 08/28/2012 02:42 AM, Jan Kiszka wrote:
>>>>>>
>>>>>> Let's not talk about devices or MMIO dispatching. I think the problem is
>>>>>> way more generic, and we will face it multiple times in QEMU.
>>>>>
>>>>> The problem exists outside qemu as well.  It is one of the reasons for
>>>>> the popularity of garbage collection IMO, and the use of reference
>>>>> counting when GC is not available.
>>>>>
>>>>> This pattern is even documented in
>>>>> Documentation/DocBook/kernel-locking.tmpl:
>>>>>
>>>>> @@ -104,12 +114,11 @@
>>>>>  struct object *cache_find(int id)
>>>>>  {
>>>>>          struct object *obj;
>>>>> -        unsigned long flags;
>>>>>
>>>>> -        spin_lock_irqsave(&amp;cache_lock, flags);
>>>>> +        rcu_read_lock();
>>>>>          obj = __cache_find(id);
>>>>>          if (obj)
>>>>>                  object_get(obj);
>>>>> -        spin_unlock_irqrestore(&amp;cache_lock, flags);
>>>>> +        rcu_read_unlock();
>>>>>
>>>>>
>>>>> Of course that doesn't mean we should use it, but at least it indicates
>>>>> that it is a standard pattern.  With MemoryRegion the pattern is
>>>>> changed, since MemoryRegion is a thunk, not the object we're really
>>>>> dispatching to.
>>>>
>>>> We are dispatching according to the memory region (parameters, op
>>>> handlers, opaques). If we end up in device object is not decided at this
>>>> level. A memory region describes a dispatchable area - not to confuse
>>>> with a device that may only partially be able to receive such requests.
>>>
>> But I think the meaning of the memory region is for dispatching. If no
>> dispatching associated with mr, why need it exist in the system?
>
> Where did I say that memory regions should no longer be used for
> dispatching? The point is to keep the clean layer separation between
> memory regions and device objects instead of merging them together.
>
> Given
>
>        Object
>        ^    ^
>        |    |
> Region 1    Region 2
>
> you protect the object during dispatch, implicitly (and that is bad)
> requiring that no region must change in that period. I say what rather
> needs protection are the regions so that Region 2 can pass away and
> maybe reappear independent of Region 1. And: I won't need to know the

OK, I see, this is a strong reason.

Regards,
pingfan

> type of that object the regions are referring to in this model. That's
> the difference.
>
>>  And
>> could you elaborate that who will be the ref holder of mr?
>
> The memory subsystem while running a memory region handler. If that will
> be a reference counter or a per-region lock like Avi suggested, we still
> need to find out.
>
> Jan
>
> --
> Siemens AG, Corporate Technology, CT RTC ITP SDP-DE
> Corporate Competence Center Embedded Linux
Avi Kivity - Sept. 1, 2012, 8:31 a.m.
On 08/29/2012 10:49 AM, Jan Kiszka wrote:
> > 
> > Let's experiment with refcounting MemoryRegion.  We can move the entire
> > contents of MemoryRegion to MemoryRegionImpl, add a reference count (to
> > MemoryRegionImpl), and change MemoryRegion to contain a pointer to the
> > refcounted MemoryRegionImpl:
> > 
> > struct MemoryRegion {
> >     struct MemoryRegionImpl *impl;
> > };
> > 
> > struct MemoryRegionImpl {
> >     atomic int refs;
> >     ...
> > };
> > 
> > The memory core can then store MemoryRegion copies (with elevated
> > refcounts) instead of pointers.  Devices can destroy MemoryRegions at
> > any time, the implementation will not go away.  However, what of the
> > opaque stored in MemoryRegionImpl?  It becomes a dangling pointer.
> > 
> > One way out is to add a lock to MemoryRegionImpl.  Dispatch takes the
> > lock, examines the ->enabled member, and bails out if it is cleared. 
> > The (MemoryRegion, not MemoryRegionImpl) destructor also takes the lock,
> > clears ->enabled,  releases the lock, and drops the reference.
>
> That means holding the MemoryRegionImpl lock across the handler call?

Blech.  As you said on the other side of this thread, we must not take a
coarse grained lock within a fine grained one, and
MemoryRegionImpl::lock is as fine as they get.

> > 
> > The advantage to this scheme is that all changes are localized to the
> > memory core, no need for a full sweep.  It is a little complicated, but
> > we may be able to simplify it (or find another one).
>
> May work. We just need to detect if memory region tries to delete itself
> from its handler to prevent the deadlock.

Those types of hacks are fragile.  IMO it just demonstrates what I said
earlier (then tried to disprove with this): if we call an opaque's
method, we must refcount or otherwise lock that opaque.  No way around it.

>
> > 
> >>
> >>>
> >>>> Besides
> >>>> MMIO and PIO dispatching, it will haunt us for file or event handlers,
> >>>> any kind of callbacks etc.
> >>>>
> >>>> Context A                               Context B
> >>>> ---------                               ---------
> >>>>                                         object = lookup()
> >>>> deregister(object)
> >>>> modify(object) -> invalid state
> >>>> ...                                     use(object)
> >>>> modify(object) -> valid state
> >>>> register(object)
> >>>>
> >>>> And with "object" I'm not talking about QOM but any data structure.
> >>>>
> >>>
> >>>
> >>> Context B
> >>> ---------
> >>> rcu_read_lock()
> >>> object = lookup()
> >>> if (object) {
> >>>     ref(object)
> >>> }
> >>> rcu_read_unlock()
> >>>
> >>> use(object)
> >>>
> >>> unref(object)
> >>>
> >>> (substitute locking scheme to conform to taste and/or dietary
> >>> restrictions)   
> >>>
> >>
> >> + wait for refcount(object) == 0 in deregister(object). That's what I'm
> >> proposing.
> > 
> > Consider timer_del() called from a timer callback.  It's often not doable.
>
> This is inherently synchronous already (when working against the same
> alarm timer backend). We can detect this in timer_del and skip waiting,
> as in the above scenario.

It can always be broken.  The timer callback takes the device lock to
update the device.  The device mmio handler, holding the device lock,
takes the timer lock to timer_mod.  Deadlock.
Avi Kivity - Sept. 1, 2012, 8:46 a.m.
On 08/30/2012 12:08 AM, Jan Kiszka wrote:
> >>>
> >>> We are dispatching according to the memory region (parameters, op
> >>> handlers, opaques). If we end up in device object is not decided at this
> >>> level. A memory region describes a dispatchable area - not to confuse
> >>> with a device that may only partially be able to receive such requests.
> >>
> > But I think the meaning of the memory region is for dispatching. If no
> > dispatching associated with mr, why need it exist in the system?
>
> Where did I say that memory regions should no longer be used for
> dispatching? The point is to keep the clean layer separation between
> memory regions and device objects instead of merging them together.

Believe me, that's exactly how I feel.  I guess no author of a module
wants to see it swallowed by a larger framework and see its design
completely changed.  Modules should be decoupled as much as possible. 
But I don't see a better way yet.

>
> Given
>
>        Object
>        ^    ^
>        |    |
> Region 1    Region 2
>
> you protect the object during dispatch, implicitly (and that is bad)
> requiring that no region must change in that period.

We only protect the object against removal.  After object_ref() has run,
the region may still be disabled.

>  I say what rather
> needs protection are the regions so that Region 2 can pass away and
> maybe reappear independent of Region 1.

Region 2 can go away until the device-supplied dispatch code takes the
device lock.  If the device chooses to use finer-grained locking, it can
allow region 2 (or even region 1) to change during dispatch.  The only
requirement is that region 1 is not destroyed.

>  And: I won't need to know the
> type of that object the regions are referring to in this model. That's
> the difference.

Sorry, I lost the reference.  What model?  Are you referring to my
broken MemoryRegionImpl split?

> >  And
> > could you elaborate that who will be the ref holder of mr?
>
> The memory subsystem while running a memory region handler. If that will
> be a reference counter or a per-region lock like Avi suggested, we still
> need to find out.
>

If we make the refcount/lock internal to the region, we must remove the
opaque, since the region won't protect it.

Replacing the opaque with container_of(mr) doesn't help, since we can't
refcount mr, only mr->impl.

We could externalize the refcounting and push it into device code.  This
means:

   memory_region_init_io(&s->mem, dev)

   ...

   object_ref(dev)
   memory_region_add_subregion(..., &dev->mr)

   ...

   memory_region_del_subregion(..., &dev->mr)  // implied flush
   object_unref(dev)

er, this must be wrong, since it's so simple
Jan Kiszka - Sept. 1, 2012, 8:57 a.m.
On 2012-09-01 10:31, Avi Kivity wrote:
> On 08/29/2012 10:49 AM, Jan Kiszka wrote:
>>>
>>> Let's experiment with refcounting MemoryRegion.  We can move the entire
>>> contents of MemoryRegion to MemoryRegionImpl, add a reference count (to
>>> MemoryRegionImpl), and change MemoryRegion to contain a pointer to the
>>> refcounted MemoryRegionImpl:
>>>
>>> struct MemoryRegion {
>>>     struct MemoryRegionImpl *impl;
>>> };
>>>
>>> struct MemoryRegionImpl {
>>>     atomic int refs;
>>>     ...
>>> };
>>>
>>> The memory core can then store MemoryRegion copies (with elevated
>>> refcounts) instead of pointers.  Devices can destroy MemoryRegions at
>>> any time, the implementation will not go away.  However, what of the
>>> opaque stored in MemoryRegionImpl?  It becomes a dangling pointer.
>>>
>>> One way out is to add a lock to MemoryRegionImpl.  Dispatch takes the
>>> lock, examines the ->enabled member, and bails out if it is cleared. 
>>> The (MemoryRegion, not MemoryRegionImpl) destructor also takes the lock,
>>> clears ->enabled,  releases the lock, and drops the reference.
>>
>> That means holding the MemoryRegionImpl lock across the handler call?
> 
> Blech.  As you said on the other side of this thread, we must not take a
> coarse grained lock within a fine grained one, and
> MemoryRegionImpl::lock is as fine as they get.

Not sure what you compare here. MemoryRegionImpl::lock would be per
memory region, so with finer scope than the BQL but with similar scope
like a per-device lock.

> 
>>>
>>> The advantage to this scheme is that all changes are localized to the
>>> memory core, no need for a full sweep.  It is a little complicated, but
>>> we may be able to simplify it (or find another one).
>>
>> May work. We just need to detect if memory region tries to delete itself
>> from its handler to prevent the deadlock.
> 
> Those types of hacks are fragile.  IMO it just demonstrates what I said
> earlier (then tried to disprove with this): if we call an opaque's
> method, we must refcount or otherwise lock that opaque.  No way around it.

But that still doesn't solve the problem that we need to lock down the
*state* of the opaque during dispatch /wrt to memory region changes.
Just ensuring its existence is not enough unless we declare memory
region transactions to be asynchronous - and adapt all users.

> 
>>
>>>
>>>>
>>>>>
>>>>>> Besides
>>>>>> MMIO and PIO dispatching, it will haunt us for file or event handlers,
>>>>>> any kind of callbacks etc.
>>>>>>
>>>>>> Context A                               Context B
>>>>>> ---------                               ---------
>>>>>>                                         object = lookup()
>>>>>> deregister(object)
>>>>>> modify(object) -> invalid state
>>>>>> ...                                     use(object)
>>>>>> modify(object) -> valid state
>>>>>> register(object)
>>>>>>
>>>>>> And with "object" I'm not talking about QOM but any data structure.
>>>>>>
>>>>>
>>>>>
>>>>> Context B
>>>>> ---------
>>>>> rcu_read_lock()
>>>>> object = lookup()
>>>>> if (object) {
>>>>>     ref(object)
>>>>> }
>>>>> rcu_read_unlock()
>>>>>
>>>>> use(object)
>>>>>
>>>>> unref(object)
>>>>>
>>>>> (substitute locking scheme to conform to taste and/or dietary
>>>>> restrictions)   
>>>>>
>>>>
>>>> + wait for refcount(object) == 0 in deregister(object). That's what I'm
>>>> proposing.
>>>
>>> Consider timer_del() called from a timer callback.  It's often not doable.
>>
>> This is inherently synchronous already (when working against the same
>> alarm timer backend). We can detect this in timer_del and skip waiting,
>> as in the above scenario.
> 
> It can always be broken.  The timer callback takes the device lock to
> update the device.  The device mmio handler, holding the device lock,
> takes the timer lock to timer_mod.  Deadlock.

Well, how is this solved in Linux? By waiting on the callback in
hrtimer_cancel. Not by wait_on_magic_opaque (well, there is even no
opaque in the hrtimer API).

Jan
Avi Kivity - Sept. 1, 2012, 9:30 a.m.
On 09/01/2012 01:57 AM, Jan Kiszka wrote:
> On 2012-09-01 10:31, Avi Kivity wrote:
> > On 08/29/2012 10:49 AM, Jan Kiszka wrote:
> >>>
> >>> Let's experiment with refcounting MemoryRegion.  We can move the entire
> >>> contents of MemoryRegion to MemoryRegionImpl, add a reference count (to
> >>> MemoryRegionImpl), and change MemoryRegion to contain a pointer to the
> >>> refcounted MemoryRegionImpl:
> >>>
> >>> struct MemoryRegion {
> >>>     struct MemoryRegionImpl *impl;
> >>> };
> >>>
> >>> struct MemoryRegionImpl {
> >>>     atomic int refs;
> >>>     ...
> >>> };
> >>>
> >>> The memory core can then store MemoryRegion copies (with elevated
> >>> refcounts) instead of pointers.  Devices can destroy MemoryRegions at
> >>> any time, the implementation will not go away.  However, what of the
> >>> opaque stored in MemoryRegionImpl?  It becomes a dangling pointer.
> >>>
> >>> One way out is to add a lock to MemoryRegionImpl.  Dispatch takes the
> >>> lock, examines the ->enabled member, and bails out if it is cleared. 
> >>> The (MemoryRegion, not MemoryRegionImpl) destructor also takes the lock,
> >>> clears ->enabled,  releases the lock, and drops the reference.
> >>
> >> That means holding the MemoryRegionImpl lock across the handler call?
> > 
> > Blech.  As you said on the other side of this thread, we must not take a
> > coarse grained lock within a fine grained one, and
> > MemoryRegionImpl::lock is as fine as they get.
>
> Not sure what you compare here. MemoryRegionImpl::lock would be per
> memory region, so with finer scope than the BQL but with similar scope
> like a per-device lock.

The dispatch may acquire the bql (in fact most will, unless we convert
everything).  A design that doesn't allow this is broken.  Even the
device lock is more coarse grained (since it covers all regions) and is
taken in a deadlock scenario as region manipulation will be done under
the device lock.  You say we can detect this, but I dislike it.

> > 
> >>>
> >>> The advantage to this scheme is that all changes are localized to the
> >>> memory core, no need for a full sweep.  It is a little complicated, but
> >>> we may be able to simplify it (or find another one).
> >>
> >> May work. We just need to detect if memory region tries to delete itself
> >> from its handler to prevent the deadlock.
> > 
> > Those types of hacks are fragile.  IMO it just demonstrates what I said
> > earlier (then tried to disprove with this): if we call an opaque's
> > method, we must refcount or otherwise lock that opaque.  No way around it.
>
> But that still doesn't solve the problem that we need to lock down the
> *state* of the opaque during dispatch /wrt to memory region changes.
> Just ensuring its existence is not enough unless we declare memory
> region transactions to be asynchronous - and adapt all users.

I expect no users will need change.

Changing the region offset has no effect on dispatch (in fact the region
itself doesn't change).  Changing subregions is fine.  Disabling a
region concurrently with access is the only potential problem, but this
is rare, and I expect it to just work.  We will have to audit all users,
yes.

> >>>> + wait for refcount(object) == 0 in deregister(object). That's what I'm
> >>>> proposing.
> >>>
> >>> Consider timer_del() called from a timer callback.  It's often not doable.
> >>
> >> This is inherently synchronous already (when working against the same
> >> alarm timer backend). We can detect this in timer_del and skip waiting,
> >> as in the above scenario.
> > 
> > It can always be broken.  The timer callback takes the device lock to
> > update the device.  The device mmio handler, holding the device lock,
> > takes the timer lock to timer_mod.  Deadlock.
>
> Well, how is this solved in Linux? By waiting on the callback in
> hrtimer_cancel. Not by wait_on_magic_opaque (well, there is even no
> opaque in the hrtimer API).

I don't consider that busy loop an elegant solution.

The original proposal mirrors dcache.  You can perform operations on a
dentry even after it is unlinked.
pingfan liu - Sept. 3, 2012, 7:44 a.m.
On Sat, Sep 1, 2012 at 4:46 PM, Avi Kivity <avi@redhat.com> wrote:
> On 08/30/2012 12:08 AM, Jan Kiszka wrote:
>> >>>
>> >>> We are dispatching according to the memory region (parameters, op
>> >>> handlers, opaques). If we end up in device object is not decided at this
>> >>> level. A memory region describes a dispatchable area - not to confuse
>> >>> with a device that may only partially be able to receive such requests.
>> >>
>> > But I think the meaning of the memory region is for dispatching. If no
>> > dispatching associated with mr, why need it exist in the system?
>>
>> Where did I say that memory regions should no longer be used for
>> dispatching? The point is to keep the clean layer separation between
>> memory regions and device objects instead of merging them together.
>
> Believe me, that's exactly how I feel.  I guess no author of a module
> wants to see it swallowed by a larger framework and see its design
> completely changed.  Modules should be decoupled as much as possible.
> But I don't see a better way yet.
>
>>
>> Given
>>
>>        Object
>>        ^    ^
>>        |    |
>> Region 1    Region 2
>>
>> you protect the object during dispatch, implicitly (and that is bad)
>> requiring that no region must change in that period.
>
> We only protect the object against removal.  After object_ref() has run,
> the region may still be disabled.
>
>>  I say what rather
>> needs protection are the regions so that Region 2 can pass away and
>> maybe reappear independent of Region 1.
>
> Region 2 can go away until the device-supplied dispatch code takes the
> device lock.  If the device chooses to use finer-grained locking, it can
> allow region 2 (or even region 1) to change during dispatch.  The only
> requirement is that region 1 is not destroyed.
>
>>  And: I won't need to know the
>> type of that object the regions are referring to in this model. That's
>> the difference.
>
> Sorry, I lost the reference.  What model?  Are you referring to my
> broken MemoryRegionImpl split?
>
>> >  And
>> > could you elaborate that who will be the ref holder of mr?
>>
>> The memory subsystem while running a memory region handler. If that will
>> be a reference counter or a per-region lock like Avi suggested, we still
>> need to find out.
>>
>
> If we make the refcount/lock internal to the region, we must remove the
> opaque, since the region won't protect it.
>
> Replacing the opaque with container_of(mr) doesn't help, since we can't
> refcount mr, only mr->impl.
>
I think you mean if using MemoryRegionImpl, then at this level, we had
better not touch the refcnt of container_of(mr) and should face the
mr->impl->refcnt. Right?

> We could externalize the refcounting and push it into device code.  This
> means:
>
>    memory_region_init_io(&s->mem, dev)
>
>    ...
>
>    object_ref(dev)
>    memory_region_add_subregion(..., &dev->mr)
>
>    ...
>
>    memory_region_del_subregion(..., &dev->mr)  // implied flush
>    object_unref(dev)
>
I think "object_ref(dev)" just another style to push
MemoryRegionOps::object() to device level.  And I think we can bypass
it.   The caller (unplug, pci-reconfig ) of
memory_region_del_subregion() ensure the @dev is valid.
If the implied flush is implemented in synchronize,  _del_subregion()
will guarantee no reader for @dev->mr and @dev exist any longer. So I
think we can save both object_ref/unref(dev) for memory system.
The only problem is that whether we can implement it as synchronous or
not,  is it possible that we can launch a  _del_subregion(mr-X) in
mr-X's dispatcher?

Regards,
pingfan

> er, this must be wrong, since it's so simple
>
> --
> I have a truly marvellous patch that fixes the bug which this
> signature is too narrow to contain.
>
Avi Kivity - Sept. 3, 2012, 8:52 a.m.
On 09/03/2012 10:44 AM, liu ping fan wrote:
>>>
>>
>> If we make the refcount/lock internal to the region, we must remove the
>> opaque, since the region won't protect it.
>>
>> Replacing the opaque with container_of(mr) doesn't help, since we can't
>> refcount mr, only mr->impl.
>>
> I think you mean if using MemoryRegionImpl, then at this level, we had
> better not touch the refcnt of container_of(mr) and should face the
> mr->impl->refcnt. Right?

I did not understand the second part, sorry.

>> We could externalize the refcounting and push it into device code.  This
>> means:
>>
>>    memory_region_init_io(&s->mem, dev)
>>
>>    ...
>>
>>    object_ref(dev)
>>    memory_region_add_subregion(..., &dev->mr)
>>
>>    ...
>>
>>    memory_region_del_subregion(..., &dev->mr)  // implied flush
>>    object_unref(dev)
>>
> I think "object_ref(dev)" just another style to push
> MemoryRegionOps::object() to device level.  And I think we can bypass
> it.   The caller (unplug, pci-reconfig ) of
> memory_region_del_subregion() ensure the @dev is valid.
> If the implied flush is implemented in synchronize,  _del_subregion()
> will guarantee no reader for @dev->mr and @dev exist any longer. 

The above code has a deadlock.  memory_region_del_subregion() may be
called under the device lock (since it may be the result of mmio to the
device), and if the flush uses synchronized_rcu(), it will wait forever
for the read-side critical section to complete.

> So I
> think we can save both object_ref/unref(dev) for memory system.
> The only problem is that whether we can implement it as synchronous or
> not,  is it possible that we can launch a  _del_subregion(mr-X) in
> mr-X's dispatcher?

Yes.  Real cases exist.

What alternatives remain?
Avi Kivity - Sept. 3, 2012, 9:09 a.m.
On 08/29/2012 08:41 PM, Jan Kiszka wrote:
>>>
>>> memory_region_transaction_commit (or calls that trigger this) will have
>>> to wait. That is required as the caller may start fiddling with the
>>> region right afterward.
>> 
>> However, it may be running within an mmio dispatch, so it may wait forever.
> 
> We won't be able to wait for devices that can acquire the same lock we
> hold while waiting, of course. That's why a lock ordering device-lock ->
> BQL (the one we hold over memory region changes) won't be allowed. I was
> wrong on this before: We must not take course-grained locks while
> holding fine-grained ones, only the other way around. Pretty obvious,
> specifically for realtime / low-latency.

So how do we wait?

Detecting that we're in the same thread and allowing reentrancy in that
case in the option, but I can't say I like it much.

>> 
>> Ignoring this, how does it wait? sleep()? or wait for a completion?
> 
> Ideally via completion.

Then you have to manage the life cycle of the completion object.

> 
>> 
>>>>
>>>> Usually a reference count controls the lifetime of the reference counted
>>>> object, what are you suggesting here?
>>>
>>> To synchronize on reference count going to zero. 
>> 
>> Quite unorthodox.  Do you have real life examples?
> 
> synchronize_rcu.

This whole thing started because synchronize_rcu() will deadlock if
called from a read-side critical section.  The fix was taking the
reference, so that the read-side critical section is confined to the
lookup, not the call.  This way the lock ordering is always device->map.

We could schedule a bh or a thread and call synchronize_rcu() from
there, but then the commit is deferred to an unknown time, whereas we
need to guarantee that by the time the guest is reentered, the
transaction is committed.

> 
>> 
>>> Or all readers leaving
>>> the read-side critical sections.
>> 
>> This is rcu.  But again, we may be in an rcu read-side critical section
>> while needing to wait.  In fact this was what I suggested in the first
>> place, until Marcelo pointed out the deadlock, so we came up with the
>> refcount.
> 
> We must avoid that deadlock scenario. I bended my brain around it, and I
> think that is the only answer. We can if we apply clear rules regarding
> locking and BQL-protected services to those devices that will actually
> use fine-grained locking, and there only for those paths that take
> per-device locks. I'm pretty sure we won't get far with an
> "all-or-nothing" model where every device uses a private lock.

An all-or-nothing model is not proposed.  Unconverted devices will take
the BQL instead of the device lock (hopefully the BQL will be taken for
them during dispatch, so no code changes will be needed).

>> 
>> I don't follow.  Synchronous transactions mean you can't
>> synchronize_rcu() or upgrade the lock or wait for the refcount.  That's
>> the source of the problem!
> 
> Our current device models assume synchronous transactions on the memory
> layer, actually on all layers. The proposals I saw so far try to change
> this. But I'm very skeptical that this would succeed, due to the
> conversion effort and due to the fact that it would give us a tricky  to
> use API.

In what way do transactions become asynchronous?

It's true that io callbacks can be called concurrently, but as soon as
they take the device lock, they are serialized.  The only relaxation is
that a callback may be called after a region has been disabled or
removed from view.
pingfan liu - Sept. 3, 2012, 10:06 a.m.
On Mon, Sep 3, 2012 at 4:52 PM, Avi Kivity <avi@redhat.com> wrote:
> On 09/03/2012 10:44 AM, liu ping fan wrote:
>>>>
>>>
>>> If we make the refcount/lock internal to the region, we must remove the
>>> opaque, since the region won't protect it.
>>>
>>> Replacing the opaque with container_of(mr) doesn't help, since we can't
>>> refcount mr, only mr->impl.
>>>
>> I think you mean if using MemoryRegionImpl, then at this level, we had
>> better not touch the refcnt of container_of(mr) and should face the
>> mr->impl->refcnt. Right?
>
> I did not understand the second part, sorry.
>
My understanding of "Replacing the opaque with container_of(mr)
doesn't help, since we can't  refcount mr, only
mr->impl." is that although Object_ref(container_of(mr)) can help us
to protect it from disappearing. But apparently it is not right place
to do it it in memory core.   Do I catch you meaning?

>>> We could externalize the refcounting and push it into device code.  This
>>> means:
>>>
>>>    memory_region_init_io(&s->mem, dev)
>>>
>>>    ...
>>>
>>>    object_ref(dev)
>>>    memory_region_add_subregion(..., &dev->mr)
>>>
>>>    ...
>>>
>>>    memory_region_del_subregion(..., &dev->mr)  // implied flush
>>>    object_unref(dev)
>>>
>> I think "object_ref(dev)" just another style to push
>> MemoryRegionOps::object() to device level.  And I think we can bypass
>> it.   The caller (unplug, pci-reconfig ) of
>> memory_region_del_subregion() ensure the @dev is valid.
>> If the implied flush is implemented in synchronize,  _del_subregion()
>> will guarantee no reader for @dev->mr and @dev exist any longer.
>
> The above code has a deadlock.  memory_region_del_subregion() may be
> called under the device lock (since it may be the result of mmio to the
> device), and if the flush uses synchronized_rcu(), it will wait forever
> for the read-side critical section to complete.
>
But if _del_subregion() just wait for mr-X quiescent period, while
calling in mr-Y's read side critical section, then we can avoid
deadlock.  I saw in pci-mapping, we delete mr-X in mr-Y read side.

>> So I
>> think we can save both object_ref/unref(dev) for memory system.
>> The only problem is that whether we can implement it as synchronous or
>> not,  is it possible that we can launch a  _del_subregion(mr-X) in
>> mr-X's dispatcher?
>
> Yes.  Real cases exist.

Oh, I find the sample code, then, the deadlock is unavoidable in this method.
>
> What alternatives remain?
>
I think a way out may be async+refcnt

Regards,
pingfan
> --
> error compiling committee.c: too many arguments to function
Avi Kivity - Sept. 3, 2012, 10:16 a.m.
On 09/03/2012 01:06 PM, liu ping fan wrote:
> On Mon, Sep 3, 2012 at 4:52 PM, Avi Kivity <avi@redhat.com> wrote:
>> On 09/03/2012 10:44 AM, liu ping fan wrote:
>>>>>
>>>>
>>>> If we make the refcount/lock internal to the region, we must remove the
>>>> opaque, since the region won't protect it.
>>>>
>>>> Replacing the opaque with container_of(mr) doesn't help, since we can't
>>>> refcount mr, only mr->impl.
>>>>
>>> I think you mean if using MemoryRegionImpl, then at this level, we had
>>> better not touch the refcnt of container_of(mr) and should face the
>>> mr->impl->refcnt. Right?
>>
>> I did not understand the second part, sorry.
>>
> My understanding of "Replacing the opaque with container_of(mr)
> doesn't help, since we can't  refcount mr, only
> mr->impl." is that although Object_ref(container_of(mr)) can help us
> to protect it from disappearing. But apparently it is not right place
> to do it it in memory core.   Do I catch you meaning?

We cannot call container_of(mr) in the memory core, because the
parameters to container_of() are not known.  But that is not the main issue.

Something we can do is make MemoryRegionOps::object() take a mr
parameter instead of opaque.  MemoryRegionOps::object() then uses
container_of() to derive the object to ref.

(works with MemoryRegionOps::ref()/MemoryRegionOps::unref() too; Jan,
this decouples Object from MemoryRegion at the cost of extra boilerplate
in client code, but it may be worthwhile as a temporary measure while we
gain more experience with this)

> 
>>>> We could externalize the refcounting and push it into device code.  This
>>>> means:
>>>>
>>>>    memory_region_init_io(&s->mem, dev)
>>>>
>>>>    ...
>>>>
>>>>    object_ref(dev)
>>>>    memory_region_add_subregion(..., &dev->mr)
>>>>
>>>>    ...
>>>>
>>>>    memory_region_del_subregion(..., &dev->mr)  // implied flush
>>>>    object_unref(dev)
>>>>
>>> I think "object_ref(dev)" just another style to push
>>> MemoryRegionOps::object() to device level.  And I think we can bypass
>>> it.   The caller (unplug, pci-reconfig ) of
>>> memory_region_del_subregion() ensure the @dev is valid.
>>> If the implied flush is implemented in synchronize,  _del_subregion()
>>> will guarantee no reader for @dev->mr and @dev exist any longer.
>>
>> The above code has a deadlock.  memory_region_del_subregion() may be
>> called under the device lock (since it may be the result of mmio to the
>> device), and if the flush uses synchronized_rcu(), it will wait forever
>> for the read-side critical section to complete.
>>
> But if _del_subregion() just wait for mr-X quiescent period, while
> calling in mr-Y's read side critical section, then we can avoid
> deadlock.  I saw in pci-mapping, we delete mr-X in mr-Y read side.

Look at cirrus-vga.c, there are many calls to the memory API there.
While I doubt that we have one region disabling itself (or a subregion
of itself), that's all code that would be run under the same device
lock, and so would deadlock.
pingfan liu - Sept. 4, 2012, 2:33 a.m.
On Mon, Sep 3, 2012 at 6:16 PM, Avi Kivity <avi@redhat.com> wrote:
> On 09/03/2012 01:06 PM, liu ping fan wrote:
>> On Mon, Sep 3, 2012 at 4:52 PM, Avi Kivity <avi@redhat.com> wrote:
>>> On 09/03/2012 10:44 AM, liu ping fan wrote:
>>>>>>
>>>>>
>>>>> If we make the refcount/lock internal to the region, we must remove the
>>>>> opaque, since the region won't protect it.
>>>>>
>>>>> Replacing the opaque with container_of(mr) doesn't help, since we can't
>>>>> refcount mr, only mr->impl.
>>>>>
>>>> I think you mean if using MemoryRegionImpl, then at this level, we had
>>>> better not touch the refcnt of container_of(mr) and should face the
>>>> mr->impl->refcnt. Right?
>>>
>>> I did not understand the second part, sorry.
>>>
>> My understanding of "Replacing the opaque with container_of(mr)
>> doesn't help, since we can't  refcount mr, only
>> mr->impl." is that although Object_ref(container_of(mr)) can help us
>> to protect it from disappearing. But apparently it is not right place
>> to do it it in memory core.   Do I catch you meaning?
>
> We cannot call container_of(mr) in the memory core, because the
> parameters to container_of() are not known.  But that is not the main issue.
>
> Something we can do is make MemoryRegionOps::object() take a mr
> parameter instead of opaque.  MemoryRegionOps::object() then uses
> container_of() to derive the object to ref.
>
> (works with MemoryRegionOps::ref()/MemoryRegionOps::unref() too; Jan,
> this decouples Object from MemoryRegion at the cost of extra boilerplate
> in client code, but it may be worthwhile as a temporary measure while we
> gain more experience with this)
>
Exactly catch your meaning, thanks.
>>
>>>>> We could externalize the refcounting and push it into device code.  This
>>>>> means:
>>>>>
>>>>>    memory_region_init_io(&s->mem, dev)
>>>>>
>>>>>    ...
>>>>>
>>>>>    object_ref(dev)
>>>>>    memory_region_add_subregion(..., &dev->mr)
>>>>>
>>>>>    ...
>>>>>
>>>>>    memory_region_del_subregion(..., &dev->mr)  // implied flush
>>>>>    object_unref(dev)
>>>>>
>>>> I think "object_ref(dev)" just another style to push
>>>> MemoryRegionOps::object() to device level.  And I think we can bypass
>>>> it.   The caller (unplug, pci-reconfig ) of
>>>> memory_region_del_subregion() ensure the @dev is valid.
>>>> If the implied flush is implemented in synchronize,  _del_subregion()
>>>> will guarantee no reader for @dev->mr and @dev exist any longer.
>>>
>>> The above code has a deadlock.  memory_region_del_subregion() may be
>>> called under the device lock (since it may be the result of mmio to the
>>> device), and if the flush uses synchronized_rcu(), it will wait forever
>>> for the read-side critical section to complete.
>>>
>> But if _del_subregion() just wait for mr-X quiescent period, while
>> calling in mr-Y's read side critical section, then we can avoid
>> deadlock.  I saw in pci-mapping, we delete mr-X in mr-Y read side.
>
> Look at cirrus-vga.c, there are many calls to the memory API there.
> While I doubt that we have one region disabling itself (or a subregion
> of itself), that's all code that would be run under the same device
> lock, and so would deadlock.
>
Oh, yes, quiescent time will never come since reader wait for the
lock! Got it, thanks.

pingfan
>
> --
> error compiling committee.c: too many arguments to function
pingfan liu - Sept. 4, 2012, 2:34 a.m.
On Mon, Sep 3, 2012 at 6:06 PM, liu ping fan <qemulist@gmail.com> wrote:
> On Mon, Sep 3, 2012 at 4:52 PM, Avi Kivity <avi@redhat.com> wrote:
>> On 09/03/2012 10:44 AM, liu ping fan wrote:
>>>>>
>>>>
>>>> If we make the refcount/lock internal to the region, we must remove the
>>>> opaque, since the region won't protect it.
>>>>
>>>> Replacing the opaque with container_of(mr) doesn't help, since we can't
>>>> refcount mr, only mr->impl.
>>>>
>>> I think you mean if using MemoryRegionImpl, then at this level, we had
>>> better not touch the refcnt of container_of(mr) and should face the
>>> mr->impl->refcnt. Right?
>>
>> I did not understand the second part, sorry.
>>
> My understanding of "Replacing the opaque with container_of(mr)
> doesn't help, since we can't  refcount mr, only
> mr->impl." is that although Object_ref(container_of(mr)) can help us
> to protect it from disappearing. But apparently it is not right place
> to do it it in memory core.   Do I catch you meaning?
>
>>>> We could externalize the refcounting and push it into device code.  This
>>>> means:
>>>>
>>>>    memory_region_init_io(&s->mem, dev)
>>>>
>>>>    ...
>>>>
>>>>    object_ref(dev)
>>>>    memory_region_add_subregion(..., &dev->mr)
>>>>
>>>>    ...
>>>>
>>>>    memory_region_del_subregion(..., &dev->mr)  // implied flush
>>>>    object_unref(dev)
>>>>
>>> I think "object_ref(dev)" just another style to push
>>> MemoryRegionOps::object() to device level.  And I think we can bypass
>>> it.   The caller (unplug, pci-reconfig ) of
>>> memory_region_del_subregion() ensure the @dev is valid.
>>> If the implied flush is implemented in synchronize,  _del_subregion()
>>> will guarantee no reader for @dev->mr and @dev exist any longer.
>>
>> The above code has a deadlock.  memory_region_del_subregion() may be
>> called under the device lock (since it may be the result of mmio to the
>> device), and if the flush uses synchronized_rcu(), it will wait forever
>> for the read-side critical section to complete.
>>
> But if _del_subregion() just wait for mr-X quiescent period, while
> calling in mr-Y's read side critical section, then we can avoid
> deadlock.  I saw in pci-mapping, we delete mr-X in mr-Y read side.
>
>>> So I
>>> think we can save both object_ref/unref(dev) for memory system.
>>> The only problem is that whether we can implement it as synchronous or
>>> not,  is it possible that we can launch a  _del_subregion(mr-X) in
>>> mr-X's dispatcher?
>>
>> Yes.  Real cases exist.
>
> Oh, I find the sample code, then, the deadlock is unavoidable in this method.
>>
>> What alternatives remain?
>>
> I think a way out may be async+refcnt
>
If we consider the relationship of MemoryRegionImpl and device as the
one between file and file->private_data  in Linux. Then the creation
of impl will object_ref(dev) and when impl->ref=0, it will
object_unref(dev)
But this is an async model, for those client which need to know the
end of flush, we can adopt callback just like call_rcu().



> Regards,
> pingfan
>> --
>> error compiling committee.c: too many arguments to function
pingfan liu - Sept. 5, 2012, 8:19 a.m.
[...]
>>
>>>>> We could externalize the refcounting and push it into device code.  This
>>>>> means:
>>>>>
>>>>>    memory_region_init_io(&s->mem, dev)
>>>>>
>>>>>    ...
>>>>>
>>>>>    object_ref(dev)
>>>>>    memory_region_add_subregion(..., &dev->mr)
>>>>>
>>>>>    ...
>>>>>
>>>>>    memory_region_del_subregion(..., &dev->mr)  // implied flush
>>>>>    object_unref(dev)
>>>>>
>>>> I think "object_ref(dev)" just another style to push
>>>> MemoryRegionOps::object() to device level.  And I think we can bypass
>>>> it.   The caller (unplug, pci-reconfig ) of
>>>> memory_region_del_subregion() ensure the @dev is valid.
>>>> If the implied flush is implemented in synchronize,  _del_subregion()
>>>> will guarantee no reader for @dev->mr and @dev exist any longer.
>>>
>>> The above code has a deadlock.  memory_region_del_subregion() may be
>>> called under the device lock (since it may be the result of mmio to the
>>> device), and if the flush uses synchronized_rcu(), it will wait forever
>>> for the read-side critical section to complete.
>>>
>> But if _del_subregion() just wait for mr-X quiescent period, while
>> calling in mr-Y's read side critical section, then we can avoid
>> deadlock.  I saw in pci-mapping, we delete mr-X in mr-Y read side.
>>
>>>> So I
>>>> think we can save both object_ref/unref(dev) for memory system.
>>>> The only problem is that whether we can implement it as synchronous or
>>>> not,  is it possible that we can launch a  _del_subregion(mr-X) in
>>>> mr-X's dispatcher?
>>>
>>> Yes.  Real cases exist.
>>
>> Oh, I find the sample code, then, the deadlock is unavoidable in this method.
>>>
>>> What alternatives remain?
>>>
>> I think a way out may be async+refcnt
>>
> If we consider the relationship of MemoryRegionImpl and device as the
> one between file and file->private_data  in Linux. Then the creation
> of impl will object_ref(dev) and when impl->ref=0, it will
> object_unref(dev)
> But this is an async model, for those client which need to know the
> end of flush, we can adopt callback just like call_rcu().
>
>
During this thread, it seems that no matter we adopt refcnt on
MemoryRegionImpl or not, protecting MemoryRegion::opaque during
dispatch is still required. It is challenging to substitute
memory_region_init_io() 's 3rd parameter from void* to Object*, owning
to the  conflict that life-cycle management need the host of opaque,
while Object and mr need 1:1 map. So I think, I will move forward on
the way based on MemoryRegionOps::object(). Right?

Regards,
pingfan

>
>> Regards,
>> pingfan
>>> --
>>> error compiling committee.c: too many arguments to function
Avi Kivity - Sept. 5, 2012, 9:52 a.m.
On 09/05/2012 11:19 AM, liu ping fan wrote:
> [...]
>>>
>>>>>> We could externalize the refcounting and push it into device code.  This
>>>>>> means:
>>>>>>
>>>>>>    memory_region_init_io(&s->mem, dev)
>>>>>>
>>>>>>    ...
>>>>>>
>>>>>>    object_ref(dev)
>>>>>>    memory_region_add_subregion(..., &dev->mr)
>>>>>>
>>>>>>    ...
>>>>>>
>>>>>>    memory_region_del_subregion(..., &dev->mr)  // implied flush
>>>>>>    object_unref(dev)
>>>>>>
>>>>> I think "object_ref(dev)" just another style to push
>>>>> MemoryRegionOps::object() to device level.  And I think we can bypass
>>>>> it.   The caller (unplug, pci-reconfig ) of
>>>>> memory_region_del_subregion() ensure the @dev is valid.
>>>>> If the implied flush is implemented in synchronize,  _del_subregion()
>>>>> will guarantee no reader for @dev->mr and @dev exist any longer.
>>>>
>>>> The above code has a deadlock.  memory_region_del_subregion() may be
>>>> called under the device lock (since it may be the result of mmio to the
>>>> device), and if the flush uses synchronized_rcu(), it will wait forever
>>>> for the read-side critical section to complete.
>>>>
>>> But if _del_subregion() just wait for mr-X quiescent period, while
>>> calling in mr-Y's read side critical section, then we can avoid
>>> deadlock.  I saw in pci-mapping, we delete mr-X in mr-Y read side.
>>>
>>>>> So I
>>>>> think we can save both object_ref/unref(dev) for memory system.
>>>>> The only problem is that whether we can implement it as synchronous or
>>>>> not,  is it possible that we can launch a  _del_subregion(mr-X) in
>>>>> mr-X's dispatcher?
>>>>
>>>> Yes.  Real cases exist.
>>>
>>> Oh, I find the sample code, then, the deadlock is unavoidable in this method.
>>>>
>>>> What alternatives remain?
>>>>
>>> I think a way out may be async+refcnt
>>>
>> If we consider the relationship of MemoryRegionImpl and device as the
>> one between file and file->private_data  in Linux. Then the creation
>> of impl will object_ref(dev) and when impl->ref=0, it will
>> object_unref(dev)
>> But this is an async model, for those client which need to know the
>> end of flush, we can adopt callback just like call_rcu().
>>
>>
> During this thread, it seems that no matter we adopt refcnt on
> MemoryRegionImpl or not, protecting MemoryRegion::opaque during
> dispatch is still required. 

Right.  So MemoryRegionImpl is pointless.

My current preference is MemoryRegionOps::ref(MemoryRegion *mr) (and a
corresponding unref), which has the following requirements:

- if the refcount is nonzero, MemoryRegion::opaque is safe to use
- if the refcount is nonzero, the MemoryRegion itself is stable.

Later we can change other callbacks to also use mr instead of opaque.
Usually the callback will be able to derive the device object from the
region, only special cases will require
memory_region_opaque(MemoryRegion *).  We can even drop the opaque from
memory_region_init_io() and replace it with memory_region_set_opaque(),
to be used in those special cases.

> It is challenging to substitute
> memory_region_init_io() 's 3rd parameter from void* to Object*, owning
> to the  conflict that life-cycle management need the host of opaque,
> while Object and mr need 1:1 map. So I think, I will move forward on
> the way based on MemoryRegionOps::object(). Right?

As outlined above, I now prefer MemoryRegionOps::ref/unref.

Advantages compared to MemoryRegionOps::object():
 - decoupled from the QOM framework

Disadvantages:
 - more boilerplate code in converted devices

Since we are likely to convert few devices to lockless dispatch, I
believe the tradeoff is justified.  But let's hear Jan and the others.
Jan Kiszka - Sept. 5, 2012, 10:36 a.m.
On 2012-09-05 11:52, Avi Kivity wrote:
> On 09/05/2012 11:19 AM, liu ping fan wrote:
>> [...]
>>>>
>>>>>>> We could externalize the refcounting and push it into device code.  This
>>>>>>> means:
>>>>>>>
>>>>>>>    memory_region_init_io(&s->mem, dev)
>>>>>>>
>>>>>>>    ...
>>>>>>>
>>>>>>>    object_ref(dev)
>>>>>>>    memory_region_add_subregion(..., &dev->mr)
>>>>>>>
>>>>>>>    ...
>>>>>>>
>>>>>>>    memory_region_del_subregion(..., &dev->mr)  // implied flush
>>>>>>>    object_unref(dev)
>>>>>>>
>>>>>> I think "object_ref(dev)" just another style to push
>>>>>> MemoryRegionOps::object() to device level.  And I think we can bypass
>>>>>> it.   The caller (unplug, pci-reconfig ) of
>>>>>> memory_region_del_subregion() ensure the @dev is valid.
>>>>>> If the implied flush is implemented in synchronize,  _del_subregion()
>>>>>> will guarantee no reader for @dev->mr and @dev exist any longer.
>>>>>
>>>>> The above code has a deadlock.  memory_region_del_subregion() may be
>>>>> called under the device lock (since it may be the result of mmio to the
>>>>> device), and if the flush uses synchronized_rcu(), it will wait forever
>>>>> for the read-side critical section to complete.
>>>>>
>>>> But if _del_subregion() just wait for mr-X quiescent period, while
>>>> calling in mr-Y's read side critical section, then we can avoid
>>>> deadlock.  I saw in pci-mapping, we delete mr-X in mr-Y read side.
>>>>
>>>>>> So I
>>>>>> think we can save both object_ref/unref(dev) for memory system.
>>>>>> The only problem is that whether we can implement it as synchronous or
>>>>>> not,  is it possible that we can launch a  _del_subregion(mr-X) in
>>>>>> mr-X's dispatcher?
>>>>>
>>>>> Yes.  Real cases exist.
>>>>
>>>> Oh, I find the sample code, then, the deadlock is unavoidable in this method.
>>>>>
>>>>> What alternatives remain?
>>>>>
>>>> I think a way out may be async+refcnt
>>>>
>>> If we consider the relationship of MemoryRegionImpl and device as the
>>> one between file and file->private_data  in Linux. Then the creation
>>> of impl will object_ref(dev) and when impl->ref=0, it will
>>> object_unref(dev)
>>> But this is an async model, for those client which need to know the
>>> end of flush, we can adopt callback just like call_rcu().
>>>
>>>
>> During this thread, it seems that no matter we adopt refcnt on
>> MemoryRegionImpl or not, protecting MemoryRegion::opaque during
>> dispatch is still required. 
> 
> Right.  So MemoryRegionImpl is pointless.
> 
> My current preference is MemoryRegionOps::ref(MemoryRegion *mr) (and a
> corresponding unref), which has the following requirements:
> 
> - if the refcount is nonzero, MemoryRegion::opaque is safe to use
> - if the refcount is nonzero, the MemoryRegion itself is stable.

The second point means that the memory subsystem will cache the region
state and apply changes only after leaving a handler that performed them?

> 
> Later we can change other callbacks to also use mr instead of opaque.
> Usually the callback will be able to derive the device object from the
> region, only special cases will require
> memory_region_opaque(MemoryRegion *).  We can even drop the opaque from
> memory_region_init_io() and replace it with memory_region_set_opaque(),
> to be used in those special cases.
> 
>> It is challenging to substitute
>> memory_region_init_io() 's 3rd parameter from void* to Object*, owning
>> to the  conflict that life-cycle management need the host of opaque,
>> while Object and mr need 1:1 map. So I think, I will move forward on
>> the way based on MemoryRegionOps::object(). Right?
> 
> As outlined above, I now prefer MemoryRegionOps::ref/unref.
> 
> Advantages compared to MemoryRegionOps::object():
>  - decoupled from the QOM framework
> 
> Disadvantages:
>  - more boilerplate code in converted devices
> 
> Since we are likely to convert few devices to lockless dispatch, I
> believe the tradeoff is justified.  But let's hear Jan and the others.

I still need to dig into related posts of the past days, lost track, but
the above sounds much better.

Besides the question of what to protect and how, one important thing
IMHO is that we are able to switch locking behaviour on a per region
basis. And that should also be feasible this way.

Jan
Avi Kivity - Sept. 5, 2012, 10:53 a.m.
On 09/05/2012 01:36 PM, Jan Kiszka wrote:
>> 
>> My current preference is MemoryRegionOps::ref(MemoryRegion *mr) (and a
>> corresponding unref), which has the following requirements:
>> 
>> - if the refcount is nonzero, MemoryRegion::opaque is safe to use
>> - if the refcount is nonzero, the MemoryRegion itself is stable.
> 
> The second point means that the memory subsystem will cache the region
> state and apply changes only after leaving a handler that performed them?

No.  I/O callbacks may be called after a region has been disabled.

I guess we can restrict this to converted regions, so we don't introduce
regressions.

>> As outlined above, I now prefer MemoryRegionOps::ref/unref.
>> 
>> Advantages compared to MemoryRegionOps::object():
>>  - decoupled from the QOM framework
>> 
>> Disadvantages:
>>  - more boilerplate code in converted devices
>> 
>> Since we are likely to convert few devices to lockless dispatch, I
>> believe the tradeoff is justified.  But let's hear Jan and the others.
> 
> I still need to dig into related posts of the past days, lost track, but
> the above sounds much better.
> 
> Besides the question of what to protect and how, one important thing
> IMHO is that we are able to switch locking behaviour on a per region
> basis. And that should also be feasible this way.

It was also possible with MemoryRegionOps::object() - no one said all
regions for a device have to refer to the same object (and there is a
difference between locking and reference counting, each callback could
take a different lock).  But it is more natural with MemoryRegionOps::ref().
Jan Kiszka - Sept. 5, 2012, 11:11 a.m.
On 2012-09-05 12:53, Avi Kivity wrote:
> On 09/05/2012 01:36 PM, Jan Kiszka wrote:
>>>
>>> My current preference is MemoryRegionOps::ref(MemoryRegion *mr) (and a
>>> corresponding unref), which has the following requirements:
>>>
>>> - if the refcount is nonzero, MemoryRegion::opaque is safe to use
>>> - if the refcount is nonzero, the MemoryRegion itself is stable.
>>
>> The second point means that the memory subsystem will cache the region
>> state and apply changes only after leaving a handler that performed them?
> 
> No.  I/O callbacks may be called after a region has been disabled.
> 
> I guess we can restrict this to converted regions, so we don't introduce
> regressions.

s/can/have to/. This property change will require some special care,
hopefully mostly at the memory layer. A simple scenario from recent patches:

    if (<enable>) {
        memory_region_set_address(&s->pm_io, pm_io_base);
        memory_region_set_enabled(&s->pm_io, true);
    } else {
        memory_region_set_enabled(&s->pm_io, false);
    }

We will have to ensure that set_enabled(..., true) will never cause a
dispatch using an outdated base address.

I think discussing semantics and usage patterns of the new memory API -
outside of the BQL - will be the next big topic. ;)

Jan
Avi Kivity - Sept. 5, 2012, 11:25 a.m.
On 09/05/2012 02:11 PM, Jan Kiszka wrote:
> On 2012-09-05 12:53, Avi Kivity wrote:
>> On 09/05/2012 01:36 PM, Jan Kiszka wrote:
>>>>
>>>> My current preference is MemoryRegionOps::ref(MemoryRegion *mr) (and a
>>>> corresponding unref), which has the following requirements:
>>>>
>>>> - if the refcount is nonzero, MemoryRegion::opaque is safe to use
>>>> - if the refcount is nonzero, the MemoryRegion itself is stable.
>>>
>>> The second point means that the memory subsystem will cache the region
>>> state and apply changes only after leaving a handler that performed them?
>> 
>> No.  I/O callbacks may be called after a region has been disabled.
>> 
>> I guess we can restrict this to converted regions, so we don't introduce
>> regressions.
> 
> s/can/have to/. This property change will require some special care,
> hopefully mostly at the memory layer. A simple scenario from recent patches:
> 
>     if (<enable>) {
>         memory_region_set_address(&s->pm_io, pm_io_base);
>         memory_region_set_enabled(&s->pm_io, true);
>     } else {
>         memory_region_set_enabled(&s->pm_io, false);
>     }

I am unable to avoid pointing out that this can be collapsed to

  memory_region_set_address(&s->pm_io, pm_io_base);
  memory_region_set_enabled(&s->pm_io, <enable>);

as the address is meaningless when disabled. Sorry.

> 
> We will have to ensure that set_enabled(..., true) will never cause a
> dispatch using an outdated base address.

No, this is entirely safe.  If the guest changes a region offset
concurrently with issuing mmio on it, then it must expect either the old
or new offset to be used during dispatch.  In either case, the correct
intra-region offset will be provided to the I/O callback (no volatile
MemoryRegion fields except ->readable (IIRC) are used during dispatch -
the rest are all copied into data structures used during dispatch (this
is part of what makes the whole thing so rcu friendly).

> I think discussing semantics and usage patterns of the new memory API -
> outside of the BQL - will be the next big topic. ;)

I hope it won't prove to be that complicated.
Jan Kiszka - Sept. 5, 2012, 12:02 p.m.
On 2012-09-05 13:25, Avi Kivity wrote:
> On 09/05/2012 02:11 PM, Jan Kiszka wrote:
>> On 2012-09-05 12:53, Avi Kivity wrote:
>>> On 09/05/2012 01:36 PM, Jan Kiszka wrote:
>>>>>
>>>>> My current preference is MemoryRegionOps::ref(MemoryRegion *mr) (and a
>>>>> corresponding unref), which has the following requirements:
>>>>>
>>>>> - if the refcount is nonzero, MemoryRegion::opaque is safe to use
>>>>> - if the refcount is nonzero, the MemoryRegion itself is stable.
>>>>
>>>> The second point means that the memory subsystem will cache the region
>>>> state and apply changes only after leaving a handler that performed them?
>>>
>>> No.  I/O callbacks may be called after a region has been disabled.
>>>
>>> I guess we can restrict this to converted regions, so we don't introduce
>>> regressions.
>>
>> s/can/have to/. This property change will require some special care,
>> hopefully mostly at the memory layer. A simple scenario from recent patches:
>>
>>     if (<enable>) {
>>         memory_region_set_address(&s->pm_io, pm_io_base);
>>         memory_region_set_enabled(&s->pm_io, true);
>>     } else {
>>         memory_region_set_enabled(&s->pm_io, false);
>>     }
> 
> I am unable to avoid pointing out that this can be collapsed to
> 
>   memory_region_set_address(&s->pm_io, pm_io_base);
>   memory_region_set_enabled(&s->pm_io, <enable>);
> 
> as the address is meaningless when disabled. Sorry.
> 
>>
>> We will have to ensure that set_enabled(..., true) will never cause a
>> dispatch using an outdated base address.
> 
> No, this is entirely safe.  If the guest changes a region offset
> concurrently with issuing mmio on it, then it must expect either the old
> or new offset to be used during dispatch.

You assume disable, reprogram, enable, ignoring that there could be
multiple, invalid states between disable and enable. Real hardware takes
care of the ordering.

>  In either case, the correct
> intra-region offset will be provided to the I/O callback (no volatile
> MemoryRegion fields except ->readable (IIRC) are used during dispatch -
> the rest are all copied into data structures used during dispatch (this
> is part of what makes the whole thing so rcu friendly).
> 
>> I think discussing semantics and usage patterns of the new memory API -
>> outside of the BQL - will be the next big topic. ;)
> 
> I hope it won't prove to be that complicated.
> 

Yeah, let's hope this...

Jan
Avi Kivity - Sept. 5, 2012, 12:17 p.m.
On 09/05/2012 03:02 PM, Jan Kiszka wrote:
> On 2012-09-05 13:25, Avi Kivity wrote:
>> On 09/05/2012 02:11 PM, Jan Kiszka wrote:
>>> On 2012-09-05 12:53, Avi Kivity wrote:
>>>> On 09/05/2012 01:36 PM, Jan Kiszka wrote:
>>>>>>
>>>>>> My current preference is MemoryRegionOps::ref(MemoryRegion *mr) (and a
>>>>>> corresponding unref), which has the following requirements:
>>>>>>
>>>>>> - if the refcount is nonzero, MemoryRegion::opaque is safe to use
>>>>>> - if the refcount is nonzero, the MemoryRegion itself is stable.
>>>>>
>>>>> The second point means that the memory subsystem will cache the region
>>>>> state and apply changes only after leaving a handler that performed them?
>>>>
>>>> No.  I/O callbacks may be called after a region has been disabled.
>>>>
>>>> I guess we can restrict this to converted regions, so we don't introduce
>>>> regressions.
>>>
>>> s/can/have to/. This property change will require some special care,
>>> hopefully mostly at the memory layer. A simple scenario from recent patches:
>>>
>>>     if (<enable>) {
>>>         memory_region_set_address(&s->pm_io, pm_io_base);
>>>         memory_region_set_enabled(&s->pm_io, true);
>>>     } else {
>>>         memory_region_set_enabled(&s->pm_io, false);
>>>     }
>> 
>> I am unable to avoid pointing out that this can be collapsed to
>> 
>>   memory_region_set_address(&s->pm_io, pm_io_base);
>>   memory_region_set_enabled(&s->pm_io, <enable>);
>> 
>> as the address is meaningless when disabled. Sorry.
>> 
>>>
>>> We will have to ensure that set_enabled(..., true) will never cause a
>>> dispatch using an outdated base address.
>> 
>> No, this is entirely safe.  If the guest changes a region offset
>> concurrently with issuing mmio on it, then it must expect either the old
>> or new offset to be used during dispatch.
> 

I forgot to mention that my clever rewrite above actually breaks this -
it needs to be wrapped in a transaction, otherwise we have a
move-then-disable pattern.

> You assume disable, reprogram, enable, ignoring that there could be
> multiple, invalid states between disable and enable. Real hardware takes
> care of the ordering.

It's possible of course, but the snippet above isn't susceptible on its own.

I don't think it's solvable.  To serialize, you must hold the device
lock, but we don't want to take the device lock during dispatch.

Users can protect against them by checking for ->enabled:

void foo_io_read(...)
{
   FooState *s = container_of(mr, ...);

   qemu_mutex_lock(&s->lock);
   if (!memory_region_enabled(mr)) {
       *data = ~(uint64_t)0;
       goto out;
   }
   ...
out:
   qemu_mutex_unlock(&s->lock);
}

Non-converted users will be naturally protected since we will take the
bql on their behalf.

Patch

diff --git a/hw/qdev.c b/hw/qdev.c
index e2339a1..b09ebbf 100644
--- a/hw/qdev.c
+++ b/hw/qdev.c
@@ -510,6 +510,8 @@  void qbus_create_inplace(BusState *bus, const char *typename,
 {
     object_initialize(bus, typename);
 
+    bus->overlap = parent;
+    object_ref(OBJECT(bus->overlap));
     bus->parent = parent;
     bus->name = name ? g_strdup(name) : NULL;
     qbus_realize(bus);
diff --git a/hw/qdev.h b/hw/qdev.h
index 182cfa5..9bc5783 100644
--- a/hw/qdev.h
+++ b/hw/qdev.h
@@ -117,6 +117,7 @@  struct BusState {
     int allow_hotplug;
     bool qom_allocated;
     bool glib_allocated;
+    DeviceState *overlap;
     int max_index;
     QTAILQ_HEAD(ChildrenHead, BusChild) children;
     QLIST_ENTRY(BusState) sibling;