Patchwork [V3,08/11] qom: introduce reclaimer to release obj in async

login
register
mail settings
Submitter pingfan liu
Date Sept. 11, 2012, 7:51 a.m.
Message ID <1347349912-15611-9-git-send-email-qemulist@gmail.com>
Download mbox | patch
Permalink /patch/183043/
State New
Headers show

Comments

pingfan liu - Sept. 11, 2012, 7:51 a.m.
From: Liu Ping Fan <pingfank@linux.vnet.ibm.com>

DeviceState will be protected by refcnt from disappearing during
dispatching. But when refcnt comes down to zero, DeviceState may
be still in use by iohandler, timer etc in main loop, we just delay
its free untill no reader.

This patch aim to build this delay reclaimer.

Signed-off-by: Liu Ping Fan <pingfank@linux.vnet.ibm.com>
---
 include/qemu/reclaimer.h |   30 +++++++++++++++++++++++++
 main-loop.c              |    5 ++++
 qemu-tool.c              |    5 ++++
 qom/Makefile.objs        |    2 +-
 qom/reclaimer.c          |   54 ++++++++++++++++++++++++++++++++++++++++++++++
 5 files changed, 95 insertions(+), 1 deletions(-)
 create mode 100644 include/qemu/reclaimer.h
 create mode 100644 qom/reclaimer.c
Avi Kivity - Sept. 11, 2012, 8:32 a.m.
On 09/11/2012 10:51 AM, Liu Ping Fan wrote:
> From: Liu Ping Fan <pingfank@linux.vnet.ibm.com>
> 
> DeviceState will be protected by refcnt from disappearing during
> dispatching. But when refcnt comes down to zero, DeviceState may
> be still in use by iohandler, timer etc in main loop, we just delay
> its free untill no reader.
> 

How can this be?  We elevate the refcount while dispatching I/O.  If we
have similar problems with the timer, we need to employ a similar solution.
pingfan liu - Sept. 11, 2012, 9:32 a.m.
On Tue, Sep 11, 2012 at 4:32 PM, Avi Kivity <avi@redhat.com> wrote:
> On 09/11/2012 10:51 AM, Liu Ping Fan wrote:
>> From: Liu Ping Fan <pingfank@linux.vnet.ibm.com>
>>
>> DeviceState will be protected by refcnt from disappearing during
>> dispatching. But when refcnt comes down to zero, DeviceState may
>> be still in use by iohandler, timer etc in main loop, we just delay
>> its free untill no reader.
>>
>
> How can this be?  We elevate the refcount while dispatching I/O.  If we
> have similar problems with the timer, we need to employ a similar solution.
>
Yes, at the next step, plan to covert iohandler, timer etc to use
refcount as memory. Here just a temp solution.

Regards,
pingfan
>
> --
> error compiling committee.c: too many arguments to function
Avi Kivity - Sept. 11, 2012, 9:37 a.m.
On 09/11/2012 12:32 PM, liu ping fan wrote:
> On Tue, Sep 11, 2012 at 4:32 PM, Avi Kivity <avi@redhat.com> wrote:
>> On 09/11/2012 10:51 AM, Liu Ping Fan wrote:
>>> From: Liu Ping Fan <pingfank@linux.vnet.ibm.com>
>>>
>>> DeviceState will be protected by refcnt from disappearing during
>>> dispatching. But when refcnt comes down to zero, DeviceState may
>>> be still in use by iohandler, timer etc in main loop, we just delay
>>> its free untill no reader.
>>>
>>
>> How can this be?  We elevate the refcount while dispatching I/O.  If we
>> have similar problems with the timer, we need to employ a similar solution.
>>
> Yes, at the next step, plan to covert iohandler, timer etc to use
> refcount as memory. Here just a temp solution.

I prefer not to ever introduce it.

What we can do is introduce a sub-region for e1000's mmio that will take
only the device lock, and let original region use the old dispatch path
(and also take the device lock).  As we thread the various subsystems
e1000 uses, we can expand the sub-region until it covers all of e1000's
functions, then fold it back into the main region.

To start with the sub-region can only include registers that call no
qemu infrastructure code: simple read/writes.
pingfan liu - Sept. 13, 2012, 6:54 a.m.
On Tue, Sep 11, 2012 at 5:37 PM, Avi Kivity <avi@redhat.com> wrote:
> On 09/11/2012 12:32 PM, liu ping fan wrote:
>> On Tue, Sep 11, 2012 at 4:32 PM, Avi Kivity <avi@redhat.com> wrote:
>>> On 09/11/2012 10:51 AM, Liu Ping Fan wrote:
>>>> From: Liu Ping Fan <pingfank@linux.vnet.ibm.com>
>>>>
>>>> DeviceState will be protected by refcnt from disappearing during
>>>> dispatching. But when refcnt comes down to zero, DeviceState may
>>>> be still in use by iohandler, timer etc in main loop, we just delay
>>>> its free untill no reader.
>>>>
>>>
>>> How can this be?  We elevate the refcount while dispatching I/O.  If we
>>> have similar problems with the timer, we need to employ a similar solution.
>>>
>> Yes, at the next step, plan to covert iohandler, timer etc to use
>> refcount as memory. Here just a temp solution.
>
> I prefer not to ever introduce it.
>
> What we can do is introduce a sub-region for e1000's mmio that will take
> only the device lock, and let original region use the old dispatch path
> (and also take the device lock).  As we thread the various subsystems
> e1000 uses, we can expand the sub-region until it covers all of e1000's
> functions, then fold it back into the main region.
>
Introducing new sub-region for e1000  seems no help to resolve this
issue. It can not tell whether main-loop still use it or not.
I think the key point is that original code SYNC eliminate all the
readers of DeviceState at acpi_piix_eject_slot() by
dev->unit()/exit(), so each subsystem will no access it in future.
But now, we can delete the DeviceState async.
Currently, we can just use e1000->unmap() to detach itself from each
subsystem(Not implemented in this series patches for timer,...) to
achieve the goal, because their readers are still under the protection
of big lock, but when they are out of big lock, we need extra effort
like memory system.

Regards,
pingfan
> To start with the sub-region can only include registers that call no
> qemu infrastructure code: simple read/writes.
>
>
> --
> error compiling committee.c: too many arguments to function
Avi Kivity - Sept. 13, 2012, 8:45 a.m.
On 09/13/2012 09:54 AM, liu ping fan wrote:
> On Tue, Sep 11, 2012 at 5:37 PM, Avi Kivity <avi@redhat.com> wrote:
>> On 09/11/2012 12:32 PM, liu ping fan wrote:
>>> On Tue, Sep 11, 2012 at 4:32 PM, Avi Kivity <avi@redhat.com> wrote:
>>>> On 09/11/2012 10:51 AM, Liu Ping Fan wrote:
>>>>> From: Liu Ping Fan <pingfank@linux.vnet.ibm.com>
>>>>>
>>>>> DeviceState will be protected by refcnt from disappearing during
>>>>> dispatching. But when refcnt comes down to zero, DeviceState may
>>>>> be still in use by iohandler, timer etc in main loop, we just delay
>>>>> its free untill no reader.
>>>>>
>>>>
>>>> How can this be?  We elevate the refcount while dispatching I/O.  If we
>>>> have similar problems with the timer, we need to employ a similar solution.
>>>>
>>> Yes, at the next step, plan to covert iohandler, timer etc to use
>>> refcount as memory. Here just a temp solution.
>>
>> I prefer not to ever introduce it.
>>
>> What we can do is introduce a sub-region for e1000's mmio that will take
>> only the device lock, and let original region use the old dispatch path
>> (and also take the device lock).  As we thread the various subsystems
>> e1000 uses, we can expand the sub-region until it covers all of e1000's
>> functions, then fold it back into the main region.
>>
> Introducing new sub-region for e1000  seems no help to resolve this
> issue. It can not tell whether main-loop still use it or not.

What is "it" here? (actually two of them).

> I think the key point is that original code SYNC eliminate all the
> readers of DeviceState at acpi_piix_eject_slot() by
> dev->unit()/exit(), so each subsystem will no access it in future.
> But now, we can delete the DeviceState async.

But deleting happens when we are guaranteed to have no I/O dispatch.

> Currently, we can just use e1000->unmap() to detach itself from each
> subsystem(Not implemented in this series patches for timer,...) to
> achieve the goal, because their readers are still under the protection
> of big lock, but when they are out of big lock, we need extra effort
> like memory system.

I see what you mean.  So you defer the deletion to a context where the
big lock is held.

But this solves nothing.  The device model accesses the network stack
and timer subsystem without the big lock held.  So you either need to
thread those two subsystems, or take the big lock in the I/O handlers.
If you do that, you can also take the big lock in the destructor.  If we
make the big lock a recursive lock, then the destructor can be invoked
in any context.

To summarize, I propose:
- drop the reclaimer
- make the bql recursive
- take the bql in the e1000 destructor
- take the bql in the e1000 I/O handlers when it accesses the timer or
network subsystems
(rest for a bit)
- thread the timer subsystem
- drop bql from around timer accesses
- thread the network subsystem
- drop bql from e1000 I/O handlers and destructor

does this work?
pingfan liu - Sept. 13, 2012, 9:59 a.m.
On Thu, Sep 13, 2012 at 4:45 PM, Avi Kivity <avi@redhat.com> wrote:
> On 09/13/2012 09:54 AM, liu ping fan wrote:
>> On Tue, Sep 11, 2012 at 5:37 PM, Avi Kivity <avi@redhat.com> wrote:
>>> On 09/11/2012 12:32 PM, liu ping fan wrote:
>>>> On Tue, Sep 11, 2012 at 4:32 PM, Avi Kivity <avi@redhat.com> wrote:
>>>>> On 09/11/2012 10:51 AM, Liu Ping Fan wrote:
>>>>>> From: Liu Ping Fan <pingfank@linux.vnet.ibm.com>
>>>>>>
>>>>>> DeviceState will be protected by refcnt from disappearing during
>>>>>> dispatching. But when refcnt comes down to zero, DeviceState may
>>>>>> be still in use by iohandler, timer etc in main loop, we just delay
>>>>>> its free untill no reader.
>>>>>>
>>>>>
>>>>> How can this be?  We elevate the refcount while dispatching I/O.  If we
>>>>> have similar problems with the timer, we need to employ a similar solution.
>>>>>
>>>> Yes, at the next step, plan to covert iohandler, timer etc to use
>>>> refcount as memory. Here just a temp solution.
>>>
>>> I prefer not to ever introduce it.
>>>
>>> What we can do is introduce a sub-region for e1000's mmio that will take
>>> only the device lock, and let original region use the old dispatch path
>>> (and also take the device lock).  As we thread the various subsystems
>>> e1000 uses, we can expand the sub-region until it covers all of e1000's
>>> functions, then fold it back into the main region.
>>>
>> Introducing new sub-region for e1000  seems no help to resolve this
>> issue. It can not tell whether main-loop still use it or not.
>
> What is "it" here? (actually two of them).
>
Should expressed as "The sub-region's dispatcher can not tell whether
main-loop still use e1000 or not"

>> I think the key point is that original code SYNC eliminate all the
>> readers of DeviceState at acpi_piix_eject_slot() by
>> dev->unit()/exit(), so each subsystem will no access it in future.
>> But now, we can delete the DeviceState async.
>
> But deleting happens when we are guaranteed to have no I/O dispatch.
>
>> Currently, we can just use e1000->unmap() to detach itself from each
>> subsystem(Not implemented in this series patches for timer,...) to
>> achieve the goal, because their readers are still under the protection
>> of big lock, but when they are out of big lock, we need extra effort
>> like memory system.
>
> I see what you mean.  So you defer the deletion to a context where the
> big lock is held.
>
> But this solves nothing.  The device model accesses the network stack
> and timer subsystem without the big lock held.  So you either need to
> thread those two subsystems, or take the big lock in the I/O handlers.

Yes, at present, I tend to  use big lock to protect around the call to
subsystem in the e1000's I/O handlers. And verify the current changes,
then thread other subsystems as the next step.

> If you do that, you can also take the big lock in the destructor.  If we

We do not call qemu_del_timer() etc at the destructor, instead, we
will call it in qdev_unplug_complete() -->e1000::unmap(). And
e1000::unmap() is the only function definitely called under bql. When
coming to destructor, the DeviceState has been completely isolated
from all of the subsystem. So no need to require big lock in
destructor.

> make the big lock a recursive lock, then the destructor can be invoked
> in any context.
>
> To summarize, I propose:
> - drop the reclaimer
Agree
> - make the bql recursive
> - take the bql in the e1000 destructor
Change to e1000::unmap()
> - take the bql in the e1000 I/O handlers when it accesses the timer or
> network subsystems
Agree
> (rest for a bit)
> - thread the timer subsystem
> - drop bql from around timer accesses
> - thread the network subsystem
> - drop bql from e1000 I/O handlers and destructor
Agree
>

Thanks and regards,
pingfan

> does this work?
>
> --
> error compiling committee.c: too many arguments to function
Avi Kivity - Sept. 13, 2012, 10:09 a.m.
On 09/13/2012 12:59 PM, liu ping fan wrote:
> On Thu, Sep 13, 2012 at 4:45 PM, Avi Kivity <avi@redhat.com> wrote:
>> On 09/13/2012 09:54 AM, liu ping fan wrote:
>>> On Tue, Sep 11, 2012 at 5:37 PM, Avi Kivity <avi@redhat.com> wrote:
>>>> On 09/11/2012 12:32 PM, liu ping fan wrote:
>>>>> On Tue, Sep 11, 2012 at 4:32 PM, Avi Kivity <avi@redhat.com> wrote:
>>>>>> On 09/11/2012 10:51 AM, Liu Ping Fan wrote:
>>>>>>> From: Liu Ping Fan <pingfank@linux.vnet.ibm.com>
>>>>>>>
>>>>>>> DeviceState will be protected by refcnt from disappearing during
>>>>>>> dispatching. But when refcnt comes down to zero, DeviceState may
>>>>>>> be still in use by iohandler, timer etc in main loop, we just delay
>>>>>>> its free untill no reader.
>>>>>>>
>>>>>>
>>>>>> How can this be?  We elevate the refcount while dispatching I/O.  If we
>>>>>> have similar problems with the timer, we need to employ a similar solution.
>>>>>>
>>>>> Yes, at the next step, plan to covert iohandler, timer etc to use
>>>>> refcount as memory. Here just a temp solution.
>>>>
>>>> I prefer not to ever introduce it.
>>>>
>>>> What we can do is introduce a sub-region for e1000's mmio that will take
>>>> only the device lock, and let original region use the old dispatch path
>>>> (and also take the device lock).  As we thread the various subsystems
>>>> e1000 uses, we can expand the sub-region until it covers all of e1000's
>>>> functions, then fold it back into the main region.
>>>>
>>> Introducing new sub-region for e1000  seems no help to resolve this
>>> issue. It can not tell whether main-loop still use it or not.
>>
>> What is "it" here? (actually two of them).
>>
> Should expressed as "The sub-region's dispatcher can not tell whether
> main-loop still use e1000 or not"

The sub-region will not use any unthreaded subsystems, so it need not
care about the main loop.

At first, it would only access registers in device state.

But if we go with the plan below, we can drop it.

> 
>> If you do that, you can also take the big lock in the destructor.  If we
> 
> We do not call qemu_del_timer() etc at the destructor, instead, we
> will call it in qdev_unplug_complete() -->e1000::unmap(). And
> e1000::unmap() is the only function definitely called under bql. When
> coming to destructor, the DeviceState has been completely isolated
> from all of the subsystem. So no need to require big lock in
> destructor.

But between unmap() and the destructor, accesses can still occur (an
access that was started before unmap() was called, but was delayed and
is dispatched after it completes).  These accesses will find the timer
deleted, and so must be prepared to check if the timer is there or not.

So we have a choice, either move timer destruction to the destructor, or
add checks in the dispatch code.

Patch

diff --git a/include/qemu/reclaimer.h b/include/qemu/reclaimer.h
new file mode 100644
index 0000000..5143c4f
--- /dev/null
+++ b/include/qemu/reclaimer.h
@@ -0,0 +1,30 @@ 
+/*
+ * QEMU reclaimer
+ *
+ * Copyright IBM, Corp. 2012
+ *
+ * This work is licensed under the terms of the GNU GPL, version 2 or later.
+ * See the COPYING file in the top-level directory.
+ */
+#ifndef QEMU_RECLAIMER
+#define QEMU_RECLAIMER
+
+#include "qemu-thread.h"
+
+typedef void ReleaseHandler(void *opaque);
+typedef struct Chunk {
+    QLIST_ENTRY(Chunk) list;
+    void *opaque;
+    ReleaseHandler *release;
+} Chunk;
+
+typedef struct ChunkHead {
+    struct QemuMutex lock;
+    QLIST_HEAD(, Chunk) reclaim_list;
+} ChunkHead;
+
+extern ChunkHead qdev_reclaimer;
+void reclaimer_enqueue(ChunkHead *head, void *opaque, ReleaseHandler *release);
+void reclaimer_worker(ChunkHead *head);
+void qemu_reclaimer(void);
+#endif
diff --git a/main-loop.c b/main-loop.c
index eb3b6e6..be9d095 100644
--- a/main-loop.c
+++ b/main-loop.c
@@ -26,6 +26,7 @@ 
 #include "qemu-timer.h"
 #include "slirp/slirp.h"
 #include "main-loop.h"
+#include "qemu/reclaimer.h"
 
 #ifndef _WIN32
 
@@ -505,5 +506,9 @@  int main_loop_wait(int nonblocking)
        them.  */
     qemu_bh_poll();
 
+    /* ref to device from iohandler/bh/timer do not obey the rules, so delay
+     * reclaiming until now.
+     */
+    qemu_reclaimer();
     return ret;
 }
diff --git a/qemu-tool.c b/qemu-tool.c
index 18205ba..f250c87 100644
--- a/qemu-tool.c
+++ b/qemu-tool.c
@@ -21,6 +21,7 @@ 
 #include "main-loop.h"
 #include "qemu_socket.h"
 #include "slirp/libslirp.h"
+#include "qemu/reclaimer.h"
 
 #include <sys/time.h>
 
@@ -100,6 +101,10 @@  void qemu_mutex_unlock_iothread(void)
 {
 }
 
+void qemu_reclaimer(void)
+{
+}
+
 int use_icount;
 
 void qemu_clock_warp(QEMUClock *clock)
diff --git a/qom/Makefile.objs b/qom/Makefile.objs
index 5ef060a..a579261 100644
--- a/qom/Makefile.objs
+++ b/qom/Makefile.objs
@@ -1,4 +1,4 @@ 
-qom-obj-y = object.o container.o qom-qobject.o
+qom-obj-y = object.o container.o qom-qobject.o reclaimer.o
 qom-obj-twice-y = cpu.o
 common-obj-y = $(qom-obj-twice-y)
 user-obj-y = $(qom-obj-twice-y)
diff --git a/qom/reclaimer.c b/qom/reclaimer.c
new file mode 100644
index 0000000..b098ad7
--- /dev/null
+++ b/qom/reclaimer.c
@@ -0,0 +1,54 @@ 
+/*
+ * QEMU reclaimer
+ *
+ * Copyright IBM, Corp. 2012
+ *
+ * This work is licensed under the terms of the GNU GPL, version 2 or later.
+ * See the COPYING file in the top-level directory.
+ */
+#include "qemu-common.h"
+#include "qemu-thread.h"
+#include "main-loop.h"
+#include "qemu-queue.h"
+#include "qemu/reclaimer.h"
+
+ChunkHead qdev_reclaimer;
+
+static void reclaimer_init(ChunkHead *head)
+{
+  qemu_mutex_init(&head->lock);
+}
+
+void reclaimer_enqueue(ChunkHead *head, void *opaque, ReleaseHandler *release)
+{
+    Chunk *r = g_malloc0(sizeof(Chunk));
+    r->opaque = opaque;
+    r->release = release;
+    qemu_mutex_lock(&head->lock);
+    QLIST_INSERT_HEAD(&head->reclaim_list, r, list);
+    qemu_mutex_unlock(&head->lock);
+}
+
+void reclaimer_worker(ChunkHead *head)
+{
+    Chunk *cur, *next;
+
+    qemu_mutex_lock(&head->lock);
+    QLIST_FOREACH_SAFE(cur, &head->reclaim_list, list, next) {
+        QLIST_REMOVE(cur, list);
+        cur->release(cur->opaque);
+        g_free(cur);
+    }
+    qemu_mutex_unlock(&head->lock);
+}
+
+void qemu_reclaimer(void)
+{
+    static int init;
+
+    if (init == 0) {
+        init = 1;
+        reclaimer_init(&qdev_reclaimer);
+    }
+    reclaimer_worker(&qdev_reclaimer);
+}