Patchwork [13/15] hotplug: introduce qdev_unplug_complete() to remove device from views

login
register
mail settings
Submitter pingfan liu
Date Aug. 8, 2012, 6:25 a.m.
Message ID <1344407156-25562-14-git-send-email-qemulist@gmail.com>
Download mbox | patch
Permalink /patch/175866/
State New
Headers show

Comments

pingfan liu - Aug. 8, 2012, 6:25 a.m.
From: Liu Ping Fan <pingfank@linux.vnet.ibm.com>

When guest confirm the removal of device, we should
--unmap from MemoryRegion view
--isolated from device tree view

Signed-off-by: Liu Ping Fan <pingfank@linux.vnet.ibm.com>
---
 hw/acpi_piix4.c |    4 ++--
 hw/pci.c        |   13 ++++++++++++-
 hw/pci.h        |    2 ++
 hw/qdev.c       |   28 ++++++++++++++++++++++++++++
 hw/qdev.h       |    3 ++-
 5 files changed, 46 insertions(+), 4 deletions(-)
Paolo Bonzini - Aug. 8, 2012, 9:52 a.m.
Il 08/08/2012 08:25, Liu Ping Fan ha scritto:
> +void qdev_unplug_complete(DeviceState *dev, Error **errp)
> +{
> +    /* isolate from mem view */
> +    qdev_unmap(dev);
> +    qemu_lock_devtree();
> +    /* isolate from device tree */
> +    qdev_unset_parent(dev);
> +    qemu_unlock_devtree();
> +    object_unref(OBJECT(dev));

Rather than deferring the free, you should defer the unref.  Otherwise
the following can happen when you have "real" RCU access to the memory
map on the read-side:

    VCPU thread                    I/O thread
=====================================================================
    get MMIO request
    rcu_read_lock()
    walk memory map
                                   qdev_unmap()
                                   lock_devtree()
                                   ...
                                   unlock_devtree
                                   unref dev -> refcnt=0, free enqueued
    ref()
    rcu_read_unlock()
                                   free()
    <dangling pointer!>

If you defer the unref, you have instead

    VCPU thread                    I/O thread
=====================================================================
    get MMIO request
    rcu_read_lock()
    walk memory map
                                   qdev_unmap()
                                   lock_devtree()
                                   ...
                                   unlock_devtree
                                   unref is enqueued
    ref() -> refcnt = 2
    rcu_read_unlock()
                                   unref() -> refcnt=1
    unref() -> refcnt = 1

So this also makes patch 14 unnecessary.

Paolo

> +}
Avi Kivity - Aug. 8, 2012, 10:07 a.m.
On 08/08/2012 12:52 PM, Paolo Bonzini wrote:
> Il 08/08/2012 08:25, Liu Ping Fan ha scritto:
>> +void qdev_unplug_complete(DeviceState *dev, Error **errp)
>> +{
>> +    /* isolate from mem view */
>> +    qdev_unmap(dev);
>> +    qemu_lock_devtree();
>> +    /* isolate from device tree */
>> +    qdev_unset_parent(dev);
>> +    qemu_unlock_devtree();
>> +    object_unref(OBJECT(dev));
> 
> Rather than deferring the free, you should defer the unref.  Otherwise
> the following can happen when you have "real" RCU access to the memory
> map on the read-side:
> 
>     VCPU thread                    I/O thread
> =====================================================================
>     get MMIO request
>     rcu_read_lock()
>     walk memory map
>                                    qdev_unmap()
>                                    lock_devtree()
>                                    ...
>                                    unlock_devtree
>                                    unref dev -> refcnt=0, free enqueued
>     ref()
>     rcu_read_unlock()
>                                    free()
>     <dangling pointer!>

unref should follow either synchronize_rcu(), or be in a call_rcu()
callback (deferring the unref).  IMO synchronize_rcu() is sufficient
here, unplug is a truly slow path, esp. on real hardware.
pingfan liu - Aug. 9, 2012, 7:28 a.m.
On Wed, Aug 8, 2012 at 5:52 PM, Paolo Bonzini <pbonzini@redhat.com> wrote:
> Il 08/08/2012 08:25, Liu Ping Fan ha scritto:
>> +void qdev_unplug_complete(DeviceState *dev, Error **errp)
>> +{
>> +    /* isolate from mem view */
>> +    qdev_unmap(dev);
>> +    qemu_lock_devtree();
>> +    /* isolate from device tree */
>> +    qdev_unset_parent(dev);
>> +    qemu_unlock_devtree();
>> +    object_unref(OBJECT(dev));
>
> Rather than deferring the free, you should defer the unref.  Otherwise
> the following can happen when you have "real" RCU access to the memory
> map on the read-side:
>
>     VCPU thread                    I/O thread
> =====================================================================
>     get MMIO request
>     rcu_read_lock()
>     walk memory map
>                                    qdev_unmap()
>                                    lock_devtree()
>                                    ...
>                                    unlock_devtree
>                                    unref dev -> refcnt=0, free enqueued
>     ref()

No ref() for dev here, while we have ref to flatview+radix in my patches.
I use rcu to protect radix+flatview+mr refered. As to dev, its ref has
inc when it is added into mem view -- that is
memory_region_add_subregion -> memory_region_get() {
if(atomic_add_and_return()) dev->ref++  }.
So not until reclaimer of mem view, the dev's ref is hold by mem view.
In a short word, rcu protect mem view, while device is protected by refcnt.

>     rcu_read_unlock()
>                                    free()
>     <dangling pointer!>
>
> If you defer the unref, you have instead
>
>     VCPU thread                    I/O thread
> =====================================================================
>     get MMIO request
>     rcu_read_lock()
>     walk memory map
>                                    qdev_unmap()
>                                    lock_devtree()
>                                    ...
>                                    unlock_devtree
>                                    unref is enqueued
>     ref() -> refcnt = 2
>     rcu_read_unlock()
>                                    unref() -> refcnt=1
>     unref() -> refcnt = 1
>
> So this also makes patch 14 unnecessary.
>
> Paolo
>
>> +}
>
>
Paolo Bonzini - Aug. 9, 2012, 8 a.m.
Il 09/08/2012 09:28, liu ping fan ha scritto:
>> >     VCPU thread                    I/O thread
>> > =====================================================================
>> >     get MMIO request
>> >     rcu_read_lock()
>> >     walk memory map
>> >                                    qdev_unmap()
>> >                                    lock_devtree()
>> >                                    ...
>> >                                    unlock_devtree
>> >                                    unref dev -> refcnt=0, free enqueued
>> >     ref()
> No ref() for dev here, while we have ref to flatview+radix in my patches.
> I use rcu to protect radix+flatview+mr refered. As to dev, its ref has
> inc when it is added into mem view -- that is
> memory_region_add_subregion -> memory_region_get() {
> if(atomic_add_and_return()) dev->ref++  }.
> So not until reclaimer of mem view, the dev's ref is hold by mem view.
> In a short word, rcu protect mem view, while device is protected by refcnt.

But the RCU critical section should not include the whole processing of
MMIO, only the walk of the memory map.

And in general I think this is a bit too tricky... I understand not
adding refcounting to all of bottom halves, timers, etc., but if you are
using a device you should have explicit ref/unref pairs.

Paolo
pingfan liu - Aug. 10, 2012, 6:42 a.m.
On Thu, Aug 9, 2012 at 4:00 PM, Paolo Bonzini <pbonzini@redhat.com> wrote:
> Il 09/08/2012 09:28, liu ping fan ha scritto:
>>> >     VCPU thread                    I/O thread
>>> > =====================================================================
>>> >     get MMIO request
>>> >     rcu_read_lock()
>>> >     walk memory map
>>> >                                    qdev_unmap()
>>> >                                    lock_devtree()
>>> >                                    ...
>>> >                                    unlock_devtree
>>> >                                    unref dev -> refcnt=0, free enqueued
>>> >     ref()
>> No ref() for dev here, while we have ref to flatview+radix in my patches.
>> I use rcu to protect radix+flatview+mr refered. As to dev, its ref has
>> inc when it is added into mem view -- that is
>> memory_region_add_subregion -> memory_region_get() {
>> if(atomic_add_and_return()) dev->ref++  }.
>> So not until reclaimer of mem view, the dev's ref is hold by mem view.
>> In a short word, rcu protect mem view, while device is protected by refcnt.
>
> But the RCU critical section should not include the whole processing of
> MMIO, only the walk of the memory map.
>
Yes, you are right.  And I think cur_map_get() can be broken into the
style "lock,  ref++, phys_page_find(); unlock".  easily.

> And in general I think this is a bit too tricky... I understand not
> adding refcounting to all of bottom halves, timers, etc., but if you are
> using a device you should have explicit ref/unref pairs.
>
Actually, there are pairs -- when dev enter mem view, inc ref; and
when it leave, dec ref.
But as Avi has pointed out, the mr->refcnt introduce complicate and no
gain. So I will discard this design

Thanks and regards,
pingfan

> Paolo
Marcelo Tosatti - Aug. 13, 2012, 6:51 p.m.
On Thu, Aug 09, 2012 at 10:00:16AM +0200, Paolo Bonzini wrote:
> Il 09/08/2012 09:28, liu ping fan ha scritto:
> >> >     VCPU thread                    I/O thread
> >> > =====================================================================
> >> >     get MMIO request
> >> >     rcu_read_lock()
> >> >     walk memory map
> >> >                                    qdev_unmap()
> >> >                                    lock_devtree()
> >> >                                    ...
> >> >                                    unlock_devtree
> >> >                                    unref dev -> refcnt=0, free enqueued
> >> >     ref()
> > No ref() for dev here, while we have ref to flatview+radix in my patches.
> > I use rcu to protect radix+flatview+mr refered. As to dev, its ref has
> > inc when it is added into mem view -- that is
> > memory_region_add_subregion -> memory_region_get() {
> > if(atomic_add_and_return()) dev->ref++  }.
> > So not until reclaimer of mem view, the dev's ref is hold by mem view.
> > In a short word, rcu protect mem view, while device is protected by refcnt.

The idea, written on that plan, was:

- RCU protects memory maps.
- Object reference protects device in the window between 4. and 5.

The unplug/remove path should:

1) Lock memmap_lock for write (if not using RCU).
2) Remove any memmap entries (which is possible due to write lock on
memmap_lock. Alternatively wait for an RCU grace period). Device should
not be visible after that.
3) Lock dev->lock.
4) Wait until references are removed (no new references can be made
since device is not visible).
5) Remove device. 

So its a combination of both dev->lock and reference counter.

Note: a first step can be only parallel execution of MMIO lookups
(actually that is a very good first target). dev->lock above would be 
qemu_big_lock in that first stage, then _only devices which are 
performance sensitive need to be converted_.

> But the RCU critical section should not include the whole processing of
> MMIO, only the walk of the memory map.

Yes.

> And in general I think this is a bit too tricky... I understand not
> adding refcounting to all of bottom halves, timers, etc., but if you are
> using a device you should have explicit ref/unref pairs.
> 
> Paolo
> --
> To unsubscribe from this list: send the line "unsubscribe kvm" in
> the body of a message to majordomo@vger.kernel.org
> More majordomo info at  http://vger.kernel.org/majordomo-info.html
Marcelo Tosatti - Aug. 13, 2012, 6:53 p.m.
On Fri, Aug 10, 2012 at 02:42:58PM +0800, liu ping fan wrote:
> On Thu, Aug 9, 2012 at 4:00 PM, Paolo Bonzini <pbonzini@redhat.com> wrote:
> > Il 09/08/2012 09:28, liu ping fan ha scritto:
> >>> >     VCPU thread                    I/O thread
> >>> > =====================================================================
> >>> >     get MMIO request
> >>> >     rcu_read_lock()
> >>> >     walk memory map
> >>> >                                    qdev_unmap()
> >>> >                                    lock_devtree()
> >>> >                                    ...
> >>> >                                    unlock_devtree
> >>> >                                    unref dev -> refcnt=0, free enqueued
> >>> >     ref()
> >> No ref() for dev here, while we have ref to flatview+radix in my patches.
> >> I use rcu to protect radix+flatview+mr refered. As to dev, its ref has
> >> inc when it is added into mem view -- that is
> >> memory_region_add_subregion -> memory_region_get() {
> >> if(atomic_add_and_return()) dev->ref++  }.
> >> So not until reclaimer of mem view, the dev's ref is hold by mem view.
> >> In a short word, rcu protect mem view, while device is protected by refcnt.
> >
> > But the RCU critical section should not include the whole processing of
> > MMIO, only the walk of the memory map.
> >
> Yes, you are right.  And I think cur_map_get() can be broken into the
> style "lock,  ref++, phys_page_find(); unlock".  easily.
> 
> > And in general I think this is a bit too tricky... I understand not
> > adding refcounting to all of bottom halves, timers, etc., but if you are
> > using a device you should have explicit ref/unref pairs.
> >
> Actually, there are pairs -- when dev enter mem view, inc ref; and
> when it leave, dec ref.
> But as Avi has pointed out, the mr->refcnt introduce complicate and no
> gain. So I will discard this design

The plan that you refer that has been relatively well thought out, IIRC. 
Instead of designing something, i would try to understand/improve on
that.

> Thanks and regards,
> pingfan
> 
> > Paolo
> --
> To unsubscribe from this list: send the line "unsubscribe kvm" in
> the body of a message to majordomo@vger.kernel.org
> More majordomo info at  http://vger.kernel.org/majordomo-info.html

Patch

diff --git a/hw/acpi_piix4.c b/hw/acpi_piix4.c
index 0aace60..c209ff7 100644
--- a/hw/acpi_piix4.c
+++ b/hw/acpi_piix4.c
@@ -305,8 +305,8 @@  static void acpi_piix_eject_slot(PIIX4PMState *s, unsigned slots)
             if (pc->no_hotplug) {
                 slot_free = false;
             } else {
-                object_unparent(OBJECT(dev));
-                qdev_free(qdev);
+                /* refcnt will be decreased */
+                qdev_unplug_complete(qdev, NULL);
             }
         }
     }
diff --git a/hw/pci.c b/hw/pci.c
index 99a4304..2095abf 100644
--- a/hw/pci.c
+++ b/hw/pci.c
@@ -856,12 +856,22 @@  static int pci_unregister_device(DeviceState *dev)
     if (ret)
         return ret;
 
-    pci_unregister_io_regions(pci_dev);
     pci_del_option_rom(pci_dev);
     do_pci_unregister_device(pci_dev);
     return 0;
 }
 
+static void pci_unmap_device(DeviceState *dev)
+{
+    PCIDevice *pci_dev = PCI_DEVICE(dev);
+    PCIDeviceClass *pc = PCI_DEVICE_GET_CLASS(pci_dev);
+
+    pci_unregister_io_regions(pci_dev);
+    if (pc->unmap) {
+        pc->unmap(pci_dev);
+    }
+}
+
 void pci_register_bar(PCIDevice *pci_dev, int region_num,
                       uint8_t type, MemoryRegion *memory)
 {
@@ -2022,6 +2032,7 @@  static void pci_device_class_init(ObjectClass *klass, void *data)
     DeviceClass *k = DEVICE_CLASS(klass);
     k->init = pci_qdev_init;
     k->unplug = pci_unplug_device;
+    k->unmap = pci_unmap_device;
     k->exit = pci_unregister_device;
     k->bus_type = TYPE_PCI_BUS;
     k->props = pci_props;
diff --git a/hw/pci.h b/hw/pci.h
index 79d38fd..1c5b909 100644
--- a/hw/pci.h
+++ b/hw/pci.h
@@ -145,6 +145,8 @@  typedef struct PCIDeviceClass {
     DeviceClass parent_class;
 
     int (*init)(PCIDevice *dev);
+    void (*unmap)(PCIDevice *dev);
+
     PCIUnregisterFunc *exit;
     PCIConfigReadFunc *config_read;
     PCIConfigWriteFunc *config_write;
diff --git a/hw/qdev.c b/hw/qdev.c
index 17525fe..530eabe 100644
--- a/hw/qdev.c
+++ b/hw/qdev.c
@@ -104,6 +104,14 @@  void qdev_set_parent_bus(DeviceState *dev, BusState *bus)
     bus_add_child(bus, dev);
 }
 
+static void qdev_unset_parent(DeviceState *dev)
+{
+    BusState *b = dev->parent_bus;
+
+    object_unparent(OBJECT(dev));
+    bus_remove_child(b, dev);
+}
+
 /* Create a new device.  This only initializes the device state structure
    and allows properties to be set.  qdev_init should be called to
    initialize the actual device emulation.  */
@@ -194,6 +202,26 @@  void qdev_set_legacy_instance_id(DeviceState *dev, int alias_id,
     dev->alias_required_for_version = required_for_version;
 }
 
+static int qdev_unmap(DeviceState *dev)
+{
+    DeviceClass *dc =  DEVICE_GET_CLASS(dev);
+    if (dc->unmap) {
+        dc->unmap(dev);
+    }
+    return 0;
+}
+
+void qdev_unplug_complete(DeviceState *dev, Error **errp)
+{
+    /* isolate from mem view */
+    qdev_unmap(dev);
+    qemu_lock_devtree();
+    /* isolate from device tree */
+    qdev_unset_parent(dev);
+    qemu_unlock_devtree();
+    object_unref(OBJECT(dev));
+}
+
 void qdev_unplug(DeviceState *dev, Error **errp)
 {
     DeviceClass *dc = DEVICE_GET_CLASS(dev);
diff --git a/hw/qdev.h b/hw/qdev.h
index f4683dc..705635a 100644
--- a/hw/qdev.h
+++ b/hw/qdev.h
@@ -47,7 +47,7 @@  typedef struct DeviceClass {
 
     /* callbacks */
     void (*reset)(DeviceState *dev);
-
+    void (*unmap)(DeviceState *dev);
     /* device state */
     const VMStateDescription *vmsd;
 
@@ -162,6 +162,7 @@  void qdev_init_nofail(DeviceState *dev);
 void qdev_set_legacy_instance_id(DeviceState *dev, int alias_id,
                                  int required_for_version);
 void qdev_unplug(DeviceState *dev, Error **errp);
+void qdev_unplug_complete(DeviceState *dev, Error **errp);
 void qdev_free(DeviceState *dev);
 int qdev_simple_unplug_cb(DeviceState *dev);
 void qdev_machine_creation_done(void);