Message ID | 5646286B.2030307@suse.de |
---|---|
State | New |
Headers | show |
On 11/13/2015 07:14 PM, Andreas Färber wrote: > Am 13.10.2015 um 14:37 schrieb Daniel P. Berrange: >> From: Pavel Fedin <p.fedin@samsung.com> >> >> ARM GICv3 systems with large number of CPUs create lots of IRQ pins. Since >> every pin is represented as a property, number of these properties becomes >> very large. Every property add first makes sure there's no duplicates. >> Traversing the list becomes very slow, therefore qemu initialization takes >> significant time (several seconds for e. g. 16 CPUs). >> >> This patch replaces list with GHashTable, making lookup very fast. The only >> drawback is that object_child_foreach() and object_child_foreach_recursive() >> cannot modify their objects during traversal, since GHashTableIter does not >> have modify-safe version. However, the code seems not to modify objects via >> these functions. >> >> Signed-off-by: Daniel P. Berrange <berrange@redhat.com> >> Signed-off-by: Pavel Fedin <p.fedin@samsung.com> > > (note these seemed misordered) > > I have queued things up to 6/7 on qom-next: > https://github.com/afaerber/qemu-cpu/commits/qom-next > > This patch didn't apply and I had to hand-apply one hunk (which I > double-checked, but you never know). > > Unfortunately I run into this test failure: > > TEST: tests/device-introspect-test... (pid=4094) > /s390x/device/introspect/list: OK > /s390x/device/introspect/none: OK > /s390x/device/introspect/abstract: OK > /s390x/device/introspect/concrete: > (process:4102): GLib-CRITICAL **: g_hash_table_iter_next: assertion > 'ri->version == ri->hash_table->version' failed > > (process:4102): GLib-CRITICAL **: g_hash_table_iter_next: assertion > 'ri->version == ri->hash_table->version' failed > > (process:4102): GLib-CRITICAL **: iter_remove_or_steal: assertion > 'ri->version == ri->hash_table->version' failed > ** > ERROR:/home/andreas/QEMU/qemu/qom/object.c:867:object_unref: assertion > failed: (obj->ref > 0) > Broken pipe > FAIL > GTester: last random seed: R02S4fa2068506971129a7ebe2323dbe03b7 > (pid=4104) > FAIL: tests/device-introspect-test > TEST: tests/qom-test... (pid=4105) > /s390x/qom/s390-ccw-virtio-2.5: OK > /s390x/qom/s390-ccw-virtio-2.4: OK > /s390x/qom/none: OK > /s390x/qom/s390-virtio: > WARNING > The s390-virtio machine (non-ccw) is deprecated. > It will be removed in 2.6. Please use s390-ccw-virtio > OK > PASS: tests/qom-test > > Are you sure you tested all targets? > Any hunch where this might stem from? > > The below patch reveals that the ref count is 0. Might be just a symptom > of the actual problem though. A simpler reproducer is s390x-softmmu/qemu-system-s390x -device sclp,help which fails with this patch and succeeds without.
Am 13.11.2015 um 22:00 schrieb Christian Borntraeger: > On 11/13/2015 07:14 PM, Andreas Färber wrote: >> Am 13.10.2015 um 14:37 schrieb Daniel P. Berrange: >>> From: Pavel Fedin <p.fedin@samsung.com> >>> >>> ARM GICv3 systems with large number of CPUs create lots of IRQ pins. Since >>> every pin is represented as a property, number of these properties becomes >>> very large. Every property add first makes sure there's no duplicates. >>> Traversing the list becomes very slow, therefore qemu initialization takes >>> significant time (several seconds for e. g. 16 CPUs). >>> >>> This patch replaces list with GHashTable, making lookup very fast. The only >>> drawback is that object_child_foreach() and object_child_foreach_recursive() >>> cannot modify their objects during traversal, since GHashTableIter does not >>> have modify-safe version. However, the code seems not to modify objects via >>> these functions. >>> >>> Signed-off-by: Daniel P. Berrange <berrange@redhat.com> >>> Signed-off-by: Pavel Fedin <p.fedin@samsung.com> >> >> (note these seemed misordered) >> >> I have queued things up to 6/7 on qom-next: >> https://github.com/afaerber/qemu-cpu/commits/qom-next >> >> This patch didn't apply and I had to hand-apply one hunk (which I >> double-checked, but you never know). >> >> Unfortunately I run into this test failure: >> >> TEST: tests/device-introspect-test... (pid=4094) >> /s390x/device/introspect/list: OK >> /s390x/device/introspect/none: OK >> /s390x/device/introspect/abstract: OK >> /s390x/device/introspect/concrete: >> (process:4102): GLib-CRITICAL **: g_hash_table_iter_next: assertion >> 'ri->version == ri->hash_table->version' failed >> >> (process:4102): GLib-CRITICAL **: g_hash_table_iter_next: assertion >> 'ri->version == ri->hash_table->version' failed >> >> (process:4102): GLib-CRITICAL **: iter_remove_or_steal: assertion >> 'ri->version == ri->hash_table->version' failed >> ** >> ERROR:/home/andreas/QEMU/qemu/qom/object.c:867:object_unref: assertion >> failed: (obj->ref > 0) >> Broken pipe >> FAIL >> GTester: last random seed: R02S4fa2068506971129a7ebe2323dbe03b7 >> (pid=4104) >> FAIL: tests/device-introspect-test >> TEST: tests/qom-test... (pid=4105) >> /s390x/qom/s390-ccw-virtio-2.5: OK >> /s390x/qom/s390-ccw-virtio-2.4: OK >> /s390x/qom/none: OK >> /s390x/qom/s390-virtio: >> WARNING >> The s390-virtio machine (non-ccw) is deprecated. >> It will be removed in 2.6. Please use s390-ccw-virtio >> OK >> PASS: tests/qom-test >> >> Are you sure you tested all targets? >> Any hunch where this might stem from? >> >> The below patch reveals that the ref count is 0. Might be just a symptom >> of the actual problem though. > > A simpler reproducer is > s390x-softmmu/qemu-system-s390x -device sclp,help > which fails with this patch and succeeds without. Thanks! sclp_init() seems to violate several QOM design principles in that it uses object_new() during TypeInfo::instance_init() and uses a TYPE_... constant as property name. But nothing else stands out immediately. Regards, Andreas
Hello! > >> (process:4102): GLib-CRITICAL **: g_hash_table_iter_next: assertion > >> 'ri->version == ri->hash_table->version' failed > >> > >> (process:4102): GLib-CRITICAL **: g_hash_table_iter_next: assertion > >> 'ri->version == ri->hash_table->version' failed > >> > >> (process:4102): GLib-CRITICAL **: iter_remove_or_steal: assertion > >> 'ri->version == ri->hash_table->version' failed Wow... Actually this may come from attempts to modify the tree inside iteration. > Thanks! sclp_init() seems to violate several QOM design principles in > that it uses object_new() during TypeInfo::instance_init() and uses a > TYPE_... constant as property name. But nothing else stands out immediately. I think we should refactor this and retry. If not all problems go away, then we are indeed modifying the tree during iteration, and we have to find some solution. I wonder... Could we have both list and hashtable? hashtable for searching by name and list for iteration. In this case we would not have to use glib's iterators, and would be free of problems with them. Just keep the list and hashtable in sync. Or, is there any hashtable implementation out there which would keep iterators valid during modification? OTOH, glib has a function "remove the element at iterator's position", and we could postpone addition. So, perhaps, using both containers would be an overkill, just refactor the code to adapt to the new behavior. Kind regards, Pavel Fedin Expert Engineer Samsung Electronics Research center Russia
On 11/16/2015 08:13 AM, Pavel Fedin wrote: > Hello! > >>>> (process:4102): GLib-CRITICAL **: g_hash_table_iter_next: assertion >>>> 'ri->version == ri->hash_table->version' failed >>>> >>>> (process:4102): GLib-CRITICAL **: g_hash_table_iter_next: assertion >>>> 'ri->version == ri->hash_table->version' failed >>>> >>>> (process:4102): GLib-CRITICAL **: iter_remove_or_steal: assertion >>>> 'ri->version == ri->hash_table->version' failed > > Wow... Actually this may come from attempts to modify the tree inside iteration. > >> Thanks! sclp_init() seems to violate several QOM design principles in >> that it uses object_new() during TypeInfo::instance_init() and uses a >> TYPE_... constant as property name. But nothing else stands out immediately. > > I think we should refactor this and retry. If not all problems go away, then we are indeed modifying the tree during iteration, and > we have to find some solution. David, Conny, the current tree of afaerber https://github.com/afaerber/qemu-cpu/commits/qom-next has this patch: > From: Pavel Fedin <p.fedin@samsung.com> > > ARM GICv3 systems with large number of CPUs create lots of IRQ pins. Since > every pin is represented as a property, number of these properties becomes > very large. Every property add first makes sure there's no duplicates. > Traversing the list becomes very slow, therefore qemu initialization takes > significant time (several seconds for e. g. 16 CPUs). > > This patch replaces list with GHashTable, making lookup very fast. The only > drawback is that object_child_foreach() and object_child_foreach_recursive() > cannot modify their objects during traversal, since GHashTableIter does not > have modify-safe version. However, the code seems not to modify objects via > these functions. > > Signed-off-by: Daniel P. Berrange <berrange@redhat.com> > Signed-off-by: Pavel Fedin <p.fedin@samsung.com> which causes failures in make check. A simple reproducer is qemu-system-s390x -device sclp,help any idea what would be the most simple fix? Can we refactor this to create the event facility and the bus in the machine or whatever? > I wonder... Could we have both list and hashtable? hashtable for searching by name and list for iteration. In this case we would > not have to use glib's iterators, and would be free of problems with them. Just keep the list and hashtable in sync. > Or, is there any hashtable implementation out there which would keep iterators valid during modification? > OTOH, glib has a function "remove the element at iterator's position", and we could postpone addition. So, perhaps, using both > containers would be an overkill, just refactor the code to adapt to the new behavior.
On 13/11/2015 22:25, Andreas Färber wrote: > Thanks! sclp_init() seems to violate several QOM design principles in > that it uses object_new() during TypeInfo::instance_init() There's nothing wrong with that. It's wrong however to use qdev_set_parent_bus in instance_init. That should be moved to sclp_realize, before the realized property is set to true. Otherwise, sclp->event_facility outlives its parent. I'm not sure why that works only with the list and not with the hash table though. It may well be a bug in this patch. > and uses a TYPE_... constant as property name. That's just a little weird, it doesn't break anything per se. Paolo
Am 16.11.2015 um 09:16 schrieb Christian Borntraeger: > On 11/16/2015 08:13 AM, Pavel Fedin wrote: >>>>> (process:4102): GLib-CRITICAL **: g_hash_table_iter_next: assertion >>>>> 'ri->version == ri->hash_table->version' failed >>>>> >>>>> (process:4102): GLib-CRITICAL **: g_hash_table_iter_next: assertion >>>>> 'ri->version == ri->hash_table->version' failed >>>>> >>>>> (process:4102): GLib-CRITICAL **: iter_remove_or_steal: assertion >>>>> 'ri->version == ri->hash_table->version' failed >> >> Wow... Actually this may come from attempts to modify the tree inside iteration. >> >>> Thanks! sclp_init() seems to violate several QOM design principles in >>> that it uses object_new() during TypeInfo::instance_init() and uses a >>> TYPE_... constant as property name. But nothing else stands out immediately. >> >> I think we should refactor this and retry. If not all problems go away, then we are indeed modifying the tree during iteration, and >> we have to find some solution. > > David, Conny, > > the current tree of afaerber > > https://github.com/afaerber/qemu-cpu/commits/qom-next > > has this patch: > >> From: Pavel Fedin <p.fedin@samsung.com> >> >> ARM GICv3 systems with large number of CPUs create lots of IRQ pins. Since >> every pin is represented as a property, number of these properties becomes >> very large. Every property add first makes sure there's no duplicates. >> Traversing the list becomes very slow, therefore qemu initialization takes >> significant time (several seconds for e. g. 16 CPUs). >> >> This patch replaces list with GHashTable, making lookup very fast. The only >> drawback is that object_child_foreach() and object_child_foreach_recursive() >> cannot modify their objects during traversal, since GHashTableIter does not >> have modify-safe version. However, the code seems not to modify objects via >> these functions. >> >> Signed-off-by: Daniel P. Berrange <berrange@redhat.com> >> Signed-off-by: Pavel Fedin <p.fedin@samsung.com> > > which causes failures in make check. A simple reproducer is > > qemu-system-s390x -device sclp,help > > > any idea what would be the most simple fix? > Can we refactor this to create the event facility and the bus in the > machine or whatever? I believe it is rather a very general problem with the new object_property_del_all() implementation. It iterates through properties, releasing child<> and link<> properties, which results in an unref, which at some point unparents that device, removing it in the parent's properties hashtable while the parent is iterating through it. In this case it seems to be about the bus child<> on the event facility. >> I wonder... Could we have both list and hashtable? hashtable for searching by name and list for iteration. In this case we would >> not have to use glib's iterators, and would be free of problems with them. Just keep the list and hashtable in sync. >> Or, is there any hashtable implementation out there which would keep iterators valid during modification? >> OTOH, glib has a function "remove the element at iterator's position", and we could postpone addition. So, perhaps, using both >> containers would be an overkill, just refactor the code to adapt to the new behavior. My idea, which I wanted to investigate after the weekend, is iterating through the hashtable to create a list of prop->release functions and call them only after finishing the iteration. That might not work either, so we may need to loop over the releasing to allow for released properties to disappear after prop->release(). Regards, Andreas
Am 16.11.2015 um 09:53 schrieb Paolo Bonzini: > On 13/11/2015 22:25, Andreas Färber wrote: >> Thanks! sclp_init() seems to violate several QOM design principles in >> that it uses object_new() during TypeInfo::instance_init() > > There's nothing wrong with that. It's wrong however to use > qdev_set_parent_bus in instance_init. That should be moved to > sclp_realize, before the realized property is set to true. Negative, realize is too late. Since there are two call sites for initialization, I don't see a better place for it. > Otherwise, sclp->event_facility outlives its parent. I'm not sure why > that works only with the list and not with the hash table though. It > may well be a bug in this patch. Luckily that's not the problem here. The unref should account for its lifetime. >> and uses a TYPE_... constant as property name. > > That's just a little weird, it doesn't break anything per se. It risks renaming the type causing the property to change name, which affects ABI stability. Therefore we always "inline" property names. Will adjust after I have a fix. Regards, Andreas
On 16/11/2015 10:48, Andreas Färber wrote: > > > Thanks! sclp_init() seems to violate several QOM design principles in > > > that it uses object_new() during TypeInfo::instance_init() > > > > There's nothing wrong with that. It's wrong however to use > > qdev_set_parent_bus in instance_init. That should be moved to > > sclp_realize, before the realized property is set to true. > > Negative, realize is too late. Since there are two call sites for > initialization, I don't see a better place for it. Why is realize too late to set the parent bus? Paolo
Hello! > My idea, which I wanted to investigate after the weekend, is iterating > through the hashtable to create a list of prop->release functions and > call them only after finishing the iteration. That might not work > either, so we may need to loop over the releasing to allow for released > properties to disappear after prop->release(). Hm... May be instead of actually deleting a property, while we are iterating, we should mark it as pending for deletion, and then, after iteration is done, actually remove them? This would cost one 'bool delete_pending' per property. Kind regards, Pavel Fedin Expert Engineer Samsung Electronics Research center Russia
On Fri, Nov 13, 2015 at 10:00:58PM +0100, Christian Borntraeger wrote: > On 11/13/2015 07:14 PM, Andreas Färber wrote: > > Am 13.10.2015 um 14:37 schrieb Daniel P. Berrange: > >> From: Pavel Fedin <p.fedin@samsung.com> > >> > >> ARM GICv3 systems with large number of CPUs create lots of IRQ pins. Since > >> every pin is represented as a property, number of these properties becomes > >> very large. Every property add first makes sure there's no duplicates. > >> Traversing the list becomes very slow, therefore qemu initialization takes > >> significant time (several seconds for e. g. 16 CPUs). > >> > >> This patch replaces list with GHashTable, making lookup very fast. The only > >> drawback is that object_child_foreach() and object_child_foreach_recursive() > >> cannot modify their objects during traversal, since GHashTableIter does not > >> have modify-safe version. However, the code seems not to modify objects via > >> these functions. > >> > >> Signed-off-by: Daniel P. Berrange <berrange@redhat.com> > >> Signed-off-by: Pavel Fedin <p.fedin@samsung.com> > > > > (note these seemed misordered) > > > > I have queued things up to 6/7 on qom-next: > > https://github.com/afaerber/qemu-cpu/commits/qom-next > > > > This patch didn't apply and I had to hand-apply one hunk (which I > > double-checked, but you never know). > > > > Unfortunately I run into this test failure: > > > > TEST: tests/device-introspect-test... (pid=4094) > > /s390x/device/introspect/list: OK > > /s390x/device/introspect/none: OK > > /s390x/device/introspect/abstract: OK > > /s390x/device/introspect/concrete: > > (process:4102): GLib-CRITICAL **: g_hash_table_iter_next: assertion > > 'ri->version == ri->hash_table->version' failed > > > > (process:4102): GLib-CRITICAL **: g_hash_table_iter_next: assertion > > 'ri->version == ri->hash_table->version' failed > > > > (process:4102): GLib-CRITICAL **: iter_remove_or_steal: assertion > > 'ri->version == ri->hash_table->version' failed > > ** > > ERROR:/home/andreas/QEMU/qemu/qom/object.c:867:object_unref: assertion > > failed: (obj->ref > 0) > > Broken pipe > > FAIL > > GTester: last random seed: R02S4fa2068506971129a7ebe2323dbe03b7 > > (pid=4104) > > FAIL: tests/device-introspect-test > > TEST: tests/qom-test... (pid=4105) > > /s390x/qom/s390-ccw-virtio-2.5: OK > > /s390x/qom/s390-ccw-virtio-2.4: OK > > /s390x/qom/none: OK > > /s390x/qom/s390-virtio: > > WARNING > > The s390-virtio machine (non-ccw) is deprecated. > > It will be removed in 2.6. Please use s390-ccw-virtio > > OK > > PASS: tests/qom-test > > > > Are you sure you tested all targets? > > Any hunch where this might stem from? > > > > The below patch reveals that the ref count is 0. Might be just a symptom > > of the actual problem though. > > A simpler reproducer is > s390x-softmmu/qemu-system-s390x -device sclp,help > which fails with this patch and succeeds without. The problems arise when unref'ing the object, which is where we iterate over properties deleting them. From my investigation it seems we have a re-entrancy issue where by child objects are calling back to the parent to delete properties while iteration is taking place. The glib hash iterators are not re-entrant safe, so this causes the glib errors showing we're modifying the hash table while iteration is taking place. (qemu-system-s390x:21872): GLib-CRITICAL **: iter_remove_or_steal: assertion 'ri->version == ri->hash_table->version' failed I'm looking into how to fix this now... Regards, Daniel
diff --git a/qom/object.c b/qom/object.c index 0ac3bc1..9aa6159 100644 --- a/qom/object.c +++ b/qom/object.c @@ -864,7 +864,7 @@ void object_unref(Object *obj) if (!obj) { return; } - g_assert(obj->ref > 0); + g_assert_cmpint(obj->ref, >, 0); /* parent always holds a reference to its children */ if (atomic_fetch_dec(&obj->ref) == 1) {