diff mbox series

secondary-vga: unregister vram on unplug.

Message ID 20180720081948.23644-1-remy.noel@blade-group.com
State New
Headers show
Series secondary-vga: unregister vram on unplug. | expand

Commit Message

Remy Noel July 20, 2018, 8:19 a.m. UTC
From: "Remy Noel" <remy.noel@blade-group.com>

When removing a secondary-vga device and then adding it back (or adding
an other one), qemu aborts with:
    "RAMBlock "0000:00:02.0/vga.vram" already registered, abort!".

It is caused by the vram staying registered, preventing vga replugging.

Signed-off-by: Remy Noel <remy.noel@blade-group.com>
---
 hw/display/vga-pci.c | 1 +
 1 file changed, 1 insertion(+)

Comments

Gerd Hoffmann Aug. 7, 2018, 1:19 p.m. UTC | #1
On Fri, Jul 20, 2018 at 10:19:48AM +0200, remy.noel@blade-group.com wrote:
> From: "Remy Noel" <remy.noel@blade-group.com>
> 
> When removing a secondary-vga device and then adding it back (or adding
> an other one), qemu aborts with:
>     "RAMBlock "0000:00:02.0/vga.vram" already registered, abort!".
> 
> It is caused by the vram staying registered, preventing vga replugging.

David?  Does that look ok?

This balances the

     vmstate_register_ram(&s->vram, s->global_vmstate ?  NULL : DEVICE(obj));

call in vga_common_init().  I'm wondering whenever the manual cleanup is
actually needed in case owner is not NULL?

thanks,
  Gerd
Dr. David Alan Gilbert Aug. 7, 2018, 2:57 p.m. UTC | #2
* Gerd Hoffmann (kraxel@redhat.com) wrote:
> On Fri, Jul 20, 2018 at 10:19:48AM +0200, remy.noel@blade-group.com wrote:
> > From: "Remy Noel" <remy.noel@blade-group.com>
> > 
> > When removing a secondary-vga device and then adding it back (or adding
> > an other one), qemu aborts with:
> >     "RAMBlock "0000:00:02.0/vga.vram" already registered, abort!".
> > 
> > It is caused by the vram staying registered, preventing vga replugging.
> 
> David?  Does that look ok?
> 
> This balances the
> 
>      vmstate_register_ram(&s->vram, s->global_vmstate ?  NULL : DEVICE(obj));
> 
> call in vga_common_init().  I'm wondering whenever the manual cleanup is
> actually needed in case owner is not NULL?

I can't see anyone who is calling unregister_ram or the functions it
calls as part of generic device cleanup, so I think it IS needed
to manually do it.

Which is a bit worrying since we have vastly more register's than
unregister's.

Dave

> thanks,
>   Gerd
> 
--
Dr. David Alan Gilbert / dgilbert@redhat.com / Manchester, UK
Peter Maydell Aug. 7, 2018, 3:06 p.m. UTC | #3
On 7 August 2018 at 15:57, Dr. David Alan Gilbert <dgilbert@redhat.com> wrote:
> * Gerd Hoffmann (kraxel@redhat.com) wrote:
>> On Fri, Jul 20, 2018 at 10:19:48AM +0200, remy.noel@blade-group.com wrote:
>> > From: "Remy Noel" <remy.noel@blade-group.com>
>> >
>> > When removing a secondary-vga device and then adding it back (or adding
>> > an other one), qemu aborts with:
>> >     "RAMBlock "0000:00:02.0/vga.vram" already registered, abort!".
>> >
>> > It is caused by the vram staying registered, preventing vga replugging.
>>
>> David?  Does that look ok?
>>
>> This balances the
>>
>>      vmstate_register_ram(&s->vram, s->global_vmstate ?  NULL : DEVICE(obj));
>>
>> call in vga_common_init().  I'm wondering whenever the manual cleanup is
>> actually needed in case owner is not NULL?
>
> I can't see anyone who is calling unregister_ram or the functions it
> calls as part of generic device cleanup, so I think it IS needed
> to manually do it.
>
> Which is a bit worrying since we have vastly more register's than
> unregister's.

Paolo suggested in an email last month that vmstate_unregister_ram()
should simply not exist, because it doesn't actually do anything useful:
https://lists.gnu.org/archive/html/qemu-devel/2018-07/msg01125.html

(ie it was added in the first place because we'd ended up with
two identically named ramblocks, but that only happened because
a reference-counting bug meant we hadn't deleted the first one
properly before creating the second.)

So I think that the bug reported in this thread is similar:
the problem is not that we're not calling vmstate_unregister_ram(),
but that when the first instance of secondary-vga is removed
it is not correctly destroying the ramblock.

thanks
-- PMM
Dr. David Alan Gilbert Aug. 7, 2018, 3:09 p.m. UTC | #4
* Peter Maydell (peter.maydell@linaro.org) wrote:
> On 7 August 2018 at 15:57, Dr. David Alan Gilbert <dgilbert@redhat.com> wrote:
> > * Gerd Hoffmann (kraxel@redhat.com) wrote:
> >> On Fri, Jul 20, 2018 at 10:19:48AM +0200, remy.noel@blade-group.com wrote:
> >> > From: "Remy Noel" <remy.noel@blade-group.com>
> >> >
> >> > When removing a secondary-vga device and then adding it back (or adding
> >> > an other one), qemu aborts with:
> >> >     "RAMBlock "0000:00:02.0/vga.vram" already registered, abort!".
> >> >
> >> > It is caused by the vram staying registered, preventing vga replugging.
> >>
> >> David?  Does that look ok?
> >>
> >> This balances the
> >>
> >>      vmstate_register_ram(&s->vram, s->global_vmstate ?  NULL : DEVICE(obj));
> >>
> >> call in vga_common_init().  I'm wondering whenever the manual cleanup is
> >> actually needed in case owner is not NULL?
> >
> > I can't see anyone who is calling unregister_ram or the functions it
> > calls as part of generic device cleanup, so I think it IS needed
> > to manually do it.
> >
> > Which is a bit worrying since we have vastly more register's than
> > unregister's.
> 
> Paolo suggested in an email last month that vmstate_unregister_ram()
> should simply not exist, because it doesn't actually do anything useful:
> https://lists.gnu.org/archive/html/qemu-devel/2018-07/msg01125.html
> 
> (ie it was added in the first place because we'd ended up with
> two identically named ramblocks, but that only happened because
> a reference-counting bug meant we hadn't deleted the first one
> properly before creating the second.)
> 
> So I think that the bug reported in this thread is similar:
> the problem is not that we're not calling vmstate_unregister_ram(),
> but that when the first instance of secondary-vga is removed
> it is not correctly destroying the ramblock.

Ah yes that makes more sense; I remember there was another similar bug
where a device screwed up and didn't delete it's RAM causing similar
problems.

Dave

> thanks
> -- PMM
--
Dr. David Alan Gilbert / dgilbert@redhat.com / Manchester, UK
Remy Noel Aug. 11, 2018, 7:07 p.m. UTC | #5
On 08/07/2018 05:09 PM, Dr. David Alan Gilbert wrote:

> * Peter Maydell (peter.maydell@linaro.org) wrote:
>> On 7 August 2018 at 15:57, Dr. David Alan Gilbert <dgilbert@redhat.com> wrote:
>>> * Gerd Hoffmann (kraxel@redhat.com) wrote:
>>>> On Fri, Jul 20, 2018 at 10:19:48AM +0200, remy.noel@blade-group.com wrote:
>>>>> From: "Remy Noel" <remy.noel@blade-group.com>
>>>>>
>>>>> When removing a secondary-vga device and then adding it back (or adding
>>>>> an other one), qemu aborts with:
>>>>>      "RAMBlock "0000:00:02.0/vga.vram" already registered, abort!".
>>>>>
>>>>> It is caused by the vram staying registered, preventing vga replugging.
>>>> David?  Does that look ok?
>>>>
>>>> This balances the
>>>>
>>>>       vmstate_register_ram(&s->vram, s->global_vmstate ?  NULL : DEVICE(obj));
>>>>
>>>> call in vga_common_init().  I'm wondering whenever the manual cleanup is
>>>> actually needed in case owner is not NULL?
>>> I can't see anyone who is calling unregister_ram or the functions it
>>> calls as part of generic device cleanup, so I think it IS needed
>>> to manually do it.
>>>
>>> Which is a bit worrying since we have vastly more register's than
>>> unregister's.
>> Paolo suggested in an email last month that vmstate_unregister_ram()
>> should simply not exist, because it doesn't actually do anything useful:
>> https://lists.gnu.org/archive/html/qemu-devel/2018-07/msg01125.html
>>
>> (ie it was added in the first place because we'd ended up with
>> two identically named ramblocks, but that only happened because
>> a reference-counting bug meant we hadn't deleted the first one
>> properly before creating the second.)
>>
>> So I think that the bug reported in this thread is similar:
>> the problem is not that we're not calling vmstate_unregister_ram(),
>> but that when the first instance of secondary-vga is removed
>> it is not correctly destroying the ramblock.
> Ah yes that makes more sense; I remember there was another similar bug
> where a device screwed up and didn't delete it's RAM causing similar
> problems.
>
> Dave
Thanks for the feedback, after closer inspection, the secondary-vga 
refcount does, indeed, never reach 0.

I noticed the bug was not present in v2.12.0 and had been visible since 
93abfc88bd649de1933588bfc7175605331b3ea9 
(https://lists.gnu.org/archive/html/qemu-devel/2018-06/msg07547.html).

This patch causes the secondary-vga object to be referenced by its 
subregions (mrs) which are themselves referenced by its mmio region 
which is referenced by the device causing a reference loop.
We should probably break this loop upon exit, however, i am not sure 
whether we should deletes the subregions or delete the mmio properly.

Thanks

Remy
Paolo Bonzini Aug. 12, 2018, 10:19 a.m. UTC | #6
On 11/08/2018 21:07, Remy NOEL wrote:
> On 08/07/2018 05:09 PM, Dr. David Alan Gilbert wrote:
> 
>> * Peter Maydell (peter.maydell@linaro.org) wrote:
>>> On 7 August 2018 at 15:57, Dr. David Alan Gilbert
>>> <dgilbert@redhat.com> wrote:
>>>> * Gerd Hoffmann (kraxel@redhat.com) wrote:
>>>>> On Fri, Jul 20, 2018 at 10:19:48AM +0200, remy.noel@blade-group.com
>>>>> wrote:
>>>>>> From: "Remy Noel" <remy.noel@blade-group.com>
>>>>>>
>>>>>> When removing a secondary-vga device and then adding it back (or
>>>>>> adding
>>>>>> an other one), qemu aborts with:
>>>>>>      "RAMBlock "0000:00:02.0/vga.vram" already registered, abort!".
>>>>>>
>>>>>> It is caused by the vram staying registered, preventing vga
>>>>>> replugging.
>>>>> David?  Does that look ok?
>>>>>
>>>>> This balances the
>>>>>
>>>>>       vmstate_register_ram(&s->vram, s->global_vmstate ?  NULL :
>>>>> DEVICE(obj));
>>>>>
>>>>> call in vga_common_init().  I'm wondering whenever the manual
>>>>> cleanup is
>>>>> actually needed in case owner is not NULL?
>>>> I can't see anyone who is calling unregister_ram or the functions it
>>>> calls as part of generic device cleanup, so I think it IS needed
>>>> to manually do it.
>>>>
>>>> Which is a bit worrying since we have vastly more register's than
>>>> unregister's.
>>> Paolo suggested in an email last month that vmstate_unregister_ram()
>>> should simply not exist, because it doesn't actually do anything useful:
>>> https://lists.gnu.org/archive/html/qemu-devel/2018-07/msg01125.html
>>>
>>> (ie it was added in the first place because we'd ended up with
>>> two identically named ramblocks, but that only happened because
>>> a reference-counting bug meant we hadn't deleted the first one
>>> properly before creating the second.)
>>>
>>> So I think that the bug reported in this thread is similar:
>>> the problem is not that we're not calling vmstate_unregister_ram(),
>>> but that when the first instance of secondary-vga is removed
>>> it is not correctly destroying the ramblock.
>> Ah yes that makes more sense; I remember there was another similar bug
>> where a device screwed up and didn't delete it's RAM causing similar
>> problems.
>>
>> Dave
> Thanks for the feedback, after closer inspection, the secondary-vga
> refcount does, indeed, never reach 0.
> 
> I noticed the bug was not present in v2.12.0 and had been visible since
> 93abfc88bd649de1933588bfc7175605331b3ea9
> (https://lists.gnu.org/archive/html/qemu-devel/2018-06/msg07547.html).
> 
> This patch causes the secondary-vga object to be referenced by its
> subregions (mrs) which are themselves referenced by its mmio region
> which is referenced by the device causing a reference loop.
> We should probably break this loop upon exit, however, i am not sure
> whether we should deletes the subregions or delete the mmio properly.

I'll take a look...

Paolo
Gerd Hoffmann Aug. 30, 2018, 11:28 a.m. UTC | #7
Hi,

> > Thanks for the feedback, after closer inspection, the secondary-vga
> > refcount does, indeed, never reach 0.
> > 
> > I noticed the bug was not present in v2.12.0 and had been visible since
> > 93abfc88bd649de1933588bfc7175605331b3ea9
> > (https://lists.gnu.org/archive/html/qemu-devel/2018-06/msg07547.html).
> > 
> > This patch causes the secondary-vga object to be referenced by its
> > subregions (mrs) which are themselves referenced by its mmio region
> > which is referenced by the device causing a reference loop.
> > We should probably break this loop upon exit, however, i am not sure
> > whether we should deletes the subregions or delete the mmio properly.
> 
> I'll take a look...

Ping, any results?

I'm wondering whenever we should just revert
93abfc88bd649de1933588bfc7175605331b3ea9.

Retested hotplug with 93abfc88bd649de1933588bfc7175605331b3ea9 reverted.
Works just fine with guest kernel loaded.  Doesn't work without guest
booted (grub prompt for example), but pci hotplug requires guest
cooperation so this isn't a bug in secondary-vga ...

cheers,
  Gerd
Remy Noel Oct. 2, 2018, 11:55 a.m. UTC | #8
Hi

On 8/30/18 1:28 PM, Gerd Hoffmann wrote:

>> I'll take a look...
> Ping, any results?
>
> I'm wondering whenever we should just revert
> 93abfc88bd649de1933588bfc7175605331b3ea9.
>
> Retested hotplug with 93abfc88bd649de1933588bfc7175605331b3ea9 reverted.
> Works just fine with guest kernel loaded.  Doesn't work without guest
> booted (grub prompt for example), but pci hotplug requires guest
> cooperation so this isn't a bug in secondary-vga ...
>
> cheers,
>    Gerd

I finally got the time to get back on this issue.

Just deleting the 3 subregions from the mmio in the exit callback does 
the job and everything works fine. I'll re-submit the patch in a new thread.

Reverting 93abfc88bd649de1933588bfc7175605331b3ea9 indeed frees the 
object but the hanging memory regions seems to cause some problems upon 
reset::

   - After deleting a device and adding it back, i get a segfault when 
calling system_reset:

                 #0  0x00007f90611acab8 g_hash_table_iter_next 
(libglib-2.0.so.0)
                 #1  0x000055a631e7aa39 object_resolve_partial_path 
(qemu-system-x86_64)
                 #2  0x000055a631e7a9ea object_resolve_partial_path 
(qemu-system-x86_64)
                 #3  0x000055a631e7a9ea object_resolve_partial_path 
(qemu-system-x86_64)
                 #4  0x000055a631e7a9ea object_resolve_partial_path 
(qemu-system-x86_64)
                 #5  0x000055a631e7aafd object_resolve_path_type 
(qemu-system-x86_64)
                 #6  0x000055a631c29879 piix4_pm_find (qemu-system-x86_64)
                 #7  0x000055a631b2d4ed acpi_get_pm_info 
(qemu-system-x86_64)
                 #8  0x000055a631b35903 acpi_build (qemu-system-x86_64)
                 #9  0x000055a631b3618d acpi_build_update 
(qemu-system-x86_64)
                 #10 0x000055a631d27cb3 fw_cfg_select (qemu-system-x86_64)
                 #11 0x000055a631d2848c fw_cfg_comb_write 
(qemu-system-x86_64)
                 #12 0x000055a631a5ebd7 memory_region_write_accessor 
(qemu-system-x86_64)
                 #13 0x000055a631a5edec access_with_adjusted_size 
(qemu-system-x86_64)
                 #14 0x000055a631a61ac2 memory_region_dispatch_write 
(qemu-system-x86_64)
                 #15 0x000055a6319fcbf9 flatview_write_continue 
(qemu-system-x86_64)
                 #16 0x000055a6319fcd43 flatview_write (qemu-system-x86_64)
                 #17 0x000055a6319fd049 address_space_write 
(qemu-system-x86_64)
                 #18 0x000055a6319fd09a address_space_rw 
(qemu-system-x86_64)
                 #19 0x000055a631a7cd66 kvm_handle_io (qemu-system-x86_64)
                 #20 0x000055a631a7d499 kvm_cpu_exec (qemu-system-x86_64)
                 #21 0x000055a631a43cb7 qemu_kvm_cpu_thread_fn 
(qemu-system-x86_64)
                 #22 0x000055a631fbbadd qemu_thread_start 
(qemu-system-x86_64)
                 #23 0x00007f905c851a9d start_thread (libpthread.so.0)
                 #24 0x00007f905c781a43 __clone (libc.so.6)

   - When calling system_reset just after removing a device, we get a 
bunch of:

         GLib-CRITICAL **: 10:16:29.020: g_hash_table_iter_init: 
assertion 'hash_table != NULL' failed

         GLib-CRITICAL **: 10:16:29.023: g_hash_table_iter_next: 
assertion 'ri->version == ri->hash_table->version' failed


I did not inquiry those further though as the subregions removal did 
lead to a clean state.


Remy Noel
diff mbox series

Patch

diff --git a/hw/display/vga-pci.c b/hw/display/vga-pci.c
index e9e62eac70..1c89571e46 100644
--- a/hw/display/vga-pci.c
+++ b/hw/display/vga-pci.c
@@ -288,6 +288,7 @@  static void pci_secondary_vga_exit(PCIDevice *dev)
     VGACommonState *s = &d->vga;
 
     graphic_console_close(s->con);
+    vmstate_unregister_ram(&s->vram, DEVICE(dev));
 }
 
 static void pci_secondary_vga_init(Object *obj)