diff mbox

[RFC,qemu] exec: Destroy dispatch immediately

Message ID 20170825083123.47432-1-aik@ozlabs.ru
State New
Headers show

Commit Message

Alexey Kardashevskiy Aug. 25, 2017, 8:31 a.m. UTC
Otherwise old dispatch holds way too much memory before RCU gets
a chance to free old dispatches.

Signed-off-by: Alexey Kardashevskiy <aik@ozlabs.ru>
---

This is a follow-up to the "Memory use with >100 virtio devices"
thread.


I assume this is a dirty hack (which fixes the problem though)
and I wonder what the proper solution would be. Thanks.


What happens here is that every virtio block device creates 2 address
spaces - for modern config space (called "virtio-pci-cfg-as") and
for busmaster (common pci thing, called after the device name,
in my case "virtio-blk-pci").

Each address_space_init() updates topology for _every_ address space.
Every topology update (address_space_update_topology()) creates a new
dispatch tree - AddressSpaceDispatch with nodes (1KB) and
sections (48KB) and destroys the old one.

However the dispatch destructor is postponed via RCU which does not
get a chance to execute until the machine is initialized but before
we get there, memory is not returned to the pool, and this is a lot
of memory which grows n^2.


Interestingly, mem_add() from exec.c is called twice:
as as->dispatch_listener.region_add() and
as as->dispatch_listener.region_nop() - I did not understand
the trick but it does not work if I remove the .region_nop() hook.
How does it work? :)

---
 exec.c | 2 +-
 1 file changed, 1 insertion(+), 1 deletion(-)

Comments

Paolo Bonzini Aug. 25, 2017, 8:53 a.m. UTC | #1
On 25/08/2017 10:31, Alexey Kardashevskiy wrote:
> Otherwise old dispatch holds way too much memory before RCU gets
> a chance to free old dispatches.
> 
> Signed-off-by: Alexey Kardashevskiy <aik@ozlabs.ru>
> ---
> 
> This is a follow-up to the "Memory use with >100 virtio devices"
> thread.
> 
> I assume this is a dirty hack (which fixes the problem though)

This doesn't work.  AddressSpaceDispatch can be accessed outside the big
QEMU lock, which is why it is destroyed via call_rcu.

The solution is to: 1) share the FlatView structures if they refer to
the same root memory region; 2) have one AddressSpaceDispatch per
FlatView instead of one per AddressSpace, so that FlatView reference
counting takes care of clearing the AddressSpaceDispatch too.  Neither
is particularly hard.  The second requires getting rid of the
as->dispatch_listener and "hard coding" the creation of the
AddressSpaceDispatch from the FlatView in memory.c.

Paolo

> and I wonder what the proper solution would be. Thanks.
> 
> 
> What happens here is that every virtio block device creates 2 address
> spaces - for modern config space (called "virtio-pci-cfg-as") and
> for busmaster (common pci thing, called after the device name,
> in my case "virtio-blk-pci").
> 
> Each address_space_init() updates topology for _every_ address space.
> Every topology update (address_space_update_topology()) creates a new
> dispatch tree - AddressSpaceDispatch with nodes (1KB) and
> sections (48KB) and destroys the old one.
> 
> However the dispatch destructor is postponed via RCU which does not
> get a chance to execute until the machine is initialized but before
> we get there, memory is not returned to the pool, and this is a lot
> of memory which grows n^2.
> 
> 
> Interestingly, mem_add() from exec.c is called twice:
> as as->dispatch_listener.region_add() and
> as as->dispatch_listener.region_nop() - I did not understand
> the trick but it does not work if I remove the .region_nop() hook.
> How does it work? :)
> 
> ---
>  exec.c | 2 +-
>  1 file changed, 1 insertion(+), 1 deletion(-)
> 
> diff --git a/exec.c b/exec.c
> index 01ac21e3cd..ea5f3eb209 100644
> --- a/exec.c
> +++ b/exec.c
> @@ -2707,7 +2707,7 @@ static void mem_commit(MemoryListener *listener)
>  
>      atomic_rcu_set(&as->dispatch, next);
>      if (cur) {
> -        call_rcu(cur, address_space_dispatch_free, rcu);
> +        address_space_dispatch_free(cur);
>      }
>  }
>  
>
Paolo Bonzini Aug. 25, 2017, 8:55 a.m. UTC | #2
On 25/08/2017 10:31, Alexey Kardashevskiy wrote:
> 
> Interestingly, mem_add() from exec.c is called twice:
> as as->dispatch_listener.region_add() and
> as as->dispatch_listener.region_nop() - I did not understand
> the trick but it does not work if I remove the .region_nop() hook.
> How does it work? :)

Didn't note this.

The hooks are:

- region_add: a new MemoryRegionSection appeared compared to the
previous FlatView

- region_nop: a region that was in the previous FlatView stayed there

- region_del: a MemoryRegionSection disappeared compared to the previous
FlatView

Because the AddressSpaceDispatch is rebuilt from scratch, it cares about
both new (region_add) and existing (region_nop) regions.

Paolo
Peter Maydell Aug. 25, 2017, 9:22 a.m. UTC | #3
On 25 August 2017 at 09:53, Paolo Bonzini <pbonzini@redhat.com> wrote:
> The solution is to: 1) share the FlatView structures if they refer to
> the same root memory region; 2) have one AddressSpaceDispatch per
> FlatView instead of one per AddressSpace, so that FlatView reference
> counting takes care of clearing the AddressSpaceDispatch too.  Neither
> is particularly hard.

If we did this we could get rid of address_space_init_shareable(),
right? (It's a bit of a cheesy hack aimed at avoiding having duplicate
address space structures for the same root memory region, but if
the underlying code is better at not duplicating all the data
structures unless necessary then the benefit of having the
separate API goes away I think.)

thanks
-- PMM
Paolo Bonzini Aug. 25, 2017, 9:57 a.m. UTC | #4
On 25/08/2017 11:22, Peter Maydell wrote:
> On 25 August 2017 at 09:53, Paolo Bonzini <pbonzini@redhat.com> wrote:
>> The solution is to: 1) share the FlatView structures if they refer to
>> the same root memory region; 2) have one AddressSpaceDispatch per
>> FlatView instead of one per AddressSpace, so that FlatView reference
>> counting takes care of clearing the AddressSpaceDispatch too.  Neither
>> is particularly hard.
> If we did this we could get rid of address_space_init_shareable(),
> right? (It's a bit of a cheesy hack aimed at avoiding having duplicate
> address space structures for the same root memory region, but if
> the underlying code is better at not duplicating all the data
> structures unless necessary then the benefit of having the
> separate API goes away I think.)

Yes, indeed.

Paolo
David Gibson Aug. 25, 2017, 1:19 p.m. UTC | #5
On Fri, Aug 25, 2017 at 11:57:26AM +0200, Paolo Bonzini wrote:
> On 25/08/2017 11:22, Peter Maydell wrote:
> > On 25 August 2017 at 09:53, Paolo Bonzini <pbonzini@redhat.com> wrote:
> >> The solution is to: 1) share the FlatView structures if they refer to
> >> the same root memory region; 2) have one AddressSpaceDispatch per
> >> FlatView instead of one per AddressSpace, so that FlatView reference
> >> counting takes care of clearing the AddressSpaceDispatch too.  Neither
> >> is particularly hard.
> > If we did this we could get rid of address_space_init_shareable(),
> > right? (It's a bit of a cheesy hack aimed at avoiding having duplicate
> > address space structures for the same root memory region, but if
> > the underlying code is better at not duplicating all the data
> > structures unless necessary then the benefit of having the
> > separate API goes away I think.)
> 
> Yes, indeed.

Hm.  Why do we need to construct full ASes for virtio-blk, rather than
just MRs?
Peter Maydell Aug. 25, 2017, 1:46 p.m. UTC | #6
On 25 August 2017 at 14:19, David Gibson <david@gibson.dropbear.id.au> wrote:
> Hm.  Why do we need to construct full ASes for virtio-blk, rather than
> just MRs?

It's the PCI layer that's doing it, but the overall reason is because
virtio-blk needs to make memory transactions (DMA reads and writes).
The AddressSpace is our construct for "thing you use to be able
to efficiently make memory transactions on a MemoryRegion".
(This is what the FlatView and AddressSpaceDispatch data structures
are for -- they let you efficiently go from "write to address 0x4000
in this thing" to the actual destination RAM or device read/write
pointers.)

thanks
-- PMM
Paolo Bonzini Aug. 25, 2017, 2:04 p.m. UTC | #7
On 25/08/2017 15:19, David Gibson wrote:
> On Fri, Aug 25, 2017 at 11:57:26AM +0200, Paolo Bonzini wrote:
>> On 25/08/2017 11:22, Peter Maydell wrote:
>>> On 25 August 2017 at 09:53, Paolo Bonzini <pbonzini@redhat.com> wrote:
>>>> The solution is to: 1) share the FlatView structures if they refer to
>>>> the same root memory region; 2) have one AddressSpaceDispatch per
>>>> FlatView instead of one per AddressSpace, so that FlatView reference
>>>> counting takes care of clearing the AddressSpaceDispatch too.  Neither
>>>> is particularly hard.
>>> If we did this we could get rid of address_space_init_shareable(),
>>> right? (It's a bit of a cheesy hack aimed at avoiding having duplicate
>>> address space structures for the same root memory region, but if
>>> the underlying code is better at not duplicating all the data
>>> structures unless necessary then the benefit of having the
>>> separate API goes away I think.)
>>
>> Yes, indeed.
> 
> Hm.  Why do we need to construct full ASes for virtio-blk, rather than
> just MRs?

It's an artifact of how virtio_address_space_read and
virtio_address_space_write are implemented.

Paolo
Alexey Kardashevskiy Aug. 29, 2017, 8:55 a.m. UTC | #8
On 25/08/17 18:53, Paolo Bonzini wrote:
> On 25/08/2017 10:31, Alexey Kardashevskiy wrote:
>> Otherwise old dispatch holds way too much memory before RCU gets
>> a chance to free old dispatches.
>>
>> Signed-off-by: Alexey Kardashevskiy <aik@ozlabs.ru>
>> ---
>>
>> This is a follow-up to the "Memory use with >100 virtio devices"
>> thread.
>>
>> I assume this is a dirty hack (which fixes the problem though)
> 
> This doesn't work.  AddressSpaceDispatch can be accessed outside the big
> QEMU lock, which is why it is destroyed via call_rcu.
> 
> The solution is to: 1) share the FlatView structures if they refer to
> the same root memory region; 2) have one AddressSpaceDispatch per
> FlatView instead of one per AddressSpace, so that FlatView reference
> counting takes care of clearing the AddressSpaceDispatch too.  Neither
> is particularly hard.  The second requires getting rid of the
> as->dispatch_listener and "hard coding" the creation of the
> AddressSpaceDispatch from the FlatView in memory.c.


While I am trying this approach, here is a cheap workaround -

diff --git a/vl.c b/vl.c
index 8e247cc2a2..4d95bc2a6a 100644
--- a/vl.c
+++ b/vl.c
@@ -4655,12 +4655,16 @@ int main(int argc, char **argv, char **envp)
     igd_gfx_passthru();

     /* init generic devices */
+    memory_region_transaction_begin();
+
     rom_set_order_override(FW_CFG_ORDER_OVERRIDE_DEVICE);
     if (qemu_opts_foreach(qemu_find_opts("device"),
                           device_init_func, NULL, NULL)) {
         exit(1);
     }

+    memory_region_transaction_commit();
+
     cpu_synchronize_all_post_init();

     rom_reset_order_override();



Just for self-education - how dirty is this hack?

This effectively postpones all dispatch trees allocation/disposal till MR
transation depth is 0 (which happens in this
memory_region_transaction_commit()), for 500 virtio devices it reduces the
amount of resident RAM from 45GB to 11GB with 2GB guest, good for speed too
- time to build a machine is reduced from 4:00 to 1:20.




> 
> Paolo
> 
>> and I wonder what the proper solution would be. Thanks.
>>
>>
>> What happens here is that every virtio block device creates 2 address
>> spaces - for modern config space (called "virtio-pci-cfg-as") and
>> for busmaster (common pci thing, called after the device name,
>> in my case "virtio-blk-pci").
>>
>> Each address_space_init() updates topology for _every_ address space.
>> Every topology update (address_space_update_topology()) creates a new
>> dispatch tree - AddressSpaceDispatch with nodes (1KB) and
>> sections (48KB) and destroys the old one.
>>
>> However the dispatch destructor is postponed via RCU which does not
>> get a chance to execute until the machine is initialized but before
>> we get there, memory is not returned to the pool, and this is a lot
>> of memory which grows n^2.
>>
>>
>> Interestingly, mem_add() from exec.c is called twice:
>> as as->dispatch_listener.region_add() and
>> as as->dispatch_listener.region_nop() - I did not understand
>> the trick but it does not work if I remove the .region_nop() hook.
>> How does it work? :)
>>
>> ---
>>  exec.c | 2 +-
>>  1 file changed, 1 insertion(+), 1 deletion(-)
>>
>> diff --git a/exec.c b/exec.c
>> index 01ac21e3cd..ea5f3eb209 100644
>> --- a/exec.c
>> +++ b/exec.c
>> @@ -2707,7 +2707,7 @@ static void mem_commit(MemoryListener *listener)
>>  
>>      atomic_rcu_set(&as->dispatch, next);
>>      if (cur) {
>> -        call_rcu(cur, address_space_dispatch_free, rcu);
>> +        address_space_dispatch_free(cur);
>>      }
>>  }
>>  
>>
>
diff mbox

Patch

diff --git a/exec.c b/exec.c
index 01ac21e3cd..ea5f3eb209 100644
--- a/exec.c
+++ b/exec.c
@@ -2707,7 +2707,7 @@  static void mem_commit(MemoryListener *listener)
 
     atomic_rcu_set(&as->dispatch, next);
     if (cur) {
-        call_rcu(cur, address_space_dispatch_free, rcu);
+        address_space_dispatch_free(cur);
     }
 }