Patchwork memory: Reintroduce dirty flag to optimize changes on disabled regions

login
register
mail settings
Submitter Jan Kiszka
Date Nov. 4, 2012, 8:30 a.m.
Message ID <509627A2.3040509@web.de>
Download mbox | patch
Permalink /patch/196983/
State New
Headers show

Comments

Jan Kiszka - Nov. 4, 2012, 8:30 a.m.
From: Jan Kiszka <jan.kiszka@siemens.com>

Cirrus is triggering this, e.g. during Win2k boot: Changes only on
disabled regions require no topology update when transaction depth drops
to 0 again.

Signed-off-by: Jan Kiszka <jan.kiszka@siemens.com>
---
 memory.c |   17 +++++++++++++++--
 1 files changed, 15 insertions(+), 2 deletions(-)
Avi Kivity - Nov. 4, 2012, 7:21 p.m.
On 11/04/2012 10:30 AM, Jan Kiszka wrote:
> From: Jan Kiszka <jan.kiszka@siemens.com>
>
> Cirrus is triggering this, e.g. during Win2k boot: Changes only on
> disabled regions require no topology update when transaction depth drops
> to 0 again.

817dcc5368988b0 (pci: give each device its own address space) mad this
much worse by multiplying the number of address spaces.  Each change is
now evaluated N+2 times, where N is the number of PCI devices.  It also
causes a corresponding expansion in memory usage.

I want to address this by caching AddressSpaceDispatch trees with the
key being the contents of the FlatView for that address space.  This
will drop the number of distinct trees to 2-4 (3 if some devices have
PCI_COMMAND_MASTER disabled, 4 if the PCI address space is different
from the cpu memory address space) but will fail if we make each address
space different (for example filtering out the device's own BARs).

If this change also improves cpu usage sufficiently, then it will be
better than your patch, which doesn't recognize changes in an enabled
region inside a disabled or hidden region.  In other words, your patch
fits the problem at hand but isn't general.  On the other hand my
approach doesn't eliminate render_memory_region(), just the exec.c stuff
and listener updates.  So we need to understand where the slowness comes
from.

>  
> @@ -1543,6 +1554,7 @@ void address_space_init(AddressSpace *as, MemoryRegion *root)
>      flatview_init(as->current_map);
>      QTAILQ_INSERT_TAIL(&address_spaces, as, address_spaces_link);
>      as->name = NULL;
> +    memory_region_update_pending = true;
>      memory_region_transaction_commit();
>      address_space_init_dispatch(as);
>  }
> @@ -1552,6 +1564,7 @@ void address_space_destroy(AddressSpace *as)
>      /* Flush out anything from MemoryListeners listening in on this */
>      memory_region_transaction_begin();
>      as->root = NULL;
> +    memory_region_update_pending = true;
>      memory_region_transaction_commit();
>      QTAILQ_REMOVE(&address_spaces, as, address_spaces_link);
>      address_space_destroy_dispatch(as);

init and destroy cannot happen to a region that is mapped (and cannot
happen within a transaction), so these two are redundant.
Jan Kiszka - Nov. 5, 2012, 6:26 a.m.
On 2012-11-04 20:21, Avi Kivity wrote:
> On 11/04/2012 10:30 AM, Jan Kiszka wrote:
>> From: Jan Kiszka <jan.kiszka@siemens.com>
>>
>> Cirrus is triggering this, e.g. during Win2k boot: Changes only on
>> disabled regions require no topology update when transaction depth drops
>> to 0 again.
> 
> 817dcc5368988b0 (pci: give each device its own address space) mad this
> much worse by multiplying the number of address spaces.  Each change is
> now evaluated N+2 times, where N is the number of PCI devices.  It also
> causes a corresponding expansion in memory usage.

I know... But this regression predates your changes, is already visible
right after 02e2b95fb4.

> 
> I want to address this by caching AddressSpaceDispatch trees with the
> key being the contents of the FlatView for that address space.  This
> will drop the number of distinct trees to 2-4 (3 if some devices have
> PCI_COMMAND_MASTER disabled, 4 if the PCI address space is different
> from the cpu memory address space) but will fail if we make each address
> space different (for example filtering out the device's own BARs).
> 
> If this change also improves cpu usage sufficiently, then it will be
> better than your patch, which doesn't recognize changes in an enabled
> region inside a disabled or hidden region.

True, though the question is how common such scenarios are. This one
(cirrus with win2k) is already special.

>  In other words, your patch
> fits the problem at hand but isn't general.  On the other hand my
> approach doesn't eliminate render_memory_region(), just the exec.c stuff
> and listener updates.  So we need to understand where the slowness comes
> from.

I would just like to have some even intermediate solution for 1.3. We
can still make it more perfect later on if required.

Jan
Avi Kivity - Nov. 5, 2012, 8:12 a.m.
On 11/05/2012 08:26 AM, Jan Kiszka wrote:
> On 2012-11-04 20:21, Avi Kivity wrote:
> > On 11/04/2012 10:30 AM, Jan Kiszka wrote:
> >> From: Jan Kiszka <jan.kiszka@siemens.com>
> >>
> >> Cirrus is triggering this, e.g. during Win2k boot: Changes only on
> >> disabled regions require no topology update when transaction depth drops
> >> to 0 again.
> > 
> > 817dcc5368988b0 (pci: give each device its own address space) mad this
> > much worse by multiplying the number of address spaces.  Each change is
> > now evaluated N+2 times, where N is the number of PCI devices.  It also
> > causes a corresponding expansion in memory usage.
>
> I know... But this regression predates your changes, is already visible
> right after 02e2b95fb4.
>
> > 
> > I want to address this by caching AddressSpaceDispatch trees with the
> > key being the contents of the FlatView for that address space.  This
> > will drop the number of distinct trees to 2-4 (3 if some devices have
> > PCI_COMMAND_MASTER disabled, 4 if the PCI address space is different
> > from the cpu memory address space) but will fail if we make each address
> > space different (for example filtering out the device's own BARs).
> > 
> > If this change also improves cpu usage sufficiently, then it will be
> > better than your patch, which doesn't recognize changes in an enabled
> > region inside a disabled or hidden region.
>
> True, though the question is how common such scenarios are. This one
> (cirrus with win2k) is already special.
>
> >  In other words, your patch
> > fits the problem at hand but isn't general.  On the other hand my
> > approach doesn't eliminate render_memory_region(), just the exec.c stuff
> > and listener updates.  So we need to understand where the slowness comes
> > from.
>
> I would just like to have some even intermediate solution for 1.3. We
> can still make it more perfect later on if required.
>

I think we should apply a v2 then, the more general optimizations will
take some time.
Jan Kiszka - Nov. 5, 2012, 8:51 a.m.
On 2012-11-05 09:12, Avi Kivity wrote:
> On 11/05/2012 08:26 AM, Jan Kiszka wrote:
>> On 2012-11-04 20:21, Avi Kivity wrote:
>>> On 11/04/2012 10:30 AM, Jan Kiszka wrote:
>>>> From: Jan Kiszka <jan.kiszka@siemens.com>
>>>>
>>>> Cirrus is triggering this, e.g. during Win2k boot: Changes only on
>>>> disabled regions require no topology update when transaction depth drops
>>>> to 0 again.
>>>
>>> 817dcc5368988b0 (pci: give each device its own address space) mad this
>>> much worse by multiplying the number of address spaces.  Each change is
>>> now evaluated N+2 times, where N is the number of PCI devices.  It also
>>> causes a corresponding expansion in memory usage.
>>
>> I know... But this regression predates your changes, is already visible
>> right after 02e2b95fb4.
>>
>>>
>>> I want to address this by caching AddressSpaceDispatch trees with the
>>> key being the contents of the FlatView for that address space.  This
>>> will drop the number of distinct trees to 2-4 (3 if some devices have
>>> PCI_COMMAND_MASTER disabled, 4 if the PCI address space is different
>>> from the cpu memory address space) but will fail if we make each address
>>> space different (for example filtering out the device's own BARs).
>>>
>>> If this change also improves cpu usage sufficiently, then it will be
>>> better than your patch, which doesn't recognize changes in an enabled
>>> region inside a disabled or hidden region.
>>
>> True, though the question is how common such scenarios are. This one
>> (cirrus with win2k) is already special.
>>
>>>  In other words, your patch
>>> fits the problem at hand but isn't general.  On the other hand my
>>> approach doesn't eliminate render_memory_region(), just the exec.c stuff
>>> and listener updates.  So we need to understand where the slowness comes
>>> from.
>>
>> I would just like to have some even intermediate solution for 1.3. We
>> can still make it more perfect later on if required.
>>
> 
> I think we should apply a v2 then, the more general optimizations will
> take some time.

OK - what should v2 do differently?

Jan
Avi Kivity - Nov. 5, 2012, 12:33 p.m.
On 11/05/2012 10:51 AM, Jan Kiszka wrote:
> On 2012-11-05 09:12, Avi Kivity wrote:
> > On 11/05/2012 08:26 AM, Jan Kiszka wrote:
> >> On 2012-11-04 20:21, Avi Kivity wrote:
> >>> On 11/04/2012 10:30 AM, Jan Kiszka wrote:
> >>>> From: Jan Kiszka <jan.kiszka@siemens.com>
> >>>>
> >>>> Cirrus is triggering this, e.g. during Win2k boot: Changes only on
> >>>> disabled regions require no topology update when transaction depth drops
> >>>> to 0 again.
> >>>
> >>> 817dcc5368988b0 (pci: give each device its own address space) mad this
> >>> much worse by multiplying the number of address spaces.  Each change is
> >>> now evaluated N+2 times, where N is the number of PCI devices.  It also
> >>> causes a corresponding expansion in memory usage.
> >>
> >> I know... But this regression predates your changes, is already visible
> >> right after 02e2b95fb4.
> >>
> >>>
> >>> I want to address this by caching AddressSpaceDispatch trees with the
> >>> key being the contents of the FlatView for that address space.  This
> >>> will drop the number of distinct trees to 2-4 (3 if some devices have
> >>> PCI_COMMAND_MASTER disabled, 4 if the PCI address space is different
> >>> from the cpu memory address space) but will fail if we make each address
> >>> space different (for example filtering out the device's own BARs).
> >>>
> >>> If this change also improves cpu usage sufficiently, then it will be
> >>> better than your patch, which doesn't recognize changes in an enabled
> >>> region inside a disabled or hidden region.
> >>
> >> True, though the question is how common such scenarios are. This one
> >> (cirrus with win2k) is already special.
> >>
> >>>  In other words, your patch
> >>> fits the problem at hand but isn't general.  On the other hand my
> >>> approach doesn't eliminate render_memory_region(), just the exec.c stuff
> >>> and listener updates.  So we need to understand where the slowness comes
> >>> from.
> >>
> >> I would just like to have some even intermediate solution for 1.3. We
> >> can still make it more perfect later on if required.
> >>
> > 
> > I think we should apply a v2 then, the more general optimizations will
> > take some time.
>
> OK - what should v2 do differently?
>

As I noted, init and destroy cannot cause a topology update.
Jan Kiszka - Nov. 5, 2012, 12:37 p.m.
On 2012-11-05 13:33, Avi Kivity wrote:
> On 11/05/2012 10:51 AM, Jan Kiszka wrote:
>> On 2012-11-05 09:12, Avi Kivity wrote:
>>> On 11/05/2012 08:26 AM, Jan Kiszka wrote:
>>>> On 2012-11-04 20:21, Avi Kivity wrote:
>>>>> On 11/04/2012 10:30 AM, Jan Kiszka wrote:
>>>>>> From: Jan Kiszka <jan.kiszka@siemens.com>
>>>>>>
>>>>>> Cirrus is triggering this, e.g. during Win2k boot: Changes only on
>>>>>> disabled regions require no topology update when transaction depth drops
>>>>>> to 0 again.
>>>>>
>>>>> 817dcc5368988b0 (pci: give each device its own address space) mad this
>>>>> much worse by multiplying the number of address spaces.  Each change is
>>>>> now evaluated N+2 times, where N is the number of PCI devices.  It also
>>>>> causes a corresponding expansion in memory usage.
>>>>
>>>> I know... But this regression predates your changes, is already visible
>>>> right after 02e2b95fb4.
>>>>
>>>>>
>>>>> I want to address this by caching AddressSpaceDispatch trees with the
>>>>> key being the contents of the FlatView for that address space.  This
>>>>> will drop the number of distinct trees to 2-4 (3 if some devices have
>>>>> PCI_COMMAND_MASTER disabled, 4 if the PCI address space is different
>>>>> from the cpu memory address space) but will fail if we make each address
>>>>> space different (for example filtering out the device's own BARs).
>>>>>
>>>>> If this change also improves cpu usage sufficiently, then it will be
>>>>> better than your patch, which doesn't recognize changes in an enabled
>>>>> region inside a disabled or hidden region.
>>>>
>>>> True, though the question is how common such scenarios are. This one
>>>> (cirrus with win2k) is already special.
>>>>
>>>>>  In other words, your patch
>>>>> fits the problem at hand but isn't general.  On the other hand my
>>>>> approach doesn't eliminate render_memory_region(), just the exec.c stuff
>>>>> and listener updates.  So we need to understand where the slowness comes
>>>>> from.
>>>>
>>>> I would just like to have some even intermediate solution for 1.3. We
>>>> can still make it more perfect later on if required.
>>>>
>>>
>>> I think we should apply a v2 then, the more general optimizations will
>>> take some time.
>>
>> OK - what should v2 do differently?
>>
> 
> As I noted, init and destroy cannot cause a topology update.

Ah, right. Why are we wrapping them in transaction_begin/commit at all then?

Jan
Avi Kivity - Nov. 25, 2012, 10:18 a.m.
On 11/05/2012 02:37 PM, Jan Kiszka wrote:
>> 
>> As I noted, init and destroy cannot cause a topology update.
> 
> Ah, right. Why are we wrapping them in transaction_begin/commit at all then?
> 

We aren't.


void memory_region_destroy(MemoryRegion *mr)
{
    assert(QTAILQ_EMPTY(&mr->subregions));
    assert(memory_region_transaction_depth == 0);
Jan Kiszka - Nov. 25, 2012, 10:45 a.m.
On 2012-11-25 11:18, Avi Kivity wrote:
> On 11/05/2012 02:37 PM, Jan Kiszka wrote:
>>>
>>> As I noted, init and destroy cannot cause a topology update.
>>
>> Ah, right. Why are we wrapping them in transaction_begin/commit at all then?
>>
> 
> We aren't.
> 
> 
> void memory_region_destroy(MemoryRegion *mr)
> {
>     assert(QTAILQ_EMPTY(&mr->subregions));
>     assert(memory_region_transaction_depth == 0);
> 

We were talking about address_space_init/destroy.

Jan
Avi Kivity - Nov. 25, 2012, 12:47 p.m.
On 11/25/2012 12:45 PM, Jan Kiszka wrote:
> On 2012-11-25 11:18, Avi Kivity wrote:
>> On 11/05/2012 02:37 PM, Jan Kiszka wrote:
>>>>
>>>> As I noted, init and destroy cannot cause a topology update.
>>>
>>> Ah, right. Why are we wrapping them in transaction_begin/commit at all then?
>>>
>> 
>> We aren't.
>> 
>> 
>> void memory_region_destroy(MemoryRegion *mr)
>> {
>>     assert(QTAILQ_EMPTY(&mr->subregions));
>>     assert(memory_region_transaction_depth == 0);
>> 
> 
> We were talking about address_space_init/destroy.

This is to force a re-rendering of the address space, so that listeners
see the construction/destruction.  Simply assigning as->root wouldn't do
that.

This kind of reliance on side effects should be documented with a
comment (and forbidden to anything that is outside the implementation).
 My bad.

Patch

diff --git a/memory.c b/memory.c
index d5150f8..122a58e 100644
--- a/memory.c
+++ b/memory.c
@@ -22,7 +22,8 @@ 
 
 #include "memory-internal.h"
 
-unsigned memory_region_transaction_depth = 0;
+static unsigned memory_region_transaction_depth;
+static bool memory_region_update_pending;
 static bool global_dirty_log = false;
 
 static QTAILQ_HEAD(memory_listeners, MemoryListener) memory_listeners
@@ -741,7 +742,8 @@  void memory_region_transaction_commit(void)
 
     assert(memory_region_transaction_depth);
     --memory_region_transaction_depth;
-    if (!memory_region_transaction_depth) {
+    if (!memory_region_transaction_depth && memory_region_update_pending) {
+        memory_region_update_pending = false;
         MEMORY_LISTENER_CALL_GLOBAL(begin, Forward);
 
         QTAILQ_FOREACH(as, &address_spaces, address_spaces_link) {
@@ -1060,6 +1062,7 @@  void memory_region_set_log(MemoryRegion *mr, bool log, unsigned client)
 
     memory_region_transaction_begin();
     mr->dirty_log_mask = (mr->dirty_log_mask & ~mask) | (log * mask);
+    memory_region_update_pending |= mr->enabled;
     memory_region_transaction_commit();
 }
 
@@ -1097,6 +1100,7 @@  void memory_region_set_readonly(MemoryRegion *mr, bool readonly)
     if (mr->readonly != readonly) {
         memory_region_transaction_begin();
         mr->readonly = readonly;
+        memory_region_update_pending |= mr->enabled;
         memory_region_transaction_commit();
     }
 }
@@ -1106,6 +1110,7 @@  void memory_region_rom_device_set_readable(MemoryRegion *mr, bool readable)
     if (mr->readable != readable) {
         memory_region_transaction_begin();
         mr->readable = readable;
+        memory_region_update_pending |= mr->enabled;
         memory_region_transaction_commit();
     }
 }
@@ -1248,6 +1253,7 @@  void memory_region_add_eventfd(MemoryRegion *mr,
     memmove(&mr->ioeventfds[i+1], &mr->ioeventfds[i],
             sizeof(*mr->ioeventfds) * (mr->ioeventfd_nb-1 - i));
     mr->ioeventfds[i] = mrfd;
+    memory_region_update_pending |= mr->enabled;
     memory_region_transaction_commit();
 }
 
@@ -1280,6 +1286,7 @@  void memory_region_del_eventfd(MemoryRegion *mr,
     --mr->ioeventfd_nb;
     mr->ioeventfds = g_realloc(mr->ioeventfds,
                                   sizeof(*mr->ioeventfds)*mr->ioeventfd_nb + 1);
+    memory_region_update_pending |= mr->enabled;
     memory_region_transaction_commit();
 }
 
@@ -1323,6 +1330,7 @@  static void memory_region_add_subregion_common(MemoryRegion *mr,
     }
     QTAILQ_INSERT_TAIL(&mr->subregions, subregion, subregions_link);
 done:
+    memory_region_update_pending |= mr->enabled && subregion->enabled;
     memory_region_transaction_commit();
 }
 
@@ -1353,6 +1361,7 @@  void memory_region_del_subregion(MemoryRegion *mr,
     assert(subregion->parent == mr);
     subregion->parent = NULL;
     QTAILQ_REMOVE(&mr->subregions, subregion, subregions_link);
+    memory_region_update_pending |= mr->enabled && subregion->enabled;
     memory_region_transaction_commit();
 }
 
@@ -1363,6 +1372,7 @@  void memory_region_set_enabled(MemoryRegion *mr, bool enabled)
     }
     memory_region_transaction_begin();
     mr->enabled = enabled;
+    memory_region_update_pending = true;
     memory_region_transaction_commit();
 }
 
@@ -1397,6 +1407,7 @@  void memory_region_set_alias_offset(MemoryRegion *mr, hwaddr offset)
 
     memory_region_transaction_begin();
     mr->alias_offset = offset;
+    memory_region_update_pending |= mr->enabled;
     memory_region_transaction_commit();
 }
 
@@ -1543,6 +1554,7 @@  void address_space_init(AddressSpace *as, MemoryRegion *root)
     flatview_init(as->current_map);
     QTAILQ_INSERT_TAIL(&address_spaces, as, address_spaces_link);
     as->name = NULL;
+    memory_region_update_pending = true;
     memory_region_transaction_commit();
     address_space_init_dispatch(as);
 }
@@ -1552,6 +1564,7 @@  void address_space_destroy(AddressSpace *as)
     /* Flush out anything from MemoryListeners listening in on this */
     memory_region_transaction_begin();
     as->root = NULL;
+    memory_region_update_pending = true;
     memory_region_transaction_commit();
     QTAILQ_REMOVE(&address_spaces, as, address_spaces_link);
     address_space_destroy_dispatch(as);