Patchwork Fix phys memory client - pass guest physical address not region offset

login
register
mail settings
Submitter Alex Williamson
Date April 29, 2011, 3:15 a.m.
Message ID <20110429031437.3796.49456.stgit@s20.home>
Download mbox | patch
Permalink /patch/93375/
State New
Headers show

Comments

Alex Williamson - April 29, 2011, 3:15 a.m.
When we're trying to get a newly registered phys memory client updated
with the current page mappings, we end up passing the region offset
(a ram_addr_t) as the start address rather than the actual guest
physical memory address (target_phys_addr_t).  If your guest has less
than 3.5G of memory, these are coincidentally the same thing.  If
there's more, the region offset for the memory above 4G starts over
at 0, so the set_memory client will overwrite it's lower memory entries.

Instead, keep track of the guest phsyical address as we're walking the
tables and pass that to the set_memory client.

Signed-off-by: Alex Williamson <alex.williamson@redhat.com>
---

 exec.c |   10 ++++++----
 1 files changed, 6 insertions(+), 4 deletions(-)
Michael S. Tsirkin - April 29, 2011, 3:06 p.m.
On Thu, Apr 28, 2011 at 09:15:23PM -0600, Alex Williamson wrote:
> When we're trying to get a newly registered phys memory client updated
> with the current page mappings, we end up passing the region offset
> (a ram_addr_t) as the start address rather than the actual guest
> physical memory address (target_phys_addr_t).  If your guest has less
> than 3.5G of memory, these are coincidentally the same thing.  If
> there's more, the region offset for the memory above 4G starts over
> at 0, so the set_memory client will overwrite it's lower memory entries.
> 
> Instead, keep track of the guest phsyical address as we're walking the
> tables and pass that to the set_memory client.
> 
> Signed-off-by: Alex Williamson <alex.williamson@redhat.com>

Acked-by: Michael S. Tsirkin <mst@redhat.com>

Given all this, can yo tell how much time does
it take to hotplug a device with, say, a 40G RAM guest?
> ---
> 
>  exec.c |   10 ++++++----
>  1 files changed, 6 insertions(+), 4 deletions(-)
> 
> diff --git a/exec.c b/exec.c
> index 4752af1..e670929 100644
> --- a/exec.c
> +++ b/exec.c
> @@ -1742,7 +1742,7 @@ static int cpu_notify_migration_log(int enable)
>  }
>  
>  static void phys_page_for_each_1(CPUPhysMemoryClient *client,
> -                                 int level, void **lp)
> +                                 int level, void **lp, target_phys_addr_t addr)
>  {
>      int i;
>  
> @@ -1751,16 +1751,18 @@ static void phys_page_for_each_1(CPUPhysMemoryClient *client,
>      }
>      if (level == 0) {
>          PhysPageDesc *pd = *lp;
> +        addr <<= L2_BITS + TARGET_PAGE_BITS;
>          for (i = 0; i < L2_SIZE; ++i) {
>              if (pd[i].phys_offset != IO_MEM_UNASSIGNED) {
> -                client->set_memory(client, pd[i].region_offset,
> +                client->set_memory(client, addr | i << TARGET_PAGE_BITS,
>                                     TARGET_PAGE_SIZE, pd[i].phys_offset);
>              }
>          }
>      } else {
>          void **pp = *lp;
>          for (i = 0; i < L2_SIZE; ++i) {
> -            phys_page_for_each_1(client, level - 1, pp + i);
> +            phys_page_for_each_1(client, level - 1, pp + i,
> +                                 (addr << L2_BITS) | i);
>          }
>      }
>  }
> @@ -1770,7 +1772,7 @@ static void phys_page_for_each(CPUPhysMemoryClient *client)
>      int i;
>      for (i = 0; i < P_L1_SIZE; ++i) {
>          phys_page_for_each_1(client, P_L1_SHIFT / L2_BITS - 1,
> -                             l1_phys_map + i);
> +                             l1_phys_map + i, i);
>      }
>  }
>
Jan Kiszka - April 29, 2011, 3:29 p.m.
On 2011-04-29 17:06, Michael S. Tsirkin wrote:
> On Thu, Apr 28, 2011 at 09:15:23PM -0600, Alex Williamson wrote:
>> When we're trying to get a newly registered phys memory client updated
>> with the current page mappings, we end up passing the region offset
>> (a ram_addr_t) as the start address rather than the actual guest
>> physical memory address (target_phys_addr_t).  If your guest has less
>> than 3.5G of memory, these are coincidentally the same thing.  If

I think this broke even with < 3.5G as phys_offset also encodes the
memory type while region_offset does not. So everything became RAMthis
way, no MMIO was announced.

>> there's more, the region offset for the memory above 4G starts over
>> at 0, so the set_memory client will overwrite it's lower memory entries.
>>
>> Instead, keep track of the guest phsyical address as we're walking the
>> tables and pass that to the set_memory client.
>>
>> Signed-off-by: Alex Williamson <alex.williamson@redhat.com>
> 
> Acked-by: Michael S. Tsirkin <mst@redhat.com>
> 
> Given all this, can yo tell how much time does
> it take to hotplug a device with, say, a 40G RAM guest?

Why not collect pages of identical types and report them as one chunk
once the type changes?

Jan
Michael S. Tsirkin - April 29, 2011, 3:34 p.m.
On Fri, Apr 29, 2011 at 05:29:06PM +0200, Jan Kiszka wrote:
> On 2011-04-29 17:06, Michael S. Tsirkin wrote:
> > On Thu, Apr 28, 2011 at 09:15:23PM -0600, Alex Williamson wrote:
> >> When we're trying to get a newly registered phys memory client updated
> >> with the current page mappings, we end up passing the region offset
> >> (a ram_addr_t) as the start address rather than the actual guest
> >> physical memory address (target_phys_addr_t).  If your guest has less
> >> than 3.5G of memory, these are coincidentally the same thing.  If
> 
> I think this broke even with < 3.5G as phys_offset also encodes the
> memory type while region_offset does not. So everything became RAMthis
> way, no MMIO was announced.
> 
> >> there's more, the region offset for the memory above 4G starts over
> >> at 0, so the set_memory client will overwrite it's lower memory entries.
> >>
> >> Instead, keep track of the guest phsyical address as we're walking the
> >> tables and pass that to the set_memory client.
> >>
> >> Signed-off-by: Alex Williamson <alex.williamson@redhat.com>
> > 
> > Acked-by: Michael S. Tsirkin <mst@redhat.com>
> > 
> > Given all this, can yo tell how much time does
> > it take to hotplug a device with, say, a 40G RAM guest?
> 
> Why not collect pages of identical types and report them as one chunk
> once the type changes?

Sure, but before we bother to optimize this, is this too slow?

> Jan
> 
> -- 
> Siemens AG, Corporate Technology, CT T DE IT 1
> Corporate Competence Center Embedded Linux
Alex Williamson - April 29, 2011, 3:38 p.m.
On Fri, 2011-04-29 at 17:29 +0200, Jan Kiszka wrote:
> On 2011-04-29 17:06, Michael S. Tsirkin wrote:
> > On Thu, Apr 28, 2011 at 09:15:23PM -0600, Alex Williamson wrote:
> >> When we're trying to get a newly registered phys memory client updated
> >> with the current page mappings, we end up passing the region offset
> >> (a ram_addr_t) as the start address rather than the actual guest
> >> physical memory address (target_phys_addr_t).  If your guest has less
> >> than 3.5G of memory, these are coincidentally the same thing.  If
> 
> I think this broke even with < 3.5G as phys_offset also encodes the
> memory type while region_offset does not. So everything became RAMthis
> way, no MMIO was announced.
> 
> >> there's more, the region offset for the memory above 4G starts over
> >> at 0, so the set_memory client will overwrite it's lower memory entries.
> >>
> >> Instead, keep track of the guest phsyical address as we're walking the
> >> tables and pass that to the set_memory client.
> >>
> >> Signed-off-by: Alex Williamson <alex.williamson@redhat.com>
> > 
> > Acked-by: Michael S. Tsirkin <mst@redhat.com>
> > 
> > Given all this, can yo tell how much time does
> > it take to hotplug a device with, say, a 40G RAM guest?
> 
> Why not collect pages of identical types and report them as one chunk
> once the type changes?

Good idea, I'll see if I can code that up.  I don't have a terribly
large system to test with, but with an 8G guest, it's surprisingly not
very noticeable.  For vfio, I intend to only have one memory client, so
adding additional devices won't have to rescan everything.  The memory
overhead of keeping the list that the memory client creates is probably
also low enough that it isn't worthwhile to tear it all down if all the
devices are removed.  Thanks,

Alex
Alex Williamson - April 29, 2011, 3:41 p.m.
On Fri, 2011-04-29 at 18:34 +0300, Michael S. Tsirkin wrote:
> On Fri, Apr 29, 2011 at 05:29:06PM +0200, Jan Kiszka wrote:
> > On 2011-04-29 17:06, Michael S. Tsirkin wrote:
> > > On Thu, Apr 28, 2011 at 09:15:23PM -0600, Alex Williamson wrote:
> > >> When we're trying to get a newly registered phys memory client updated
> > >> with the current page mappings, we end up passing the region offset
> > >> (a ram_addr_t) as the start address rather than the actual guest
> > >> physical memory address (target_phys_addr_t).  If your guest has less
> > >> than 3.5G of memory, these are coincidentally the same thing.  If
> > 
> > I think this broke even with < 3.5G as phys_offset also encodes the
> > memory type while region_offset does not. So everything became RAMthis
> > way, no MMIO was announced.
> > 
> > >> there's more, the region offset for the memory above 4G starts over
> > >> at 0, so the set_memory client will overwrite it's lower memory entries.
> > >>
> > >> Instead, keep track of the guest phsyical address as we're walking the
> > >> tables and pass that to the set_memory client.
> > >>
> > >> Signed-off-by: Alex Williamson <alex.williamson@redhat.com>
> > > 
> > > Acked-by: Michael S. Tsirkin <mst@redhat.com>
> > > 
> > > Given all this, can yo tell how much time does
> > > it take to hotplug a device with, say, a 40G RAM guest?
> > 
> > Why not collect pages of identical types and report them as one chunk
> > once the type changes?
> 
> Sure, but before we bother to optimize this, is this too slow?

At a set_memory call per 4k page, it's probably worthwhile to factor in
some simply optimizations.  My set_memory callback was being hit 10^6
times.  Thanks,

Alex
Jan Kiszka - April 29, 2011, 3:45 p.m.
On 2011-04-29 17:38, Alex Williamson wrote:
> On Fri, 2011-04-29 at 17:29 +0200, Jan Kiszka wrote:
>> On 2011-04-29 17:06, Michael S. Tsirkin wrote:
>>> On Thu, Apr 28, 2011 at 09:15:23PM -0600, Alex Williamson wrote:
>>>> When we're trying to get a newly registered phys memory client updated
>>>> with the current page mappings, we end up passing the region offset
>>>> (a ram_addr_t) as the start address rather than the actual guest
>>>> physical memory address (target_phys_addr_t).  If your guest has less
>>>> than 3.5G of memory, these are coincidentally the same thing.  If
>>
>> I think this broke even with < 3.5G as phys_offset also encodes the
>> memory type while region_offset does not. So everything became RAMthis
>> way, no MMIO was announced.
>>
>>>> there's more, the region offset for the memory above 4G starts over
>>>> at 0, so the set_memory client will overwrite it's lower memory entries.
>>>>
>>>> Instead, keep track of the guest phsyical address as we're walking the
>>>> tables and pass that to the set_memory client.
>>>>
>>>> Signed-off-by: Alex Williamson <alex.williamson@redhat.com>
>>>
>>> Acked-by: Michael S. Tsirkin <mst@redhat.com>
>>>
>>> Given all this, can yo tell how much time does
>>> it take to hotplug a device with, say, a 40G RAM guest?
>>
>> Why not collect pages of identical types and report them as one chunk
>> once the type changes?
> 
> Good idea, I'll see if I can code that up.  I don't have a terribly
> large system to test with, but with an 8G guest, it's surprisingly not
> very noticeable.  For vfio, I intend to only have one memory client, so
> adding additional devices won't have to rescan everything.  The memory
> overhead of keeping the list that the memory client creates is probably
> also low enough that it isn't worthwhile to tear it all down if all the
> devices are removed.  Thanks,

What other clients register late? Do the need to know to whole memory
layout?

This full page table walk is likely a latency killer as it happens under
global lock. Ugly.

Jan
Alex Williamson - April 29, 2011, 3:55 p.m.
On Fri, 2011-04-29 at 17:45 +0200, Jan Kiszka wrote:
> On 2011-04-29 17:38, Alex Williamson wrote:
> > On Fri, 2011-04-29 at 17:29 +0200, Jan Kiszka wrote:
> >> On 2011-04-29 17:06, Michael S. Tsirkin wrote:
> >>> On Thu, Apr 28, 2011 at 09:15:23PM -0600, Alex Williamson wrote:
> >>>> When we're trying to get a newly registered phys memory client updated
> >>>> with the current page mappings, we end up passing the region offset
> >>>> (a ram_addr_t) as the start address rather than the actual guest
> >>>> physical memory address (target_phys_addr_t).  If your guest has less
> >>>> than 3.5G of memory, these are coincidentally the same thing.  If
> >>
> >> I think this broke even with < 3.5G as phys_offset also encodes the
> >> memory type while region_offset does not. So everything became RAMthis
> >> way, no MMIO was announced.
> >>
> >>>> there's more, the region offset for the memory above 4G starts over
> >>>> at 0, so the set_memory client will overwrite it's lower memory entries.
> >>>>
> >>>> Instead, keep track of the guest phsyical address as we're walking the
> >>>> tables and pass that to the set_memory client.
> >>>>
> >>>> Signed-off-by: Alex Williamson <alex.williamson@redhat.com>
> >>>
> >>> Acked-by: Michael S. Tsirkin <mst@redhat.com>
> >>>
> >>> Given all this, can yo tell how much time does
> >>> it take to hotplug a device with, say, a 40G RAM guest?
> >>
> >> Why not collect pages of identical types and report them as one chunk
> >> once the type changes?
> > 
> > Good idea, I'll see if I can code that up.  I don't have a terribly
> > large system to test with, but with an 8G guest, it's surprisingly not
> > very noticeable.  For vfio, I intend to only have one memory client, so
> > adding additional devices won't have to rescan everything.  The memory
> > overhead of keeping the list that the memory client creates is probably
> > also low enough that it isn't worthwhile to tear it all down if all the
> > devices are removed.  Thanks,
> 
> What other clients register late? Do the need to know to whole memory
> layout?
> 
> This full page table walk is likely a latency killer as it happens under
> global lock. Ugly.

vhost and kvm are the only current users.  kvm registers it's client
early enough that there's no memory registered, so doesn't really need
this replay through the page table walk.  I'm not sure how vhost works
currently.  I'm also looking at using this for vfio to register pages
for the iommu.

Alex
Jan Kiszka - April 29, 2011, 4:07 p.m.
On 2011-04-29 17:55, Alex Williamson wrote:
> On Fri, 2011-04-29 at 17:45 +0200, Jan Kiszka wrote:
>> On 2011-04-29 17:38, Alex Williamson wrote:
>>> On Fri, 2011-04-29 at 17:29 +0200, Jan Kiszka wrote:
>>>> On 2011-04-29 17:06, Michael S. Tsirkin wrote:
>>>>> On Thu, Apr 28, 2011 at 09:15:23PM -0600, Alex Williamson wrote:
>>>>>> When we're trying to get a newly registered phys memory client updated
>>>>>> with the current page mappings, we end up passing the region offset
>>>>>> (a ram_addr_t) as the start address rather than the actual guest
>>>>>> physical memory address (target_phys_addr_t).  If your guest has less
>>>>>> than 3.5G of memory, these are coincidentally the same thing.  If
>>>>
>>>> I think this broke even with < 3.5G as phys_offset also encodes the
>>>> memory type while region_offset does not. So everything became RAMthis
>>>> way, no MMIO was announced.
>>>>
>>>>>> there's more, the region offset for the memory above 4G starts over
>>>>>> at 0, so the set_memory client will overwrite it's lower memory entries.
>>>>>>
>>>>>> Instead, keep track of the guest phsyical address as we're walking the
>>>>>> tables and pass that to the set_memory client.
>>>>>>
>>>>>> Signed-off-by: Alex Williamson <alex.williamson@redhat.com>
>>>>>
>>>>> Acked-by: Michael S. Tsirkin <mst@redhat.com>
>>>>>
>>>>> Given all this, can yo tell how much time does
>>>>> it take to hotplug a device with, say, a 40G RAM guest?
>>>>
>>>> Why not collect pages of identical types and report them as one chunk
>>>> once the type changes?
>>>
>>> Good idea, I'll see if I can code that up.  I don't have a terribly
>>> large system to test with, but with an 8G guest, it's surprisingly not
>>> very noticeable.  For vfio, I intend to only have one memory client, so
>>> adding additional devices won't have to rescan everything.  The memory
>>> overhead of keeping the list that the memory client creates is probably
>>> also low enough that it isn't worthwhile to tear it all down if all the
>>> devices are removed.  Thanks,
>>
>> What other clients register late? Do the need to know to whole memory
>> layout?
>>
>> This full page table walk is likely a latency killer as it happens under
>> global lock. Ugly.
> 
> vhost and kvm are the only current users.  kvm registers it's client
> early enough that there's no memory registered, so doesn't really need
> this replay through the page table walk.  I'm not sure how vhost works
> currently.  I'm also looking at using this for vfio to register pages
> for the iommu.

Hmm, it looks like vhost is basically recreating the condensed, slotted
memory layout from the per-page reports now. A bit inefficient,
specifically as this happens per vhost device, no? And if vfio preferred
a slotted format as well, you would end up copying vhost logic.

That sounds to me like the qemu core should start tracking slots and
report slot changes, not memory region registrations.

Jan
Alex Williamson - April 29, 2011, 4:20 p.m.
On Fri, 2011-04-29 at 18:07 +0200, Jan Kiszka wrote:
> On 2011-04-29 17:55, Alex Williamson wrote:
> > On Fri, 2011-04-29 at 17:45 +0200, Jan Kiszka wrote:
> >> On 2011-04-29 17:38, Alex Williamson wrote:
> >>> On Fri, 2011-04-29 at 17:29 +0200, Jan Kiszka wrote:
> >>>> On 2011-04-29 17:06, Michael S. Tsirkin wrote:
> >>>>> On Thu, Apr 28, 2011 at 09:15:23PM -0600, Alex Williamson wrote:
> >>>>>> When we're trying to get a newly registered phys memory client updated
> >>>>>> with the current page mappings, we end up passing the region offset
> >>>>>> (a ram_addr_t) as the start address rather than the actual guest
> >>>>>> physical memory address (target_phys_addr_t).  If your guest has less
> >>>>>> than 3.5G of memory, these are coincidentally the same thing.  If
> >>>>
> >>>> I think this broke even with < 3.5G as phys_offset also encodes the
> >>>> memory type while region_offset does not. So everything became RAMthis
> >>>> way, no MMIO was announced.
> >>>>
> >>>>>> there's more, the region offset for the memory above 4G starts over
> >>>>>> at 0, so the set_memory client will overwrite it's lower memory entries.
> >>>>>>
> >>>>>> Instead, keep track of the guest phsyical address as we're walking the
> >>>>>> tables and pass that to the set_memory client.
> >>>>>>
> >>>>>> Signed-off-by: Alex Williamson <alex.williamson@redhat.com>
> >>>>>
> >>>>> Acked-by: Michael S. Tsirkin <mst@redhat.com>
> >>>>>
> >>>>> Given all this, can yo tell how much time does
> >>>>> it take to hotplug a device with, say, a 40G RAM guest?
> >>>>
> >>>> Why not collect pages of identical types and report them as one chunk
> >>>> once the type changes?
> >>>
> >>> Good idea, I'll see if I can code that up.  I don't have a terribly
> >>> large system to test with, but with an 8G guest, it's surprisingly not
> >>> very noticeable.  For vfio, I intend to only have one memory client, so
> >>> adding additional devices won't have to rescan everything.  The memory
> >>> overhead of keeping the list that the memory client creates is probably
> >>> also low enough that it isn't worthwhile to tear it all down if all the
> >>> devices are removed.  Thanks,
> >>
> >> What other clients register late? Do the need to know to whole memory
> >> layout?
> >>
> >> This full page table walk is likely a latency killer as it happens under
> >> global lock. Ugly.
> > 
> > vhost and kvm are the only current users.  kvm registers it's client
> > early enough that there's no memory registered, so doesn't really need
> > this replay through the page table walk.  I'm not sure how vhost works
> > currently.  I'm also looking at using this for vfio to register pages
> > for the iommu.
> 
> Hmm, it looks like vhost is basically recreating the condensed, slotted
> memory layout from the per-page reports now. A bit inefficient,
> specifically as this happens per vhost device, no? And if vfio preferred
> a slotted format as well, you would end up copying vhost logic.
> 
> That sounds to me like the qemu core should start tracking slots and
> report slot changes, not memory region registrations.

I was thinking the same thing, but I think Michael is concerned if we'll
each need slightly different lists.  This is also where kvm is mapping
to a fixed array of slots, which is know to blow-up with too many
assigned devices.  Needs to be fixed on both kernel and qemu side.
Runtime overhead of the phys memory client is pretty minimal, it's just
the startup that thrashes set_memory.

Alex
Jan Kiszka - April 29, 2011, 4:31 p.m.
On 2011-04-29 18:20, Alex Williamson wrote:
> On Fri, 2011-04-29 at 18:07 +0200, Jan Kiszka wrote:
>> On 2011-04-29 17:55, Alex Williamson wrote:
>>> On Fri, 2011-04-29 at 17:45 +0200, Jan Kiszka wrote:
>>>> On 2011-04-29 17:38, Alex Williamson wrote:
>>>>> On Fri, 2011-04-29 at 17:29 +0200, Jan Kiszka wrote:
>>>>>> On 2011-04-29 17:06, Michael S. Tsirkin wrote:
>>>>>>> On Thu, Apr 28, 2011 at 09:15:23PM -0600, Alex Williamson wrote:
>>>>>>>> When we're trying to get a newly registered phys memory client updated
>>>>>>>> with the current page mappings, we end up passing the region offset
>>>>>>>> (a ram_addr_t) as the start address rather than the actual guest
>>>>>>>> physical memory address (target_phys_addr_t).  If your guest has less
>>>>>>>> than 3.5G of memory, these are coincidentally the same thing.  If
>>>>>>
>>>>>> I think this broke even with < 3.5G as phys_offset also encodes the
>>>>>> memory type while region_offset does not. So everything became RAMthis
>>>>>> way, no MMIO was announced.
>>>>>>
>>>>>>>> there's more, the region offset for the memory above 4G starts over
>>>>>>>> at 0, so the set_memory client will overwrite it's lower memory entries.
>>>>>>>>
>>>>>>>> Instead, keep track of the guest phsyical address as we're walking the
>>>>>>>> tables and pass that to the set_memory client.
>>>>>>>>
>>>>>>>> Signed-off-by: Alex Williamson <alex.williamson@redhat.com>
>>>>>>>
>>>>>>> Acked-by: Michael S. Tsirkin <mst@redhat.com>
>>>>>>>
>>>>>>> Given all this, can yo tell how much time does
>>>>>>> it take to hotplug a device with, say, a 40G RAM guest?
>>>>>>
>>>>>> Why not collect pages of identical types and report them as one chunk
>>>>>> once the type changes?
>>>>>
>>>>> Good idea, I'll see if I can code that up.  I don't have a terribly
>>>>> large system to test with, but with an 8G guest, it's surprisingly not
>>>>> very noticeable.  For vfio, I intend to only have one memory client, so
>>>>> adding additional devices won't have to rescan everything.  The memory
>>>>> overhead of keeping the list that the memory client creates is probably
>>>>> also low enough that it isn't worthwhile to tear it all down if all the
>>>>> devices are removed.  Thanks,
>>>>
>>>> What other clients register late? Do the need to know to whole memory
>>>> layout?
>>>>
>>>> This full page table walk is likely a latency killer as it happens under
>>>> global lock. Ugly.
>>>
>>> vhost and kvm are the only current users.  kvm registers it's client
>>> early enough that there's no memory registered, so doesn't really need
>>> this replay through the page table walk.  I'm not sure how vhost works
>>> currently.  I'm also looking at using this for vfio to register pages
>>> for the iommu.
>>
>> Hmm, it looks like vhost is basically recreating the condensed, slotted
>> memory layout from the per-page reports now. A bit inefficient,
>> specifically as this happens per vhost device, no? And if vfio preferred
>> a slotted format as well, you would end up copying vhost logic.
>>
>> That sounds to me like the qemu core should start tracking slots and
>> report slot changes, not memory region registrations.
> 
> I was thinking the same thing, but I think Michael is concerned if we'll
> each need slightly different lists.  This is also where kvm is mapping
> to a fixed array of slots, which is know to blow-up with too many
> assigned devices.  Needs to be fixed on both kernel and qemu side.
> Runtime overhead of the phys memory client is pretty minimal, it's just
> the startup that thrashes set_memory.

I'm not just concerned about the runtime overhead. This is code
duplication. Even if the format of the lists differ, their structure
should not: one entry per continuous memory region, and some lists may
track sparsely based on their interests.

I'm sure the core could be taught to help the clients creating and
maintaining such lists. We already have two types of users in tree, you
are about to create another one, and Xen should have some need for it as
well.

Jan
Michael S. Tsirkin - May 1, 2011, 10:29 a.m.
On Fri, Apr 29, 2011 at 06:31:03PM +0200, Jan Kiszka wrote:
> On 2011-04-29 18:20, Alex Williamson wrote:
> > On Fri, 2011-04-29 at 18:07 +0200, Jan Kiszka wrote:
> >> On 2011-04-29 17:55, Alex Williamson wrote:
> >>> On Fri, 2011-04-29 at 17:45 +0200, Jan Kiszka wrote:
> >>>> On 2011-04-29 17:38, Alex Williamson wrote:
> >>>>> On Fri, 2011-04-29 at 17:29 +0200, Jan Kiszka wrote:
> >>>>>> On 2011-04-29 17:06, Michael S. Tsirkin wrote:
> >>>>>>> On Thu, Apr 28, 2011 at 09:15:23PM -0600, Alex Williamson wrote:
> >>>>>>>> When we're trying to get a newly registered phys memory client updated
> >>>>>>>> with the current page mappings, we end up passing the region offset
> >>>>>>>> (a ram_addr_t) as the start address rather than the actual guest
> >>>>>>>> physical memory address (target_phys_addr_t).  If your guest has less
> >>>>>>>> than 3.5G of memory, these are coincidentally the same thing.  If
> >>>>>>
> >>>>>> I think this broke even with < 3.5G as phys_offset also encodes the
> >>>>>> memory type while region_offset does not. So everything became RAMthis
> >>>>>> way, no MMIO was announced.
> >>>>>>
> >>>>>>>> there's more, the region offset for the memory above 4G starts over
> >>>>>>>> at 0, so the set_memory client will overwrite it's lower memory entries.
> >>>>>>>>
> >>>>>>>> Instead, keep track of the guest phsyical address as we're walking the
> >>>>>>>> tables and pass that to the set_memory client.
> >>>>>>>>
> >>>>>>>> Signed-off-by: Alex Williamson <alex.williamson@redhat.com>
> >>>>>>>
> >>>>>>> Acked-by: Michael S. Tsirkin <mst@redhat.com>
> >>>>>>>
> >>>>>>> Given all this, can yo tell how much time does
> >>>>>>> it take to hotplug a device with, say, a 40G RAM guest?
> >>>>>>
> >>>>>> Why not collect pages of identical types and report them as one chunk
> >>>>>> once the type changes?
> >>>>>
> >>>>> Good idea, I'll see if I can code that up.  I don't have a terribly
> >>>>> large system to test with, but with an 8G guest, it's surprisingly not
> >>>>> very noticeable.  For vfio, I intend to only have one memory client, so
> >>>>> adding additional devices won't have to rescan everything.  The memory
> >>>>> overhead of keeping the list that the memory client creates is probably
> >>>>> also low enough that it isn't worthwhile to tear it all down if all the
> >>>>> devices are removed.  Thanks,
> >>>>
> >>>> What other clients register late? Do the need to know to whole memory
> >>>> layout?
> >>>>
> >>>> This full page table walk is likely a latency killer as it happens under
> >>>> global lock. Ugly.
> >>>
> >>> vhost and kvm are the only current users.  kvm registers it's client
> >>> early enough that there's no memory registered, so doesn't really need
> >>> this replay through the page table walk.  I'm not sure how vhost works
> >>> currently.  I'm also looking at using this for vfio to register pages
> >>> for the iommu.
> >>
> >> Hmm, it looks like vhost is basically recreating the condensed, slotted
> >> memory layout from the per-page reports now. A bit inefficient,
> >> specifically as this happens per vhost device, no? And if vfio preferred
> >> a slotted format as well, you would end up copying vhost logic.
> >>
> >> That sounds to me like the qemu core should start tracking slots and
> >> report slot changes, not memory region registrations.
> > 
> > I was thinking the same thing, but I think Michael is concerned if we'll
> > each need slightly different lists.  This is also where kvm is mapping
> > to a fixed array of slots, which is know to blow-up with too many
> > assigned devices.  Needs to be fixed on both kernel and qemu side.
> > Runtime overhead of the phys memory client is pretty minimal, it's just
> > the startup that thrashes set_memory.
> 
> I'm not just concerned about the runtime overhead. This is code
> duplication. Even if the format of the lists differ, their structure
> should not: one entry per continuous memory region, and some lists may
> track sparsely based on their interests.
> 
> I'm sure the core could be taught to help the clients creating and
> maintaining such lists. We already have two types of users in tree, you
> are about to create another one, and Xen should have some need for it as
> well.
> 
> Jan

Absolutely. There should be some common code to deal with
slots.

> -- 
> Siemens AG, Corporate Technology, CT T DE IT 1
> Corporate Competence Center Embedded Linux
Markus Armbruster - May 3, 2011, 1:15 p.m.
Alex Williamson <alex.williamson@redhat.com> writes:

> When we're trying to get a newly registered phys memory client updated
> with the current page mappings, we end up passing the region offset
> (a ram_addr_t) as the start address rather than the actual guest
> physical memory address (target_phys_addr_t).  If your guest has less
> than 3.5G of memory, these are coincidentally the same thing.  If
> there's more, the region offset for the memory above 4G starts over
> at 0, so the set_memory client will overwrite it's lower memory entries.
>
> Instead, keep track of the guest phsyical address as we're walking the
> tables and pass that to the set_memory client.
>
> Signed-off-by: Alex Williamson <alex.williamson@redhat.com>
> ---
>
>  exec.c |   10 ++++++----
>  1 files changed, 6 insertions(+), 4 deletions(-)
>
> diff --git a/exec.c b/exec.c
> index 4752af1..e670929 100644
> --- a/exec.c
> +++ b/exec.c
> @@ -1742,7 +1742,7 @@ static int cpu_notify_migration_log(int enable)
>  }
>  
>  static void phys_page_for_each_1(CPUPhysMemoryClient *client,
> -                                 int level, void **lp)
> +                                 int level, void **lp, target_phys_addr_t addr)
>  {
>      int i;
>  

Aren't you abusing target_phys_addr_t here?  It's not a physical
address, it needs to be shifted left to become one.  By how much depends
on level.  Please take pity on future maintainers and spell this out in
a comment.

Perhaps you can code it in a way that makes the parameter an address.
Probably no need for a comment then.

> @@ -1751,16 +1751,18 @@ static void phys_page_for_each_1(CPUPhysMemoryClient *client,
>      }
>      if (level == 0) {
>          PhysPageDesc *pd = *lp;
> +        addr <<= L2_BITS + TARGET_PAGE_BITS;
>          for (i = 0; i < L2_SIZE; ++i) {
>              if (pd[i].phys_offset != IO_MEM_UNASSIGNED) {
> -                client->set_memory(client, pd[i].region_offset,
> +                client->set_memory(client, addr | i << TARGET_PAGE_BITS,
>                                     TARGET_PAGE_SIZE, pd[i].phys_offset);
>              }
>          }
>      } else {
>          void **pp = *lp;
>          for (i = 0; i < L2_SIZE; ++i) {
> -            phys_page_for_each_1(client, level - 1, pp + i);
> +            phys_page_for_each_1(client, level - 1, pp + i,
> +                                 (addr << L2_BITS) | i);
>          }
>      }
>  }
> @@ -1770,7 +1772,7 @@ static void phys_page_for_each(CPUPhysMemoryClient *client)
>      int i;
>      for (i = 0; i < P_L1_SIZE; ++i) {
>          phys_page_for_each_1(client, P_L1_SHIFT / L2_BITS - 1,
> -                             l1_phys_map + i);
> +                             l1_phys_map + i, i);
>      }
>  }
>  

Fix makes sense to me, after some head scratching.  A comment explaining
the phys map data structure would be helpful.  l1_phys_map[] has a
comment, but it's devoid of detail.
Alex Williamson - May 3, 2011, 2:20 p.m.
On Tue, 2011-05-03 at 15:15 +0200, Markus Armbruster wrote:
> Alex Williamson <alex.williamson@redhat.com> writes:
> 
> > When we're trying to get a newly registered phys memory client updated
> > with the current page mappings, we end up passing the region offset
> > (a ram_addr_t) as the start address rather than the actual guest
> > physical memory address (target_phys_addr_t).  If your guest has less
> > than 3.5G of memory, these are coincidentally the same thing.  If
> > there's more, the region offset for the memory above 4G starts over
> > at 0, so the set_memory client will overwrite it's lower memory entries.
> >
> > Instead, keep track of the guest phsyical address as we're walking the
> > tables and pass that to the set_memory client.
> >
> > Signed-off-by: Alex Williamson <alex.williamson@redhat.com>
> > ---
> >
> >  exec.c |   10 ++++++----
> >  1 files changed, 6 insertions(+), 4 deletions(-)
> >
> > diff --git a/exec.c b/exec.c
> > index 4752af1..e670929 100644
> > --- a/exec.c
> > +++ b/exec.c
> > @@ -1742,7 +1742,7 @@ static int cpu_notify_migration_log(int enable)
> >  }
> >  
> >  static void phys_page_for_each_1(CPUPhysMemoryClient *client,
> > -                                 int level, void **lp)
> > +                                 int level, void **lp, target_phys_addr_t addr)
> >  {
> >      int i;
> >  
> 
> Aren't you abusing target_phys_addr_t here?  It's not a physical
> address, it needs to be shifted left to become one.  By how much depends
> on level.  Please take pity on future maintainers and spell this out in
> a comment.
> 
> Perhaps you can code it in a way that makes the parameter an address.
> Probably no need for a comment then.

Right, it's not a target_phys_addr_t on passing to the function, but it
becomes one as we work, so it still seemed the appropriate data type.  I
rather like how the shifting works into the recursive-ness of the
function, I think it removes a bit of ugliness for figuring how many
levels are there, where am I, how many multiples of *_BITS do I shift.
I'll add a comment and hope that helps.

> > @@ -1751,16 +1751,18 @@ static void phys_page_for_each_1(CPUPhysMemoryClient *client,
> >      }
> >      if (level == 0) {
> >          PhysPageDesc *pd = *lp;
> > +        addr <<= L2_BITS + TARGET_PAGE_BITS;
> >          for (i = 0; i < L2_SIZE; ++i) {
> >              if (pd[i].phys_offset != IO_MEM_UNASSIGNED) {
> > -                client->set_memory(client, pd[i].region_offset,
> > +                client->set_memory(client, addr | i << TARGET_PAGE_BITS,
> >                                     TARGET_PAGE_SIZE, pd[i].phys_offset);
> >              }
> >          }
> >      } else {
> >          void **pp = *lp;
> >          for (i = 0; i < L2_SIZE; ++i) {
> > -            phys_page_for_each_1(client, level - 1, pp + i);
> > +            phys_page_for_each_1(client, level - 1, pp + i,
> > +                                 (addr << L2_BITS) | i);
> >          }
> >      }
> >  }
> > @@ -1770,7 +1772,7 @@ static void phys_page_for_each(CPUPhysMemoryClient *client)
> >      int i;
> >      for (i = 0; i < P_L1_SIZE; ++i) {
> >          phys_page_for_each_1(client, P_L1_SHIFT / L2_BITS - 1,
> > -                             l1_phys_map + i);
> > +                             l1_phys_map + i, i);
> >      }
> >  }
> >  
> 
> Fix makes sense to me, after some head scratching.  A comment explaining
> the phys map data structure would be helpful.  l1_phys_map[] has a
> comment, but it's devoid of detail.

I'll see what I can do, though I'm pretty sure I'm not at the top of the
list for describing the existence and format of these tables.  Thanks,

Alex

Patch

diff --git a/exec.c b/exec.c
index 4752af1..e670929 100644
--- a/exec.c
+++ b/exec.c
@@ -1742,7 +1742,7 @@  static int cpu_notify_migration_log(int enable)
 }
 
 static void phys_page_for_each_1(CPUPhysMemoryClient *client,
-                                 int level, void **lp)
+                                 int level, void **lp, target_phys_addr_t addr)
 {
     int i;
 
@@ -1751,16 +1751,18 @@  static void phys_page_for_each_1(CPUPhysMemoryClient *client,
     }
     if (level == 0) {
         PhysPageDesc *pd = *lp;
+        addr <<= L2_BITS + TARGET_PAGE_BITS;
         for (i = 0; i < L2_SIZE; ++i) {
             if (pd[i].phys_offset != IO_MEM_UNASSIGNED) {
-                client->set_memory(client, pd[i].region_offset,
+                client->set_memory(client, addr | i << TARGET_PAGE_BITS,
                                    TARGET_PAGE_SIZE, pd[i].phys_offset);
             }
         }
     } else {
         void **pp = *lp;
         for (i = 0; i < L2_SIZE; ++i) {
-            phys_page_for_each_1(client, level - 1, pp + i);
+            phys_page_for_each_1(client, level - 1, pp + i,
+                                 (addr << L2_BITS) | i);
         }
     }
 }
@@ -1770,7 +1772,7 @@  static void phys_page_for_each(CPUPhysMemoryClient *client)
     int i;
     for (i = 0; i < P_L1_SIZE; ++i) {
         phys_page_for_each_1(client, P_L1_SHIFT / L2_BITS - 1,
-                             l1_phys_map + i);
+                             l1_phys_map + i, i);
     }
 }