Patchwork [RFC] Exporting Guest RAM information for NUMA binding

login
register
mail settings
Submitter Bharata B Rao
Date Oct. 29, 2011, 6:45 p.m.
Message ID <20111029184502.GH11038@in.ibm.com>
Download mbox | patch
Permalink /patch/122550/
State New
Headers show

Comments

Bharata B Rao - Oct. 29, 2011, 6:45 p.m.
Hi,

As guests become NUMA aware, it becomes important for the guests to
have correct NUMA policies when they run on NUMA aware hosts.
Currently limited support for NUMA binding is available via libvirt
where it is possible to apply a NUMA policy to the guest as a whole.
However multinode guests would benefit if guest memory belonging to
Alexander Graf - Oct. 29, 2011, 7:57 p.m.
On 29.10.2011, at 20:45, Bharata B Rao wrote:

> Hi,
> 
> As guests become NUMA aware, it becomes important for the guests to
> have correct NUMA policies when they run on NUMA aware hosts.
> Currently limited support for NUMA binding is available via libvirt
> where it is possible to apply a NUMA policy to the guest as a whole.
> However multinode guests would benefit if guest memory belonging to
> different guest nodes are mapped appropriately to different host NUMA nodes.
> 
> To achieve this we would need QEMU to expose information about
> guest RAM ranges (Guest Physical Address - GPA) and their host virtual
> address mappings (Host Virtual Address - HVA). Using GPA and HVA, any external
> tool like libvirt would be able to divide the guest RAM as per the guest NUMA
> node geometry and bind guest memory nodes to corresponding host memory nodes
> using HVA. This needs both QEMU (and libvirt) changes as well as changes
> in the kernel.

Ok, let's take a step back here. You are basically growing libvirt into a memory resource manager that know how much memory is available on which nodes and how these nodes would possibly fit into the host's memory layout.

Shouldn't that be the kernel's job? It seems to me that architecturally the kernel is the place I would want my memory resource controls to be in.

Imagine QEMU could tell the kernel that different parts of its virtual memory address space are supposed to be fast on different host vcpu threads. Then the kernel has all the information it needs. It could even potentially migrate memory towards a thread, whenever the scheduler determines that it's better to run a thread somewhere else.

That said, I don't disagree with your approach per se. It just sounds way too static to me and tries to overcome shortcomings we have in the Linux mm system by replacing it with hardcoded pinning logic in user space.


Alex
Vaidyanathan Srinivasan - Oct. 30, 2011, 9:32 a.m.
* Alexander Graf <agraf@suse.de> [2011-10-29 21:57:38]:

> 
> On 29.10.2011, at 20:45, Bharata B Rao wrote:
> 
> > Hi,
> > 
> > As guests become NUMA aware, it becomes important for the guests to
> > have correct NUMA policies when they run on NUMA aware hosts.
> > Currently limited support for NUMA binding is available via libvirt
> > where it is possible to apply a NUMA policy to the guest as a whole.
> > However multinode guests would benefit if guest memory belonging to
> > different guest nodes are mapped appropriately to different host NUMA nodes.
> > 
> > To achieve this we would need QEMU to expose information about
> > guest RAM ranges (Guest Physical Address - GPA) and their host virtual
> > address mappings (Host Virtual Address - HVA). Using GPA and HVA, any external
> > tool like libvirt would be able to divide the guest RAM as per the guest NUMA
> > node geometry and bind guest memory nodes to corresponding host memory nodes
> > using HVA. This needs both QEMU (and libvirt) changes as well as changes
> > in the kernel.
> 
> Ok, let's take a step back here. You are basically growing libvirt into a memory resource manager that know how much memory is available on which nodes and how these nodes would possibly fit into the host's memory layout.

Yes, the motivation is to get libvirt to manage memory and numa
related policies more effectively just like we do vcpu pinning and
allocations.  We would like libvirt to know about the host numa
configurations and map it to guest memory layout to minimize
cross-node references within guest.

> Shouldn't that be the kernel's job? It seems to me that architecturally the kernel is the place I would want my memory resource controls to be in.

Kernel is the one implementing the policy.  Kernel cannot know guest
memory layout or expectations for that VM.  Kernel today sees guest as
a single process that obeys numactl bindings.  What this patch is
trying to do it to make the policy recommendations more fine grain and
effective for a multi-node guest.

Qemu knows the layouts and can tell the kernel or issue the mbind()
calls to setup the numa affinity, however qemu's assumptions could
change if libvirt enforces policies on it using cgroups and cpusets.
Hence in the proposed approach we would allow libvirt to be the policy
owner, get the required information from qemu and set the policy by
informing the kernel, just like we do for vcpus today.

> Imagine QEMU could tell the kernel that different parts of its virtual memory address space are supposed to be fast on different host vcpu threads. Then the kernel has all the information it needs. It could even potentially migrate memory towards a thread, whenever the scheduler determines that it's better to run a thread somewhere else.

Migrating memory near to vcpu or scheduling vcpus closer to the memory
node is a good approach as proposed by Andrea Arcangeli as autonuma.
That could be one of the policies that libvirt can choose for a given
scenario.

> That said, I don't disagree with your approach per se. It just sounds way too static to me and tries to overcome shortcomings we have in the Linux mm system by replacing it with hardcoded pinning logic in user space.
 
Thanks for the review, I agree that fine control on memory and cpu
pinning needs to be used carefully to get the desired positive effect.
This proposal is a good first step to handle multi-node guest
effectively compared to the default policies that are available today.

--Vaidy
Chris Wright - Nov. 8, 2011, 5:33 p.m.
* Alexander Graf (agraf@suse.de) wrote:
> On 29.10.2011, at 20:45, Bharata B Rao wrote:
> > As guests become NUMA aware, it becomes important for the guests to
> > have correct NUMA policies when they run on NUMA aware hosts.
> > Currently limited support for NUMA binding is available via libvirt
> > where it is possible to apply a NUMA policy to the guest as a whole.
> > However multinode guests would benefit if guest memory belonging to
> > different guest nodes are mapped appropriately to different host NUMA nodes.
> > 
> > To achieve this we would need QEMU to expose information about
> > guest RAM ranges (Guest Physical Address - GPA) and their host virtual
> > address mappings (Host Virtual Address - HVA). Using GPA and HVA, any external
> > tool like libvirt would be able to divide the guest RAM as per the guest NUMA
> > node geometry and bind guest memory nodes to corresponding host memory nodes
> > using HVA. This needs both QEMU (and libvirt) changes as well as changes
> > in the kernel.
> 
> Ok, let's take a step back here. You are basically growing libvirt into a memory resource manager that know how much memory is available on which nodes and how these nodes would possibly fit into the host's memory layout.
> 
> Shouldn't that be the kernel's job? It seems to me that architecturally the kernel is the place I would want my memory resource controls to be in.

I think that both Peter and Andrea are looking at this.  Before we commit
an API to QEMU that has a different semantic than a possible new kernel
interface (that perhaps QEMU could use directly to inform kernel of the
binding/relationship between vcpu thread and it's memory at VM startuup)
it would be useful to see what these guys are working on...

thanks,
-chris
Bharata B Rao - Nov. 21, 2011, 3:18 p.m.
On Tue, Nov 08, 2011 at 09:33:04AM -0800, Chris Wright wrote:
> * Alexander Graf (agraf@suse.de) wrote:
> > On 29.10.2011, at 20:45, Bharata B Rao wrote:
> > > As guests become NUMA aware, it becomes important for the guests to
> > > have correct NUMA policies when they run on NUMA aware hosts.
> > > Currently limited support for NUMA binding is available via libvirt
> > > where it is possible to apply a NUMA policy to the guest as a whole.
> > > However multinode guests would benefit if guest memory belonging to
> > > different guest nodes are mapped appropriately to different host NUMA nodes.
> > > 
> > > To achieve this we would need QEMU to expose information about
> > > guest RAM ranges (Guest Physical Address - GPA) and their host virtual
> > > address mappings (Host Virtual Address - HVA). Using GPA and HVA, any external
> > > tool like libvirt would be able to divide the guest RAM as per the guest NUMA
> > > node geometry and bind guest memory nodes to corresponding host memory nodes
> > > using HVA. This needs both QEMU (and libvirt) changes as well as changes
> > > in the kernel.
> > 
> > Ok, let's take a step back here. You are basically growing libvirt into a memory resource manager that know how much memory is available on which nodes and how these nodes would possibly fit into the host's memory layout.
> > 
> > Shouldn't that be the kernel's job? It seems to me that architecturally the kernel is the place I would want my memory resource controls to be in.
> 
> I think that both Peter and Andrea are looking at this.  Before we commit
> an API to QEMU that has a different semantic than a possible new kernel
> interface (that perhaps QEMU could use directly to inform kernel of the
> binding/relationship between vcpu thread and it's memory at VM startuup)
> it would be useful to see what these guys are working on...

I looked at Peter's recent work in this area.
(https://lkml.org/lkml/2011/11/17/204)

It introduces two interfaces:

1. ms_tbind() to bind a thread to a memsched(*) group
2. ms_mbind() to bind a memory region to memsched group

I assume the 2nd interface could be used by QEMU to create
memsched groups for each of guest NUMA node memory regions.

In the past, Anthony has said that NUMA binding should be done from outside
of QEMU (http://www.kerneltrap.org/mailarchive/linux-kvm/2010/8/31/6267041)
Though that was in a different context, may be we should re-look at that
and see if QEMU still sticks to that. I know its a bit early, but if needed
we should ask Peter to consider extending ms_mbind() to take a tid parameter
too instead of working on current task by default.

(*) memsched: An abstraction for representing coupling of threads with virtual
address ranges. Threads and virtual address ranges of a memsched group are
guaranteed (?) to be located on the same node.

Regards,
Bharata.
Peter Zijlstra - Nov. 21, 2011, 3:25 p.m.
On Mon, 2011-11-21 at 20:48 +0530, Bharata B Rao wrote:

> I looked at Peter's recent work in this area.
> (https://lkml.org/lkml/2011/11/17/204)
> 
> It introduces two interfaces:
> 
> 1. ms_tbind() to bind a thread to a memsched(*) group
> 2. ms_mbind() to bind a memory region to memsched group
> 
> I assume the 2nd interface could be used by QEMU to create
> memsched groups for each of guest NUMA node memory regions.

No, you would need both, you'll need to group vcpu threads _and_ some
vaddress space together.

I understood QEMU currently uses a single big anonymous mmap() to
allocate the guest memory, using this you could either use multiple or
carve up the big alloc into virtual nodes by assigning different parts
to different ms groups.

Example: suppose you want to create a 2 node guest with 8 vcpus, create
2 ms groups, each with 4 vcpu threads and assign half the total guest
mmap to either.

> In the past, Anthony has said that NUMA binding should be done from outside
> of QEMU (http://www.kerneltrap.org/mailarchive/linux-kvm/2010/8/31/6267041)

If you want to expose a sense of virtual NUMA to your guest you really
have no choice there. The only thing you can do externally is run whole
VMs inside one particular node.

> Though that was in a different context, may be we should re-look at that
> and see if QEMU still sticks to that. I know its a bit early, but if needed
> we should ask Peter to consider extending ms_mbind() to take a tid parameter
> too instead of working on current task by default.

Uh, what for? ms_mbind() works on the current process, not task.

> (*) memsched: An abstraction for representing coupling of threads with virtual
> address ranges. Threads and virtual address ranges of a memsched group are
> guaranteed (?) to be located on the same node.

Yeah, more or less so. We could relax that slightly to allow tasks to
run away from the node for very short periods of time, but basically
that's the provided guarantee.
Bharata B Rao - Nov. 21, 2011, 4 p.m.
On Mon, Nov 21, 2011 at 04:25:26PM +0100, Peter Zijlstra wrote:
> On Mon, 2011-11-21 at 20:48 +0530, Bharata B Rao wrote:
> 
> > I looked at Peter's recent work in this area.
> > (https://lkml.org/lkml/2011/11/17/204)
> > 
> > It introduces two interfaces:
> > 
> > 1. ms_tbind() to bind a thread to a memsched(*) group
> > 2. ms_mbind() to bind a memory region to memsched group
> > 
> > I assume the 2nd interface could be used by QEMU to create
> > memsched groups for each of guest NUMA node memory regions.
> 
> No, you would need both, you'll need to group vcpu threads _and_ some
> vaddress space together.
> 
> I understood QEMU currently uses a single big anonymous mmap() to
> allocate the guest memory, using this you could either use multiple or
> carve up the big alloc into virtual nodes by assigning different parts
> to different ms groups.
> 
> Example: suppose you want to create a 2 node guest with 8 vcpus, create
> 2 ms groups, each with 4 vcpu threads and assign half the total guest
> mmap to either.
> 
> > In the past, Anthony has said that NUMA binding should be done from outside
> > of QEMU (http://www.kerneltrap.org/mailarchive/linux-kvm/2010/8/31/6267041)
> 
> If you want to expose a sense of virtual NUMA to your guest you really
> have no choice there. The only thing you can do externally is run whole
> VMs inside one particular node.
> 
> > Though that was in a different context, may be we should re-look at that
> > and see if QEMU still sticks to that. I know its a bit early, but if needed
> > we should ask Peter to consider extending ms_mbind() to take a tid parameter
> > too instead of working on current task by default.
> 
> Uh, what for? ms_mbind() works on the current process, not task.

In the original post of this mail thread, I proposed a way to export
guest RAM ranges (Guest Physical Address-GPA) and their corresponding host
host virtual mappings (Host Virtual Address-HVA) from QEMU (via QEMU monitor).
The idea was to use this GPA to HVA mappings from tools like libvirt to bind
specific parts of the guest RAM to different host nodes. This needed an
extension to existing mbind() to allow binding memory of a process(QEMU) from a
different process(libvirt). This was needed since we wanted to do all this from
libvirt.

Hence I was coming from that background when I asked for extending
ms_mbind() to take a tid parameter. If QEMU community thinks that NUMA
binding should all be done from outside of QEMU, it is needed, otherwise
what you have should be sufficient.

Regards,
Bharata.
Peter Zijlstra - Nov. 21, 2011, 5:03 p.m.
On Mon, 2011-11-21 at 21:30 +0530, Bharata B Rao wrote:
> 
> In the original post of this mail thread, I proposed a way to export
> guest RAM ranges (Guest Physical Address-GPA) and their corresponding host
> host virtual mappings (Host Virtual Address-HVA) from QEMU (via QEMU monitor).
> The idea was to use this GPA to HVA mappings from tools like libvirt to bind
> specific parts of the guest RAM to different host nodes. This needed an
> extension to existing mbind() to allow binding memory of a process(QEMU) from a
> different process(libvirt). This was needed since we wanted to do all this from
> libvirt.
> 
> Hence I was coming from that background when I asked for extending
> ms_mbind() to take a tid parameter. If QEMU community thinks that NUMA
> binding should all be done from outside of QEMU, it is needed, otherwise
> what you have should be sufficient. 

That's just retarded, and no you won't get such extentions. Poking at
another process's virtual address space is just daft. Esp. if there's no
actual reason for it.

Furthermore, it would make libvirt a required part of qemu, and since I
don't think I've ever use libvirt that's another reason to object, I
don't need that stinking mess.
Avi Kivity - Nov. 21, 2011, 6:03 p.m.
On 11/21/2011 05:25 PM, Peter Zijlstra wrote:
> On Mon, 2011-11-21 at 20:48 +0530, Bharata B Rao wrote:
>
> > I looked at Peter's recent work in this area.
> > (https://lkml.org/lkml/2011/11/17/204)
> > 
> > It introduces two interfaces:
> > 
> > 1. ms_tbind() to bind a thread to a memsched(*) group
> > 2. ms_mbind() to bind a memory region to memsched group
> > 
> > I assume the 2nd interface could be used by QEMU to create
> > memsched groups for each of guest NUMA node memory regions.
>
> No, you would need both, you'll need to group vcpu threads _and_ some
> vaddress space together.
>
> I understood QEMU currently uses a single big anonymous mmap() to
> allocate the guest memory, using this you could either use multiple or
> carve up the big alloc into virtual nodes by assigning different parts
> to different ms groups.
>
> Example: suppose you want to create a 2 node guest with 8 vcpus, create
> 2 ms groups, each with 4 vcpu threads and assign half the total guest
> mmap to either.
>

Does ms_mbind() require that its vmas in its area be completely
contained in the region, or does it split vmas on demand?  I suggest the
latter to avoid exposing implementation details.
Peter Zijlstra - Nov. 21, 2011, 7:31 p.m.
On Mon, 2011-11-21 at 20:03 +0200, Avi Kivity wrote:
> 
> Does ms_mbind() require that its vmas in its area be completely
> contained in the region, or does it split vmas on demand?  I suggest the
> latter to avoid exposing implementation details. 

as implemented (which is still rather incomplete) it does the split on
demand like all other memory interfaces.
Chris Wright - Nov. 21, 2011, 10:50 p.m.
* Peter Zijlstra (a.p.zijlstra@chello.nl) wrote:
> On Mon, 2011-11-21 at 21:30 +0530, Bharata B Rao wrote:
> > 
> > In the original post of this mail thread, I proposed a way to export
> > guest RAM ranges (Guest Physical Address-GPA) and their corresponding host
> > host virtual mappings (Host Virtual Address-HVA) from QEMU (via QEMU monitor).
> > The idea was to use this GPA to HVA mappings from tools like libvirt to bind
> > specific parts of the guest RAM to different host nodes. This needed an
> > extension to existing mbind() to allow binding memory of a process(QEMU) from a
> > different process(libvirt). This was needed since we wanted to do all this from
> > libvirt.
> > 
> > Hence I was coming from that background when I asked for extending
> > ms_mbind() to take a tid parameter. If QEMU community thinks that NUMA
> > binding should all be done from outside of QEMU, it is needed, otherwise
> > what you have should be sufficient. 
> 
> That's just retarded, and no you won't get such extentions. Poking at
> another process's virtual address space is just daft. Esp. if there's no
> actual reason for it.

Need to separate the binding vs the policy mgmt.  The policy mgmt could
still be done outside, whereas the binding could still be done from w/in
QEMU.  A simple monitor interface to rebalance vcpu memory allcoations
to different nodes could very well schedule vcpu thread work in QEMU.

So, I agree, even if there is some external policy mgmt, it could still
easily work w/ QEMU to use Peter's proposed interface.

thanks,
-chris
Anthony Liguori - Nov. 22, 2011, 1:51 a.m.
On 11/21/2011 11:03 AM, Peter Zijlstra wrote:
> On Mon, 2011-11-21 at 21:30 +0530, Bharata B Rao wrote:
>>
>> In the original post of this mail thread, I proposed a way to export
>> guest RAM ranges (Guest Physical Address-GPA) and their corresponding host
>> host virtual mappings (Host Virtual Address-HVA) from QEMU (via QEMU monitor).
>> The idea was to use this GPA to HVA mappings from tools like libvirt to bind
>> specific parts of the guest RAM to different host nodes. This needed an
>> extension to existing mbind() to allow binding memory of a process(QEMU) from a
>> different process(libvirt). This was needed since we wanted to do all this from
>> libvirt.
>>
>> Hence I was coming from that background when I asked for extending
>> ms_mbind() to take a tid parameter. If QEMU community thinks that NUMA
>> binding should all be done from outside of QEMU, it is needed, otherwise
>> what you have should be sufficient.
>
> That's just retarded, and no you won't get such extentions. Poking at
> another process's virtual address space is just daft. Esp. if there's no
> actual reason for it.

Yes, that would be a terrible interface.

Fundamentally, the entity that should be deciding what memory should be present 
and where it should located is the kernel.  I'm fundamentally opposed to trying 
to make QEMU override the scheduler/mm by using cpu or memory pinning in QEMU.

 From what I can tell about ms_mbind(), it just uses process knowledge to bind 
specific areas of memory to a memsched group and let's the kernel decide what to 
do with that knowledge.  This is exactly the type of interface that QEMU should 
be using.

QEMU should tell the kernel enough information such that the kernel can make 
good decisions.  QEMU should not be the one making the decisions.

It looks like ms_mbind() takes a flags argument which I assume is the same flags 
as mbind().  The current implementation ignores flags and just uses MPOL_BIND.

I would hope that the flags argument would only be treated as advisory by the 
kernel.

Regards,

Anthony Liguori

>
> Furthermore, it would make libvirt a required part of qemu, and since I
> don't think I've ever use libvirt that's another reason to object, I
> don't need that stinking mess.
>
Anthony Liguori - Nov. 22, 2011, 1:57 a.m.
On 11/21/2011 04:50 PM, Chris Wright wrote:
> * Peter Zijlstra (a.p.zijlstra@chello.nl) wrote:
>> On Mon, 2011-11-21 at 21:30 +0530, Bharata B Rao wrote:
>>>
>>> In the original post of this mail thread, I proposed a way to export
>>> guest RAM ranges (Guest Physical Address-GPA) and their corresponding host
>>> host virtual mappings (Host Virtual Address-HVA) from QEMU (via QEMU monitor).
>>> The idea was to use this GPA to HVA mappings from tools like libvirt to bind
>>> specific parts of the guest RAM to different host nodes. This needed an
>>> extension to existing mbind() to allow binding memory of a process(QEMU) from a
>>> different process(libvirt). This was needed since we wanted to do all this from
>>> libvirt.
>>>
>>> Hence I was coming from that background when I asked for extending
>>> ms_mbind() to take a tid parameter. If QEMU community thinks that NUMA
>>> binding should all be done from outside of QEMU, it is needed, otherwise
>>> what you have should be sufficient.
>>
>> That's just retarded, and no you won't get such extentions. Poking at
>> another process's virtual address space is just daft. Esp. if there's no
>> actual reason for it.
>
> Need to separate the binding vs the policy mgmt.  The policy mgmt could
> still be done outside, whereas the binding could still be done from w/in
> QEMU.  A simple monitor interface to rebalance vcpu memory allcoations
> to different nodes could very well schedule vcpu thread work in QEMU.

I really would prefer to avoid having such an interface.  It's a shot gun that 
will only result in many poor feet being maimed.  I can't tell you the number of 
times I've encountered people using CPU pinning when they have absolutely no 
business doing CPU pinning.

If we really believe such an interface should exist, then the interface should 
really be from the kernel.  Once we have memgroups, there's no reason to involve 
QEMU at all.  QEMU can define the memgroups based on the NUMA nodes and then 
it's up to the kernel as to whether it exposes controls to explicitly bind 
memgroups within a process or not.

Regards,

Anthony Liguori

> So, I agree, even if there is some external policy mgmt, it could still
> easily work w/ QEMU to use Peter's proposed interface.
>
> thanks,
> -chris
>
Andrea Arcangeli - Nov. 23, 2011, 3:03 p.m.
Hi!

On Mon, Nov 21, 2011 at 07:51:21PM -0600, Anthony Liguori wrote:
> Fundamentally, the entity that should be deciding what memory should be present 
> and where it should located is the kernel.  I'm fundamentally opposed to trying 
> to make QEMU override the scheduler/mm by using cpu or memory pinning in QEMU.
> 
>  From what I can tell about ms_mbind(), it just uses process knowledge to bind 
> specific areas of memory to a memsched group and let's the kernel decide what to 
> do with that knowledge.  This is exactly the type of interface that QEMU should 
> be using.
> 
> QEMU should tell the kernel enough information such that the kernel can make 
> good decisions.  QEMU should not be the one making the decisions.

True, QEMU won't have to decide where the memory and vcpus should be
located (but hey it wouldn't need to decide that even if you use
cpusets, you can use relative mbind with cpusets, the admin or a
cpuset job scheduler could decide) but it's still QEMU making the
decision of what memory and which vcpus threads to
ms_mbind/ms_tbind. Think how you're going to create the input of those
syscalls...

If it wasn't qemu to decide that, qemu wouldn't be required to scan
the whole host physical numa (cpu/memory) topology in order to create
the "input" arguments of "ms_mbind/ms_tbind". And when you migrate the
VM to another host, the whole vtopology may be counter-productive
because the kernel isn't automatically detecting the numa affinity
between threads and the guest vtopology will stick to whatever numa
_physical_ topology that was seen on the first node where the VM was
created.

I doubt that the assumption that all cloud nodes will have the same
physical numa topology is reasonable.

Furthermore to get the same benefits that qemu gets on host by using
ms_mbind/ms_tbind, every single guest application should be modified
to scan the guest vtopology and call ms_mbind/ms_tbind too (or use the
hard bindings which is what we try to avoid).

I think it's unreasonable to expect all applications to use
ms_mbind/ms_tbind in the guest, at best guest apps will use cpusets or
wrappers, few apps will be modified for sys_ms_tbind/mbind.

You can always have the supercomputer case with just one app that is
optimized and a single VM spanning over the whole host, but in that
scenarios hard bindings would work perfectly too.

In my view the trouble of the numa hard bindings is not the fact
they're hard and qemu has to also decide the location (in fact it
doesn't need to decide the location if you use cpusets and relative
mbinds). The bigger problem is the fact either the admin or the app
developer has to explicitly scan the numa physical topology (both cpus
and memory) and tell the kernel how much memory to bind to each
thread. ms_mbind/ms_tbind only partially solve that problem. They're
similar to the mbind MPOL_F_RELATIVE_NODES with cpusets, except you
don't need an admin or a cpuset-job-scheduler (or a perl script) to
redistribute the hardware resources.

Now dealing with bindings isn't big deal for qemu, in fact this API is
pretty much ideal for qemu, but it won't make life substantially
easier than if compared to hard bindings. Simply the management code
that is now done with a perl script will have to be moved in the
kernel. It looks an incremental improvement compared to the relative
mbind+cpuset, but I'm unsure if it's the best we could aim for and
what we really need in virt considering we deal with VM migration too.

The real long term design to me is not to add more syscalls, and
initially handling the case of a process/VM spanning not more than one
node in thread number and amount of memory. That's not too hard an in
fact I've benchmarks for the scheduler already showing it to work
pretty well (it's creating a too strict affinity but it can be relaxed
to be more useful). Then later add some mechanism (simplest is the
page fault at low frequency) to create a
guest_vcpu_thread<->host_memory affinity and have a parvirtualized
interface that tells the guest scheduler to group CPUs.

If the guest scheduler runs free and is allowed to move threads
randomly without any paravirtualized interface that controls the CPU
thread migration in the guest scheduler, the thread<->memory affinity
on host will be hopeless. But with a parvirtualized interface to make
a guest thread stick to vcpu0/1/2/3 and not going into vcpu4/5/6/7,
will allow to create a more meaningful guest_thread<->physical_ram
affinity on host through KVM page faults. And then this will work also
with VM migration and without having to create a vtopology in guest.

And for apps running in guest no paravirt will be needed of course.

The reason paravirt would be needed for qemu-kvm with a full automatic
thread<->memory affinity is that the vcpu threads are magic. What runs
in the vcpu thread are guest threads. And those can move through the
guest CPU scheduler from vcpu0 to vcpu7. If that happens and we've 4
physical cpu for each physical node, any affinity we measure in the
host will be meaningless. Normal threads using NPTL won't behave like
that. Maybe some other thread library could have a "scheduler" inside
that would make it behave like a vcpu thread (it's one thread really
with several threads inside) but those existed mostly to simulate
multiple threads in a single thread so they don't matter. And in this
respect sys_tbind also requires the tid to have meaningful memory
affinity. sys_tbind/mbind gets away with it by creating a vtopology in
the guest, so the guest scheduler would then follow the vtopology (but
vtopology breaks across VM migration and to really be followed well
with sys_mbind/tbind it'd require all apps to be modified).

grouping guest threads to stick into some vcpu sounds immensely
simpler than changing the whole guest vtopology at runtime that would
involve changing memory layout too.

NOTE: the paravirt cpu grouping interface would also handle the case
of 3 guests of 2.5G on a 8G guest (4G per node). One of the three
guests will have memory spanning over two nodes, and the guest
vtopology created by sys_mbind/tbind can't handle it. While paravirt
cpu grouping and automatic thread<->memory affinity on host will
handle it, like it will handle VM migration across nodes with
different physical topology. The problem is to create a
thread<->memory affinity we'll have to issue some page fault in KVM in
the background. How harmful that is I don't know at this point. So the
full automatic thread<->memory affinity is a bit of a vapourware
concept at this point (process<->memory affinity seems to work already
though).

But Peter's migration code was driven by page faults already (not
included in the patch he posted) and the other patch that exists
called migrate-on-fault also depended on page faults. So I am
optimistic we could have a thread<->memory affinity working too in the
longer term. The plan would be to run them at low frequency and only
if we can't fit a process into one node (in terms of both number of
threads and memory). If the process fits in one node, we wouldn't even
need any page fault and the information in the pagetables will be
enough to do a best decision. The downside is it significantly more
difficult to implement the thread<->memory affinity. And that's why
I'm focusing initially on the simpler case of considering only the
process<->memory affinity. That's fairly easy.

So for the time being this incremental improvement may be justified,
it moves the logic from a perl script to the kernel but I'm just
skeptical it provides a big advantage compared to the numa bindings we
already have in the kernel, especially if in the long term we can get
rid of a vtopology completely.

The vtopology in the guest may seem appealing, it solves the problem
when you use bindings everywhere (be them hard bindings, or cpuset
relative bindings, or the dynamic sys_mbind/tbind). But there is no
much hope to alter the vtopology at runtime, so when a guest must be
split across two nodes (3 VM of 2.5G ram running in a 8G host with 2
4G nodes) or through VM migration across different cloud nodes, I
think the vtopology is trouble and would be best if it's avoided. The
memory side of the vtopology is absolute trouble if it doesn't match
the host physical topology exactly.
Alexander Graf - Nov. 23, 2011, 6:34 p.m.
On 11/23/2011 04:03 PM, Andrea Arcangeli wrote:
> Hi!
>
> On Mon, Nov 21, 2011 at 07:51:21PM -0600, Anthony Liguori wrote:
>> Fundamentally, the entity that should be deciding what memory should be present
>> and where it should located is the kernel.  I'm fundamentally opposed to trying
>> to make QEMU override the scheduler/mm by using cpu or memory pinning in QEMU.
>>
>>   From what I can tell about ms_mbind(), it just uses process knowledge to bind
>> specific areas of memory to a memsched group and let's the kernel decide what to
>> do with that knowledge.  This is exactly the type of interface that QEMU should
>> be using.
>>
>> QEMU should tell the kernel enough information such that the kernel can make
>> good decisions.  QEMU should not be the one making the decisions.
> True, QEMU won't have to decide where the memory and vcpus should be
> located (but hey it wouldn't need to decide that even if you use
> cpusets, you can use relative mbind with cpusets, the admin or a
> cpuset job scheduler could decide) but it's still QEMU making the
> decision of what memory and which vcpus threads to
> ms_mbind/ms_tbind. Think how you're going to create the input of those
> syscalls...
>
> If it wasn't qemu to decide that, qemu wouldn't be required to scan
> the whole host physical numa (cpu/memory) topology in order to create
> the "input" arguments of "ms_mbind/ms_tbind". And when you migrate the
> VM to another host, the whole vtopology may be counter-productive
> because the kernel isn't automatically detecting the numa affinity
> between threads and the guest vtopology will stick to whatever numa
> _physical_ topology that was seen on the first node where the VM was
> created.
>
> I doubt that the assumption that all cloud nodes will have the same
> physical numa topology is reasonable.
>
> Furthermore to get the same benefits that qemu gets on host by using
> ms_mbind/ms_tbind, every single guest application should be modified
> to scan the guest vtopology and call ms_mbind/ms_tbind too (or use the
> hard bindings which is what we try to avoid).
>
> I think it's unreasonable to expect all applications to use
> ms_mbind/ms_tbind in the guest, at best guest apps will use cpusets or
> wrappers, few apps will be modified for sys_ms_tbind/mbind.
>
> You can always have the supercomputer case with just one app that is
> optimized and a single VM spanning over the whole host, but in that
> scenarios hard bindings would work perfectly too.
>
> In my view the trouble of the numa hard bindings is not the fact
> they're hard and qemu has to also decide the location (in fact it
> doesn't need to decide the location if you use cpusets and relative
> mbinds). The bigger problem is the fact either the admin or the app
> developer has to explicitly scan the numa physical topology (both cpus
> and memory) and tell the kernel how much memory to bind to each
> thread. ms_mbind/ms_tbind only partially solve that problem. They're
> similar to the mbind MPOL_F_RELATIVE_NODES with cpusets, except you
> don't need an admin or a cpuset-job-scheduler (or a perl script) to
> redistribute the hardware resources.

Well yeah, of course the guest needs to see some topology. I don't see 
why we'd have to actually scan the host for this though. All we need to 
tell the kernel is "this memory region is close to that thread".

So if you define "-numa node,mem=1G,cpus=0" then QEMU should be able to 
tell the kernel that this GB of RAM actually is close to that vCPU thread.

Of course the admin still needs to decide how to split up memory. That's 
the deal with emulating real hardware. You get the interfaces hardware 
gets :). However, if you follow a reasonable default strategy such as 
numa splitting your RAM into equal chunks between guest vCPUs you're 
probably close enough to optimal usage models. Or at least you could 
have a close enough approximation of how this mapping could work for the 
_guest_ regardless of the host and when you migrate it somewhere else it 
should also work reasonably well.


> Now dealing with bindings isn't big deal for qemu, in fact this API is
> pretty much ideal for qemu, but it won't make life substantially
> easier than if compared to hard bindings. Simply the management code
> that is now done with a perl script will have to be moved in the
> kernel. It looks an incremental improvement compared to the relative
> mbind+cpuset, but I'm unsure if it's the best we could aim for and
> what we really need in virt considering we deal with VM migration too.
>
> The real long term design to me is not to add more syscalls, and
> initially handling the case of a process/VM spanning not more than one
> node in thread number and amount of memory. That's not too hard an in
> fact I've benchmarks for the scheduler already showing it to work
> pretty well (it's creating a too strict affinity but it can be relaxed
> to be more useful). Then later add some mechanism (simplest is the
> page fault at low frequency) to create a
> guest_vcpu_thread<->host_memory affinity and have a parvirtualized
> interface that tells the guest scheduler to group CPUs.
>
> If the guest scheduler runs free and is allowed to move threads
> randomly without any paravirtualized interface that controls the CPU
> thread migration in the guest scheduler, the thread<->memory affinity
> on host will be hopeless. But with a parvirtualized interface to make
> a guest thread stick to vcpu0/1/2/3 and not going into vcpu4/5/6/7,
> will allow to create a more meaningful guest_thread<->physical_ram
> affinity on host through KVM page faults. And then this will work also
> with VM migration and without having to create a vtopology in guest.

So you want to basically dynamically create NUMA topologies from the 
runtime behavior of the guest? What if it changes over time?

> And for apps running in guest no paravirt will be needed of course.
>
> The reason paravirt would be needed for qemu-kvm with a full automatic
> thread<->memory affinity is that the vcpu threads are magic. What runs
> in the vcpu thread are guest threads. And those can move through the
> guest CPU scheduler from vcpu0 to vcpu7. If that happens and we've 4
> physical cpu for each physical node, any affinity we measure in the
> host will be meaningless. Normal threads using NPTL won't behave like
> that. Maybe some other thread library could have a "scheduler" inside
> that would make it behave like a vcpu thread (it's one thread really
> with several threads inside) but those existed mostly to simulate
> multiple threads in a single thread so they don't matter. And in this
> respect sys_tbind also requires the tid to have meaningful memory
> affinity. sys_tbind/mbind gets away with it by creating a vtopology in
> the guest, so the guest scheduler would then follow the vtopology (but
> vtopology breaks across VM migration and to really be followed well
> with sys_mbind/tbind it'd require all apps to be modified).
>
> grouping guest threads to stick into some vcpu sounds immensely
> simpler than changing the whole guest vtopology at runtime that would
> involve changing memory layout too.
>
> NOTE: the paravirt cpu grouping interface would also handle the case
> of 3 guests of 2.5G on a 8G guest (4G per node). One of the three
> guests will have memory spanning over two nodes, and the guest
> vtopology created by sys_mbind/tbind can't handle it. While paravirt
> cpu grouping and automatic thread<->memory affinity on host will
> handle it, like it will handle VM migration across nodes with
> different physical topology. The problem is to create a
> thread<->memory affinity we'll have to issue some page fault in KVM in
> the background. How harmful that is I don't know at this point. So the
> full automatic thread<->memory affinity is a bit of a vapourware
> concept at this point (process<->memory affinity seems to work already
> though).
>
> But Peter's migration code was driven by page faults already (not
> included in the patch he posted) and the other patch that exists
> called migrate-on-fault also depended on page faults. So I am
> optimistic we could have a thread<->memory affinity working too in the
> longer term. The plan would be to run them at low frequency and only
> if we can't fit a process into one node (in terms of both number of
> threads and memory). If the process fits in one node, we wouldn't even
> need any page fault and the information in the pagetables will be
> enough to do a best decision. The downside is it significantly more
> difficult to implement the thread<->memory affinity. And that's why
> I'm focusing initially on the simpler case of considering only the
> process<->memory affinity. That's fairly easy.
>
> So for the time being this incremental improvement may be justified,
> it moves the logic from a perl script to the kernel but I'm just
> skeptical it provides a big advantage compared to the numa bindings we
> already have in the kernel, especially if in the long term we can get
> rid of a vtopology completely.

I actually like the idea of just telling the kernel how close memory 
will be to a thread. Sure, you can handle this basically by shoving your 
scheduler into user space, but isn't managing processes what a kernel is 
supposed to do in the first place?

You can always argue for a microkernel, but having a scheduler in user 
space (perl script) and another one in the kernel doesn't sound very 
appealing to me. If you want to go full-on user space, sure, I can see 
why :).

Either way, your approach sounds to be very much in the concept phase, 
while this is more something that can actually be tested and benchmarked 
against today. So yes, I want the interim solution - just in case your 
plan doesn't work out :). Oh, and then there's the non-PV guests too...


Alex
Andrea Arcangeli - Nov. 23, 2011, 8:19 p.m.
On Wed, Nov 23, 2011 at 07:34:37PM +0100, Alexander Graf wrote:
> So if you define "-numa node,mem=1G,cpus=0" then QEMU should be able to 
> tell the kernel that this GB of RAM actually is close to that vCPU thread.
> Of course the admin still needs to decide how to split up memory. That's 
> the deal with emulating real hardware. You get the interfaces hardware 
> gets :). However, if you follow a reasonable default strategy such as 

The problem is how do you decide the parameter "-numa
node,mem=1G,cpus=0".

Real hardware exists when the VM starts. But then the VM can be
migrated. Or the VM may have to be split in the middle of two nodes
regardless of the -node node,mem=1G,cpus=0-1 to avoid swapping so
there may be two 512M nodes with 1 cpu each instead of 1 NUMA node
with 1G and 2 cpus.

Especially by relaxing the hard bindings and using ms_mbind/tbind, the
vtopology you create won't match real hardware because you don't know
the real hardware that you will get.

> numa splitting your RAM into equal chunks between guest vCPUs you're 
> probably close enough to optimal usage models. Or at least you could 
> have a close enough approximation of how this mapping could work for the 
> _guest_ regardless of the host and when you migrate it somewhere else it 
> should also work reasonably well.

If you enforce these assumptions and the admin has still again to
choose the "-numa node,mem=1G" parameters after checking the physical
numa topology and make sure the vtopology can match the real physical
topology and that the guest runs on "real hardware", it's not very
different from using hard bindings, hard bindings enforces the "real
hardware" so there's no way it can go wrong. I mean you still need
some NUMA topology knowledge outside of QEMU to be sure you get "real
hardware" out of the vtopology.

Ok cpusets would restrict the availability of idle cpus, so there's a
slight improvement in maximizing idle CPU usage (it's better to run
50% slower than not to run at all), but that could be achieved also by
a relax of the cpuset semantics (if that's not already available).

> So you want to basically dynamically create NUMA topologies from the 
> runtime behavior of the guest? What if it changes over time?

Yes, just I wouldn't call it NUMA topologies or it looks like a
vtopology, and the vtopology is something fixed at boot time, the sort
of thing created by using a command line like "-numa
node,mem=1G,cpu=0".

I wouldn't try to give the guest any "memory" topology, just the vcpus
are magic threads that don't behave like normal threads in memory
affinity terms. So they need a paravirtualization layer to be dealt
with.

The fact vcpu0 accessed 10 pages right now, doesn't mean there's a
real affinity between vcpu0 and those N pages if the guest scheduler
is free to migrate anything anywhere. The guest thread running in the
vcpu0 may be migrated to the vcpu7 which may belong to a different
physical node. So if we want to automatically detect thread<->memory
affinity between vcpus and guest memory, we also need to group the
guest threads in certain vcpus and prevent those cpu migrations. The
thread in the guest would better stick to vcpu0/1/2/3 (instead of
migration to vcpu4/5/6/7) if vcpu0/1/2/3 have affinity with the same
memory which fits in one node. That can only be told dynamically from
KVM to the guest OS scheduler as we may migrate virtual machines or we
may move the memory.

Take the example of 3 VM of 2.5G ram each on a 8G system with 2 nodes
(4G per node). Suppose one of the two VM that have all the 2.5G
allocated in a single node quits. Then the VM that was split across
the two nodes will  "memory-migrated" to fit in one node. So far so
good, but then KVM should tell the guest OS scheduler that it should
stop grouping vcpus and all vcpus are equal and all guest threads can
be migrated to any vcpu.

I don't see a way to do those things with a vtopology fixed at boot.

> I actually like the idea of just telling the kernel how close memory 
> will be to a thread. Sure, you can handle this basically by shoving your 
> scheduler into user space, but isn't managing processes what a kernel is 
> supposed to do in the first place?

Assume you're not in virt and you just want to tell thread A uses
memory range A and thread B uses memory range B. If the memory range A
fits in one node you're ok. But if "memory A" now spans over two nodes
(maybe to avoid swapping), you're still screwed and you won't give
enough information to the kernel on the real runtime affinity that
"thread A" has on the memory. Now if statistically the access to
"memory a" are all equal, it won't make a difference but if you end up
using half of "memory A" 99% of the time, it will not work as well.

This is especially a problem for KVM because statistically the
accesses to "memory a" given to vcpu0 won't be equal. 50% of it may
not be used at all and just have pagecache sitting there, or even free
memory, so we can do better if "memory a" is split across two nodes to
avoid swapping, if we detect the vcpu<->memory affinity dynamically.

> You can always argue for a microkernel, but having a scheduler in user 
> space (perl script) and another one in the kernel doesn't sound very 
> appealing to me. If you want to go full-on user space, sure, I can see 
> why :).
>
> Either way, your approach sounds to be very much in the concept phase, 
> while this is more something that can actually be tested and benchmarked 

The thread<->memory affinity is in the concept phase, but the
process<->memory affinity already runs and in benchmarks it already performs
almost as well as hard bindings. It has the cost of a knumad daemon
scanning the memory in the background but that's cheap, not even
comparable to something like KSM. It's comparable to khugepaged
overhead, which is orders of magnitude lower and considering those are
big systems with many CPUs I don't think it's a big deal.

Once process<->memory affinity works well if we go into the
thread<->memory affinity, we'll have to tweak knumad to trigger page
faults to give us per-thread information on the memory affinity.

Also I'm only working on anonymous memory right now, maybe it should
be extended to other types of memory and handle the case of the memory
being shared by entities running in different nodes and not touch it
in that case, while if the pagecache is used by just one thread (or
process initially) it could still migrate it. For readonly shared
memory duplicating it per-node is the way to go but I'm not going into
that direction as it's not useful for virt. It remains a possibility
for the future.

> against today. So yes, I want the interim solution - just in case your 
> plan doesn't work out :). Oh, and then there's the non-PV guests too...

Actually to me it looks like the code misses the memory affinity and
migration, so I'm not sure how much you can run benchmarks on it
yet. It seems to tweak the scheduler though.

I don't mean ms_tbind/mbind are a bad idea, it allows to remove the
migration invoked by a perl script, into the kernel, but I'm not
satisfied with the trouble that creating a vtopology still gives us
(vtopology only makes sense if it matches the "real hardware" and as
said above we don't always have real hardware, and if you enforce real
hardware you're pretty close to using hard bindings, except it will be
the kernel doing the migration of cpus and memory instead of those
being invoked by userland, and it also allows to maximize usage of the
idle CPUs).

But even after you create a vtopology in guest and you makes sure it
won't split across nodes so that the vtopology runs on "real hardware"
it is has been created for, it won't help much if all userland apps in
guest aren't also modified to use ms_mbind/ms_tbind which I don't see
happening any time soon. You could still run knumad in guest, to take
advantage of the vtopology without having to modify guest apps
though. But if knumad would run in host (if we can solve the
thread<->memory affinity) there would be no need of vtopology in the
guest in the first place.

I'm positive and I've proof of concept that knumad works for
process<->memory affinity but considering the automigration code isn't
complete (it doesn't even yet migrate THP without splitting them) I'm
not yet delving into the complications of the thread affinity.
Dipankar Sarma - Nov. 30, 2011, 4:22 p.m.
On Wed, Nov 23, 2011 at 07:34:37PM +0100, Alexander Graf wrote:
> On 11/23/2011 04:03 PM, Andrea Arcangeli wrote:
> >Hi!
> >
> >
> >In my view the trouble of the numa hard bindings is not the fact
> >they're hard and qemu has to also decide the location (in fact it
> >doesn't need to decide the location if you use cpusets and relative
> >mbinds). The bigger problem is the fact either the admin or the app
> >developer has to explicitly scan the numa physical topology (both cpus
> >and memory) and tell the kernel how much memory to bind to each
> >thread. ms_mbind/ms_tbind only partially solve that problem. They're
> >similar to the mbind MPOL_F_RELATIVE_NODES with cpusets, except you
> >don't need an admin or a cpuset-job-scheduler (or a perl script) to
> >redistribute the hardware resources.
> 
> Well yeah, of course the guest needs to see some topology. I don't
> see why we'd have to actually scan the host for this though. All we
> need to tell the kernel is "this memory region is close to that
> thread".
> 
> So if you define "-numa node,mem=1G,cpus=0" then QEMU should be able
> to tell the kernel that this GB of RAM actually is close to that
> vCPU thread.
> 
> Of course the admin still needs to decide how to split up memory.
> That's the deal with emulating real hardware. You get the interfaces
> hardware gets :). However, if you follow a reasonable default
> strategy such as numa splitting your RAM into equal chunks between
> guest vCPUs you're probably close enough to optimal usage models. Or
> at least you could have a close enough approximation of how this
> mapping could work for the _guest_ regardless of the host and when
> you migrate it somewhere else it should also work reasonably well.

Allowing specification of the numa nodes to qemu, allowing
qemu to create cpu+mem grouping (without binding) and letting
the kernel decide how to manage them seems like a reasonable incremental 
step between no guest/host NUMA awareness and automatic NUMA 
configuration in the host kernel. It would be suffice for the current 
needs we see.

Besides migration, we also have use cases where we may want to
have large multi-node VMs that are static (like LPARs), having the guest 
aware of the topology there is helpful. 

Also, if at all topology changes due to migration or host kernel decisions,
we can make use of something like VPHN (virtual processor home node)
capability on Power systems to have guest kernel update its topology
knowledge. You can refer to that in
arch/powerpc/mm/numa.c. Otherwise, as long as the host kernel
maintains mappings requested by ms_tbind()/ms_mbind(), we can
create the guest topology correctly and optimize for NUMA. This
would work for us.

Thanks
Dipankar
Peter Zijlstra - Nov. 30, 2011, 4:25 p.m.
On Wed, 2011-11-30 at 21:52 +0530, Dipankar Sarma wrote:
> 
> Also, if at all topology changes due to migration or host kernel decisions,
> we can make use of something like VPHN (virtual processor home node)
> capability on Power systems to have guest kernel update its topology
> knowledge. You can refer to that in
> arch/powerpc/mm/numa.c. 

I think that fail^Wfeature of PPC is terminally broken. You simply
cannot change the topology after the fact.
Chris Wright - Nov. 30, 2011, 4:33 p.m.
* Peter Zijlstra (a.p.zijlstra@chello.nl) wrote:
> On Wed, 2011-11-30 at 21:52 +0530, Dipankar Sarma wrote:
> > 
> > Also, if at all topology changes due to migration or host kernel decisions,
> > we can make use of something like VPHN (virtual processor home node)
> > capability on Power systems to have guest kernel update its topology
> > knowledge. You can refer to that in
> > arch/powerpc/mm/numa.c. 
> 
> I think that fail^Wfeature of PPC is terminally broken. You simply
> cannot change the topology after the fact. 

Agreed, there's too many things that consult topology once and never
look back.
Andrea Arcangeli - Nov. 30, 2011, 5:41 p.m.
On Wed, Nov 30, 2011 at 09:52:37PM +0530, Dipankar Sarma wrote:
> create the guest topology correctly and optimize for NUMA. This
> would work for us.

Even on the case of 1 guest that fits in one node, you're not going to
max out the full bandwidth of all memory channels with this.

qemu all can do with ms_mbind/tbind is to create a vtopology that
matches the hardware topology. It has these limits:

1) requires all userland applications to be modified to scan either
   the physical topology if run on host, or the vtopology if run on
   guest to get the full benefit.

2) breaks across live migration if host physical topology changes

3) 1 small guest on a idle numa system that fits in one numa node will
   tell not enough information to the host kernel

4) if used outside of qemu and one threads allocates more memory than
   what fits in one node it won't tell enough info to the host kernel.

About 3): if you've just one guest that fits in one node, each vcpu
should be spread across all the nodes probably, and behave like
MADV_INTERLEAVE if the guest CPU scheduler migrate guests processes in
reverse, the global memory bandwidth will still be used full even if
they will both access remote memory. I've just seen benchmarks where
no pinning runs more than _twice_ as fast than pinning with just 1
guest and only 10 vcpu threads, probably because of that.

About 4): even if the thread scans the numa topology it won't be able
to tell tell enough info to the kernel to know which parts of the
memory may be used more or less (ok it may be possible to call mbind
and vary it at runtime but it adds even more complexity left to the
programmer).

If the vcpu is free to go in any node, and we've a automatic
vcpu<->memory affinity, then the memory will follow the vcpu. And the
scheduler domains should already optimize for maxing out the full
memory bandwidth of all channels.

Trouble 1/2/3/4 applies to the hard bindings as well, not just to
mbind/tbin.

In short it's an incremental step that moves some logic to the kernel
but I don't see it solving all situations optimally and it shares a
lot of the limits of the hard bindings.
Dipankar Sarma - Dec. 1, 2011, 5:25 p.m.
On Wed, Nov 30, 2011 at 06:41:13PM +0100, Andrea Arcangeli wrote:
> On Wed, Nov 30, 2011 at 09:52:37PM +0530, Dipankar Sarma wrote:
> > create the guest topology correctly and optimize for NUMA. This
> > would work for us.
> 
> Even on the case of 1 guest that fits in one node, you're not going to
> max out the full bandwidth of all memory channels with this.
> 
> qemu all can do with ms_mbind/tbind is to create a vtopology that
> matches the hardware topology. It has these limits:
> 
> 1) requires all userland applications to be modified to scan either
>    the physical topology if run on host, or the vtopology if run on
>    guest to get the full benefit.

Not sure why you would need that. qemu can reflect the
topology based on -numa specifications and the corresponding
ms_tbind/mbind in FDT (in the case of Power, I guess ACPI
tables for x86) and guest kernel would detect this virtualized
topology. So there is no need for two types of topologies afaics.
It will all be reflected in /sys/devices/system/node in the guest.

> 
> 2) breaks across live migration if host physical topology changes

That is indeed an issue. Either VM placement software needs to
be really smart to migrate VMs that fit well or, more likely,
we will have to find a way to make guest kernels aware of
topology changes. But the latter has impact on userspace
as well for applications that might have optimized for NUMA.

> 3) 1 small guest on a idle numa system that fits in one numa node will
>    tell not enough information to the host kernel
> 
> 4) if used outside of qemu and one threads allocates more memory than
>    what fits in one node it won't tell enough info to the host kernel.
> 
> About 3): if you've just one guest that fits in one node, each vcpu
> should be spread across all the nodes probably, and behave like
> MADV_INTERLEAVE if the guest CPU scheduler migrate guests processes in
> reverse, the global memory bandwidth will still be used full even if
> they will both access remote memory. I've just seen benchmarks where
> no pinning runs more than _twice_ as fast than pinning with just 1
> guest and only 10 vcpu threads, probably because of that.

I agree. Specifying NUMA topology for guest can result in
sub-optimal performance in some cases, it is a tradeoff.


> In short it's an incremental step that moves some logic to the kernel
> but I don't see it solving all situations optimally and it shares a
> lot of the limits of the hard bindings.

Agreed.

Thanks
Dipankar
Andrea Arcangeli - Dec. 1, 2011, 5:36 p.m.
On Thu, Dec 01, 2011 at 10:55:20PM +0530, Dipankar Sarma wrote:
> On Wed, Nov 30, 2011 at 06:41:13PM +0100, Andrea Arcangeli wrote:
> > On Wed, Nov 30, 2011 at 09:52:37PM +0530, Dipankar Sarma wrote:
> > > create the guest topology correctly and optimize for NUMA. This
> > > would work for us.
> > 
> > Even on the case of 1 guest that fits in one node, you're not going to
> > max out the full bandwidth of all memory channels with this.
> > 
> > qemu all can do with ms_mbind/tbind is to create a vtopology that
> > matches the hardware topology. It has these limits:
> > 
> > 1) requires all userland applications to be modified to scan either
> >    the physical topology if run on host, or the vtopology if run on
> >    guest to get the full benefit.
> 
> Not sure why you would need that. qemu can reflect the
> topology based on -numa specifications and the corresponding
> ms_tbind/mbind in FDT (in the case of Power, I guess ACPI
> tables for x86) and guest kernel would detect this virtualized
> topology. So there is no need for two types of topologies afaics.
> It will all be reflected in /sys/devices/system/node in the guest.

The point is: what a vtopology gives you? If you don't modify all apps
running in the guest to use it? vtopology on guest, helps exactly like
the topology on host -> very little unless you modify qemu on host to
use ms_tbind/mbind.

> > 2) breaks across live migration if host physical topology changes
> 
> That is indeed an issue. Either VM placement software needs to
> be really smart to migrate VMs that fit well or, more likely,
> we will have to find a way to make guest kernels aware of
> topology changes. But the latter has impact on userspace
> as well for applications that might have optimized for NUMA.

Making guest kernel aware about "memory" topology changes is going to
be a whole mess. Or at least harder than memory hotplug.

> I agree. Specifying NUMA topology for guest can result in
> sub-optimal performance in some cases, it is a tradeoff.

I see it more like a limit of this solution, which is a common limit
to the hard bindings than a tradeoff.

> Agreed.

Yep I just wanted to make clear the limits remains with this solution.

I'll try to teach knumad to detect thread<->memory affinity too with
some logic, we'll see how well that can work.
Peter Zijlstra - Dec. 1, 2011, 5:40 p.m.
On Wed, 2011-11-23 at 16:03 +0100, Andrea Arcangeli wrote:
> Hi!
> 
> On Mon, Nov 21, 2011 at 07:51:21PM -0600, Anthony Liguori wrote:
> > Fundamentally, the entity that should be deciding what memory should be present 
> > and where it should located is the kernel.  I'm fundamentally opposed to trying 
> > to make QEMU override the scheduler/mm by using cpu or memory pinning in QEMU.
> > 
> >  From what I can tell about ms_mbind(), it just uses process knowledge to bind 
> > specific areas of memory to a memsched group and let's the kernel decide what to 
> > do with that knowledge.  This is exactly the type of interface that QEMU should 
> > be using.
> > 
> > QEMU should tell the kernel enough information such that the kernel can make 
> > good decisions.  QEMU should not be the one making the decisions.
> 
> True, QEMU won't have to decide where the memory and vcpus should be
> located (but hey it wouldn't need to decide that even if you use
> cpusets, you can use relative mbind with cpusets, the admin or a
> cpuset job scheduler could decide) but it's still QEMU making the
> decision of what memory and which vcpus threads to
> ms_mbind/ms_tbind. Think how you're going to create the input of those
> syscalls...
> 
> If it wasn't qemu to decide that, qemu wouldn't be required to scan
> the whole host physical numa (cpu/memory) topology in order to create
> the "input" arguments of "ms_mbind/ms_tbind".

That's a plain falsehood, you don't need to scan host physcal topology
in order to create useful ms_[mt]bind arguments. You can use physical
topology to optimize for particular hardware, but its not a strict
requirement.

>  And when you migrate the
> VM to another host, the whole vtopology may be counter-productive
> because the kernel isn't automatically detecting the numa affinity
> between threads and the guest vtopology will stick to whatever numa
> _physical_ topology that was seen on the first node where the VM was
> created.

This doesn't make any sense at all.

> I doubt that the assumption that all cloud nodes will have the same
> physical numa topology is reasonable.

So what? If you want to be very careful you can make sure you vnodes are
small enough they fit any any physical node in your cloud (god I f*king
hate that word).

If you're slightly less careful, things will still work, you might get
less max parallelism, but typically (from what I understood) these VM
hosting thingies are overloaded so you never get your max cpu anyway, so
who cares.

Things is, whatever you set-up it will always work, it might not be
optimal, but the one guarantee: [threads,vrange] will stay on the same
node will be kept true, no matter where you run it.

Also, migration between non-identical hosts is always 'tricky'. You're
always stuck with some minimally supported subset or average case thing.
Really, why do you think NUMA would be any different.

> Furthermore to get the same benefits that qemu gets on host by using
> ms_mbind/ms_tbind, every single guest application should be modified
> to scan the guest vtopology and call ms_mbind/ms_tbind too (or use the
> hard bindings which is what we try to avoid).

No! ms_[tm]bind() is just part of the solution, the other part is what
to do for simple programs, and like I wrote in my email earlier, and
what we talked about in Prague, is that for normal simple proglets we
simply pick a numa node and stick to it. Much like:

 http://home.arcor.de/efocht/sched/

Except we could actually migrate the whole thing if needed. Basically
you give each task its own 1 vnode and assign all threads to it.

Only big programs that need to span multiple nodes need to be modified
to get best advantage of numa. But that has always been true.

> In my view the trouble of the numa hard bindings is not the fact
> they're hard and qemu has to also decide the location (in fact it
> doesn't need to decide the location if you use cpusets and relative
> mbinds). The bigger problem is the fact either the admin or the app
> developer has to explicitly scan the numa physical topology (both cpus
> and memory) and tell the kernel how much memory to bind to each
> thread. ms_mbind/ms_tbind only partially solve that problem. They're
> similar to the mbind MPOL_F_RELATIVE_NODES with cpusets, except you
> don't need an admin or a cpuset-job-scheduler (or a perl script) to
> redistribute the hardware resources.

You're full of crap Andrea. 

Yes you need some clue as to your actual topology, but that's life, you
can't get SMP for free either, you need to have some clue.

Just like with regular SMP where you need to be aware of data sharing,
NUMA just makes it worse. If your app decomposes well enough to create a
vnode per thread, that's excellent, if you want to scale your app to fit
your machine that's fine too, heck, every multi-threaded app out there
worth using already queries machine topology one way or another, its not
a big deal.

But cpusets and relative_nodes doesn't work, you still get your memory
splattered all over whatever nodes you allow and the scheduler will
still move your task around based purely on cpu-load. 0-win.

Not needing a (userspace) job-scheduler is a win, because that avoids
having everybody talk to this job-scheduler, and there's multiple
job-schedulers out there, two can't properly co-exist, etc. Also, the
kernel is the right place to do this.

[ this btw is true for all muddle-ware solutions, try and fit two
applications together that are written against different but similar
purpose muddle-wares and shit will come apart quickly ]

> Now dealing with bindings isn't big deal for qemu, in fact this API is
> pretty much ideal for qemu, but it won't make life substantially
> easier than if compared to hard bindings. Simply the management code
> that is now done with a perl script will have to be moved in the
> kernel. It looks an incremental improvement compared to the relative
> mbind+cpuset, but I'm unsure if it's the best we could aim for and
> what we really need in virt considering we deal with VM migration too.

No virt is crap, it needs to die, its horrid, and any solution aimed
squarely at virt only is shit and not worth considering, that simple.

If you want to help solve the NUMA issue, forget about virt and solve it
for the non-virt case.

> The real long term design to me is not to add more syscalls, and
> initially handling the case of a process/VM spanning not more than one
> node in thread number and amount of memory. That's not too hard an in
> fact I've benchmarks for the scheduler already showing it to work
> pretty well (it's creating a too strict affinity but it can be relaxed
> to be more useful). Then later add some mechanism (simplest is the
> page fault at low frequency) to create a
> guest_vcpu_thread<->host_memory affinity and have a parvirtualized
> interface that tells the guest scheduler to group CPUs.

I bet you're believe a compiler can solve all
parallelization/concurrency problems for you as well. Happy pipe
dreaming for you. While you're at it, I've heard this
transactional-memory crap will solve all our locking problems.

Concurrency is hard, applications needs to know wtf they're doing if
they want to gain any efficiency by it.

> If the guest scheduler runs free and is allowed to move threads
> randomly without any paravirtualized interface that controls the CPU
> thread migration in the guest scheduler, the thread<->memory affinity
> on host will be hopeless. But with a parvirtualized interface to make
> a guest thread stick to vcpu0/1/2/3 and not going into vcpu4/5/6/7,
> will allow to create a more meaningful guest_thread<->physical_ram
> affinity on host through KVM page faults. And then this will work also
> with VM migration and without having to create a vtopology in guest.

As a maintainer of the scheduler I can, with a fair degree of certainty,
say you'll never get such paravirt scheduler hooks.

Also, as much as I dislike the whole virt stuff, the whole premise of
virt is to 'emulate' real hardware. Real hardware does NUMA, therefore
its not weird to also do vNUMA. And yes NUMA sucks eggs, and in fact not
all hardware platforms expose it, have a look at s390 for example. They
stuck in huge caches and pretend it doesn't exist. But for those that
do, there's a performance gain to play by its rules.

Furthermore, I've been told there is a great interest in running !
paravirt kernels, so much so in fact that hardware emulation seems more
important than paravirt solutions.

Also, I really don't see how trying to establish thread:page relations
is in any way virt related, why couldn't you do this in a host kernel? 

From what I gather what you propose is to periodically unmap all user
memory (or map it !r !w !x, which is effectively the same) and take the
fault. This fault will establish a thread:page relation. One can just
use that or involve some history as well. Once you have this thread:page
relation set you want to group them on the same node.

There's various problems with that, firstly of course the overhead,
storing this thread:page relation set requires quite a lot of memory.
Secondly I'm not quite sure I see how that works for threads that share
a working set. Suppose you have 4 threads and 2 working sets, how do you
make sure to keep the 2 groups together. I don't think that's evident
from the simple thread:page relation data [*]. Thirdly I immensely
dislike all these background scanner things, they make it very hard to
account time to those who actually use it.

[ * I can only make that work if you're willing to store something like
O(nr_pages * nr_threads) amount of data to correlate stuff, and that's
not even the time needed to process it and make something useful out of
it ]

> sys_tbind/mbind gets away with it by creating a vtopology in
> the guest, so the guest scheduler would then follow the vtopology (but
> vtopology breaks across VM migration and to really be followed well
> with sys_mbind/tbind it'd require all apps to be modified).

Again, vtopology doesn't break with VM migration. Its perfectly possible
to create a vnode with 8 threads on hardware with only 2 cpus per node.

Your threads all get to share those 2 cpus, so its not ideal, but it
does work.

> grouping guest threads to stick into some vcpu sounds immensely
> simpler than changing the whole guest vtopology at runtime that would
> involve changing memory layout too.

vtopology is stable for the entire duration of the guest. Some weird
people at IBM think its doable to change the topology at runtime, but
I'd argue they're wrong.

> NOTE: the paravirt cpu grouping interface would also handle the case
> of 3 guests of 2.5G on a 8G guest (4G per node). One of the three
> guests will have memory spanning over two nodes, and the guest
> vtopology created by sys_mbind/tbind can't handle it.

You could of course have created 3 guests with 2 nodes of 1.25G each.
You can always do stupid things, sys_[mt]bind doesn't pretend brains
aren't required.

<snip more stuff>
Dipankar Sarma - Dec. 1, 2011, 5:49 p.m.
On Thu, Dec 01, 2011 at 06:36:23PM +0100, Andrea Arcangeli wrote:
> On Thu, Dec 01, 2011 at 10:55:20PM +0530, Dipankar Sarma wrote:
> > On Wed, Nov 30, 2011 at 06:41:13PM +0100, Andrea Arcangeli wrote:
> > > On Wed, Nov 30, 2011 at 09:52:37PM +0530, Dipankar Sarma wrote:
> > > > create the guest topology correctly and optimize for NUMA. This
> > > > would work for us.
> > > 
> > > Even on the case of 1 guest that fits in one node, you're not going to
> > > max out the full bandwidth of all memory channels with this.
> > > 
> > > qemu all can do with ms_mbind/tbind is to create a vtopology that
> > > matches the hardware topology. It has these limits:
> > > 
> > > 1) requires all userland applications to be modified to scan either
> > >    the physical topology if run on host, or the vtopology if run on
> > >    guest to get the full benefit.
> > 
> > Not sure why you would need that. qemu can reflect the
> > topology based on -numa specifications and the corresponding
> > ms_tbind/mbind in FDT (in the case of Power, I guess ACPI
> > tables for x86) and guest kernel would detect this virtualized
> > topology. So there is no need for two types of topologies afaics.
> > It will all be reflected in /sys/devices/system/node in the guest.
> 
> The point is: what a vtopology gives you? If you don't modify all apps
> running in the guest to use it? vtopology on guest, helps exactly like
> the topology on host -> very little unless you modify qemu on host to
> use ms_tbind/mbind.

Sure, ms_tbind/mbind will be needed in qemu. For the rest, NUMA aware apps 
already use topology while running on physical systems and they wouldn't need 
modification for this kind of virtualized topology.

Thanks
Dipankar
Marcelo Tosatti - Dec. 22, 2011, 11:01 a.m.
On Thu, Dec 01, 2011 at 06:40:31PM +0100, Peter Zijlstra wrote:
> On Wed, 2011-11-23 at 16:03 +0100, Andrea Arcangeli wrote:
> > Hi!
> > 
> > On Mon, Nov 21, 2011 at 07:51:21PM -0600, Anthony Liguori wrote:
> > > Fundamentally, the entity that should be deciding what memory should be present 
> > > and where it should located is the kernel.  I'm fundamentally opposed to trying 
> > > to make QEMU override the scheduler/mm by using cpu or memory pinning in QEMU.
> > > 
> > >  From what I can tell about ms_mbind(), it just uses process knowledge to bind 
> > > specific areas of memory to a memsched group and let's the kernel decide what to 
> > > do with that knowledge.  This is exactly the type of interface that QEMU should 
> > > be using.
> > > 
> > > QEMU should tell the kernel enough information such that the kernel can make 
> > > good decisions.  QEMU should not be the one making the decisions.
> > 
> > True, QEMU won't have to decide where the memory and vcpus should be
> > located (but hey it wouldn't need to decide that even if you use
> > cpusets, you can use relative mbind with cpusets, the admin or a
> > cpuset job scheduler could decide) but it's still QEMU making the
> > decision of what memory and which vcpus threads to
> > ms_mbind/ms_tbind. Think how you're going to create the input of those
> > syscalls...
> > 
> > If it wasn't qemu to decide that, qemu wouldn't be required to scan
> > the whole host physical numa (cpu/memory) topology in order to create
> > the "input" arguments of "ms_mbind/ms_tbind".
> 
> That's a plain falsehood, you don't need to scan host physcal topology
> in order to create useful ms_[mt]bind arguments. You can use physical
> topology to optimize for particular hardware, but its not a strict
> requirement.
> 
> >  And when you migrate the
> > VM to another host, the whole vtopology may be counter-productive
> > because the kernel isn't automatically detecting the numa affinity
> > between threads and the guest vtopology will stick to whatever numa
> > _physical_ topology that was seen on the first node where the VM was
> > created.
> 
> This doesn't make any sense at all.
> 
> > I doubt that the assumption that all cloud nodes will have the same
> > physical numa topology is reasonable.
> 
> So what? If you want to be very careful you can make sure you vnodes are
> small enough they fit any any physical node in your cloud (god I f*king
> hate that word).
> 
> If you're slightly less careful, things will still work, you might get
> less max parallelism, but typically (from what I understood) these VM
> hosting thingies are overloaded so you never get your max cpu anyway, so
> who cares.
> 
> Things is, whatever you set-up it will always work, it might not be
> optimal, but the one guarantee: [threads,vrange] will stay on the same
> node will be kept true, no matter where you run it.
> 
> Also, migration between non-identical hosts is always 'tricky'. You're
> always stuck with some minimally supported subset or average case thing.
> Really, why do you think NUMA would be any different.
> 
> > Furthermore to get the same benefits that qemu gets on host by using
> > ms_mbind/ms_tbind, every single guest application should be modified
> > to scan the guest vtopology and call ms_mbind/ms_tbind too (or use the
> > hard bindings which is what we try to avoid).
> 
> No! ms_[tm]bind() is just part of the solution, the other part is what
> to do for simple programs, and like I wrote in my email earlier, and
> what we talked about in Prague, is that for normal simple proglets we
> simply pick a numa node and stick to it. Much like:
> 
>  http://home.arcor.de/efocht/sched/
> 
> Except we could actually migrate the whole thing if needed. Basically
> you give each task its own 1 vnode and assign all threads to it.
> 
> Only big programs that need to span multiple nodes need to be modified
> to get best advantage of numa. But that has always been true.
> 
> > In my view the trouble of the numa hard bindings is not the fact
> > they're hard and qemu has to also decide the location (in fact it
> > doesn't need to decide the location if you use cpusets and relative
> > mbinds). The bigger problem is the fact either the admin or the app
> > developer has to explicitly scan the numa physical topology (both cpus
> > and memory) and tell the kernel how much memory to bind to each
> > thread. ms_mbind/ms_tbind only partially solve that problem. They're
> > similar to the mbind MPOL_F_RELATIVE_NODES with cpusets, except you
> > don't need an admin or a cpuset-job-scheduler (or a perl script) to
> > redistribute the hardware resources.
> 
> You're full of crap Andrea. 
> 
> Yes you need some clue as to your actual topology, but that's life, you
> can't get SMP for free either, you need to have some clue.
> 
> Just like with regular SMP where you need to be aware of data sharing,
> NUMA just makes it worse. If your app decomposes well enough to create a
> vnode per thread, that's excellent, if you want to scale your app to fit
> your machine that's fine too, heck, every multi-threaded app out there
> worth using already queries machine topology one way or another, its not
> a big deal.
> 
> But cpusets and relative_nodes doesn't work, you still get your memory
> splattered all over whatever nodes you allow and the scheduler will
> still move your task around based purely on cpu-load. 0-win.
> 
> Not needing a (userspace) job-scheduler is a win, because that avoids
> having everybody talk to this job-scheduler, and there's multiple
> job-schedulers out there, two can't properly co-exist, etc. Also, the
> kernel is the right place to do this.
> 
> [ this btw is true for all muddle-ware solutions, try and fit two
> applications together that are written against different but similar
> purpose muddle-wares and shit will come apart quickly ]
> 
> > Now dealing with bindings isn't big deal for qemu, in fact this API is
> > pretty much ideal for qemu, but it won't make life substantially
> > easier than if compared to hard bindings. Simply the management code
> > that is now done with a perl script will have to be moved in the
> > kernel. It looks an incremental improvement compared to the relative
> > mbind+cpuset, but I'm unsure if it's the best we could aim for and
> > what we really need in virt considering we deal with VM migration too.
> 
> No virt is crap, it needs to die, its horrid, and any solution aimed
> squarely at virt only is shit and not worth considering, that simple.

Removing this phrase from context (feel free to object on that basis
to the following inquiry), what are your concerns with virtualization
itself? Is it the fact that having an unknownable operating system under
your feet uncomfortable only, or is there something else? Because virt
is green, it saves silicon.
Marcelo Tosatti - Dec. 22, 2011, 11:24 a.m.
On Thu, Dec 01, 2011 at 06:40:31PM +0100, Peter Zijlstra wrote:
> On Wed, 2011-11-23 at 16:03 +0100, Andrea Arcangeli wrote:

<snip>

> >From what I gather what you propose is to periodically unmap all user
> memory (or map it !r !w !x, which is effectively the same) and take the
> fault. This fault will establish a thread:page relation. One can just
> use that or involve some history as well. Once you have this thread:page
> relation set you want to group them on the same node.
> 
> There's various problems with that, firstly of course the overhead,
> storing this thread:page relation set requires quite a lot of memory.
> Secondly I'm not quite sure I see how that works for threads that share
> a working set. Suppose you have 4 threads and 2 working sets, how do you
> make sure to keep the 2 groups together. I don't think that's evident
> from the simple thread:page relation data [*]. Thirdly I immensely
> dislike all these background scanner things, they make it very hard to
> account time to those who actually use it.

Picture yourself as the administrator of a virtualized host, with
a given workload of guests doing their tasks. All it takes is to
understand from a high level what the algorithms of ksm (collapsing of
equal content-pages into same physical RAM) and khugepaged (collapsing
of 4k pages in 2MB pages, good for TLB) are doing (and that should be
documented), and infer from that what is happening. The same is valid
for the guy who is writing management tools and exposing the statistics
to the system administrator.
Anthony Liguori - Dec. 22, 2011, 5:13 p.m.
On 12/22/2011 05:01 AM, Marcelo Tosatti wrote:
> On Thu, Dec 01, 2011 at 06:40:31PM +0100, Peter Zijlstra wrote:
>> No virt is crap, it needs to die, its horrid, and any solution aimed
>> squarely at virt only is shit and not worth considering, that simple.
>
> Removing this phrase from context (feel free to object on that basis
> to the following inquiry), what are your concerns with virtualization
> itself? Is it the fact that having an unknownable operating system under
> your feet uncomfortable only, or is there something else? Because virt
> is green, it saves silicon.

Oh man, if you say virt solves global warming, I think I'm going to have to jump 
off a bridge to end the madness...

Regards,

Anthony Liguori

>
>
Marcelo Tosatti - Dec. 22, 2011, 5:55 p.m.
On Thu, Dec 22, 2011 at 11:13:15AM -0600, Anthony Liguori wrote:
> On 12/22/2011 05:01 AM, Marcelo Tosatti wrote:
> >On Thu, Dec 01, 2011 at 06:40:31PM +0100, Peter Zijlstra wrote:
> >>No virt is crap, it needs to die, its horrid, and any solution aimed
> >>squarely at virt only is shit and not worth considering, that simple.
> >
> >Removing this phrase from context (feel free to object on that basis
> >to the following inquiry), what are your concerns with virtualization
> >itself? Is it the fact that having an unknownable operating system under
> >your feet uncomfortable only, or is there something else? Because virt
> >is green, it saves silicon.
> 
> Oh man, if you say virt solves global warming, I think I'm going to
> have to jump off a bridge to end the madness...

I said it is green (saves energy) and it saves silicon (therefore saves
fuel?). The rest of conclusions are your own.
Peter Zijlstra - Dec. 22, 2011, 7:04 p.m.
On Thu, 2011-12-22 at 09:01 -0200, Marcelo Tosatti wrote:
> 
> > No virt is crap, it needs to die, its horrid, and any solution aimed
> > squarely at virt only is shit and not worth considering, that simple.
> 
> Removing this phrase from context (feel free to object on that basis
> to the following inquiry), what are your concerns with virtualization
> itself? Is it the fact that having an unknownable operating system under
> your feet uncomfortable only, or is there something else? Because virt
> is green, it saves silicon.

No, you're going the wrong way around that argument. Resource control
would save the planet in that case. That's an entirely separate concept
from virtualization. Look how much cgroup crap you still need on top of
the whole virt thing.

Virt deals with running legacy OSs, mostly because you're in a bind and
for a host of reasons can't get this super critical application you
really must have running on your new and improved platform.

So you emulate hardware to run the old os, to run the old app or
somesuch nonsense.

Virt really is mostly a technical solution to a mostly non-technical
problem.

There's of course the debug angle, but I've never really found it
reliable enough to use in that capacity, give me real hardware with a
serial port any day of the week.

Also, it just really offends me, we work really hard to make stuff go as
fast as possible and then you stick a gigantic emulation layer in
between and complain that shit is slow again. Don't do that!!

Patch

different guest nodes are mapped appropriately to different host NUMA nodes.

To achieve this we would need QEMU to expose information about
guest RAM ranges (Guest Physical Address - GPA) and their host virtual
address mappings (Host Virtual Address - HVA). Using GPA and HVA, any external
tool like libvirt would be able to divide the guest RAM as per the guest NUMA
node geometry and bind guest memory nodes to corresponding host memory nodes
using HVA. This needs both QEMU (and libvirt) changes as well as changes
in the kernel.

- System calls that set NUMA memory policies (like mbind) currently work
  for the current (or the calling) process. These syscalls need to be
  extended so that a process like libvirt is able to set NUMA memory
  policies for QEMU process's memory ranges.
- This RFC is actually about the proposed change in QEMU to export
  GPA and HVA via QEMU monitor.
  
The patch against QEMU present towards the end of this note is an attempt
to achieve this. This patch adds a new monitor command "info ram".
"info ram" prints out GPA and HVA for different sections of guest RAM.

For a guest booted with options "-smp sockets=2,cores=4,threads=2
-numa node,nodeid=0,cpus=0-15 -numa node,nodeid=1,cpus=16-31 -cpu core2duo
-m 5g", the exported data looks like this:

******************
(qemu) info ram
GPA: 0-9ffff RAM: 0-9ffff HVA: 0x7efe7fe00000-0x7efe7fe9ffff
GPA: cc000-effff RAM: cc000-effff HVA: 0x7efe7fecc000-0x7efe7feeffff
GPA: 100000-dfffffff RAM: 100000-dfffffff HVA: 0x7efe7ff00000-0x7eff5fdfffff
GPA: fc000000-fc7fffff RAM: 140040000-14083ffff HVA: 0x7efe7f400000-0x7efe7fbfffff
GPA: 100000000-15fffffff RAM: e0000000-13fffffff HVA: 0x7eff5fe00000-0x7effbfdfffff
******************

I will remove the ram_addr (prefixed with RAM:) from the above. Having it
here just to validate the regions and to compare with "info mtree" output
(shown below).

******************
(qemu) info mtree
memory
0000000000000000-7ffffffffffffffe (prio 0): system
  0000000000000000-00000000dfffffff (prio 0): alias ram-below-4g @pc.ram 0000000000000000-00000000dfffffff
  00000000000a0000-00000000000bffff (prio 1): alias smram-region @pci 00000000000a0000-00000000000bffff
  00000000000c0000-00000000000c3fff (prio 1): alias pam-rom @pc.ram 00000000000c0000-00000000000c3fff
  00000000000c4000-00000000000c7fff (prio 1): alias pam-rom @pc.ram 00000000000c4000-00000000000c7fff
  00000000000c8000-00000000000cbfff (prio 1): alias pam-rom @pc.ram 00000000000c8000-00000000000cbfff
  00000000000cc000-00000000000cffff (prio 1): alias pam-ram @pc.ram 00000000000cc000-00000000000cffff
  00000000000d0000-00000000000d3fff (prio 1): alias pam-ram @pc.ram 00000000000d0000-00000000000d3fff
  00000000000d4000-00000000000d7fff (prio 1): alias pam-ram @pc.ram 00000000000d4000-00000000000d7fff
  00000000000d8000-00000000000dbfff (prio 1): alias pam-ram @pc.ram 00000000000d8000-00000000000dbfff
  00000000000dc000-00000000000dffff (prio 1): alias pam-ram @pc.ram 00000000000dc000-00000000000dffff
  00000000000e0000-00000000000e3fff (prio 1): alias pam-ram @pc.ram 00000000000e0000-00000000000e3fff
  00000000000e4000-00000000000e7fff (prio 1): alias pam-ram @pc.ram 00000000000e4000-00000000000e7fff
  00000000000e8000-00000000000ebfff (prio 1): alias pam-ram @pc.ram 00000000000e8000-00000000000ebfff
  00000000000ec000-00000000000effff (prio 1): alias pam-ram @pc.ram 00000000000ec000-00000000000effff
  00000000000f0000-00000000000fffff (prio 1): alias pam-rom @pc.ram 00000000000f0000-00000000000fffff
  00000000e0000000-00000000ffffffff (prio 0): alias pci-hole @pci 00000000e0000000-00000000ffffffff
  00000000fee00000-00000000feefffff (prio 0): apic
  0000000100000000-000000015fffffff (prio 0): alias ram-above-4g @pc.ram 00000000e0000000-000000013fffffff
  4000000000000000-7fffffffffffffff (prio 0): alias pci-hole64 @pci 4000000000000000-7fffffffffffffff
pc.ram
0000000000000000-000000013fffffff (prio 0): pc.ram
******************


The current patch just exports the information and expects
external tools to make use of it for binding. But we do understand that
memory ranges can change and external tool should be able to respond to
this. This is the current thinking on how to handle this:

- Whenever the address range changes, send an async notification to
  libvirt (using QMP perhaps?)
- libvirt will note the change and re-read the current guest RAM mapping
  info and re-bind the regions as appropriate.

I haven't fully figured out this part (QEMU to libvirt notification part)
yet and any pointers or suggestions here will be useful.

Also a question:

- In what ways the guest memory layout can change ? Is the change driven
  by external agents like libvirt (memory hot add)? or can things change
  transparently within QEMU. If its only the former, then we kind of know
  when to do rebinding.

The patch follows:

---
Export guest RAM address via QEMU monitor.

NUMA aware QEMU guests running on NUMA systems can benefit from binding
guest RAM to host appropriate NUMA node memory. Allow admin tools like
libvirt to achieve this by exporting guest RAM information via
QEMU monitor.

Signed-off-by: Bharata B Rao <bharata@linux.vnet.ibm.com>
---

 memory.c  |   33 +++++++++++++++++++++++++++++++++
 memory.h  |    2 ++
 monitor.c |   12 ++++++++++++
 3 files changed, 47 insertions(+), 0 deletions(-)


diff --git a/memory.c b/memory.c
index dc5e35d..3ae10e5 100644
--- a/memory.c
+++ b/memory.c
@@ -1402,3 +1402,36 @@  void mtree_info(fprintf_function mon_printf, void *f)
         mtree_print_mr(mon_printf, f, address_space_io.root, 0, 0, &ml_head);
     }
 }
+
+#if !defined(CONFIG_USER_ONLY)
+void ram_info_print(fprintf_function mon_printf, void *f)
+{
+    FlatRange *fr;
+
+    FOR_EACH_FLAT_RANGE(fr, &address_space_memory.current_map) {
+        AddrRange ar = fr->addr;
+        ram_addr_t ram;
+        uint8_t *hva;
+
+        ram = cpu_get_physical_page_desc(ar.start);
+
+        /* Only show RAM area */
+        if ((ram & ~TARGET_PAGE_MASK) != IO_MEM_RAM) {
+            continue;
+        }
+        ram &= TARGET_PAGE_MASK;
+        hva = qemu_get_ram_ptr(ram);
+        mon_printf(f, "GPA: %llx-%llx" " RAM: "
+            RAM_ADDR_FMT "-" RAM_ADDR_FMT " HVA: %p-%p\n",
+            (unsigned long long)ar.start,
+            (unsigned long long)(ar.start+ar.size-1),
+            ram, (ram_addr_t)(ram+ar.size-1),
+            hva, hva+ar.size-1);
+    }
+}
+#else
+void ram_info_print(fprintf_function mon_printf, void *f)
+{
+    mon_printf(f, "Not supported\n");
+}
+#endif
diff --git a/memory.h b/memory.h
index d5b47da..b5fb5e0 100644
--- a/memory.h
+++ b/memory.h
@@ -503,6 +503,8 @@  void memory_region_transaction_commit(void);
 
 void mtree_info(fprintf_function mon_printf, void *f);
 
+void ram_info_print(fprintf_function mon_printf, void *f);
+
 #endif
 
 #endif
diff --git a/monitor.c b/monitor.c
index ffda0fe..3b1a7f3 100644
--- a/monitor.c
+++ b/monitor.c
@@ -2738,6 +2738,11 @@  int monitor_get_fd(Monitor *mon, const char *fdname)
     return -1;
 }
 
+static void do_info_ram(Monitor *mon)
+{
+    ram_info_print((fprintf_function)monitor_printf, mon);
+}
+
 static const mon_cmd_t mon_cmds[] = {
 #include "hmp-commands.h"
     { NULL, NULL, },
@@ -3050,6 +3055,13 @@  static const mon_cmd_t info_cmds[] = {
         .mhandler.info = do_trace_print_events,
     },
     {
+        .name       = "ram",
+        .args_type  = "",
+        .params     = "",
+        .help       = "show RAM information",
+        .mhandler.info = do_info_ram,
+    },
+    {
         .name       = NULL,
     },
 };