Message ID | 20111029184502.GH11038@in.ibm.com |
---|---|
State | New |
Headers | show |
On 29.10.2011, at 20:45, Bharata B Rao wrote: > Hi, > > As guests become NUMA aware, it becomes important for the guests to > have correct NUMA policies when they run on NUMA aware hosts. > Currently limited support for NUMA binding is available via libvirt > where it is possible to apply a NUMA policy to the guest as a whole. > However multinode guests would benefit if guest memory belonging to > different guest nodes are mapped appropriately to different host NUMA nodes. > > To achieve this we would need QEMU to expose information about > guest RAM ranges (Guest Physical Address - GPA) and their host virtual > address mappings (Host Virtual Address - HVA). Using GPA and HVA, any external > tool like libvirt would be able to divide the guest RAM as per the guest NUMA > node geometry and bind guest memory nodes to corresponding host memory nodes > using HVA. This needs both QEMU (and libvirt) changes as well as changes > in the kernel. Ok, let's take a step back here. You are basically growing libvirt into a memory resource manager that know how much memory is available on which nodes and how these nodes would possibly fit into the host's memory layout. Shouldn't that be the kernel's job? It seems to me that architecturally the kernel is the place I would want my memory resource controls to be in. Imagine QEMU could tell the kernel that different parts of its virtual memory address space are supposed to be fast on different host vcpu threads. Then the kernel has all the information it needs. It could even potentially migrate memory towards a thread, whenever the scheduler determines that it's better to run a thread somewhere else. That said, I don't disagree with your approach per se. It just sounds way too static to me and tries to overcome shortcomings we have in the Linux mm system by replacing it with hardcoded pinning logic in user space. Alex
* Alexander Graf <agraf@suse.de> [2011-10-29 21:57:38]: > > On 29.10.2011, at 20:45, Bharata B Rao wrote: > > > Hi, > > > > As guests become NUMA aware, it becomes important for the guests to > > have correct NUMA policies when they run on NUMA aware hosts. > > Currently limited support for NUMA binding is available via libvirt > > where it is possible to apply a NUMA policy to the guest as a whole. > > However multinode guests would benefit if guest memory belonging to > > different guest nodes are mapped appropriately to different host NUMA nodes. > > > > To achieve this we would need QEMU to expose information about > > guest RAM ranges (Guest Physical Address - GPA) and their host virtual > > address mappings (Host Virtual Address - HVA). Using GPA and HVA, any external > > tool like libvirt would be able to divide the guest RAM as per the guest NUMA > > node geometry and bind guest memory nodes to corresponding host memory nodes > > using HVA. This needs both QEMU (and libvirt) changes as well as changes > > in the kernel. > > Ok, let's take a step back here. You are basically growing libvirt into a memory resource manager that know how much memory is available on which nodes and how these nodes would possibly fit into the host's memory layout. Yes, the motivation is to get libvirt to manage memory and numa related policies more effectively just like we do vcpu pinning and allocations. We would like libvirt to know about the host numa configurations and map it to guest memory layout to minimize cross-node references within guest. > Shouldn't that be the kernel's job? It seems to me that architecturally the kernel is the place I would want my memory resource controls to be in. Kernel is the one implementing the policy. Kernel cannot know guest memory layout or expectations for that VM. Kernel today sees guest as a single process that obeys numactl bindings. What this patch is trying to do it to make the policy recommendations more fine grain and effective for a multi-node guest. Qemu knows the layouts and can tell the kernel or issue the mbind() calls to setup the numa affinity, however qemu's assumptions could change if libvirt enforces policies on it using cgroups and cpusets. Hence in the proposed approach we would allow libvirt to be the policy owner, get the required information from qemu and set the policy by informing the kernel, just like we do for vcpus today. > Imagine QEMU could tell the kernel that different parts of its virtual memory address space are supposed to be fast on different host vcpu threads. Then the kernel has all the information it needs. It could even potentially migrate memory towards a thread, whenever the scheduler determines that it's better to run a thread somewhere else. Migrating memory near to vcpu or scheduling vcpus closer to the memory node is a good approach as proposed by Andrea Arcangeli as autonuma. That could be one of the policies that libvirt can choose for a given scenario. > That said, I don't disagree with your approach per se. It just sounds way too static to me and tries to overcome shortcomings we have in the Linux mm system by replacing it with hardcoded pinning logic in user space. Thanks for the review, I agree that fine control on memory and cpu pinning needs to be used carefully to get the desired positive effect. This proposal is a good first step to handle multi-node guest effectively compared to the default policies that are available today. --Vaidy
* Alexander Graf (agraf@suse.de) wrote: > On 29.10.2011, at 20:45, Bharata B Rao wrote: > > As guests become NUMA aware, it becomes important for the guests to > > have correct NUMA policies when they run on NUMA aware hosts. > > Currently limited support for NUMA binding is available via libvirt > > where it is possible to apply a NUMA policy to the guest as a whole. > > However multinode guests would benefit if guest memory belonging to > > different guest nodes are mapped appropriately to different host NUMA nodes. > > > > To achieve this we would need QEMU to expose information about > > guest RAM ranges (Guest Physical Address - GPA) and their host virtual > > address mappings (Host Virtual Address - HVA). Using GPA and HVA, any external > > tool like libvirt would be able to divide the guest RAM as per the guest NUMA > > node geometry and bind guest memory nodes to corresponding host memory nodes > > using HVA. This needs both QEMU (and libvirt) changes as well as changes > > in the kernel. > > Ok, let's take a step back here. You are basically growing libvirt into a memory resource manager that know how much memory is available on which nodes and how these nodes would possibly fit into the host's memory layout. > > Shouldn't that be the kernel's job? It seems to me that architecturally the kernel is the place I would want my memory resource controls to be in. I think that both Peter and Andrea are looking at this. Before we commit an API to QEMU that has a different semantic than a possible new kernel interface (that perhaps QEMU could use directly to inform kernel of the binding/relationship between vcpu thread and it's memory at VM startuup) it would be useful to see what these guys are working on... thanks, -chris
On Tue, Nov 08, 2011 at 09:33:04AM -0800, Chris Wright wrote: > * Alexander Graf (agraf@suse.de) wrote: > > On 29.10.2011, at 20:45, Bharata B Rao wrote: > > > As guests become NUMA aware, it becomes important for the guests to > > > have correct NUMA policies when they run on NUMA aware hosts. > > > Currently limited support for NUMA binding is available via libvirt > > > where it is possible to apply a NUMA policy to the guest as a whole. > > > However multinode guests would benefit if guest memory belonging to > > > different guest nodes are mapped appropriately to different host NUMA nodes. > > > > > > To achieve this we would need QEMU to expose information about > > > guest RAM ranges (Guest Physical Address - GPA) and their host virtual > > > address mappings (Host Virtual Address - HVA). Using GPA and HVA, any external > > > tool like libvirt would be able to divide the guest RAM as per the guest NUMA > > > node geometry and bind guest memory nodes to corresponding host memory nodes > > > using HVA. This needs both QEMU (and libvirt) changes as well as changes > > > in the kernel. > > > > Ok, let's take a step back here. You are basically growing libvirt into a memory resource manager that know how much memory is available on which nodes and how these nodes would possibly fit into the host's memory layout. > > > > Shouldn't that be the kernel's job? It seems to me that architecturally the kernel is the place I would want my memory resource controls to be in. > > I think that both Peter and Andrea are looking at this. Before we commit > an API to QEMU that has a different semantic than a possible new kernel > interface (that perhaps QEMU could use directly to inform kernel of the > binding/relationship between vcpu thread and it's memory at VM startuup) > it would be useful to see what these guys are working on... I looked at Peter's recent work in this area. (https://lkml.org/lkml/2011/11/17/204) It introduces two interfaces: 1. ms_tbind() to bind a thread to a memsched(*) group 2. ms_mbind() to bind a memory region to memsched group I assume the 2nd interface could be used by QEMU to create memsched groups for each of guest NUMA node memory regions. In the past, Anthony has said that NUMA binding should be done from outside of QEMU (http://www.kerneltrap.org/mailarchive/linux-kvm/2010/8/31/6267041) Though that was in a different context, may be we should re-look at that and see if QEMU still sticks to that. I know its a bit early, but if needed we should ask Peter to consider extending ms_mbind() to take a tid parameter too instead of working on current task by default. (*) memsched: An abstraction for representing coupling of threads with virtual address ranges. Threads and virtual address ranges of a memsched group are guaranteed (?) to be located on the same node. Regards, Bharata.
On Mon, 2011-11-21 at 20:48 +0530, Bharata B Rao wrote: > I looked at Peter's recent work in this area. > (https://lkml.org/lkml/2011/11/17/204) > > It introduces two interfaces: > > 1. ms_tbind() to bind a thread to a memsched(*) group > 2. ms_mbind() to bind a memory region to memsched group > > I assume the 2nd interface could be used by QEMU to create > memsched groups for each of guest NUMA node memory regions. No, you would need both, you'll need to group vcpu threads _and_ some vaddress space together. I understood QEMU currently uses a single big anonymous mmap() to allocate the guest memory, using this you could either use multiple or carve up the big alloc into virtual nodes by assigning different parts to different ms groups. Example: suppose you want to create a 2 node guest with 8 vcpus, create 2 ms groups, each with 4 vcpu threads and assign half the total guest mmap to either. > In the past, Anthony has said that NUMA binding should be done from outside > of QEMU (http://www.kerneltrap.org/mailarchive/linux-kvm/2010/8/31/6267041) If you want to expose a sense of virtual NUMA to your guest you really have no choice there. The only thing you can do externally is run whole VMs inside one particular node. > Though that was in a different context, may be we should re-look at that > and see if QEMU still sticks to that. I know its a bit early, but if needed > we should ask Peter to consider extending ms_mbind() to take a tid parameter > too instead of working on current task by default. Uh, what for? ms_mbind() works on the current process, not task. > (*) memsched: An abstraction for representing coupling of threads with virtual > address ranges. Threads and virtual address ranges of a memsched group are > guaranteed (?) to be located on the same node. Yeah, more or less so. We could relax that slightly to allow tasks to run away from the node for very short periods of time, but basically that's the provided guarantee.
On Mon, Nov 21, 2011 at 04:25:26PM +0100, Peter Zijlstra wrote: > On Mon, 2011-11-21 at 20:48 +0530, Bharata B Rao wrote: > > > I looked at Peter's recent work in this area. > > (https://lkml.org/lkml/2011/11/17/204) > > > > It introduces two interfaces: > > > > 1. ms_tbind() to bind a thread to a memsched(*) group > > 2. ms_mbind() to bind a memory region to memsched group > > > > I assume the 2nd interface could be used by QEMU to create > > memsched groups for each of guest NUMA node memory regions. > > No, you would need both, you'll need to group vcpu threads _and_ some > vaddress space together. > > I understood QEMU currently uses a single big anonymous mmap() to > allocate the guest memory, using this you could either use multiple or > carve up the big alloc into virtual nodes by assigning different parts > to different ms groups. > > Example: suppose you want to create a 2 node guest with 8 vcpus, create > 2 ms groups, each with 4 vcpu threads and assign half the total guest > mmap to either. > > > In the past, Anthony has said that NUMA binding should be done from outside > > of QEMU (http://www.kerneltrap.org/mailarchive/linux-kvm/2010/8/31/6267041) > > If you want to expose a sense of virtual NUMA to your guest you really > have no choice there. The only thing you can do externally is run whole > VMs inside one particular node. > > > Though that was in a different context, may be we should re-look at that > > and see if QEMU still sticks to that. I know its a bit early, but if needed > > we should ask Peter to consider extending ms_mbind() to take a tid parameter > > too instead of working on current task by default. > > Uh, what for? ms_mbind() works on the current process, not task. In the original post of this mail thread, I proposed a way to export guest RAM ranges (Guest Physical Address-GPA) and their corresponding host host virtual mappings (Host Virtual Address-HVA) from QEMU (via QEMU monitor). The idea was to use this GPA to HVA mappings from tools like libvirt to bind specific parts of the guest RAM to different host nodes. This needed an extension to existing mbind() to allow binding memory of a process(QEMU) from a different process(libvirt). This was needed since we wanted to do all this from libvirt. Hence I was coming from that background when I asked for extending ms_mbind() to take a tid parameter. If QEMU community thinks that NUMA binding should all be done from outside of QEMU, it is needed, otherwise what you have should be sufficient. Regards, Bharata.
On Mon, 2011-11-21 at 21:30 +0530, Bharata B Rao wrote: > > In the original post of this mail thread, I proposed a way to export > guest RAM ranges (Guest Physical Address-GPA) and their corresponding host > host virtual mappings (Host Virtual Address-HVA) from QEMU (via QEMU monitor). > The idea was to use this GPA to HVA mappings from tools like libvirt to bind > specific parts of the guest RAM to different host nodes. This needed an > extension to existing mbind() to allow binding memory of a process(QEMU) from a > different process(libvirt). This was needed since we wanted to do all this from > libvirt. > > Hence I was coming from that background when I asked for extending > ms_mbind() to take a tid parameter. If QEMU community thinks that NUMA > binding should all be done from outside of QEMU, it is needed, otherwise > what you have should be sufficient. That's just retarded, and no you won't get such extentions. Poking at another process's virtual address space is just daft. Esp. if there's no actual reason for it. Furthermore, it would make libvirt a required part of qemu, and since I don't think I've ever use libvirt that's another reason to object, I don't need that stinking mess.
On 11/21/2011 05:25 PM, Peter Zijlstra wrote: > On Mon, 2011-11-21 at 20:48 +0530, Bharata B Rao wrote: > > > I looked at Peter's recent work in this area. > > (https://lkml.org/lkml/2011/11/17/204) > > > > It introduces two interfaces: > > > > 1. ms_tbind() to bind a thread to a memsched(*) group > > 2. ms_mbind() to bind a memory region to memsched group > > > > I assume the 2nd interface could be used by QEMU to create > > memsched groups for each of guest NUMA node memory regions. > > No, you would need both, you'll need to group vcpu threads _and_ some > vaddress space together. > > I understood QEMU currently uses a single big anonymous mmap() to > allocate the guest memory, using this you could either use multiple or > carve up the big alloc into virtual nodes by assigning different parts > to different ms groups. > > Example: suppose you want to create a 2 node guest with 8 vcpus, create > 2 ms groups, each with 4 vcpu threads and assign half the total guest > mmap to either. > Does ms_mbind() require that its vmas in its area be completely contained in the region, or does it split vmas on demand? I suggest the latter to avoid exposing implementation details.
On Mon, 2011-11-21 at 20:03 +0200, Avi Kivity wrote: > > Does ms_mbind() require that its vmas in its area be completely > contained in the region, or does it split vmas on demand? I suggest the > latter to avoid exposing implementation details. as implemented (which is still rather incomplete) it does the split on demand like all other memory interfaces.
* Peter Zijlstra (a.p.zijlstra@chello.nl) wrote: > On Mon, 2011-11-21 at 21:30 +0530, Bharata B Rao wrote: > > > > In the original post of this mail thread, I proposed a way to export > > guest RAM ranges (Guest Physical Address-GPA) and their corresponding host > > host virtual mappings (Host Virtual Address-HVA) from QEMU (via QEMU monitor). > > The idea was to use this GPA to HVA mappings from tools like libvirt to bind > > specific parts of the guest RAM to different host nodes. This needed an > > extension to existing mbind() to allow binding memory of a process(QEMU) from a > > different process(libvirt). This was needed since we wanted to do all this from > > libvirt. > > > > Hence I was coming from that background when I asked for extending > > ms_mbind() to take a tid parameter. If QEMU community thinks that NUMA > > binding should all be done from outside of QEMU, it is needed, otherwise > > what you have should be sufficient. > > That's just retarded, and no you won't get such extentions. Poking at > another process's virtual address space is just daft. Esp. if there's no > actual reason for it. Need to separate the binding vs the policy mgmt. The policy mgmt could still be done outside, whereas the binding could still be done from w/in QEMU. A simple monitor interface to rebalance vcpu memory allcoations to different nodes could very well schedule vcpu thread work in QEMU. So, I agree, even if there is some external policy mgmt, it could still easily work w/ QEMU to use Peter's proposed interface. thanks, -chris
On 11/21/2011 11:03 AM, Peter Zijlstra wrote: > On Mon, 2011-11-21 at 21:30 +0530, Bharata B Rao wrote: >> >> In the original post of this mail thread, I proposed a way to export >> guest RAM ranges (Guest Physical Address-GPA) and their corresponding host >> host virtual mappings (Host Virtual Address-HVA) from QEMU (via QEMU monitor). >> The idea was to use this GPA to HVA mappings from tools like libvirt to bind >> specific parts of the guest RAM to different host nodes. This needed an >> extension to existing mbind() to allow binding memory of a process(QEMU) from a >> different process(libvirt). This was needed since we wanted to do all this from >> libvirt. >> >> Hence I was coming from that background when I asked for extending >> ms_mbind() to take a tid parameter. If QEMU community thinks that NUMA >> binding should all be done from outside of QEMU, it is needed, otherwise >> what you have should be sufficient. > > That's just retarded, and no you won't get such extentions. Poking at > another process's virtual address space is just daft. Esp. if there's no > actual reason for it. Yes, that would be a terrible interface. Fundamentally, the entity that should be deciding what memory should be present and where it should located is the kernel. I'm fundamentally opposed to trying to make QEMU override the scheduler/mm by using cpu or memory pinning in QEMU. From what I can tell about ms_mbind(), it just uses process knowledge to bind specific areas of memory to a memsched group and let's the kernel decide what to do with that knowledge. This is exactly the type of interface that QEMU should be using. QEMU should tell the kernel enough information such that the kernel can make good decisions. QEMU should not be the one making the decisions. It looks like ms_mbind() takes a flags argument which I assume is the same flags as mbind(). The current implementation ignores flags and just uses MPOL_BIND. I would hope that the flags argument would only be treated as advisory by the kernel. Regards, Anthony Liguori > > Furthermore, it would make libvirt a required part of qemu, and since I > don't think I've ever use libvirt that's another reason to object, I > don't need that stinking mess. >
On 11/21/2011 04:50 PM, Chris Wright wrote: > * Peter Zijlstra (a.p.zijlstra@chello.nl) wrote: >> On Mon, 2011-11-21 at 21:30 +0530, Bharata B Rao wrote: >>> >>> In the original post of this mail thread, I proposed a way to export >>> guest RAM ranges (Guest Physical Address-GPA) and their corresponding host >>> host virtual mappings (Host Virtual Address-HVA) from QEMU (via QEMU monitor). >>> The idea was to use this GPA to HVA mappings from tools like libvirt to bind >>> specific parts of the guest RAM to different host nodes. This needed an >>> extension to existing mbind() to allow binding memory of a process(QEMU) from a >>> different process(libvirt). This was needed since we wanted to do all this from >>> libvirt. >>> >>> Hence I was coming from that background when I asked for extending >>> ms_mbind() to take a tid parameter. If QEMU community thinks that NUMA >>> binding should all be done from outside of QEMU, it is needed, otherwise >>> what you have should be sufficient. >> >> That's just retarded, and no you won't get such extentions. Poking at >> another process's virtual address space is just daft. Esp. if there's no >> actual reason for it. > > Need to separate the binding vs the policy mgmt. The policy mgmt could > still be done outside, whereas the binding could still be done from w/in > QEMU. A simple monitor interface to rebalance vcpu memory allcoations > to different nodes could very well schedule vcpu thread work in QEMU. I really would prefer to avoid having such an interface. It's a shot gun that will only result in many poor feet being maimed. I can't tell you the number of times I've encountered people using CPU pinning when they have absolutely no business doing CPU pinning. If we really believe such an interface should exist, then the interface should really be from the kernel. Once we have memgroups, there's no reason to involve QEMU at all. QEMU can define the memgroups based on the NUMA nodes and then it's up to the kernel as to whether it exposes controls to explicitly bind memgroups within a process or not. Regards, Anthony Liguori > So, I agree, even if there is some external policy mgmt, it could still > easily work w/ QEMU to use Peter's proposed interface. > > thanks, > -chris >
Hi! On Mon, Nov 21, 2011 at 07:51:21PM -0600, Anthony Liguori wrote: > Fundamentally, the entity that should be deciding what memory should be present > and where it should located is the kernel. I'm fundamentally opposed to trying > to make QEMU override the scheduler/mm by using cpu or memory pinning in QEMU. > > From what I can tell about ms_mbind(), it just uses process knowledge to bind > specific areas of memory to a memsched group and let's the kernel decide what to > do with that knowledge. This is exactly the type of interface that QEMU should > be using. > > QEMU should tell the kernel enough information such that the kernel can make > good decisions. QEMU should not be the one making the decisions. True, QEMU won't have to decide where the memory and vcpus should be located (but hey it wouldn't need to decide that even if you use cpusets, you can use relative mbind with cpusets, the admin or a cpuset job scheduler could decide) but it's still QEMU making the decision of what memory and which vcpus threads to ms_mbind/ms_tbind. Think how you're going to create the input of those syscalls... If it wasn't qemu to decide that, qemu wouldn't be required to scan the whole host physical numa (cpu/memory) topology in order to create the "input" arguments of "ms_mbind/ms_tbind". And when you migrate the VM to another host, the whole vtopology may be counter-productive because the kernel isn't automatically detecting the numa affinity between threads and the guest vtopology will stick to whatever numa _physical_ topology that was seen on the first node where the VM was created. I doubt that the assumption that all cloud nodes will have the same physical numa topology is reasonable. Furthermore to get the same benefits that qemu gets on host by using ms_mbind/ms_tbind, every single guest application should be modified to scan the guest vtopology and call ms_mbind/ms_tbind too (or use the hard bindings which is what we try to avoid). I think it's unreasonable to expect all applications to use ms_mbind/ms_tbind in the guest, at best guest apps will use cpusets or wrappers, few apps will be modified for sys_ms_tbind/mbind. You can always have the supercomputer case with just one app that is optimized and a single VM spanning over the whole host, but in that scenarios hard bindings would work perfectly too. In my view the trouble of the numa hard bindings is not the fact they're hard and qemu has to also decide the location (in fact it doesn't need to decide the location if you use cpusets and relative mbinds). The bigger problem is the fact either the admin or the app developer has to explicitly scan the numa physical topology (both cpus and memory) and tell the kernel how much memory to bind to each thread. ms_mbind/ms_tbind only partially solve that problem. They're similar to the mbind MPOL_F_RELATIVE_NODES with cpusets, except you don't need an admin or a cpuset-job-scheduler (or a perl script) to redistribute the hardware resources. Now dealing with bindings isn't big deal for qemu, in fact this API is pretty much ideal for qemu, but it won't make life substantially easier than if compared to hard bindings. Simply the management code that is now done with a perl script will have to be moved in the kernel. It looks an incremental improvement compared to the relative mbind+cpuset, but I'm unsure if it's the best we could aim for and what we really need in virt considering we deal with VM migration too. The real long term design to me is not to add more syscalls, and initially handling the case of a process/VM spanning not more than one node in thread number and amount of memory. That's not too hard an in fact I've benchmarks for the scheduler already showing it to work pretty well (it's creating a too strict affinity but it can be relaxed to be more useful). Then later add some mechanism (simplest is the page fault at low frequency) to create a guest_vcpu_thread<->host_memory affinity and have a parvirtualized interface that tells the guest scheduler to group CPUs. If the guest scheduler runs free and is allowed to move threads randomly without any paravirtualized interface that controls the CPU thread migration in the guest scheduler, the thread<->memory affinity on host will be hopeless. But with a parvirtualized interface to make a guest thread stick to vcpu0/1/2/3 and not going into vcpu4/5/6/7, will allow to create a more meaningful guest_thread<->physical_ram affinity on host through KVM page faults. And then this will work also with VM migration and without having to create a vtopology in guest. And for apps running in guest no paravirt will be needed of course. The reason paravirt would be needed for qemu-kvm with a full automatic thread<->memory affinity is that the vcpu threads are magic. What runs in the vcpu thread are guest threads. And those can move through the guest CPU scheduler from vcpu0 to vcpu7. If that happens and we've 4 physical cpu for each physical node, any affinity we measure in the host will be meaningless. Normal threads using NPTL won't behave like that. Maybe some other thread library could have a "scheduler" inside that would make it behave like a vcpu thread (it's one thread really with several threads inside) but those existed mostly to simulate multiple threads in a single thread so they don't matter. And in this respect sys_tbind also requires the tid to have meaningful memory affinity. sys_tbind/mbind gets away with it by creating a vtopology in the guest, so the guest scheduler would then follow the vtopology (but vtopology breaks across VM migration and to really be followed well with sys_mbind/tbind it'd require all apps to be modified). grouping guest threads to stick into some vcpu sounds immensely simpler than changing the whole guest vtopology at runtime that would involve changing memory layout too. NOTE: the paravirt cpu grouping interface would also handle the case of 3 guests of 2.5G on a 8G guest (4G per node). One of the three guests will have memory spanning over two nodes, and the guest vtopology created by sys_mbind/tbind can't handle it. While paravirt cpu grouping and automatic thread<->memory affinity on host will handle it, like it will handle VM migration across nodes with different physical topology. The problem is to create a thread<->memory affinity we'll have to issue some page fault in KVM in the background. How harmful that is I don't know at this point. So the full automatic thread<->memory affinity is a bit of a vapourware concept at this point (process<->memory affinity seems to work already though). But Peter's migration code was driven by page faults already (not included in the patch he posted) and the other patch that exists called migrate-on-fault also depended on page faults. So I am optimistic we could have a thread<->memory affinity working too in the longer term. The plan would be to run them at low frequency and only if we can't fit a process into one node (in terms of both number of threads and memory). If the process fits in one node, we wouldn't even need any page fault and the information in the pagetables will be enough to do a best decision. The downside is it significantly more difficult to implement the thread<->memory affinity. And that's why I'm focusing initially on the simpler case of considering only the process<->memory affinity. That's fairly easy. So for the time being this incremental improvement may be justified, it moves the logic from a perl script to the kernel but I'm just skeptical it provides a big advantage compared to the numa bindings we already have in the kernel, especially if in the long term we can get rid of a vtopology completely. The vtopology in the guest may seem appealing, it solves the problem when you use bindings everywhere (be them hard bindings, or cpuset relative bindings, or the dynamic sys_mbind/tbind). But there is no much hope to alter the vtopology at runtime, so when a guest must be split across two nodes (3 VM of 2.5G ram running in a 8G host with 2 4G nodes) or through VM migration across different cloud nodes, I think the vtopology is trouble and would be best if it's avoided. The memory side of the vtopology is absolute trouble if it doesn't match the host physical topology exactly.
On 11/23/2011 04:03 PM, Andrea Arcangeli wrote: > Hi! > > On Mon, Nov 21, 2011 at 07:51:21PM -0600, Anthony Liguori wrote: >> Fundamentally, the entity that should be deciding what memory should be present >> and where it should located is the kernel. I'm fundamentally opposed to trying >> to make QEMU override the scheduler/mm by using cpu or memory pinning in QEMU. >> >> From what I can tell about ms_mbind(), it just uses process knowledge to bind >> specific areas of memory to a memsched group and let's the kernel decide what to >> do with that knowledge. This is exactly the type of interface that QEMU should >> be using. >> >> QEMU should tell the kernel enough information such that the kernel can make >> good decisions. QEMU should not be the one making the decisions. > True, QEMU won't have to decide where the memory and vcpus should be > located (but hey it wouldn't need to decide that even if you use > cpusets, you can use relative mbind with cpusets, the admin or a > cpuset job scheduler could decide) but it's still QEMU making the > decision of what memory and which vcpus threads to > ms_mbind/ms_tbind. Think how you're going to create the input of those > syscalls... > > If it wasn't qemu to decide that, qemu wouldn't be required to scan > the whole host physical numa (cpu/memory) topology in order to create > the "input" arguments of "ms_mbind/ms_tbind". And when you migrate the > VM to another host, the whole vtopology may be counter-productive > because the kernel isn't automatically detecting the numa affinity > between threads and the guest vtopology will stick to whatever numa > _physical_ topology that was seen on the first node where the VM was > created. > > I doubt that the assumption that all cloud nodes will have the same > physical numa topology is reasonable. > > Furthermore to get the same benefits that qemu gets on host by using > ms_mbind/ms_tbind, every single guest application should be modified > to scan the guest vtopology and call ms_mbind/ms_tbind too (or use the > hard bindings which is what we try to avoid). > > I think it's unreasonable to expect all applications to use > ms_mbind/ms_tbind in the guest, at best guest apps will use cpusets or > wrappers, few apps will be modified for sys_ms_tbind/mbind. > > You can always have the supercomputer case with just one app that is > optimized and a single VM spanning over the whole host, but in that > scenarios hard bindings would work perfectly too. > > In my view the trouble of the numa hard bindings is not the fact > they're hard and qemu has to also decide the location (in fact it > doesn't need to decide the location if you use cpusets and relative > mbinds). The bigger problem is the fact either the admin or the app > developer has to explicitly scan the numa physical topology (both cpus > and memory) and tell the kernel how much memory to bind to each > thread. ms_mbind/ms_tbind only partially solve that problem. They're > similar to the mbind MPOL_F_RELATIVE_NODES with cpusets, except you > don't need an admin or a cpuset-job-scheduler (or a perl script) to > redistribute the hardware resources. Well yeah, of course the guest needs to see some topology. I don't see why we'd have to actually scan the host for this though. All we need to tell the kernel is "this memory region is close to that thread". So if you define "-numa node,mem=1G,cpus=0" then QEMU should be able to tell the kernel that this GB of RAM actually is close to that vCPU thread. Of course the admin still needs to decide how to split up memory. That's the deal with emulating real hardware. You get the interfaces hardware gets :). However, if you follow a reasonable default strategy such as numa splitting your RAM into equal chunks between guest vCPUs you're probably close enough to optimal usage models. Or at least you could have a close enough approximation of how this mapping could work for the _guest_ regardless of the host and when you migrate it somewhere else it should also work reasonably well. > Now dealing with bindings isn't big deal for qemu, in fact this API is > pretty much ideal for qemu, but it won't make life substantially > easier than if compared to hard bindings. Simply the management code > that is now done with a perl script will have to be moved in the > kernel. It looks an incremental improvement compared to the relative > mbind+cpuset, but I'm unsure if it's the best we could aim for and > what we really need in virt considering we deal with VM migration too. > > The real long term design to me is not to add more syscalls, and > initially handling the case of a process/VM spanning not more than one > node in thread number and amount of memory. That's not too hard an in > fact I've benchmarks for the scheduler already showing it to work > pretty well (it's creating a too strict affinity but it can be relaxed > to be more useful). Then later add some mechanism (simplest is the > page fault at low frequency) to create a > guest_vcpu_thread<->host_memory affinity and have a parvirtualized > interface that tells the guest scheduler to group CPUs. > > If the guest scheduler runs free and is allowed to move threads > randomly without any paravirtualized interface that controls the CPU > thread migration in the guest scheduler, the thread<->memory affinity > on host will be hopeless. But with a parvirtualized interface to make > a guest thread stick to vcpu0/1/2/3 and not going into vcpu4/5/6/7, > will allow to create a more meaningful guest_thread<->physical_ram > affinity on host through KVM page faults. And then this will work also > with VM migration and without having to create a vtopology in guest. So you want to basically dynamically create NUMA topologies from the runtime behavior of the guest? What if it changes over time? > And for apps running in guest no paravirt will be needed of course. > > The reason paravirt would be needed for qemu-kvm with a full automatic > thread<->memory affinity is that the vcpu threads are magic. What runs > in the vcpu thread are guest threads. And those can move through the > guest CPU scheduler from vcpu0 to vcpu7. If that happens and we've 4 > physical cpu for each physical node, any affinity we measure in the > host will be meaningless. Normal threads using NPTL won't behave like > that. Maybe some other thread library could have a "scheduler" inside > that would make it behave like a vcpu thread (it's one thread really > with several threads inside) but those existed mostly to simulate > multiple threads in a single thread so they don't matter. And in this > respect sys_tbind also requires the tid to have meaningful memory > affinity. sys_tbind/mbind gets away with it by creating a vtopology in > the guest, so the guest scheduler would then follow the vtopology (but > vtopology breaks across VM migration and to really be followed well > with sys_mbind/tbind it'd require all apps to be modified). > > grouping guest threads to stick into some vcpu sounds immensely > simpler than changing the whole guest vtopology at runtime that would > involve changing memory layout too. > > NOTE: the paravirt cpu grouping interface would also handle the case > of 3 guests of 2.5G on a 8G guest (4G per node). One of the three > guests will have memory spanning over two nodes, and the guest > vtopology created by sys_mbind/tbind can't handle it. While paravirt > cpu grouping and automatic thread<->memory affinity on host will > handle it, like it will handle VM migration across nodes with > different physical topology. The problem is to create a > thread<->memory affinity we'll have to issue some page fault in KVM in > the background. How harmful that is I don't know at this point. So the > full automatic thread<->memory affinity is a bit of a vapourware > concept at this point (process<->memory affinity seems to work already > though). > > But Peter's migration code was driven by page faults already (not > included in the patch he posted) and the other patch that exists > called migrate-on-fault also depended on page faults. So I am > optimistic we could have a thread<->memory affinity working too in the > longer term. The plan would be to run them at low frequency and only > if we can't fit a process into one node (in terms of both number of > threads and memory). If the process fits in one node, we wouldn't even > need any page fault and the information in the pagetables will be > enough to do a best decision. The downside is it significantly more > difficult to implement the thread<->memory affinity. And that's why > I'm focusing initially on the simpler case of considering only the > process<->memory affinity. That's fairly easy. > > So for the time being this incremental improvement may be justified, > it moves the logic from a perl script to the kernel but I'm just > skeptical it provides a big advantage compared to the numa bindings we > already have in the kernel, especially if in the long term we can get > rid of a vtopology completely. I actually like the idea of just telling the kernel how close memory will be to a thread. Sure, you can handle this basically by shoving your scheduler into user space, but isn't managing processes what a kernel is supposed to do in the first place? You can always argue for a microkernel, but having a scheduler in user space (perl script) and another one in the kernel doesn't sound very appealing to me. If you want to go full-on user space, sure, I can see why :). Either way, your approach sounds to be very much in the concept phase, while this is more something that can actually be tested and benchmarked against today. So yes, I want the interim solution - just in case your plan doesn't work out :). Oh, and then there's the non-PV guests too... Alex
On Wed, Nov 23, 2011 at 07:34:37PM +0100, Alexander Graf wrote: > So if you define "-numa node,mem=1G,cpus=0" then QEMU should be able to > tell the kernel that this GB of RAM actually is close to that vCPU thread. > Of course the admin still needs to decide how to split up memory. That's > the deal with emulating real hardware. You get the interfaces hardware > gets :). However, if you follow a reasonable default strategy such as The problem is how do you decide the parameter "-numa node,mem=1G,cpus=0". Real hardware exists when the VM starts. But then the VM can be migrated. Or the VM may have to be split in the middle of two nodes regardless of the -node node,mem=1G,cpus=0-1 to avoid swapping so there may be two 512M nodes with 1 cpu each instead of 1 NUMA node with 1G and 2 cpus. Especially by relaxing the hard bindings and using ms_mbind/tbind, the vtopology you create won't match real hardware because you don't know the real hardware that you will get. > numa splitting your RAM into equal chunks between guest vCPUs you're > probably close enough to optimal usage models. Or at least you could > have a close enough approximation of how this mapping could work for the > _guest_ regardless of the host and when you migrate it somewhere else it > should also work reasonably well. If you enforce these assumptions and the admin has still again to choose the "-numa node,mem=1G" parameters after checking the physical numa topology and make sure the vtopology can match the real physical topology and that the guest runs on "real hardware", it's not very different from using hard bindings, hard bindings enforces the "real hardware" so there's no way it can go wrong. I mean you still need some NUMA topology knowledge outside of QEMU to be sure you get "real hardware" out of the vtopology. Ok cpusets would restrict the availability of idle cpus, so there's a slight improvement in maximizing idle CPU usage (it's better to run 50% slower than not to run at all), but that could be achieved also by a relax of the cpuset semantics (if that's not already available). > So you want to basically dynamically create NUMA topologies from the > runtime behavior of the guest? What if it changes over time? Yes, just I wouldn't call it NUMA topologies or it looks like a vtopology, and the vtopology is something fixed at boot time, the sort of thing created by using a command line like "-numa node,mem=1G,cpu=0". I wouldn't try to give the guest any "memory" topology, just the vcpus are magic threads that don't behave like normal threads in memory affinity terms. So they need a paravirtualization layer to be dealt with. The fact vcpu0 accessed 10 pages right now, doesn't mean there's a real affinity between vcpu0 and those N pages if the guest scheduler is free to migrate anything anywhere. The guest thread running in the vcpu0 may be migrated to the vcpu7 which may belong to a different physical node. So if we want to automatically detect thread<->memory affinity between vcpus and guest memory, we also need to group the guest threads in certain vcpus and prevent those cpu migrations. The thread in the guest would better stick to vcpu0/1/2/3 (instead of migration to vcpu4/5/6/7) if vcpu0/1/2/3 have affinity with the same memory which fits in one node. That can only be told dynamically from KVM to the guest OS scheduler as we may migrate virtual machines or we may move the memory. Take the example of 3 VM of 2.5G ram each on a 8G system with 2 nodes (4G per node). Suppose one of the two VM that have all the 2.5G allocated in a single node quits. Then the VM that was split across the two nodes will "memory-migrated" to fit in one node. So far so good, but then KVM should tell the guest OS scheduler that it should stop grouping vcpus and all vcpus are equal and all guest threads can be migrated to any vcpu. I don't see a way to do those things with a vtopology fixed at boot. > I actually like the idea of just telling the kernel how close memory > will be to a thread. Sure, you can handle this basically by shoving your > scheduler into user space, but isn't managing processes what a kernel is > supposed to do in the first place? Assume you're not in virt and you just want to tell thread A uses memory range A and thread B uses memory range B. If the memory range A fits in one node you're ok. But if "memory A" now spans over two nodes (maybe to avoid swapping), you're still screwed and you won't give enough information to the kernel on the real runtime affinity that "thread A" has on the memory. Now if statistically the access to "memory a" are all equal, it won't make a difference but if you end up using half of "memory A" 99% of the time, it will not work as well. This is especially a problem for KVM because statistically the accesses to "memory a" given to vcpu0 won't be equal. 50% of it may not be used at all and just have pagecache sitting there, or even free memory, so we can do better if "memory a" is split across two nodes to avoid swapping, if we detect the vcpu<->memory affinity dynamically. > You can always argue for a microkernel, but having a scheduler in user > space (perl script) and another one in the kernel doesn't sound very > appealing to me. If you want to go full-on user space, sure, I can see > why :). > > Either way, your approach sounds to be very much in the concept phase, > while this is more something that can actually be tested and benchmarked The thread<->memory affinity is in the concept phase, but the process<->memory affinity already runs and in benchmarks it already performs almost as well as hard bindings. It has the cost of a knumad daemon scanning the memory in the background but that's cheap, not even comparable to something like KSM. It's comparable to khugepaged overhead, which is orders of magnitude lower and considering those are big systems with many CPUs I don't think it's a big deal. Once process<->memory affinity works well if we go into the thread<->memory affinity, we'll have to tweak knumad to trigger page faults to give us per-thread information on the memory affinity. Also I'm only working on anonymous memory right now, maybe it should be extended to other types of memory and handle the case of the memory being shared by entities running in different nodes and not touch it in that case, while if the pagecache is used by just one thread (or process initially) it could still migrate it. For readonly shared memory duplicating it per-node is the way to go but I'm not going into that direction as it's not useful for virt. It remains a possibility for the future. > against today. So yes, I want the interim solution - just in case your > plan doesn't work out :). Oh, and then there's the non-PV guests too... Actually to me it looks like the code misses the memory affinity and migration, so I'm not sure how much you can run benchmarks on it yet. It seems to tweak the scheduler though. I don't mean ms_tbind/mbind are a bad idea, it allows to remove the migration invoked by a perl script, into the kernel, but I'm not satisfied with the trouble that creating a vtopology still gives us (vtopology only makes sense if it matches the "real hardware" and as said above we don't always have real hardware, and if you enforce real hardware you're pretty close to using hard bindings, except it will be the kernel doing the migration of cpus and memory instead of those being invoked by userland, and it also allows to maximize usage of the idle CPUs). But even after you create a vtopology in guest and you makes sure it won't split across nodes so that the vtopology runs on "real hardware" it is has been created for, it won't help much if all userland apps in guest aren't also modified to use ms_mbind/ms_tbind which I don't see happening any time soon. You could still run knumad in guest, to take advantage of the vtopology without having to modify guest apps though. But if knumad would run in host (if we can solve the thread<->memory affinity) there would be no need of vtopology in the guest in the first place. I'm positive and I've proof of concept that knumad works for process<->memory affinity but considering the automigration code isn't complete (it doesn't even yet migrate THP without splitting them) I'm not yet delving into the complications of the thread affinity.
On Wed, Nov 23, 2011 at 07:34:37PM +0100, Alexander Graf wrote: > On 11/23/2011 04:03 PM, Andrea Arcangeli wrote: > >Hi! > > > > > >In my view the trouble of the numa hard bindings is not the fact > >they're hard and qemu has to also decide the location (in fact it > >doesn't need to decide the location if you use cpusets and relative > >mbinds). The bigger problem is the fact either the admin or the app > >developer has to explicitly scan the numa physical topology (both cpus > >and memory) and tell the kernel how much memory to bind to each > >thread. ms_mbind/ms_tbind only partially solve that problem. They're > >similar to the mbind MPOL_F_RELATIVE_NODES with cpusets, except you > >don't need an admin or a cpuset-job-scheduler (or a perl script) to > >redistribute the hardware resources. > > Well yeah, of course the guest needs to see some topology. I don't > see why we'd have to actually scan the host for this though. All we > need to tell the kernel is "this memory region is close to that > thread". > > So if you define "-numa node,mem=1G,cpus=0" then QEMU should be able > to tell the kernel that this GB of RAM actually is close to that > vCPU thread. > > Of course the admin still needs to decide how to split up memory. > That's the deal with emulating real hardware. You get the interfaces > hardware gets :). However, if you follow a reasonable default > strategy such as numa splitting your RAM into equal chunks between > guest vCPUs you're probably close enough to optimal usage models. Or > at least you could have a close enough approximation of how this > mapping could work for the _guest_ regardless of the host and when > you migrate it somewhere else it should also work reasonably well. Allowing specification of the numa nodes to qemu, allowing qemu to create cpu+mem grouping (without binding) and letting the kernel decide how to manage them seems like a reasonable incremental step between no guest/host NUMA awareness and automatic NUMA configuration in the host kernel. It would be suffice for the current needs we see. Besides migration, we also have use cases where we may want to have large multi-node VMs that are static (like LPARs), having the guest aware of the topology there is helpful. Also, if at all topology changes due to migration or host kernel decisions, we can make use of something like VPHN (virtual processor home node) capability on Power systems to have guest kernel update its topology knowledge. You can refer to that in arch/powerpc/mm/numa.c. Otherwise, as long as the host kernel maintains mappings requested by ms_tbind()/ms_mbind(), we can create the guest topology correctly and optimize for NUMA. This would work for us. Thanks Dipankar
On Wed, 2011-11-30 at 21:52 +0530, Dipankar Sarma wrote: > > Also, if at all topology changes due to migration or host kernel decisions, > we can make use of something like VPHN (virtual processor home node) > capability on Power systems to have guest kernel update its topology > knowledge. You can refer to that in > arch/powerpc/mm/numa.c. I think that fail^Wfeature of PPC is terminally broken. You simply cannot change the topology after the fact.
* Peter Zijlstra (a.p.zijlstra@chello.nl) wrote: > On Wed, 2011-11-30 at 21:52 +0530, Dipankar Sarma wrote: > > > > Also, if at all topology changes due to migration or host kernel decisions, > > we can make use of something like VPHN (virtual processor home node) > > capability on Power systems to have guest kernel update its topology > > knowledge. You can refer to that in > > arch/powerpc/mm/numa.c. > > I think that fail^Wfeature of PPC is terminally broken. You simply > cannot change the topology after the fact. Agreed, there's too many things that consult topology once and never look back.
On Wed, Nov 30, 2011 at 09:52:37PM +0530, Dipankar Sarma wrote: > create the guest topology correctly and optimize for NUMA. This > would work for us. Even on the case of 1 guest that fits in one node, you're not going to max out the full bandwidth of all memory channels with this. qemu all can do with ms_mbind/tbind is to create a vtopology that matches the hardware topology. It has these limits: 1) requires all userland applications to be modified to scan either the physical topology if run on host, or the vtopology if run on guest to get the full benefit. 2) breaks across live migration if host physical topology changes 3) 1 small guest on a idle numa system that fits in one numa node will tell not enough information to the host kernel 4) if used outside of qemu and one threads allocates more memory than what fits in one node it won't tell enough info to the host kernel. About 3): if you've just one guest that fits in one node, each vcpu should be spread across all the nodes probably, and behave like MADV_INTERLEAVE if the guest CPU scheduler migrate guests processes in reverse, the global memory bandwidth will still be used full even if they will both access remote memory. I've just seen benchmarks where no pinning runs more than _twice_ as fast than pinning with just 1 guest and only 10 vcpu threads, probably because of that. About 4): even if the thread scans the numa topology it won't be able to tell tell enough info to the kernel to know which parts of the memory may be used more or less (ok it may be possible to call mbind and vary it at runtime but it adds even more complexity left to the programmer). If the vcpu is free to go in any node, and we've a automatic vcpu<->memory affinity, then the memory will follow the vcpu. And the scheduler domains should already optimize for maxing out the full memory bandwidth of all channels. Trouble 1/2/3/4 applies to the hard bindings as well, not just to mbind/tbin. In short it's an incremental step that moves some logic to the kernel but I don't see it solving all situations optimally and it shares a lot of the limits of the hard bindings.
On Wed, Nov 30, 2011 at 06:41:13PM +0100, Andrea Arcangeli wrote: > On Wed, Nov 30, 2011 at 09:52:37PM +0530, Dipankar Sarma wrote: > > create the guest topology correctly and optimize for NUMA. This > > would work for us. > > Even on the case of 1 guest that fits in one node, you're not going to > max out the full bandwidth of all memory channels with this. > > qemu all can do with ms_mbind/tbind is to create a vtopology that > matches the hardware topology. It has these limits: > > 1) requires all userland applications to be modified to scan either > the physical topology if run on host, or the vtopology if run on > guest to get the full benefit. Not sure why you would need that. qemu can reflect the topology based on -numa specifications and the corresponding ms_tbind/mbind in FDT (in the case of Power, I guess ACPI tables for x86) and guest kernel would detect this virtualized topology. So there is no need for two types of topologies afaics. It will all be reflected in /sys/devices/system/node in the guest. > > 2) breaks across live migration if host physical topology changes That is indeed an issue. Either VM placement software needs to be really smart to migrate VMs that fit well or, more likely, we will have to find a way to make guest kernels aware of topology changes. But the latter has impact on userspace as well for applications that might have optimized for NUMA. > 3) 1 small guest on a idle numa system that fits in one numa node will > tell not enough information to the host kernel > > 4) if used outside of qemu and one threads allocates more memory than > what fits in one node it won't tell enough info to the host kernel. > > About 3): if you've just one guest that fits in one node, each vcpu > should be spread across all the nodes probably, and behave like > MADV_INTERLEAVE if the guest CPU scheduler migrate guests processes in > reverse, the global memory bandwidth will still be used full even if > they will both access remote memory. I've just seen benchmarks where > no pinning runs more than _twice_ as fast than pinning with just 1 > guest and only 10 vcpu threads, probably because of that. I agree. Specifying NUMA topology for guest can result in sub-optimal performance in some cases, it is a tradeoff. > In short it's an incremental step that moves some logic to the kernel > but I don't see it solving all situations optimally and it shares a > lot of the limits of the hard bindings. Agreed. Thanks Dipankar
On Thu, Dec 01, 2011 at 10:55:20PM +0530, Dipankar Sarma wrote: > On Wed, Nov 30, 2011 at 06:41:13PM +0100, Andrea Arcangeli wrote: > > On Wed, Nov 30, 2011 at 09:52:37PM +0530, Dipankar Sarma wrote: > > > create the guest topology correctly and optimize for NUMA. This > > > would work for us. > > > > Even on the case of 1 guest that fits in one node, you're not going to > > max out the full bandwidth of all memory channels with this. > > > > qemu all can do with ms_mbind/tbind is to create a vtopology that > > matches the hardware topology. It has these limits: > > > > 1) requires all userland applications to be modified to scan either > > the physical topology if run on host, or the vtopology if run on > > guest to get the full benefit. > > Not sure why you would need that. qemu can reflect the > topology based on -numa specifications and the corresponding > ms_tbind/mbind in FDT (in the case of Power, I guess ACPI > tables for x86) and guest kernel would detect this virtualized > topology. So there is no need for two types of topologies afaics. > It will all be reflected in /sys/devices/system/node in the guest. The point is: what a vtopology gives you? If you don't modify all apps running in the guest to use it? vtopology on guest, helps exactly like the topology on host -> very little unless you modify qemu on host to use ms_tbind/mbind. > > 2) breaks across live migration if host physical topology changes > > That is indeed an issue. Either VM placement software needs to > be really smart to migrate VMs that fit well or, more likely, > we will have to find a way to make guest kernels aware of > topology changes. But the latter has impact on userspace > as well for applications that might have optimized for NUMA. Making guest kernel aware about "memory" topology changes is going to be a whole mess. Or at least harder than memory hotplug. > I agree. Specifying NUMA topology for guest can result in > sub-optimal performance in some cases, it is a tradeoff. I see it more like a limit of this solution, which is a common limit to the hard bindings than a tradeoff. > Agreed. Yep I just wanted to make clear the limits remains with this solution. I'll try to teach knumad to detect thread<->memory affinity too with some logic, we'll see how well that can work.
On Wed, 2011-11-23 at 16:03 +0100, Andrea Arcangeli wrote: > Hi! > > On Mon, Nov 21, 2011 at 07:51:21PM -0600, Anthony Liguori wrote: > > Fundamentally, the entity that should be deciding what memory should be present > > and where it should located is the kernel. I'm fundamentally opposed to trying > > to make QEMU override the scheduler/mm by using cpu or memory pinning in QEMU. > > > > From what I can tell about ms_mbind(), it just uses process knowledge to bind > > specific areas of memory to a memsched group and let's the kernel decide what to > > do with that knowledge. This is exactly the type of interface that QEMU should > > be using. > > > > QEMU should tell the kernel enough information such that the kernel can make > > good decisions. QEMU should not be the one making the decisions. > > True, QEMU won't have to decide where the memory and vcpus should be > located (but hey it wouldn't need to decide that even if you use > cpusets, you can use relative mbind with cpusets, the admin or a > cpuset job scheduler could decide) but it's still QEMU making the > decision of what memory and which vcpus threads to > ms_mbind/ms_tbind. Think how you're going to create the input of those > syscalls... > > If it wasn't qemu to decide that, qemu wouldn't be required to scan > the whole host physical numa (cpu/memory) topology in order to create > the "input" arguments of "ms_mbind/ms_tbind". That's a plain falsehood, you don't need to scan host physcal topology in order to create useful ms_[mt]bind arguments. You can use physical topology to optimize for particular hardware, but its not a strict requirement. > And when you migrate the > VM to another host, the whole vtopology may be counter-productive > because the kernel isn't automatically detecting the numa affinity > between threads and the guest vtopology will stick to whatever numa > _physical_ topology that was seen on the first node where the VM was > created. This doesn't make any sense at all. > I doubt that the assumption that all cloud nodes will have the same > physical numa topology is reasonable. So what? If you want to be very careful you can make sure you vnodes are small enough they fit any any physical node in your cloud (god I f*king hate that word). If you're slightly less careful, things will still work, you might get less max parallelism, but typically (from what I understood) these VM hosting thingies are overloaded so you never get your max cpu anyway, so who cares. Things is, whatever you set-up it will always work, it might not be optimal, but the one guarantee: [threads,vrange] will stay on the same node will be kept true, no matter where you run it. Also, migration between non-identical hosts is always 'tricky'. You're always stuck with some minimally supported subset or average case thing. Really, why do you think NUMA would be any different. > Furthermore to get the same benefits that qemu gets on host by using > ms_mbind/ms_tbind, every single guest application should be modified > to scan the guest vtopology and call ms_mbind/ms_tbind too (or use the > hard bindings which is what we try to avoid). No! ms_[tm]bind() is just part of the solution, the other part is what to do for simple programs, and like I wrote in my email earlier, and what we talked about in Prague, is that for normal simple proglets we simply pick a numa node and stick to it. Much like: http://home.arcor.de/efocht/sched/ Except we could actually migrate the whole thing if needed. Basically you give each task its own 1 vnode and assign all threads to it. Only big programs that need to span multiple nodes need to be modified to get best advantage of numa. But that has always been true. > In my view the trouble of the numa hard bindings is not the fact > they're hard and qemu has to also decide the location (in fact it > doesn't need to decide the location if you use cpusets and relative > mbinds). The bigger problem is the fact either the admin or the app > developer has to explicitly scan the numa physical topology (both cpus > and memory) and tell the kernel how much memory to bind to each > thread. ms_mbind/ms_tbind only partially solve that problem. They're > similar to the mbind MPOL_F_RELATIVE_NODES with cpusets, except you > don't need an admin or a cpuset-job-scheduler (or a perl script) to > redistribute the hardware resources. You're full of crap Andrea. Yes you need some clue as to your actual topology, but that's life, you can't get SMP for free either, you need to have some clue. Just like with regular SMP where you need to be aware of data sharing, NUMA just makes it worse. If your app decomposes well enough to create a vnode per thread, that's excellent, if you want to scale your app to fit your machine that's fine too, heck, every multi-threaded app out there worth using already queries machine topology one way or another, its not a big deal. But cpusets and relative_nodes doesn't work, you still get your memory splattered all over whatever nodes you allow and the scheduler will still move your task around based purely on cpu-load. 0-win. Not needing a (userspace) job-scheduler is a win, because that avoids having everybody talk to this job-scheduler, and there's multiple job-schedulers out there, two can't properly co-exist, etc. Also, the kernel is the right place to do this. [ this btw is true for all muddle-ware solutions, try and fit two applications together that are written against different but similar purpose muddle-wares and shit will come apart quickly ] > Now dealing with bindings isn't big deal for qemu, in fact this API is > pretty much ideal for qemu, but it won't make life substantially > easier than if compared to hard bindings. Simply the management code > that is now done with a perl script will have to be moved in the > kernel. It looks an incremental improvement compared to the relative > mbind+cpuset, but I'm unsure if it's the best we could aim for and > what we really need in virt considering we deal with VM migration too. No virt is crap, it needs to die, its horrid, and any solution aimed squarely at virt only is shit and not worth considering, that simple. If you want to help solve the NUMA issue, forget about virt and solve it for the non-virt case. > The real long term design to me is not to add more syscalls, and > initially handling the case of a process/VM spanning not more than one > node in thread number and amount of memory. That's not too hard an in > fact I've benchmarks for the scheduler already showing it to work > pretty well (it's creating a too strict affinity but it can be relaxed > to be more useful). Then later add some mechanism (simplest is the > page fault at low frequency) to create a > guest_vcpu_thread<->host_memory affinity and have a parvirtualized > interface that tells the guest scheduler to group CPUs. I bet you're believe a compiler can solve all parallelization/concurrency problems for you as well. Happy pipe dreaming for you. While you're at it, I've heard this transactional-memory crap will solve all our locking problems. Concurrency is hard, applications needs to know wtf they're doing if they want to gain any efficiency by it. > If the guest scheduler runs free and is allowed to move threads > randomly without any paravirtualized interface that controls the CPU > thread migration in the guest scheduler, the thread<->memory affinity > on host will be hopeless. But with a parvirtualized interface to make > a guest thread stick to vcpu0/1/2/3 and not going into vcpu4/5/6/7, > will allow to create a more meaningful guest_thread<->physical_ram > affinity on host through KVM page faults. And then this will work also > with VM migration and without having to create a vtopology in guest. As a maintainer of the scheduler I can, with a fair degree of certainty, say you'll never get such paravirt scheduler hooks. Also, as much as I dislike the whole virt stuff, the whole premise of virt is to 'emulate' real hardware. Real hardware does NUMA, therefore its not weird to also do vNUMA. And yes NUMA sucks eggs, and in fact not all hardware platforms expose it, have a look at s390 for example. They stuck in huge caches and pretend it doesn't exist. But for those that do, there's a performance gain to play by its rules. Furthermore, I've been told there is a great interest in running ! paravirt kernels, so much so in fact that hardware emulation seems more important than paravirt solutions. Also, I really don't see how trying to establish thread:page relations is in any way virt related, why couldn't you do this in a host kernel? From what I gather what you propose is to periodically unmap all user memory (or map it !r !w !x, which is effectively the same) and take the fault. This fault will establish a thread:page relation. One can just use that or involve some history as well. Once you have this thread:page relation set you want to group them on the same node. There's various problems with that, firstly of course the overhead, storing this thread:page relation set requires quite a lot of memory. Secondly I'm not quite sure I see how that works for threads that share a working set. Suppose you have 4 threads and 2 working sets, how do you make sure to keep the 2 groups together. I don't think that's evident from the simple thread:page relation data [*]. Thirdly I immensely dislike all these background scanner things, they make it very hard to account time to those who actually use it. [ * I can only make that work if you're willing to store something like O(nr_pages * nr_threads) amount of data to correlate stuff, and that's not even the time needed to process it and make something useful out of it ] > sys_tbind/mbind gets away with it by creating a vtopology in > the guest, so the guest scheduler would then follow the vtopology (but > vtopology breaks across VM migration and to really be followed well > with sys_mbind/tbind it'd require all apps to be modified). Again, vtopology doesn't break with VM migration. Its perfectly possible to create a vnode with 8 threads on hardware with only 2 cpus per node. Your threads all get to share those 2 cpus, so its not ideal, but it does work. > grouping guest threads to stick into some vcpu sounds immensely > simpler than changing the whole guest vtopology at runtime that would > involve changing memory layout too. vtopology is stable for the entire duration of the guest. Some weird people at IBM think its doable to change the topology at runtime, but I'd argue they're wrong. > NOTE: the paravirt cpu grouping interface would also handle the case > of 3 guests of 2.5G on a 8G guest (4G per node). One of the three > guests will have memory spanning over two nodes, and the guest > vtopology created by sys_mbind/tbind can't handle it. You could of course have created 3 guests with 2 nodes of 1.25G each. You can always do stupid things, sys_[mt]bind doesn't pretend brains aren't required. <snip more stuff>
On Thu, Dec 01, 2011 at 06:36:23PM +0100, Andrea Arcangeli wrote: > On Thu, Dec 01, 2011 at 10:55:20PM +0530, Dipankar Sarma wrote: > > On Wed, Nov 30, 2011 at 06:41:13PM +0100, Andrea Arcangeli wrote: > > > On Wed, Nov 30, 2011 at 09:52:37PM +0530, Dipankar Sarma wrote: > > > > create the guest topology correctly and optimize for NUMA. This > > > > would work for us. > > > > > > Even on the case of 1 guest that fits in one node, you're not going to > > > max out the full bandwidth of all memory channels with this. > > > > > > qemu all can do with ms_mbind/tbind is to create a vtopology that > > > matches the hardware topology. It has these limits: > > > > > > 1) requires all userland applications to be modified to scan either > > > the physical topology if run on host, or the vtopology if run on > > > guest to get the full benefit. > > > > Not sure why you would need that. qemu can reflect the > > topology based on -numa specifications and the corresponding > > ms_tbind/mbind in FDT (in the case of Power, I guess ACPI > > tables for x86) and guest kernel would detect this virtualized > > topology. So there is no need for two types of topologies afaics. > > It will all be reflected in /sys/devices/system/node in the guest. > > The point is: what a vtopology gives you? If you don't modify all apps > running in the guest to use it? vtopology on guest, helps exactly like > the topology on host -> very little unless you modify qemu on host to > use ms_tbind/mbind. Sure, ms_tbind/mbind will be needed in qemu. For the rest, NUMA aware apps already use topology while running on physical systems and they wouldn't need modification for this kind of virtualized topology. Thanks Dipankar
On Thu, Dec 01, 2011 at 06:40:31PM +0100, Peter Zijlstra wrote: > On Wed, 2011-11-23 at 16:03 +0100, Andrea Arcangeli wrote: > > Hi! > > > > On Mon, Nov 21, 2011 at 07:51:21PM -0600, Anthony Liguori wrote: > > > Fundamentally, the entity that should be deciding what memory should be present > > > and where it should located is the kernel. I'm fundamentally opposed to trying > > > to make QEMU override the scheduler/mm by using cpu or memory pinning in QEMU. > > > > > > From what I can tell about ms_mbind(), it just uses process knowledge to bind > > > specific areas of memory to a memsched group and let's the kernel decide what to > > > do with that knowledge. This is exactly the type of interface that QEMU should > > > be using. > > > > > > QEMU should tell the kernel enough information such that the kernel can make > > > good decisions. QEMU should not be the one making the decisions. > > > > True, QEMU won't have to decide where the memory and vcpus should be > > located (but hey it wouldn't need to decide that even if you use > > cpusets, you can use relative mbind with cpusets, the admin or a > > cpuset job scheduler could decide) but it's still QEMU making the > > decision of what memory and which vcpus threads to > > ms_mbind/ms_tbind. Think how you're going to create the input of those > > syscalls... > > > > If it wasn't qemu to decide that, qemu wouldn't be required to scan > > the whole host physical numa (cpu/memory) topology in order to create > > the "input" arguments of "ms_mbind/ms_tbind". > > That's a plain falsehood, you don't need to scan host physcal topology > in order to create useful ms_[mt]bind arguments. You can use physical > topology to optimize for particular hardware, but its not a strict > requirement. > > > And when you migrate the > > VM to another host, the whole vtopology may be counter-productive > > because the kernel isn't automatically detecting the numa affinity > > between threads and the guest vtopology will stick to whatever numa > > _physical_ topology that was seen on the first node where the VM was > > created. > > This doesn't make any sense at all. > > > I doubt that the assumption that all cloud nodes will have the same > > physical numa topology is reasonable. > > So what? If you want to be very careful you can make sure you vnodes are > small enough they fit any any physical node in your cloud (god I f*king > hate that word). > > If you're slightly less careful, things will still work, you might get > less max parallelism, but typically (from what I understood) these VM > hosting thingies are overloaded so you never get your max cpu anyway, so > who cares. > > Things is, whatever you set-up it will always work, it might not be > optimal, but the one guarantee: [threads,vrange] will stay on the same > node will be kept true, no matter where you run it. > > Also, migration between non-identical hosts is always 'tricky'. You're > always stuck with some minimally supported subset or average case thing. > Really, why do you think NUMA would be any different. > > > Furthermore to get the same benefits that qemu gets on host by using > > ms_mbind/ms_tbind, every single guest application should be modified > > to scan the guest vtopology and call ms_mbind/ms_tbind too (or use the > > hard bindings which is what we try to avoid). > > No! ms_[tm]bind() is just part of the solution, the other part is what > to do for simple programs, and like I wrote in my email earlier, and > what we talked about in Prague, is that for normal simple proglets we > simply pick a numa node and stick to it. Much like: > > http://home.arcor.de/efocht/sched/ > > Except we could actually migrate the whole thing if needed. Basically > you give each task its own 1 vnode and assign all threads to it. > > Only big programs that need to span multiple nodes need to be modified > to get best advantage of numa. But that has always been true. > > > In my view the trouble of the numa hard bindings is not the fact > > they're hard and qemu has to also decide the location (in fact it > > doesn't need to decide the location if you use cpusets and relative > > mbinds). The bigger problem is the fact either the admin or the app > > developer has to explicitly scan the numa physical topology (both cpus > > and memory) and tell the kernel how much memory to bind to each > > thread. ms_mbind/ms_tbind only partially solve that problem. They're > > similar to the mbind MPOL_F_RELATIVE_NODES with cpusets, except you > > don't need an admin or a cpuset-job-scheduler (or a perl script) to > > redistribute the hardware resources. > > You're full of crap Andrea. > > Yes you need some clue as to your actual topology, but that's life, you > can't get SMP for free either, you need to have some clue. > > Just like with regular SMP where you need to be aware of data sharing, > NUMA just makes it worse. If your app decomposes well enough to create a > vnode per thread, that's excellent, if you want to scale your app to fit > your machine that's fine too, heck, every multi-threaded app out there > worth using already queries machine topology one way or another, its not > a big deal. > > But cpusets and relative_nodes doesn't work, you still get your memory > splattered all over whatever nodes you allow and the scheduler will > still move your task around based purely on cpu-load. 0-win. > > Not needing a (userspace) job-scheduler is a win, because that avoids > having everybody talk to this job-scheduler, and there's multiple > job-schedulers out there, two can't properly co-exist, etc. Also, the > kernel is the right place to do this. > > [ this btw is true for all muddle-ware solutions, try and fit two > applications together that are written against different but similar > purpose muddle-wares and shit will come apart quickly ] > > > Now dealing with bindings isn't big deal for qemu, in fact this API is > > pretty much ideal for qemu, but it won't make life substantially > > easier than if compared to hard bindings. Simply the management code > > that is now done with a perl script will have to be moved in the > > kernel. It looks an incremental improvement compared to the relative > > mbind+cpuset, but I'm unsure if it's the best we could aim for and > > what we really need in virt considering we deal with VM migration too. > > No virt is crap, it needs to die, its horrid, and any solution aimed > squarely at virt only is shit and not worth considering, that simple. Removing this phrase from context (feel free to object on that basis to the following inquiry), what are your concerns with virtualization itself? Is it the fact that having an unknownable operating system under your feet uncomfortable only, or is there something else? Because virt is green, it saves silicon.
On Thu, Dec 01, 2011 at 06:40:31PM +0100, Peter Zijlstra wrote: > On Wed, 2011-11-23 at 16:03 +0100, Andrea Arcangeli wrote: <snip> > >From what I gather what you propose is to periodically unmap all user > memory (or map it !r !w !x, which is effectively the same) and take the > fault. This fault will establish a thread:page relation. One can just > use that or involve some history as well. Once you have this thread:page > relation set you want to group them on the same node. > > There's various problems with that, firstly of course the overhead, > storing this thread:page relation set requires quite a lot of memory. > Secondly I'm not quite sure I see how that works for threads that share > a working set. Suppose you have 4 threads and 2 working sets, how do you > make sure to keep the 2 groups together. I don't think that's evident > from the simple thread:page relation data [*]. Thirdly I immensely > dislike all these background scanner things, they make it very hard to > account time to those who actually use it. Picture yourself as the administrator of a virtualized host, with a given workload of guests doing their tasks. All it takes is to understand from a high level what the algorithms of ksm (collapsing of equal content-pages into same physical RAM) and khugepaged (collapsing of 4k pages in 2MB pages, good for TLB) are doing (and that should be documented), and infer from that what is happening. The same is valid for the guy who is writing management tools and exposing the statistics to the system administrator.
On 12/22/2011 05:01 AM, Marcelo Tosatti wrote: > On Thu, Dec 01, 2011 at 06:40:31PM +0100, Peter Zijlstra wrote: >> No virt is crap, it needs to die, its horrid, and any solution aimed >> squarely at virt only is shit and not worth considering, that simple. > > Removing this phrase from context (feel free to object on that basis > to the following inquiry), what are your concerns with virtualization > itself? Is it the fact that having an unknownable operating system under > your feet uncomfortable only, or is there something else? Because virt > is green, it saves silicon. Oh man, if you say virt solves global warming, I think I'm going to have to jump off a bridge to end the madness... Regards, Anthony Liguori > >
On Thu, Dec 22, 2011 at 11:13:15AM -0600, Anthony Liguori wrote: > On 12/22/2011 05:01 AM, Marcelo Tosatti wrote: > >On Thu, Dec 01, 2011 at 06:40:31PM +0100, Peter Zijlstra wrote: > >>No virt is crap, it needs to die, its horrid, and any solution aimed > >>squarely at virt only is shit and not worth considering, that simple. > > > >Removing this phrase from context (feel free to object on that basis > >to the following inquiry), what are your concerns with virtualization > >itself? Is it the fact that having an unknownable operating system under > >your feet uncomfortable only, or is there something else? Because virt > >is green, it saves silicon. > > Oh man, if you say virt solves global warming, I think I'm going to > have to jump off a bridge to end the madness... I said it is green (saves energy) and it saves silicon (therefore saves fuel?). The rest of conclusions are your own.
On Thu, 2011-12-22 at 09:01 -0200, Marcelo Tosatti wrote: > > > No virt is crap, it needs to die, its horrid, and any solution aimed > > squarely at virt only is shit and not worth considering, that simple. > > Removing this phrase from context (feel free to object on that basis > to the following inquiry), what are your concerns with virtualization > itself? Is it the fact that having an unknownable operating system under > your feet uncomfortable only, or is there something else? Because virt > is green, it saves silicon. No, you're going the wrong way around that argument. Resource control would save the planet in that case. That's an entirely separate concept from virtualization. Look how much cgroup crap you still need on top of the whole virt thing. Virt deals with running legacy OSs, mostly because you're in a bind and for a host of reasons can't get this super critical application you really must have running on your new and improved platform. So you emulate hardware to run the old os, to run the old app or somesuch nonsense. Virt really is mostly a technical solution to a mostly non-technical problem. There's of course the debug angle, but I've never really found it reliable enough to use in that capacity, give me real hardware with a serial port any day of the week. Also, it just really offends me, we work really hard to make stuff go as fast as possible and then you stick a gigantic emulation layer in between and complain that shit is slow again. Don't do that!!
different guest nodes are mapped appropriately to different host NUMA nodes. To achieve this we would need QEMU to expose information about guest RAM ranges (Guest Physical Address - GPA) and their host virtual address mappings (Host Virtual Address - HVA). Using GPA and HVA, any external tool like libvirt would be able to divide the guest RAM as per the guest NUMA node geometry and bind guest memory nodes to corresponding host memory nodes using HVA. This needs both QEMU (and libvirt) changes as well as changes in the kernel. - System calls that set NUMA memory policies (like mbind) currently work for the current (or the calling) process. These syscalls need to be extended so that a process like libvirt is able to set NUMA memory policies for QEMU process's memory ranges. - This RFC is actually about the proposed change in QEMU to export GPA and HVA via QEMU monitor. The patch against QEMU present towards the end of this note is an attempt to achieve this. This patch adds a new monitor command "info ram". "info ram" prints out GPA and HVA for different sections of guest RAM. For a guest booted with options "-smp sockets=2,cores=4,threads=2 -numa node,nodeid=0,cpus=0-15 -numa node,nodeid=1,cpus=16-31 -cpu core2duo -m 5g", the exported data looks like this: ****************** (qemu) info ram GPA: 0-9ffff RAM: 0-9ffff HVA: 0x7efe7fe00000-0x7efe7fe9ffff GPA: cc000-effff RAM: cc000-effff HVA: 0x7efe7fecc000-0x7efe7feeffff GPA: 100000-dfffffff RAM: 100000-dfffffff HVA: 0x7efe7ff00000-0x7eff5fdfffff GPA: fc000000-fc7fffff RAM: 140040000-14083ffff HVA: 0x7efe7f400000-0x7efe7fbfffff GPA: 100000000-15fffffff RAM: e0000000-13fffffff HVA: 0x7eff5fe00000-0x7effbfdfffff ****************** I will remove the ram_addr (prefixed with RAM:) from the above. Having it here just to validate the regions and to compare with "info mtree" output (shown below). ****************** (qemu) info mtree memory 0000000000000000-7ffffffffffffffe (prio 0): system 0000000000000000-00000000dfffffff (prio 0): alias ram-below-4g @pc.ram 0000000000000000-00000000dfffffff 00000000000a0000-00000000000bffff (prio 1): alias smram-region @pci 00000000000a0000-00000000000bffff 00000000000c0000-00000000000c3fff (prio 1): alias pam-rom @pc.ram 00000000000c0000-00000000000c3fff 00000000000c4000-00000000000c7fff (prio 1): alias pam-rom @pc.ram 00000000000c4000-00000000000c7fff 00000000000c8000-00000000000cbfff (prio 1): alias pam-rom @pc.ram 00000000000c8000-00000000000cbfff 00000000000cc000-00000000000cffff (prio 1): alias pam-ram @pc.ram 00000000000cc000-00000000000cffff 00000000000d0000-00000000000d3fff (prio 1): alias pam-ram @pc.ram 00000000000d0000-00000000000d3fff 00000000000d4000-00000000000d7fff (prio 1): alias pam-ram @pc.ram 00000000000d4000-00000000000d7fff 00000000000d8000-00000000000dbfff (prio 1): alias pam-ram @pc.ram 00000000000d8000-00000000000dbfff 00000000000dc000-00000000000dffff (prio 1): alias pam-ram @pc.ram 00000000000dc000-00000000000dffff 00000000000e0000-00000000000e3fff (prio 1): alias pam-ram @pc.ram 00000000000e0000-00000000000e3fff 00000000000e4000-00000000000e7fff (prio 1): alias pam-ram @pc.ram 00000000000e4000-00000000000e7fff 00000000000e8000-00000000000ebfff (prio 1): alias pam-ram @pc.ram 00000000000e8000-00000000000ebfff 00000000000ec000-00000000000effff (prio 1): alias pam-ram @pc.ram 00000000000ec000-00000000000effff 00000000000f0000-00000000000fffff (prio 1): alias pam-rom @pc.ram 00000000000f0000-00000000000fffff 00000000e0000000-00000000ffffffff (prio 0): alias pci-hole @pci 00000000e0000000-00000000ffffffff 00000000fee00000-00000000feefffff (prio 0): apic 0000000100000000-000000015fffffff (prio 0): alias ram-above-4g @pc.ram 00000000e0000000-000000013fffffff 4000000000000000-7fffffffffffffff (prio 0): alias pci-hole64 @pci 4000000000000000-7fffffffffffffff pc.ram 0000000000000000-000000013fffffff (prio 0): pc.ram ****************** The current patch just exports the information and expects external tools to make use of it for binding. But we do understand that memory ranges can change and external tool should be able to respond to this. This is the current thinking on how to handle this: - Whenever the address range changes, send an async notification to libvirt (using QMP perhaps?) - libvirt will note the change and re-read the current guest RAM mapping info and re-bind the regions as appropriate. I haven't fully figured out this part (QEMU to libvirt notification part) yet and any pointers or suggestions here will be useful. Also a question: - In what ways the guest memory layout can change ? Is the change driven by external agents like libvirt (memory hot add)? or can things change transparently within QEMU. If its only the former, then we kind of know when to do rebinding. The patch follows: --- Export guest RAM address via QEMU monitor. NUMA aware QEMU guests running on NUMA systems can benefit from binding guest RAM to host appropriate NUMA node memory. Allow admin tools like libvirt to achieve this by exporting guest RAM information via QEMU monitor. Signed-off-by: Bharata B Rao <bharata@linux.vnet.ibm.com> --- memory.c | 33 +++++++++++++++++++++++++++++++++ memory.h | 2 ++ monitor.c | 12 ++++++++++++ 3 files changed, 47 insertions(+), 0 deletions(-) diff --git a/memory.c b/memory.c index dc5e35d..3ae10e5 100644 --- a/memory.c +++ b/memory.c @@ -1402,3 +1402,36 @@ void mtree_info(fprintf_function mon_printf, void *f) mtree_print_mr(mon_printf, f, address_space_io.root, 0, 0, &ml_head); } } + +#if !defined(CONFIG_USER_ONLY) +void ram_info_print(fprintf_function mon_printf, void *f) +{ + FlatRange *fr; + + FOR_EACH_FLAT_RANGE(fr, &address_space_memory.current_map) { + AddrRange ar = fr->addr; + ram_addr_t ram; + uint8_t *hva; + + ram = cpu_get_physical_page_desc(ar.start); + + /* Only show RAM area */ + if ((ram & ~TARGET_PAGE_MASK) != IO_MEM_RAM) { + continue; + } + ram &= TARGET_PAGE_MASK; + hva = qemu_get_ram_ptr(ram); + mon_printf(f, "GPA: %llx-%llx" " RAM: " + RAM_ADDR_FMT "-" RAM_ADDR_FMT " HVA: %p-%p\n", + (unsigned long long)ar.start, + (unsigned long long)(ar.start+ar.size-1), + ram, (ram_addr_t)(ram+ar.size-1), + hva, hva+ar.size-1); + } +} +#else +void ram_info_print(fprintf_function mon_printf, void *f) +{ + mon_printf(f, "Not supported\n"); +} +#endif diff --git a/memory.h b/memory.h index d5b47da..b5fb5e0 100644 --- a/memory.h +++ b/memory.h @@ -503,6 +503,8 @@ void memory_region_transaction_commit(void); void mtree_info(fprintf_function mon_printf, void *f); +void ram_info_print(fprintf_function mon_printf, void *f); + #endif #endif diff --git a/monitor.c b/monitor.c index ffda0fe..3b1a7f3 100644 --- a/monitor.c +++ b/monitor.c @@ -2738,6 +2738,11 @@ int monitor_get_fd(Monitor *mon, const char *fdname) return -1; } +static void do_info_ram(Monitor *mon) +{ + ram_info_print((fprintf_function)monitor_printf, mon); +} + static const mon_cmd_t mon_cmds[] = { #include "hmp-commands.h" { NULL, NULL, }, @@ -3050,6 +3055,13 @@ static const mon_cmd_t info_cmds[] = { .mhandler.info = do_trace_print_events, }, { + .name = "ram", + .args_type = "", + .params = "", + .help = "show RAM information", + .mhandler.info = do_info_ram, + }, + { .name = NULL, }, };