From patchwork Sat Oct 29 18:45:02 2011 Content-Type: text/plain; charset="utf-8" MIME-Version: 1.0 Content-Transfer-Encoding: 7bit X-Patchwork-Submitter: Bharata B Rao X-Patchwork-Id: 122550 Return-Path: X-Original-To: incoming@patchwork.ozlabs.org Delivered-To: patchwork-incoming@bilbo.ozlabs.org Received: from lists.gnu.org (lists.gnu.org [140.186.70.17]) (using TLSv1 with cipher AES256-SHA (256/256 bits)) (Client did not present a certificate) by ozlabs.org (Postfix) with ESMTPS id 01A231007D8 for ; Sun, 30 Oct 2011 05:45:41 +1100 (EST) Received: from localhost ([::1]:33656 helo=lists.gnu.org) by lists.gnu.org with esmtp (Exim 4.71) (envelope-from ) id 1RKDuS-0002OT-Oy for incoming@patchwork.ozlabs.org; Sat, 29 Oct 2011 14:45:36 -0400 Received: from eggs.gnu.org ([140.186.70.92]:58928) by lists.gnu.org with esmtp (Exim 4.71) (envelope-from ) id 1RKDuL-0002Nx-NC for qemu-devel@nongnu.org; Sat, 29 Oct 2011 14:45:30 -0400 Received: from Debian-exim by eggs.gnu.org with spam-scanned (Exim 4.71) (envelope-from ) id 1RKDuK-0003Ux-5H for qemu-devel@nongnu.org; Sat, 29 Oct 2011 14:45:29 -0400 Received: from e23smtp03.au.ibm.com ([202.81.31.145]:54304) by eggs.gnu.org with esmtp (Exim 4.71) (envelope-from ) id 1RKDuJ-0003UA-4A for qemu-devel@nongnu.org; Sat, 29 Oct 2011 14:45:28 -0400 Received: from d23relay03.au.ibm.com (d23relay03.au.ibm.com [202.81.31.245]) by e23smtp03.au.ibm.com (8.14.4/8.13.1) with ESMTP id p9TIdbZ7018726 for ; Sun, 30 Oct 2011 05:39:37 +1100 Received: from d23av01.au.ibm.com (d23av01.au.ibm.com [9.190.234.96]) by d23relay03.au.ibm.com (8.13.8/8.13.8/NCO v10.0) with ESMTP id p9TIj9wm1589380 for ; Sun, 30 Oct 2011 05:45:16 +1100 Received: from d23av01.au.ibm.com (loopback [127.0.0.1]) by d23av01.au.ibm.com (8.14.4/8.13.1/NCO v10.0 AVout) with ESMTP id p9TIj90A023348 for ; Sun, 30 Oct 2011 05:45:09 +1100 Received: from in.ibm.com ([9.77.122.186]) by d23av01.au.ibm.com (8.14.4/8.13.1/NCO v10.0 AVin) with ESMTP id p9TIj3DO023255 (version=TLSv1/SSLv3 cipher=DHE-RSA-AES128-SHA bits=128 verify=NO); Sun, 30 Oct 2011 05:45:07 +1100 Date: Sun, 30 Oct 2011 00:15:02 +0530 From: Bharata B Rao To: qemu-devel@nongnu.org Message-ID: <20111029184502.GH11038@in.ibm.com> MIME-Version: 1.0 Content-Disposition: inline User-Agent: Mutt/1.5.21 (2010-09-15) X-detected-operating-system: by eggs.gnu.org: GNU/Linux 2.6, seldom 2.4 (older, 4) X-Received-From: 202.81.31.145 Cc: Vaidyanathan S , dipankar@in.ibm.com Subject: [Qemu-devel] [RFC PATCH] Exporting Guest RAM information for NUMA binding X-BeenThere: qemu-devel@nongnu.org X-Mailman-Version: 2.1.14 Precedence: list Reply-To: bharata@linux.vnet.ibm.com List-Id: List-Unsubscribe: , List-Archive: List-Post: List-Help: List-Subscribe: , Errors-To: qemu-devel-bounces+incoming=patchwork.ozlabs.org@nongnu.org Sender: qemu-devel-bounces+incoming=patchwork.ozlabs.org@nongnu.org Hi, As guests become NUMA aware, it becomes important for the guests to have correct NUMA policies when they run on NUMA aware hosts. Currently limited support for NUMA binding is available via libvirt where it is possible to apply a NUMA policy to the guest as a whole. However multinode guests would benefit if guest memory belonging to different guest nodes are mapped appropriately to different host NUMA nodes. To achieve this we would need QEMU to expose information about guest RAM ranges (Guest Physical Address - GPA) and their host virtual address mappings (Host Virtual Address - HVA). Using GPA and HVA, any external tool like libvirt would be able to divide the guest RAM as per the guest NUMA node geometry and bind guest memory nodes to corresponding host memory nodes using HVA. This needs both QEMU (and libvirt) changes as well as changes in the kernel. - System calls that set NUMA memory policies (like mbind) currently work for the current (or the calling) process. These syscalls need to be extended so that a process like libvirt is able to set NUMA memory policies for QEMU process's memory ranges. - This RFC is actually about the proposed change in QEMU to export GPA and HVA via QEMU monitor. The patch against QEMU present towards the end of this note is an attempt to achieve this. This patch adds a new monitor command "info ram". "info ram" prints out GPA and HVA for different sections of guest RAM. For a guest booted with options "-smp sockets=2,cores=4,threads=2 -numa node,nodeid=0,cpus=0-15 -numa node,nodeid=1,cpus=16-31 -cpu core2duo -m 5g", the exported data looks like this: ****************** (qemu) info ram GPA: 0-9ffff RAM: 0-9ffff HVA: 0x7efe7fe00000-0x7efe7fe9ffff GPA: cc000-effff RAM: cc000-effff HVA: 0x7efe7fecc000-0x7efe7feeffff GPA: 100000-dfffffff RAM: 100000-dfffffff HVA: 0x7efe7ff00000-0x7eff5fdfffff GPA: fc000000-fc7fffff RAM: 140040000-14083ffff HVA: 0x7efe7f400000-0x7efe7fbfffff GPA: 100000000-15fffffff RAM: e0000000-13fffffff HVA: 0x7eff5fe00000-0x7effbfdfffff ****************** I will remove the ram_addr (prefixed with RAM:) from the above. Having it here just to validate the regions and to compare with "info mtree" output (shown below). ****************** (qemu) info mtree memory 0000000000000000-7ffffffffffffffe (prio 0): system 0000000000000000-00000000dfffffff (prio 0): alias ram-below-4g @pc.ram 0000000000000000-00000000dfffffff 00000000000a0000-00000000000bffff (prio 1): alias smram-region @pci 00000000000a0000-00000000000bffff 00000000000c0000-00000000000c3fff (prio 1): alias pam-rom @pc.ram 00000000000c0000-00000000000c3fff 00000000000c4000-00000000000c7fff (prio 1): alias pam-rom @pc.ram 00000000000c4000-00000000000c7fff 00000000000c8000-00000000000cbfff (prio 1): alias pam-rom @pc.ram 00000000000c8000-00000000000cbfff 00000000000cc000-00000000000cffff (prio 1): alias pam-ram @pc.ram 00000000000cc000-00000000000cffff 00000000000d0000-00000000000d3fff (prio 1): alias pam-ram @pc.ram 00000000000d0000-00000000000d3fff 00000000000d4000-00000000000d7fff (prio 1): alias pam-ram @pc.ram 00000000000d4000-00000000000d7fff 00000000000d8000-00000000000dbfff (prio 1): alias pam-ram @pc.ram 00000000000d8000-00000000000dbfff 00000000000dc000-00000000000dffff (prio 1): alias pam-ram @pc.ram 00000000000dc000-00000000000dffff 00000000000e0000-00000000000e3fff (prio 1): alias pam-ram @pc.ram 00000000000e0000-00000000000e3fff 00000000000e4000-00000000000e7fff (prio 1): alias pam-ram @pc.ram 00000000000e4000-00000000000e7fff 00000000000e8000-00000000000ebfff (prio 1): alias pam-ram @pc.ram 00000000000e8000-00000000000ebfff 00000000000ec000-00000000000effff (prio 1): alias pam-ram @pc.ram 00000000000ec000-00000000000effff 00000000000f0000-00000000000fffff (prio 1): alias pam-rom @pc.ram 00000000000f0000-00000000000fffff 00000000e0000000-00000000ffffffff (prio 0): alias pci-hole @pci 00000000e0000000-00000000ffffffff 00000000fee00000-00000000feefffff (prio 0): apic 0000000100000000-000000015fffffff (prio 0): alias ram-above-4g @pc.ram 00000000e0000000-000000013fffffff 4000000000000000-7fffffffffffffff (prio 0): alias pci-hole64 @pci 4000000000000000-7fffffffffffffff pc.ram 0000000000000000-000000013fffffff (prio 0): pc.ram ****************** The current patch just exports the information and expects external tools to make use of it for binding. But we do understand that memory ranges can change and external tool should be able to respond to this. This is the current thinking on how to handle this: - Whenever the address range changes, send an async notification to libvirt (using QMP perhaps?) - libvirt will note the change and re-read the current guest RAM mapping info and re-bind the regions as appropriate. I haven't fully figured out this part (QEMU to libvirt notification part) yet and any pointers or suggestions here will be useful. Also a question: - In what ways the guest memory layout can change ? Is the change driven by external agents like libvirt (memory hot add)? or can things change transparently within QEMU. If its only the former, then we kind of know when to do rebinding. The patch follows: --- Export guest RAM address via QEMU monitor. NUMA aware QEMU guests running on NUMA systems can benefit from binding guest RAM to host appropriate NUMA node memory. Allow admin tools like libvirt to achieve this by exporting guest RAM information via QEMU monitor. Signed-off-by: Bharata B Rao --- memory.c | 33 +++++++++++++++++++++++++++++++++ memory.h | 2 ++ monitor.c | 12 ++++++++++++ 3 files changed, 47 insertions(+), 0 deletions(-) diff --git a/memory.c b/memory.c index dc5e35d..3ae10e5 100644 --- a/memory.c +++ b/memory.c @@ -1402,3 +1402,36 @@ void mtree_info(fprintf_function mon_printf, void *f) mtree_print_mr(mon_printf, f, address_space_io.root, 0, 0, &ml_head); } } + +#if !defined(CONFIG_USER_ONLY) +void ram_info_print(fprintf_function mon_printf, void *f) +{ + FlatRange *fr; + + FOR_EACH_FLAT_RANGE(fr, &address_space_memory.current_map) { + AddrRange ar = fr->addr; + ram_addr_t ram; + uint8_t *hva; + + ram = cpu_get_physical_page_desc(ar.start); + + /* Only show RAM area */ + if ((ram & ~TARGET_PAGE_MASK) != IO_MEM_RAM) { + continue; + } + ram &= TARGET_PAGE_MASK; + hva = qemu_get_ram_ptr(ram); + mon_printf(f, "GPA: %llx-%llx" " RAM: " + RAM_ADDR_FMT "-" RAM_ADDR_FMT " HVA: %p-%p\n", + (unsigned long long)ar.start, + (unsigned long long)(ar.start+ar.size-1), + ram, (ram_addr_t)(ram+ar.size-1), + hva, hva+ar.size-1); + } +} +#else +void ram_info_print(fprintf_function mon_printf, void *f) +{ + mon_printf(f, "Not supported\n"); +} +#endif diff --git a/memory.h b/memory.h index d5b47da..b5fb5e0 100644 --- a/memory.h +++ b/memory.h @@ -503,6 +503,8 @@ void memory_region_transaction_commit(void); void mtree_info(fprintf_function mon_printf, void *f); +void ram_info_print(fprintf_function mon_printf, void *f); + #endif #endif diff --git a/monitor.c b/monitor.c index ffda0fe..3b1a7f3 100644 --- a/monitor.c +++ b/monitor.c @@ -2738,6 +2738,11 @@ int monitor_get_fd(Monitor *mon, const char *fdname) return -1; } +static void do_info_ram(Monitor *mon) +{ + ram_info_print((fprintf_function)monitor_printf, mon); +} + static const mon_cmd_t mon_cmds[] = { #include "hmp-commands.h" { NULL, NULL, }, @@ -3050,6 +3055,13 @@ static const mon_cmd_t info_cmds[] = { .mhandler.info = do_trace_print_events, }, { + .name = "ram", + .args_type = "", + .params = "", + .help = "show RAM information", + .mhandler.info = do_info_ram, + }, + { .name = NULL, }, };