From patchwork Wed Mar 17 14:59:51 2010 Content-Type: text/plain; charset="utf-8" MIME-Version: 1.0 Content-Transfer-Encoding: 7bit X-Patchwork-Submitter: Andrea Arcangeli X-Patchwork-Id: 47947 Return-Path: X-Original-To: incoming@patchwork.ozlabs.org Delivered-To: patchwork-incoming@bilbo.ozlabs.org Received: from lists.gnu.org (lists.gnu.org [199.232.76.165]) (using TLSv1 with cipher DHE-RSA-AES256-SHA (256/256 bits)) (Client did not present a certificate) by ozlabs.org (Postfix) with ESMTPS id E3D4DB7D25 for ; Thu, 18 Mar 2010 02:02:29 +1100 (EST) Received: from localhost ([127.0.0.1]:38930 helo=lists.gnu.org) by lists.gnu.org with esmtp (Exim 4.43) id 1NrukN-0004w9-2S for incoming@patchwork.ozlabs.org; Wed, 17 Mar 2010 11:01:23 -0400 Received: from mailman by lists.gnu.org with tmda-scanned (Exim 4.43) id 1Nruj6-0004vn-QY for qemu-devel@nongnu.org; Wed, 17 Mar 2010 11:00:04 -0400 Received: from [199.232.76.173] (port=45561 helo=monty-python.gnu.org) by lists.gnu.org with esmtp (Exim 4.43) id 1Nruj5-0004vP-Jv for qemu-devel@nongnu.org; Wed, 17 Mar 2010 11:00:03 -0400 Received: from Debian-exim by monty-python.gnu.org with spam-scanned (Exim 4.60) (envelope-from ) id 1Nruj2-0006YG-Q1 for qemu-devel@nongnu.org; Wed, 17 Mar 2010 11:00:03 -0400 Received: from mx1.redhat.com ([209.132.183.28]:38709) by monty-python.gnu.org with esmtp (Exim 4.60) (envelope-from ) id 1Nruj2-0006YC-Cx for qemu-devel@nongnu.org; Wed, 17 Mar 2010 11:00:00 -0400 Received: from int-mx05.intmail.prod.int.phx2.redhat.com (int-mx05.intmail.prod.int.phx2.redhat.com [10.5.11.18]) by mx1.redhat.com (8.13.8/8.13.8) with ESMTP id o2HExqYg002499 (version=TLSv1/SSLv3 cipher=DHE-RSA-AES256-SHA bits=256 verify=OK); Wed, 17 Mar 2010 10:59:52 -0400 Received: from random.random (ovpn01.gateway.prod.ext.phx2.redhat.com [10.5.9.1]) by int-mx05.intmail.prod.int.phx2.redhat.com (8.13.8/8.13.8) with ESMTP id o2HExpIQ020117; Wed, 17 Mar 2010 10:59:51 -0400 Date: Wed, 17 Mar 2010 15:59:51 +0100 From: Andrea Arcangeli To: qemu-devel@nongnu.org Message-ID: <20100317145950.GA5752@random.random> MIME-Version: 1.0 Content-Disposition: inline X-Scanned-By: MIMEDefang 2.67 on 10.5.11.18 X-detected-operating-system: by monty-python.gnu.org: Genre and OS details not recognized. Cc: Avi Kivity , Paul Brook Subject: [Qemu-devel] [PATCH QEMU] Transparent Hugepage Support #3 X-BeenThere: qemu-devel@nongnu.org X-Mailman-Version: 2.1.5 Precedence: list List-Id: qemu-devel.nongnu.org List-Unsubscribe: , List-Archive: List-Post: List-Help: List-Subscribe: , Sender: qemu-devel-bounces+incoming=patchwork.ozlabs.org@nongnu.org Errors-To: qemu-devel-bounces+incoming=patchwork.ozlabs.org@nongnu.org From: Andrea Arcangeli This will allow proper alignment so NPT/EPT can take advantage of linux host backing the guest memory with hugepages. It also ensures that when KVM isn't used the first 2M of guest physical memory are backed by a large TLB. To complete it, it will also notify the kernel that this memory is important to be backed by hugepages with madvise (needed for both KVM and QEMU) so that hugepages can also be used in embedded systems without any memory waste and in the future it will allow khugepaged to prioritize the collapse of hugepages into the madvise regions. Ideally the max hugepage size provided by the transparent hugepage support in the kernel should be exported by some sysfs file, but there is no reason to expect x86_64 host to have hugepages larger than 2M or to expect those to be supported by the kernel transparent hugepage support in the short and medium term, so we can defer the invention of a fixed kernel API until this happens, by that time we'll surely have a better clue of what's the best way to provide that information to userland and it'll be a few liner change to adapt qemu to use it so there's no hurry to do it right now. Plus the below will keep to remain optimal and there is no risk of memory waste as virtual memory is practically zero cost on 64bit archs. NOTE: if the callers of qemu_ram_alloc changes significantly we may later be required to pass a second parameter to qemu_ram_alloc that will tell it what is the first guest physical address that corresponds to the "sized" memory block being allocated. I'd defer this change for later too as it may never be needed. I verified this is more than enough to get the max benefit from the kernel side feature. cat /sys/kernel/debug/kvm/largepages 301 Signed-off-by: Andrea Arcangeli diff --git a/exec.c b/exec.c index 14767b7..ab33f6b 100644 --- a/exec.c +++ b/exec.c @@ -2745,6 +2745,18 @@ static void *file_ram_alloc(ram_addr_t memory, const char *path) } #endif +#if defined(__linux__) && defined(__x86_64__) +/* + * Align on the max transparent hugepage size so that + * "(gfn ^ pfn) & (HPAGE_SIZE-1) == 0" to allow KVM to + * take advantage of hugepages with NPT/EPT or to + * ensure the first 2M of the guest physical ram will + * be mapped by the same hugetlb for QEMU (it is worth + * it even without NPT/EPT). + */ +#define PREFERRED_RAM_ALIGN (2*1024*1024) +#endif + ram_addr_t qemu_ram_alloc(ram_addr_t size) { RAMBlock *new_block; @@ -2768,11 +2780,19 @@ ram_addr_t qemu_ram_alloc(ram_addr_t size) PROT_EXEC|PROT_READ|PROT_WRITE, MAP_SHARED | MAP_ANONYMOUS, -1, 0); #else - new_block->host = qemu_vmalloc(size); +#ifdef PREFERRED_RAM_ALIGN + if (size >= PREFERRED_RAM_ALIGN) + new_block->host = qemu_memalign(PREFERRED_RAM_ALIGN, size); + else +#endif + new_block->host = qemu_vmalloc(size); #endif #ifdef MADV_MERGEABLE madvise(new_block->host, size, MADV_MERGEABLE); #endif +#ifdef MADV_HUGEPAGE + madvise(new_block->host, size, MADV_HUGEPAGE); +#endif } new_block->offset = last_ram_offset; new_block->length = size;