From patchwork Wed Jan 4 03:03:55 2012 Content-Type: text/plain; charset="utf-8" MIME-Version: 1.0 Content-Transfer-Encoding: 7bit X-Patchwork-Submitter: Isaku Yamahata X-Patchwork-Id: 134189 Return-Path: X-Original-To: incoming@patchwork.ozlabs.org Delivered-To: patchwork-incoming@bilbo.ozlabs.org Received: from lists.gnu.org (lists.gnu.org [140.186.70.17]) (using TLSv1 with cipher AES256-SHA (256/256 bits)) (Client did not present a certificate) by ozlabs.org (Postfix) with ESMTPS id 101961007D8 for ; Wed, 4 Jan 2012 14:04:17 +1100 (EST) Received: from localhost ([::1]:57951 helo=lists.gnu.org) by lists.gnu.org with esmtp (Exim 4.71) (envelope-from ) id 1RiH96-00004t-Qf for incoming@patchwork.ozlabs.org; Tue, 03 Jan 2012 22:04:08 -0500 Received: from eggs.gnu.org ([140.186.70.92]:57829) by lists.gnu.org with esmtp (Exim 4.71) (envelope-from ) id 1RiH90-0008WE-FH for qemu-devel@nongnu.org; Tue, 03 Jan 2012 22:04:03 -0500 Received: from Debian-exim by eggs.gnu.org with spam-scanned (Exim 4.71) (envelope-from ) id 1RiH8y-0004j2-VS for qemu-devel@nongnu.org; Tue, 03 Jan 2012 22:04:02 -0500 Received: from mail.valinux.co.jp ([210.128.90.3]:39493) by eggs.gnu.org with esmtp (Exim 4.71) (envelope-from ) id 1RiH8y-0004it-Ft for qemu-devel@nongnu.org; Tue, 03 Jan 2012 22:04:00 -0500 Received: from ps.local.valinux.co.jp (vagw.valinux.co.jp [210.128.90.14]) by mail.valinux.co.jp (Postfix) with SMTP id DE84E28233; Wed, 4 Jan 2012 12:03:55 +0900 (JST) Received: (nullmailer pid 28276 invoked by uid 1000); Wed, 04 Jan 2012 03:03:55 -0000 Date: Wed, 4 Jan 2012 12:03:55 +0900 From: Isaku Yamahata To: Andrea Arcangeli Message-ID: <20120104030355.GL19274@valinux.co.jp> References: <20111229134920.GH19274@valinux.co.jp> <4EFC70BA.1080808@redhat.com> <20111229141802.GI19274@valinux.co.jp> <4EFC7AB8.807@redhat.com> <20111229144943.GJ19274@valinux.co.jp> <4EFC7F4F.9010202@redhat.com> <20111229155328.GK19274@valinux.co.jp> <4EFC8EAD.80306@redhat.com> <4EFC8EE9.9030802@redhat.com> <20120102170551.GF4172@redhat.com> MIME-Version: 1.0 Content-Disposition: inline In-Reply-To: <20120102170551.GF4172@redhat.com> User-Agent: Mutt/1.5.19 (2009-01-05) X-Virus-Scanned: clamav-milter 0.95.2 at va-mail.local.valinux.co.jp X-Virus-Status: Clean X-detected-operating-system: by eggs.gnu.org: GNU/Linux 2.6 (newer, 3) X-Received-From: 210.128.90.3 Cc: t.hirofuchi@aist.go.jp, satoshi.itoh@aist.go.jp, Avi Kivity , kvm@vger.kernel.org, qemu-devel@nongnu.org Subject: Re: [Qemu-devel] [PATCH 0/2][RFC] postcopy migration: Linux char device for postcopy X-BeenThere: qemu-devel@nongnu.org X-Mailman-Version: 2.1.14 Precedence: list List-Id: List-Unsubscribe: , List-Archive: List-Post: List-Help: List-Subscribe: , Errors-To: qemu-devel-bounces+incoming=patchwork.ozlabs.org@nongnu.org Sender: qemu-devel-bounces+incoming=patchwork.ozlabs.org@nongnu.org On Mon, Jan 02, 2012 at 06:05:51PM +0100, Andrea Arcangeli wrote: > On Thu, Dec 29, 2011 at 06:01:45PM +0200, Avi Kivity wrote: > > On 12/29/2011 06:00 PM, Avi Kivity wrote: > > > The NFS client has exactly the same issue, if you mount it with the intr > > > option. In fact you could use the NFS client as a trivial umem/cuse > > > prototype. > > > > Actually, NFS can return SIGBUS, it doesn't care about restarting daemons. > > During KVMForum I suggested to a few people that it could be done > entirely in userland with PROT_NONE. So the problem is if we do it in > userland with the current functionality you'll run out of VMAs and > slowdown performance too much. > > But all you need is the ability to map single pages in the address > space. The only special requirement is that a new vma must not be > created during the map operation. It'd be very similar to > remap_file_pages for MAP_SHARED, it also was created to avoid having > to create new vmas on a large MAP_SHARED mapping and no other reason > at all. In our case we deal with a large MAP_ANONYMOUS mapping and we > must alter the pte without creating new vmas but the problem is very > similar to remap_file_pages. > > Qemu in the dst node can do: > > mmap(MAP_ANONYMOUS....) > fault_area_prepare(start, end, signalnr) > > prepare_fault_area will map the range with the magic pte. > > Then when the signalnr fires, you do: > > send(givemepageX) > recv(&tmpaddr_aligned, PAGE_SIZE,...); > fault_area_map(final_dest_aligned, tmpaddr_aligned, size) > > map_fault_area will check the pgprot of the two vmas mapping > final_dest_aligned and tmpaddr_aligned have the same vma->vm_pgprot > and various other vma bits, and if all ok, it'll just copy the pte > from tmpaddr_aligned, to final_dest_aligned and it'll update the > page->index. It can fail if the page is shared to avoid dealing with > the non-linearity of the page mapped in multiple vmas. > > You basically need a bypass to avoid altering the pgprot of the vma, > and enter into the pte a "magic" thing that fires signal handlers > if accessed, without having to create new vmas. gup/gup_fast and stuff > should just always fallback into handle_mm_fault when encountering such a > thing, so returning failure as if gup_fast was run on a address beyond > the end of the i_size in the MAP_SHARED case. Yes, it's quite doable in user space(qemu) with a kernel-enhancement. And it would be easy to convert a separated daemon process into a thread in qemu. I think it should be done out side of qemu process for some reasons. (I just repeat same discussion at the KVM-forum because no one remembers it) - ptrace (and its variant) Some people want to investigate guest ram on host (qemu stopped or lively). For example, enhance crash utility and it will attach qemu process and debug guest kernel. - core dump qemu process may core-dump. As postmortem analysis, people want to investigate guest RAM. Again enhance crash utility and it will read the core file and analyze guest kernel. When creating core, the qemu process is already dead. It precludes the above possibilities to handle fault in qemu process. > THP already works on /dev/zero mmaps as long as it's a MAP_PRIVATE, > KSM should work too but I doubt anybody tested it on MAP_PRIVATE of > /dev/zero. Oh great. It seems to work with anonymous page generally of non-anonymous VMA. Is that right? If correct, THP/KSM work with mmap(MAP_PRIVATE, /dev/umem...), do they? > The device driver provides an advantage in being self contained but I > doubt it's simpler. I suppose after migration is complete you'll still > switch the vma back to regular anonymous vma so leading to the same > result? Yes, it was my original intention. The page is anonymous, but the vma isn't anonymous. I concerned that KSM/THP doesn't work with such pages. If they work, it isn't necessary to switch the VMA into anonymous. > The patch 2/2 is small and self contained so it's quite attractive, I > didn't see patch 1/2, was it posted? Posted. It's quite short and trivial which just do EXPORT_SYMBOL_GPL of mem_cgroup_cache_chage and shmem_zero_setup. I include it here for convenience. From e8bfda16a845eef4381872a331c6f0f200c3f7d7 Mon Sep 17 00:00:00 2001 Message-Id: In-Reply-To: References: From: Isaku Yamahata Date: Thu, 11 Aug 2011 20:05:28 +0900 Subject: [PATCH 1/2] export necessary symbols Signed-off-by: Isaku Yamahata --- mm/memcontrol.c | 1 + mm/shmem.c | 1 + 2 files changed, 2 insertions(+), 0 deletions(-) diff --git a/mm/memcontrol.c b/mm/memcontrol.c index b63f5f7..85530fc 100644 --- a/mm/memcontrol.c +++ b/mm/memcontrol.c @@ -2807,6 +2807,7 @@ int mem_cgroup_cache_charge(struct page *page, struct mm_struct *mm, return ret; } +EXPORT_SYMBOL_GPL(mem_cgroup_cache_charge); /* * While swap-in, try_charge -> commit or cancel, the page is locked. diff --git a/mm/shmem.c b/mm/shmem.c index d672250..d137a37 100644 --- a/mm/shmem.c +++ b/mm/shmem.c @@ -2546,6 +2546,7 @@ int shmem_zero_setup(struct vm_area_struct *vma) vma->vm_flags |= VM_CAN_NONLINEAR; return 0; } +EXPORT_SYMBOL_GPL(shmem_zero_setup); /** * shmem_read_mapping_page_gfp - read into page cache, using specified page allocation flags.