From patchwork Wed Jul  2 16:50:16 2014
Content-Type: text/plain; charset="utf-8"
MIME-Version: 1.0
Content-Transfer-Encoding: 7bit
X-Patchwork-Submitter: Andrea Arcangeli <aarcange@redhat.com>
X-Patchwork-Id: 366429
Return-Path: <qemu-devel-bounces+incoming=patchwork.ozlabs.org@nongnu.org>
X-Original-To: incoming@patchwork.ozlabs.org
Delivered-To: patchwork-incoming@bilbo.ozlabs.org
Received: from lists.gnu.org (lists.gnu.org [IPv6:2001:4830:134:3::11])
	(using TLSv1 with cipher AES256-SHA (256/256 bits))
	(No client certificate requested)
	by ozlabs.org (Postfix) with ESMTPS id 9235E1400B7
	for <incoming@patchwork.ozlabs.org>;
	Thu,  3 Jul 2014 02:55:49 +1000 (EST)
Received: from localhost ([::1]:55515 helo=lists.gnu.org)
	by lists.gnu.org with esmtp (Exim 4.71) (envelope-from
	<qemu-devel-bounces+incoming=patchwork.ozlabs.org@nongnu.org>)
	id 1X2Nox-00017T-Ru
	for incoming@patchwork.ozlabs.org; Wed, 02 Jul 2014 12:55:47 -0400
Received: from eggs.gnu.org ([2001:4830:134:3::10]:56343)
	by lists.gnu.org with esmtp (Exim 4.71)
	(envelope-from <aarcange@redhat.com>) id 1X2Nl2-0003eo-RO
	for qemu-devel@nongnu.org; Wed, 02 Jul 2014 12:51:51 -0400
Received: from Debian-exim by eggs.gnu.org with spam-scanned (Exim 4.71)
	(envelope-from <aarcange@redhat.com>) id 1X2Nku-000808-Rw
	for qemu-devel@nongnu.org; Wed, 02 Jul 2014 12:51:44 -0400
Received: from mx1.redhat.com ([209.132.183.28]:55275)
	by eggs.gnu.org with esmtp (Exim 4.71)
	(envelope-from <aarcange@redhat.com>) id 1X2Nku-0007zw-J0
	for qemu-devel@nongnu.org; Wed, 02 Jul 2014 12:51:36 -0400
Received: from int-mx11.intmail.prod.int.phx2.redhat.com
	(int-mx11.intmail.prod.int.phx2.redhat.com [10.5.11.24])
	by mx1.redhat.com (8.14.4/8.14.4) with ESMTP id s62GoLwa022799
	(version=TLSv1/SSLv3 cipher=DHE-RSA-AES256-GCM-SHA384 bits=256
	verify=OK); Wed, 2 Jul 2014 12:50:22 -0400
Received: from mail.random (ovpn-116-107.ams2.redhat.com [10.36.116.107])
	by int-mx11.intmail.prod.int.phx2.redhat.com (8.14.4/8.14.4) with
	ESMTP id s62GoK8Z031587; Wed, 2 Jul 2014 12:50:21 -0400
From: Andrea Arcangeli <aarcange@redhat.com>
To: qemu-devel@nongnu.org, kvm@vger.kernel.org, linux-mm@kvack.org,
	linux-kernel@vger.kernel.org
Date: Wed,  2 Jul 2014 18:50:16 +0200
Message-Id: <1404319816-30229-11-git-send-email-aarcange@redhat.com>
In-Reply-To: <1404319816-30229-1-git-send-email-aarcange@redhat.com>
References: <1404319816-30229-1-git-send-email-aarcange@redhat.com>
X-Scanned-By: MIMEDefang 2.68 on 10.5.11.24
X-detected-operating-system: by eggs.gnu.org: GNU/Linux 3.x
X-Received-From: 209.132.183.28
Cc: Robert Love <rlove@google.com>, Dave Hansen <dave@sr71.net>,
	Jan Kara <jack@suse.cz>, Neil Brown <neilb@suse.de>,
	Stefan Hajnoczi <stefanha@gmail.com>, Andrew Jones <drjones@redhat.com>,
	KOSAKI Motohiro <kosaki.motohiro@gmail.com>,
	Michel Lespinasse <walken@google.com>,
	Andrea Arcangeli <aarcange@redhat.com>, Taras Glek <tglek@mozilla.com>,
	Juan Quintela <quintela@redhat.com>, Hugh Dickins <hughd@google.com>,
	Isaku Yamahata <yamahata@valinux.co.jp>, Mel Gorman <mgorman@suse.de>,
	Android Kernel Team <kernel-team@android.com>,
	Mel Gorman <mel@csn.ul.ie>,
	"\\\"Dr. David Alan Gilbert\\\"" <dgilbert@redhat.com>,
	"Huangpeng \(Peter\)" <peter.huangpeng@huawei.com>,
	Anthony Liguori <anthony@codemonkey.ws>,
	Mike Hommey <mh@glandium.org>, Keith Packard <keithp@keithp.com>,
	Wenchao Xia <wenchaoqemu@gmail.com>, Minchan Kim <minchan@kernel.org>,
	Dmitry Adamushko <dmitry.adamushko@gmail.com>,
	Johannes Weiner <hannes@cmpxchg.org>,
	Paolo Bonzini <pbonzini@redhat.com>,
	Andrew Morton <akpm@linux-foundation.org>
Subject: [Qemu-devel] [PATCH 10/10] userfaultfd: use VM_FAULT_RETRY in
	handle_userfault()
X-BeenThere: qemu-devel@nongnu.org
X-Mailman-Version: 2.1.14
Precedence: list
List-Id: <qemu-devel.nongnu.org>
List-Unsubscribe: <https://lists.nongnu.org/mailman/options/qemu-devel>,
	<mailto:qemu-devel-request@nongnu.org?subject=unsubscribe>
List-Archive: <http://lists.nongnu.org/archive/html/qemu-devel>
List-Post: <mailto:qemu-devel@nongnu.org>
List-Help: <mailto:qemu-devel-request@nongnu.org?subject=help>
List-Subscribe: <https://lists.nongnu.org/mailman/listinfo/qemu-devel>,
	<mailto:qemu-devel-request@nongnu.org?subject=subscribe>
Errors-To: qemu-devel-bounces+incoming=patchwork.ozlabs.org@nongnu.org
Sender: qemu-devel-bounces+incoming=patchwork.ozlabs.org@nongnu.org

This optimizes the userfault handler to repeat the fault without
returning to userland if it's a page faults and it teaches it to
handle FOLL_NOWAIT if it's a nonblocking gup invocation from KVM. The
FOLL_NOWAIT part is actually more than an optimization because if
FOLL_NOWAIT is set the gup caller assumes the mmap_sem cannot be
released (and it could assume that the structures protected by it
potentially read earlier cannot have become stale).

The locking rules to comply with FAULT_FLAG_KILLABLE,
FAULT_FLAG_ALLOW_RETRY, FAULT_FLAG_RETRY_NOWAIT flags looks quite
convoluted (and nor well documented, aside from a "Caution" comment in
__lock_page_or_retry) so this is not a trivial change and in turn it's
kept incremental at the end of the patchset.

Signed-off-by: Andrea Arcangeli <aarcange@redhat.com>
---
 fs/userfaultfd.c            | 68 ++++++++++++++++++++++++++++++++++++++++++---
 include/linux/userfaultfd.h |  6 ++--
 mm/huge_memory.c            |  8 +++---
 mm/memory.c                 |  4 +--
 4 files changed, 74 insertions(+), 12 deletions(-)

diff --git a/fs/userfaultfd.c b/fs/userfaultfd.c
index deed8cb..b8b0fb7 100644
--- a/fs/userfaultfd.c
+++ b/fs/userfaultfd.c
@@ -155,12 +155,29 @@ static void userfaultfd_ctx_put(struct userfaultfd_ctx *ctx)
 	kref_put(&ctx->kref, userfaultfd_free);
 }
 
-int handle_userfault(struct vm_area_struct *vma, unsigned long address)
+/*
+ * The locking rules involved in returning VM_FAULT_RETRY depending on
+ * FAULT_FLAG_ALLOW_RETRY, FAULT_FLAG_RETRY_NOWAIT and
+ * FAULT_FLAG_KILLABLE are not straightforward. The "Caution"
+ * recommendation in __lock_page_or_retry is not an understatement.
+ *
+ * If FAULT_FLAG_ALLOW_RETRY is set, the mmap_sem must be released
+ * before returning VM_FAULT_RETRY only if FAULT_FLAG_RETRY_NOWAIT is
+ * not set.
+ *
+ * If FAULT_FLAG_ALLOW_RETRY is set but FAULT_FLAG_KILLABLE is not
+ * set, VM_FAULT_RETRY can still be returned if and only if there are
+ * fatal_signal_pending()s, and the mmap_sem must be released before
+ * returning it.
+ */
+int handle_userfault(struct vm_area_struct *vma, unsigned long address,
+		     unsigned int flags)
 {
 	struct mm_struct *mm = vma->vm_mm;
 	struct mm_slot *slot;
 	struct userfaultfd_ctx *ctx;
 	struct userfaultfd_wait_queue uwq;
+	int ret;
 
 	BUG_ON(!rwsem_is_locked(&mm->mmap_sem));
 
@@ -188,10 +205,53 @@ int handle_userfault(struct vm_area_struct *vma, unsigned long address)
 	__add_wait_queue(&ctx->fault_wqh, &uwq.wq);
 	for (;;) {
 		set_current_state(TASK_INTERRUPTIBLE);
-		if (fatal_signal_pending(current))
+		if (fatal_signal_pending(current) || ctx->released) {
+			/*
+			 * If we have to fail because the task is
+			 * killed or the file was relased, so simulate
+			 * VM_FAULT_SIGBUS or just return to userland
+			 * through VM_FAULT_RETRY if we come from a
+			 * page fault.
+			 */
+			ret = VM_FAULT_SIGBUS;
+			if (fatal_signal_pending(current) &&
+			    (flags & FAULT_FLAG_KILLABLE)) {
+				/*
+				 * If FAULT_FLAG_KILLABLE is set we
+				 * and there's a fatal signal pending
+				 * can return VM_FAULT_RETRY
+				 * regardless if
+				 * FAULT_FLAG_ALLOW_RETRY is set or
+				 * not as long as we release the
+				 * mmap_sem. The page fault will
+				 * return stright to userland then to
+				 * handle the fatal signal.
+				 */
+				up_read(&mm->mmap_sem);
+				ret = VM_FAULT_RETRY;
+			}
+			break;
+		}
+		if (!uwq.pending) {
+			ret = 0;
+			if (flags & FAULT_FLAG_ALLOW_RETRY) {
+				ret = VM_FAULT_RETRY;
+				if (!(flags & FAULT_FLAG_RETRY_NOWAIT))
+					up_read(&mm->mmap_sem);
+			}
 			break;
-		if (!uwq.pending)
+		}
+		if (((FAULT_FLAG_ALLOW_RETRY|FAULT_FLAG_RETRY_NOWAIT) &
+		     flags) ==
+		    (FAULT_FLAG_ALLOW_RETRY|FAULT_FLAG_RETRY_NOWAIT)) {
+			ret = VM_FAULT_RETRY;
+			/*
+			 * The mmap_sem must not be released if
+			 * FAULT_FLAG_RETRY_NOWAIT is set despite we
+			 * return VM_FAULT_RETRY (FOLL_NOWAIT case).
+			 */
 			break;
+		}
 		spin_unlock(&ctx->fault_wqh.lock);
 		up_read(&mm->mmap_sem);
 
@@ -211,7 +271,7 @@ int handle_userfault(struct vm_area_struct *vma, unsigned long address)
 	 */
 	userfaultfd_ctx_put(ctx);
 
-	return 0;
+	return ret;
 }
 
 static int userfaultfd_release(struct inode *inode, struct file *file)
diff --git a/include/linux/userfaultfd.h b/include/linux/userfaultfd.h
index 8200a71..b7caef5 100644
--- a/include/linux/userfaultfd.h
+++ b/include/linux/userfaultfd.h
@@ -26,11 +26,13 @@
 
 #ifdef CONFIG_USERFAULTFD
 
-int handle_userfault(struct vm_area_struct *vma, unsigned long address);
+int handle_userfault(struct vm_area_struct *vma, unsigned long address,
+		     unsigned int flags);
 
 #else /* CONFIG_USERFAULTFD */
 
-static int handle_userfault(struct vm_area_struct *vma, unsigned long address)
+static int handle_userfault(struct vm_area_struct *vma, unsigned long address,
+			    unsigned int flags)
 {
 	return VM_FAULT_SIGBUS;
 }
diff --git a/mm/huge_memory.c b/mm/huge_memory.c
index d6efd80..e1a74a2 100644
--- a/mm/huge_memory.c
+++ b/mm/huge_memory.c
@@ -714,7 +714,7 @@ static inline pmd_t mk_huge_pmd(struct page *page, pgprot_t prot)
 static int __do_huge_pmd_anonymous_page(struct mm_struct *mm,
 					struct vm_area_struct *vma,
 					unsigned long haddr, pmd_t *pmd,
-					struct page *page)
+					struct page *page, unsigned int flags)
 {
 	pgtable_t pgtable;
 	spinlock_t *ptl;
@@ -753,7 +753,7 @@ static int __do_huge_pmd_anonymous_page(struct mm_struct *mm,
 			mem_cgroup_uncharge_page(page);
 			put_page(page);
 			pte_free(mm, pgtable);
-			ret = handle_userfault(vma, haddr);
+			ret = handle_userfault(vma, haddr, flags);
 			VM_BUG_ON(ret & VM_FAULT_FALLBACK);
 			return ret;
 		}
@@ -835,7 +835,7 @@ int do_huge_pmd_anonymous_page(struct mm_struct *mm, struct vm_area_struct *vma,
 		if (pmd_none(*pmd)) {
 			if (vma->vm_flags & VM_USERFAULT) {
 				spin_unlock(ptl);
-				ret = handle_userfault(vma, haddr);
+				ret = handle_userfault(vma, haddr, flags);
 				VM_BUG_ON(ret & VM_FAULT_FALLBACK);
 			} else {
 				set_huge_zero_page(pgtable, mm, vma,
@@ -863,7 +863,7 @@ int do_huge_pmd_anonymous_page(struct mm_struct *mm, struct vm_area_struct *vma,
 		count_vm_event(THP_FAULT_FALLBACK);
 		return VM_FAULT_FALLBACK;
 	}
-	return __do_huge_pmd_anonymous_page(mm, vma, haddr, pmd, page);
+	return __do_huge_pmd_anonymous_page(mm, vma, haddr, pmd, page, flags);
 }
 
 int copy_huge_pmd(struct mm_struct *dst_mm, struct mm_struct *src_mm,
diff --git a/mm/memory.c b/mm/memory.c
index a6a04ed..44506e9 100644
--- a/mm/memory.c
+++ b/mm/memory.c
@@ -2645,7 +2645,7 @@ static int do_anonymous_page(struct mm_struct *mm, struct vm_area_struct *vma,
 		/* Deliver the page fault to userland, check inside PT lock */
 		if (vma->vm_flags & VM_USERFAULT) {
 			pte_unmap_unlock(page_table, ptl);
-			return handle_userfault(vma, address);
+			return handle_userfault(vma, address, flags);
 		}
 		goto setpte;
 	}
@@ -2679,7 +2679,7 @@ static int do_anonymous_page(struct mm_struct *mm, struct vm_area_struct *vma,
 		pte_unmap_unlock(page_table, ptl);
 		mem_cgroup_uncharge_page(page);
 		page_cache_release(page);
-		return handle_userfault(vma, address);
+		return handle_userfault(vma, address, flags);
 	}
 
 	inc_mm_counter_fast(mm, MM_ANONPAGES);