[net-next,4/5] tcp: implement mmap() for zero copy receive

Message ID 20180416173339.6310-5-edumazet@google.com
State Accepted
Delegated to: David Miller
Headers show
Series
  • tcp: add zero copy receive
Related show

Commit Message

Eric Dumazet April 16, 2018, 5:33 p.m.
Some networks can make sure TCP payload can exactly fit 4KB pages,
with well chosen MSS/MTU and architectures.

Implement mmap() system call so that applications can avoid
copying data without complex splice() games.

Note that a successful mmap( X bytes) on TCP socket is consuming
bytes, as if recvmsg() has been done. (tp->copied += X)

Only PROT_READ mappings are accepted, as skb page frags
are fundamentally shared and read only.

If tcp_mmap() finds data that is not a full page, or a patch of
urgent data, -EINVAL is returned, no bytes are consumed.

Application must fallback to recvmsg() to read the problematic sequence.

mmap() wont block,  regardless of socket being in blocking or
non-blocking mode. If not enough bytes are in receive queue,
mmap() would return -EAGAIN, or -EIO if socket is in a state
where no other bytes can be added into receive queue.

An application might use SO_RCVLOWAT, poll() and/or ioctl( FIONREAD)
to efficiently use mmap()

On the sender side, MSG_EOR might help to clearly separate unaligned
headers and 4K-aligned chunks if necessary.

Tested:

mlx4 (cx-3) 40Gbit NIC, with tcp_mmap program provided in following patch.
MTU set to 4168  (4096 TCP payload, 40 bytes IPv6 header, 32 bytes TCP header)

Without mmap() (tcp_mmap -s)

received 32768 MB (0 % mmap'ed) in 8.13342 s, 33.7961 Gbit,
  cpu usage user:0.034 sys:3.778, 116.333 usec per MB, 63062 c-switches
received 32768 MB (0 % mmap'ed) in 8.14501 s, 33.748 Gbit,
  cpu usage user:0.029 sys:3.997, 122.864 usec per MB, 61903 c-switches
received 32768 MB (0 % mmap'ed) in 8.11723 s, 33.8635 Gbit,
  cpu usage user:0.048 sys:3.964, 122.437 usec per MB, 62983 c-switches
received 32768 MB (0 % mmap'ed) in 8.39189 s, 32.7552 Gbit,
  cpu usage user:0.038 sys:4.181, 128.754 usec per MB, 55834 c-switches

With mmap() on receiver (tcp_mmap -s -z)

received 32768 MB (100 % mmap'ed) in 8.03083 s, 34.2278 Gbit,
  cpu usage user:0.024 sys:1.466, 45.4712 usec per MB, 65479 c-switches
received 32768 MB (100 % mmap'ed) in 7.98805 s, 34.4111 Gbit,
  cpu usage user:0.026 sys:1.401, 43.5486 usec per MB, 65447 c-switches
received 32768 MB (100 % mmap'ed) in 7.98377 s, 34.4296 Gbit,
  cpu usage user:0.028 sys:1.452, 45.166 usec per MB, 65496 c-switches
received 32768 MB (99.9969 % mmap'ed) in 8.01838 s, 34.281 Gbit,
  cpu usage user:0.02 sys:1.446, 44.7388 usec per MB, 65505 c-switches

Signed-off-by: Eric Dumazet <edumazet@google.com>
---
 include/net/tcp.h   |   2 +
 net/ipv4/af_inet.c  |   2 +-
 net/ipv4/tcp.c      | 113 ++++++++++++++++++++++++++++++++++++++++++++
 net/ipv6/af_inet6.c |   2 +-
 4 files changed, 117 insertions(+), 2 deletions(-)

Comments

Eric Dumazet April 19, 2018, 11:15 p.m. | #1
On 04/16/2018 10:33 AM, Eric Dumazet wrote:
> Some networks can make sure TCP payload can exactly fit 4KB pages,
> with well chosen MSS/MTU and architectures.
> 
> Implement mmap() system call so that applications can avoid
> copying data without complex splice() games.
> 
> Note that a successful mmap( X bytes) on TCP socket is consuming
> bytes, as if recvmsg() has been done. (tp->copied += X)
> 

Oh well, I should have run this code with LOCKDEP enabled :/

[  974.320412] ======================================================
[  974.326631] WARNING: possible circular locking dependency detected
[  974.332816] 4.16.0-dbx-DEV #40 Not tainted
[  974.336927] ------------------------------------------------------
[  974.343107] b78299096/15790 is trying to acquire lock:
[  974.348246] 000000006074c9cf (sk_lock-AF_INET6){+.+.}, at: tcp_mmap+0x7c/0x550
[  974.355505] 
               but task is already holding lock:
[  974.361366] 000000008dbe063b (&mm->mmap_sem){++++}, at: vm_mmap_pgoff+0x99/0x100
[  974.368801] 
               which lock already depends on the new lock.

[  974.377010] 
               the existing dependency chain (in reverse order) is:
[  974.384501] 
               -> #1 (&mm->mmap_sem){++++}:
[  974.389911]        __might_fault+0x68/0x90
[  974.394025]        _copy_from_user+0x23/0xa0
[  974.398311]        sock_setsockopt+0x4a2/0xac0
[  974.402761]        __sys_setsockopt+0xd9/0xf0
[  974.407118]        SyS_setsockopt+0xe/0x20
[  974.411242]        do_syscall_64+0x6e/0x1a0
[  974.415431]        entry_SYSCALL_64_after_hwframe+0x42/0xb7
[  974.421011] 
               -> #0 (sk_lock-AF_INET6){+.+.}:
[  974.426690]        lock_acquire+0x95/0x1e0
[  974.430813]        lock_sock_nested+0x71/0xa0
[  974.435196]        tcp_mmap+0x7c/0x550
[  974.438940]        sock_mmap+0x23/0x30
[  974.442695]        mmap_region+0x3a4/0x5d0
[  974.446808]        do_mmap+0x313/0x530
[  974.450571]        vm_mmap_pgoff+0xc7/0x100
[  974.454769]        ksys_mmap_pgoff+0x1d5/0x260
[  974.459247]        SyS_mmap+0x1b/0x30
[  974.462936]        do_syscall_64+0x6e/0x1a0
[  974.467114]        entry_SYSCALL_64_after_hwframe+0x42/0xb7
[  974.472678] 
               other info that might help us debug this:

[  974.480677]  Possible unsafe locking scenario:

[  974.486600]        CPU0                    CPU1
[  974.491152]        ----                    ----
[  974.495684]   lock(&mm->mmap_sem);
[  974.499089]                                lock(sk_lock-AF_INET6);
[  974.505285]                                lock(&mm->mmap_sem);
[  974.511211]   lock(sk_lock-AF_INET6);
[  974.514885] 
                *** DEADLOCK ***

[  974.520825] 1 lock held by b78299096/15790:
[  974.525018]  #0: 000000008dbe063b (&mm->mmap_sem){++++}, at: vm_mmap_pgoff+0x99/0x100
[  974.532852] 
               stack backtrace:
[  974.537224] CPU: 25 PID: 15790 Comm: b78299096 Not tainted 4.16.0-dbx-DEV #40
[  974.544371] Hardware name: Intel RML,PCH/Iota_QC_19, BIOS 2.40.0 06/22/2016
[  974.551333] Call Trace:
[  974.553792]  dump_stack+0x70/0xa5
[  974.557111]  print_circular_bug.isra.39+0x1d8/0x1e6
[  974.561982]  __lock_acquire+0x1284/0x1340
[  974.565992]  ? tcp_mmap+0x7c/0x550
[  974.569419]  lock_acquire+0x95/0x1e0
[  974.573011]  ? lock_acquire+0x95/0x1e0
[  974.576767]  ? tcp_mmap+0x7c/0x550
[  974.580167]  lock_sock_nested+0x71/0xa0
[  974.584023]  ? tcp_mmap+0x7c/0x550
[  974.587437]  tcp_mmap+0x7c/0x550
[  974.590677]  sock_mmap+0x23/0x30
[  974.593909]  mmap_region+0x3a4/0x5d0
[  974.597506]  do_mmap+0x313/0x530
[  974.600749]  vm_mmap_pgoff+0xc7/0x100
[  974.604414]  ksys_mmap_pgoff+0x1d5/0x260
[  974.608341]  ? fd_install+0x25/0x30
[  974.611849]  ? trace_hardirqs_on_caller+0xef/0x180
[  974.616641]  SyS_mmap+0x1b/0x30
[  974.619804]  do_syscall_64+0x6e/0x1a0
[  974.623462]  entry_SYSCALL_64_after_hwframe+0x42/0xb7
[  974.628549] RIP: 0033:0x433749
[  974.631600] RSP: 002b:00007ffd29fdb438 EFLAGS: 00000216 ORIG_RAX: 0000000000000009
[  974.639197] RAX: ffffffffffffffda RBX: 00000000004002e0 RCX: 0000000000433749
[  974.646323] RDX: 0000000000000008 RSI: 0000000000004000 RDI: 0000000020ab7000
[  974.653463] RBP: 00007ffd29fdb460 R08: 0000000000000003 R09: 0000000000000000
[  974.660603] R10: 0000000000000012 R11: 0000000000000216 R12: 0000000000401670
[  974.667737] R13: 0000000000401700 R14: 0000000000000000 R15: 0000000000000000


I am not sure we can keep mmap() API, since we probably need to first lock the socket,
then grab vm semaphore.
Eric Dumazet April 20, 2018, 1:01 a.m. | #2
On 04/19/2018 04:15 PM, Eric Dumazet wrote:

> I am not sure we can keep mmap() API, since we probably need to first lock the socket,
> then grab vm semaphore.
> 

We can keep mmap() nice interface, granted we can add one hook like in following patch.

David, do you think such patch would be acceptable by lkml and mm/fs maintainers ?

Alternative would be implementing an ioctl() or getsockopt() operation,
but it seems less natural...

Thanks !

diff --git a/include/linux/fs.h b/include/linux/fs.h
index 92efaf1f89775f7b017477617dd983c10e0dc4d2..016c711ac33e226b4285ee5bd688e14661dc0879 100644
--- a/include/linux/fs.h
+++ b/include/linux/fs.h
@@ -1714,6 +1714,7 @@ struct file_operations {
        long (*unlocked_ioctl) (struct file *, unsigned int, unsigned long);
        long (*compat_ioctl) (struct file *, unsigned int, unsigned long);
        int (*mmap) (struct file *, struct vm_area_struct *);
+       void (*mmap_hook) (struct file *, bool);
        unsigned long mmap_supported_flags;
        int (*open) (struct inode *, struct file *);
        int (*flush) (struct file *, fl_owner_t id);
diff --git a/mm/util.c b/mm/util.c
index 1fc4fa7576f762bbbf341f056ca6d0be803a423f..b546c59a6169c4dfa9011c61e86da4d03496aa4d 100644
--- a/mm/util.c
+++ b/mm/util.c
@@ -350,11 +350,20 @@ unsigned long vm_mmap_pgoff(struct file *file, unsigned long addr,
 
        ret = security_mmap_file(file, prot, flag);
        if (!ret) {
-               if (down_write_killable(&mm->mmap_sem))
+               void (*mmap_hook)(struct file *, bool) = file ? file->f_op->mmap_hook : NULL;
+
+               if (mmap_hook)
+                       mmap_hook(file, true);
+               if (down_write_killable(&mm->mmap_sem)) {
+                       if (mmap_hook)
+                               mmap_hook(file, false);
                        return -EINTR;
+               }
                ret = do_mmap_pgoff(file, addr, len, prot, flag, pgoff,
                                    &populate, &uf);
                up_write(&mm->mmap_sem);
+               if (mmap_hook)
+                       mmap_hook(file, false);
                userfaultfd_unmap_complete(mm, &uf);
                if (populate)
                        mm_populate(ret, populate);
diff --git a/net/ipv4/tcp.c b/net/ipv4/tcp.c
index 4022073b0aeea9d07af0fa825b640a00512908a3..79b05d6d41643e8c309dfb8bd9597dc8b00fb0e1 100644
--- a/net/ipv4/tcp.c
+++ b/net/ipv4/tcp.c
@@ -1756,8 +1756,6 @@ int tcp_mmap(struct file *file, struct socket *sock,
        /* TODO: Maybe the following is not needed if pages are COW */
        vma->vm_flags &= ~VM_MAYWRITE;
 
-       lock_sock(sk);
-
        ret = -ENOTCONN;
        if (sk->sk_state == TCP_LISTEN)
                goto out;
@@ -1833,7 +1831,6 @@ int tcp_mmap(struct file *file, struct socket *sock,
 
        ret = 0;
 out:
-       release_sock(sk);
        kvfree(pages_array);
        return ret;
 }
diff --git a/net/socket.c b/net/socket.c
index f10f1d947c78c193b49379b0ec641d81367fb4cf..bcabae3c37d765e5c0548a14fc93c19258972b48 100644
--- a/net/socket.c
+++ b/net/socket.c
@@ -131,6 +131,16 @@ static ssize_t sock_splice_read(struct file *file, loff_t *ppos,
                                struct pipe_inode_info *pipe, size_t len,
                                unsigned int flags);
 
+static void sock_mmap_hook(struct file *file, bool enter)
+{
+       struct socket *sock = file->private_data;
+       struct sock *sk = sock->sk;
+
+       if (enter)
+               lock_sock(sk);
+       else
+               release_sock(sk);
+}
 /*
  *     Socket files have a set of 'special' operations as well as the generic file ones. These don't appear
  *     in the operation structures but are done directly via the socketcall() multiplexor.
@@ -147,6 +157,7 @@ static const struct file_operations socket_file_ops = {
        .compat_ioctl = compat_sock_ioctl,
 #endif
        .mmap =         sock_mmap,
+       .mmap_hook =    sock_mmap_hook,
        .release =      sock_close,
        .fasync =       sock_fasync,
        .sendpage =     sock_sendpage,
David Miller April 20, 2018, 1:17 a.m. | #3
From: Eric Dumazet <eric.dumazet@gmail.com>
Date: Thu, 19 Apr 2018 18:01:32 -0700

> David, do you think such patch would be acceptable by lkml and mm/fs
> maintainers ?

You will have to ask them directly I think :)
Jonathan Corbet April 20, 2018, 3:19 p.m. | #4
On Thu, 19 Apr 2018 18:01:32 -0700
Eric Dumazet <eric.dumazet@gmail.com> wrote:

> We can keep mmap() nice interface, granted we can add one hook like in following patch.
> 
> David, do you think such patch would be acceptable by lkml and mm/fs maintainers ?
> 
> Alternative would be implementing an ioctl() or getsockopt() operation,
> but it seems less natural...

So I have little standing here, but what the heck, not letting that bother
me has earned me a living for the last 20 years or so...:)

I think you should consider switching over to an interface where you
mmap() the region once, and use ioctl() to move the data into that region,
for a couple of reasons beyond the locking issues you've already found:

 - The "mmap() consumes data" semantics are a bit ... strange, IMO.
   That's not what mmap() normally does.  People expect ioctl() to do
   magic things, instead.

 - I would expect it to be a tiny bit faster, since you wouldn't be doing
   the VMA setup and teardown each time.

Thanks,

jon
Eric Dumazet April 20, 2018, 3:39 p.m. | #5
On 04/20/2018 08:19 AM, Jonathan Corbet wrote:
> On Thu, 19 Apr 2018 18:01:32 -0700
> Eric Dumazet <eric.dumazet@gmail.com> wrote:
> 
>> We can keep mmap() nice interface, granted we can add one hook like in following patch.
>>
>> David, do you think such patch would be acceptable by lkml and mm/fs maintainers ?
>>
>> Alternative would be implementing an ioctl() or getsockopt() operation,
>> but it seems less natural...
> 

Hi Jonathan

> So I have little standing here, but what the heck, not letting that bother
> me has earned me a living for the last 20 years or so...:)
> 
> I think you should consider switching over to an interface where you
> mmap() the region once, and use ioctl() to move the data into that region,
> for a couple of reasons beyond the locking issues you've already found:
> 
>  - The "mmap() consumes data" semantics are a bit ... strange, IMO.
>    That's not what mmap() normally does.  People expect ioctl() to do
>    magic things, instead.

Well, the thing is that most of our use cases wont reuse same mmap() area.

RPC layer will provide all RPC with their associated pages to RPC consumers.

RPC consumers will decide to keep these pages or consume them.

So having to mmap() + another syscall to consume XXX bytes from receive queue is not
going to save cpu cycles :/

Having the ability to call mmap() multiple times for the same TCP payload is not
going to be of any use in real applications. This is why I only support 'offset 0'
for the last mmap() parameter.

> 
>  - I would expect it to be a tiny bit faster, since you wouldn't be doing
>    the VMA setup and teardown each time.

Maybe for the degenerated case we can reuse the same region over and over.

Patch

diff --git a/include/net/tcp.h b/include/net/tcp.h
index 0ee85c47c185afcb8e1017d59e02313cb5df78ec..833154e3df173ea41aa16dd1ec739a175c679c5c 100644
--- a/include/net/tcp.h
+++ b/include/net/tcp.h
@@ -404,6 +404,8 @@  int tcp_recvmsg(struct sock *sk, struct msghdr *msg, size_t len, int nonblock,
 		int flags, int *addr_len);
 int tcp_set_rcvlowat(struct sock *sk, int val);
 void tcp_data_ready(struct sock *sk);
+int tcp_mmap(struct file *file, struct socket *sock,
+	     struct vm_area_struct *vma);
 void tcp_parse_options(const struct net *net, const struct sk_buff *skb,
 		       struct tcp_options_received *opt_rx,
 		       int estab, struct tcp_fastopen_cookie *foc);
diff --git a/net/ipv4/af_inet.c b/net/ipv4/af_inet.c
index f5c562aaef3522519bcf1ae37782a7e14e278723..3ebf599cebaea4926decc1aad7274b12ec7e1566 100644
--- a/net/ipv4/af_inet.c
+++ b/net/ipv4/af_inet.c
@@ -994,7 +994,7 @@  const struct proto_ops inet_stream_ops = {
 	.getsockopt	   = sock_common_getsockopt,
 	.sendmsg	   = inet_sendmsg,
 	.recvmsg	   = inet_recvmsg,
-	.mmap		   = sock_no_mmap,
+	.mmap		   = tcp_mmap,
 	.sendpage	   = inet_sendpage,
 	.splice_read	   = tcp_splice_read,
 	.read_sock	   = tcp_read_sock,
diff --git a/net/ipv4/tcp.c b/net/ipv4/tcp.c
index c768d306b65714bb8740c60110c43042508af6b7..438fbca96cd3100d722e1bd8bcc6f49624495a21 100644
--- a/net/ipv4/tcp.c
+++ b/net/ipv4/tcp.c
@@ -1726,6 +1726,119 @@  int tcp_set_rcvlowat(struct sock *sk, int val)
 }
 EXPORT_SYMBOL(tcp_set_rcvlowat);
 
+/* When user wants to mmap X pages, we first need to perform the mapping
+ * before freeing any skbs in receive queue, otherwise user would be unable
+ * to fallback to standard recvmsg(). This happens if some data in the
+ * requested block is not exactly fitting in a page.
+ *
+ * We only support order-0 pages for the moment.
+ * mmap() on TCP is very strict, there is no point
+ * trying to accommodate with pathological layouts.
+ */
+int tcp_mmap(struct file *file, struct socket *sock,
+	     struct vm_area_struct *vma)
+{
+	unsigned long size = vma->vm_end - vma->vm_start;
+	unsigned int nr_pages = size >> PAGE_SHIFT;
+	struct page **pages_array = NULL;
+	u32 seq, len, offset, nr = 0;
+	struct sock *sk = sock->sk;
+	const skb_frag_t *frags;
+	struct tcp_sock *tp;
+	struct sk_buff *skb;
+	int ret;
+
+	if (vma->vm_pgoff || !nr_pages)
+		return -EINVAL;
+
+	if (vma->vm_flags & VM_WRITE)
+		return -EPERM;
+	/* TODO: Maybe the following is not needed if pages are COW */
+	vma->vm_flags &= ~VM_MAYWRITE;
+
+	lock_sock(sk);
+
+	ret = -ENOTCONN;
+	if (sk->sk_state == TCP_LISTEN)
+		goto out;
+
+	sock_rps_record_flow(sk);
+
+	if (tcp_inq(sk) < size) {
+		ret = sock_flag(sk, SOCK_DONE) ? -EIO : -EAGAIN;
+		goto out;
+	}
+	tp = tcp_sk(sk);
+	seq = tp->copied_seq;
+	/* Abort if urgent data is in the area */
+	if (unlikely(tp->urg_data)) {
+		u32 urg_offset = tp->urg_seq - seq;
+
+		ret = -EINVAL;
+		if (urg_offset < size)
+			goto out;
+	}
+	ret = -ENOMEM;
+	pages_array = kvmalloc_array(nr_pages, sizeof(struct page *),
+				     GFP_KERNEL);
+	if (!pages_array)
+		goto out;
+	skb = tcp_recv_skb(sk, seq, &offset);
+	ret = -EINVAL;
+skb_start:
+	/* We do not support anything not in page frags */
+	offset -= skb_headlen(skb);
+	if ((int)offset < 0)
+		goto out;
+	if (skb_has_frag_list(skb))
+		goto out;
+	len = skb->data_len - offset;
+	frags = skb_shinfo(skb)->frags;
+	while (offset) {
+		if (frags->size > offset)
+			goto out;
+		offset -= frags->size;
+		frags++;
+	}
+	while (nr < nr_pages) {
+		if (len) {
+			if (len < PAGE_SIZE)
+				goto out;
+			if (frags->size != PAGE_SIZE || frags->page_offset)
+				goto out;
+			pages_array[nr++] = skb_frag_page(frags);
+			frags++;
+			len -= PAGE_SIZE;
+			seq += PAGE_SIZE;
+			continue;
+		}
+		skb = skb->next;
+		offset = seq - TCP_SKB_CB(skb)->seq;
+		goto skb_start;
+	}
+	/* OK, we have a full set of pages ready to be inserted into vma */
+	for (nr = 0; nr < nr_pages; nr++) {
+		ret = vm_insert_page(vma, vma->vm_start + (nr << PAGE_SHIFT),
+				     pages_array[nr]);
+		if (ret)
+			goto out;
+	}
+	/* operation is complete, we can 'consume' all skbs */
+	tp->copied_seq = seq;
+	tcp_rcv_space_adjust(sk);
+
+	/* Clean up data we have read: This will do ACK frames. */
+	tcp_recv_skb(sk, seq, &offset);
+	tcp_cleanup_rbuf(sk, size);
+
+	ret = 0;
+out:
+	release_sock(sk);
+	kvfree(pages_array);
+	return ret;
+}
+EXPORT_SYMBOL(tcp_mmap);
+
 static void tcp_update_recv_tstamps(struct sk_buff *skb,
 				    struct scm_timestamping *tss)
 {
diff --git a/net/ipv6/af_inet6.c b/net/ipv6/af_inet6.c
index e70d59fb26e16ace1eb484d23964946092a2cd57..2c694912df2e77b414de5cc2aa43e2ec59286836 100644
--- a/net/ipv6/af_inet6.c
+++ b/net/ipv6/af_inet6.c
@@ -579,7 +579,7 @@  const struct proto_ops inet6_stream_ops = {
 	.getsockopt	   = sock_common_getsockopt,	/* ok		*/
 	.sendmsg	   = inet_sendmsg,		/* ok		*/
 	.recvmsg	   = inet_recvmsg,		/* ok		*/
-	.mmap		   = sock_no_mmap,
+	.mmap		   = tcp_mmap,
 	.sendpage	   = inet_sendpage,
 	.sendmsg_locked    = tcp_sendmsg_locked,
 	.sendpage_locked   = tcp_sendpage_locked,