diff mbox

tcp: splice as many packets as possible at once

Message ID 4967DF10.2010107@cosmosbay.com
State Accepted, archived
Delegated to: David Miller
Headers show

Commit Message

Eric Dumazet Jan. 9, 2009, 11:34 p.m. UTC
Willy Tarreau a écrit :
> On Fri, Jan 09, 2009 at 11:45:02PM +0100, Eric Dumazet wrote:
>> Only the (!timeo) can be above. Other conditions must be checked after
>> the release/lock.
> 
> Yes that's what Evgeniy explained too. I smelled something like this
> but did not know.
> 
> Care to redo the whole patch, since you already have all code parts
> at hand as well as some fragments of commit messages ? You can even
> add my Tested-by if you want. Finally it was nice that Dave asked
> for this explanation because it drove our nose to the fishy parts ;-)

Sure, here it is :


David, do you think we still must call __tcp_splice_read() only once
in tcp_splice_read() if SPLICE_F_NONBLOCK is set ?

With following patch, a splice() call is limited to 16 frames, typically
16*1460 = 23360 bytes. Removing the test as Willy did in its patch
could return the exact length requested by user (limited to 16 pages),
giving nice blocks if feeding a file on disk...

Thank you

From: Willy Tarreau <w@1wt.eu>

[PATCH] tcp: splice as many packets as possible at once

As spotted by Willy Tarreau, current splice() from tcp socket to pipe is not
optimal. It processes at most one segment per call.
This results in low performance and very high overhead due to syscall rate
when splicing from interfaces which do not support LRO.

Willy provided a patch inside tcp_splice_read(), but a better fix
is to let tcp_read_sock() process as many segments as possible, so
that tcp_rcv_space_adjust() and tcp_cleanup_rbuf() are called less
often.

With this change, splice() behaves like tcp_recvmsg(), being able
to consume many skbs in one system call. With typical 1460 bytes
of payload per frame, that means splice(SPLICE_F_NONBLOCK) can return
16*1460 = 23360 bytes.

Signed-off-by: Willy Tarreau <w@1wt.eu>
Signed-off-by: Eric Dumazet <dada1@cosmosbay.com>
---

--
To unsubscribe from this list: send the line "unsubscribe netdev" in
the body of a message to majordomo@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html

Comments

David Miller Jan. 13, 2009, 5:45 a.m. UTC | #1
From: Eric Dumazet <dada1@cosmosbay.com>
Date: Sat, 10 Jan 2009 00:34:40 +0100

> David, do you think we still must call __tcp_splice_read() only once
> in tcp_splice_read() if SPLICE_F_NONBLOCK is set ?

Eric, I'll get to this thread as soon as I can, perhaps tomorrow.  I
want to get all of the build fallout and bug fixes for 2.6.29-rcX
sorted before everyone heads off to LCA in the next week or so :-)
--
To unsubscribe from this list: send the line "unsubscribe netdev" in
the body of a message to majordomo@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
David Miller Jan. 14, 2009, 12:05 a.m. UTC | #2
From: Eric Dumazet <dada1@cosmosbay.com>
Date: Sat, 10 Jan 2009 00:34:40 +0100

> David, do you think we still must call __tcp_splice_read() only once
> in tcp_splice_read() if SPLICE_F_NONBLOCK is set ?

You seem to be working that out in another thread :-)

> [PATCH] tcp: splice as many packets as possible at once
> 
> As spotted by Willy Tarreau, current splice() from tcp socket to pipe is not
> optimal. It processes at most one segment per call.
> This results in low performance and very high overhead due to syscall rate
> when splicing from interfaces which do not support LRO.
> 
> Willy provided a patch inside tcp_splice_read(), but a better fix
> is to let tcp_read_sock() process as many segments as possible, so
> that tcp_rcv_space_adjust() and tcp_cleanup_rbuf() are called less
> often.
> 
> With this change, splice() behaves like tcp_recvmsg(), being able
> to consume many skbs in one system call. With typical 1460 bytes
> of payload per frame, that means splice(SPLICE_F_NONBLOCK) can return
> 16*1460 = 23360 bytes.
> 
> Signed-off-by: Willy Tarreau <w@1wt.eu>
> Signed-off-by: Eric Dumazet <dada1@cosmosbay.com>

I've applied this, thanks!
--
To unsubscribe from this list: send the line "unsubscribe netdev" in
the body of a message to majordomo@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
diff mbox

Patch

diff --git a/net/ipv4/tcp.c b/net/ipv4/tcp.c
index bd6ff90..1233835 100644
--- a/net/ipv4/tcp.c
+++ b/net/ipv4/tcp.c
@@ -522,8 +522,12 @@  static int tcp_splice_data_recv(read_descriptor_t *rd_desc, struct sk_buff *skb,
 				unsigned int offset, size_t len)
 {
 	struct tcp_splice_state *tss = rd_desc->arg.data;
+	int ret;
 
-	return skb_splice_bits(skb, offset, tss->pipe, tss->len, tss->flags);
+	ret = skb_splice_bits(skb, offset, tss->pipe, rd_desc->count, tss->flags);
+	if (ret > 0)
+		rd_desc->count -= ret;
+	return ret;
 }
 
 static int __tcp_splice_read(struct sock *sk, struct tcp_splice_state *tss)
@@ -531,6 +535,7 @@  static int __tcp_splice_read(struct sock *sk, struct tcp_splice_state *tss)
 	/* Store TCP splice context information in read_descriptor_t. */
 	read_descriptor_t rd_desc = {
 		.arg.data = tss,
+		.count	  = tss->len,
 	};
 
 	return tcp_read_sock(sk, &rd_desc, tcp_splice_data_recv);
@@ -611,11 +616,13 @@  ssize_t tcp_splice_read(struct socket *sock, loff_t *ppos,
 		tss.len -= ret;
 		spliced += ret;
 
+		if (!timeo)
+			break;
 		release_sock(sk);
 		lock_sock(sk);
 
 		if (sk->sk_err || sk->sk_state == TCP_CLOSE ||
-		    (sk->sk_shutdown & RCV_SHUTDOWN) || !timeo ||
+		    (sk->sk_shutdown & RCV_SHUTDOWN) ||
 		    signal_pending(current))
 			break;
 	}