diff mbox

tcp: allow splice() to build full TSO packets

Message ID 1333481821.18626.322.camel@edumazet-glaptop
State Accepted, archived
Delegated to: David Miller
Headers show

Commit Message

Eric Dumazet April 3, 2012, 7:37 p.m. UTC
vmsplice()/splice(pipe, socket) call do_tcp_sendpages() one page at a
time, adding at most 4096 bytes to an skb. (assuming PAGE_SIZE=4096)

The call to tcp_push() at the end of do_tcp_sendpages() forces an
immediate xmit when pipe is not already filled, and tso_fragment() try
to split these skb to MSS multiples.

4096 bytes are usually split in a skb with 2 MSS, and a remaining
sub-mss skb (assuming MTU=1500)

This makes slow start suboptimal because many small frames are sent to
qdisc/driver layers instead of big ones (constrained by cwnd and packets
in flight of course)

In fact, applications using sendmsg() (adding an additional memory copy)
instead of vmsplice()/splice()/sendfile() are a bit faster because of
this anomaly, especially if serving small files in environments with
large initial [c]wnd.

Call tcp_push() only if MSG_MORE is not set in the flags parameter.

This bit is automatically provided by splice() internals but for the
last page, or on all pages if user specified SPLICE_F_MORE splice()
flag.

In some workloads, this can reduce number of sent logical packets by an
order of magnitude, making zero-copy TCP actually faster than
one-copy :)

Reported-by: Tom Herbert <therbert@google.com>
Cc: Nandita Dukkipati <nanditad@google.com>
Cc: Neal Cardwell <ncardwell@google.com>
Cc: Tom Herbert <therbert@google.com>
Cc: Yuchung Cheng <ycheng@google.com>
Cc: H.K. Jerry Chu <hkchu@google.com>
Cc: Maciej Żenczykowski <maze@google.com>
Cc: Mahesh Bandewar <maheshb@google.com>
Cc: Ilpo Järvinen <ilpo.jarvinen@helsinki.fi>
Signed-off-by: Eric Dumazet <eric.dumazet@gmail>com>
---
 net/ipv4/tcp.c |    2 +-
 1 file changed, 1 insertion(+), 1 deletion(-)



--
To unsubscribe from this list: send the line "unsubscribe netdev" in
the body of a message to majordomo@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html

Comments

David Miller April 3, 2012, 9:21 p.m. UTC | #1
From: Eric Dumazet <eric.dumazet@gmail.com>
Date: Tue, 03 Apr 2012 21:37:01 +0200

> vmsplice()/splice(pipe, socket) call do_tcp_sendpages() one page at a
> time, adding at most 4096 bytes to an skb. (assuming PAGE_SIZE=4096)
> 
> The call to tcp_push() at the end of do_tcp_sendpages() forces an
> immediate xmit when pipe is not already filled, and tso_fragment() try
> to split these skb to MSS multiples.
> 
> 4096 bytes are usually split in a skb with 2 MSS, and a remaining
> sub-mss skb (assuming MTU=1500)

Interesting.

But why doesn't TCP_NAGLE_CORK save us?  That gets passed down into
the push pending frames logic when MSG_MORE is specified.

As far as I can tell, the combination of TCP_NAGLE_CORK and the TSO
deferral logic should do the right thing here.

Obviously you see different behavior, but why?

Also, by eliding the tcp_push() call you are introducing other side
effects:

1) we won't do the tcp_mark_push logic

2) we don't set the URG seq

I think #2 can never happen in the vmsplice/splice path, but #1 might
matter.

That's why I want to concentrate on why the tcp_push() path doesn't
behave properly when MSG_MORE is set.
--
To unsubscribe from this list: send the line "unsubscribe netdev" in
the body of a message to majordomo@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Eric Dumazet April 3, 2012, 9:31 p.m. UTC | #2
On Tue, 2012-04-03 at 17:21 -0400, David Miller wrote:
> From: Eric Dumazet <eric.dumazet@gmail.com>
> Date: Tue, 03 Apr 2012 21:37:01 +0200
> 
> > vmsplice()/splice(pipe, socket) call do_tcp_sendpages() one page at a
> > time, adding at most 4096 bytes to an skb. (assuming PAGE_SIZE=4096)
> > 
> > The call to tcp_push() at the end of do_tcp_sendpages() forces an
> > immediate xmit when pipe is not already filled, and tso_fragment() try
> > to split these skb to MSS multiples.
> > 
> > 4096 bytes are usually split in a skb with 2 MSS, and a remaining
> > sub-mss skb (assuming MTU=1500)
> 
> Interesting.
> 
> But why doesn't TCP_NAGLE_CORK save us?  That gets passed down into
> the push pending frames logic when MSG_MORE is specified.
> 
> As far as I can tell, the combination of TCP_NAGLE_CORK and the TSO
> deferral logic should do the right thing here.
> 
> Obviously you see different behavior, but why?
> 
> Also, by eliding the tcp_push() call you are introducing other side
> effects:
> 
> 1) we won't do the tcp_mark_push logic
> 
> 2) we don't set the URG seq
> 
> I think #2 can never happen in the vmsplice/splice path, but #1 might
> matter.
> 
> That's why I want to concentrate on why the tcp_push() path doesn't
> behave properly when MSG_MORE is set.

It behaves properly I think, but in the tcp_sendmsg() perspective only.

The code in tcp_sendmsg() and do_tcp_sendpages() is similar (actually
probably copy/pasted) but the thing is tcp_sendmsg() is called once per
sendmsg() call (and the push logic is OK at the end of it), while a
single splice() system call can call do_tcp_sendpages() 16 times (or
even more if pipe buffer was extended by fcntl(F_SETPIPE_SZ))

Maybe a real fix would be to call do_tcp_sendpages() exactly once, but I
tried this today and found needed surgery was complex). Also this would
lock socket for a long period and could add latencies because of backlog
processing.



--
To unsubscribe from this list: send the line "unsubscribe netdev" in
the body of a message to majordomo@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
David Miller April 3, 2012, 9:36 p.m. UTC | #3
From: Eric Dumazet <eric.dumazet@gmail.com>
Date: Tue, 03 Apr 2012 23:31:29 +0200

> The code in tcp_sendmsg() and do_tcp_sendpages() is similar (actually
> probably copy/pasted) but the thing is tcp_sendmsg() is called once per
> sendmsg() call (and the push logic is OK at the end of it), while a
> single splice() system call can call do_tcp_sendpages() 16 times (or
> even more if pipe buffer was extended by fcntl(F_SETPIPE_SZ))

Ok, so this means that in essence the tcp_mark_push should also only
be done in the final sendpage call.

And since I'm wholly convinced that the URG stuff is a complete
"don't care" for this path, I'm convinced your patch is the right
thing to do.

Applied to 'net' and queued up for -stable, thanks Eric.

> Maybe a real fix would be to call do_tcp_sendpages() exactly once, but I
> tried this today and found needed surgery was complex). Also this would
> lock socket for a long period and could add latencies because of backlog
> processing.

I don't think this is a good idea.  Maybe we can do some level of
batching at some point, but it would need to have a limit.

--
To unsubscribe from this list: send the line "unsubscribe netdev" in
the body of a message to majordomo@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
David Miller April 3, 2012, 10:50 p.m. UTC | #4
From: David Miller <davem@davemloft.net>
Date: Tue, 03 Apr 2012 17:36:14 -0400 (EDT)

> From: Eric Dumazet <eric.dumazet@gmail.com>
> Date: Tue, 03 Apr 2012 23:31:29 +0200
> 
>> The code in tcp_sendmsg() and do_tcp_sendpages() is similar (actually
>> probably copy/pasted) but the thing is tcp_sendmsg() is called once per
>> sendmsg() call (and the push logic is OK at the end of it), while a
>> single splice() system call can call do_tcp_sendpages() 16 times (or
>> even more if pipe buffer was extended by fcntl(F_SETPIPE_SZ))
> 
> Ok, so this means that in essence the tcp_mark_push should also only
> be done in the final sendpage call.
> 
> And since I'm wholly convinced that the URG stuff is a complete
> "don't care" for this path, I'm convinced your patch is the right
> thing to do.
> 
> Applied to 'net' and queued up for -stable, thanks Eric.

Eric, sorry to be a pain, but to clear my conscience can you tell me
if the sendfile() path do the right thing with your change too?

I'm concerned about returning back into userspace without at least
one tcp_push() at the end.
--
To unsubscribe from this list: send the line "unsubscribe netdev" in
the body of a message to majordomo@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Eric Dumazet April 4, 2012, 6:15 a.m. UTC | #5
On Tue, 2012-04-03 at 18:50 -0400, David Miller wrote:

> Eric, sorry to be a pain, but to clear my conscience can you tell me
> if the sendfile() path do the right thing with your change too?
> 
> I'm concerned about returning back into userspace without at least
> one tcp_push() at the end.

Absolutely, this patch also handles the sendfile() path.

do_sendfile() ->
 do_splice_direct() ->
  splice_direct_to_actor() -> 
   ...
    splice_from_pipe() ->
     splice_from_pipe_feed() ->
      pipe_to_sendpage()  more = (sd->flags & SPLICE_F_MORE) || sd->len < sd->total_len
       sock_sendpage() if (more) flags |= MSG_MORE;
        kernel_sendpage()
         inet_sendpage()
          tcp_sendpage()

So the push is perfomed at the end (last page doesnt have MSG_MORE flag)

Here is what we got _before_ the patch using "netperf -t TCP_SENDFILE --
-m 16K"


07:50:14.097137 IP A > B: Flags [S], seq 2635999430, win 14600, options [mss 1460,sackOK,TS val 56315934 ecr 0,nop,wscale 7], length 0
07:50:14.107211 IP B > A: Flags [S.], seq 2438979637, ack 2635999431, win 14480, options [mss 1460,sackOK,TS val 56315935 ecr 56315934,nop,wscale 7], length 0
07:50:14.117318 IP A > B: Flags [.], ack 1, win 115, options [nop,nop,TS val 56315936 ecr 56315935], length 0
07:50:14.117534 IP A > B: Flags [.], seq 1:2897, ack 1, win 115, options [nop,nop,TS val 56315937 ecr 56315935], length 2896
07:50:14.117543 IP A > B: Flags [.], seq 2897:4345, ack 1, win 115, options [nop,nop,TS val 56315937 ecr 56315935], length 1448
07:50:14.127612 IP B > A: Flags [.], ack 2897, win 136, options [nop,nop,TS val 56315938 ecr 56315937], length 0
07:50:14.127618 IP B > A: Flags [.], ack 4345, win 159, options [nop,nop,TS val 56315938 ecr 56315937], length 0
07:50:14.137698 IP A > B: Flags [.], seq 4345:8689, ack 1, win 115, options [nop,nop,TS val 56315939 ecr 56315938], length 4344
07:50:14.137731 IP A > B: Flags [.], seq 8689:11585, ack 1, win 115, options [nop,nop,TS val 56315939 ecr 56315938], length 2896
07:50:14.147906 IP B > A: Flags [.], ack 8689, win 181, options [nop,nop,TS val 56315940 ecr 56315939], length 0
07:50:14.147910 IP B > A: Flags [.], ack 11585, win 204, options [nop,nop,TS val 56315940 ecr 56315939], length 0
07:50:14.158012 IP A > B: Flags [.], seq 11585:17377, ack 1, win 115, options [nop,nop,TS val 56315941 ecr 56315940], length 5792
07:50:14.158022 IP A > B: Flags [P.], seq 17377:18825, ack 1, win 115, options [nop,nop,TS val 56315941 ecr 56315940], length 1448
07:50:14.158171 IP A > B: Flags [.], seq 18825:21721, ack 1, win 115, options [nop,nop,TS val 56315941 ecr 56315940], length 2896
07:50:14.168361 IP B > A: Flags [.], ack 17377, win 227, options [nop,nop,TS val 56315942 ecr 56315941], length 0
07:50:14.168366 IP B > A: Flags [.], ack 18825, win 249, options [nop,nop,TS val 56315942 ecr 56315941], length 0
07:50:14.168424 IP B > A: Flags [.], ack 21721, win 272, options [nop,nop,TS val 56315942 ecr 56315941], length 0
07:50:14.178460 IP A > B: Flags [.], seq 21721:28961, ack 1, win 115, options [nop,nop,TS val 56315943 ecr 56315942], length 7240
07:50:14.178466 IP A > B: Flags [.], seq 28961:34753, ack 1, win 115, options [nop,nop,TS val 56315943 ecr 56315942], length 5792
07:50:14.188668 IP B > A: Flags [.], ack 28961, win 295, options [nop,nop,TS val 56315944 ecr 56315943], length 0
07:50:14.188675 IP B > A: Flags [.], ack 34753, win 317, options [nop,nop,TS val 56315944 ecr 56315943], length 0
07:50:14.198772 IP A > B: Flags [.], seq 34753:44889, ack 1, win 115, options [nop,nop,TS val 56315945 ecr 56315944], length 10136
07:50:14.198778 IP A > B: Flags [P.], seq 44889:47785, ack 1, win 115, options [nop,nop,TS val 56315945 ecr 56315944], length 2896
07:50:14.208990 IP B > A: Flags [.], ack 44889, win 340, options [nop,nop,TS val 56315946 ecr 56315945], length 0
07:50:14.208995 IP B > A: Flags [.], ack 47785, win 362, options [nop,nop,TS val 56315946 ecr 56315945], length 0
07:50:14.219081 IP A > B: Flags [.], seq 47785:60817, ack 1, win 115, options [nop,nop,TS val 56315947 ecr 56315946], length 13032
07:50:14.219090 IP A > B: Flags [.], seq 60817:68057, ack 1, win 115, options [nop,nop,TS val 56315947 ecr 56315946], length 7240
07:50:14.229357 IP B > A: Flags [.], ack 60817, win 385, options [nop,nop,TS val 56315948 ecr 56315947], length 0
07:50:14.229361 IP B > A: Flags [.], ack 68057, win 408, options [nop,nop,TS val 56315948 ecr 56315947], length 0
07:50:14.239450 IP A > B: Flags [.], seq 68057:73849, ack 1, win 115, options [nop,nop,TS val 56315949 ecr 56315948], length 5792
07:50:14.239458 IP A > B: Flags [.], seq 73849:82537, ack 1, win 115, options [nop,nop,TS val 56315949 ecr 56315948], length 8688
07:50:14.239492 IP A > B: Flags [.], seq 82537:86881, ack 1, win 115, options [nop,nop,TS val 56315949 ecr 56315948], length 4344
07:50:14.249694 IP B > A: Flags [.], ack 73849, win 430, options [nop,nop,TS val 56315950 ecr 56315949], length 0
07:50:14.249699 IP B > A: Flags [.], ack 82537, win 453, options [nop,nop,TS val 56315950 ecr 56315949], length 0
07:50:14.249753 IP B > A: Flags [.], ack 86881, win 476, options [nop,nop,TS val 56315950 ecr 56315949], length 0
07:50:14.259790 IP A > B: Flags [.], seq 86881:98465, ack 1, win 115, options [nop,nop,TS val 56315951 ecr 56315950], length 11584
07:50:14.259796 IP A > B: Flags [.], seq 98465:99913, ack 1, win 115, options [nop,nop,TS val 56315951 ecr 56315950], length 1448
07:50:14.259826 IP A > B: Flags [.], seq 99913:108601, ack 1, win 115, options [nop,nop,TS val 56315951 ecr 56315950], length 8688
07:50:14.259830 IP A > B: Flags [P.], seq 108601:112945, ack 1, win 115, options [nop,nop,TS val 56315951 ecr 56315950], length 4344



--
To unsubscribe from this list: send the line "unsubscribe netdev" in
the body of a message to majordomo@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
David Miller April 4, 2012, 6:20 a.m. UTC | #6
From: Eric Dumazet <eric.dumazet@gmail.com>
Date: Wed, 04 Apr 2012 08:15:28 +0200

> On Tue, 2012-04-03 at 18:50 -0400, David Miller wrote:
> 
>> Eric, sorry to be a pain, but to clear my conscience can you tell me
>> if the sendfile() path do the right thing with your change too?
>> 
>> I'm concerned about returning back into userspace without at least
>> one tcp_push() at the end.
> 
> Absolutely, this patch also handles the sendfile() path.
> 
> do_sendfile() ->
>  do_splice_direct() ->
>   splice_direct_to_actor() -> 
>    ...
>     splice_from_pipe() ->
>      splice_from_pipe_feed() ->
>       pipe_to_sendpage()  more = (sd->flags & SPLICE_F_MORE) || sd->len < sd->total_len
>        sock_sendpage() if (more) flags |= MSG_MORE;
>         kernel_sendpage()
>          inet_sendpage()
>           tcp_sendpage()
> 
> So the push is perfomed at the end (last page doesnt have MSG_MORE flag)

Perfect, thanks for double checking.
--
To unsubscribe from this list: send the line "unsubscribe netdev" in
the body of a message to majordomo@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
diff mbox

Patch

diff --git a/net/ipv4/tcp.c b/net/ipv4/tcp.c
index cfd7edd..2ff6f45 100644
--- a/net/ipv4/tcp.c
+++ b/net/ipv4/tcp.c
@@ -860,7 +860,7 @@  wait_for_memory:
 	}
 
 out:
-	if (copied)
+	if (copied && !(flags & MSG_MORE))
 		tcp_push(sk, flags, mss_now, tp->nonagle);
 	return copied;