Message ID | 1333481821.18626.322.camel@edumazet-glaptop |
---|---|
State | Accepted, archived |
Delegated to: | David Miller |
Headers | show |
From: Eric Dumazet <eric.dumazet@gmail.com> Date: Tue, 03 Apr 2012 21:37:01 +0200 > vmsplice()/splice(pipe, socket) call do_tcp_sendpages() one page at a > time, adding at most 4096 bytes to an skb. (assuming PAGE_SIZE=4096) > > The call to tcp_push() at the end of do_tcp_sendpages() forces an > immediate xmit when pipe is not already filled, and tso_fragment() try > to split these skb to MSS multiples. > > 4096 bytes are usually split in a skb with 2 MSS, and a remaining > sub-mss skb (assuming MTU=1500) Interesting. But why doesn't TCP_NAGLE_CORK save us? That gets passed down into the push pending frames logic when MSG_MORE is specified. As far as I can tell, the combination of TCP_NAGLE_CORK and the TSO deferral logic should do the right thing here. Obviously you see different behavior, but why? Also, by eliding the tcp_push() call you are introducing other side effects: 1) we won't do the tcp_mark_push logic 2) we don't set the URG seq I think #2 can never happen in the vmsplice/splice path, but #1 might matter. That's why I want to concentrate on why the tcp_push() path doesn't behave properly when MSG_MORE is set. -- To unsubscribe from this list: send the line "unsubscribe netdev" in the body of a message to majordomo@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
On Tue, 2012-04-03 at 17:21 -0400, David Miller wrote: > From: Eric Dumazet <eric.dumazet@gmail.com> > Date: Tue, 03 Apr 2012 21:37:01 +0200 > > > vmsplice()/splice(pipe, socket) call do_tcp_sendpages() one page at a > > time, adding at most 4096 bytes to an skb. (assuming PAGE_SIZE=4096) > > > > The call to tcp_push() at the end of do_tcp_sendpages() forces an > > immediate xmit when pipe is not already filled, and tso_fragment() try > > to split these skb to MSS multiples. > > > > 4096 bytes are usually split in a skb with 2 MSS, and a remaining > > sub-mss skb (assuming MTU=1500) > > Interesting. > > But why doesn't TCP_NAGLE_CORK save us? That gets passed down into > the push pending frames logic when MSG_MORE is specified. > > As far as I can tell, the combination of TCP_NAGLE_CORK and the TSO > deferral logic should do the right thing here. > > Obviously you see different behavior, but why? > > Also, by eliding the tcp_push() call you are introducing other side > effects: > > 1) we won't do the tcp_mark_push logic > > 2) we don't set the URG seq > > I think #2 can never happen in the vmsplice/splice path, but #1 might > matter. > > That's why I want to concentrate on why the tcp_push() path doesn't > behave properly when MSG_MORE is set. It behaves properly I think, but in the tcp_sendmsg() perspective only. The code in tcp_sendmsg() and do_tcp_sendpages() is similar (actually probably copy/pasted) but the thing is tcp_sendmsg() is called once per sendmsg() call (and the push logic is OK at the end of it), while a single splice() system call can call do_tcp_sendpages() 16 times (or even more if pipe buffer was extended by fcntl(F_SETPIPE_SZ)) Maybe a real fix would be to call do_tcp_sendpages() exactly once, but I tried this today and found needed surgery was complex). Also this would lock socket for a long period and could add latencies because of backlog processing. -- To unsubscribe from this list: send the line "unsubscribe netdev" in the body of a message to majordomo@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
From: Eric Dumazet <eric.dumazet@gmail.com> Date: Tue, 03 Apr 2012 23:31:29 +0200 > The code in tcp_sendmsg() and do_tcp_sendpages() is similar (actually > probably copy/pasted) but the thing is tcp_sendmsg() is called once per > sendmsg() call (and the push logic is OK at the end of it), while a > single splice() system call can call do_tcp_sendpages() 16 times (or > even more if pipe buffer was extended by fcntl(F_SETPIPE_SZ)) Ok, so this means that in essence the tcp_mark_push should also only be done in the final sendpage call. And since I'm wholly convinced that the URG stuff is a complete "don't care" for this path, I'm convinced your patch is the right thing to do. Applied to 'net' and queued up for -stable, thanks Eric. > Maybe a real fix would be to call do_tcp_sendpages() exactly once, but I > tried this today and found needed surgery was complex). Also this would > lock socket for a long period and could add latencies because of backlog > processing. I don't think this is a good idea. Maybe we can do some level of batching at some point, but it would need to have a limit. -- To unsubscribe from this list: send the line "unsubscribe netdev" in the body of a message to majordomo@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
From: David Miller <davem@davemloft.net> Date: Tue, 03 Apr 2012 17:36:14 -0400 (EDT) > From: Eric Dumazet <eric.dumazet@gmail.com> > Date: Tue, 03 Apr 2012 23:31:29 +0200 > >> The code in tcp_sendmsg() and do_tcp_sendpages() is similar (actually >> probably copy/pasted) but the thing is tcp_sendmsg() is called once per >> sendmsg() call (and the push logic is OK at the end of it), while a >> single splice() system call can call do_tcp_sendpages() 16 times (or >> even more if pipe buffer was extended by fcntl(F_SETPIPE_SZ)) > > Ok, so this means that in essence the tcp_mark_push should also only > be done in the final sendpage call. > > And since I'm wholly convinced that the URG stuff is a complete > "don't care" for this path, I'm convinced your patch is the right > thing to do. > > Applied to 'net' and queued up for -stable, thanks Eric. Eric, sorry to be a pain, but to clear my conscience can you tell me if the sendfile() path do the right thing with your change too? I'm concerned about returning back into userspace without at least one tcp_push() at the end. -- To unsubscribe from this list: send the line "unsubscribe netdev" in the body of a message to majordomo@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
On Tue, 2012-04-03 at 18:50 -0400, David Miller wrote: > Eric, sorry to be a pain, but to clear my conscience can you tell me > if the sendfile() path do the right thing with your change too? > > I'm concerned about returning back into userspace without at least > one tcp_push() at the end. Absolutely, this patch also handles the sendfile() path. do_sendfile() -> do_splice_direct() -> splice_direct_to_actor() -> ... splice_from_pipe() -> splice_from_pipe_feed() -> pipe_to_sendpage() more = (sd->flags & SPLICE_F_MORE) || sd->len < sd->total_len sock_sendpage() if (more) flags |= MSG_MORE; kernel_sendpage() inet_sendpage() tcp_sendpage() So the push is perfomed at the end (last page doesnt have MSG_MORE flag) Here is what we got _before_ the patch using "netperf -t TCP_SENDFILE -- -m 16K" 07:50:14.097137 IP A > B: Flags [S], seq 2635999430, win 14600, options [mss 1460,sackOK,TS val 56315934 ecr 0,nop,wscale 7], length 0 07:50:14.107211 IP B > A: Flags [S.], seq 2438979637, ack 2635999431, win 14480, options [mss 1460,sackOK,TS val 56315935 ecr 56315934,nop,wscale 7], length 0 07:50:14.117318 IP A > B: Flags [.], ack 1, win 115, options [nop,nop,TS val 56315936 ecr 56315935], length 0 07:50:14.117534 IP A > B: Flags [.], seq 1:2897, ack 1, win 115, options [nop,nop,TS val 56315937 ecr 56315935], length 2896 07:50:14.117543 IP A > B: Flags [.], seq 2897:4345, ack 1, win 115, options [nop,nop,TS val 56315937 ecr 56315935], length 1448 07:50:14.127612 IP B > A: Flags [.], ack 2897, win 136, options [nop,nop,TS val 56315938 ecr 56315937], length 0 07:50:14.127618 IP B > A: Flags [.], ack 4345, win 159, options [nop,nop,TS val 56315938 ecr 56315937], length 0 07:50:14.137698 IP A > B: Flags [.], seq 4345:8689, ack 1, win 115, options [nop,nop,TS val 56315939 ecr 56315938], length 4344 07:50:14.137731 IP A > B: Flags [.], seq 8689:11585, ack 1, win 115, options [nop,nop,TS val 56315939 ecr 56315938], length 2896 07:50:14.147906 IP B > A: Flags [.], ack 8689, win 181, options [nop,nop,TS val 56315940 ecr 56315939], length 0 07:50:14.147910 IP B > A: Flags [.], ack 11585, win 204, options [nop,nop,TS val 56315940 ecr 56315939], length 0 07:50:14.158012 IP A > B: Flags [.], seq 11585:17377, ack 1, win 115, options [nop,nop,TS val 56315941 ecr 56315940], length 5792 07:50:14.158022 IP A > B: Flags [P.], seq 17377:18825, ack 1, win 115, options [nop,nop,TS val 56315941 ecr 56315940], length 1448 07:50:14.158171 IP A > B: Flags [.], seq 18825:21721, ack 1, win 115, options [nop,nop,TS val 56315941 ecr 56315940], length 2896 07:50:14.168361 IP B > A: Flags [.], ack 17377, win 227, options [nop,nop,TS val 56315942 ecr 56315941], length 0 07:50:14.168366 IP B > A: Flags [.], ack 18825, win 249, options [nop,nop,TS val 56315942 ecr 56315941], length 0 07:50:14.168424 IP B > A: Flags [.], ack 21721, win 272, options [nop,nop,TS val 56315942 ecr 56315941], length 0 07:50:14.178460 IP A > B: Flags [.], seq 21721:28961, ack 1, win 115, options [nop,nop,TS val 56315943 ecr 56315942], length 7240 07:50:14.178466 IP A > B: Flags [.], seq 28961:34753, ack 1, win 115, options [nop,nop,TS val 56315943 ecr 56315942], length 5792 07:50:14.188668 IP B > A: Flags [.], ack 28961, win 295, options [nop,nop,TS val 56315944 ecr 56315943], length 0 07:50:14.188675 IP B > A: Flags [.], ack 34753, win 317, options [nop,nop,TS val 56315944 ecr 56315943], length 0 07:50:14.198772 IP A > B: Flags [.], seq 34753:44889, ack 1, win 115, options [nop,nop,TS val 56315945 ecr 56315944], length 10136 07:50:14.198778 IP A > B: Flags [P.], seq 44889:47785, ack 1, win 115, options [nop,nop,TS val 56315945 ecr 56315944], length 2896 07:50:14.208990 IP B > A: Flags [.], ack 44889, win 340, options [nop,nop,TS val 56315946 ecr 56315945], length 0 07:50:14.208995 IP B > A: Flags [.], ack 47785, win 362, options [nop,nop,TS val 56315946 ecr 56315945], length 0 07:50:14.219081 IP A > B: Flags [.], seq 47785:60817, ack 1, win 115, options [nop,nop,TS val 56315947 ecr 56315946], length 13032 07:50:14.219090 IP A > B: Flags [.], seq 60817:68057, ack 1, win 115, options [nop,nop,TS val 56315947 ecr 56315946], length 7240 07:50:14.229357 IP B > A: Flags [.], ack 60817, win 385, options [nop,nop,TS val 56315948 ecr 56315947], length 0 07:50:14.229361 IP B > A: Flags [.], ack 68057, win 408, options [nop,nop,TS val 56315948 ecr 56315947], length 0 07:50:14.239450 IP A > B: Flags [.], seq 68057:73849, ack 1, win 115, options [nop,nop,TS val 56315949 ecr 56315948], length 5792 07:50:14.239458 IP A > B: Flags [.], seq 73849:82537, ack 1, win 115, options [nop,nop,TS val 56315949 ecr 56315948], length 8688 07:50:14.239492 IP A > B: Flags [.], seq 82537:86881, ack 1, win 115, options [nop,nop,TS val 56315949 ecr 56315948], length 4344 07:50:14.249694 IP B > A: Flags [.], ack 73849, win 430, options [nop,nop,TS val 56315950 ecr 56315949], length 0 07:50:14.249699 IP B > A: Flags [.], ack 82537, win 453, options [nop,nop,TS val 56315950 ecr 56315949], length 0 07:50:14.249753 IP B > A: Flags [.], ack 86881, win 476, options [nop,nop,TS val 56315950 ecr 56315949], length 0 07:50:14.259790 IP A > B: Flags [.], seq 86881:98465, ack 1, win 115, options [nop,nop,TS val 56315951 ecr 56315950], length 11584 07:50:14.259796 IP A > B: Flags [.], seq 98465:99913, ack 1, win 115, options [nop,nop,TS val 56315951 ecr 56315950], length 1448 07:50:14.259826 IP A > B: Flags [.], seq 99913:108601, ack 1, win 115, options [nop,nop,TS val 56315951 ecr 56315950], length 8688 07:50:14.259830 IP A > B: Flags [P.], seq 108601:112945, ack 1, win 115, options [nop,nop,TS val 56315951 ecr 56315950], length 4344 -- To unsubscribe from this list: send the line "unsubscribe netdev" in the body of a message to majordomo@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
From: Eric Dumazet <eric.dumazet@gmail.com> Date: Wed, 04 Apr 2012 08:15:28 +0200 > On Tue, 2012-04-03 at 18:50 -0400, David Miller wrote: > >> Eric, sorry to be a pain, but to clear my conscience can you tell me >> if the sendfile() path do the right thing with your change too? >> >> I'm concerned about returning back into userspace without at least >> one tcp_push() at the end. > > Absolutely, this patch also handles the sendfile() path. > > do_sendfile() -> > do_splice_direct() -> > splice_direct_to_actor() -> > ... > splice_from_pipe() -> > splice_from_pipe_feed() -> > pipe_to_sendpage() more = (sd->flags & SPLICE_F_MORE) || sd->len < sd->total_len > sock_sendpage() if (more) flags |= MSG_MORE; > kernel_sendpage() > inet_sendpage() > tcp_sendpage() > > So the push is perfomed at the end (last page doesnt have MSG_MORE flag) Perfect, thanks for double checking. -- To unsubscribe from this list: send the line "unsubscribe netdev" in the body of a message to majordomo@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
diff --git a/net/ipv4/tcp.c b/net/ipv4/tcp.c index cfd7edd..2ff6f45 100644 --- a/net/ipv4/tcp.c +++ b/net/ipv4/tcp.c @@ -860,7 +860,7 @@ wait_for_memory: } out: - if (copied) + if (copied && !(flags & MSG_MORE)) tcp_push(sk, flags, mss_now, tp->nonagle); return copied;
vmsplice()/splice(pipe, socket) call do_tcp_sendpages() one page at a time, adding at most 4096 bytes to an skb. (assuming PAGE_SIZE=4096) The call to tcp_push() at the end of do_tcp_sendpages() forces an immediate xmit when pipe is not already filled, and tso_fragment() try to split these skb to MSS multiples. 4096 bytes are usually split in a skb with 2 MSS, and a remaining sub-mss skb (assuming MTU=1500) This makes slow start suboptimal because many small frames are sent to qdisc/driver layers instead of big ones (constrained by cwnd and packets in flight of course) In fact, applications using sendmsg() (adding an additional memory copy) instead of vmsplice()/splice()/sendfile() are a bit faster because of this anomaly, especially if serving small files in environments with large initial [c]wnd. Call tcp_push() only if MSG_MORE is not set in the flags parameter. This bit is automatically provided by splice() internals but for the last page, or on all pages if user specified SPLICE_F_MORE splice() flag. In some workloads, this can reduce number of sent logical packets by an order of magnitude, making zero-copy TCP actually faster than one-copy :) Reported-by: Tom Herbert <therbert@google.com> Cc: Nandita Dukkipati <nanditad@google.com> Cc: Neal Cardwell <ncardwell@google.com> Cc: Tom Herbert <therbert@google.com> Cc: Yuchung Cheng <ycheng@google.com> Cc: H.K. Jerry Chu <hkchu@google.com> Cc: Maciej Żenczykowski <maze@google.com> Cc: Mahesh Bandewar <maheshb@google.com> Cc: Ilpo Järvinen <ilpo.jarvinen@helsinki.fi> Signed-off-by: Eric Dumazet <eric.dumazet@gmail>com> --- net/ipv4/tcp.c | 2 +- 1 file changed, 1 insertion(+), 1 deletion(-) -- To unsubscribe from this list: send the line "unsubscribe netdev" in the body of a message to majordomo@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html