Message ID | 1391865273.10160.76.camel@edumazet-glaptop2.roam.corp.google.com |
---|---|
State | RFC, archived |
Delegated to: | David Miller |
Headers | show |
On Sat, 2014-02-08 at 05:14 -0800, Eric Dumazet wrote:
> Here is the combined patch, could you test it ?
Also make sure you have commit a181ceb501b31b4bf8812a5c84c716cc31d82c2d
("tcp: autocork should not hold first packet in write queue")
in your tree.
--
To unsubscribe from this list: send the line "unsubscribe netdev" in
the body of a message to majordomo@vger.kernel.org
More majordomo info at http://vger.kernel.org/majordomo-info.html
Hello Eric, > Also make sure you have commit a181ceb501b31b4bf8812a5c84c716cc31d82c2d > ("tcp: autocork should not hold first packet in write queue") > in your tree. confirmed: (node-62) [~/work/linux-2.6] git show a181ceb501b31b4bf8812a5c84c716cc31d82c2d | head commit a181ceb501b31b4bf8812a5c84c716cc31d82c2d Author: Eric Dumazet <edumazet@google.com> Date: Tue Dec 17 09:58:30 2013 -0800 tcp: autocork should not hold first packet in write queue Willem noticed a TCP_RR regression caused by TCP autocorking on a Mellanox test bed. MLX4_EN_TX_COAL_TIME is 16 us, which can be right above RTT between hosts. Cheers, Thomas -- To unsubscribe from this list: send the line "unsubscribe netdev" in the body of a message to majordomo@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
On Sat, 2014-02-08 at 05:33 -0800, Eric Dumazet wrote: > On Sat, 2014-02-08 at 05:14 -0800, Eric Dumazet wrote: > > Here is the combined patch, could you test it ? > > Also make sure you have commit a181ceb501b31b4bf8812a5c84c716cc31d82c2d > ("tcp: autocork should not hold first packet in write queue") > in your tree. > > BTW this problem demonstrates there is room for improvement in iCSCI, using MSG_MORE to avoid sending two small segments in separate frames. [1] 00:32:35.726568 IP 10.101.99.5.3260 > 10.101.0.13.27778: Flags [P.], seq 145:193, ack 144, win 235, options [nop,nop,TS val 4294960733 ecr 385385], length 48 [2] 00:32:35.838074 IP 10.101.0.13.27778 > 10.101.99.5.3260: Flags [.], ack 193, win 514, options [nop,nop,TS val 385396 ecr 4294960733], length 0 [3] 00:32:35.838099 IP 10.101.99.5.3260 > 10.101.0.13.27778: Flags [P.], seq 193:705, ack 144, win 235, options [nop,nop,TS val 4294960761 ecr 385396], length 512 [1] & [3] could be coalesced, and [2] would be avoided. -- To unsubscribe from this list: send the line "unsubscribe netdev" in the body of a message to majordomo@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
On Sat, 2014-02-08 at 05:50 -0800, Eric Dumazet wrote: > On Sat, 2014-02-08 at 05:33 -0800, Eric Dumazet wrote: > > On Sat, 2014-02-08 at 05:14 -0800, Eric Dumazet wrote: > > > Here is the combined patch, could you test it ? > > > > Also make sure you have commit a181ceb501b31b4bf8812a5c84c716cc31d82c2d > > ("tcp: autocork should not hold first packet in write queue") > > in your tree. > > > > > > BTW this problem demonstrates there is room for improvement in iCSCI, > using MSG_MORE to avoid sending two small segments in separate frames. > > [1] 00:32:35.726568 IP 10.101.99.5.3260 > 10.101.0.13.27778: Flags [P.], seq 145:193, ack 144, win 235, options [nop,nop,TS val 4294960733 ecr 385385], length 48 > [2] 00:32:35.838074 IP 10.101.0.13.27778 > 10.101.99.5.3260: Flags [.], ack 193, win 514, options [nop,nop,TS val 385396 ecr 4294960733], length 0 > [3] 00:32:35.838099 IP 10.101.99.5.3260 > 10.101.0.13.27778: Flags [P.], seq 193:705, ack 144, win 235, options [nop,nop,TS val 4294960761 ecr 385396], length 512 > > [1] & [3] could be coalesced, and [2] would be avoided. > With the fix, new pcap is more explicit about this suboptimal behavior : 05:34:16.280900 IP 10.101.0.13.41531 > 10.101.99.5.3260: Flags [.], ack 54353, win 514, options [nop,nop,TS val 1732452 ecr 4294935370], length 0 05:34:16.280949 IP 10.101.0.13.41531 > 10.101.99.5.3260: Flags [P.], seq 5328:5376, ack 54353, win 514, options [nop,nop,TS val 1732452 ecr 4294935370], length 48 05:34:16.280982 IP 10.101.99.5.3260 > 10.101.0.13.41531: Flags [P.], seq 54353:54401, ack 5376, win 235, options [nop,nop,TS val 4294935370 ecr 1732452], length 48 05:34:16.281000 IP 10.101.99.5.3260 > 10.101.0.13.41531: Flags [P.], seq 54401:54913, ack 5376, win 235, options [nop,nop,TS val 4294935370 ecr 1732452], length 512 05:34:16.281107 IP 10.101.0.13.41531 > 10.101.99.5.3260: Flags [.], ack 54913, win 514, options [nop,nop,TS val 1732452 ecr 4294935370], length 0 05:34:16.281157 IP 10.101.0.13.41531 > 10.101.99.5.3260: Flags [P.], seq 5376:5424, ack 54913, win 514, options [nop,nop,TS val 1732452 ecr 4294935370], length 48 05:34:16.281190 IP 10.101.99.5.3260 > 10.101.0.13.41531: Flags [P.], seq 54913:54961, ack 5424, win 235, options [nop,nop,TS val 4294935370 ecr 1732452], length 48 05:34:16.281208 IP 10.101.99.5.3260 > 10.101.0.13.41531: Flags [P.], seq 54961:55473, ack 5424, win 235, options [nop,nop,TS val 4294935370 ecr 1732452], length 512 05:34:16.281337 IP 10.101.0.13.41531 > 10.101.99.5.3260: Flags [.], ack 55473, win 514, options [nop,nop,TS val 1732452 ecr 4294935370], length 0 05:34:16.281390 IP 10.101.0.13.41531 > 10.101.99.5.3260: Flags [P.], seq 5424:5472, ack 55473, win 514, options [nop,nop,TS val 1732452 ecr 4294935370], length 48 05:34:16.281423 IP 10.101.99.5.3260 > 10.101.0.13.41531: Flags [P.], seq 55473:55521, ack 5472, win 235, options [nop,nop,TS val 4294935370 ecr 1732452], length 48 05:34:16.281440 IP 10.101.99.5.3260 > 10.101.0.13.41531: Flags [P.], seq 55521:56033, ack 5472, win 235, options [nop,nop,TS val 4294935370 ecr 1732452], length 512 -- To unsubscribe from this list: send the line "unsubscribe netdev" in the body of a message to majordomo@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
Hello Eric, > > BTW this problem demonstrates there is room for improvement in iCSCI, > > using MSG_MORE to avoid sending two small segments in separate frames. > With the fix, new pcap is more explicit about this suboptimal behavior : > 05:34:16.280900 IP 10.101.0.13.41531 > 10.101.99.5.3260: Flags [.], ack 54353, win 514, options [nop,nop,TS val 1732452 ecr 4294935370], length 0 > 05:34:16.280949 IP 10.101.0.13.41531 > 10.101.99.5.3260: Flags [P.], seq 5328:5376, ack 54353, win 514, options [nop,nop,TS val 1732452 ecr 4294935370], length 48 > 05:34:16.280982 IP 10.101.99.5.3260 > 10.101.0.13.41531: Flags [P.], seq 54353:54401, ack 5376, win 235, options [nop,nop,TS val 4294935370 ecr 1732452], length 48 > 05:34:16.281000 IP 10.101.99.5.3260 > 10.101.0.13.41531: Flags [P.], seq 54401:54913, ack 5376, win 235, options [nop,nop,TS val 4294935370 ecr 1732452], length 512 > 05:34:16.281107 IP 10.101.0.13.41531 > 10.101.99.5.3260: Flags [.], ack 54913, win 514, options [nop,nop,TS val 1732452 ecr 4294935370], length 0 > 05:34:16.281157 IP 10.101.0.13.41531 > 10.101.99.5.3260: Flags [P.], seq 5376:5424, ack 54913, win 514, options [nop,nop,TS val 1732452 ecr 4294935370], length 48 > 05:34:16.281190 IP 10.101.99.5.3260 > 10.101.0.13.41531: Flags [P.], seq 54913:54961, ack 5424, win 235, options [nop,nop,TS val 4294935370 ecr 1732452], length 48 > 05:34:16.281208 IP 10.101.99.5.3260 > 10.101.0.13.41531: Flags [P.], seq 54961:55473, ack 5424, win 235, options [nop,nop,TS val 4294935370 ecr 1732452], length 512 > 05:34:16.281337 IP 10.101.0.13.41531 > 10.101.99.5.3260: Flags [.], ack 55473, win 514, options [nop,nop,TS val 1732452 ecr 4294935370], length 0 > 05:34:16.281390 IP 10.101.0.13.41531 > 10.101.99.5.3260: Flags [P.], seq 5424:5472, ack 55473, win 514, options [nop,nop,TS val 1732452 ecr 4294935370], length 48 > 05:34:16.281423 IP 10.101.99.5.3260 > 10.101.0.13.41531: Flags [P.], seq 55473:55521, ack 5472, win 235, options [nop,nop,TS val 4294935370 ecr 1732452], length 48 > 05:34:16.281440 IP 10.101.99.5.3260 > 10.101.0.13.41531: Flags [P.], seq 55521:56033, ack 5472, win 235, options [nop,nop,TS val 4294935370 ecr 1732452], length 512 I get the idea. However I'm a little bit confused, when I do a 'git grep MSG_MORE' I don't see much references in the Linux kernel who use it at all. So do you have an example for me where this flags needs to be applied? Cheers, Thomas -- To unsubscribe from this list: send the line "unsubscribe netdev" in the body of a message to majordomo@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
On Sat, 2014-02-08 at 15:19 +0100, Thomas Glanzmann wrote: > Hello Eric, > I get the idea. However I'm a little bit confused, when I do a 'git grep > MSG_MORE' I don't see much references in the Linux kernel who use it at > all. So do you have an example for me where this flags needs to be > applied? Idea would be to set this flag when calling sendmsg() of the 48 bytes of the header, and not set it on the sendmsg() of the 512 bytes of the payload. iscsi_sw_tcp_xmit_segment() already adds MSG_MORE, but it would be nice to add a new _initial_ flags parameter to iscsi_sw_tcp_xmit_segment() -- To unsubscribe from this list: send the line "unsubscribe netdev" in the body of a message to majordomo@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
Hello Eric, > Idea would be to set this flag when calling sendmsg() of the 48 bytes > of the header, and not set it on the sendmsg() of the 512 bytes of the > payload. I see. > iscsi_sw_tcp_xmit_segment() already adds MSG_MORE, but > it would be nice to add a new _initial_ flags parameter to > iscsi_sw_tcp_xmit_segment() This is for the iscsi initiator implementation. I'm interested in iSCSI target code, but I already found it and experiemented a little bit, but I need to dig deeper if I want to prepare a patch. Cheers, Thomas -- To unsubscribe from this list: send the line "unsubscribe netdev" in the body of a message to majordomo@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
On Sat, 2014-02-08 at 16:00 +0100, Thomas Glanzmann wrote: > Hello Eric, > > > Idea would be to set this flag when calling sendmsg() of the 48 bytes > > of the header, and not set it on the sendmsg() of the 512 bytes of the > > payload. > > I see. > > > iscsi_sw_tcp_xmit_segment() already adds MSG_MORE, but > > it would be nice to add a new _initial_ flags parameter to > > iscsi_sw_tcp_xmit_segment() > > This is for the iscsi initiator implementation. I'm interested in iSCSI > target code, but I already found it and experiemented a little bit, but > I need to dig deeper if I want to prepare a patch. Fantastic ! Let me know if you want some help. Note : We did some patches in the MSG_MORE logic for sendpage(), but in your case I do not think its related (git grep -n MSG_SENDPAGE_NOTLAST ) if you are curious -- To unsubscribe from this list: send the line "unsubscribe netdev" in the body of a message to majordomo@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
Hello Eric, > Note : We did some patches in the MSG_MORE logic for sendpage(), but > in your case I do not think its related > (git grep -n MSG_SENDPAGE_NOTLAST ) if you are curious thank you for the pointer. The iSCSI target code actually uses sendpage whenever it can. Cheers, Thomas -- To unsubscribe from this list: send the line "unsubscribe netdev" in the body of a message to majordomo@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
On Sat, 2014-02-08 at 17:57 +0100, Thomas Glanzmann wrote: > Hello Eric, > > > Note : We did some patches in the MSG_MORE logic for sendpage(), but > > in your case I do not think its related > > (git grep -n MSG_SENDPAGE_NOTLAST ) if you are curious > > thank you for the pointer. The iSCSI target code actually uses sendpage > whenever it can. Yep, but the problem (at least on your pcap), is about sending the 48 bytes headers in TCP segment of its own, then the 512 byte payload in a separate segment. I suspect the sendpage() is only used for the payload. No need for MSG_MORE here. The MSG_MORE would need to be set on the first part (48 bytes header), so that TCP stack will defer the push of the segment at the time the 512 bytes payload is added. -- To unsubscribe from this list: send the line "unsubscribe netdev" in the body of a message to majordomo@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
Hello Eric, > Yep, but the problem (at least on your pcap), is about sending the 48 > bytes headers in TCP segment of its own, then the 512 byte payload in > a separate segment. I agree. > I suspect the sendpage() is only used for the payload. No need for > MSG_MORE here. I see. > The MSG_MORE would need to be set on the first part (48 bytes header), > so that TCP stack will defer the push of the segment at the time the 512 > bytes payload is added. The iSCSI target uses one function to send all outbound data. So in order to do it right every function that is sending data in multiple chunks need to mark it correctly. Of course someone could also do some wild guessing and saying that everything that is below 512 Bytes gets pushed out. I wonder what Nab has to say about this? Cheers, Thomas -- To unsubscribe from this list: send the line "unsubscribe netdev" in the body of a message to majordomo@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
diff --git a/net/ipv4/tcp_output.c b/net/ipv4/tcp_output.c index 10435b3b9d0f..3be16727f058 100644 --- a/net/ipv4/tcp_output.c +++ b/net/ipv4/tcp_output.c @@ -698,7 +698,8 @@ static void tcp_tsq_handler(struct sock *sk) if ((1 << sk->sk_state) & (TCPF_ESTABLISHED | TCPF_FIN_WAIT1 | TCPF_CLOSING | TCPF_CLOSE_WAIT | TCPF_LAST_ACK)) - tcp_write_xmit(sk, tcp_current_mss(sk), 0, 0, GFP_ATOMIC); + tcp_write_xmit(sk, tcp_current_mss(sk), tcp_sk(sk)->nonagle, + 0, GFP_ATOMIC); } /* * One tasklet per cpu tries to send more skbs. @@ -1904,7 +1905,15 @@ static bool tcp_write_xmit(struct sock *sk, unsigned int mss_now, int nonagle, if (atomic_read(&sk->sk_wmem_alloc) > limit) { set_bit(TSQ_THROTTLED, &tp->tsq_flags); - break; + /* It is possible TX completion already happened + * before we set TSQ_THROTTLED, so we must + * test again the condition. + * We abuse smp_mb__after_clear_bit() because + * there is no smp_mb__after_set_bit() yet + */ + smp_mb__after_clear_bit(); + if (atomic_read(&sk->sk_wmem_alloc) > limit) + break; } limit = mss_now;