Message ID | 1318576791.2533.99.camel@edumazet-laptop |
---|---|
State | Rejected, archived |
Delegated to: | David Miller |
Headers | show |
From: Eric Dumazet <eric.dumazet@gmail.com> Date: Fri, 14 Oct 2011 09:19:51 +0200 > Many drivers allocates big skb to store a single TCP frame. > (WIFI drivers, or NIC using PAGE_SIZE fragments) > > Its now common to get skb->truesize bigger than 4096 to store a ~1500 > bytes TCP frame. > > TCP sessions with large RTT and packet losses can fill their Out Of > Order queue with such oversized skbs, and hit their sk_rcvbuf limit, > starting a pruning of complete OFO queue, without giving chance to > receive the missing packet(s) and moving skbs from OFO to receive queue. > > This patch adds skb_reduce_truesize() helper, and uses it for all skbs > queued into OFO queue. > > Spending some time to perform a copy is worth the pain, since it permits > SACK processing to have a chance to complete over the RTT barrier. > > This greatly improves user experience, without added cost on fast path. > > Signed-off-by: Eric Dumazet <eric.dumazet@gmail.com> No objection from me, although I wish wireless drivers were able to size their SKBs more appropriately. I wonder how many problems that look like "OMG we gotz da Buffer Bloat, arrr!" are actually due to this truesize issue. I think such large truesize SKBs will cause problems even in non loss situations, in that the receive buffer will hit it's limits more quickly. I not sure that the receive buffer autotuning is built to handle this sort of scenerio as a common occurance. You might want to check if this is the actual root cause of your problems. If the receive buffer autotuning doesn't expand the receive buffer enough to hold two windows worth of these large truesize SKBs, that's the real reason why we end up pruning. We have to decide if these kinds of SKBs are acceptable as a normal situation for MSS sized frames. And if they are then it's probably a good idea to adjust the receive buffer autotuning code too. Although I realize it might be difficult, getting rid of these weird SKBs in the first place would be ideal. It would also be a good idea to put the truesize inaccuracies into perspective when selecting how to fix this. It's trying to prevent 1 byte packets not accounting for the 256 byte SKB and metadata. That kind of case with such a high ratio of wastage is important. On the other hand, using 2048 bytes for a 1500 byte packet and claiming the truesize is 1500 + sizeof(metadata)... that might be an acceptable lie to tell :-) This is especially true if it allows an easy solution to this wireless problem. Just some thoughts... and I wonder if the wireless thing is due to some hardware limitation or similar. -- To unsubscribe from this list: send the line "unsubscribe netdev" in the body of a message to majordomo@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
Le vendredi 14 octobre 2011 à 03:42 -0400, David Miller a écrit : > From: Eric Dumazet <eric.dumazet@gmail.com> > Date: Fri, 14 Oct 2011 09:19:51 +0200 > > > Many drivers allocates big skb to store a single TCP frame. > > (WIFI drivers, or NIC using PAGE_SIZE fragments) > > > > Its now common to get skb->truesize bigger than 4096 to store a ~1500 > > bytes TCP frame. > > > > TCP sessions with large RTT and packet losses can fill their Out Of > > Order queue with such oversized skbs, and hit their sk_rcvbuf limit, > > starting a pruning of complete OFO queue, without giving chance to > > receive the missing packet(s) and moving skbs from OFO to receive queue. > > > > This patch adds skb_reduce_truesize() helper, and uses it for all skbs > > queued into OFO queue. > > > > Spending some time to perform a copy is worth the pain, since it permits > > SACK processing to have a chance to complete over the RTT barrier. > > > > This greatly improves user experience, without added cost on fast path. > > > > Signed-off-by: Eric Dumazet <eric.dumazet@gmail.com> > > No objection from me, although I wish wireless drivers were able to > size their SKBs more appropriately. I wonder how many problems that > look like "OMG we gotz da Buffer Bloat, arrr!" are actually due to > this truesize issue. > > I think such large truesize SKBs will cause problems even in non loss > situations, in that the receive buffer will hit it's limits more > quickly. I not sure that the receive buffer autotuning is built to > handle this sort of scenerio as a common occurance. > > You might want to check if this is the actual root cause of your > problems. If the receive buffer autotuning doesn't expand the receive > buffer enough to hold two windows worth of these large truesize SKBs, > that's the real reason why we end up pruning. > > We have to decide if these kinds of SKBs are acceptable as a normal > situation for MSS sized frames. And if they are then it's probably > a good idea to adjust the receive buffer autotuning code too. > > Although I realize it might be difficult, getting rid of these weird > SKBs in the first place would be ideal. > > It would also be a good idea to put the truesize inaccuracies into > perspective when selecting how to fix this. It's trying to prevent > 1 byte packets not accounting for the 256 byte SKB and metadata. > That kind of case with such a high ratio of wastage is important. > > On the other hand, using 2048 bytes for a 1500 byte packet and claiming > the truesize is 1500 + sizeof(metadata)... that might be an acceptable > lie to tell :-) This is especially true if it allows an easy solution > to this wireless problem. > > Just some thoughts... and I wonder if the wireless thing is due to > some hardware limitation or similar. > This patch specifically addresses the OFO problem, trying to lower memory usage for machines handling lot of sockets (proxies for example) For the general case, I believe we have to tune/change tcp_win_from_space() to take into account general tendancy to get fat skbs. sysctl_tcp_adv_win_scale is not fine enough today, and default value (2) gives too much collapses. It's also a very complex setting, I am pretty sure nobody knows how to use it. tcp_win_from_space(int space) -> 75% of space [ default ] Only current kernels choices are to set it to one/-1 : tcp_win_from_space(int space) -> 50% of space or -2 : tcp_win_from_space(int space) -> 25% of space -- To unsubscribe from this list: send the line "unsubscribe netdev" in the body of a message to majordomo@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
On 10/14/2011 12:42 AM, David Miller wrote: > No objection from me, although I wish wireless drivers were able to > size their SKBs more appropriately. I wonder how many problems that > look like "OMG we gotz da Buffer Bloat, arrr!" are actually due to > this truesize issue. I think the buffer bloat folks are looking at latency through transmit queues - now perhaps some of their latency is really coming from retransmissions thanks to packets being dropped thanks to overfilling socket buffers, but I'm pretty sure they are clever enough to look for that. > I think such large truesize SKBs will cause problems even in non loss > situations, in that the receive buffer will hit it's limits more > quickly. I not sure that the receive buffer autotuning is built to > handle this sort of scenerio as a common occurance. I believe that may be the case - at least during something like: netperf -t TCP_RR -H <host> -l 30 -- -b 256 -D which on an otherwise quiet test setup will report a non-trivial number of retransmissions - either via looking at netstat -s output, or by adding local_transport_retrans,remote_transport_retrans to an output selector for netperf (eg -o throughput,burst_size,local_transport_retrans,remote_transport_retrans,lss_size_end,rsr_size_end) (I plan on providing more data after a laptop has gone through some upgrades) > You might want to check if this is the actual root cause of your > problems. If the receive buffer autotuning doesn't expand the receive > buffer enough to hold two windows worth of these large truesize SKBs, > that's the real reason why we end up pruning. > > We have to decide if these kinds of SKBs are acceptable as a normal > situation for MSS sized frames. And if they are then it's probably > a good idea to adjust the receive buffer autotuning code too. > > Although I realize it might be difficult, getting rid of these weird > SKBs in the first place would be ideal. That means a semi-arbitrary alloc/copy in drivers, even when/if the wasted space isn't going to be a problem no? That TCP_RR test above would run "just fine" if the burst size was much smaller, but if there was an arbitrary allocate/copy it would take a service demand and thus transaction rate hit. > It would also be a good idea to put the truesize inaccuracies into > perspective when selecting how to fix this. It's trying to prevent > 1 byte packets not accounting for the 256 byte SKB and metadata. > That kind of case with such a high ratio of wastage is important. > > On the other hand, using 2048 bytes for a 1500 byte packet and claiming > the truesize is 1500 + sizeof(metadata)... that might be an acceptable > lie to tell :-) This is especially true if it allows an easy solution > to this wireless problem. Is the wireless problem strictly a wireless problem? Many of the drivers where Eric has been fixing the truesize accounting have been wired devices no? rick jones -- To unsubscribe from this list: send the line "unsubscribe netdev" in the body of a message to majordomo@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
Le vendredi 14 octobre 2011 à 08:50 -0700, Rick Jones a écrit : > Is the wireless problem strictly a wireless problem? Many of the > drivers where Eric has been fixing the truesize accounting have been > wired devices no? Yes, but the goal of such fixes it to make bugs happen too with said wired devices ;) About WIFI, I get these TCP Collapses on two different machines, one using drivers/net/wireless/rt2x00 driver Extract from drivers/net/wireless/rt2x00/rt2x00queue.h /** * DOC: Entry frame size * * Ralink PCI devices demand the Frame size to be a multiple of 128 bytes, * for USB devices this restriction does not apply, but the value of * 2432 makes sense since it is big enough to contain the maximum fragment * size according to the ieee802.11 specs. * The aggregation size depends on support from the driver, but should * be something around 3840 bytes. */ #define DATA_FRAME_SIZE 2432 #define MGMT_FRAME_SIZE 256 #define AGGREGATION_SIZE 3840 You understand why we endup using skb->truesize > 4096 buffers I liked doing the copybreak only if needed, I found the OFO case was most of the time responsible of the Collapses. Now we also could do the copybreak for frames queued into regular receive_queue, if current wmem_alloc is above 25% of rcvbuf space... -- To unsubscribe from this list: send the line "unsubscribe netdev" in the body of a message to majordomo@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
Le vendredi 14 octobre 2011 à 10:05 +0200, Eric Dumazet a écrit : > This patch specifically addresses the OFO problem, trying to lower > memory usage for machines handling lot of sockets (proxies for example) Well, thinking a bit more about it is needed, so zap the patch please. Thanks -- To unsubscribe from this list: send the line "unsubscribe netdev" in the body of a message to majordomo@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
> I believe that may be the case - at least during something like: > > netperf -t TCP_RR -H <host> -l 30 -- -b 256 -D > > which on an otherwise quiet test setup will report a non-trivial number > of retransmissions - either via looking at netstat -s output, or by > adding local_transport_retrans,remote_transport_retrans to an output > selector for netperf (eg -o > throughput,burst_size,local_transport_retrans,remote_transport_retrans,lss_size_end,rsr_size_end) > > > (I plan on providing more data after a laptop has gone through some > upgrades) So, a test as above from a system running 2.6.38-11-generic to a system running 3.0.0-12-generic. On the sender we have: raj@tardy:~/netperf2_trunk$ netstat -s > before; src/netperf -H raj-8510w.americas.hpqcorp.net -t tcp_rr -- -b 256 -D -o throughput,local_transport_retrans,remote_transport_retrans,lss_size_end,rsr_size_end ; netstat -s > after MIGRATED TCP REQUEST/RESPONSE TEST from 0.0.0.0 (0.0.0.0) port 0 AF_INET to internal-host.americas.hpqcorp.net (16.89.245.115) port 0 AF_INET : nodelay : first burst 256 Throughput,Local Transport Retransmissions,Remote Transport Retransmissions,Local Send Socket Size Final,Remote Recv Socket Size Final 76752.43,274,0,16384,98304 274 retransmissions at the sender. The "beforeafter" of that on the sender: raj@tardy:~/netperf2_trunk$ cat delta.send Ip: 766747 total packets received 12 with invalid addresses 0 forwarded 0 incoming packets discarded 766735 incoming packets delivered 734689 requests sent out 0 dropped because of missing route Icmp: 0 ICMP messages received 0 input ICMP message failed. ICMP input histogram: destination unreachable: 0 echo requests: 0 echo replies: 0 0 ICMP messages sent 0 ICMP messages failed ICMP output histogram: destination unreachable: 0 echo request: 0 echo replies: 0 IcmpMsg: InType0: 0 InType3: 0 InType8: 0 OutType0: 0 OutType3: 0 OutType8: 0 Tcp: 2 active connections openings 0 passive connection openings 0 failed connection attempts 0 connection resets received 0 connections established 766727 segments received 734408 segments send out 274 segments retransmited 0 bad segments received. 0 resets sent Udp: 7 packets received 0 packets to unknown port received. 0 packet receive errors 7 packets sent UdpLite: TcpExt: 0 packets pruned from receive queue because of socket buffer overrun 0 ICMP packets dropped because they were out-of-window 0 TCP sockets finished time wait in fast timer 2 delayed acks sent 0 delayed acks further delayed because of locked socket Quick ack mode was activated 0 times 170856 packets directly queued to recvmsg prequeue. 1204 bytes directly in process context from backlog 170678 bytes directly received in process context from prequeue 592090 packet headers predicted 170626 packets header predicted and directly queued to user 1375 acknowledgments not containing data payload received 174911 predicted acknowledgments 150 times recovered from packet loss by selective acknowledgements 0 congestion windows recovered without slow start by DSACK 0 congestion windows recovered without slow start after partial ack 299 TCP data loss events TCPLostRetransmit: 9 0 timeouts after reno fast retransmit 0 timeouts after SACK recovery 253 fast retransmits 14 forward retransmits 6 retransmits in slow start 0 other TCP timeouts 1 SACK retransmits failed 0 times receiver scheduled too late for direct processing 0 packets collapsed in receive queue due to low socket buffer 0 DSACKs sent for old packets 0 DSACKs received 0 connections reset due to unexpected data 0 connections reset due to early user close 0 connections aborted due to timeout 0 times unabled to send RST due to no memory TCPDSACKIgnoredOld: 0 TCPDSACKIgnoredNoUndo: 0 TCPSackShifted: 0 TCPSackMerged: 1031 TCPSackShiftFallback: 240 TCPBacklogDrop: 0 IPReversePathFilter: 0 IpExt: InMcastPkts: 0 OutMcastPkts: 0 InBcastPkts: 1 InOctets: -1012182764 OutOctets: -1436530450 InMcastOctets: 0 OutMcastOctets: 0 InBcastOctets: 147 and then the deltas on the receiver: raj@raj-8510w:~/netperf2_trunk$ cat delta.recv Ip: 734669 total packets received 0 with invalid addresses 0 forwarded 0 incoming packets discarded 734669 incoming packets delivered 766696 requests sent out 0 dropped because of missing route Icmp: 0 ICMP messages received 0 input ICMP message failed. ICMP input histogram: destination unreachable: 0 0 ICMP messages sent 0 ICMP messages failed ICMP output histogram: IcmpMsg: InType3: 0 Tcp: 0 active connections openings 2 passive connection openings 0 failed connection attempts 0 connection resets received 0 connections established 734651 segments received 766695 segments send out 0 segments retransmited 0 bad segments received. 0 resets sent Udp: 1 packets received 0 packets to unknown port received. 0 packet receive errors 1 packets sent UdpLite: TcpExt: 28 packets pruned from receive queue because of socket buffer overrun 0 delayed acks sent 0 delayed acks further delayed because of locked socket 19 packets directly queued to recvmsg prequeue. 0 bytes directly in process context from backlog 667 bytes directly received in process context from prequeue 727842 packet headers predicted 9 packets header predicted and directly queued to user 161 acknowledgments not containing data payload received 229704 predicted acknowledgments 6774 packets collapsed in receive queue due to low socket buffer TCPBacklogDrop: 276 IpExt: InMcastPkts: 0 OutMcastPkts: 0 InBcastPkts: 17 OutBcastPkts: 0 InOctets: 38973144 OutOctets: 40673137 InMcastOctets: 0 OutMcastOctets: 0 InBcastOctets: 1816 OutBcastOctets: 0 this is an otherwise clean network, no errors reported by ifconfig or ethtool -S, and the packet rate was well within the limits of 1 GbE and the ProCurve 2724 switch between the two systems. From just a very quick look it looks like tcp_v[46]_rcv is called, finds that the socket is owned by the user, attempts to add to the backlog, but the path called by sk_add_backlog does not seem to make any attempts to compress things, so when the quantity of data is << the truesize it starts tossing babies out with the bathwater. rick jones -- To unsubscribe from this list: send the line "unsubscribe netdev" in the body of a message to majordomo@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
From: Rick Jones <rick.jones2@hp.com> Date: Fri, 14 Oct 2011 15:12:04 -0700 > From just a very quick look it looks like tcp_v[46]_rcv is called, > finds that the socket is owned by the user, attempts to add to the > backlog, but the path called by sk_add_backlog does not seem to make > any attempts to compress things, so when the quantity of data is << > the truesize it starts tossing babies out with the bathwater. This is why I don't believe the right fix is to add bandaids all around the TCP layer. The wastage has to be avoided at a higher level. -- To unsubscribe from this list: send the line "unsubscribe netdev" in the body of a message to majordomo@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
Le vendredi 14 octobre 2011 à 15:12 -0700, Rick Jones a écrit : Thanks Rick > So, a test as above from a system running 2.6.38-11-generic to a system > running 3.0.0-12-generic. On the sender we have: > > raj@tardy:~/netperf2_trunk$ netstat -s > before; src/netperf -H > raj-8510w.americas.hpqcorp.net -t tcp_rr -- -b 256 -D -o > throughput,local_transport_retrans,remote_transport_retrans,lss_size_end,rsr_size_end > ; netstat -s > after > MIGRATED TCP REQUEST/RESPONSE TEST from 0.0.0.0 (0.0.0.0) port 0 AF_INET > to internal-host.americas.hpqcorp.net (16.89.245.115) port 0 AF_INET : > nodelay : first burst 256 > Throughput,Local Transport Retransmissions,Remote Transport > Retransmissions,Local Send Socket Size Final,Remote Recv Socket Size Final > 76752.43,274,0,16384,98304 > > 274 retransmissions at the sender. The "beforeafter" of that on the sender: > > raj@tardy:~/netperf2_trunk$ cat delta.send > Tcp: > 2 active connections openings > 0 passive connection openings > 0 failed connection attempts > 0 connection resets received > 0 connections established > 766727 segments received > 734408 segments send out > 274 segments retransmited Exactly the count of dropped frames because of receiver sk_rmem_alloc + backlog.len hitting receiver sk_rcvbuf static inline bool sk_rcvqueues_full(const struct sock *sk, const struct sk_buff *skb) { unsigned int qsize = sk->sk_backlog.len + atomic_read(&sk->sk_rmem_alloc); return qsize + skb->truesize > sk->sk_rcvbuf; } static inline __must_check int sk_add_backlog(struct sock *sk, struct sk_buff *skb) { if (sk_rcvqueues_full(sk, skb)) return -ENOBUFS; __sk_add_backlog(sk, skb); sk->sk_backlog.len += skb->truesize; return 0; } In very old kernels, we had no limit on backlog, so we could queue lot of extra skbs in it and eventually consume all kernel memory (OOM) refs : commit c377411f249 (net: sk_add_backlog() take rmem_alloc into account) commit 6b03a53a5ab7 (tcp: use limited socket backlog) commit 8eae939f14003 (net: add limit for socket backlog ) Now we enforce a limit, better to chose a correct limit / tcpwindow combination so that normal trafic doesnt trigger drops at receiver > 0 bad segments received. > 0 resets sent > Udp: > 7 packets received > 0 packets to unknown port received. > 0 packet receive errors > 7 packets sent > UdpLite: > TcpExt: > 0 packets pruned from receive queue because of socket buffer overrun > 0 ICMP packets dropped because they were out-of-window > 0 TCP sockets finished time wait in fast timer > 2 delayed acks sent > 0 delayed acks further delayed because of locked socket > Quick ack mode was activated 0 times > 170856 packets directly queued to recvmsg prequeue. > 1204 bytes directly in process context from backlog > 170678 bytes directly received in process context from prequeue > 592090 packet headers predicted > 170626 packets header predicted and directly queued to user > 1375 acknowledgments not containing data payload received > 174911 predicted acknowledgments > 150 times recovered from packet loss by selective acknowledgements > 0 congestion windows recovered without slow start by DSACK > 0 congestion windows recovered without slow start after partial ack > 299 TCP data loss events > TCPLostRetransmit: 9 > 0 timeouts after reno fast retransmit > 0 timeouts after SACK recovery > 253 fast retransmits > 14 forward retransmits > 6 retransmits in slow start > 0 other TCP timeouts > 1 SACK retransmits failed > 0 times receiver scheduled too late for direct processing > 0 packets collapsed in receive queue due to low socket buffer > 0 DSACKs sent for old packets > 0 DSACKs received > 0 connections reset due to unexpected data > 0 connections reset due to early user close > 0 connections aborted due to timeout > 0 times unabled to send RST due to no memory > TCPDSACKIgnoredOld: 0 > TCPDSACKIgnoredNoUndo: 0 > TCPSackShifted: 0 > TCPSackMerged: 1031 > TCPSackShiftFallback: 240 > TCPBacklogDrop: 0 > IPReversePathFilter: 0 > IpExt: > InMcastPkts: 0 > OutMcastPkts: 0 > InBcastPkts: 1 > InOctets: -1012182764 > OutOctets: -1436530450 > InMcastOctets: 0 > OutMcastOctets: 0 > InBcastOctets: 147 > > and then the deltas on the receiver: > > raj@raj-8510w:~/netperf2_trunk$ cat delta.recv > Ip: > 734669 total packets received > 0 with invalid addresses > 0 forwarded > 0 incoming packets discarded > 734669 incoming packets delivered > 766696 requests sent out > 0 dropped because of missing route > Icmp: > 0 ICMP messages received > 0 input ICMP message failed. > ICMP input histogram: > destination unreachable: 0 > 0 ICMP messages sent > 0 ICMP messages failed > ICMP output histogram: > IcmpMsg: > InType3: 0 > Tcp: > 0 active connections openings > 2 passive connection openings > 0 failed connection attempts > 0 connection resets received > 0 connections established > 734651 segments received > 766695 segments send out > 0 segments retransmited > 0 bad segments received. > 0 resets sent > Udp: > 1 packets received > 0 packets to unknown port received. > 0 packet receive errors > 1 packets sent > UdpLite: > TcpExt: > 28 packets pruned from receive queue because of socket buffer overrun > 0 delayed acks sent > 0 delayed acks further delayed because of locked socket > 19 packets directly queued to recvmsg prequeue. > 0 bytes directly in process context from backlog > 667 bytes directly received in process context from prequeue > 727842 packet headers predicted > 9 packets header predicted and directly queued to user > 161 acknowledgments not containing data payload received > 229704 predicted acknowledgments > 6774 packets collapsed in receive queue due to low socket buffer > TCPBacklogDrop: 276 Yes, these two counters explain all. 1) "6774 packets collapsed in receive queue due to low socket buffer" We spend a _lot_ of cpu time in "collapsing" process : Taking several skb and build a compound one (using one PAGE and trying to fill all the available bytes in it with contigous parts). Doing this work is of course last desperate attempt before the much painfull : 2) TCPBacklogDrop: 276 We plain drop incoming messages because too much kernel memory is used by the socket. > IpExt: > InMcastPkts: 0 > OutMcastPkts: 0 > InBcastPkts: 17 > OutBcastPkts: 0 > InOctets: 38973144 > OutOctets: 40673137 > InMcastOctets: 0 > OutMcastOctets: 0 > InBcastOctets: 1816 > OutBcastOctets: 0 > > this is an otherwise clean network, no errors reported by ifconfig or > ethtool -S, and the packet rate was well within the limits of 1 GbE and > the ProCurve 2724 switch between the two systems. > > From just a very quick look it looks like tcp_v[46]_rcv is called, > finds that the socket is owned by the user, attempts to add to the > backlog, but the path called by sk_add_backlog does not seem to make any > attempts to compress things, so when the quantity of data is << the > truesize it starts tossing babies out with the bathwater. > Rick, could you redo the test, using following bit on receiver : echo 1 >/proc/sys/net/ipv4/tcp_adv_win_scale If you still have collapses/retransmits, you then could try : echo -2 >/proc/sys/net/ipv4/tcp_adv_win_scale Thanks ! -- To unsubscribe from this list: send the line "unsubscribe netdev" in the body of a message to majordomo@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
Le vendredi 14 octobre 2011 à 19:18 -0400, David Miller a écrit : > From: Rick Jones <rick.jones2@hp.com> > Date: Fri, 14 Oct 2011 15:12:04 -0700 > > > From just a very quick look it looks like tcp_v[46]_rcv is called, > > finds that the socket is owned by the user, attempts to add to the > > backlog, but the path called by sk_add_backlog does not seem to make > > any attempts to compress things, so when the quantity of data is << > > the truesize it starts tossing babies out with the bathwater. > > This is why I don't believe the right fix is to add bandaids all > around the TCP layer. > > The wastage has to be avoided at a higher level. We cant do that at higher level without smart hardware (like NIU) or adding a copy. Its a tradeoff between space and speed. Most drivers have to allocate a large skb1 and post it to hardware to receive a frame (Unknown length, only max length is known) Some drivers have a copybreak feature, doing a copy of small incoming frames into a smaller skb2 (skb2->truesize < skb1->truesize) This strategy do save memory for small frames, not for 1500 bytes frames. I think the problem is in TCP layer (and maybe in other protocols) : 1) Either tune rcvbuf to allow more memory to be used, for a particular tcp window, Or lower TCP window to allow less packets in flight for a given rcvbuf. 2) TCP COLLAPSE already is trying to reduce memory costs of a tcp socket with many packets in OFO queue. But fixing 1) would make these collapses never happen in the first place. People wanting high TCP bandwidth [ with say more than 500 in-flight packets per session ] can certainly afford having enough memory. -- To unsubscribe from this list: send the line "unsubscribe netdev" in the body of a message to majordomo@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
From: Eric Dumazet <eric.dumazet@gmail.com> Date: Sat, 15 Oct 2011 08:54:42 +0200 > I think the problem is in TCP layer (and maybe in other protocols) : > > 1) Either tune rcvbuf to allow more memory to be used, for a particular > tcp window, > > Or lower TCP window to allow less packets in flight for a given > rcvbuf. > > 2) TCP COLLAPSE already is trying to reduce memory costs of a tcp socket > with many packets in OFO queue. But fixing 1) would make these collapses > never happen in the first place. People wanting high TCP bandwidth > [ with say more than 500 in-flight packets per session ] can certainly > afford having enough memory. So perhaps the best solution is to divorce truesize from such driver and device details? If there is one calculation, then TCP need only be concerned with one case. Look at how confusing and useless tcp_adv_win_scale ends up being for this problem. Therefore I'll make the mostly-serious propsal that truesize be something like "initial_real_total_data + sizeof(metadata)" So if a device receives a 512 byte packet, it's: 512 + sizeof(metadata) It still provides the necessary protection that truesize is meant to provide, yet sanitizes all of the receive and send buffer overhead handling. TCP should be absoultely, and completely, impervious to details like how buffering needs to be done for some random wireless card. Just the mere fact that using a larger buffer in a driver ruins TCP performance indicates a serious design failure. -- To unsubscribe from this list: send the line "unsubscribe netdev" in the body of a message to majordomo@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
Le dimanche 16 octobre 2011 à 20:53 -0400, David Miller a écrit : > So perhaps the best solution is to divorce truesize from such driver > and device details? If there is one calculation, then TCP need only > be concerned with one case. > > Look at how confusing and useless tcp_adv_win_scale ends up being for > this problem. > > Therefore I'll make the mostly-serious propsal that truesize be > something like "initial_real_total_data + sizeof(metadata)" > > So if a device receives a 512 byte packet, it's: > > 512 + sizeof(metadata) > That would probably OOM in stress situation, with thousand of sockets. > It still provides the necessary protection that truesize is meant to > provide, yet sanitizes all of the receive and send buffer overhead > handling. > > TCP should be absoultely, and completely, impervious to details like > how buffering needs to be done for some random wireless card. Just > the mere fact that using a larger buffer in a driver ruins TCP > performance indicates a serious design failure. > I dont think its a design failure. Its the same problem when computing the TCP window given the rcvspace (memory we allow to be consumed for the socket) based on the MSS : If the sender uses 1-bytes frames only, then receiver hit the memory limit and performance drops. Right now our tcp-window tuning really assumes too much : perfect MSS skb using _exactly_ MSS + sizeof(metadata), while we already know that real slab cost is higher : __roundup_pow_of_two(MSS + sizeof(struct skb_shared_info)) + SKB_DATA_ALIGN(sizeof(struct sk_buff)) and now with paged frag devices : PAGE_SIZE + SKB_DATA_ALIGN(sizeof(struct sk_buff)) We assume sender behaves correctly and drivers dont use 64KB pages to store a single 72-bytes frame I would say the first thing TCP stack must respect is the memory limits that the admin set for it. Thats what skb->truesize is for. # cat /proc/sys/net/ipv4/tcp_rmem 4096 87380 4127616 In this case, we allow up to 4Mbytes or receiver memory per session. Not 20 or 30 Mbytes... We must translate this to a TCP window, suitable for current hardware. -- To unsubscribe from this list: send the line "unsubscribe netdev" in the body of a message to majordomo@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
> > Rick, could you redo the test, using following bit on receiver : > > echo 1>/proc/sys/net/ipv4/tcp_adv_win_scale raj@tardy:~/netperf2_trunk$ netstat -s > before; src/netperf -H raj-8510w.americas.hpqcorp.net -t tcp_rr -- -b 256 -D -o throughput,local_transport_retrans,remote_transport_retrans,lss_size_end,rsr_size_end ; netstat -s > afterMIGRATED TCP REQUEST/RESPONSE TEST from 0.0.0.0 (0.0.0.0) port 0 AF_INET to internal-host.americas.hpqcorp.net (16.89.245.115) port 0 AF_INET : nodelay : first burst 256 Throughput,Local Transport Retransmissions,Remote Transport Retransmissions,Local Send Socket Size Final,Remote Recv Socket Size Final 78527.68,289,0,16384,98304 Deltas on the receiver: TcpExt: 27 packets pruned from receive queue because of socket buffer overrun 0 TCP sockets finished time wait in fast timer 0 delayed acks sent 0 delayed acks further delayed because of locked socket Quick ack mode was activated 0 times 19 packets directly queued to recvmsg prequeue. 0 bytes directly in process context from backlog 670 bytes directly received in process context from prequeue 739983 packet headers predicted 14 packets header predicted and directly queued to user 127 acknowledgments not containing data payload received 235774 predicted acknowledgments 0 other TCP timeouts 6553 packets collapsed in receive queue due to low socket buffer 0 DSACKs sent for old packets TCPBacklogDrop: 294 So, moving on to: > If you still have collapses/retransmits, you then could try : > > echo -2>/proc/sys/net/ipv4/tcp_adv_win_scale raj@tardy:~/netperf2_trunk$ netstat -s > before; src/netperf -H raj-8510w.americas.hpqcorp.net -t tcp_rr -- -b 256 -D -o throughput,local_transport_retrans,remote_transport_retrans,lss_size_end,rsr_size_end ; netstat -s > after MIGRATED TCP REQUEST/RESPONSE TEST from 0.0.0.0 (0.0.0.0) port 0 AF_INET to internal-host.americas.hpqcorp.net (16.89.245.115) port 0 AF_INET : nodelay : first burst 256 Throughput,Local Transport Retransmissions,Remote Transport Retransmissions,Local Send Socket Size Final,Remote Recv Socket Size Final 95981.83,0,0,121200,156600 No retransmissions in that one. rick -- To unsubscribe from this list: send the line "unsubscribe netdev" in the body of a message to majordomo@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
diff --git a/net/ipv4/tcp_input.c b/net/ipv4/tcp_input.c index c1653fe..1d10edb 100644 --- a/net/ipv4/tcp_input.c +++ b/net/ipv4/tcp_input.c @@ -4426,6 +4426,25 @@ static inline int tcp_try_rmem_schedule(struct sock *sk, unsigned int size) return 0; } +/* + * Caller want to reduce memory needs before queueing skb + * The (expensive) copy should not be be done in fast path. + */ +static struct sk_buff *skb_reduce_truesize(struct sk_buff *skb) +{ + if (skb->truesize > 2 * SKB_TRUESIZE(skb->len)) { + struct sk_buff *nskb; + + nskb = skb_copy_expand(skb, skb_headroom(skb), 0, + GFP_ATOMIC | __GFP_NOWARN); + if (nskb) { + __kfree_skb(skb); + skb = nskb; + } + } + return skb; +} + static void tcp_data_queue(struct sock *sk, struct sk_buff *skb) { struct tcphdr *th = tcp_hdr(skb); @@ -4553,6 +4572,11 @@ drop: SOCK_DEBUG(sk, "out of order segment: rcv_next %X seq %X - %X\n", tp->rcv_nxt, TCP_SKB_CB(skb)->seq, TCP_SKB_CB(skb)->end_seq); + /* Since this skb might stay on ofo a long time, try to reduce + * its truesize (if its too big) to avoid future pruning. + * Many drivers allocate large buffers even to hold tiny frames. + */ + skb = skb_reduce_truesize(skb); skb_set_owner_r(skb, sk); if (!skb_peek(&tp->out_of_order_queue)) {
Many drivers allocates big skb to store a single TCP frame. (WIFI drivers, or NIC using PAGE_SIZE fragments) Its now common to get skb->truesize bigger than 4096 to store a ~1500 bytes TCP frame. TCP sessions with large RTT and packet losses can fill their Out Of Order queue with such oversized skbs, and hit their sk_rcvbuf limit, starting a pruning of complete OFO queue, without giving chance to receive the missing packet(s) and moving skbs from OFO to receive queue. This patch adds skb_reduce_truesize() helper, and uses it for all skbs queued into OFO queue. Spending some time to perform a copy is worth the pain, since it permits SACK processing to have a chance to complete over the RTT barrier. This greatly improves user experience, without added cost on fast path. Signed-off-by: Eric Dumazet <eric.dumazet@gmail.com> --- net/ipv4/tcp_input.c | 24 ++++++++++++++++++++++++ 1 file changed, 24 insertions(+) -- To unsubscribe from this list: send the line "unsubscribe netdev" in the body of a message to majordomo@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html