diff mbox

[v2,net-next,2/2] tcp: reduce out_of_order memory use

Message ID 1332104867.3597.1.camel@edumazet-laptop
State Accepted, archived
Delegated to: David Miller
Headers show

Commit Message

Eric Dumazet March 18, 2012, 9:07 p.m. UTC
With increasing receive window sizes, but speed of light not improved
that much, out of order queue can contain a huge number of skbs, waiting
to be moved to receive_queue when missing packets can fill the holes.

Some devices happen to use fat skbs (truesize of 4096 + sizeof(struct
sk_buff)) to store regular (MTU <= 1500) frames. This makes highly
probable sk_rmem_alloc hits sk_rcvbuf limit, which can be 4Mbytes in
many cases.

When limit is hit, tcp stack calls tcp_collapse_ofo_queue(), a true
latency killer and cpu cache blower.

Doing the coalescing attempt each time we add a frame in ofo queue
permits to keep memory use tight and in many cases avoid the
tcp_collapse() thing later.

Tested on various wireless setups (b43, ath9k, ...) known to use big skb
truesize, this patch removed the "packets collapsed in receive queue due
to low socket buffer" I had before.

This also reduced average memory used by tcp sockets.

With help from Neal Cardwell.

Signed-off-by: Eric Dumazet <eric.dumazet@gmail.com>
Cc: Neal Cardwell <ncardwell@google.com>
Cc: Yuchung Cheng <ycheng@google.com>
Cc: H.K. Jerry Chu <hkchu@google.com>
Cc: Tom Herbert <therbert@google.com>
Cc: Ilpo Järvinen <ilpo.jarvinen@helsinki.fi>
---
V2: rebase after tcp_data_queue_ofo() introduction.

 include/linux/snmp.h |    1 +
 net/ipv4/proc.c      |    1 +
 net/ipv4/tcp_input.c |   19 ++++++++++++++++++-
 3 files changed, 20 insertions(+), 1 deletion(-)




--
To unsubscribe from this list: send the line "unsubscribe netdev" in
the body of a message to majordomo@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html

Comments

Neal Cardwell March 19, 2012, 3:53 a.m. UTC | #1
On Sun, Mar 18, 2012 at 5:07 PM, Eric Dumazet <eric.dumazet@gmail.com> wrote:
> With increasing receive window sizes, but speed of light not improved
> that much, out of order queue can contain a huge number of skbs, waiting
> to be moved to receive_queue when missing packets can fill the holes.
>
> Some devices happen to use fat skbs (truesize of 4096 + sizeof(struct
> sk_buff)) to store regular (MTU <= 1500) frames. This makes highly
> probable sk_rmem_alloc hits sk_rcvbuf limit, which can be 4Mbytes in
> many cases.
>
> When limit is hit, tcp stack calls tcp_collapse_ofo_queue(), a true
> latency killer and cpu cache blower.
>
> Doing the coalescing attempt each time we add a frame in ofo queue
> permits to keep memory use tight and in many cases avoid the
> tcp_collapse() thing later.
>
> Tested on various wireless setups (b43, ath9k, ...) known to use big skb
> truesize, this patch removed the "packets collapsed in receive queue due
> to low socket buffer" I had before.
>
> This also reduced average memory used by tcp sockets.
>
> With help from Neal Cardwell.
>
> Signed-off-by: Eric Dumazet <eric.dumazet@gmail.com>

Acked-by: Neal Cardwell <ncardwell@google.com>

neal
--
To unsubscribe from this list: send the line "unsubscribe netdev" in
the body of a message to majordomo@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
David Miller March 19, 2012, 8:57 p.m. UTC | #2
From: Eric Dumazet <eric.dumazet@gmail.com>
Date: Sun, 18 Mar 2012 14:07:47 -0700

> With increasing receive window sizes, but speed of light not improved
> that much, out of order queue can contain a huge number of skbs, waiting
> to be moved to receive_queue when missing packets can fill the holes.
> 
> Some devices happen to use fat skbs (truesize of 4096 + sizeof(struct
> sk_buff)) to store regular (MTU <= 1500) frames. This makes highly
> probable sk_rmem_alloc hits sk_rcvbuf limit, which can be 4Mbytes in
> many cases.
> 
> When limit is hit, tcp stack calls tcp_collapse_ofo_queue(), a true
> latency killer and cpu cache blower.
> 
> Doing the coalescing attempt each time we add a frame in ofo queue
> permits to keep memory use tight and in many cases avoid the
> tcp_collapse() thing later.
> 
> Tested on various wireless setups (b43, ath9k, ...) known to use big skb
> truesize, this patch removed the "packets collapsed in receive queue due
> to low socket buffer" I had before.
> 
> This also reduced average memory used by tcp sockets.
> 
> With help from Neal Cardwell.
> 
> Signed-off-by: Eric Dumazet <eric.dumazet@gmail.com>

Applied.
--
To unsubscribe from this list: send the line "unsubscribe netdev" in
the body of a message to majordomo@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
diff mbox

Patch

diff --git a/include/linux/snmp.h b/include/linux/snmp.h
index 8ee8af4..2e68f5b 100644
--- a/include/linux/snmp.h
+++ b/include/linux/snmp.h
@@ -233,6 +233,7 @@  enum
 	LINUX_MIB_TCPREQQFULLDOCOOKIES,		/* TCPReqQFullDoCookies */
 	LINUX_MIB_TCPREQQFULLDROP,		/* TCPReqQFullDrop */
 	LINUX_MIB_TCPRETRANSFAIL,		/* TCPRetransFail */
+	LINUX_MIB_TCPRCVCOALESCE,			/* TCPRcvCoalesce */
 	__LINUX_MIB_MAX
 };
 
diff --git a/net/ipv4/proc.c b/net/ipv4/proc.c
index 02d6107..8af0d44 100644
--- a/net/ipv4/proc.c
+++ b/net/ipv4/proc.c
@@ -257,6 +257,7 @@  static const struct snmp_mib snmp4_net_list[] = {
 	SNMP_MIB_ITEM("TCPReqQFullDoCookies", LINUX_MIB_TCPREQQFULLDOCOOKIES),
 	SNMP_MIB_ITEM("TCPReqQFullDrop", LINUX_MIB_TCPREQQFULLDROP),
 	SNMP_MIB_ITEM("TCPRetransFail", LINUX_MIB_TCPRETRANSFAIL),
+	SNMP_MIB_ITEM("TCPRcvCoalesce", LINUX_MIB_TCPRCVCOALESCE),
 	SNMP_MIB_SENTINEL
 };
 
diff --git a/net/ipv4/tcp_input.c b/net/ipv4/tcp_input.c
index fa7de12..e886e2f 100644
--- a/net/ipv4/tcp_input.c
+++ b/net/ipv4/tcp_input.c
@@ -4484,7 +4484,24 @@  static void tcp_data_queue_ofo(struct sock *sk, struct sk_buff *skb)
 	end_seq = TCP_SKB_CB(skb)->end_seq;
 
 	if (seq == TCP_SKB_CB(skb1)->end_seq) {
-		__skb_queue_after(&tp->out_of_order_queue, skb1, skb);
+		/* Packets in ofo can stay in queue a long time.
+		 * Better try to coalesce them right now
+		 * to avoid future tcp_collapse_ofo_queue(),
+		 * probably the most expensive function in tcp stack.
+		 */
+		if (skb->len <= skb_tailroom(skb1) && !tcp_hdr(skb)->fin) {
+			NET_INC_STATS_BH(sock_net(sk),
+					 LINUX_MIB_TCPRCVCOALESCE);
+			BUG_ON(skb_copy_bits(skb, 0,
+					     skb_put(skb1, skb->len),
+					     skb->len));
+			TCP_SKB_CB(skb1)->end_seq = end_seq;
+			TCP_SKB_CB(skb1)->ack_seq = TCP_SKB_CB(skb)->ack_seq;
+			__kfree_skb(skb);
+			skb = NULL;
+		} else {
+			__skb_queue_after(&tp->out_of_order_queue, skb1, skb);
+		}
 
 		if (!tp->rx_opt.num_sacks ||
 		    tp->selective_acks[0].end_seq != seq)