diff mbox

[net-next] tcp: introduce tcp_try_coalesce

Message ID 1335201102.5205.28.camel@edumazet-glaptop
State Accepted, archived
Delegated to: David Miller
Headers show

Commit Message

Eric Dumazet April 23, 2012, 5:11 p.m. UTC
From: Eric Dumazet <edumazet@google.com>

commit c8628155ece3 (tcp: reduce out_of_order memory use) took care of
coalescing tcp segments provided by legacy devices (linear skbs)

We extend this idea to fragged skbs, as their truesize can be heavy.

ixgbe for example uses 256+1024+PAGE_SIZE/2 = 3328 bytes per segment.

Use this coalescing strategy for receive queue too.

This contributes to reduce number of tcp collapses, at minimal cost, and
reduces memory overhead and packets drops.

Signed-off-by: Eric Dumazet <edumazet@google.com>
Cc: Neal Cardwell <ncardwell@google.com>
Cc: Tom Herbert <therbert@google.com>
Cc: Maciej Żenczykowski <maze@google.com>
Cc: Ilpo Järvinen <ilpo.jarvinen@helsinki.fi>
---
 net/ipv4/tcp_input.c |   79 ++++++++++++++++++++++++++++++++---------
 1 file changed, 62 insertions(+), 17 deletions(-)



--
To unsubscribe from this list: send the line "unsubscribe netdev" in
the body of a message to majordomo@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html

Comments

Neal Cardwell April 24, 2012, 1:13 a.m. UTC | #1
On Mon, Apr 23, 2012 at 1:11 PM, Eric Dumazet <eric.dumazet@gmail.com> wrote:
> From: Eric Dumazet <edumazet@google.com>
>
> commit c8628155ece3 (tcp: reduce out_of_order memory use) took care of
> coalescing tcp segments provided by legacy devices (linear skbs)
>
> We extend this idea to fragged skbs, as their truesize can be heavy.
>
> ixgbe for example uses 256+1024+PAGE_SIZE/2 = 3328 bytes per segment.
>
> Use this coalescing strategy for receive queue too.
>
> This contributes to reduce number of tcp collapses, at minimal cost, and
> reduces memory overhead and packets drops.

The mechanics look solid, but I'm a little concerned about the
potential added overhead for the new case where tcp_try_coalesce()
does a skb_copy_bits() for in-order data that it is coalescing at the
end of the sk_receive_queue. Do you have any performance numbers for
this case to help suggest whether this added copy is a concern?

neal
--
To unsubscribe from this list: send the line "unsubscribe netdev" in
the body of a message to majordomo@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Eric Dumazet April 24, 2012, 2:13 a.m. UTC | #2
On Mon, 2012-04-23 at 21:13 -0400, Neal Cardwell wrote:
> On Mon, Apr 23, 2012 at 1:11 PM, Eric Dumazet <eric.dumazet@gmail.com> wrote:
> > From: Eric Dumazet <edumazet@google.com>
> >
> > commit c8628155ece3 (tcp: reduce out_of_order memory use) took care of
> > coalescing tcp segments provided by legacy devices (linear skbs)
> >
> > We extend this idea to fragged skbs, as their truesize can be heavy.
> >
> > ixgbe for example uses 256+1024+PAGE_SIZE/2 = 3328 bytes per segment.
> >
> > Use this coalescing strategy for receive queue too.
> >
> > This contributes to reduce number of tcp collapses, at minimal cost, and
> > reduces memory overhead and packets drops.
> 
> The mechanics look solid, but I'm a little concerned about the
> potential added overhead for the new case where tcp_try_coalesce()
> does a skb_copy_bits() for in-order data that it is coalescing at the
> end of the sk_receive_queue. Do you have any performance numbers for
> this case to help suggest whether this added copy is a concern?
> 
> neal

This never happens on connections where performance matters : skb head
can only contains one full mss segment.

This part is only used on wifi devices, where skb head is really fat.



--
To unsubscribe from this list: send the line "unsubscribe netdev" in
the body of a message to majordomo@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Eric Dumazet April 24, 2012, 2:29 a.m. UTC | #3
On Tue, 2012-04-24 at 04:13 +0200, Eric Dumazet wrote:

> This never happens on connections where performance matters : skb head
> can only contains one full mss segment.

By the way, top end hardware uses fragged skbs.

Incidentally this gives better splice() performance (avoids copy of skb
head to pages)



--
To unsubscribe from this list: send the line "unsubscribe netdev" in
the body of a message to majordomo@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Neal Cardwell April 24, 2012, 2:39 a.m. UTC | #4
On Mon, Apr 23, 2012 at 1:11 PM, Eric Dumazet <eric.dumazet@gmail.com> wrote:
> From: Eric Dumazet <edumazet@google.com>
>
> commit c8628155ece3 (tcp: reduce out_of_order memory use) took care of
> coalescing tcp segments provided by legacy devices (linear skbs)
>
> We extend this idea to fragged skbs, as their truesize can be heavy.
>
> ixgbe for example uses 256+1024+PAGE_SIZE/2 = 3328 bytes per segment.
>
> Use this coalescing strategy for receive queue too.
>
> This contributes to reduce number of tcp collapses, at minimal cost, and
> reduces memory overhead and packets drops.

Acked-by: Neal Cardwell <ncardwell@google.com>

Thanks for the background info, Eric.

neal
--
To unsubscribe from this list: send the line "unsubscribe netdev" in
the body of a message to majordomo@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
David Miller April 24, 2012, 2:46 a.m. UTC | #5
From: Neal Cardwell <ncardwell@google.com>
Date: Mon, 23 Apr 2012 22:39:10 -0400

> On Mon, Apr 23, 2012 at 1:11 PM, Eric Dumazet <eric.dumazet@gmail.com> wrote:
>> From: Eric Dumazet <edumazet@google.com>
>>
>> commit c8628155ece3 (tcp: reduce out_of_order memory use) took care of
>> coalescing tcp segments provided by legacy devices (linear skbs)
>>
>> We extend this idea to fragged skbs, as their truesize can be heavy.
>>
>> ixgbe for example uses 256+1024+PAGE_SIZE/2 = 3328 bytes per segment.
>>
>> Use this coalescing strategy for receive queue too.
>>
>> This contributes to reduce number of tcp collapses, at minimal cost, and
>> reduces memory overhead and packets drops.
> 
> Acked-by: Neal Cardwell <ncardwell@google.com>
> 
> Thanks for the background info, Eric.

Applied, thanks Eric.

Although I'd like to ask you to clean up tcp_try_coalesce() a bit.

It effectively returns a boolean, but you've clouded this up by
returning an int and defining it in the comment to return "> 0" or
not.

Just make it return a real bool.

I know why you did this, it makes the "eaten" code somewhat simpler in
tcp_data_queue(), but overall it's more confusing how it is now.

People look at how the tcp_try_coalesce() return value is interpreted
and say "in what cases can it return a negative value?"  We both know
it can't, but you have to read the entire function to figure that out.

And that's by definition not intuitive.

Thanks.
--
To unsubscribe from this list: send the line "unsubscribe netdev" in
the body of a message to majordomo@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Eric Dumazet April 24, 2012, 2:59 a.m. UTC | #6
On Mon, 2012-04-23 at 22:46 -0400, David Miller wrote:

> Applied, thanks Eric.
> 
> Although I'd like to ask you to clean up tcp_try_coalesce() a bit.
> 
> It effectively returns a boolean, but you've clouded this up by
> returning an int and defining it in the comment to return "> 0" or
> not.
> 
> Just make it return a real bool.
> 
> I know why you did this, it makes the "eaten" code somewhat simpler in
> tcp_data_queue(), but overall it's more confusing how it is now.
> 
> People look at how the tcp_try_coalesce() return value is interpreted
> and say "in what cases can it return a negative value?"  We both know
> it can't, but you have to read the entire function to figure that out.
> 
> And that's by definition not intuitive.
> 
> Thanks.

Sure I'll do the cleanup. You guessed correctly why I did that ;)

In the beginning I did a "return len;" instead of "return 1;" and felt a
bit uncomfortable in case we merged a zero length message.

Then I added the !th->fin test inside tcp_try_coalesce()

(my initial patch allowed the fin being set for the tcp_data_queue()
case since tcp_fin() was called anyway)

Thanks


--
To unsubscribe from this list: send the line "unsubscribe netdev" in
the body of a message to majordomo@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
diff mbox

Patch

diff --git a/net/ipv4/tcp_input.c b/net/ipv4/tcp_input.c
index 37e1c5c..bd7aef5 100644
--- a/net/ipv4/tcp_input.c
+++ b/net/ipv4/tcp_input.c
@@ -4449,6 +4449,58 @@  static inline int tcp_try_rmem_schedule(struct sock *sk, unsigned int size)
 	return 0;
 }
 
+/**
+ * tcp_try_coalesce - try to merge skb to prior one
+ * @sk: socket
+ * @to: prior buffer
+ * @from: buffer to add in queue
+ *
+ * Before queueing skb @from after @to, try to merge them
+ * to reduce overall memory use and queue lengths, if cost is small.
+ * Packets in ofo or receive queues can stay a long time.
+ * Better try to coalesce them right now to avoid future collapses.
+ * Returns > 0 value if caller should free @from instead of queueing it
+ */
+static int tcp_try_coalesce(struct sock *sk,
+			    struct sk_buff *to,
+			    struct sk_buff *from)
+{
+	int len = from->len;
+
+	if (tcp_hdr(from)->fin)
+		return 0;
+	if (len <= skb_tailroom(to)) {
+		BUG_ON(skb_copy_bits(from, 0, skb_put(to, len), len));
+merge:
+		NET_INC_STATS_BH(sock_net(sk), LINUX_MIB_TCPRCVCOALESCE);
+		TCP_SKB_CB(to)->end_seq = TCP_SKB_CB(from)->end_seq;
+		TCP_SKB_CB(to)->ack_seq = TCP_SKB_CB(from)->ack_seq;
+		return 1;
+	}
+	if (skb_headlen(from) == 0 &&
+	    !skb_has_frag_list(to) &&
+	    !skb_has_frag_list(from) &&
+	    (skb_shinfo(to)->nr_frags +
+	     skb_shinfo(from)->nr_frags <= MAX_SKB_FRAGS)) {
+		int delta = from->truesize - ksize(from->head) -
+			    SKB_DATA_ALIGN(sizeof(struct sk_buff));
+
+		WARN_ON_ONCE(delta < len);
+		memcpy(skb_shinfo(to)->frags + skb_shinfo(to)->nr_frags,
+		       skb_shinfo(from)->frags,
+		       skb_shinfo(from)->nr_frags * sizeof(skb_frag_t));
+		skb_shinfo(to)->nr_frags += skb_shinfo(from)->nr_frags;
+		skb_shinfo(from)->nr_frags = 0;
+		to->truesize += delta;
+		atomic_add(delta, &sk->sk_rmem_alloc);
+		sk_mem_charge(sk, delta);
+		to->len += len;
+		to->data_len += len;
+		goto merge;
+	}
+	return 0;
+}
+
 static void tcp_data_queue_ofo(struct sock *sk, struct sk_buff *skb)
 {
 	struct tcp_sock *tp = tcp_sk(sk);
@@ -4487,23 +4539,11 @@  static void tcp_data_queue_ofo(struct sock *sk, struct sk_buff *skb)
 	end_seq = TCP_SKB_CB(skb)->end_seq;
 
 	if (seq == TCP_SKB_CB(skb1)->end_seq) {
-		/* Packets in ofo can stay in queue a long time.
-		 * Better try to coalesce them right now
-		 * to avoid future tcp_collapse_ofo_queue(),
-		 * probably the most expensive function in tcp stack.
-		 */
-		if (skb->len <= skb_tailroom(skb1) && !tcp_hdr(skb)->fin) {
-			NET_INC_STATS_BH(sock_net(sk),
-					 LINUX_MIB_TCPRCVCOALESCE);
-			BUG_ON(skb_copy_bits(skb, 0,
-					     skb_put(skb1, skb->len),
-					     skb->len));
-			TCP_SKB_CB(skb1)->end_seq = end_seq;
-			TCP_SKB_CB(skb1)->ack_seq = TCP_SKB_CB(skb)->ack_seq;
+		if (tcp_try_coalesce(sk, skb1, skb) <= 0) {
+			__skb_queue_after(&tp->out_of_order_queue, skb1, skb);
+		} else {
 			__kfree_skb(skb);
 			skb = NULL;
-		} else {
-			__skb_queue_after(&tp->out_of_order_queue, skb1, skb);
 		}
 
 		if (!tp->rx_opt.num_sacks ||
@@ -4624,13 +4664,18 @@  static void tcp_data_queue(struct sock *sk, struct sk_buff *skb)
 		}
 
 		if (eaten <= 0) {
+			struct sk_buff *tail;
 queue_and_out:
 			if (eaten < 0 &&
 			    tcp_try_rmem_schedule(sk, skb->truesize))
 				goto drop;
 
-			skb_set_owner_r(skb, sk);
-			__skb_queue_tail(&sk->sk_receive_queue, skb);
+			tail = skb_peek_tail(&sk->sk_receive_queue);
+			eaten = tail ? tcp_try_coalesce(sk, tail, skb) : -1;
+			if (eaten <= 0) {
+				skb_set_owner_r(skb, sk);
+				__skb_queue_tail(&sk->sk_receive_queue, skb);
+			}
 		}
 		tp->rcv_nxt = TCP_SKB_CB(skb)->end_seq;
 		if (skb->len)