Message ID | 1335201102.5205.28.camel@edumazet-glaptop |
---|---|
State | Accepted, archived |
Delegated to: | David Miller |
Headers | show |
On Mon, Apr 23, 2012 at 1:11 PM, Eric Dumazet <eric.dumazet@gmail.com> wrote: > From: Eric Dumazet <edumazet@google.com> > > commit c8628155ece3 (tcp: reduce out_of_order memory use) took care of > coalescing tcp segments provided by legacy devices (linear skbs) > > We extend this idea to fragged skbs, as their truesize can be heavy. > > ixgbe for example uses 256+1024+PAGE_SIZE/2 = 3328 bytes per segment. > > Use this coalescing strategy for receive queue too. > > This contributes to reduce number of tcp collapses, at minimal cost, and > reduces memory overhead and packets drops. The mechanics look solid, but I'm a little concerned about the potential added overhead for the new case where tcp_try_coalesce() does a skb_copy_bits() for in-order data that it is coalescing at the end of the sk_receive_queue. Do you have any performance numbers for this case to help suggest whether this added copy is a concern? neal -- To unsubscribe from this list: send the line "unsubscribe netdev" in the body of a message to majordomo@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
On Mon, 2012-04-23 at 21:13 -0400, Neal Cardwell wrote: > On Mon, Apr 23, 2012 at 1:11 PM, Eric Dumazet <eric.dumazet@gmail.com> wrote: > > From: Eric Dumazet <edumazet@google.com> > > > > commit c8628155ece3 (tcp: reduce out_of_order memory use) took care of > > coalescing tcp segments provided by legacy devices (linear skbs) > > > > We extend this idea to fragged skbs, as their truesize can be heavy. > > > > ixgbe for example uses 256+1024+PAGE_SIZE/2 = 3328 bytes per segment. > > > > Use this coalescing strategy for receive queue too. > > > > This contributes to reduce number of tcp collapses, at minimal cost, and > > reduces memory overhead and packets drops. > > The mechanics look solid, but I'm a little concerned about the > potential added overhead for the new case where tcp_try_coalesce() > does a skb_copy_bits() for in-order data that it is coalescing at the > end of the sk_receive_queue. Do you have any performance numbers for > this case to help suggest whether this added copy is a concern? > > neal This never happens on connections where performance matters : skb head can only contains one full mss segment. This part is only used on wifi devices, where skb head is really fat. -- To unsubscribe from this list: send the line "unsubscribe netdev" in the body of a message to majordomo@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
On Tue, 2012-04-24 at 04:13 +0200, Eric Dumazet wrote: > This never happens on connections where performance matters : skb head > can only contains one full mss segment. By the way, top end hardware uses fragged skbs. Incidentally this gives better splice() performance (avoids copy of skb head to pages) -- To unsubscribe from this list: send the line "unsubscribe netdev" in the body of a message to majordomo@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
On Mon, Apr 23, 2012 at 1:11 PM, Eric Dumazet <eric.dumazet@gmail.com> wrote: > From: Eric Dumazet <edumazet@google.com> > > commit c8628155ece3 (tcp: reduce out_of_order memory use) took care of > coalescing tcp segments provided by legacy devices (linear skbs) > > We extend this idea to fragged skbs, as their truesize can be heavy. > > ixgbe for example uses 256+1024+PAGE_SIZE/2 = 3328 bytes per segment. > > Use this coalescing strategy for receive queue too. > > This contributes to reduce number of tcp collapses, at minimal cost, and > reduces memory overhead and packets drops. Acked-by: Neal Cardwell <ncardwell@google.com> Thanks for the background info, Eric. neal -- To unsubscribe from this list: send the line "unsubscribe netdev" in the body of a message to majordomo@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
From: Neal Cardwell <ncardwell@google.com> Date: Mon, 23 Apr 2012 22:39:10 -0400 > On Mon, Apr 23, 2012 at 1:11 PM, Eric Dumazet <eric.dumazet@gmail.com> wrote: >> From: Eric Dumazet <edumazet@google.com> >> >> commit c8628155ece3 (tcp: reduce out_of_order memory use) took care of >> coalescing tcp segments provided by legacy devices (linear skbs) >> >> We extend this idea to fragged skbs, as their truesize can be heavy. >> >> ixgbe for example uses 256+1024+PAGE_SIZE/2 = 3328 bytes per segment. >> >> Use this coalescing strategy for receive queue too. >> >> This contributes to reduce number of tcp collapses, at minimal cost, and >> reduces memory overhead and packets drops. > > Acked-by: Neal Cardwell <ncardwell@google.com> > > Thanks for the background info, Eric. Applied, thanks Eric. Although I'd like to ask you to clean up tcp_try_coalesce() a bit. It effectively returns a boolean, but you've clouded this up by returning an int and defining it in the comment to return "> 0" or not. Just make it return a real bool. I know why you did this, it makes the "eaten" code somewhat simpler in tcp_data_queue(), but overall it's more confusing how it is now. People look at how the tcp_try_coalesce() return value is interpreted and say "in what cases can it return a negative value?" We both know it can't, but you have to read the entire function to figure that out. And that's by definition not intuitive. Thanks. -- To unsubscribe from this list: send the line "unsubscribe netdev" in the body of a message to majordomo@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
On Mon, 2012-04-23 at 22:46 -0400, David Miller wrote: > Applied, thanks Eric. > > Although I'd like to ask you to clean up tcp_try_coalesce() a bit. > > It effectively returns a boolean, but you've clouded this up by > returning an int and defining it in the comment to return "> 0" or > not. > > Just make it return a real bool. > > I know why you did this, it makes the "eaten" code somewhat simpler in > tcp_data_queue(), but overall it's more confusing how it is now. > > People look at how the tcp_try_coalesce() return value is interpreted > and say "in what cases can it return a negative value?" We both know > it can't, but you have to read the entire function to figure that out. > > And that's by definition not intuitive. > > Thanks. Sure I'll do the cleanup. You guessed correctly why I did that ;) In the beginning I did a "return len;" instead of "return 1;" and felt a bit uncomfortable in case we merged a zero length message. Then I added the !th->fin test inside tcp_try_coalesce() (my initial patch allowed the fin being set for the tcp_data_queue() case since tcp_fin() was called anyway) Thanks -- To unsubscribe from this list: send the line "unsubscribe netdev" in the body of a message to majordomo@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
diff --git a/net/ipv4/tcp_input.c b/net/ipv4/tcp_input.c index 37e1c5c..bd7aef5 100644 --- a/net/ipv4/tcp_input.c +++ b/net/ipv4/tcp_input.c @@ -4449,6 +4449,58 @@ static inline int tcp_try_rmem_schedule(struct sock *sk, unsigned int size) return 0; } +/** + * tcp_try_coalesce - try to merge skb to prior one + * @sk: socket + * @to: prior buffer + * @from: buffer to add in queue + * + * Before queueing skb @from after @to, try to merge them + * to reduce overall memory use and queue lengths, if cost is small. + * Packets in ofo or receive queues can stay a long time. + * Better try to coalesce them right now to avoid future collapses. + * Returns > 0 value if caller should free @from instead of queueing it + */ +static int tcp_try_coalesce(struct sock *sk, + struct sk_buff *to, + struct sk_buff *from) +{ + int len = from->len; + + if (tcp_hdr(from)->fin) + return 0; + if (len <= skb_tailroom(to)) { + BUG_ON(skb_copy_bits(from, 0, skb_put(to, len), len)); +merge: + NET_INC_STATS_BH(sock_net(sk), LINUX_MIB_TCPRCVCOALESCE); + TCP_SKB_CB(to)->end_seq = TCP_SKB_CB(from)->end_seq; + TCP_SKB_CB(to)->ack_seq = TCP_SKB_CB(from)->ack_seq; + return 1; + } + if (skb_headlen(from) == 0 && + !skb_has_frag_list(to) && + !skb_has_frag_list(from) && + (skb_shinfo(to)->nr_frags + + skb_shinfo(from)->nr_frags <= MAX_SKB_FRAGS)) { + int delta = from->truesize - ksize(from->head) - + SKB_DATA_ALIGN(sizeof(struct sk_buff)); + + WARN_ON_ONCE(delta < len); + memcpy(skb_shinfo(to)->frags + skb_shinfo(to)->nr_frags, + skb_shinfo(from)->frags, + skb_shinfo(from)->nr_frags * sizeof(skb_frag_t)); + skb_shinfo(to)->nr_frags += skb_shinfo(from)->nr_frags; + skb_shinfo(from)->nr_frags = 0; + to->truesize += delta; + atomic_add(delta, &sk->sk_rmem_alloc); + sk_mem_charge(sk, delta); + to->len += len; + to->data_len += len; + goto merge; + } + return 0; +} + static void tcp_data_queue_ofo(struct sock *sk, struct sk_buff *skb) { struct tcp_sock *tp = tcp_sk(sk); @@ -4487,23 +4539,11 @@ static void tcp_data_queue_ofo(struct sock *sk, struct sk_buff *skb) end_seq = TCP_SKB_CB(skb)->end_seq; if (seq == TCP_SKB_CB(skb1)->end_seq) { - /* Packets in ofo can stay in queue a long time. - * Better try to coalesce them right now - * to avoid future tcp_collapse_ofo_queue(), - * probably the most expensive function in tcp stack. - */ - if (skb->len <= skb_tailroom(skb1) && !tcp_hdr(skb)->fin) { - NET_INC_STATS_BH(sock_net(sk), - LINUX_MIB_TCPRCVCOALESCE); - BUG_ON(skb_copy_bits(skb, 0, - skb_put(skb1, skb->len), - skb->len)); - TCP_SKB_CB(skb1)->end_seq = end_seq; - TCP_SKB_CB(skb1)->ack_seq = TCP_SKB_CB(skb)->ack_seq; + if (tcp_try_coalesce(sk, skb1, skb) <= 0) { + __skb_queue_after(&tp->out_of_order_queue, skb1, skb); + } else { __kfree_skb(skb); skb = NULL; - } else { - __skb_queue_after(&tp->out_of_order_queue, skb1, skb); } if (!tp->rx_opt.num_sacks || @@ -4624,13 +4664,18 @@ static void tcp_data_queue(struct sock *sk, struct sk_buff *skb) } if (eaten <= 0) { + struct sk_buff *tail; queue_and_out: if (eaten < 0 && tcp_try_rmem_schedule(sk, skb->truesize)) goto drop; - skb_set_owner_r(skb, sk); - __skb_queue_tail(&sk->sk_receive_queue, skb); + tail = skb_peek_tail(&sk->sk_receive_queue); + eaten = tail ? tcp_try_coalesce(sk, tail, skb) : -1; + if (eaten <= 0) { + skb_set_owner_r(skb, sk); + __skb_queue_tail(&sk->sk_receive_queue, skb); + } } tp->rcv_nxt = TCP_SKB_CB(skb)->end_seq; if (skb->len)