From patchwork Mon Apr 23 17:11:42 2012 Content-Type: text/plain; charset="utf-8" MIME-Version: 1.0 Content-Transfer-Encoding: 8bit X-Patchwork-Submitter: Eric Dumazet X-Patchwork-Id: 154516 X-Patchwork-Delegate: davem@davemloft.net Return-Path: X-Original-To: patchwork-incoming@ozlabs.org Delivered-To: patchwork-incoming@ozlabs.org Received: from vger.kernel.org (vger.kernel.org [209.132.180.67]) by ozlabs.org (Postfix) with ESMTP id 357BAB6FF2 for ; Tue, 24 Apr 2012 03:11:53 +1000 (EST) Received: (majordomo@vger.kernel.org) by vger.kernel.org via listexpand id S1754824Ab2DWRLs (ORCPT ); Mon, 23 Apr 2012 13:11:48 -0400 Received: from mail-bk0-f46.google.com ([209.85.214.46]:49558 "EHLO mail-bk0-f46.google.com" rhost-flags-OK-OK-OK-OK) by vger.kernel.org with ESMTP id S1754816Ab2DWRLr (ORCPT ); Mon, 23 Apr 2012 13:11:47 -0400 Received: by bkcik5 with SMTP id ik5so8836015bkc.19 for ; Mon, 23 Apr 2012 10:11:45 -0700 (PDT) DKIM-Signature: v=1; a=rsa-sha256; c=relaxed/relaxed; d=gmail.com; s=20120113; h=subject:from:to:cc:content-type:date:message-id:mime-version :x-mailer:content-transfer-encoding; bh=ViCOSLSwty7YeLmprPtxjQu1tftgnw9lVrGHcIkR+YQ=; b=qfuOQ5UPhlWgr/WR3DzG5GbOMeUZwjb5yuqQU+gVJNo7ctqQDSLFFjo99K8wQasb2n z9YtFttGTxXymrR9BGzVarsutZ6506AqOb1g8eib+vc18rinmIQLjiJggCD/2NtA8hn6 SRld7fgERwluaaZf/rZLFuXDAIaSFVX/aakKp6/15S2UrHsSRB67fekTJkxgBPTNTYW/ TQ0Shavfz2RIAKupCcixM57GEBS+OrJCEDrf3qUC+HuGjLJpEbw87tcrc0ujfWfXIWNj tNPFTYA4/1Uph1Fhu7mssMkB+Bnc0X/aNIqgLLQ5zPlx1wL4r8ORSJr6RPhvpYibWwVo gcGQ== Received: by 10.205.133.210 with SMTP id hz18mr5046969bkc.117.1335201105702; Mon, 23 Apr 2012 10:11:45 -0700 (PDT) Received: from [172.28.92.147] ([74.125.122.49]) by mx.google.com with ESMTPS id hq5sm23733350bkc.8.2012.04.23.10.11.43 (version=SSLv3 cipher=OTHER); Mon, 23 Apr 2012 10:11:44 -0700 (PDT) Subject: [PATCH net-next] tcp: introduce tcp_try_coalesce From: Eric Dumazet To: David Miller Cc: netdev , Neal Cardwell , Tom Herbert , Maciej =?UTF-8?Q?=C5=BBenczykowski?= , Ilpo =?ISO-8859-1?Q?J=E4rvinen?= Date: Mon, 23 Apr 2012 19:11:42 +0200 Message-ID: <1335201102.5205.28.camel@edumazet-glaptop> Mime-Version: 1.0 X-Mailer: Evolution 2.28.3 Sender: netdev-owner@vger.kernel.org Precedence: bulk List-ID: X-Mailing-List: netdev@vger.kernel.org From: Eric Dumazet commit c8628155ece3 (tcp: reduce out_of_order memory use) took care of coalescing tcp segments provided by legacy devices (linear skbs) We extend this idea to fragged skbs, as their truesize can be heavy. ixgbe for example uses 256+1024+PAGE_SIZE/2 = 3328 bytes per segment. Use this coalescing strategy for receive queue too. This contributes to reduce number of tcp collapses, at minimal cost, and reduces memory overhead and packets drops. Signed-off-by: Eric Dumazet Cc: Neal Cardwell Cc: Tom Herbert Cc: Maciej Żenczykowski Cc: Ilpo Järvinen Acked-by: Neal Cardwell --- net/ipv4/tcp_input.c | 79 ++++++++++++++++++++++++++++++++--------- 1 file changed, 62 insertions(+), 17 deletions(-) -- To unsubscribe from this list: send the line "unsubscribe netdev" in the body of a message to majordomo@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html diff --git a/net/ipv4/tcp_input.c b/net/ipv4/tcp_input.c index 37e1c5c..bd7aef5 100644 --- a/net/ipv4/tcp_input.c +++ b/net/ipv4/tcp_input.c @@ -4449,6 +4449,58 @@ static inline int tcp_try_rmem_schedule(struct sock *sk, unsigned int size) return 0; } +/** + * tcp_try_coalesce - try to merge skb to prior one + * @sk: socket + * @to: prior buffer + * @from: buffer to add in queue + * + * Before queueing skb @from after @to, try to merge them + * to reduce overall memory use and queue lengths, if cost is small. + * Packets in ofo or receive queues can stay a long time. + * Better try to coalesce them right now to avoid future collapses. + * Returns > 0 value if caller should free @from instead of queueing it + */ +static int tcp_try_coalesce(struct sock *sk, + struct sk_buff *to, + struct sk_buff *from) +{ + int len = from->len; + + if (tcp_hdr(from)->fin) + return 0; + if (len <= skb_tailroom(to)) { + BUG_ON(skb_copy_bits(from, 0, skb_put(to, len), len)); +merge: + NET_INC_STATS_BH(sock_net(sk), LINUX_MIB_TCPRCVCOALESCE); + TCP_SKB_CB(to)->end_seq = TCP_SKB_CB(from)->end_seq; + TCP_SKB_CB(to)->ack_seq = TCP_SKB_CB(from)->ack_seq; + return 1; + } + if (skb_headlen(from) == 0 && + !skb_has_frag_list(to) && + !skb_has_frag_list(from) && + (skb_shinfo(to)->nr_frags + + skb_shinfo(from)->nr_frags <= MAX_SKB_FRAGS)) { + int delta = from->truesize - ksize(from->head) - + SKB_DATA_ALIGN(sizeof(struct sk_buff)); + + WARN_ON_ONCE(delta < len); + memcpy(skb_shinfo(to)->frags + skb_shinfo(to)->nr_frags, + skb_shinfo(from)->frags, + skb_shinfo(from)->nr_frags * sizeof(skb_frag_t)); + skb_shinfo(to)->nr_frags += skb_shinfo(from)->nr_frags; + skb_shinfo(from)->nr_frags = 0; + to->truesize += delta; + atomic_add(delta, &sk->sk_rmem_alloc); + sk_mem_charge(sk, delta); + to->len += len; + to->data_len += len; + goto merge; + } + return 0; +} + static void tcp_data_queue_ofo(struct sock *sk, struct sk_buff *skb) { struct tcp_sock *tp = tcp_sk(sk); @@ -4487,23 +4539,11 @@ static void tcp_data_queue_ofo(struct sock *sk, struct sk_buff *skb) end_seq = TCP_SKB_CB(skb)->end_seq; if (seq == TCP_SKB_CB(skb1)->end_seq) { - /* Packets in ofo can stay in queue a long time. - * Better try to coalesce them right now - * to avoid future tcp_collapse_ofo_queue(), - * probably the most expensive function in tcp stack. - */ - if (skb->len <= skb_tailroom(skb1) && !tcp_hdr(skb)->fin) { - NET_INC_STATS_BH(sock_net(sk), - LINUX_MIB_TCPRCVCOALESCE); - BUG_ON(skb_copy_bits(skb, 0, - skb_put(skb1, skb->len), - skb->len)); - TCP_SKB_CB(skb1)->end_seq = end_seq; - TCP_SKB_CB(skb1)->ack_seq = TCP_SKB_CB(skb)->ack_seq; + if (tcp_try_coalesce(sk, skb1, skb) <= 0) { + __skb_queue_after(&tp->out_of_order_queue, skb1, skb); + } else { __kfree_skb(skb); skb = NULL; - } else { - __skb_queue_after(&tp->out_of_order_queue, skb1, skb); } if (!tp->rx_opt.num_sacks || @@ -4624,13 +4664,18 @@ static void tcp_data_queue(struct sock *sk, struct sk_buff *skb) } if (eaten <= 0) { + struct sk_buff *tail; queue_and_out: if (eaten < 0 && tcp_try_rmem_schedule(sk, skb->truesize)) goto drop; - skb_set_owner_r(skb, sk); - __skb_queue_tail(&sk->sk_receive_queue, skb); + tail = skb_peek_tail(&sk->sk_receive_queue); + eaten = tail ? tcp_try_coalesce(sk, tail, skb) : -1; + if (eaten <= 0) { + skb_set_owner_r(skb, sk); + __skb_queue_tail(&sk->sk_receive_queue, skb); + } } tp->rcv_nxt = TCP_SKB_CB(skb)->end_seq; if (skb->len)