From patchwork Thu Aug 3 20:29:44 2017 Content-Type: text/plain; charset="utf-8" MIME-Version: 1.0 Content-Transfer-Encoding: 7bit X-Patchwork-Submitter: Willem de Bruijn X-Patchwork-Id: 797439 X-Patchwork-Delegate: davem@davemloft.net Return-Path: X-Original-To: patchwork-incoming@ozlabs.org Delivered-To: patchwork-incoming@ozlabs.org Authentication-Results: ozlabs.org; spf=none (mailfrom) smtp.mailfrom=vger.kernel.org (client-ip=209.132.180.67; helo=vger.kernel.org; envelope-from=netdev-owner@vger.kernel.org; receiver=) Authentication-Results: ozlabs.org; dkim=pass (2048-bit key; unprotected) header.d=gmail.com header.i=@gmail.com header.b="QhR9+8tg"; dkim-atps=neutral Received: from vger.kernel.org (vger.kernel.org [209.132.180.67]) by ozlabs.org (Postfix) with ESMTP id 3xNhWb3PQfz9s7g for ; Fri, 4 Aug 2017 06:30:19 +1000 (AEST) Received: (majordomo@vger.kernel.org) by vger.kernel.org via listexpand id S1752041AbdHCUaM (ORCPT ); Thu, 3 Aug 2017 16:30:12 -0400 Received: from mail-qt0-f196.google.com ([209.85.216.196]:38633 "EHLO mail-qt0-f196.google.com" rhost-flags-OK-OK-OK-OK) by vger.kernel.org with ESMTP id S1751968AbdHCU3z (ORCPT ); Thu, 3 Aug 2017 16:29:55 -0400 Received: by mail-qt0-f196.google.com with SMTP id p3so2427858qtg.5; Thu, 03 Aug 2017 13:29:55 -0700 (PDT) DKIM-Signature: v=1; a=rsa-sha256; c=relaxed/relaxed; d=gmail.com; s=20161025; h=from:to:cc:subject:date:message-id:in-reply-to:references; bh=p0A/8XupyeIZUbSwaiAkLXcbvSAZ0A2qcASNce5OiaI=; b=QhR9+8tgAA4g7WWyVoG+c383RTIIjHY4k4GUBYqubT4MXEYfLtiCWURrXl+/jBOPAC 6zmGobMMXqF+LyVVoCUZxTkFo3TMgUVSPrT7PG2Lg3gOgkLfvlQ+chzDcNNlfSAOlotl MWLydAyQeYfIFPkkCN2i08EY7bZda/BbDsl+/U4F8vFJU/NZVeW7oQ38J82GBorVrvpD VjseQoY6vVDa/7VqiPOWwS/yWwgoNeUZozdRihgYKVk/RRm7JJxecokveO5PAkYoesoh Lk9M/3tT+MHpkHO6euK9k/IdycN3acx/mAHcIzgpza6Yok/Zb9jmF2vMa2wS1jOqQrS5 x4cA== X-Google-DKIM-Signature: v=1; a=rsa-sha256; c=relaxed/relaxed; d=1e100.net; s=20161025; h=x-gm-message-state:from:to:cc:subject:date:message-id:in-reply-to :references; bh=p0A/8XupyeIZUbSwaiAkLXcbvSAZ0A2qcASNce5OiaI=; b=c+BK46jgpAuUP7QfHlvRUWzROcVyDuuZXi6nb7t85XPi8R8TDG2Dp2zVWTcmYz0HRO 3jKTuPLApE84XAmWy662jC/HrgTf6UlNeElvYt0t3bqgDR98+oi7EQxmqeZs1md/Uw6r We4bJJaE439dYygttMbtVXSN2spCyKBcVuH47KGijQJdQ4hHOdcs90g/vCvn2xiTzTYC lzFd+K+ANlGvX5C0JWaALryQ4Vja/VRDokkJeYq3Z7itn9PVqtz83xwYl7l34rWk5jrP vrxCFJjEv1bFJbwWSL7cKGhGwwzoe98L0m4ZdpqnjySohf/c5sCzEEGSAoHas67w4riY +HFg== X-Gm-Message-State: AHYfb5j4EDbtuIXsNROipOJ+OQTAPwVTZuQW6hDBqWdHmf3ugQW84CGW 2aqdbdzbWFvZHt47ESk= X-Received: by 10.200.55.93 with SMTP id p29mr137968qtb.64.1501792194653; Thu, 03 Aug 2017 13:29:54 -0700 (PDT) Received: from willemb1.nyc.corp.google.com ([100.101.212.81]) by smtp.gmail.com with ESMTPSA id q17sm28134618qkh.53.2017.08.03.13.29.54 (version=TLS1_2 cipher=ECDHE-RSA-AES128-SHA bits=128/128); Thu, 03 Aug 2017 13:29:54 -0700 (PDT) From: Willem de Bruijn To: netdev@vger.kernel.org Cc: davem@davemloft.net, linux-api@vger.kernel.org, Willem de Bruijn Subject: [PATCH net-next v4 8/9] tcp: enable MSG_ZEROCOPY Date: Thu, 3 Aug 2017 16:29:44 -0400 Message-Id: <20170803202945.70750-9-willemdebruijn.kernel@gmail.com> X-Mailer: git-send-email 2.14.0.rc1.383.gd1ce394fe2-goog In-Reply-To: <20170803202945.70750-1-willemdebruijn.kernel@gmail.com> References: <20170803202945.70750-1-willemdebruijn.kernel@gmail.com> Sender: netdev-owner@vger.kernel.org Precedence: bulk List-ID: X-Mailing-List: netdev@vger.kernel.org From: Willem de Bruijn Enable support for MSG_ZEROCOPY to the TCP stack. TSO and GSO are both supported. Only data sent to remote destinations is sent without copying. Packets looped onto a local destination have their payload copied to avoid unbounded latency. Tested: A 10x TCP_STREAM between two hosts showed a reduction in netserver process cycles by up to 70%, depending on packet size. Systemwide, savings are of course much less pronounced, at up to 20% best case. msg_zerocopy.sh 4 tcp: without zerocopy tx=121792 (7600 MB) txc=0 zc=n rx=60458 (7600 MB) with zerocopy tx=286257 (17863 MB) txc=286257 zc=y rx=140022 (17863 MB) This test opens a pair of sockets over veth, one one calls send with 64KB and optionally MSG_ZEROCOPY and on the other reads the initial bytes. The receiver truncates, so this is strictly an upper bound on what is achievable. It is more representative of sending data out of a physical NIC (when payload is not touched, either). Signed-off-by: Willem de Bruijn --- net/ipv4/tcp.c | 32 +++++++++++++++++++++++++++++++- 1 file changed, 31 insertions(+), 1 deletion(-) diff --git a/net/ipv4/tcp.c b/net/ipv4/tcp.c index 9dd6f4dba9b1..71b25567e787 100644 --- a/net/ipv4/tcp.c +++ b/net/ipv4/tcp.c @@ -1165,6 +1165,7 @@ static int tcp_sendmsg_fastopen(struct sock *sk, struct msghdr *msg, int tcp_sendmsg_locked(struct sock *sk, struct msghdr *msg, size_t size) { struct tcp_sock *tp = tcp_sk(sk); + struct ubuf_info *uarg = NULL; struct sk_buff *skb; struct sockcm_cookie sockc; int flags, err, copied = 0; @@ -1174,6 +1175,26 @@ int tcp_sendmsg_locked(struct sock *sk, struct msghdr *msg, size_t size) long timeo; flags = msg->msg_flags; + + if (flags & MSG_ZEROCOPY && size) { + if (sk->sk_state != TCP_ESTABLISHED) { + err = -EINVAL; + goto out_err; + } + + skb = tcp_send_head(sk) ? tcp_write_queue_tail(sk) : NULL; + uarg = sock_zerocopy_realloc(sk, size, skb_zcopy(skb)); + if (!uarg) { + err = -ENOBUFS; + goto out_err; + } + + /* skb may be freed in main loop, keep extra ref on uarg */ + sock_zerocopy_get(uarg); + if (!(sk_check_csum_caps(sk) && sk->sk_route_caps & NETIF_F_SG)) + uarg->zerocopy = 0; + } + if (unlikely(flags & MSG_FASTOPEN || inet_sk(sk)->defer_connect)) { err = tcp_sendmsg_fastopen(sk, msg, &copied_syn, size); if (err == -EINPROGRESS && copied_syn > 0) @@ -1297,7 +1318,7 @@ int tcp_sendmsg_locked(struct sock *sk, struct msghdr *msg, size_t size) err = skb_add_data_nocache(sk, skb, &msg->msg_iter, copy); if (err) goto do_fault; - } else { + } else if (!uarg || !uarg->zerocopy) { bool merge = true; int i = skb_shinfo(skb)->nr_frags; struct page_frag *pfrag = sk_page_frag(sk); @@ -1335,6 +1356,13 @@ int tcp_sendmsg_locked(struct sock *sk, struct msghdr *msg, size_t size) page_ref_inc(pfrag->page); } pfrag->offset += copy; + } else { + err = skb_zerocopy_iter_stream(sk, skb, msg, copy, uarg); + if (err == -EMSGSIZE || err == -EEXIST) + goto new_segment; + if (err < 0) + goto do_error; + copy = err; } if (!copied) @@ -1381,6 +1409,7 @@ int tcp_sendmsg_locked(struct sock *sk, struct msghdr *msg, size_t size) tcp_push(sk, flags, mss_now, tp->nonagle, size_goal); } out_nopush: + sock_zerocopy_put(uarg); return copied + copied_syn; do_fault: @@ -1397,6 +1426,7 @@ int tcp_sendmsg_locked(struct sock *sk, struct msghdr *msg, size_t size) if (copied + copied_syn) goto out; out_err: + sock_zerocopy_put_abort(uarg); err = sk_stream_error(sk, flags, err); /* make sure we wake any epoll edge trigger waiter */ if (unlikely(skb_queue_len(&sk->sk_write_queue) == 0 &&