From patchwork Thu Nov 12 19:01:58 2020 Content-Type: text/plain; charset="utf-8" MIME-Version: 1.0 Content-Transfer-Encoding: 7bit X-Patchwork-Submitter: Arjun Roy X-Patchwork-Id: 1399250 Return-Path: X-Original-To: patchwork-incoming-netdev@ozlabs.org Delivered-To: patchwork-incoming-netdev@ozlabs.org Authentication-Results: ozlabs.org; spf=pass (sender SPF authorized) smtp.mailfrom=vger.kernel.org (client-ip=23.128.96.18; helo=vger.kernel.org; envelope-from=netdev-owner@vger.kernel.org; receiver=) Authentication-Results: ozlabs.org; dmarc=pass (p=none dis=none) header.from=gmail.com Authentication-Results: ozlabs.org; dkim=pass (2048-bit key; unprotected) header.d=gmail.com header.i=@gmail.com header.a=rsa-sha256 header.s=20161025 header.b=LHOkYEsM; dkim-atps=neutral Received: from vger.kernel.org (vger.kernel.org [23.128.96.18]) by ozlabs.org (Postfix) with ESMTP id 4CX9w412d2z9s0b for ; Fri, 13 Nov 2020 06:02:44 +1100 (AEDT) Received: (majordomo@vger.kernel.org) by vger.kernel.org via listexpand id S1727143AbgKLTCm (ORCPT ); Thu, 12 Nov 2020 14:02:42 -0500 Received: from lindbergh.monkeyblade.net ([23.128.96.19]:45194 "EHLO lindbergh.monkeyblade.net" rhost-flags-OK-OK-OK-OK) by vger.kernel.org with ESMTP id S1726852AbgKLTCk (ORCPT ); Thu, 12 Nov 2020 14:02:40 -0500 Received: from mail-pf1-x444.google.com (mail-pf1-x444.google.com [IPv6:2607:f8b0:4864:20::444]) by lindbergh.monkeyblade.net (Postfix) with ESMTPS id E8ECBC0613D1 for ; Thu, 12 Nov 2020 11:02:40 -0800 (PST) Received: by mail-pf1-x444.google.com with SMTP id g7so5430896pfc.2 for ; Thu, 12 Nov 2020 11:02:40 -0800 (PST) DKIM-Signature: v=1; a=rsa-sha256; c=relaxed/relaxed; d=gmail.com; s=20161025; h=from:to:cc:subject:date:message-id:in-reply-to:references :mime-version:content-transfer-encoding; bh=Dc1wdFWbMj+QO91RzUeNpjPQSIwXQT+UsBWXjN8QvbI=; b=LHOkYEsMZDsZV+s1fP9FnSoFQefY9FZavbzBkoF2HVXK4PlojGGC/7ReoO0kkItHGD r9WLyZJqHDj1m6Topj9tnNplOzkwCKC+Qsfy4LXd7hx4Bdf3hg0osX0w78xQta63cmGk A55L2jHQBfkjYO5JSbHdoKx246poVzbeyfDw8j+8pep3xBSToq731vyaWs7uenUuqxmJ NbGuFZu4UmmA4F6UB2q3oVbj5R3tQxdPN/oZZnEp/Kb6rkJ33PodOFBm3W4mwZ2X6vav Ggvh27XOWKCb7ejO2qfhO+Xc0RyR5F4U+DHNR5CyQxSirg0+DRFtFgVr3YRE5PlNkb5c Gz8A== X-Google-DKIM-Signature: v=1; a=rsa-sha256; c=relaxed/relaxed; d=1e100.net; s=20161025; h=x-gm-message-state:from:to:cc:subject:date:message-id:in-reply-to :references:mime-version:content-transfer-encoding; bh=Dc1wdFWbMj+QO91RzUeNpjPQSIwXQT+UsBWXjN8QvbI=; b=l7+RUCapXrDW0Ff2pgRNcGxLXJaY3IIMdJKee2u2ARe04nPxdxPuspjJnfrhcPMhHM 3Gy3fwnjluWAQThoPq9PDneaug7zDeHPJ2+NUABfBfXCfAoiZOYlDDb4UKhHmtAUgM+9 UVuGIgwJEdfdzV2dbFTm9UXkteWOe/KLPDXdgkDpxjJ74nRSvtE0+zeUaKYBBp0iqetA 5013c6DwBpsnBbFwdJ8Cd3mSbbC0jZm/FBGAK1ykFOGDURhQ/pAuzLMwSVEfQQMadpSn 5J0iZqCX+Rs5Ptzb0v1/o1EGYLLJMX3zI1PQyIi5g2cfKMYRWum5uAGLY3gCq5WJh/pq QUwA== X-Gm-Message-State: AOAM533jIfelSi63Hq/fx4WV6WYkkoAyNMsBwknCEMc4tpxO8JraE+PE ta/7J2KQm5uEPU1VbAZ13gmdUfWola0= X-Google-Smtp-Source: ABdhPJzJR2wShhmB7StEGjeSf9HBfhEjPm6wGZLQyDmc4Bvyd9MZ4Yi/+TExLgj7rJlsZRfrzm/miA== X-Received: by 2002:a63:5c02:: with SMTP id q2mr699126pgb.297.1605207760483; Thu, 12 Nov 2020 11:02:40 -0800 (PST) Received: from phantasmagoria.svl.corp.google.com ([2620:15c:2c4:201:f693:9fff:feea:f0b9]) by smtp.gmail.com with ESMTPSA id z7sm7458809pfq.214.2020.11.12.11.02.39 (version=TLS1_3 cipher=TLS_AES_256_GCM_SHA384 bits=256/256); Thu, 12 Nov 2020 11:02:40 -0800 (PST) From: Arjun Roy To: davem@davemloft.net, netdev@vger.kernel.org Cc: arjunroy@google.com, edumazet@google.com, soheil@google.com Subject: [net-next 1/8] tcp: Copy straggler unaligned data for TCP Rx. zerocopy. Date: Thu, 12 Nov 2020 11:01:58 -0800 Message-Id: <20201112190205.633640-2-arjunroy.kdev@gmail.com> X-Mailer: git-send-email 2.29.2.222.g5d2a92d10f8-goog In-Reply-To: <20201112190205.633640-1-arjunroy.kdev@gmail.com> References: <20201112190205.633640-1-arjunroy.kdev@gmail.com> MIME-Version: 1.0 Precedence: bulk List-ID: X-Mailing-List: netdev@vger.kernel.org From: Arjun Roy When TCP receive zerocopy does not successfully map the entire requested space, it outputs a 'hint' that the caller should recvmsg(). Augment zerocopy to accept a user buffer that it tries to copy this hint into - if it is possible to copy the entire hint, it will do so. This elides a recvmsg() call for received traffic that isn't exactly page-aligned in size. This was tested with RPC-style traffic of arbitrary sizes. Normally, each received message required at least one getsockopt() call, and one recvmsg() call for the remaining unaligned data. With this change, almost all of the recvmsg() calls are eliminated, leading to a savings of about 25%-50% in number of system calls for RPC-style workloads. Signed-off-by: Arjun Roy Signed-off-by: Eric Dumazet Signed-off-by: Soheil Hassas Yeganeh Reported-by: kernel test robot --- include/uapi/linux/tcp.h | 2 + net/ipv4/tcp.c | 80 ++++++++++++++++++++++++++++++++-------- 2 files changed, 66 insertions(+), 16 deletions(-) diff --git a/include/uapi/linux/tcp.h b/include/uapi/linux/tcp.h index cfcb10b75483..62db78b9c1a0 100644 --- a/include/uapi/linux/tcp.h +++ b/include/uapi/linux/tcp.h @@ -349,5 +349,7 @@ struct tcp_zerocopy_receive { __u32 recv_skip_hint; /* out: amount of bytes to skip */ __u32 inq; /* out: amount of bytes in read queue */ __s32 err; /* out: socket error */ + __u64 copybuf_address; /* in: copybuf address (small reads) */ + __s32 copybuf_len; /* in/out: copybuf bytes avail/used or error */ }; #endif /* _UAPI_LINUX_TCP_H */ diff --git a/net/ipv4/tcp.c b/net/ipv4/tcp.c index b2bc3d7fe9e8..f86ccf221c0b 100644 --- a/net/ipv4/tcp.c +++ b/net/ipv4/tcp.c @@ -1743,6 +1743,48 @@ int tcp_mmap(struct file *file, struct socket *sock, } EXPORT_SYMBOL(tcp_mmap); +static int tcp_copy_straggler_data(struct tcp_zerocopy_receive *zc, + struct sk_buff *skb, u32 copylen, + u32 *offset, u32 *seq) +{ + struct msghdr msg = {}; + struct iovec iov; + int err; + + err = import_single_range(READ, (void __user *)zc->copybuf_address, + copylen, &iov, &msg.msg_iter); + if (err) + return err; + err = skb_copy_datagram_msg(skb, *offset, &msg, copylen); + if (err) + return err; + zc->recv_skip_hint -= copylen; + *offset += copylen; + *seq += copylen; + return (__s32)copylen; +} + +static int tcp_zerocopy_handle_leftover_data(struct tcp_zerocopy_receive *zc, + struct sock *sk, + struct sk_buff *skb, + u32 *seq, + s32 copybuf_len) +{ + u32 offset, copylen = min_t(u32, copybuf_len, zc->recv_skip_hint); + + if (!copylen) + return 0; + /* skb is null if inq < PAGE_SIZE. */ + if (skb) + offset = *seq - TCP_SKB_CB(skb)->seq; + else + skb = tcp_recv_skb(sk, *seq, &offset); + + zc->copybuf_len = tcp_copy_straggler_data(zc, skb, copylen, &offset, + seq); + return zc->copybuf_len < 0 ? 0 : copylen; +} + static int tcp_zerocopy_vm_insert_batch(struct vm_area_struct *vma, struct page **pages, unsigned long pages_to_map, @@ -1776,8 +1818,10 @@ static int tcp_zerocopy_vm_insert_batch(struct vm_area_struct *vma, static int tcp_zerocopy_receive(struct sock *sk, struct tcp_zerocopy_receive *zc) { + u32 length = 0, offset, vma_len, avail_len, aligned_len, copylen = 0; unsigned long address = (unsigned long)zc->address; - u32 length = 0, seq, offset, zap_len; + s32 copybuf_len = zc->copybuf_len; + struct tcp_sock *tp = tcp_sk(sk); #define PAGE_BATCH_SIZE 8 struct page *pages[PAGE_BATCH_SIZE]; const skb_frag_t *frags = NULL; @@ -1785,10 +1829,12 @@ static int tcp_zerocopy_receive(struct sock *sk, struct sk_buff *skb = NULL; unsigned long pg_idx = 0; unsigned long curr_addr; - struct tcp_sock *tp; - int inq; + u32 seq = tp->copied_seq; + int inq = tcp_inq(sk); int ret; + zc->copybuf_len = 0; + if (address & (PAGE_SIZE - 1) || address != zc->address) return -EINVAL; @@ -1797,8 +1843,6 @@ static int tcp_zerocopy_receive(struct sock *sk, sock_rps_record_flow(sk); - tp = tcp_sk(sk); - mmap_read_lock(current->mm); vma = find_vma(current->mm, address); @@ -1806,17 +1850,16 @@ static int tcp_zerocopy_receive(struct sock *sk, mmap_read_unlock(current->mm); return -EINVAL; } - zc->length = min_t(unsigned long, zc->length, vma->vm_end - address); - - seq = tp->copied_seq; - inq = tcp_inq(sk); - zc->length = min_t(u32, zc->length, inq); - zap_len = zc->length & ~(PAGE_SIZE - 1); - if (zap_len) { - zap_page_range(vma, address, zap_len); + vma_len = min_t(unsigned long, zc->length, vma->vm_end - address); + avail_len = min_t(u32, vma_len, inq); + aligned_len = avail_len & ~(PAGE_SIZE - 1); + if (aligned_len) { + zap_page_range(vma, address, aligned_len); + zc->length = aligned_len; zc->recv_skip_hint = 0; } else { - zc->recv_skip_hint = zc->length; + zc->length = avail_len; + zc->recv_skip_hint = avail_len; } ret = 0; curr_addr = address; @@ -1885,13 +1928,18 @@ static int tcp_zerocopy_receive(struct sock *sk, } out: mmap_read_unlock(current->mm); - if (length) { + /* Try to copy straggler data. */ + if (!ret) + copylen = tcp_zerocopy_handle_leftover_data(zc, sk, skb, &seq, + copybuf_len); + + if (length + copylen) { WRITE_ONCE(tp->copied_seq, seq); tcp_rcv_space_adjust(sk); /* Clean up data we have read: This will do ACK frames. */ tcp_recv_skb(sk, seq, &offset); - tcp_cleanup_rbuf(sk, length); + tcp_cleanup_rbuf(sk, length + copylen); ret = 0; if (length == zc->length) zc->recv_skip_hint = 0; From patchwork Thu Nov 12 19:01:59 2020 Content-Type: text/plain; charset="utf-8" MIME-Version: 1.0 Content-Transfer-Encoding: 7bit X-Patchwork-Submitter: Arjun Roy X-Patchwork-Id: 1399251 Return-Path: X-Original-To: patchwork-incoming-netdev@ozlabs.org Delivered-To: patchwork-incoming-netdev@ozlabs.org Authentication-Results: ozlabs.org; spf=pass (sender SPF authorized) smtp.mailfrom=vger.kernel.org (client-ip=23.128.96.18; helo=vger.kernel.org; envelope-from=netdev-owner@vger.kernel.org; receiver=) Authentication-Results: ozlabs.org; dmarc=pass (p=none dis=none) header.from=gmail.com Authentication-Results: ozlabs.org; dkim=pass (2048-bit key; unprotected) header.d=gmail.com header.i=@gmail.com header.a=rsa-sha256 header.s=20161025 header.b=pR4Fsl1S; dkim-atps=neutral Received: from vger.kernel.org (vger.kernel.org [23.128.96.18]) by ozlabs.org (Postfix) with ESMTP id 4CX9w80nMVz9sTL for ; Fri, 13 Nov 2020 06:02:47 +1100 (AEDT) Received: (majordomo@vger.kernel.org) by vger.kernel.org via listexpand id S1726774AbgKLTCr (ORCPT ); Thu, 12 Nov 2020 14:02:47 -0500 Received: from lindbergh.monkeyblade.net ([23.128.96.19]:45208 "EHLO lindbergh.monkeyblade.net" rhost-flags-OK-OK-OK-OK) by vger.kernel.org with ESMTP id S1727147AbgKLTCp (ORCPT ); Thu, 12 Nov 2020 14:02:45 -0500 Received: from mail-pg1-x52b.google.com (mail-pg1-x52b.google.com [IPv6:2607:f8b0:4864:20::52b]) by lindbergh.monkeyblade.net (Postfix) with ESMTPS id 65B79C0613D1 for ; Thu, 12 Nov 2020 11:02:45 -0800 (PST) Received: by mail-pg1-x52b.google.com with SMTP id w4so4949124pgg.13 for ; Thu, 12 Nov 2020 11:02:45 -0800 (PST) DKIM-Signature: v=1; a=rsa-sha256; c=relaxed/relaxed; d=gmail.com; s=20161025; h=from:to:cc:subject:date:message-id:in-reply-to:references :mime-version:content-transfer-encoding; bh=KfiOB8ZGjLkvBBtrQum1ZpB66ELEgJ/gdPcqJ2w29Go=; b=pR4Fsl1SoLnDSYE5ittiLwVN83rnMwMfmUJulGzOiiHwCyZDchn8RREnBYVntT9M+y ADnJKI4Gph7rr0112y0GSkczDDLy7HDHwaRuLG5g9nfIArtH3Tyj1+W4aQSBX1IEqdf3 ecPWHAzGwsFxrmlI2A50QoKGLcp1UgK1AIhfZb76ryJ8GBqhs+6kFUaMUkQj9WTJ8Vu0 Pe4Dz7rTaYx0mAeAFsQ4E2DZu+AERgWX7XsGWeYNsT+Y2OfhoglecRsNetXPEeKt82vb sZX2gdOetNRnfLXqI6fgaHxIwdhoRyMeeYrFjWnV+rI9oZDUzQDtvOnsb7aueYjrolD7 NTPQ== X-Google-DKIM-Signature: v=1; a=rsa-sha256; c=relaxed/relaxed; d=1e100.net; s=20161025; h=x-gm-message-state:from:to:cc:subject:date:message-id:in-reply-to :references:mime-version:content-transfer-encoding; bh=KfiOB8ZGjLkvBBtrQum1ZpB66ELEgJ/gdPcqJ2w29Go=; b=bFbnZWh6/vfAHnPL73MdcYCaI3jOlGJIOgWMTsaiLHoHdG6Ykw80OdqGBEiXN26Klt DK9aYE2oPar03uOXhO8vPxCV2zQZa6R/C5zBjMyb1O3EyZEXdcAt0iFtJFoVikfN2EgB 1l1dr3Q7laG+makmz9RlmivwnUl8OPwmIX8nqnOx7sKjEN80oeNo2rfnlDG3KKHghVPV 6v9/U9BWZUNXWyBWZcKhFBwOHQc1kc+ycSIjJsrlRV1p4/EK/+vYKYqSHcemcoReg6ye 1r0GGZu4bJ9rfH+hngyPMl7oB2O1z4ya2JoY+JT5jlStWj2pWn8octuy4IUV9VbCfsHK 4uLw== X-Gm-Message-State: AOAM53368meGALVXF9AE85J08ayQsYl0Y9EonYPB+oT7kI8U/OjuJAyz Ki3oG1olmTRV4GuJvnCNxIY= X-Google-Smtp-Source: ABdhPJw5vbsLVIYPCSmKvpUUCWvzB0hEfkgHn9fkIHA2z+JjP5fIkZRf4T+9UA6OZI9boh74SLg4ow== X-Received: by 2002:a17:90b:14c:: with SMTP id em12mr683138pjb.170.1605207764705; Thu, 12 Nov 2020 11:02:44 -0800 (PST) Received: from phantasmagoria.svl.corp.google.com ([2620:15c:2c4:201:f693:9fff:feea:f0b9]) by smtp.gmail.com with ESMTPSA id z7sm7458809pfq.214.2020.11.12.11.02.43 (version=TLS1_3 cipher=TLS_AES_256_GCM_SHA384 bits=256/256); Thu, 12 Nov 2020 11:02:44 -0800 (PST) From: Arjun Roy To: davem@davemloft.net, netdev@vger.kernel.org Cc: arjunroy@google.com, edumazet@google.com, soheil@google.com Subject: [net-next 2/8] tcp: Introduce tcp_recvmsg_locked(). Date: Thu, 12 Nov 2020 11:01:59 -0800 Message-Id: <20201112190205.633640-3-arjunroy.kdev@gmail.com> X-Mailer: git-send-email 2.29.2.222.g5d2a92d10f8-goog In-Reply-To: <20201112190205.633640-1-arjunroy.kdev@gmail.com> References: <20201112190205.633640-1-arjunroy.kdev@gmail.com> MIME-Version: 1.0 Precedence: bulk List-ID: X-Mailing-List: netdev@vger.kernel.org From: Arjun Roy Refactor tcp_recvmsg() by splitting it into locked and unlocked portions. Callers already holding the socket lock and not using ERRQUEUE/cmsg/busy polling can simply call tcp_recvmsg_locked(). This is in preparation for a short-circuit copy performed by TCP receive zerocopy for small (< PAGE_SIZE, or otherwise requested by the user) reads. Signed-off-by: Arjun Roy Signed-off-by: Eric Dumazet Signed-off-by: Soheil Hassas Yeganeh --- net/ipv4/tcp.c | 68 ++++++++++++++++++++++++++++---------------------- 1 file changed, 38 insertions(+), 30 deletions(-) diff --git a/net/ipv4/tcp.c b/net/ipv4/tcp.c index f86ccf221c0b..49e33222a68b 100644 --- a/net/ipv4/tcp.c +++ b/net/ipv4/tcp.c @@ -2061,36 +2061,27 @@ static int tcp_inq_hint(struct sock *sk) * Probably, code can be easily improved even more. */ -int tcp_recvmsg(struct sock *sk, struct msghdr *msg, size_t len, int nonblock, - int flags, int *addr_len) +static int tcp_recvmsg_locked(struct sock *sk, struct msghdr *msg, size_t len, + int nonblock, int flags, + struct scm_timestamping_internal *tss, int *cmsg_flags) { struct tcp_sock *tp = tcp_sk(sk); int copied = 0; u32 peek_seq; u32 *seq; unsigned long used; - int err, inq; + int err; int target; /* Read at least this many bytes */ long timeo; struct sk_buff *skb, *last; u32 urg_hole = 0; - struct scm_timestamping_internal tss; - int cmsg_flags; - - if (unlikely(flags & MSG_ERRQUEUE)) - return inet_recv_error(sk, msg, len, addr_len); - - if (sk_can_busy_loop(sk) && skb_queue_empty_lockless(&sk->sk_receive_queue) && - (sk->sk_state == TCP_ESTABLISHED)) - sk_busy_loop(sk, nonblock); - - lock_sock(sk); err = -ENOTCONN; if (sk->sk_state == TCP_LISTEN) goto out; - cmsg_flags = tp->recvmsg_inq ? 1 : 0; + if (tp->recvmsg_inq) + *cmsg_flags = 1; timeo = sock_rcvtimeo(sk, nonblock); /* Urgent data needs to be handled specially. */ @@ -2270,8 +2261,8 @@ int tcp_recvmsg(struct sock *sk, struct msghdr *msg, size_t len, int nonblock, } if (TCP_SKB_CB(skb)->has_rxtstamp) { - tcp_update_recv_tstamps(skb, &tss); - cmsg_flags |= 2; + tcp_update_recv_tstamps(skb, tss); + *cmsg_flags |= 2; } if (used + offset < skb->len) @@ -2297,22 +2288,9 @@ int tcp_recvmsg(struct sock *sk, struct msghdr *msg, size_t len, int nonblock, /* Clean up data we have read: This will do ACK frames. */ tcp_cleanup_rbuf(sk, copied); - - release_sock(sk); - - if (cmsg_flags) { - if (cmsg_flags & 2) - tcp_recv_timestamp(msg, sk, &tss); - if (cmsg_flags & 1) { - inq = tcp_inq_hint(sk); - put_cmsg(msg, SOL_TCP, TCP_CM_INQ, sizeof(inq), &inq); - } - } - return copied; out: - release_sock(sk); return err; recv_urg: @@ -2323,6 +2301,36 @@ int tcp_recvmsg(struct sock *sk, struct msghdr *msg, size_t len, int nonblock, err = tcp_peek_sndq(sk, msg, len); goto out; } + +int tcp_recvmsg(struct sock *sk, struct msghdr *msg, size_t len, int nonblock, + int flags, int *addr_len) +{ + int cmsg_flags = 0, ret, inq; + struct scm_timestamping_internal tss; + + if (unlikely(flags & MSG_ERRQUEUE)) + return inet_recv_error(sk, msg, len, addr_len); + + if (sk_can_busy_loop(sk) && + skb_queue_empty_lockless(&sk->sk_receive_queue) && + sk->sk_state == TCP_ESTABLISHED) + sk_busy_loop(sk, nonblock); + + lock_sock(sk); + ret = tcp_recvmsg_locked(sk, msg, len, nonblock, flags, &tss, + &cmsg_flags); + release_sock(sk); + + if (cmsg_flags && ret >= 0) { + if (cmsg_flags & 2) + tcp_recv_timestamp(msg, sk, &tss); + if (cmsg_flags & 1) { + inq = tcp_inq_hint(sk); + put_cmsg(msg, SOL_TCP, TCP_CM_INQ, sizeof(inq), &inq); + } + } + return ret; +} EXPORT_SYMBOL(tcp_recvmsg); void tcp_set_state(struct sock *sk, int state) From patchwork Thu Nov 12 19:02:00 2020 Content-Type: text/plain; charset="utf-8" MIME-Version: 1.0 Content-Transfer-Encoding: 7bit X-Patchwork-Submitter: Arjun Roy X-Patchwork-Id: 1399253 Return-Path: X-Original-To: patchwork-incoming-netdev@ozlabs.org Delivered-To: patchwork-incoming-netdev@ozlabs.org Authentication-Results: ozlabs.org; spf=pass (sender SPF authorized) smtp.mailfrom=vger.kernel.org (client-ip=23.128.96.18; helo=vger.kernel.org; envelope-from=netdev-owner@vger.kernel.org; receiver=) Authentication-Results: ozlabs.org; dmarc=pass (p=none dis=none) header.from=gmail.com Authentication-Results: ozlabs.org; dkim=pass (2048-bit key; unprotected) header.d=gmail.com header.i=@gmail.com header.a=rsa-sha256 header.s=20161025 header.b=hHs3rQMo; dkim-atps=neutral Received: from vger.kernel.org (vger.kernel.org [23.128.96.18]) by ozlabs.org (Postfix) with ESMTP id 4CX9wL6Qd8z9sTR for ; Fri, 13 Nov 2020 06:02:58 +1100 (AEDT) Received: (majordomo@vger.kernel.org) by vger.kernel.org via listexpand id S1727176AbgKLTC5 (ORCPT ); Thu, 12 Nov 2020 14:02:57 -0500 Received: from lindbergh.monkeyblade.net ([23.128.96.19]:45222 "EHLO lindbergh.monkeyblade.net" rhost-flags-OK-OK-OK-OK) by vger.kernel.org with ESMTP id S1727168AbgKLTCu (ORCPT ); Thu, 12 Nov 2020 14:02:50 -0500 Received: from mail-pf1-x431.google.com (mail-pf1-x431.google.com [IPv6:2607:f8b0:4864:20::431]) by lindbergh.monkeyblade.net (Postfix) with ESMTPS id 5EE82C0613D1 for ; Thu, 12 Nov 2020 11:02:50 -0800 (PST) Received: by mail-pf1-x431.google.com with SMTP id v12so5375160pfm.13 for ; Thu, 12 Nov 2020 11:02:50 -0800 (PST) DKIM-Signature: v=1; a=rsa-sha256; c=relaxed/relaxed; d=gmail.com; s=20161025; h=from:to:cc:subject:date:message-id:in-reply-to:references :mime-version:content-transfer-encoding; bh=MhBcpjhEqXsppr3+44nUMYba1BpmJJZSblxJp5+le2M=; b=hHs3rQMo/0DpOnBirtN9fwQtsV7O8/Vn5K6kbcJNbvTQ2OGBwSh31xCZ7e5ADCFLl/ oIkaBpOb0a/aXUTuTgzFXKpO/xwmO31nkSnYC5cxPFky38F1WPuHiqowDNoETKQQ2CzQ RYbgosTreRQKXO0kIwq9q1yBaINc8dxurSWx/rTgQLZ7FlMTW236BL5EoVfVHGHF79th WpDM2v8M94yD9wN1udxTMYm1fyIXMX6J2BtLC6jGorHSrx3FQ4I2jnrhlVmehf4kCTpQ JEeQrhHrk9AQAUV4yaiNVZwzzCQ9wdTlnChiKiMXJ4ppCpACn1xSExm+zK59ircFOBwM P4gg== X-Google-DKIM-Signature: v=1; a=rsa-sha256; c=relaxed/relaxed; d=1e100.net; s=20161025; h=x-gm-message-state:from:to:cc:subject:date:message-id:in-reply-to :references:mime-version:content-transfer-encoding; bh=MhBcpjhEqXsppr3+44nUMYba1BpmJJZSblxJp5+le2M=; b=OGIRYKC/IT0UZEiB6qWg1rESL/ami+eVKftdhPsGoHiOMEF5v4QNC5P4h2jH31nA94 kDBJf1juaWXJqOmZxyxxw+LPb6wOSoW4361HmO7MzBydILzK4ZRZbULAlL5k6lCLxBPs ard36ypzWmL06UTTuzTGWD+BOi5Zy2W0VA/4M4GXJlOv7ZThjnQ3cCWSTJwsynxhf2mW yWaduuelxBcgtEBqQTU2/GokyD083weHK/X19cBsnoul4HaCfzxUJYt1f5Dqt3lFT5xf RoRKGY+PIyLQvz7wNT3QR8wholXs/6QPsbl19aiD3NB4U4frKg9spp1Xe1xIVLuSIiIb vOWA== X-Gm-Message-State: AOAM532aP5Qw0F+996Vl49d1z8oxJD5oIUVma1BiUT8w0d0BGSwvCQQS GFzO0f/p2gFnQkEX4JES38c= X-Google-Smtp-Source: ABdhPJxGrknstaFjATZi4QXUC2uFCzdYz47larpPKUlHSux1KjvUX/gpuNi5ZkbE+NOjdcWWBxpRig== X-Received: by 2002:a17:90a:b88e:: with SMTP id o14mr658030pjr.226.1605207770025; Thu, 12 Nov 2020 11:02:50 -0800 (PST) Received: from phantasmagoria.svl.corp.google.com ([2620:15c:2c4:201:f693:9fff:feea:f0b9]) by smtp.gmail.com with ESMTPSA id z7sm7458809pfq.214.2020.11.12.11.02.49 (version=TLS1_3 cipher=TLS_AES_256_GCM_SHA384 bits=256/256); Thu, 12 Nov 2020 11:02:49 -0800 (PST) From: Arjun Roy To: davem@davemloft.net, netdev@vger.kernel.org Cc: arjunroy@google.com, edumazet@google.com, soheil@google.com Subject: [net-next 3/8] tcp: Refactor skb frag fast-forward op for recv zerocopy. Date: Thu, 12 Nov 2020 11:02:00 -0800 Message-Id: <20201112190205.633640-4-arjunroy.kdev@gmail.com> X-Mailer: git-send-email 2.29.2.222.g5d2a92d10f8-goog In-Reply-To: <20201112190205.633640-1-arjunroy.kdev@gmail.com> References: <20201112190205.633640-1-arjunroy.kdev@gmail.com> MIME-Version: 1.0 Precedence: bulk List-ID: X-Mailing-List: netdev@vger.kernel.org From: Arjun Roy Refactor skb frag fast-forwarding for tcp receive zerocopy. This is part of a patch set that introduces short-circuited hybrid copies for small receive operations, which results in roughly 33% fewer syscalls for small RPC scenarios. skb_advance_to_frag(), given a skb and an offset into the skb, iterates from the first frag for the skb until we're at the frag specified by the offset. Assuming the offset provided refers to how many bytes in the skb are already read, the returned frag points to the next frag we may read from, while offset_frag is set to the number of bytes from this frag that we have already read. If frag is not null and offset_frag is equal to 0, then we may be able to map this frag's page into the process address space with vm_insert_page(). However, if offset_frag is not equal to 0, then we cannot do so. Signed-off-by: Arjun Roy Signed-off-by: Eric Dumazet Signed-off-by: Soheil Hassas Yeganeh --- net/ipv4/tcp.c | 35 ++++++++++++++++++++++++++--------- 1 file changed, 26 insertions(+), 9 deletions(-) diff --git a/net/ipv4/tcp.c b/net/ipv4/tcp.c index 49e33222a68b..ab19d0d00db1 100644 --- a/net/ipv4/tcp.c +++ b/net/ipv4/tcp.c @@ -1743,6 +1743,28 @@ int tcp_mmap(struct file *file, struct socket *sock, } EXPORT_SYMBOL(tcp_mmap); +static skb_frag_t *skb_advance_to_frag(struct sk_buff *skb, u32 offset_skb, + u32 *offset_frag) +{ + skb_frag_t *frag; + + offset_skb -= skb_headlen(skb); + if ((int)offset_skb < 0 || skb_has_frag_list(skb)) + return NULL; + + frag = skb_shinfo(skb)->frags; + while (offset_skb) { + if (skb_frag_size(frag) > offset_skb) { + *offset_frag = offset_skb; + return frag; + } + offset_skb -= skb_frag_size(frag); + ++frag; + } + *offset_frag = 0; + return frag; +} + static int tcp_copy_straggler_data(struct tcp_zerocopy_receive *zc, struct sk_buff *skb, u32 copylen, u32 *offset, u32 *seq) @@ -1865,6 +1887,8 @@ static int tcp_zerocopy_receive(struct sock *sk, curr_addr = address; while (length + PAGE_SIZE <= zc->length) { if (zc->recv_skip_hint < PAGE_SIZE) { + u32 offset_frag; + /* If we're here, finish the current batch. */ if (pg_idx) { ret = tcp_zerocopy_vm_insert_batch(vma, pages, @@ -1885,16 +1909,9 @@ static int tcp_zerocopy_receive(struct sock *sk, skb = tcp_recv_skb(sk, seq, &offset); } zc->recv_skip_hint = skb->len - offset; - offset -= skb_headlen(skb); - if ((int)offset < 0 || skb_has_frag_list(skb)) + frags = skb_advance_to_frag(skb, offset, &offset_frag); + if (!frags || offset_frag) break; - frags = skb_shinfo(skb)->frags; - while (offset) { - if (skb_frag_size(frags) > offset) - goto out; - offset -= skb_frag_size(frags); - frags++; - } } if (skb_frag_size(frags) != PAGE_SIZE || skb_frag_off(frags)) { int remaining = zc->recv_skip_hint; From patchwork Thu Nov 12 19:02:01 2020 Content-Type: text/plain; charset="utf-8" MIME-Version: 1.0 Content-Transfer-Encoding: 7bit X-Patchwork-Submitter: Arjun Roy X-Patchwork-Id: 1399254 Return-Path: X-Original-To: patchwork-incoming-netdev@ozlabs.org Delivered-To: patchwork-incoming-netdev@ozlabs.org Authentication-Results: ozlabs.org; spf=pass (sender SPF authorized) smtp.mailfrom=vger.kernel.org (client-ip=23.128.96.18; helo=vger.kernel.org; envelope-from=netdev-owner@vger.kernel.org; receiver=) Authentication-Results: ozlabs.org; dmarc=pass (p=none dis=none) header.from=gmail.com Authentication-Results: ozlabs.org; dkim=pass (2048-bit key; unprotected) header.d=gmail.com header.i=@gmail.com header.a=rsa-sha256 header.s=20161025 header.b=UdHAUUXv; dkim-atps=neutral Received: from vger.kernel.org (vger.kernel.org [23.128.96.18]) by ozlabs.org (Postfix) with ESMTP id 4CX9wP3cK1z9s0b for ; Fri, 13 Nov 2020 06:03:01 +1100 (AEDT) Received: (majordomo@vger.kernel.org) by vger.kernel.org via listexpand id S1727182AbgKLTDA (ORCPT ); Thu, 12 Nov 2020 14:03:00 -0500 Received: from lindbergh.monkeyblade.net ([23.128.96.19]:45246 "EHLO lindbergh.monkeyblade.net" rhost-flags-OK-OK-OK-OK) by vger.kernel.org with ESMTP id S1727162AbgKLTC5 (ORCPT ); Thu, 12 Nov 2020 14:02:57 -0500 Received: from mail-pf1-x430.google.com (mail-pf1-x430.google.com [IPv6:2607:f8b0:4864:20::430]) by lindbergh.monkeyblade.net (Postfix) with ESMTPS id 3AFD3C0613D1 for ; Thu, 12 Nov 2020 11:02:57 -0800 (PST) Received: by mail-pf1-x430.google.com with SMTP id v12so5375457pfm.13 for ; Thu, 12 Nov 2020 11:02:57 -0800 (PST) DKIM-Signature: v=1; a=rsa-sha256; c=relaxed/relaxed; d=gmail.com; s=20161025; h=from:to:cc:subject:date:message-id:in-reply-to:references :mime-version:content-transfer-encoding; bh=ENPO+v4os7TsDF1riNbiaNyIMFiM3yB/cjtGO2Qk/DI=; b=UdHAUUXvD2IAL0B8AgO/QOVsxduTcvJjhjtrtvJVBtIdry9/4u+jYXo0Zec0MPaDoR 5bJZ7VT/M24wOJjob6rGHaSeYFxrA9qTaVvEeu53DC3b9ICjUj85OxBLSP9UUHVfQVkG knXq9sGpzVh3Gmq7tGU5pkBayabEIAA+1PT7f+T2+Roj/spqYSWOehCG5sxqkBEDkRWZ S45IYvnA6cheGc6Yg8ZWbk8332smnLyvSO2s584TRaNrnYowjcDK7Et+t2hj8TAHfwCp M2OEJwsnij4HKq+4WBBdq1s/LJDoehnsKU0aS9QdQiaqwX5my/MWggNoiex9bSVfTRbq kpUw== X-Google-DKIM-Signature: v=1; a=rsa-sha256; c=relaxed/relaxed; d=1e100.net; s=20161025; h=x-gm-message-state:from:to:cc:subject:date:message-id:in-reply-to :references:mime-version:content-transfer-encoding; bh=ENPO+v4os7TsDF1riNbiaNyIMFiM3yB/cjtGO2Qk/DI=; b=in15RN4tF11JeDV4nr7vDNBwsf/IscmNLQF0SWcyvc2s8PcBqqbrALktJuNa2TTMiM Vw/aHNXZHpZTj7t6h6NJUwdwF1CosW+gVcc9aYM36hD10f3ZTKC9dsKKSCE09nvGqXVc 0ElLwZ5dfUJqHYm2eZSW7UCR3WolNiEGYbtsXfjyArCROmCDH9jQGo2FSjM6WNj0NRyK HXf/NVsFc+wCeJNgjeJlU82oRU95Oo5wiaP7gSY9X7zh5ZD0s9HKpl0EJLk7Hy73Z8hA S1ofhK/Eq9nuFW5t+Pr4XqSdy8yhyHV33xicHaeNx2zu64IbDmCgyCTiwkuxT83o7Nv9 TZ6w== X-Gm-Message-State: AOAM531d13EzUurbaFShLExt8DHG1Tf2qM7SKiNvvAL1kC13/Up0m69/ D2PG3nZdBLyQst7rTxpbSZQ= X-Google-Smtp-Source: ABdhPJzgJUFaYb3C8gwM89KTlz+PlZh+qgKPusOydUC1MCe4lcr8Yqfg6lAXmtUEvvDbDPqUYJmM4g== X-Received: by 2002:a63:6484:: with SMTP id y126mr794209pgb.320.1605207776855; Thu, 12 Nov 2020 11:02:56 -0800 (PST) Received: from phantasmagoria.svl.corp.google.com ([2620:15c:2c4:201:f693:9fff:feea:f0b9]) by smtp.gmail.com with ESMTPSA id z7sm7458809pfq.214.2020.11.12.11.02.56 (version=TLS1_3 cipher=TLS_AES_256_GCM_SHA384 bits=256/256); Thu, 12 Nov 2020 11:02:56 -0800 (PST) From: Arjun Roy To: davem@davemloft.net, netdev@vger.kernel.org Cc: arjunroy@google.com, edumazet@google.com, soheil@google.com Subject: [net-next 4/8] tcp: Refactor frag-is-remappable test for recv zerocopy. Date: Thu, 12 Nov 2020 11:02:01 -0800 Message-Id: <20201112190205.633640-5-arjunroy.kdev@gmail.com> X-Mailer: git-send-email 2.29.2.222.g5d2a92d10f8-goog In-Reply-To: <20201112190205.633640-1-arjunroy.kdev@gmail.com> References: <20201112190205.633640-1-arjunroy.kdev@gmail.com> MIME-Version: 1.0 Precedence: bulk List-ID: X-Mailing-List: netdev@vger.kernel.org From: Arjun Roy Refactor frag-is-remappable test for tcp receive zerocopy. This is part of a patch set that introduces short-circuited hybrid copies for small receive operations, which results in roughly 33% fewer syscalls for small RPC scenarios. Signed-off-by: Arjun Roy Signed-off-by: Eric Dumazet Signed-off-by: Soheil Hassas Yeganeh --- net/ipv4/tcp.c | 34 ++++++++++++++++++++++++++-------- 1 file changed, 26 insertions(+), 8 deletions(-) diff --git a/net/ipv4/tcp.c b/net/ipv4/tcp.c index ab19d0d00db1..f3bd606a678d 100644 --- a/net/ipv4/tcp.c +++ b/net/ipv4/tcp.c @@ -1765,6 +1765,26 @@ static skb_frag_t *skb_advance_to_frag(struct sk_buff *skb, u32 offset_skb, return frag; } +static bool can_map_frag(const skb_frag_t *frag) +{ + return skb_frag_size(frag) == PAGE_SIZE && !skb_frag_off(frag); +} + +static int find_next_mappable_frag(const skb_frag_t *frag, + int remaining_in_skb) +{ + int offset = 0; + + if (likely(can_map_frag(frag))) + return 0; + + while (offset < remaining_in_skb && !can_map_frag(frag)) { + offset += skb_frag_size(frag); + ++frag; + } + return offset; +} + static int tcp_copy_straggler_data(struct tcp_zerocopy_receive *zc, struct sk_buff *skb, u32 copylen, u32 *offset, u32 *seq) @@ -1886,6 +1906,8 @@ static int tcp_zerocopy_receive(struct sock *sk, ret = 0; curr_addr = address; while (length + PAGE_SIZE <= zc->length) { + int mappable_offset; + if (zc->recv_skip_hint < PAGE_SIZE) { u32 offset_frag; @@ -1913,15 +1935,11 @@ static int tcp_zerocopy_receive(struct sock *sk, if (!frags || offset_frag) break; } - if (skb_frag_size(frags) != PAGE_SIZE || skb_frag_off(frags)) { - int remaining = zc->recv_skip_hint; - while (remaining && (skb_frag_size(frags) != PAGE_SIZE || - skb_frag_off(frags))) { - remaining -= skb_frag_size(frags); - frags++; - } - zc->recv_skip_hint -= remaining; + mappable_offset = find_next_mappable_frag(frags, + zc->recv_skip_hint); + if (mappable_offset) { + zc->recv_skip_hint = mappable_offset; break; } pages[pg_idx] = skb_frag_page(frags); From patchwork Thu Nov 12 19:02:02 2020 Content-Type: text/plain; charset="utf-8" MIME-Version: 1.0 Content-Transfer-Encoding: 7bit X-Patchwork-Submitter: Arjun Roy X-Patchwork-Id: 1399258 Return-Path: X-Original-To: patchwork-incoming-netdev@ozlabs.org Delivered-To: patchwork-incoming-netdev@ozlabs.org Authentication-Results: ozlabs.org; spf=pass (sender SPF authorized) smtp.mailfrom=vger.kernel.org (client-ip=23.128.96.18; helo=vger.kernel.org; envelope-from=netdev-owner@vger.kernel.org; receiver=) Authentication-Results: ozlabs.org; dmarc=pass (p=none dis=none) header.from=gmail.com Authentication-Results: ozlabs.org; dkim=pass (2048-bit key; unprotected) header.d=gmail.com header.i=@gmail.com header.a=rsa-sha256 header.s=20161025 header.b=mMdZq0Tq; dkim-atps=neutral Received: from vger.kernel.org (vger.kernel.org [23.128.96.18]) by ozlabs.org (Postfix) with ESMTP id 4CX9wd26Zjz9s0b for ; Fri, 13 Nov 2020 06:03:13 +1100 (AEDT) Received: (majordomo@vger.kernel.org) by vger.kernel.org via listexpand id S1727196AbgKLTDK (ORCPT ); Thu, 12 Nov 2020 14:03:10 -0500 Received: from lindbergh.monkeyblade.net ([23.128.96.19]:45256 "EHLO lindbergh.monkeyblade.net" rhost-flags-OK-OK-OK-OK) by vger.kernel.org with ESMTP id S1726759AbgKLTDA (ORCPT ); Thu, 12 Nov 2020 14:03:00 -0500 Received: from mail-pf1-x443.google.com (mail-pf1-x443.google.com [IPv6:2607:f8b0:4864:20::443]) by lindbergh.monkeyblade.net (Postfix) with ESMTPS id 8CCAFC0613D4 for ; Thu, 12 Nov 2020 11:02:59 -0800 (PST) Received: by mail-pf1-x443.google.com with SMTP id v12so5375575pfm.13 for ; Thu, 12 Nov 2020 11:02:59 -0800 (PST) DKIM-Signature: v=1; a=rsa-sha256; c=relaxed/relaxed; d=gmail.com; s=20161025; h=from:to:cc:subject:date:message-id:in-reply-to:references :mime-version:content-transfer-encoding; bh=fMyNeyMIQsirgjnCDGnyYpd6LU6jKA1eMXSUgYknCfA=; b=mMdZq0TqbhvbECFPkP5NrcF27u8ZaM+0BT7pGvEgWLZw+S4Ro96+qcXTgx5c8Ed9ZY MQA8IgpdmTrU3FFDiy0Uy22uXqfpCKoYpKv39CfBUEhhU00nvG0CQiCANJrbf9e/rUxC DOhg236RXtfTRKxacbanc3Srw9fdOEoNGGPLe8vQ/JpGk5iDuou29TgeJIF8e1eGUDTi 8X5Zg7Bk/X4wBr/Z9Zaccn69GWyvOz2qiKugw5uZ0QbQnsIq65r+8BRPoazn7lElKcIU kqw9+XWv3f0T9rkPsU36tJy+YNe4obxgOq4fwqA0xP1DDq79lu/SweLTWbkHgLW9Al+o ZPHw== X-Google-DKIM-Signature: v=1; a=rsa-sha256; c=relaxed/relaxed; d=1e100.net; s=20161025; h=x-gm-message-state:from:to:cc:subject:date:message-id:in-reply-to :references:mime-version:content-transfer-encoding; bh=fMyNeyMIQsirgjnCDGnyYpd6LU6jKA1eMXSUgYknCfA=; b=Sib/S5mxLXV9XNIggA0k2CgGWkH9brsqB287cYNQa8QfHzF7/BxGb6oYlxrHemTkSL w17lyamcvPOlF5KrQ8tHYzMkwhDwjo1IyGEANcr9cICrc54OugSOJeDYGYAM2aKwx00e jZ4sCgjRghM0nSi2WMm86NFfOnIGaKXpGsEIOrJCJs1IQYmBhkbDVCr/DUXCR9Yjjctw OJCoOIjTxIkpb8NlAstFPa5pLOSLs4E6/sLt0tSxdkucDodtaDW83UzlfeEIrMaEPucQ a3A0F2yX1u5VjYGp0yqUHrLC+SE5u5QSbTqk/8X7iixHS18GiqdoAX4ivJEQ+cu4qb7S e83Q== X-Gm-Message-State: AOAM533bdme8uMvxYOpdfkNy0BJnZhHQ1jx7FPBfSk+mnxVoWJ8q227u 4LB8gextbdBFi27od46Nn4o= X-Google-Smtp-Source: ABdhPJwC2RgJ1FH0a2O+NqqX59YSlLjb4439DSuUKbP/LrSyBizn4tpwrFFZl0lZddqo5T5xTXgRyA== X-Received: by 2002:a63:fc1c:: with SMTP id j28mr720539pgi.95.1605207779125; Thu, 12 Nov 2020 11:02:59 -0800 (PST) Received: from phantasmagoria.svl.corp.google.com ([2620:15c:2c4:201:f693:9fff:feea:f0b9]) by smtp.gmail.com with ESMTPSA id z7sm7458809pfq.214.2020.11.12.11.02.58 (version=TLS1_3 cipher=TLS_AES_256_GCM_SHA384 bits=256/256); Thu, 12 Nov 2020 11:02:58 -0800 (PST) From: Arjun Roy To: davem@davemloft.net, netdev@vger.kernel.org Cc: arjunroy@google.com, edumazet@google.com, soheil@google.com Subject: [net-next 5/8] tcp: Fast return if inq < PAGE_SIZE for recv zerocopy. Date: Thu, 12 Nov 2020 11:02:02 -0800 Message-Id: <20201112190205.633640-6-arjunroy.kdev@gmail.com> X-Mailer: git-send-email 2.29.2.222.g5d2a92d10f8-goog In-Reply-To: <20201112190205.633640-1-arjunroy.kdev@gmail.com> References: <20201112190205.633640-1-arjunroy.kdev@gmail.com> MIME-Version: 1.0 Precedence: bulk List-ID: X-Mailing-List: netdev@vger.kernel.org From: Arjun Roy Sometimes, we may call tcp receive zerocopy when inq is 0, or inq < PAGE_SIZE, in which case we cannot remap pages. In this case, simply return the appropriate hint for regular copying without taking mmap_sem. Signed-off-by: Arjun Roy Signed-off-by: Eric Dumazet Signed-off-by: Soheil Hassas Yeganeh --- net/ipv4/tcp.c | 8 ++++++++ 1 file changed, 8 insertions(+) diff --git a/net/ipv4/tcp.c b/net/ipv4/tcp.c index f3bd606a678d..38f8e03f1182 100644 --- a/net/ipv4/tcp.c +++ b/net/ipv4/tcp.c @@ -1885,6 +1885,14 @@ static int tcp_zerocopy_receive(struct sock *sk, sock_rps_record_flow(sk); + if (inq < PAGE_SIZE) { + zc->length = 0; + zc->recv_skip_hint = inq; + if (!inq && sock_flag(sk, SOCK_DONE)) + return -EIO; + return 0; + } + mmap_read_lock(current->mm); vma = find_vma(current->mm, address); From patchwork Thu Nov 12 19:02:03 2020 Content-Type: text/plain; charset="utf-8" MIME-Version: 1.0 Content-Transfer-Encoding: 7bit X-Patchwork-Submitter: Arjun Roy X-Patchwork-Id: 1399255 Return-Path: X-Original-To: patchwork-incoming-netdev@ozlabs.org Delivered-To: patchwork-incoming-netdev@ozlabs.org Authentication-Results: ozlabs.org; spf=pass (sender SPF authorized) smtp.mailfrom=vger.kernel.org (client-ip=23.128.96.18; helo=vger.kernel.org; envelope-from=netdev-owner@vger.kernel.org; receiver=) Authentication-Results: ozlabs.org; dmarc=pass (p=none dis=none) header.from=gmail.com Authentication-Results: ozlabs.org; dkim=pass (2048-bit key; unprotected) header.d=gmail.com header.i=@gmail.com header.a=rsa-sha256 header.s=20161025 header.b=UjJcQ35X; dkim-atps=neutral Received: from vger.kernel.org (vger.kernel.org [23.128.96.18]) by ozlabs.org (Postfix) with ESMTP id 4CX9wT6Z1Sz9s0b for ; Fri, 13 Nov 2020 06:03:05 +1100 (AEDT) Received: (majordomo@vger.kernel.org) by vger.kernel.org via listexpand id S1727189AbgKLTDE (ORCPT ); Thu, 12 Nov 2020 14:03:04 -0500 Received: from lindbergh.monkeyblade.net ([23.128.96.19]:45264 "EHLO lindbergh.monkeyblade.net" rhost-flags-OK-OK-OK-OK) by vger.kernel.org with ESMTP id S1727185AbgKLTDC (ORCPT ); Thu, 12 Nov 2020 14:03:02 -0500 Received: from mail-pg1-x544.google.com (mail-pg1-x544.google.com [IPv6:2607:f8b0:4864:20::544]) by lindbergh.monkeyblade.net (Postfix) with ESMTPS id EEFD0C0613D6 for ; Thu, 12 Nov 2020 11:03:01 -0800 (PST) Received: by mail-pg1-x544.google.com with SMTP id f27so5000076pgl.1 for ; Thu, 12 Nov 2020 11:03:01 -0800 (PST) DKIM-Signature: v=1; a=rsa-sha256; c=relaxed/relaxed; d=gmail.com; s=20161025; h=from:to:cc:subject:date:message-id:in-reply-to:references :mime-version:content-transfer-encoding; bh=ej/ere3U8cds1oBI8o3BGl4O7tQDAJUPvUxGNKrx630=; b=UjJcQ35XsQvFnjW6IMbXC9yfwNGVlpvK5WBA76EqsULKYbgI6t3ExTefRhTS2ska6V VhBRPW8otyVSwZh8TZSTV66LUmGZiiBEmRQJ6Ho9juJ2aou1FgxbnlNloe4GUKvUSVUL Nfa5yPhNRt/PHSkhYVs0RikkE6iXGr7yFiB8bj6TzSfUY4iP9r6qppBjnSKVri+ld62c iduLWU4hraTk62tKB4XLe8pebLf8EG2L4KTFEZGNPzKEg2NRi2rA0AF3opOxuEynHrjd jLhD/FTgM8eXLOvQq1WLI3ckNMzrM6U//3PsVZ4qTzTy0uD0YisBD2bMAunHFfM2MdCj 4q9A== X-Google-DKIM-Signature: v=1; a=rsa-sha256; c=relaxed/relaxed; d=1e100.net; s=20161025; h=x-gm-message-state:from:to:cc:subject:date:message-id:in-reply-to :references:mime-version:content-transfer-encoding; bh=ej/ere3U8cds1oBI8o3BGl4O7tQDAJUPvUxGNKrx630=; b=hzUfWuHAiVk3aIhtB+Bzxfs3gfp6j3AYGrQAlI6+HCh2os+DLzDqWDsDnZV0iTnHoV xTCyD9P6k/JQE2muxJEHziqY5wocUtT84qFz7nr8YChACj48Y/czIbzjfHVEzP9R+f1l N8QCYxBRf3EatlKdmfqNaXPyf9xkd7Ti2nbxnS3S/c1NkQdvcYciZy/ahpXM6CAtzpgk A1SnDORRELNG/wHPXa+bqOYJcTx8j35PCHZZVMRQksukNDureH+t6rcWEwE64sbxN34G 1CkuGcOsSm/OCVhQrlZCr2SaMjU5GrWsbrMeLZF3+jcC4ULnR6loDw+4myFhXJGD380Q bvZw== X-Gm-Message-State: AOAM533F29bAccWd2B8yTaKW/KrvuQjKdQEju+vAbtsZzCE3Wp+sRAKI LR0C283ltOcrd04nrv55E56KETgdsXY= X-Google-Smtp-Source: ABdhPJzmO5935W2f3gVfx7VmIWuy0XUsNtt6O0D+k7PvGUTlNQHcqtWN9PpQ8D++F8uwANkdGc/Hfw== X-Received: by 2002:a63:db50:: with SMTP id x16mr760072pgi.205.1605207781555; Thu, 12 Nov 2020 11:03:01 -0800 (PST) Received: from phantasmagoria.svl.corp.google.com ([2620:15c:2c4:201:f693:9fff:feea:f0b9]) by smtp.gmail.com with ESMTPSA id z7sm7458809pfq.214.2020.11.12.11.03.00 (version=TLS1_3 cipher=TLS_AES_256_GCM_SHA384 bits=256/256); Thu, 12 Nov 2020 11:03:01 -0800 (PST) From: Arjun Roy To: davem@davemloft.net, netdev@vger.kernel.org Cc: arjunroy@google.com, edumazet@google.com, soheil@google.com Subject: [net-next 6/8] tcp: Introduce short-circuit small reads for recv zerocopy. Date: Thu, 12 Nov 2020 11:02:03 -0800 Message-Id: <20201112190205.633640-7-arjunroy.kdev@gmail.com> X-Mailer: git-send-email 2.29.2.222.g5d2a92d10f8-goog In-Reply-To: <20201112190205.633640-1-arjunroy.kdev@gmail.com> References: <20201112190205.633640-1-arjunroy.kdev@gmail.com> MIME-Version: 1.0 Precedence: bulk List-ID: X-Mailing-List: netdev@vger.kernel.org From: Arjun Roy Sometimes, we may call tcp receive zerocopy when inq is 0, or inq < PAGE_SIZE, or inq is generally small enough that it is cheaper to copy rather than remap pages. In these cases, we may want to either return early (inq=0) or attempt to use the provided copy buffer to simply copy the received data. This allows us to save both system call overhead and the latency of acquiring mmap_sem in read mode for cases where it would be useless to do so. This patch enables this behaviour by: 1. Returning quickly if inq is 0. 2. Attempting to perform a regular copy if a hybrid copybuffer is provided and it is large enough to absorb all available bytes. 3. Return quickly if no such buffer was provided and there are less than PAGE_SIZE bytes available. For small RPC ping-pong workloads, normally we would have 1 getsockopt(), 1 recvmsg() and 1 sendmsg() call per RPC. With this change, we remove the recvmsg() call entirely, reducing the syscall overhead by about 33%. In testing with small (hundreds of bytes) RPC traffic, this yields a syscall reduction of about 33% and an efficiency gain of about 3-5% when defined as QPS/CPU Util. Signed-off-by: Arjun Roy Signed-off-by: Eric Dumazet Signed-off-by: Soheil Hassas Yeganeh --- net/ipv4/tcp.c | 32 ++++++++++++++++++++++++++++++++ 1 file changed, 32 insertions(+) diff --git a/net/ipv4/tcp.c b/net/ipv4/tcp.c index 38f8e03f1182..ca45a875147e 100644 --- a/net/ipv4/tcp.c +++ b/net/ipv4/tcp.c @@ -1785,6 +1785,35 @@ static int find_next_mappable_frag(const skb_frag_t *frag, return offset; } +static int tcp_recvmsg_locked(struct sock *sk, struct msghdr *msg, size_t len, + int nonblock, int flags, + struct scm_timestamping_internal *tss, + int *cmsg_flags); +static int receive_fallback_to_copy(struct sock *sk, + struct tcp_zerocopy_receive *zc, int inq) +{ + struct scm_timestamping_internal tss_unused; + int err, cmsg_flags_unused; + struct msghdr msg = {}; + struct iovec iov; + + zc->length = 0; + zc->recv_skip_hint = 0; + + err = import_single_range(READ, (void __user *)zc->copybuf_address, + inq, &iov, &msg.msg_iter); + if (err) + return err; + + err = tcp_recvmsg_locked(sk, &msg, inq, /*nonblock=*/1, /*flags=*/0, + &tss_unused, &cmsg_flags_unused); + if (err < 0) + return err; + + zc->copybuf_len = err; + return 0; +} + static int tcp_copy_straggler_data(struct tcp_zerocopy_receive *zc, struct sk_buff *skb, u32 copylen, u32 *offset, u32 *seq) @@ -1885,6 +1914,9 @@ static int tcp_zerocopy_receive(struct sock *sk, sock_rps_record_flow(sk); + if (inq && inq <= copybuf_len) + return receive_fallback_to_copy(sk, zc, inq); + if (inq < PAGE_SIZE) { zc->length = 0; zc->recv_skip_hint = inq; From patchwork Thu Nov 12 19:02:04 2020 Content-Type: text/plain; charset="utf-8" MIME-Version: 1.0 Content-Transfer-Encoding: 7bit X-Patchwork-Submitter: Arjun Roy X-Patchwork-Id: 1399256 Return-Path: X-Original-To: patchwork-incoming-netdev@ozlabs.org Delivered-To: patchwork-incoming-netdev@ozlabs.org Authentication-Results: ozlabs.org; spf=pass (sender SPF authorized) smtp.mailfrom=vger.kernel.org (client-ip=23.128.96.18; helo=vger.kernel.org; envelope-from=netdev-owner@vger.kernel.org; receiver=) Authentication-Results: ozlabs.org; dmarc=pass (p=none dis=none) header.from=gmail.com Authentication-Results: ozlabs.org; dkim=pass (2048-bit key; unprotected) header.d=gmail.com header.i=@gmail.com header.a=rsa-sha256 header.s=20161025 header.b=Xp/N/hxo; dkim-atps=neutral Received: from vger.kernel.org (vger.kernel.org [23.128.96.18]) by ozlabs.org (Postfix) with ESMTP id 4CX9wW0LMBz9s0b for ; Fri, 13 Nov 2020 06:03:07 +1100 (AEDT) Received: (majordomo@vger.kernel.org) by vger.kernel.org via listexpand id S1727190AbgKLTDF (ORCPT ); Thu, 12 Nov 2020 14:03:05 -0500 Received: from lindbergh.monkeyblade.net ([23.128.96.19]:45272 "EHLO lindbergh.monkeyblade.net" rhost-flags-OK-OK-OK-OK) by vger.kernel.org with ESMTP id S1727187AbgKLTDD (ORCPT ); Thu, 12 Nov 2020 14:03:03 -0500 Received: from mail-pl1-x62a.google.com (mail-pl1-x62a.google.com [IPv6:2607:f8b0:4864:20::62a]) by lindbergh.monkeyblade.net (Postfix) with ESMTPS id E019FC0613D4 for ; Thu, 12 Nov 2020 11:03:03 -0800 (PST) Received: by mail-pl1-x62a.google.com with SMTP id y22so3282201plr.6 for ; Thu, 12 Nov 2020 11:03:03 -0800 (PST) DKIM-Signature: v=1; a=rsa-sha256; c=relaxed/relaxed; d=gmail.com; s=20161025; h=from:to:cc:subject:date:message-id:in-reply-to:references :mime-version:content-transfer-encoding; bh=uBhuHxEFEMnspB71cOl9GHtE64ZEIdLY3Dvadheva4U=; b=Xp/N/hxoji6XQLgqhl/X/p+pEIPgAwjfyiB4NdwtQsFvVqzWdjFgTgapwuxL3s/OLe b8OPa34189evpzpj6I+K8nZgRlTQ5LaU1cqF6ysXMJ7GqpQAM1fA6Rm7pANhYfTOPUpf ht9tr2TRO8TH1zSNCEmYaH4tVtt7VClvycgTjsFx+RCQtvGB+MC2o62JzrWZQ1vkBxCa Rvj4nTI7PzFIw2HSw93Wj+MU8SJ2ckEfpZXrHD9h2cJOV3SD6kE4NDqpnuuE6bnSzEk6 j6RUKetiN/+ke0F7ezxvWKI0lNTTC4xYLNGGOcxB/CMVq6Tbf8HMHNqjWUEkY9SVXZR/ iQag== X-Google-DKIM-Signature: v=1; a=rsa-sha256; c=relaxed/relaxed; d=1e100.net; s=20161025; h=x-gm-message-state:from:to:cc:subject:date:message-id:in-reply-to :references:mime-version:content-transfer-encoding; bh=uBhuHxEFEMnspB71cOl9GHtE64ZEIdLY3Dvadheva4U=; b=iRwlgOTJnL9TfMxYNjexbubDD57Ri2C9q35qULbFyv9VCThkw6ukCKMQmBTVktxFD6 qmJLD5uM5gdQJQZMPqn5kouHuGzkiAIg+6Wk8Dax5ZokT3mD0lqS1zukSuVsx0nNVnhV qiAigHWfSOCg1CUnkIJxYGosxNZq/7G+zUAzD9uiEmcwhu/o4lYn1rzkepdzgZI2Te4B Eqz60T1LNQZKp7wRdn4wlrGsO7oLy+q2Hc/7q5WLfywZ4vNNd92qInpd2l5t+vglFSvr s/OO4b8VZXt+WJafIo2fhSX81zGafhyJOqfJtCQLM3wIe45mJl/W9X4m/uckKdGpiLep l+yQ== X-Gm-Message-State: AOAM5321MwIo5llqn57dxOolKK6gk+weayNyn5C0M2oJN8Gjhm/nvssH 4uqsOQqAeqQpAFuIfqY95o0= X-Google-Smtp-Source: ABdhPJws0f9F+EBNnZGx6T2aCT3y4t4QvMvRQul4f0u89g+MFyGGIp9pCTNhuGRXseraRfnDHyc15w== X-Received: by 2002:a17:902:7e47:b029:d6:c9f2:d50 with SMTP id a7-20020a1709027e47b02900d6c9f20d50mr679234pln.81.1605207783518; Thu, 12 Nov 2020 11:03:03 -0800 (PST) Received: from phantasmagoria.svl.corp.google.com ([2620:15c:2c4:201:f693:9fff:feea:f0b9]) by smtp.gmail.com with ESMTPSA id z7sm7458809pfq.214.2020.11.12.11.03.02 (version=TLS1_3 cipher=TLS_AES_256_GCM_SHA384 bits=256/256); Thu, 12 Nov 2020 11:03:03 -0800 (PST) From: Arjun Roy To: davem@davemloft.net, netdev@vger.kernel.org Cc: arjunroy@google.com, edumazet@google.com, soheil@google.com Subject: [net-next 7/8] tcp: Set zerocopy hint when data is copied Date: Thu, 12 Nov 2020 11:02:04 -0800 Message-Id: <20201112190205.633640-8-arjunroy.kdev@gmail.com> X-Mailer: git-send-email 2.29.2.222.g5d2a92d10f8-goog In-Reply-To: <20201112190205.633640-1-arjunroy.kdev@gmail.com> References: <20201112190205.633640-1-arjunroy.kdev@gmail.com> MIME-Version: 1.0 Precedence: bulk List-ID: X-Mailing-List: netdev@vger.kernel.org From: Arjun Roy Set zerocopy hint, event when falling back to copy, so that the pending data can be efficiently received using zerocopy when possible. Signed-off-by: Arjun Roy Signed-off-by: Eric Dumazet Signed-off-by: Soheil Hassas Yeganeh --- net/ipv4/tcp.c | 45 +++++++++++++++++++++++++++++++++++++++++++++ 1 file changed, 45 insertions(+) diff --git a/net/ipv4/tcp.c b/net/ipv4/tcp.c index ca45a875147e..c06ba63f4caf 100644 --- a/net/ipv4/tcp.c +++ b/net/ipv4/tcp.c @@ -1785,6 +1785,43 @@ static int find_next_mappable_frag(const skb_frag_t *frag, return offset; } +static void tcp_zerocopy_set_hint_for_skb(struct sock *sk, + struct tcp_zerocopy_receive *zc, + struct sk_buff *skb, u32 offset) +{ + u32 frag_offset, partial_frag_remainder = 0; + int mappable_offset; + skb_frag_t *frag; + + /* worst case: skip to next skb. try to improve on this case below */ + zc->recv_skip_hint = skb->len - offset; + + /* Find the frag containing this offset (and how far into that frag) */ + frag = skb_advance_to_frag(skb, offset, &frag_offset); + if (!frag) + return; + + if (frag_offset) { + struct skb_shared_info *info = skb_shinfo(skb); + + /* We read part of the last frag, must recvmsg() rest of skb. */ + if (frag == &info->frags[info->nr_frags - 1]) + return; + + /* Else, we must at least read the remainder in this frag. */ + partial_frag_remainder = skb_frag_size(frag) - frag_offset; + zc->recv_skip_hint -= partial_frag_remainder; + ++frag; + } + + /* partial_frag_remainder: If part way through a frag, must read rest. + * mappable_offset: Bytes till next mappable frag, *not* counting bytes + * in partial_frag_remainder. + */ + mappable_offset = find_next_mappable_frag(frag, zc->recv_skip_hint); + zc->recv_skip_hint = mappable_offset + partial_frag_remainder; +} + static int tcp_recvmsg_locked(struct sock *sk, struct msghdr *msg, size_t len, int nonblock, int flags, struct scm_timestamping_internal *tss, @@ -1811,6 +1848,14 @@ static int receive_fallback_to_copy(struct sock *sk, return err; zc->copybuf_len = err; + if (likely(zc->copybuf_len)) { + struct sk_buff *skb; + u32 offset; + + skb = tcp_recv_skb(sk, tcp_sk(sk)->copied_seq, &offset); + if (skb) + tcp_zerocopy_set_hint_for_skb(sk, zc, skb, offset); + } return 0; } From patchwork Thu Nov 12 19:02:05 2020 Content-Type: text/plain; charset="utf-8" MIME-Version: 1.0 Content-Transfer-Encoding: 7bit X-Patchwork-Submitter: Arjun Roy X-Patchwork-Id: 1399257 Return-Path: X-Original-To: patchwork-incoming-netdev@ozlabs.org Delivered-To: patchwork-incoming-netdev@ozlabs.org Authentication-Results: ozlabs.org; spf=pass (sender SPF authorized) smtp.mailfrom=vger.kernel.org (client-ip=23.128.96.18; helo=vger.kernel.org; envelope-from=netdev-owner@vger.kernel.org; receiver=) Authentication-Results: ozlabs.org; dmarc=pass (p=none dis=none) header.from=gmail.com Authentication-Results: ozlabs.org; dkim=pass (2048-bit key; unprotected) header.d=gmail.com header.i=@gmail.com header.a=rsa-sha256 header.s=20161025 header.b=RgVdynPu; dkim-atps=neutral Received: from vger.kernel.org (vger.kernel.org [23.128.96.18]) by ozlabs.org (Postfix) with ESMTP id 4CX9wY74ypz9sSn for ; Fri, 13 Nov 2020 06:03:09 +1100 (AEDT) Received: (majordomo@vger.kernel.org) by vger.kernel.org via listexpand id S1727192AbgKLTDJ (ORCPT ); Thu, 12 Nov 2020 14:03:09 -0500 Received: from lindbergh.monkeyblade.net ([23.128.96.19]:45280 "EHLO lindbergh.monkeyblade.net" rhost-flags-OK-OK-OK-OK) by vger.kernel.org with ESMTP id S1726478AbgKLTDG (ORCPT ); Thu, 12 Nov 2020 14:03:06 -0500 Received: from mail-pg1-x543.google.com (mail-pg1-x543.google.com [IPv6:2607:f8b0:4864:20::543]) by lindbergh.monkeyblade.net (Postfix) with ESMTPS id 71E9BC0613D1 for ; Thu, 12 Nov 2020 11:03:06 -0800 (PST) Received: by mail-pg1-x543.google.com with SMTP id f27so5000231pgl.1 for ; Thu, 12 Nov 2020 11:03:06 -0800 (PST) DKIM-Signature: v=1; a=rsa-sha256; c=relaxed/relaxed; d=gmail.com; s=20161025; h=from:to:cc:subject:date:message-id:in-reply-to:references :mime-version:content-transfer-encoding; bh=bxWPmmqXRCqqJnJ/gmtyQ3yO6P918qwJPyfH8rwd70M=; b=RgVdynPuJ3ZKfTYtlhq6PtODYm8fGXxfh9D9sHJIcN6gFy/NYVfs9X5dAAUHM4OHAG 7m0hth9rAyT/zopKi6mh3d06Q246vbxpGiMUyhtEdu1C4zTaifIzxFI6zQhteMylFaxn anWWmI8cBwfRJ2VZmpJKZGXXnQHa8NwNW5jeLDOTlp3wLBvf8m76fCzpnwp8b6Pfii+q BSHBNCq2Y4lSOs5DWDPdVoHeKZUaV4+mfQ3bKk/Wr4+SiE2EERDJnatM2tPITFcDfyf+ YDXCtDvd5UP3N4Lz8DkbGbQpB1+mJPSEPrLNvy6MSfYKA6WzZUfhmMxAOyVd0awVavaf 2ZVA== X-Google-DKIM-Signature: v=1; a=rsa-sha256; c=relaxed/relaxed; d=1e100.net; s=20161025; h=x-gm-message-state:from:to:cc:subject:date:message-id:in-reply-to :references:mime-version:content-transfer-encoding; bh=bxWPmmqXRCqqJnJ/gmtyQ3yO6P918qwJPyfH8rwd70M=; b=F+i2dkF+1kn5bby7GUdCxIVPTrtVgCM5v2nXnNU8mE7mTn8LXl+cvSms+jVyi1CHf7 MY9cu7D9L2KPGwHubw5pDdumyVCJ1TbY7Xqioq+AsZbj81Q9PslEyPV9R7TpUXp1wxy2 AnED8Q5yafhUjvk3IOKdTLa4fcODTp4Rt5x6ABW6YJsWJ705Kjppz0xPjdUVS+JMErPt rWS8v4YwQxVEURMKHax9sCDMFuklANz2GGRYbMIjet6sD/kx8kAxONjlhgC9v+2EBnNi dF8ZeUkf51V9X8P0twAz+6s3svKfPS8NVBFBMEvEx3IiNDtDterOuOLRlG8+KkL/YTCK LqWA== X-Gm-Message-State: AOAM5307sLsYTUHzKSkIhcS55nSFWdwje0Q67q49074lMGbIJ5yZS0fq 63jFUiGLLJ5jLsIbID7BDRRFMLxZcNc= X-Google-Smtp-Source: ABdhPJzXJJvQ/SFpAhcFcqoFWMHQc/BhzURWdxp0WrcqJncZU2j5PG0R4a+i60Emrl6Dspd1OTfnaQ== X-Received: by 2002:a63:1f53:: with SMTP id q19mr797115pgm.286.1605207785968; Thu, 12 Nov 2020 11:03:05 -0800 (PST) Received: from phantasmagoria.svl.corp.google.com ([2620:15c:2c4:201:f693:9fff:feea:f0b9]) by smtp.gmail.com with ESMTPSA id z7sm7458809pfq.214.2020.11.12.11.03.05 (version=TLS1_3 cipher=TLS_AES_256_GCM_SHA384 bits=256/256); Thu, 12 Nov 2020 11:03:05 -0800 (PST) From: Arjun Roy To: davem@davemloft.net, netdev@vger.kernel.org Cc: arjunroy@google.com, edumazet@google.com, soheil@google.com Subject: [net-next 8/8] tcp: Defer vm zap unless actually needed for recv zerocopy. Date: Thu, 12 Nov 2020 11:02:05 -0800 Message-Id: <20201112190205.633640-9-arjunroy.kdev@gmail.com> X-Mailer: git-send-email 2.29.2.222.g5d2a92d10f8-goog In-Reply-To: <20201112190205.633640-1-arjunroy.kdev@gmail.com> References: <20201112190205.633640-1-arjunroy.kdev@gmail.com> MIME-Version: 1.0 Precedence: bulk List-ID: X-Mailing-List: netdev@vger.kernel.org From: Arjun Roy Zapping pages is required only if we are calling vm_insert_page into a region where pages had previously been mapped. Receive zerocopy allows reusing such regions, and hitherto called zap_page_range() before calling vm_insert_page() in that range. zap_page_range() can also be triggered from userspace with madvise(MADV_DONTNEED). If userspace is configured to call this before reusing a segment, or if there was nothing mapped at this virtual address to begin with, we can avoid calling zap_page_range() under the socket lock. That said, if userspace does not do that, then we are still responsible for calling zap_page_range(). This patch adds a flag that the user can use to hint to the kernel that a zap is not required. If the flag is not set, or if an older user application does not have a flags field at all, then the kernel calls zap_page_range as before. Also, if the flag is set but a zap is still required, the kernel performs that zap as necessary. Thus incorrectly indicating that a zap can be avoided does not change the correctness of operation. It also increases the batchsize for vm_insert_pages and prefetches the page struct for the batch since we're about to bump the refcount. An alternative mechanism could be to not have a flag, assume by default a zap is not needed, and fall back to zapping if needed. However, this would harm performance for older applications for which a zap is necessary, and thus we implement it with an explicit flag so newer applications can opt in. When using RPC-style traffic with medium sized (tens of KB) RPCs, this change yields an efficency improvement of about 30% for QPS/CPU usage. Signed-off-by: Arjun Roy Signed-off-by: Eric Dumazet Signed-off-by: Soheil Hassas Yeganeh --- include/uapi/linux/tcp.h | 2 + net/ipv4/tcp.c | 147 ++++++++++++++++++++++++++------------- 2 files changed, 99 insertions(+), 50 deletions(-) diff --git a/include/uapi/linux/tcp.h b/include/uapi/linux/tcp.h index 62db78b9c1a0..13ceeb395eb8 100644 --- a/include/uapi/linux/tcp.h +++ b/include/uapi/linux/tcp.h @@ -343,6 +343,7 @@ struct tcp_diag_md5sig { /* setsockopt(fd, IPPROTO_TCP, TCP_ZEROCOPY_RECEIVE, ...) */ +#define TCP_RECEIVE_ZEROCOPY_FLAG_TLB_CLEAN_HINT 0x1 struct tcp_zerocopy_receive { __u64 address; /* in: address of mapping */ __u32 length; /* in/out: number of bytes to map/mapped */ @@ -351,5 +352,6 @@ struct tcp_zerocopy_receive { __s32 err; /* out: socket error */ __u64 copybuf_address; /* in: copybuf address (small reads) */ __s32 copybuf_len; /* in/out: copybuf bytes avail/used or error */ + __u32 flags; /* in: flags */ }; #endif /* _UAPI_LINUX_TCP_H */ diff --git a/net/ipv4/tcp.c b/net/ipv4/tcp.c index c06ba63f4caf..309fe0146bb4 100644 --- a/net/ipv4/tcp.c +++ b/net/ipv4/tcp.c @@ -1901,51 +1901,101 @@ static int tcp_zerocopy_handle_leftover_data(struct tcp_zerocopy_receive *zc, return zc->copybuf_len < 0 ? 0 : copylen; } +static int tcp_zerocopy_vm_insert_batch_error(struct vm_area_struct *vma, + struct page **pending_pages, + unsigned long pages_remaining, + unsigned long *address, + u32 *length, + u32 *seq, + struct tcp_zerocopy_receive *zc, + u32 total_bytes_to_map, + int err) +{ + /* At least one page did not map. Try zapping if we skipped earlier. */ + if (err == -EBUSY && + zc->flags & TCP_RECEIVE_ZEROCOPY_FLAG_TLB_CLEAN_HINT) { + u32 maybe_zap_len; + + maybe_zap_len = total_bytes_to_map - /* All bytes to map */ + *length + /* Mapped or pending */ + (pages_remaining * PAGE_SIZE); /* Failed map. */ + zap_page_range(vma, *address, maybe_zap_len); + err = 0; + } + + if (!err) { + unsigned long leftover_pages = pages_remaining; + int bytes_mapped; + + /* We called zap_page_range, try to reinsert. */ + err = vm_insert_pages(vma, *address, + pending_pages, + &pages_remaining); + bytes_mapped = PAGE_SIZE * (leftover_pages - pages_remaining); + *seq += bytes_mapped; + *address += bytes_mapped; + } + if (err) { + /* Either we were unable to zap, OR we zapped, retried an + * insert, and still had an issue. Either ways, pages_remaining + * is the number of pages we were unable to map, and we unroll + * some state we speculatively touched before. + */ + const int bytes_not_mapped = PAGE_SIZE * pages_remaining; + + *length -= bytes_not_mapped; + zc->recv_skip_hint += bytes_not_mapped; + } + return err; +} + static int tcp_zerocopy_vm_insert_batch(struct vm_area_struct *vma, struct page **pages, - unsigned long pages_to_map, - unsigned long *insert_addr, - u32 *length_with_pending, + unsigned int pages_to_map, + unsigned long *address, + u32 *length, u32 *seq, - struct tcp_zerocopy_receive *zc) + struct tcp_zerocopy_receive *zc, + u32 total_bytes_to_map) { unsigned long pages_remaining = pages_to_map; - int bytes_mapped; - int ret; + unsigned int pages_mapped; + unsigned int bytes_mapped; + int err; - ret = vm_insert_pages(vma, *insert_addr, pages, &pages_remaining); - bytes_mapped = PAGE_SIZE * (pages_to_map - pages_remaining); + err = vm_insert_pages(vma, *address, pages, &pages_remaining); + pages_mapped = pages_to_map - (unsigned int)pages_remaining; + bytes_mapped = PAGE_SIZE * pages_mapped; /* Even if vm_insert_pages fails, it may have partially succeeded in * mapping (some but not all of the pages). */ *seq += bytes_mapped; - *insert_addr += bytes_mapped; - if (ret) { - /* But if vm_insert_pages did fail, we have to unroll some state - * we speculatively touched before. - */ - const int bytes_not_mapped = PAGE_SIZE * pages_remaining; - *length_with_pending -= bytes_not_mapped; - zc->recv_skip_hint += bytes_not_mapped; - } - return ret; + *address += bytes_mapped; + + if (likely(!err)) + return 0; + + /* Error: maybe zap and retry + rollback state for failed inserts. */ + return tcp_zerocopy_vm_insert_batch_error(vma, pages + pages_mapped, + pages_remaining, address, length, seq, zc, total_bytes_to_map, + err); } +#define TCP_ZEROCOPY_PAGE_BATCH_SIZE 32 static int tcp_zerocopy_receive(struct sock *sk, struct tcp_zerocopy_receive *zc) { - u32 length = 0, offset, vma_len, avail_len, aligned_len, copylen = 0; + u32 length = 0, offset, vma_len, avail_len, copylen = 0; unsigned long address = (unsigned long)zc->address; + struct page *pages[TCP_ZEROCOPY_PAGE_BATCH_SIZE]; s32 copybuf_len = zc->copybuf_len; struct tcp_sock *tp = tcp_sk(sk); - #define PAGE_BATCH_SIZE 8 - struct page *pages[PAGE_BATCH_SIZE]; const skb_frag_t *frags = NULL; + unsigned int pages_to_map = 0; struct vm_area_struct *vma; struct sk_buff *skb = NULL; - unsigned long pg_idx = 0; - unsigned long curr_addr; u32 seq = tp->copied_seq; + u32 total_bytes_to_map; int inq = tcp_inq(sk); int ret; @@ -1979,34 +2029,24 @@ static int tcp_zerocopy_receive(struct sock *sk, } vma_len = min_t(unsigned long, zc->length, vma->vm_end - address); avail_len = min_t(u32, vma_len, inq); - aligned_len = avail_len & ~(PAGE_SIZE - 1); - if (aligned_len) { - zap_page_range(vma, address, aligned_len); - zc->length = aligned_len; + total_bytes_to_map = avail_len & ~(PAGE_SIZE - 1); + if (total_bytes_to_map) { + if (!(zc->flags & TCP_RECEIVE_ZEROCOPY_FLAG_TLB_CLEAN_HINT)) + zap_page_range(vma, address, total_bytes_to_map); + zc->length = total_bytes_to_map; zc->recv_skip_hint = 0; } else { zc->length = avail_len; zc->recv_skip_hint = avail_len; } ret = 0; - curr_addr = address; while (length + PAGE_SIZE <= zc->length) { int mappable_offset; + struct page *page; if (zc->recv_skip_hint < PAGE_SIZE) { u32 offset_frag; - /* If we're here, finish the current batch. */ - if (pg_idx) { - ret = tcp_zerocopy_vm_insert_batch(vma, pages, - pg_idx, - &curr_addr, - &length, - &seq, zc); - if (ret) - goto out; - pg_idx = 0; - } if (skb) { if (zc->recv_skip_hint > 0) break; @@ -2027,24 +2067,31 @@ static int tcp_zerocopy_receive(struct sock *sk, zc->recv_skip_hint = mappable_offset; break; } - pages[pg_idx] = skb_frag_page(frags); - pg_idx++; + page = skb_frag_page(frags); + prefetchw(page); + pages[pages_to_map++] = page; length += PAGE_SIZE; zc->recv_skip_hint -= PAGE_SIZE; frags++; - if (pg_idx == PAGE_BATCH_SIZE) { - ret = tcp_zerocopy_vm_insert_batch(vma, pages, pg_idx, - &curr_addr, &length, - &seq, zc); + if (pages_to_map == TCP_ZEROCOPY_PAGE_BATCH_SIZE || + zc->recv_skip_hint < PAGE_SIZE) { + /* Either full batch, or we're about to go to next skb + * (and we cannot unroll failed ops across skbs). + */ + ret = tcp_zerocopy_vm_insert_batch(vma, pages, + pages_to_map, + &address, &length, + &seq, zc, + total_bytes_to_map); if (ret) goto out; - pg_idx = 0; + pages_to_map = 0; } } - if (pg_idx) { - ret = tcp_zerocopy_vm_insert_batch(vma, pages, pg_idx, - &curr_addr, &length, &seq, - zc); + if (pages_to_map) { + ret = tcp_zerocopy_vm_insert_batch(vma, pages, pages_to_map, + &address, &length, &seq, + zc, total_bytes_to_map); } out: mmap_read_unlock(current->mm);