From patchwork Wed Jan 17 12:19:58 2018 Content-Type: text/plain; charset="utf-8" MIME-Version: 1.0 Content-Transfer-Encoding: 7bit X-Patchwork-Submitter: Sowmini Varadhan X-Patchwork-Id: 862248 Return-Path: X-Original-To: patchwork-incoming@ozlabs.org Delivered-To: patchwork-incoming@ozlabs.org Authentication-Results: ozlabs.org; spf=none (mailfrom) smtp.mailfrom=vger.kernel.org (client-ip=209.132.180.67; helo=vger.kernel.org; envelope-from=netdev-owner@vger.kernel.org; receiver=) Authentication-Results: ozlabs.org; dkim=pass (2048-bit key; unprotected) header.d=oracle.com header.i=@oracle.com header.b="jzCRFoDe"; dkim-atps=neutral Received: from vger.kernel.org (vger.kernel.org [209.132.180.67]) by ozlabs.org (Postfix) with ESMTP id 3zM6734hkJz9sNV for ; Wed, 17 Jan 2018 23:37:35 +1100 (AEDT) Received: (majordomo@vger.kernel.org) by vger.kernel.org via listexpand id S1753058AbeAQMhd (ORCPT ); Wed, 17 Jan 2018 07:37:33 -0500 Received: from aserp2130.oracle.com ([141.146.126.79]:59696 "EHLO aserp2130.oracle.com" rhost-flags-OK-OK-OK-OK) by vger.kernel.org with ESMTP id S1752986AbeAQMhU (ORCPT ); Wed, 17 Jan 2018 07:37:20 -0500 Received: from pps.filterd (aserp2130.oracle.com [127.0.0.1]) by aserp2130.oracle.com (8.16.0.22/8.16.0.22) with SMTP id w0HCbFTJ048748; Wed, 17 Jan 2018 12:37:15 GMT DKIM-Signature: v=1; a=rsa-sha256; c=relaxed/relaxed; d=oracle.com; h=from : to : cc : subject : date : message-id; s=corp-2017-10-26; bh=46Xe7K3Fv18Z3PIFOPtsplqVpqz03o6+doCG9lFQT58=; b=jzCRFoDe662CIpPYyrScEHDqwHjZhsiLg5jbMrTtVLW0jRsRHIJx/d2147KWf5oLcO71 DG4PzLlIZ1YhQ3WuYr2YHYH1VjJVrSLAf4pJiyMAoI8fStPbkxhjvwn3tWKQS3kkSVcj FbXnrjH0OnXfMkcrnXPTa1qd9zNBgics3f9OY46Tkrng4pNGUbquHD5XN7lmw7S6Hrhm IW+IQjVAp89gCruTuGTuWqj2HMCYLCnCUtzqAKN0fSGKJURmncp+mB/cO5JTgxnxFOep LLzMTtnESYWMMwYEL/RmJLuLFzmHZe8W6ovlkjZyBS9dZ+Kc83pEFx27Bref+N18rXzG cA== Received: from aserv0021.oracle.com (aserv0021.oracle.com [141.146.126.233]) by aserp2130.oracle.com with ESMTP id 2fj2gy15wn-1 (version=TLSv1.2 cipher=ECDHE-RSA-AES256-GCM-SHA384 bits=256 verify=OK); Wed, 17 Jan 2018 12:37:15 +0000 Received: from userv0122.oracle.com (userv0122.oracle.com [156.151.31.75]) by aserv0021.oracle.com (8.14.4/8.14.4) with ESMTP id w0HCbEdU005273 (version=TLSv1/SSLv3 cipher=DHE-RSA-AES256-GCM-SHA384 bits=256 verify=FAIL); Wed, 17 Jan 2018 12:37:15 GMT Received: from abhmp0008.oracle.com (abhmp0008.oracle.com [141.146.116.14]) by userv0122.oracle.com (8.14.4/8.14.4) with ESMTP id w0HCbEZO018512; Wed, 17 Jan 2018 12:37:14 GMT Received: from ipftiger1.us.oracle.com (/10.208.179.35) by default (Oracle Beehive Gateway v4.0) with ESMTP ; Wed, 17 Jan 2018 04:37:13 -0800 From: Sowmini Varadhan To: netdev@vger.kernel.org, willemdebruijn.kernel@gmail.com Cc: davem@davemloft.net, rds-devel@oss.oracle.com, sowmini.varadhan@oracle.com, santosh.shilimkar@oracle.com Subject: [PATCH RFC net-next 0/6] rds: zerocopy support Date: Wed, 17 Jan 2018 04:19:58 -0800 Message-Id: X-Mailer: git-send-email 1.7.1 X-Proofpoint-Virus-Version: vendor=nai engine=5900 definitions=8776 signatures=668653 X-Proofpoint-Spam-Details: rule=notspam policy=default score=0 suspectscore=2 malwarescore=0 phishscore=0 bulkscore=0 spamscore=0 mlxscore=0 mlxlogscore=999 adultscore=0 classifier=spam adjust=0 reason=mlx scancount=1 engine=8.0.1-1711220000 definitions=main-1801170183 Sender: netdev-owner@vger.kernel.org Precedence: bulk List-ID: X-Mailing-List: netdev@vger.kernel.org This patch series provides support for MSG_ZERCOCOPY on a PF_RDS socket based on the APIs and infrastructure added by f214f915e7db ("tcp: enable MSG_ZEROCOPY") For single threaded rds-stress testing using rds-tcp with the ixgbe driver using 1M message sizes (-a 1M -q 1M) preliminary results show that there is a significant reduction in latency: about 90 usec with zerocopy, compared with 200 usec without zerocopy. Additional testing/debugging is ongoing, but I am sharing the current patchset to get some feedback on API design choices especially for the send-completion notification for multi-threaded datagram socket applications Brief RDS Architectural overview: PF_RDS sockets implement message-bounded datagram semantics over a reliable transport. The RDS socket layer tracks message boundaries and uses an underlying transport like TCP to segment/reassemble the message into MTU sized frames. In addition to the reliable, ordered delivery semantics provided by the transport, the RDS layer also retains the datagram in its retransmit queue, to be resent in case of transport failure/restart events. This patchset modifies the above for zerocopy in the following manner. - if the MSG_ZEROCOPY flag is specified with rds_sendmsg(), and, - if the SO_ZEROCOPY socket option has been set on the PF_RDS socket, application pages sent down with rds_sendmsg are pinned. The pinning uses the accounting infrastructure added by a91dbff551a6 ("sock: ulimit on MSG_ZEROCOPY pages") The message is unpinned after we get back an ACK (TCP ACK, in the case of rds-tcp) indicating that the RDS module at the receiver has received the datagram, and it is safe for the sender to free the message from its (RDS) retransmit queue. The payload bytes in the message may not be modified for the duration that the message has been pinned. A multi-threaded application using this infrastructure thus needs to be notified about send-completion, and that notification must uniquely identify the message to the application so that the application buffers may be freed/reused. Unique identification of the message in the completion notification is done in the following manner: - application passes down a 32 bit cookie as ancillary data with rds_sendmsg. The ancillary data in this case has cmsg_level == SOL_RDS and cmsg_type == RDS_CMSG_ZCOPY_COOKIE. - upon send-completion, the rds module passes up a batch of cookies on the sk_error_queue associated with the PF_RDS socket. The message thus received will have a batch of N cookies in the data, with the number of cookies (N) specified in the ancillary data passed with recvmsg(). The current patchset sets up the ancillary data as a sock_extended_err with ee_origin == SO_EE_ORIGIN_ZEROCOPY, and ee_data == N based on 52267790ef52 ("sock: add MSG_ZEROCOPY"), and alternate suggestions for designing this API are invited. The important point here is that the notification would need to be able to contain an arbitrary number of cookies, where each cookie would allow the application to uniquely identify a buffer used with sendmsg() Note that cookie-batching on send-completion notification means that the application may not know the buffering requirements a priori and the buffer sent down with recvmsg on the MSG_ERRQUEUE may be smaller than the required size for the notifications to be sent. To accomodate this case, sk_error_queue has been enhanced to support MSG_PEEK semantics (so that the application can retry with a larger buffer) Work in progress - additional testing: when we test this with rds-stress with 8 sockets, and a send depth of 64 (i.e. each socket can have at most 64 outstanding requests) some data corruption is reported by rds-stress. Working on drilling down the root-cause - optimizing the send-completion notification API: our use-cases are multi-threaded, and we want to be able to reuse buffers as soon as possible (instead of waiting for the req-resp transaction to complete). Sub-optimal design of the completion notification can actually cause a perf deterioration (system-call overhead to reap notification, throughput can go down because application does not send "fast enough", even though latency is small), so this area needs to be optimized carefully - additional test results beyond the rds-stress micro-benchmarks. Sowmini Varadhan (6): sock: MSG_PEEK support for sk_error_queue skbuff: export mm_[un]account_pinned_pages for other modules rds: hold a sock ref from rds_message to the rds_sock sock: permit SO_ZEROCOPY on PF_RDS socket rds: support for zcopy completion notification rds: zerocopy Tx support. drivers/net/tun.c | 2 +- include/linux/skbuff.h | 3 + include/net/sock.h | 2 +- include/uapi/linux/rds.h | 1 + net/core/skbuff.c | 6 ++- net/core/sock.c | 14 +++++- net/packet/af_packet.c | 3 +- net/rds/af_rds.c | 3 + net/rds/message.c | 119 ++++++++++++++++++++++++++++++++++++++++++++- net/rds/rds.h | 16 +++++- net/rds/recv.c | 3 + net/rds/send.c | 41 ++++++++++++---- 12 files changed, 192 insertions(+), 21 deletions(-)