From patchwork Wed Mar 7 01:12:12 2018 Content-Type: text/plain; charset="utf-8" MIME-Version: 1.0 Content-Transfer-Encoding: 7bit X-Patchwork-Submitter: Jesus Sanchez-Palencia X-Patchwork-Id: 882342 Return-Path: X-Original-To: patchwork-incoming@ozlabs.org Delivered-To: patchwork-incoming@ozlabs.org Authentication-Results: ozlabs.org; spf=none (mailfrom) smtp.mailfrom=vger.kernel.org (client-ip=209.132.180.67; helo=vger.kernel.org; envelope-from=netdev-owner@vger.kernel.org; receiver=) Authentication-Results: ozlabs.org; dmarc=none (p=none dis=none) header.from=intel.com Received: from vger.kernel.org (vger.kernel.org [209.132.180.67]) by ozlabs.org (Postfix) with ESMTP id 3zwwh669FDz9sfy for ; Wed, 7 Mar 2018 12:16:06 +1100 (AEDT) Received: (majordomo@vger.kernel.org) by vger.kernel.org via listexpand id S933293AbeCGBPB (ORCPT ); Tue, 6 Mar 2018 20:15:01 -0500 Received: from mga09.intel.com ([134.134.136.24]:56493 "EHLO mga09.intel.com" rhost-flags-OK-OK-OK-OK) by vger.kernel.org with ESMTP id S932804AbeCGBPA (ORCPT ); Tue, 6 Mar 2018 20:15:00 -0500 X-Amp-Result: SKIPPED(no attachment in message) X-Amp-File-Uploaded: False Received: from orsmga008.jf.intel.com ([10.7.209.65]) by orsmga102.jf.intel.com with ESMTP/TLS/DHE-RSA-AES256-GCM-SHA384; 06 Mar 2018 17:14:59 -0800 X-ExtLoop1: 1 X-IronPort-AV: E=Sophos;i="5.47,433,1515484800"; d="scan'208";a="23096770" Received: from darjeeling.jf.intel.com ([10.24.15.164]) by orsmga008.jf.intel.com with ESMTP; 06 Mar 2018 17:14:59 -0800 From: Jesus Sanchez-Palencia To: netdev@vger.kernel.org Cc: jhs@mojatatu.com, xiyou.wangcong@gmail.com, jiri@resnulli.us, vinicius.gomes@intel.com, richardcochran@gmail.com, intel-wired-lan@lists.osuosl.org, anna-maria@linutronix.de, henrik@austad.us, tglx@linutronix.de, john.stultz@linaro.org, levi.pearson@harman.com, edumazet@google.com, willemb@google.com, mlichvar@redhat.com Subject: [RFC v3 net-next 00/18] Time based packet transmission Date: Tue, 6 Mar 2018 17:12:12 -0800 Message-Id: <20180307011230.24001-1-jesus.sanchez-palencia@intel.com> X-Mailer: git-send-email 2.16.2 Sender: netdev-owner@vger.kernel.org Precedence: bulk List-ID: X-Mailing-List: netdev@vger.kernel.org This series is the v3 of the Time based packet transmission RFC, which was originally proposed by Richard Cochran (v1: https://lwn.net/Articles/733962/ ) and further developed by us with the addition of the tbs qdisc (v2: https://lwn.net/Articles/744797/ ). It introduces a new socket option (SO_TXTIME), a new qdisc (tbs) and implements support for hw offloading on the igb driver for the Intel i210 NIC. The tbs qdisc also supports SW best effort that can be used as a fallback. The main changes since v2 can be found below. Fixes since v2: - skb->tstamp is only cleared on the forwarding path; - ktime_t is no longer the type used for timestamps (s64 is); - get_unaligned() is now used for copying data from the cmsg header; - added getsockopt() support for SO_TXTIME; - restricted SO_TXTIME input range to [0,1]; - removed ns_capable() check from __sock_cmsg_send(); - the qdisc control struct now uses a 32 bitmap for config flags; - fixed qdisc backlog decrement bug; - 'overlimits' is now incremented on dequeue() drops in addition to the 'dropped' counter; Interface changes since v2: * CMSG interface: - added a per-packet clockid parameter to the cmsg (SCM_CLOCKID); - added a per-packet drop_if_late flag to the cmsg (SCM_DROP_IF_LATE); * tc-tbs: - clockid now receives a string; e.g.: CLOCK_REALTIME or /dev/ptp0 - offload is now a standalone argument (i.e. no more offload 1); - sorting is now argument that enables txtime based sorting provided by the qdisc; Design changes since v2: - Now on the dequeue() path, tbs only drops an expired packet if it has the skb->tc_drop_if_late flag set. In practical terms, this will define if the semantics of txtime on a system is "not earlier than" or "not later than" a given timestamp; - Now on the enqueue() path, the qdisc will drop a packet if its clockid doesn't match the qdisc's one; - Sorting the packets based on their txtime is now an option for the disc. Effectively, this means it can be configured in 4 modes: HW offload or SW best-effort, sorting enabled or disabled; The tbs qdisc is designed so it buffers packets until a configurable time before their deadline (tx times). If sorting is enabled, regardless of HW offload or SW fallback modes, the qdisc uses a rbtree internally so the buffered packets are always 'ordered' by the earliest deadline. If sorting is disabled, then for HW offload the qdisc will use a 'raw' FIFO through qdisc_enqueue_tail() / qdisc_dequeue_head(), whereas for SW best-effort, it will use a 'scheduled' FIFO. The other configurable parameter from the tbs qdisc is the clockid to be used. In order to provide that, this series adds a new API to pkt_sched.h (i.e. qdisc_watchdog_init_clockid()). The tbs qdisc will drop any packets with a transmission time in the past or when a deadline is missed if SCM_DROP_IF_LATE is set. Queueing packets in advance plus configuring the delta parameter for the system correctly makes all the difference in reducing the number of drops. Moreover, note that the delta parameter ends up defining the Tx time when SW best-effort is used given that the timestamps won't be used by the NIC on this case. Examples: # SW best-effort with sorting # $ tc qdisc replace dev enp2s0 parent root handle 100 mqprio num_tc 3 \ map 2 2 1 0 2 2 2 2 2 2 2 2 2 2 2 2 queues 1@0 1@1 2@2 hw 0 $ tc qdisc add dev enp2s0 parent 100:1 tbs delta 100000 \ clockid CLOCK_REALTIME sorting In this example first the mqprio qdisc is setup, then the tbs qdisc is configured onto the first hw Tx queue using SW best-effort with sorting enabled. Also, it is configured so the timestamps on each packet are in reference to the clockid CLOCK_REALTIME and so packets are dequeued from the qdisc 100000 nanoseconds before their transmission time. # HW offload without sorting # $ tc qdisc replace dev enp2s0 parent root handle 100 mqprio num_tc 3 \ map 2 2 1 0 2 2 2 2 2 2 2 2 2 2 2 2 queues 1@0 1@1 2@2 hw 0 $ tc qdisc add dev enp2s0 parent 100:1 tbs offload In this example, the Qdisc will use HW offload for the control of the transmission time through the network adapter. It's assumed implicitly the timestamp in skbuffs are in reference to the interface's PHC and setting any other valid clockid would be treated as an error. Because there is no scheduling being performed in the qdisc, setting a delta != 0 would also be considered an error. # HW offload with sorting # $ tc qdisc replace dev enp2s0 parent root handle 100 mqprio num_tc 3 \ map 2 2 1 0 2 2 2 2 2 2 2 2 2 2 2 2 queues 1@0 1@1 2@2 hw 0 $ tc qdisc add dev enp2s0 parent 100:1 tbs offload delta 100000 \ clockid CLOCK_REALTIME sorting Here, the Qdisc will use HW offload for the txtime control again, but now sorting will be enabled, and thus there will be scheduling being performed by the qdisc. That is done based on the clockid CLOCK_REALTIME and packets leave the Qdisc "delta" (100000) nanoseconds before their transmission time. Because this will be using HW offload and since dynamic clocks are not supported by the hrtimer, the system clock and the PHC clock must be synchronized for this mode to behave as expected. For testing, we've followed a similar approach from the v1 and v2 testing and no significant changes on the results were observed. An updated version of udp_tai.c is attached to this cover letter. For last, most of the To Dos we still have before a final patchset are related to further testing the igb support: - testing with L2 only talkers + AF_PACKET sockets; - testing tbs in conjunction with cbs; Thanks for all the feedback so far, Jesus Jesus Sanchez-Palencia (12): sock: Fix SO_ZEROCOPY switch case net: Clear skb->tstamp only on the forwarding path posix-timers: Add CLOCKID_INVALID mask net: SO_TXTIME: Add clockid and drop_if_late params net: ipv4: raw: Handle remaining txtime parameters net: ipv4: udp: Handle remaining txtime parameters net: packet: Handle remaining txtime parameters net/sched: Add HW offloading capability to TBS igb: Refactor igb_configure_cbs() igb: Only change Tx arbitration when CBS is on igb: Refactor igb_offload_cbs() igb: Add support for TBS offload Richard Cochran (4): net: Add a new socket option for a future transmit time. net: ipv4: raw: Hook into time based transmission. net: ipv4: udp: Hook into time based transmission. net: packet: Hook into time based transmission. Vinicius Costa Gomes (2): net/sched: Allow creating a Qdisc watchdog with other clocks net/sched: Introduce the TBS Qdisc arch/alpha/include/uapi/asm/socket.h | 5 + arch/frv/include/uapi/asm/socket.h | 5 + arch/ia64/include/uapi/asm/socket.h | 5 + arch/m32r/include/uapi/asm/socket.h | 5 + arch/mips/include/uapi/asm/socket.h | 5 + arch/mn10300/include/uapi/asm/socket.h | 5 + arch/parisc/include/uapi/asm/socket.h | 5 + arch/s390/include/uapi/asm/socket.h | 5 + arch/sparc/include/uapi/asm/socket.h | 5 + arch/xtensa/include/uapi/asm/socket.h | 5 + drivers/net/ethernet/intel/igb/e1000_defines.h | 16 + drivers/net/ethernet/intel/igb/igb.h | 1 + drivers/net/ethernet/intel/igb/igb_main.c | 239 +++++++--- include/linux/netdevice.h | 2 + include/linux/posix-timers.h | 1 + include/linux/skbuff.h | 3 + include/net/pkt_sched.h | 7 + include/net/sock.h | 4 + include/uapi/asm-generic/socket.h | 5 + include/uapi/linux/pkt_sched.h | 18 + net/core/skbuff.c | 1 - net/core/sock.c | 44 +- net/ipv4/raw.c | 7 + net/ipv4/udp.c | 10 +- net/packet/af_packet.c | 19 + net/sched/Kconfig | 11 + net/sched/Makefile | 1 + net/sched/sch_api.c | 11 +- net/sched/sch_tbs.c | 591 +++++++++++++++++++++++++ 29 files changed, 978 insertions(+), 63 deletions(-) create mode 100644 net/sched/sch_tbs.c