mbox series

[v2,net-next,00/14] Scheduled packet Transmission: ETF

Message ID 20180703224300.25300-1-jesus.sanchez-palencia@intel.com
Headers show
Series Scheduled packet Transmission: ETF | expand

Message

Jesus Sanchez-Palencia July 3, 2018, 10:42 p.m. UTC
Changes since v1:
  - moved struct sock_txtime from socket.h to uapi net_tstamp.h;
  - sk_clockid was changed from u16 to u8;
  - sk_txtime_flags was changed from u16 to a u8 bit field in struct sock;
  - the socket option flags are now validated in sock_setsockopt();
  - added SO_EE_ORIGIN_TXTIME;
  - sockc.transmit_time is now initialized from all IPv4 Tx paths;
  - added support for the IPv6 Tx path;


Overview
========

This work consists of a set of kernel interfaces that can be used by
applications that require (time-based) Scheduled Tx of packets.
It is comprised by 3 new components to the kernel:

  - SO_TXTIME: socket option + cmsg programming interfaces.

  - etf: the "earliest txtime first" qdisc, that provides per-queue
	 TxTime-based scheduling. This has been renamed from 'tbs' to
	 'etf' to better describe its functionality.

  - taprio: the "time-aware priority scheduler" qdisc, that provides
	    per-port Time-Aware scheduling;

This patchset is providing the first 2 components, which have been
developed for longer. The taprio qdisc will be shared as an RFC separately
(shortly).

Note that this series is a follow up of the "Time based packet
transmission" RFCv3 [1].



etf (formerly known as 'tbs')
=============================

For applications/systems that the concept of time slices isn't precise
enough, the etf qdisc allows applications to control the instant when
a packet should leave the network controller. When used in conjunction
with taprio, it can also be used in case the application needs to
control with greater guarantee the offset into each time slice a packet
will be sent. Another use case of etf, is when only a small number of
applications on a system are time sensitive, so it can then be used
with a more traditional root qdisc (like mqprio).

The etf qdisc is designed so it buffers packets until a configurable
time before their deadline (Tx time). The qdisc uses a rbtree internally
so the buffered packets are always 'ordered' by their txtime (deadline)
and will be dequeued following the earliest txtime first.

It relies on the SO_TXTIME API set for receiving the per-packet timestamp
(txtime) as well as the config flags for each socket: the clockid to be
used as a reference, if the expected mode of txtime for that socket is
deadline or strict mode, and if packet drops should be reported on the
socket's error queue or not.

The qdisc will drop any packets with a Tx time in the past, or if a
packet expires while waiting for being dequeued. Drops can be reported
as errors back to userspace through the socket's error queue.

Example configuration:

$ tc qdisc add dev enp2s0 parent 100:1 etf offload delta 200000 \
            clockid CLOCK_TAI

Here, the Qdisc will use HW offload for the txtime control.
Packets will be dequeued by the qdisc "delta" (200000) nanoseconds before
their transmission time. Because this will be using HW offload and
since dynamic clocks are not supported by hrtimers, the system clock
and the PHC clock must be synchronized for this mode to behave as expected.

A more complete example can be found here, with instructions of how to
test it:

https://gist.github.com/jeez/bd3afeff081ba64a695008dd8215866f [2]


Note that we haven't modified the qdisc so it uses a timerqueue because
the modification needed was increasing the number of cachelines of a sk_buff.



This series is also hosted on github and can be found at [3].
The companion iproute2 patches can be found at [4].


[1] https://patchwork.ozlabs.org/cover/882342/

[2] github doesn't make it clear, but the gist can be cloned like this:
$ git clone https://gist.github.com/jeez/bd3afeff081ba64a695008dd8215866f scheduled-tx-tests

[3] https://github.com/jeez/linux/tree/etf-v2

[4] https://github.com/jeez/iproute2/tree/etf-v2



Jesus Sanchez-Palencia (10):
  net: Clear skb->tstamp only on the forwarding path
  net: ipv4: Hook into time based transmission
  net: ipv6: Hook into time based transmission
  net/sched: Add HW offloading capability to ETF
  igb: Refactor igb_configure_cbs()
  igb: Only change Tx arbitration when CBS is on
  igb: Refactor igb_offload_cbs()
  igb: Only call skb_tx_timestamp after descriptors are ready
  igb: Add support for ETF offload
  net/sched: Make etf report drops on error_queue

Richard Cochran (2):
  net: Add a new socket option for a future transmit time.
  net: packet: Hook into time based transmission.

Vinicius Costa Gomes (2):
  net/sched: Allow creating a Qdisc watchdog with other clocks
  net/sched: Introduce the ETF Qdisc

 arch/alpha/include/uapi/asm/socket.h          |   3 +
 arch/ia64/include/uapi/asm/socket.h           |   3 +
 arch/mips/include/uapi/asm/socket.h           |   3 +
 arch/parisc/include/uapi/asm/socket.h         |   3 +
 arch/s390/include/uapi/asm/socket.h           |   3 +
 arch/sparc/include/uapi/asm/socket.h          |   3 +
 arch/xtensa/include/uapi/asm/socket.h         |   3 +
 .../net/ethernet/intel/igb/e1000_defines.h    |  16 +
 drivers/net/ethernet/intel/igb/igb.h          |   1 +
 drivers/net/ethernet/intel/igb/igb_main.c     | 256 ++++++---
 include/linux/netdevice.h                     |   1 +
 include/net/inet_sock.h                       |   1 +
 include/net/pkt_sched.h                       |   7 +
 include/net/sock.h                            |  11 +
 include/uapi/asm-generic/socket.h             |   3 +
 include/uapi/linux/errqueue.h                 |   4 +
 include/uapi/linux/net_tstamp.h               |  18 +
 include/uapi/linux/pkt_sched.h                |  18 +
 net/core/skbuff.c                             |   2 +-
 net/core/sock.c                               |  39 ++
 net/ipv4/icmp.c                               |   2 +
 net/ipv4/ip_output.c                          |   3 +
 net/ipv4/ping.c                               |   1 +
 net/ipv4/raw.c                                |   2 +
 net/ipv4/udp.c                                |   1 +
 net/ipv6/ip6_output.c                         |  11 +-
 net/ipv6/raw.c                                |   7 +-
 net/ipv6/udp.c                                |   1 +
 net/packet/af_packet.c                        |   6 +
 net/sched/Kconfig                             |  11 +
 net/sched/Makefile                            |   1 +
 net/sched/sch_api.c                           |  11 +-
 net/sched/sch_etf.c                           | 484 ++++++++++++++++++
 33 files changed, 864 insertions(+), 75 deletions(-)
 create mode 100644 net/sched/sch_etf.c

Comments

David Miller July 4, 2018, 1:37 p.m. UTC | #1
From: Jesus Sanchez-Palencia <jesus.sanchez-palencia@intel.com>
Date: Tue,  3 Jul 2018 15:42:46 -0700

> Overview
> ========
> 
> This work consists of a set of kernel interfaces that can be used by
> applications that require (time-based) Scheduled Tx of packets.
> It is comprised by 3 new components to the kernel:
> 
>   - SO_TXTIME: socket option + cmsg programming interfaces.
> 
>   - etf: the "earliest txtime first" qdisc, that provides per-queue
> 	 TxTime-based scheduling. This has been renamed from 'tbs' to
> 	 'etf' to better describe its functionality.
> 
>   - taprio: the "time-aware priority scheduler" qdisc, that provides
> 	    per-port Time-Aware scheduling;
> 
> This patchset is providing the first 2 components, which have been
> developed for longer. The taprio qdisc will be shared as an RFC separately
> (shortly).
 ...

I don't have any problems with this, series applied, thanks.
Stephen Hemminger July 6, 2018, 9:38 p.m. UTC | #2
On Tue,  3 Jul 2018 15:42:46 -0700
Jesus Sanchez-Palencia <jesus.sanchez-palencia@intel.com> wrote:

> Changes since v1:
>   - moved struct sock_txtime from socket.h to uapi net_tstamp.h;
>   - sk_clockid was changed from u16 to u8;
>   - sk_txtime_flags was changed from u16 to a u8 bit field in struct sock;
>   - the socket option flags are now validated in sock_setsockopt();
>   - added SO_EE_ORIGIN_TXTIME;
>   - sockc.transmit_time is now initialized from all IPv4 Tx paths;
>   - added support for the IPv6 Tx path;
> 
> 
> Overview
> ========
> 
> This work consists of a set of kernel interfaces that can be used by
> applications that require (time-based) Scheduled Tx of packets.
> It is comprised by 3 new components to the kernel:
> 
>   - SO_TXTIME: socket option + cmsg programming interfaces.
> 
>   - etf: the "earliest txtime first" qdisc, that provides per-queue
> 	 TxTime-based scheduling. This has been renamed from 'tbs' to
> 	 'etf' to better describe its functionality.
> 
>   - taprio: the "time-aware priority scheduler" qdisc, that provides
> 	    per-port Time-Aware scheduling;
> 
> This patchset is providing the first 2 components, which have been
> developed for longer. The taprio qdisc will be shared as an RFC separately
> (shortly).
> 
> Note that this series is a follow up of the "Time based packet
> transmission" RFCv3 [1].
> 
> 
> 
> etf (formerly known as 'tbs')
> =============================
> 
> For applications/systems that the concept of time slices isn't precise
> enough, the etf qdisc allows applications to control the instant when
> a packet should leave the network controller. When used in conjunction
> with taprio, it can also be used in case the application needs to
> control with greater guarantee the offset into each time slice a packet
> will be sent. Another use case of etf, is when only a small number of
> applications on a system are time sensitive, so it can then be used
> with a more traditional root qdisc (like mqprio).
> 
> The etf qdisc is designed so it buffers packets until a configurable
> time before their deadline (Tx time). The qdisc uses a rbtree internally
> so the buffered packets are always 'ordered' by their txtime (deadline)
> and will be dequeued following the earliest txtime first.
> 
> It relies on the SO_TXTIME API set for receiving the per-packet timestamp
> (txtime) as well as the config flags for each socket: the clockid to be
> used as a reference, if the expected mode of txtime for that socket is
> deadline or strict mode, and if packet drops should be reported on the
> socket's error queue or not.
> 
> The qdisc will drop any packets with a Tx time in the past, or if a
> packet expires while waiting for being dequeued. Drops can be reported
> as errors back to userspace through the socket's error queue.
> 
> Example configuration:
> 
> $ tc qdisc add dev enp2s0 parent 100:1 etf offload delta 200000 \
>             clockid CLOCK_TAI
> 
> Here, the Qdisc will use HW offload for the txtime control.
> Packets will be dequeued by the qdisc "delta" (200000) nanoseconds before
> their transmission time. Because this will be using HW offload and
> since dynamic clocks are not supported by hrtimers, the system clock
> and the PHC clock must be synchronized for this mode to behave as expected.
> 
> A more complete example can be found here, with instructions of how to
> test it:
> 
> https://gist.github.com/jeez/bd3afeff081ba64a695008dd8215866f [2]
> 
> 
> Note that we haven't modified the qdisc so it uses a timerqueue because
> the modification needed was increasing the number of cachelines of a sk_buff.
> 
> 
> 
> This series is also hosted on github and can be found at [3].
> The companion iproute2 patches can be found at [4].
> 
> 
> [1] https://patchwork.ozlabs.org/cover/882342/
> 
> [2] github doesn't make it clear, but the gist can be cloned like this:
> $ git clone https://gist.github.com/jeez/bd3afeff081ba64a695008dd8215866f scheduled-tx-tests
> 
> [3] https://github.com/jeez/linux/tree/etf-v2
> 
> [4] https://github.com/jeez/iproute2/tree/etf-v2
> 
> 
> 
> Jesus Sanchez-Palencia (10):
>   net: Clear skb->tstamp only on the forwarding path
>   net: ipv4: Hook into time based transmission
>   net: ipv6: Hook into time based transmission
>   net/sched: Add HW offloading capability to ETF
>   igb: Refactor igb_configure_cbs()
>   igb: Only change Tx arbitration when CBS is on
>   igb: Refactor igb_offload_cbs()
>   igb: Only call skb_tx_timestamp after descriptors are ready
>   igb: Add support for ETF offload
>   net/sched: Make etf report drops on error_queue
> 
> Richard Cochran (2):
>   net: Add a new socket option for a future transmit time.
>   net: packet: Hook into time based transmission.
> 
> Vinicius Costa Gomes (2):
>   net/sched: Allow creating a Qdisc watchdog with other clocks
>   net/sched: Introduce the ETF Qdisc
> 
>  arch/alpha/include/uapi/asm/socket.h          |   3 +
>  arch/ia64/include/uapi/asm/socket.h           |   3 +
>  arch/mips/include/uapi/asm/socket.h           |   3 +
>  arch/parisc/include/uapi/asm/socket.h         |   3 +
>  arch/s390/include/uapi/asm/socket.h           |   3 +
>  arch/sparc/include/uapi/asm/socket.h          |   3 +
>  arch/xtensa/include/uapi/asm/socket.h         |   3 +
>  .../net/ethernet/intel/igb/e1000_defines.h    |  16 +
>  drivers/net/ethernet/intel/igb/igb.h          |   1 +
>  drivers/net/ethernet/intel/igb/igb_main.c     | 256 ++++++---
>  include/linux/netdevice.h                     |   1 +
>  include/net/inet_sock.h                       |   1 +
>  include/net/pkt_sched.h                       |   7 +
>  include/net/sock.h                            |  11 +
>  include/uapi/asm-generic/socket.h             |   3 +
>  include/uapi/linux/errqueue.h                 |   4 +
>  include/uapi/linux/net_tstamp.h               |  18 +
>  include/uapi/linux/pkt_sched.h                |  18 +
>  net/core/skbuff.c                             |   2 +-
>  net/core/sock.c                               |  39 ++
>  net/ipv4/icmp.c                               |   2 +
>  net/ipv4/ip_output.c                          |   3 +
>  net/ipv4/ping.c                               |   1 +
>  net/ipv4/raw.c                                |   2 +
>  net/ipv4/udp.c                                |   1 +
>  net/ipv6/ip6_output.c                         |  11 +-
>  net/ipv6/raw.c                                |   7 +-
>  net/ipv6/udp.c                                |   1 +
>  net/packet/af_packet.c                        |   6 +
>  net/sched/Kconfig                             |  11 +
>  net/sched/Makefile                            |   1 +
>  net/sched/sch_api.c                           |  11 +-
>  net/sched/sch_etf.c                           | 484 ++++++++++++++++++
>  33 files changed, 864 insertions(+), 75 deletions(-)
>  create mode 100644 net/sched/sch_etf.c
> 

Why support different clockid's in the API? 
I think the clock used in API should be either nanoseconds or USER_HZ (ie 100)
and the kernel components should use ktime. If you need to translate that to some
other value in the hardware driver,  then let the device driver do it.

Exposing multiple choices in userspace API, leads to more error paths and does
not provide direct benefits.
Jesus Sanchez-Palencia July 9, 2018, 3:24 p.m. UTC | #3
Hi Stephen,


On 07/06/2018 02:38 PM, Stephen Hemminger wrote:
> On Tue,  3 Jul 2018 15:42:46 -0700
> Jesus Sanchez-Palencia <jesus.sanchez-palencia@intel.com> wrote:
> 
>> Changes since v1:
>>   - moved struct sock_txtime from socket.h to uapi net_tstamp.h;
>>   - sk_clockid was changed from u16 to u8;
>>   - sk_txtime_flags was changed from u16 to a u8 bit field in struct sock;
>>   - the socket option flags are now validated in sock_setsockopt();
>>   - added SO_EE_ORIGIN_TXTIME;
>>   - sockc.transmit_time is now initialized from all IPv4 Tx paths;
>>   - added support for the IPv6 Tx path;
>>
>>
>> Overview
>> ========
>>
>> This work consists of a set of kernel interfaces that can be used by
>> applications that require (time-based) Scheduled Tx of packets.
>> It is comprised by 3 new components to the kernel:
>>
>>   - SO_TXTIME: socket option + cmsg programming interfaces.
>>
>>   - etf: the "earliest txtime first" qdisc, that provides per-queue
>> 	 TxTime-based scheduling. This has been renamed from 'tbs' to
>> 	 'etf' to better describe its functionality.
>>
>>   - taprio: the "time-aware priority scheduler" qdisc, that provides
>> 	    per-port Time-Aware scheduling;
>>
>> This patchset is providing the first 2 components, which have been
>> developed for longer. The taprio qdisc will be shared as an RFC separately
>> (shortly).
>>
>> Note that this series is a follow up of the "Time based packet
>> transmission" RFCv3 [1].
>>
>>
>>
>> etf (formerly known as 'tbs')
>> =============================
>>
>> For applications/systems that the concept of time slices isn't precise
>> enough, the etf qdisc allows applications to control the instant when
>> a packet should leave the network controller. When used in conjunction
>> with taprio, it can also be used in case the application needs to
>> control with greater guarantee the offset into each time slice a packet
>> will be sent. Another use case of etf, is when only a small number of
>> applications on a system are time sensitive, so it can then be used
>> with a more traditional root qdisc (like mqprio).
>>
>> The etf qdisc is designed so it buffers packets until a configurable
>> time before their deadline (Tx time). The qdisc uses a rbtree internally
>> so the buffered packets are always 'ordered' by their txtime (deadline)
>> and will be dequeued following the earliest txtime first.
>>
>> It relies on the SO_TXTIME API set for receiving the per-packet timestamp
>> (txtime) as well as the config flags for each socket: the clockid to be
>> used as a reference, if the expected mode of txtime for that socket is
>> deadline or strict mode, and if packet drops should be reported on the
>> socket's error queue or not.
>>
>> The qdisc will drop any packets with a Tx time in the past, or if a
>> packet expires while waiting for being dequeued. Drops can be reported
>> as errors back to userspace through the socket's error queue.
>>
>> Example configuration:
>>
>> $ tc qdisc add dev enp2s0 parent 100:1 etf offload delta 200000 \
>>             clockid CLOCK_TAI
>>
>> Here, the Qdisc will use HW offload for the txtime control.
>> Packets will be dequeued by the qdisc "delta" (200000) nanoseconds before
>> their transmission time. Because this will be using HW offload and
>> since dynamic clocks are not supported by hrtimers, the system clock
>> and the PHC clock must be synchronized for this mode to behave as expected.
>>
>> A more complete example can be found here, with instructions of how to
>> test it:
>>
>> https://gist.github.com/jeez/bd3afeff081ba64a695008dd8215866f [2]
>>
>>
>> Note that we haven't modified the qdisc so it uses a timerqueue because
>> the modification needed was increasing the number of cachelines of a sk_buff.
>>
>>
>>
>> This series is also hosted on github and can be found at [3].
>> The companion iproute2 patches can be found at [4].
>>
>>
>> [1] https://patchwork.ozlabs.org/cover/882342/
>>
>> [2] github doesn't make it clear, but the gist can be cloned like this:
>> $ git clone https://gist.github.com/jeez/bd3afeff081ba64a695008dd8215866f scheduled-tx-tests
>>
>> [3] https://github.com/jeez/linux/tree/etf-v2
>>
>> [4] https://github.com/jeez/iproute2/tree/etf-v2
>>
>>
>>
>> Jesus Sanchez-Palencia (10):
>>   net: Clear skb->tstamp only on the forwarding path
>>   net: ipv4: Hook into time based transmission
>>   net: ipv6: Hook into time based transmission
>>   net/sched: Add HW offloading capability to ETF
>>   igb: Refactor igb_configure_cbs()
>>   igb: Only change Tx arbitration when CBS is on
>>   igb: Refactor igb_offload_cbs()
>>   igb: Only call skb_tx_timestamp after descriptors are ready
>>   igb: Add support for ETF offload
>>   net/sched: Make etf report drops on error_queue
>>
>> Richard Cochran (2):
>>   net: Add a new socket option for a future transmit time.
>>   net: packet: Hook into time based transmission.
>>
>> Vinicius Costa Gomes (2):
>>   net/sched: Allow creating a Qdisc watchdog with other clocks
>>   net/sched: Introduce the ETF Qdisc
>>
>>  arch/alpha/include/uapi/asm/socket.h          |   3 +
>>  arch/ia64/include/uapi/asm/socket.h           |   3 +
>>  arch/mips/include/uapi/asm/socket.h           |   3 +
>>  arch/parisc/include/uapi/asm/socket.h         |   3 +
>>  arch/s390/include/uapi/asm/socket.h           |   3 +
>>  arch/sparc/include/uapi/asm/socket.h          |   3 +
>>  arch/xtensa/include/uapi/asm/socket.h         |   3 +
>>  .../net/ethernet/intel/igb/e1000_defines.h    |  16 +
>>  drivers/net/ethernet/intel/igb/igb.h          |   1 +
>>  drivers/net/ethernet/intel/igb/igb_main.c     | 256 ++++++---
>>  include/linux/netdevice.h                     |   1 +
>>  include/net/inet_sock.h                       |   1 +
>>  include/net/pkt_sched.h                       |   7 +
>>  include/net/sock.h                            |  11 +
>>  include/uapi/asm-generic/socket.h             |   3 +
>>  include/uapi/linux/errqueue.h                 |   4 +
>>  include/uapi/linux/net_tstamp.h               |  18 +
>>  include/uapi/linux/pkt_sched.h                |  18 +
>>  net/core/skbuff.c                             |   2 +-
>>  net/core/sock.c                               |  39 ++
>>  net/ipv4/icmp.c                               |   2 +
>>  net/ipv4/ip_output.c                          |   3 +
>>  net/ipv4/ping.c                               |   1 +
>>  net/ipv4/raw.c                                |   2 +
>>  net/ipv4/udp.c                                |   1 +
>>  net/ipv6/ip6_output.c                         |  11 +-
>>  net/ipv6/raw.c                                |   7 +-
>>  net/ipv6/udp.c                                |   1 +
>>  net/packet/af_packet.c                        |   6 +
>>  net/sched/Kconfig                             |  11 +
>>  net/sched/Makefile                            |   1 +
>>  net/sched/sch_api.c                           |  11 +-
>>  net/sched/sch_etf.c                           | 484 ++++++++++++++++++
>>  33 files changed, 864 insertions(+), 75 deletions(-)
>>  create mode 100644 net/sched/sch_etf.c
>>
> 
> Why support different clockid's in the API? 
> I think the clock used in API should be either nanoseconds or USER_HZ (ie 100)
> and the kernel components should use ktime. If you need to translate that to some
> other value in the hardware driver,  then let the device driver do it.
> 
> Exposing multiple choices in userspace API, leads to more error paths and does
> not provide direct benefits.


The kernel components already use ktime_t. The clockid_t here is to define the
time source (i.e. which clock must be used to read the ktime from) and not the
unit of time.

I hope that clarifies.

Regards,
Jesus