mbox series

[RFC,net-next,0/5] TSN: Add qdisc-based config interfaces for traffic shapers

Message ID 20170901012625.14838-1-vinicius.gomes@intel.com
Headers show
Series TSN: Add qdisc-based config interfaces for traffic shapers | expand

Message

Vinicius Costa Gomes Sept. 1, 2017, 1:26 a.m. UTC
Hi,

This patchset is an RFC on a proposal of how the Traffic Control subsystem can
be used to offload the configuration of traffic shapers into network devices
that provide support for them in HW. Our goal here is to start upstreaming
support for features related to the Time-Sensitive Networking (TSN) set of
standards into the kernel.

As part of this work, we've assessed previous public discussions related to TSN
enabling: patches from Henrik Austad (Cisco), the presentation from Eric Mann
at Linux Plumbers 2012, patches from Gangfeng Huang (National Instruments) and
the current state of the OpenAVNU project (https://github.com/AVnu/OpenAvnu/).

Please note that the patches provided as part of this RFC are implementing what
is needed only for 802.1Qav (FQTSS) only, but we'd like to take advantage of
this discussion and share our WIP ideas for the 802.1Qbv and 802.1Qbu interfaces
as well. The current patches are only providing support for HW offload of the
configs.


Overview
========

Time-sensitive Networking (TSN) is a set of standards that aim to address
resources availability for providing bandwidth reservation and bounded latency
on Ethernet based LANs. The proposal described here aims to cover mainly what is
needed to enable the following standards: 802.1Qat, 802.1Qav, 802.1Qbv and
802.1Qbu.

The initial target of this work is the Intel i210 NIC, but other controllers'
datasheet were also taken into account, like the Renesas RZ/A1H RZ/A1M group and
the Synopsis DesignWare Ethernet QoS controller.


Proposal
========

Feature-wise, what is covered here are configuration interfaces for HW
implementations of the Credit-Based shaper (CBS, 802.1Qav), Time-Aware shaper
(802.1Qbv) and Frame Preemption (802.1Qbu). CBS is a per-queue shaper, while
Qbv and Qbu must be configured per port, with the configuration covering all
queues. Given that these features are related to traffic shaping, and that the
traffic control subsystem already provides a queueing discipline that offloads
config into the device driver (i.e. mqprio), designing new qdiscs for the
specific purpose of offloading the config for each shaper seemed like a good
fit.

For steering traffic into the correct queues, we use the socket option
SO_PRIORITY and then a mechanism to map priority to traffic classes / Tx queues.
The qdisc mqprio is currently used in our tests.

As for the shapers config interface:

 * CBS (802.1Qav)

   This patchset is proposing a new qdisc called 'cbs'. Its 'tc' cmd line is:
   $ tc qdisc add dev IFACE parent ID cbs locredit N hicredit M sendslope S \
     idleslope I

   Note that the parameters for this qdisc are the ones defined by the
   802.1Q-2014 spec, so no hardware specific functionality is exposed here.


 * Time-aware shaper (802.1Qbv):

   The idea we are currently exploring is to add a "time-aware", priority based
   qdisc, that also exposes the Tx queues available and provides a mechanism for
   mapping priority <-> traffic class <-> Tx queues in a similar fashion as
   mqprio. We are calling this qdisc 'taprio', and its 'tc' cmd line would be:

   $ $ tc qdisc add dev ens4 parent root handle 100 taprio num_tc 4    \
     	   map 2 2 1 0 3 3 3 3 3 3 3 3 3 3 3 3                         \
	   queues 0 1 2 3                                              \
     	   sched-file gates.sched [base-time <interval>]               \
           [cycle-time <interval>] [extension-time <interval>]

   <file> is multi-line, with each line being of the following format:
   <cmd> <gate mask> <interval in nanoseconds>

   Qbv only defines one <cmd>: "S" for 'SetGates'

   For example:

   S 0x01 300
   S 0x03 500

   This means that there are two intervals, the first will have the gate
   for traffic class 0 open for 300 nanoseconds, the second will have
   both traffic classes open for 500 nanoseconds.

   Additionally, an option to set just one entry of the gate control list will
   also be provided by 'taprio':

   $ tc qdisc (...) \
        sched-row <row number> <cmd> <gate mask> <interval>  \
        [base-time <interval>] [cycle-time <interval>] \
        [extension-time <interval>]


 * Frame Preemption (802.1Qbu):

   To control even further the latency, it may prove useful to signal which
   traffic classes are marked as preemptable. For that, 'taprio' provides the
   preemption command so you set each traffic class as preemptable or not:

   $ tc qdisc (...) \
        preemption 0 1 1 1


 * Time-aware shaper + Preemption:

   As an example of how Qbv and Qbu can be used together, we may specify
   both the schedule and the preempt-mask, and this way we may also
   specify the Set-Gates-and-Hold and Set-Gates-and-Release commands as
   specified in the Qbu spec:

   $ tc qdisc add dev ens4 parent root handle 100 taprio num_tc 4 \
     	   map 2 2 1 0 3 3 3 3 3 3 3 3 3 3 3 3                    \
	   queues 0 1 2 3                                         \
     	   preemption 0 1 1 1                                     \
	   sched-file preempt_gates.sched

    <file> is multi-line, with each line being of the following format:
    <cmd> <gate mask> <interval in nanoseconds>

    For this case, two new commands are introduced:

    "H" for 'set gates and hold'
    "R" for 'set gates and release'

    H 0x01 300
    R 0x03 500



Testing this RFC
================

For testing the patches of this RFC only, you can refer to the samples and
helper script being added to samples/tsn/ and the use the 'mqprio' qdisc to
setup the priorities to Tx queues mapping, together with the 'cbs' qdisc to
configure the HW shaper of the i210 controller:

1) Setup priorities to traffic classes to hardware queues mapping
$ tc qdisc replace dev enp3s0 parent root mqprio num_tc 3 \
     map 2 2 1 0 2 2 2 2 2 2 2 2 2 2 2 2 queues 1@0 1@1 2@2 hw 0

2) Check scheme. You want to get the inner qdiscs ID from the bottom up
$ tc -g  class show dev enp3s0

Ex.:
+---(802a:3) mqprio
|    +---(802a:6) mqprio
|    +---(802a:7) mqprio
|
+---(802a:2) mqprio
|    +---(802a:5) mqprio
|
+---(802a:1) mqprio
     +---(802a:4) mqprio

 * Here '802a:4' is Tx Queue #0 and '802a:5' is Tx Queue #1.

3) Calculate CBS parameters for classes A and B. i.e. BW for A is 20Mbps and
   for B is 10Mbps:
$ ./samples/tsn/calculate_cbs_params.py -A 20000 -a 1500 -B 10000 -b 1500

4) Configure CBS for traffic class A (priority 3) as provided by the script:
$ tc qdisc replace dev enp3s0 parent 802a:4 cbs locredit -1470 \
     hicredit 30 sendslope -980000 idleslope 20000

5) Configure CBS for traffic class B (priority 2):
$ tc qdisc replace dev enp3s0 parent 802a:5 cbs \
     locredit -1485 hicredit 31 sendslope -990000 idleslope 10000

6) Run Listener, compiled from samples/tsn/listener.c
$ ./listener -i enp3s0

7) Run Talker for class A (prio 3 here), compiled from samples/tsn/talker.c
$ ./talker -i enp3s0 -p 3

 * The bandwidth displayed on the listener output at this stage should be very
   close to the one configured for class A.

8) You can also run a Talker for class B (prio 2 here)
$ ./talker -i enp3s0 -p 2

 * The bandwidth displayed on the listener output now should increase to very
   close to the one configured for class A + class B.

Authors
=======
 - Andre Guedes <andre.guedes@intel.com>
 - Ivan Briano <ivan.briano@intel.com>
 - Jesus Sanchez-Palencia <jesus.sanchez-palencia@intel.com>
 - Vinicius Gomes <vinicius.gomes@intel.com>


Andre Guedes (2):
  igb: Add support for CBS offload
  samples/tsn: Add script for calculating CBS config

Jesus Sanchez-Palencia (1):
  sample: Add TSN Talker and Listener examples

Vinicius Costa Gomes (2):
  net/sched: Introduce the user API for the CBS shaper
  net/sched: Introduce Credit Based Shaper (CBS) qdisc

 drivers/net/ethernet/intel/igb/e1000_defines.h |  23 ++
 drivers/net/ethernet/intel/igb/e1000_regs.h    |   8 +
 drivers/net/ethernet/intel/igb/igb.h           |   6 +
 drivers/net/ethernet/intel/igb/igb_main.c      | 349 +++++++++++++++++++++++++
 include/linux/netdevice.h                      |   1 +
 include/uapi/linux/pkt_sched.h                 |  29 ++
 net/sched/Kconfig                              |  11 +
 net/sched/Makefile                             |   1 +
 net/sched/sch_cbs.c                            | 286 ++++++++++++++++++++
 samples/tsn/calculate_cbs_params.py            | 112 ++++++++
 samples/tsn/listener.c                         | 254 ++++++++++++++++++
 samples/tsn/talker.c                           | 136 ++++++++++
 12 files changed, 1216 insertions(+)
 create mode 100644 net/sched/sch_cbs.c
 create mode 100755 samples/tsn/calculate_cbs_params.py
 create mode 100644 samples/tsn/listener.c
 create mode 100644 samples/tsn/talker.c

--
2.14.1

Comments

Richard Cochran Sept. 1, 2017, 1:03 p.m. UTC | #1
I happy to see this posted.  At first glance, it seems like a step in
the right direction.

On Thu, Aug 31, 2017 at 06:26:20PM -0700, Vinicius Costa Gomes wrote:
>  * Time-aware shaper (802.1Qbv):
...
>    S 0x01 300
>    S 0x03 500
> 
>    This means that there are two intervals, the first will have the gate
>    for traffic class 0 open for 300 nanoseconds, the second will have
>    both traffic classes open for 500 nanoseconds.

The i210 doesn't support this in HW, or does it?

>  * Frame Preemption (802.1Qbu):
> 
>    To control even further the latency, it may prove useful to signal which
>    traffic classes are marked as preemptable. For that, 'taprio' provides the
>    preemption command so you set each traffic class as preemptable or not:
> 
>    $ tc qdisc (...) \
>         preemption 0 1 1 1

Neither can the i210 preempt frames, or what am I missing?

The timing of this RFC is good, as I am just finishing up an RFC that
implements time-based transmit using the i210.  I'll try and get that
out ASAP.

Thanks,
Richard
Jesus Sanchez-Palencia Sept. 1, 2017, 4:12 p.m. UTC | #2
Hi Richard,


On 09/01/2017 06:03 AM, Richard Cochran wrote:
> 
> I happy to see this posted.  At first glance, it seems like a step in
> the right direction.
> 
> On Thu, Aug 31, 2017 at 06:26:20PM -0700, Vinicius Costa Gomes wrote:
>>  * Time-aware shaper (802.1Qbv):
> ...
>>    S 0x01 300
>>    S 0x03 500
>>
>>    This means that there are two intervals, the first will have the gate
>>    for traffic class 0 open for 300 nanoseconds, the second will have
>>    both traffic classes open for 500 nanoseconds.
> 
> The i210 doesn't support this in HW, or does it?


No, it does not. i210 only provides support for a per-packet feature called
LaunchTime that can be used control both the fetch and the transmission time of
packets.


> 
>>  * Frame Preemption (802.1Qbu):
>>
>>    To control even further the latency, it may prove useful to signal which
>>    traffic classes are marked as preemptable. For that, 'taprio' provides the
>>    preemption command so you set each traffic class as preemptable or not:
>>
>>    $ tc qdisc (...) \
>>         preemption 0 1 1 1
> 
> Neither can the i210 preempt frames, or what am I missing?

No, it does not.

But when we started working on the shapers we decided to look ahead and try to
come up with interfaces that could cover beyond 802.1Qav. These are just some
ideas we've been prototyping here together with the 'cbs' qdisc.


> 
> The timing of this RFC is good, as I am just finishing up an RFC that
> implements time-based transmit using the i210.  I'll try and get that
> out ASAP.


Is it correct to assume you are referring to an interface for Launchtime here?


Thanks,
Jesus
Richard Cochran Sept. 1, 2017, 4:53 p.m. UTC | #3
On Fri, Sep 01, 2017 at 09:12:17AM -0700, Jesus Sanchez-Palencia wrote:
> Is it correct to assume you are referring to an interface for Launchtime here?

Yes.

Thanks,
Richard
Richard Cochran Sept. 5, 2017, 7:20 a.m. UTC | #4
On Fri, Sep 01, 2017 at 09:12:17AM -0700, Jesus Sanchez-Palencia wrote:
> On 09/01/2017 06:03 AM, Richard Cochran wrote:
> > The timing of this RFC is good, as I am just finishing up an RFC that
> > implements time-based transmit using the i210.  I'll try and get that
> > out ASAP.

I have an RFC series ready for net-next, but the the merge window just
started.  I'll post it when the window closes again...

Thanks,
Richard
Henrik Austad Sept. 7, 2017, 5:34 a.m. UTC | #5
On Thu, Aug 31, 2017 at 06:26:20PM -0700, Vinicius Costa Gomes wrote:
> Hi,
> 
> This patchset is an RFC on a proposal of how the Traffic Control subsystem can
> be used to offload the configuration of traffic shapers into network devices
> that provide support for them in HW. Our goal here is to start upstreaming
> support for features related to the Time-Sensitive Networking (TSN) set of
> standards into the kernel.

Nice to see that others are working on this as well! :)

A short disclaimer; I'm pretty much anchored in the view "linux is the 
end-station in a TSN domain", is this your approach as well, or are you 
looking at this driver to be used in bridges as well? (because that will 
affect the comments on time-aware shaper and frame preemption)

Yet another disclaimer; I am not a linux networking subsystem expert. Not 
by a long shot! There are black magic happening in the internals of the 
networking subsystem that I am not even aware of. So if something I say or 
ask does not make sense _at_all_, that's probably why..

I do know a tiny bit about TSN though, and I have been messing around 
with it for a little while, hence my comments below

> As part of this work, we've assessed previous public discussions related to TSN
> enabling: patches from Henrik Austad (Cisco), the presentation from Eric Mann
> at Linux Plumbers 2012, patches from Gangfeng Huang (National Instruments) and
> the current state of the OpenAVNU project (https://github.com/AVnu/OpenAvnu/).

/me eyes Cc ;p

> Overview
> ========
> 
> Time-sensitive Networking (TSN) is a set of standards that aim to address
> resources availability for providing bandwidth reservation and bounded latency
> on Ethernet based LANs. The proposal described here aims to cover mainly what is
> needed to enable the following standards: 802.1Qat, 802.1Qav, 802.1Qbv and
> 802.1Qbu.
> 
> The initial target of this work is the Intel i210 NIC, but other controllers'
> datasheet were also taken into account, like the Renesas RZ/A1H RZ/A1M group and
> the Synopsis DesignWare Ethernet QoS controller.

NXP has a TSN aware chip on the i.MX7 sabre board as well </fyi>

> Proposal
> ========
> 
> Feature-wise, what is covered here are configuration interfaces for HW
> implementations of the Credit-Based shaper (CBS, 802.1Qav), Time-Aware shaper
> (802.1Qbv) and Frame Preemption (802.1Qbu). CBS is a per-queue shaper, while
> Qbv and Qbu must be configured per port, with the configuration covering all
> queues. Given that these features are related to traffic shaping, and that the
> traffic control subsystem already provides a queueing discipline that offloads
> config into the device driver (i.e. mqprio), designing new qdiscs for the
> specific purpose of offloading the config for each shaper seemed like a good
> fit.

just to be clear, you register sch_cbs as a subclass to mqprio, not as a 
root class?

> For steering traffic into the correct queues, we use the socket option
> SO_PRIORITY and then a mechanism to map priority to traffic classes / Tx queues.
> The qdisc mqprio is currently used in our tests.

Right, fair enough, I'd prefer the TSN qdisc to be the root-device and 
rather have mqprio for high priority traffic and another for 'everything 
else'', but this would work too. This is not that relevant at this stage I 
guess :)

> As for the shapers config interface:
> 
>  * CBS (802.1Qav)
> 
>    This patchset is proposing a new qdisc called 'cbs'. Its 'tc' cmd line is:
>    $ tc qdisc add dev IFACE parent ID cbs locredit N hicredit M sendslope S \
>      idleslope I

So this confuses me a bit, why specify sendSlope?

    sendSlope = portTransmitRate - idleSlope

and portTransmitRate is the speed of the MAC (which you get from the 
driver). Adding sendSlope here is just redundant I think.

Also, does this mean that when you create the qdisc, you have locked the 
bandwidth for the scheduler? Meaning, if I later want to add another 
stream that requires more bandwidth, I have to close all active streams, 
reconfigure the qdisc and then restart?

>    Note that the parameters for this qdisc are the ones defined by the
>    802.1Q-2014 spec, so no hardware specific functionality is exposed here.

You do need to know if the link is brought up as 100 or 1000 though - which 
the driver already knows.

>  * Time-aware shaper (802.1Qbv):
> 
>    The idea we are currently exploring is to add a "time-aware", priority based
>    qdisc, that also exposes the Tx queues available and provides a mechanism for
>    mapping priority <-> traffic class <-> Tx queues in a similar fashion as
>    mqprio. We are calling this qdisc 'taprio', and its 'tc' cmd line would be:

As far as I know, this is not supported by i210, and if time-aware shaping 
is enabled in the network - you'll be queued on a bridge until the window 
opens as time-aware shaping is enforced on the tx-port and not on rx. Is 
this required in this driver?

>    $ $ tc qdisc add dev ens4 parent root handle 100 taprio num_tc 4    \
>      	   map 2 2 1 0 3 3 3 3 3 3 3 3 3 3 3 3                         \
> 	   queues 0 1 2 3                                              \
>      	   sched-file gates.sched [base-time <interval>]               \
>            [cycle-time <interval>] [extension-time <interval>]

That was a lot of priorities! 802.1Q lists 8 priorities, where does these 
16 come from?

You map pri 0,1 to queue 2, pri 2 to queue 1 (Class B), pri 3 to queue 0 
(class A) and everythign else to queue 3. This is what I would expect, 
except for the additional 8 priorities.

>    <file> is multi-line, with each line being of the following format:
>    <cmd> <gate mask> <interval in nanoseconds>
> 
>    Qbv only defines one <cmd>: "S" for 'SetGates'
> 
>    For example:
> 
>    S 0x01 300
>    S 0x03 500
> 
>    This means that there are two intervals, the first will have the gate
>    for traffic class 0 open for 300 nanoseconds, the second will have
>    both traffic classes open for 500 nanoseconds.

Are you aware of any hw except dedicated switching stuff that supports 
this? (meant as "I'm curious and would like to know")

>    Additionally, an option to set just one entry of the gate control list will
>    also be provided by 'taprio':
> 
>    $ tc qdisc (...) \
>         sched-row <row number> <cmd> <gate mask> <interval>  \
>         [base-time <interval>] [cycle-time <interval>] \
>         [extension-time <interval>]
> 
> 
>  * Frame Preemption (802.1Qbu):

So Frame preemption is nice, but my understanding of Qbu is that the real 
benefit is at the bridges and not in the endpoints. As jumbo-frames is 
explicitly disallowed in Qav, the maximum latency incurred by a frame in 
flight is 12us on a 1Gbps link. I am not sure if these 12us is what will be 
the main delay in your application.

Or have I missed some crucial point here?

>    To control even further the latency, it may prove useful to signal which
>    traffic classes are marked as preemptable. For that, 'taprio' provides the
>    preemption command so you set each traffic class as preemptable or not:
> 
>    $ tc qdisc (...) \
>         preemption 0 1 1 1
> 
>  * Time-aware shaper + Preemption:
> 
>    As an example of how Qbv and Qbu can be used together, we may specify
>    both the schedule and the preempt-mask, and this way we may also
>    specify the Set-Gates-and-Hold and Set-Gates-and-Release commands as
>    specified in the Qbu spec:
> 
>    $ tc qdisc add dev ens4 parent root handle 100 taprio num_tc 4 \
>      	   map 2 2 1 0 3 3 3 3 3 3 3 3 3 3 3 3                    \
> 	   queues 0 1 2 3                                         \
>      	   preemption 0 1 1 1                                     \
> 	   sched-file preempt_gates.sched
> 
>     <file> is multi-line, with each line being of the following format:
>     <cmd> <gate mask> <interval in nanoseconds>
> 
>     For this case, two new commands are introduced:
> 
>     "H" for 'set gates and hold'
>     "R" for 'set gates and release'
> 
>     H 0x01 300
>     R 0x03 500

So my understanding of all of this is that you configure the *total* 
bandwith for each class when you load the qdisc and then let userspace 
handle the rest. Is this correct?

In my view, it would be nice if the qdisc had some notion about streams so 
that you could create a stream, feed frames to it and let the driver pace 
them out. (The fewer you queue, the shorter the delay). This will also 
allow you to enforce per-stream bandwidth restrictions. I don't see how you 
can do this here unless you want to do this in userspace.

Do you have any plans for adding support for multiplexing streams? If you 
have multiple streams, how do you enforce that one stream does not eat into 
the bandwidth of another stream? AFAIK, this is something the network must 
enforce, but I see no option of doing som here.

> Testing this RFC
> ================
> 
> For testing the patches of this RFC only, you can refer to the samples and
> helper script being added to samples/tsn/ and the use the 'mqprio' qdisc to
> setup the priorities to Tx queues mapping, together with the 'cbs' qdisc to
> configure the HW shaper of the i210 controller:

I will test it, feedback will be provided soon! :)

Thanks!
-Henrik

> 1) Setup priorities to traffic classes to hardware queues mapping
> $ tc qdisc replace dev enp3s0 parent root mqprio num_tc 3 \
>      map 2 2 1 0 2 2 2 2 2 2 2 2 2 2 2 2 queues 1@0 1@1 2@2 hw 0
> 
> 2) Check scheme. You want to get the inner qdiscs ID from the bottom up
> $ tc -g  class show dev enp3s0
> 
> Ex.:
> +---(802a:3) mqprio
> |    +---(802a:6) mqprio
> |    +---(802a:7) mqprio
> |
> +---(802a:2) mqprio
> |    +---(802a:5) mqprio
> |
> +---(802a:1) mqprio
>      +---(802a:4) mqprio
> 
>  * Here '802a:4' is Tx Queue #0 and '802a:5' is Tx Queue #1.
> 
> 3) Calculate CBS parameters for classes A and B. i.e. BW for A is 20Mbps and
>    for B is 10Mbps:
> $ ./samples/tsn/calculate_cbs_params.py -A 20000 -a 1500 -B 10000 -b 1500
> 
> 4) Configure CBS for traffic class A (priority 3) as provided by the script:
> $ tc qdisc replace dev enp3s0 parent 802a:4 cbs locredit -1470 \
>      hicredit 30 sendslope -980000 idleslope 20000
> 
> 5) Configure CBS for traffic class B (priority 2):
> $ tc qdisc replace dev enp3s0 parent 802a:5 cbs \
>      locredit -1485 hicredit 31 sendslope -990000 idleslope 10000
> 
> 6) Run Listener, compiled from samples/tsn/listener.c
> $ ./listener -i enp3s0
> 
> 7) Run Talker for class A (prio 3 here), compiled from samples/tsn/talker.c
> $ ./talker -i enp3s0 -p 3
> 
>  * The bandwidth displayed on the listener output at this stage should be very
>    close to the one configured for class A.
> 
> 8) You can also run a Talker for class B (prio 2 here)
> $ ./talker -i enp3s0 -p 2
> 
>  * The bandwidth displayed on the listener output now should increase to very
>    close to the one configured for class A + class B.

Because you grab both class A *and* B, or because B will eat what A does 
not use?

-H

> Authors
> =======
>  - Andre Guedes <andre.guedes@intel.com>
>  - Ivan Briano <ivan.briano@intel.com>
>  - Jesus Sanchez-Palencia <jesus.sanchez-palencia@intel.com>
>  - Vinicius Gomes <vinicius.gomes@intel.com>
> 
> 
> Andre Guedes (2):
>   igb: Add support for CBS offload
>   samples/tsn: Add script for calculating CBS config
> 
> Jesus Sanchez-Palencia (1):
>   sample: Add TSN Talker and Listener examples
> 
> Vinicius Costa Gomes (2):
>   net/sched: Introduce the user API for the CBS shaper
>   net/sched: Introduce Credit Based Shaper (CBS) qdisc
> 
>  drivers/net/ethernet/intel/igb/e1000_defines.h |  23 ++
>  drivers/net/ethernet/intel/igb/e1000_regs.h    |   8 +
>  drivers/net/ethernet/intel/igb/igb.h           |   6 +
>  drivers/net/ethernet/intel/igb/igb_main.c      | 349 +++++++++++++++++++++++++
>  include/linux/netdevice.h                      |   1 +
>  include/uapi/linux/pkt_sched.h                 |  29 ++
>  net/sched/Kconfig                              |  11 +
>  net/sched/Makefile                             |   1 +
>  net/sched/sch_cbs.c                            | 286 ++++++++++++++++++++
>  samples/tsn/calculate_cbs_params.py            | 112 ++++++++
>  samples/tsn/listener.c                         | 254 ++++++++++++++++++
>  samples/tsn/talker.c                           | 136 ++++++++++
>  12 files changed, 1216 insertions(+)
>  create mode 100644 net/sched/sch_cbs.c
>  create mode 100755 samples/tsn/calculate_cbs_params.py
>  create mode 100644 samples/tsn/listener.c
>  create mode 100644 samples/tsn/talker.c
> 
> --
> 2.14.1
Richard Cochran Sept. 7, 2017, 12:40 p.m. UTC | #6
On Thu, Sep 07, 2017 at 07:34:11AM +0200, Henrik Austad wrote:
> Also, does this mean that when you create the qdisc, you have locked the 
> bandwidth for the scheduler? Meaning, if I later want to add another 
> stream that requires more bandwidth, I have to close all active streams, 
> reconfigure the qdisc and then restart?

No, just allocate enough bandwidth to accomodate all of the expected
streams.  The streams can start and stop at will.

> So my understanding of all of this is that you configure the *total* 
> bandwith for each class when you load the qdisc and then let userspace 
> handle the rest. Is this correct?

Nothing wrong with that.
 
> In my view, it would be nice if the qdisc had some notion about streams so 
> that you could create a stream, feed frames to it and let the driver pace 
> them out. (The fewer you queue, the shorter the delay). This will also 
> allow you to enforce per-stream bandwidth restrictions. I don't see how you 
> can do this here unless you want to do this in userspace.
> 
> Do you have any plans for adding support for multiplexing streams? If you 
> have multiple streams, how do you enforce that one stream does not eat into 
> the bandwidth of another stream? AFAIK, this is something the network must 
> enforce, but I see no option of doing som here.

Please, lets keep this simple.  Today we have exactly zero user space
applications using this kind of bandwidth reservation.  The case of
wanting the kernel to police individual stream usage does not exist,
and probably never will.

For serious TSN use cases, the bandwidth needed by each system and
indeed the entire network will be engineered, and we can reasonably
expect applications to cooperate in this regard.

Thanks,
Richard
Henrik Austad Sept. 7, 2017, 3:27 p.m. UTC | #7
On Thu, Sep 07, 2017 at 02:40:18PM +0200, Richard Cochran wrote:
> On Thu, Sep 07, 2017 at 07:34:11AM +0200, Henrik Austad wrote:
> > Also, does this mean that when you create the qdisc, you have locked the 
> > bandwidth for the scheduler? Meaning, if I later want to add another 
> > stream that requires more bandwidth, I have to close all active streams, 
> > reconfigure the qdisc and then restart?
> 
> No, just allocate enough bandwidth to accomodate all of the expected
> streams.  The streams can start and stop at will.

Sure, that'll work.

And if you want to this driver to act as a bridge, how do you accomodate 
change in network requirements? (i.e. how does this work with switchdev?)
- Or am I overthinking this?

> > So my understanding of all of this is that you configure the *total* 
> > bandwith for each class when you load the qdisc and then let userspace 
> > handle the rest. Is this correct?
> 
> Nothing wrong with that.

Didn't mean to say it was wrong, just making sure I've understood the 
concept.

> > In my view, it would be nice if the qdisc had some notion about streams so 
> > that you could create a stream, feed frames to it and let the driver pace 
> > them out. (The fewer you queue, the shorter the delay). This will also 
> > allow you to enforce per-stream bandwidth restrictions. I don't see how you 
> > can do this here unless you want to do this in userspace.
> > 
> > Do you have any plans for adding support for multiplexing streams? If you 
> > have multiple streams, how do you enforce that one stream does not eat into 
> > the bandwidth of another stream? AFAIK, this is something the network must 
> > enforce, but I see no option of doing som here.
> 
> Please, lets keep this simple. 

Simple is always good

> Today we have exactly zero user space
> applications using this kind of bandwidth reservation.  The case of
> wanting the kernel to police individual stream usage does not exist,
> and probably never will.

That we have *zero* userspace applications today is probably related to the 
fact that we have exacatly *zero* drivers in the kernel that talks TSN :)

To rephrase a bit, what I'm worried about:

If you have more than 1 application in userspace that wants to send data 
using this scheduler, how do you ensure fair transmission of frames? (both 
how much bandwidth they use, but also ordering of frames from each 
application) Do you expect all of this to be handled in userspace?

> For serious TSN use cases, the bandwidth needed by each system and
> indeed the entire network will be engineered, and we can reasonably
> expect applications to cooperate in this regard.

yes.. that'll happen ;)

> Thanks,
> Richard

Don't get me wrong, I think it is great that others are working on this!
I'm just trying to fully understand the thought that have gone into this 
and how it is inteded to be used.

I'll get busy testing the code and wrapping my head around the different 
parameters.
Richard Cochran Sept. 7, 2017, 3:53 p.m. UTC | #8
On Thu, Sep 07, 2017 at 05:27:51PM +0200, Henrik Austad wrote:
> On Thu, Sep 07, 2017 at 02:40:18PM +0200, Richard Cochran wrote:
> And if you want to this driver to act as a bridge, how do you accomodate 
> change in network requirements? (i.e. how does this work with switchdev?)

To my understanding, this Qdisc idea provides QoS for the host's
transmitted traffic, and nothing more.

> - Or am I overthinking this?

Being able to configure the external ports of a switchdev is probably
a nice feature, but that is another story.  (But maybe I misunderstood
the authors' intent!)

> If you have more than 1 application in userspace that wants to send data 
> using this scheduler, how do you ensure fair transmission of frames? (both 
> how much bandwidth they use,

There are many ways to handle this, and we shouldn't put any of that
policy into the kernel.  For example, there might be a monolithic
application with configurable threads, or an allocation server that
grants bandwidth to applications via IPC, or a multiplexing stream
server like jack, pulse, etc, and so on...

> but also ordering of frames from each application)

Not sure what you mean by this.

> Do you expect all of this to be handled in userspace?

Yes, I do.

Thanks,
Richard
Henrik Austad Sept. 7, 2017, 4:18 p.m. UTC | #9
On Thu, Sep 07, 2017 at 05:53:15PM +0200, Richard Cochran wrote:
> On Thu, Sep 07, 2017 at 05:27:51PM +0200, Henrik Austad wrote:
> > On Thu, Sep 07, 2017 at 02:40:18PM +0200, Richard Cochran wrote:
> > And if you want to this driver to act as a bridge, how do you accomodate 
> > change in network requirements? (i.e. how does this work with switchdev?)
> 
> To my understanding, this Qdisc idea provides QoS for the host's
> transmitted traffic, and nothing more.

Ok, then we're on the same page.

> > - Or am I overthinking this?
> 
> Being able to configure the external ports of a switchdev is probably
> a nice feature, but that is another story.  (But maybe I misunderstood
> the authors' intent!)

ok, chalk that one up for later perhaps

> > If you have more than 1 application in userspace that wants to send data 
> > using this scheduler, how do you ensure fair transmission of frames? (both 
> > how much bandwidth they use,
> 
> There are many ways to handle this, and we shouldn't put any of that
> policy into the kernel.  For example, there might be a monolithic
> application with configurable threads, or an allocation server that
> grants bandwidth to applications via IPC, or a multiplexing stream
> server like jack, pulse, etc, and so on...

true

> > but also ordering of frames from each application)
> 
> Not sure what you mean by this.

Fair enough, I'm not that good at making myself clear :)

Let's see if I can make a better attempt:

If you have 2 separate applications that have their own streams going to 
different endpoints - but both are in the same class, then they will 
share the qdisc bandwidth.

So application 
- A sends frame A1, A2, A3, .. An
- B sends B1, B2, .. Bn

What I was trying to describe was: if application A send 2 frames, and B 
sends 2 frames at the same time, then you would hope that the order would 
be A1, B1, A2, B2, and not A1, A2, B1, B2.

None of this would be a problem if you expect a *single* user, like the 
allocation server you described above. Again, I think this is just me 
overthinking the problem right now :)

> > Do you expect all of this to be handled in userspace?
> 
> Yes, I do.

ok, fair enough

Thanks for answering my questions!
Andre Guedes Sept. 7, 2017, 7:58 p.m. UTC | #10
Hi Henrik,

Thanks for your feedback! I'll address some of your comments below.

On Thu, 2017-09-07 at 07:34 +0200, Henrik Austad wrote:
> > As for the shapers config interface:
> > 
> >  * CBS (802.1Qav)
> > 
> >    This patchset is proposing a new qdisc called 'cbs'. Its 'tc' cmd line
> > is:
> >    $ tc qdisc add dev IFACE parent ID cbs locredit N hicredit M sendslope S
> > \
> >      idleslope I
> 
> So this confuses me a bit, why specify sendSlope?
> 
>     sendSlope = portTransmitRate - idleSlope
> 
> and portTransmitRate is the speed of the MAC (which you get from the 
> driver). Adding sendSlope here is just redundant I think.

Yes, this was something we've spent quite a few time discussing before this RFC
series. After reading the Annex L from 802.1Q-2014 (operation of CBS algorithm)
so many times, we've came up with the rationale explained below.

The rationale here is that sendSlope is just another parameter from CBS
algorithm like idleSlope, hiCredit and loCredit. As such, its calculation
should be done at the same "layer" as the others parameters (in this case, user
space) in order to keep consistency. Moreover, in this design, the driver layer
is dead simple: all the device driver has to do is applying CBS parameters to
hardware. Having any CBS parameter calculation in the driver layer means all
device drivers must implement that calculation.

> Also, does this mean that when you create the qdisc, you have locked the 
> bandwidth for the scheduler? Meaning, if I later want to add another 
> stream that requires more bandwidth, I have to close all active streams, 
> reconfigure the qdisc and then restart?

If we want to reserve more bandwidth to "accommodate" a new stream, we don't
need to close all active streams. All we have to do is changing the CBS qdisc
and pass the new CBS parameters. Here is what the command-line would look like:

$ tc qdisc change dev enp0s4 parent 8001:5 cbs locredit -1470 hicredit 30
sendslope -980000 idleslope 20000

No application/stream is interrupted while new CBS parameters are applied.

> >    Note that the parameters for this qdisc are the ones defined by the
> >    802.1Q-2014 spec, so no hardware specific functionality is exposed here.
> 
> You do need to know if the link is brought up as 100 or 1000 though - which 
> the driver already knows.

User space knows that information via ethtool or /sys.

> > Testing this RFC
> > ================
> > 
> > For testing the patches of this RFC only, you can refer to the samples and
> > helper script being added to samples/tsn/ and the use the 'mqprio' qdisc to
> > setup the priorities to Tx queues mapping, together with the 'cbs' qdisc to
> > configure the HW shaper of the i210 controller:
> 
> I will test it, feedback will be provided soon! :)

That's great! Please let us know if you find any issue and thanks for you
support.

> > 8) You can also run a Talker for class B (prio 2 here)
> > $ ./talker -i enp3s0 -p 2
> > 
> >  * The bandwidth displayed on the listener output now should increase to
> > very
> >    close to the one configured for class A + class B.
> 
> Because you grab both class A *and* B, or because B will eat what A does 
> not use?

Because the listener application grabs both class A and B traffic.

Regards,

Andre
Andre Guedes Sept. 7, 2017, 9:51 p.m. UTC | #11
On Thu, 2017-09-07 at 18:18 +0200, Henrik Austad wrote:
> On Thu, Sep 07, 2017 at 05:53:15PM +0200, Richard Cochran wrote:
> > On Thu, Sep 07, 2017 at 05:27:51PM +0200, Henrik Austad wrote:
> > > On Thu, Sep 07, 2017 at 02:40:18PM +0200, Richard Cochran wrote:
> > > And if you want to this driver to act as a bridge, how do you accomodate 
> > > change in network requirements? (i.e. how does this work with switchdev?)
> > 
> > To my understanding, this Qdisc idea provides QoS for the host's
> > transmitted traffic, and nothing more.
> 
> Ok, then we're on the same page.
> 
> > > - Or am I overthinking this?
> > 
> > Being able to configure the external ports of a switchdev is probably
> > a nice feature, but that is another story.  (But maybe I misunderstood
> > the authors' intent!)
> 
> ok, chalk that one up for later perhaps

Just to clarify, we've been most focused on end-station use-cases. We've
considered some bridge use-cases as well just to verify that the proposed
design won't be an issue if someone else goes for it.

- Andre
Vinicius Costa Gomes Sept. 8, 2017, 1:29 a.m. UTC | #12
Henrik Austad <henrik@austad.us> writes:

> On Thu, Aug 31, 2017 at 06:26:20PM -0700, Vinicius Costa Gomes wrote:
>> Hi,
>>
>> This patchset is an RFC on a proposal of how the Traffic Control subsystem can
>> be used to offload the configuration of traffic shapers into network devices
>> that provide support for them in HW. Our goal here is to start upstreaming
>> support for features related to the Time-Sensitive Networking (TSN) set of
>> standards into the kernel.
>
> Nice to see that others are working on this as well! :)
>
> A short disclaimer; I'm pretty much anchored in the view "linux is the
> end-station in a TSN domain", is this your approach as well, or are you
> looking at this driver to be used in bridges as well? (because that will
> affect the comments on time-aware shaper and frame preemption)
>
> Yet another disclaimer; I am not a linux networking subsystem expert. Not
> by a long shot! There are black magic happening in the internals of the
> networking subsystem that I am not even aware of. So if something I say or
> ask does not make sense _at_all_, that's probably why..
>
> I do know a tiny bit about TSN though, and I have been messing around
> with it for a little while, hence my comments below
>
>> As part of this work, we've assessed previous public discussions related to TSN
>> enabling: patches from Henrik Austad (Cisco), the presentation from Eric Mann
>> at Linux Plumbers 2012, patches from Gangfeng Huang (National Instruments) and
>> the current state of the OpenAVNU project (https://github.com/AVnu/OpenAvnu/).
>
> /me eyes Cc ;p
>
>> Overview
>> ========
>>
>> Time-sensitive Networking (TSN) is a set of standards that aim to address
>> resources availability for providing bandwidth reservation and bounded latency
>> on Ethernet based LANs. The proposal described here aims to cover mainly what is
>> needed to enable the following standards: 802.1Qat, 802.1Qav, 802.1Qbv and
>> 802.1Qbu.
>>
>> The initial target of this work is the Intel i210 NIC, but other controllers'
>> datasheet were also taken into account, like the Renesas RZ/A1H RZ/A1M group and
>> the Synopsis DesignWare Ethernet QoS controller.
>
> NXP has a TSN aware chip on the i.MX7 sabre board as well </fyi>

Cool. Will take a look.

>
>> Proposal
>> ========
>>
>> Feature-wise, what is covered here are configuration interfaces for HW
>> implementations of the Credit-Based shaper (CBS, 802.1Qav), Time-Aware shaper
>> (802.1Qbv) and Frame Preemption (802.1Qbu). CBS is a per-queue shaper, while
>> Qbv and Qbu must be configured per port, with the configuration covering all
>> queues. Given that these features are related to traffic shaping, and that the
>> traffic control subsystem already provides a queueing discipline that offloads
>> config into the device driver (i.e. mqprio), designing new qdiscs for the
>> specific purpose of offloading the config for each shaper seemed like a good
>> fit.
>
> just to be clear, you register sch_cbs as a subclass to mqprio, not as a
> root class?

That's right.

>
>> For steering traffic into the correct queues, we use the socket option
>> SO_PRIORITY and then a mechanism to map priority to traffic classes / Tx queues.
>> The qdisc mqprio is currently used in our tests.
>
> Right, fair enough, I'd prefer the TSN qdisc to be the root-device and
> rather have mqprio for high priority traffic and another for 'everything
> else'', but this would work too. This is not that relevant at this stage I
> guess :)

That's a scenario I haven't considered, will give it some thought.

>
>> As for the shapers config interface:
>>
>>  * CBS (802.1Qav)
>>
>>    This patchset is proposing a new qdisc called 'cbs'. Its 'tc' cmd line is:
>>    $ tc qdisc add dev IFACE parent ID cbs locredit N hicredit M sendslope S \
>>      idleslope I
>
> So this confuses me a bit, why specify sendSlope?
>
>     sendSlope = portTransmitRate - idleSlope
>
> and portTransmitRate is the speed of the MAC (which you get from the
> driver). Adding sendSlope here is just redundant I think.
>
> Also, does this mean that when you create the qdisc, you have locked the
> bandwidth for the scheduler? Meaning, if I later want to add another
> stream that requires more bandwidth, I have to close all active streams,
> reconfigure the qdisc and then restart?
>
>>    Note that the parameters for this qdisc are the ones defined by the
>>    802.1Q-2014 spec, so no hardware specific functionality is exposed here.
>
> You do need to know if the link is brought up as 100 or 1000 though - which
> the driver already knows.
>
>>  * Time-aware shaper (802.1Qbv):
>>
>>    The idea we are currently exploring is to add a "time-aware", priority based
>>    qdisc, that also exposes the Tx queues available and provides a mechanism for
>>    mapping priority <-> traffic class <-> Tx queues in a similar fashion as
>>    mqprio. We are calling this qdisc 'taprio', and its 'tc' cmd line would be:
>
> As far as I know, this is not supported by i210, and if time-aware shaping
> is enabled in the network - you'll be queued on a bridge until the window
> opens as time-aware shaping is enforced on the tx-port and not on rx. Is
> this required in this driver?

Yeah, i210 doesn't support the time-aware shaper. I think the second
part of your question doesn't really apply, then.

>
>>    $ $ tc qdisc add dev ens4 parent root handle 100 taprio num_tc 4    \
>>      	   map 2 2 1 0 3 3 3 3 3 3 3 3 3 3 3 3                         \
>> 	   queues 0 1 2 3                                              \
>>      	   sched-file gates.sched [base-time <interval>]               \
>>            [cycle-time <interval>] [extension-time <interval>]
>
> That was a lot of priorities! 802.1Q lists 8 priorities, where does these
> 16 come from?

Even if the 802.1Q only defines 8 priorities, the Linux network stack
supports a lot more (and this command line is more than slightly
inspired by the mqprio equivalent).

>
> You map pri 0,1 to queue 2, pri 2 to queue 1 (Class B), pri 3 to queue 0
> (class A) and everythign else to queue 3. This is what I would expect,
> except for the additional 8 priorities.
>
>>    <file> is multi-line, with each line being of the following format:
>>    <cmd> <gate mask> <interval in nanoseconds>
>>
>>    Qbv only defines one <cmd>: "S" for 'SetGates'
>>
>>    For example:
>>
>>    S 0x01 300
>>    S 0x03 500
>>
>>    This means that there are two intervals, the first will have the gate
>>    for traffic class 0 open for 300 nanoseconds, the second will have
>>    both traffic classes open for 500 nanoseconds.
>
> Are you aware of any hw except dedicated switching stuff that supports
> this? (meant as "I'm curious and would like to know")

Not really. I couldn't find any public documentation about products
destined for end stations that support this. I, too, would like to know
more.

>
>>    Additionally, an option to set just one entry of the gate control list will
>>    also be provided by 'taprio':
>>
>>    $ tc qdisc (...) \
>>         sched-row <row number> <cmd> <gate mask> <interval>  \
>>         [base-time <interval>] [cycle-time <interval>] \
>>         [extension-time <interval>]
>>
>>
>>  * Frame Preemption (802.1Qbu):
>
> So Frame preemption is nice, but my understanding of Qbu is that the real
> benefit is at the bridges and not in the endpoints. As jumbo-frames is
> explicitly disallowed in Qav, the maximum latency incurred by a frame in
> flight is 12us on a 1Gbps link. I am not sure if these 12us is what will be
> the main delay in your application.
>
> Or have I missed some crucial point here?


You didn't seem to have missed anything. What I saw as the biggest point
for frame preemption, is when it is used with scheduled traffic, you
could keep the preemptable traffic classes gates always open, have a few
time windows for periodic traffic, and still have predictable behaviour
for an unscheduled "emergency" traffic.


Cheers,
--
Vinicius
Henrik Austad Sept. 8, 2017, 6:06 a.m. UTC | #13
On Thu, Sep 07, 2017 at 07:58:53PM +0000, Guedes, Andre wrote:
> Hi Henrik,
> 
> Thanks for your feedback! I'll address some of your comments below.
> 
> On Thu, 2017-09-07 at 07:34 +0200, Henrik Austad wrote:
> > > As for the shapers config interface:
> > > 
> > >  * CBS (802.1Qav)
> > > 
> > >    This patchset is proposing a new qdisc called 'cbs'. Its 'tc' cmd line
> > > is:
> > >    $ tc qdisc add dev IFACE parent ID cbs locredit N hicredit M sendslope S
> > > \
> > >      idleslope I
> > 
> > So this confuses me a bit, why specify sendSlope?
> > 
> >     sendSlope = portTransmitRate - idleSlope
> > 
> > and portTransmitRate is the speed of the MAC (which you get from the 
> > driver). Adding sendSlope here is just redundant I think.
> 
> Yes, this was something we've spent quite a few time discussing before this RFC
> series. After reading the Annex L from 802.1Q-2014 (operation of CBS algorithm)
> so many times, we've came up with the rationale explained below.
> 
> The rationale here is that sendSlope is just another parameter from CBS
> algorithm like idleSlope, hiCredit and loCredit. As such, its calculation
> should be done at the same "layer" as the others parameters (in this case, user
> space) in order to keep consistency. Moreover, in this design, the driver layer
> is dead simple: all the device driver has to do is applying CBS parameters to
> hardware. Having any CBS parameter calculation in the driver layer means all
> device drivers must implement that calculation.

Ok, that actually makes a lot of sense, and anything that keeps this kind 
of arithmetic outside the kernel is a good thing!

Thanks for the clarification!

> > Also, does this mean that when you create the qdisc, you have locked the 
> > bandwidth for the scheduler? Meaning, if I later want to add another 
> > stream that requires more bandwidth, I have to close all active streams, 
> > reconfigure the qdisc and then restart?
> 
> If we want to reserve more bandwidth to "accommodate" a new stream, we don't
> need to close all active streams. All we have to do is changing the CBS qdisc
> and pass the new CBS parameters. Here is what the command-line would look like:
> 
> $ tc qdisc change dev enp0s4 parent 8001:5 cbs locredit -1470 hicredit 30
> sendslope -980000 idleslope 20000
> 
> No application/stream is interrupted while new CBS parameters are applied.

Ah, good.

> > >    Note that the parameters for this qdisc are the ones defined by the
> > >    802.1Q-2014 spec, so no hardware specific functionality is exposed here.
> > 
> > You do need to know if the link is brought up as 100 or 1000 though - which 
> > the driver already knows.
> 
> User space knows that information via ethtool or /sys.

Fair point.

> > > Testing this RFC
> > > ================
> > > 
> > > For testing the patches of this RFC only, you can refer to the samples and
> > > helper script being added to samples/tsn/ and the use the 'mqprio' qdisc to
> > > setup the priorities to Tx queues mapping, together with the 'cbs' qdisc to
> > > configure the HW shaper of the i210 controller:
> > 
> > I will test it, feedback will be provided soon! :)
> 
> That's great! Please let us know if you find any issue and thanks for you
> support.
> 
> > > 8) You can also run a Talker for class B (prio 2 here)
> > > $ ./talker -i enp3s0 -p 2
> > > 
> > >  * The bandwidth displayed on the listener output now should increase to
> > > very
> > >    close to the one configured for class A + class B.
> > 
> > Because you grab both class A *and* B, or because B will eat what A does 
> > not use?
> 
> Because the listener application grabs both class A and B traffic.

Right, got it.

Thanks for the feedback, I'm getting really excited about this! :D
Richard Cochran Sept. 12, 2017, 4:56 a.m. UTC | #14
On Thu, Sep 07, 2017 at 06:29:00PM -0700, Vinicius Costa Gomes wrote:
> >>  * Time-aware shaper (802.1Qbv):
> >>
> >>    The idea we are currently exploring is to add a "time-aware", priority based
> >>    qdisc, that also exposes the Tx queues available and provides a mechanism for
> >>    mapping priority <-> traffic class <-> Tx queues in a similar fashion as
> >>    mqprio. We are calling this qdisc 'taprio', and its 'tc' cmd line would be:
> >
> > As far as I know, this is not supported by i210, and if time-aware shaping
> > is enabled in the network - you'll be queued on a bridge until the window
> > opens as time-aware shaping is enforced on the tx-port and not on rx. Is
> > this required in this driver?
> 
> Yeah, i210 doesn't support the time-aware shaper. I think the second
> part of your question doesn't really apply, then.

Actually, you can implement 802.1Qbv (as an end station) quite easily
using the i210.  I'll show how by posting a series after net-next
opens up again.

Thanks,
Richard
Richard Cochran Sept. 18, 2017, 8:02 a.m. UTC | #15
On Thu, Aug 31, 2017 at 06:26:20PM -0700, Vinicius Costa Gomes wrote:
>  * Time-aware shaper (802.1Qbv):

I just posted a working alternative showing how to handle 802.1Qbv and
many other Ethernet field buses.
 
>    The idea we are currently exploring is to add a "time-aware", priority based
>    qdisc, that also exposes the Tx queues available and provides a mechanism for
>    mapping priority <-> traffic class <-> Tx queues in a similar fashion as
>    mqprio. We are calling this qdisc 'taprio', and its 'tc' cmd line would be:
> 
>    $ $ tc qdisc add dev ens4 parent root handle 100 taprio num_tc 4    \
>      	   map 2 2 1 0 3 3 3 3 3 3 3 3 3 3 3 3                         \
> 	   queues 0 1 2 3                                              \
>      	   sched-file gates.sched [base-time <interval>]               \
>            [cycle-time <interval>] [extension-time <interval>]
> 
>    <file> is multi-line, with each line being of the following format:
>    <cmd> <gate mask> <interval in nanoseconds>
> 
>    Qbv only defines one <cmd>: "S" for 'SetGates'
> 
>    For example:
> 
>    S 0x01 300
>    S 0x03 500
> 
>    This means that there are two intervals, the first will have the gate
>    for traffic class 0 open for 300 nanoseconds, the second will have
>    both traffic classes open for 500 nanoseconds.

The idea of the schedule file will not work in practice.  Consider the
fact that the application wants to deliver time critical data in a
particular slot.  How can it find out a) what the time slots are and
b) when the next slot is scheduled?  With this Qdisc, it cannot do
this, AFAICT.  The admin might delete the file after configuring the
Qdisc!

Using the SO_TXTIME option, the application has total control over the
scheduling.  The great advantages of this approach is that we can
support any possible combination of periodic or aperiodic scheduling
and we can support any priority scheme user space dreams up.

For example, one can imaging running two or more loops that only
occasionally collide.  When they do collide, which packet should be
sent first?  Just let user space decide.

Thanks,
Richard
Richard Cochran Sept. 18, 2017, 8:12 a.m. UTC | #16
On Thu, Aug 31, 2017 at 06:26:20PM -0700, Vinicius Costa Gomes wrote:
> This patchset is an RFC on a proposal of how the Traffic Control subsystem can
> be used to offload the configuration of traffic shapers into network devices
> that provide support for them in HW. Our goal here is to start upstreaming
> support for features related to the Time-Sensitive Networking (TSN) set of
> standards into the kernel.

Just for the record, here is my score card showing the current status
of TSN support in Linux.  Comments and corrections are more welcome.

Thanks,
Richard


 | FEATURE                                        | STANDARD            | STATUS                       |
 |------------------------------------------------+---------------------+------------------------------|
 | Synchronization                                | 802.1AS-2011        | Implemented in               |
 |                                                |                     | - Linux kernel PHC subsystem |
 |                                                |                     | - linuxptp (userspace)       |
 |------------------------------------------------+---------------------+------------------------------|
 | Forwarding and Queuing Enhancements            | 802.1Q-2014 sec. 34 | RFC posted (this thread)     |
 | for Time-Sensitive Streams (FQTSS)             |                     |                              |
 |------------------------------------------------+---------------------+------------------------------|
 | Stream Reservation Protocol (SRP)              | 802.1Q-2014 sec. 35 | in Open-AVB [1]              |
 |------------------------------------------------+---------------------+------------------------------|
 | Audio Video Transport Protocol (AVTP)          | IEEE 1722-2011      | DNE                          |
 |------------------------------------------------+---------------------+------------------------------|
 | Audio/Video Device Discovery, Enumeration,     | IEEE 1722.1-2013    | jdksavdecc-c [2]             |
 | Connection Management and Control (AVDECC)     |                     |                              |
 | AVDECC Connection Management Protocol (ACMP)   |                     |                              |
 | AVDECC Enumeration and Control Protocol (AECP) |                     |                              |
 | MAC Address Acquisition Protocol (MAAP)        |                     | in Open-AVB                  |
 |------------------------------------------------+---------------------+------------------------------|
 | Frame Preemption                               | P802.1Qbu           | DNE                          |
 | Scheduled Traffic                              | P802.1Qbv           | RFC posted (SO_TXTIME)       |
 | SRP Enhancements and Performance Improvements  | P802.1Qcc           | DNE                          |

 DNE = Does Not Exist (to my knowledge)

1. https://github.com/Avnu/OpenAvnu

   (DISCLAIMER from the website:)

   It is planned to eventually include the various packet encapsulation types,
   protocol discovery daemons, libraries to convert media clocks to AVB clocks
   and vice versa, and drivers.

   This repository does not include all components required to build a full
   production AVB/TSN system (e.g. a turnkey solution to stream stored or live audio
   or video content). Some simple example applications are provided which
   illustrate the flow - but a professional Audio/Video system requires a full media stack
   - including audio and video inputs and outputs, media processing elements, and
   various graphical user interfaces. Various companies provide such integrated
   solutions.

2. https://github.com/jdkoftinoff/jdksavdecc-c
Henrik Austad Sept. 18, 2017, 11:46 a.m. UTC | #17
On Mon, Sep 18, 2017 at 10:02:14AM +0200, Richard Cochran wrote:
> On Thu, Aug 31, 2017 at 06:26:20PM -0700, Vinicius Costa Gomes wrote:
> >  * Time-aware shaper (802.1Qbv):
> 
> I just posted a working alternative showing how to handle 802.1Qbv and
> many other Ethernet field buses.

Yes, I saw them, grabbing them for testing now - thanks!

> >    The idea we are currently exploring is to add a "time-aware", priority based
> >    qdisc, that also exposes the Tx queues available and provides a mechanism for
> >    mapping priority <-> traffic class <-> Tx queues in a similar fashion as
> >    mqprio. We are calling this qdisc 'taprio', and its 'tc' cmd line would be:
> > 
> >    $ $ tc qdisc add dev ens4 parent root handle 100 taprio num_tc 4    \
> >      	   map 2 2 1 0 3 3 3 3 3 3 3 3 3 3 3 3                         \
> > 	   queues 0 1 2 3                                              \
> >      	   sched-file gates.sched [base-time <interval>]               \
> >            [cycle-time <interval>] [extension-time <interval>]
> > 
> >    <file> is multi-line, with each line being of the following format:
> >    <cmd> <gate mask> <interval in nanoseconds>
> > 
> >    Qbv only defines one <cmd>: "S" for 'SetGates'
> > 
> >    For example:
> > 
> >    S 0x01 300
> >    S 0x03 500
> > 
> >    This means that there are two intervals, the first will have the gate
> >    for traffic class 0 open for 300 nanoseconds, the second will have
> >    both traffic classes open for 500 nanoseconds.
> 
> The idea of the schedule file will not work in practice.  Consider the
> fact that the application wants to deliver time critical data in a
> particular slot.  How can it find out a) what the time slots are and
> b) when the next slot is scheduled?  With this Qdisc, it cannot do
> this, AFAICT.  The admin might delete the file after configuring the
> Qdisc!
> 
> Using the SO_TXTIME option, the application has total control over the
> scheduling.  The great advantages of this approach is that we can
> support any possible combination of periodic or aperiodic scheduling
> and we can support any priority scheme user space dreams up.

Using SO_TXTIME makes a lot of sense. TSN has a presentation_time, which 
you can use to deduce the time it should be transmitted (Class A has a 2ms 
latency guarantee, B has 50), but given how TSN uses the timestamp, it will 
wrap every 4.3 seconds, using SO_TXTIME allows you to schedule transmission 
at a much later time. It should also lessen the dependency on a specific 
protocol, which is also good.

> For example, one can imaging running two or more loops that only
> occasionally collide.  When they do collide, which packet should be
> sent first?  Just let user space decide.

If 2 userspace apps send to the same Tx-queue with the same priority, would 
it not make sense to just do FIFO? For all practical purposes, they have 
the same importance (same SO_PRIORITY, same SO_TXTIME). If the priority 
differs, then they would be directed to different queues, where one queue 
will take presedence anyway.

How far into the future would it make sense to schedule packets anyway?

I'll have a look at the other series you just posted!
Vinicius Costa Gomes Sept. 18, 2017, 11:06 p.m. UTC | #18
Hi Richard,

Richard Cochran <richardcochran@gmail.com> writes:

> On Thu, Aug 31, 2017 at 06:26:20PM -0700, Vinicius Costa Gomes wrote:
>>  * Time-aware shaper (802.1Qbv):
>
> I just posted a working alternative showing how to handle 802.1Qbv and
> many other Ethernet field buses.
>
>>    The idea we are currently exploring is to add a "time-aware", priority based
>>    qdisc, that also exposes the Tx queues available and provides a mechanism for
>>    mapping priority <-> traffic class <-> Tx queues in a similar fashion as
>>    mqprio. We are calling this qdisc 'taprio', and its 'tc' cmd line would be:
>>
>>    $ $ tc qdisc add dev ens4 parent root handle 100 taprio num_tc 4    \
>>      	   map 2 2 1 0 3 3 3 3 3 3 3 3 3 3 3 3                         \
>> 	   queues 0 1 2 3                                              \
>>      	   sched-file gates.sched [base-time <interval>]               \
>>            [cycle-time <interval>] [extension-time <interval>]
>>
>>    <file> is multi-line, with each line being of the following format:
>>    <cmd> <gate mask> <interval in nanoseconds>
>>
>>    Qbv only defines one <cmd>: "S" for 'SetGates'
>>
>>    For example:
>>
>>    S 0x01 300
>>    S 0x03 500
>>
>>    This means that there are two intervals, the first will have the gate
>>    for traffic class 0 open for 300 nanoseconds, the second will have
>>    both traffic classes open for 500 nanoseconds.
>
> The idea of the schedule file will not work in practice.  Consider the
> fact that the application wants to deliver time critical data in a
> particular slot.  How can it find out a) what the time slots are and
> b) when the next slot is scheduled?  With this Qdisc, it cannot do
> this, AFAICT.  The admin might delete the file after configuring the
> Qdisc!

That's the point, the application does not need to know that, and asking
that would be stupid. From the point of view of the Qbv specification,
applications only need to care about its basic bandwidth requirements:
its interval, frame size, frames per interval (using the terms of the
SRP section of 802.1Q). The traffic schedule is provided (off band) by a
"god box" which knows all the requirements of all applications in all
the nodes and how they are connected.

(And that's another nice point of how 802.1Qbv works, applications do
not need to be changed to use it, and I think we should work to achieve
this on the Linux side)

That being said, that only works for kinds of traffic that maps well to
this configuration in advance model, which is the model that the IEEE
(see 802.1Qcc) and the AVNU Alliance[1] are pushing for.

In the real world, I can see multiple types of applications, some using
something like TXTIME, and some configured in advance.

>
> Using the SO_TXTIME option, the application has total control over the
> scheduling.  The great advantages of this approach is that we can
> support any possible combination of periodic or aperiodic scheduling
> and we can support any priority scheme user space dreams up.

It has the disavantage of that the scheduling information has to be
in-band with the data. I *really* think that for scheduled traffic,
there should be a clear separation, we should not mix the dataflow with
scheduling. In short, an application in the network don't need to have
all the information necessary to schedule its own traffic well.

I have two points here: 1. I see both "solutions" (taprio and SO_TXTIME)
as being ortoghonal and useful, both; 2. trying to make one do the job
of the other, however, looks like "If all I have is a hammer, everything
looks like a nail".

In short, I see a per-packet transmission time and a per-queue schedule
as solutions to different problems.

>
> For example, one can imaging running two or more loops that only
> occasionally collide.  When they do collide, which packet should be
> sent first?  Just let user space decide.
>
> Thanks,
> Richard

Cheers,
--
Vinicius

[1]
http://avnu.org/theory-of-operation-for-tsn-enabled-industrial-systems/
Richard Cochran Sept. 19, 2017, 5:22 a.m. UTC | #19
On Mon, Sep 18, 2017 at 04:06:28PM -0700, Vinicius Costa Gomes wrote:
> That's the point, the application does not need to know that, and asking
> that would be stupid.

On the contrary, this information is essential to the application.
Probably you have never seen an actual Ethernet field bus in
operation?  In any case, you are missing the point.

> (And that's another nice point of how 802.1Qbv works, applications do
> not need to be changed to use it, and I think we should work to achieve
> this on the Linux side)

Once you start to care about real time performance, then you need to
consider the applications.  This is industrial control, not streaming
your tunes from your ipod.
 
> That being said, that only works for kinds of traffic that maps well to
> this configuration in advance model, which is the model that the IEEE
> (see 802.1Qcc) and the AVNU Alliance[1] are pushing for.

Again, you are missing the point of what they aiming for.  I have
looked at a number of production systems, and in each case the
developers want total control over the transmission, in order to
reduce latency to an absolute minimum.  Typically the data to be sent
are available only microseconds before the transmission deadline.

Consider OpenAVB on github that people are already using.  Take a look
at simple_talker.c and explain how "applications do not need to be
changed to use it."

> [1]
> http://avnu.org/theory-of-operation-for-tsn-enabled-industrial-systems/

Did you even read this?

    [page 24]

    As described in section 2, some industrial control systems require
    predictable, very low latency and cycle-to-cycle variation to meet
    hard real-time application requirements. In these systems,
    multiple distributed controllers commonly synchronize their
    sensor/actuator operations with other controllers by scheduling
    these operations in time, typically using a repeating control
    cycle.
    ...
    The gate control mechanism is itself a time-aware PTP application
    operating within a bridge or end station port.

It is an application, not a "god box."

> In short, I see a per-packet transmission time and a per-queue schedule
> as solutions to different problems.

Well, I can agree with that.  For some non real-time applications,
bandwidth shaping is enough, and your Qdisc idea is sufficient.  For
the really challenging TSN targets (industrial control, automotive),
your idea of an opaque schedule file won't fly.

Thanks,
Richard
Henrik Austad Sept. 19, 2017, 1:14 p.m. UTC | #20
Hi all,

On Tue, Sep 19, 2017 at 07:22:44AM +0200, Richard Cochran wrote:
> On Mon, Sep 18, 2017 at 04:06:28PM -0700, Vinicius Costa Gomes wrote:
> > That's the point, the application does not need to know that, and asking
> > that would be stupid.
> 
> On the contrary, this information is essential to the application.
> Probably you have never seen an actual Ethernet field bus in
> operation?  In any case, you are missing the point.
> 
> > (And that's another nice point of how 802.1Qbv works, applications do
> > not need to be changed to use it, and I think we should work to achieve
> > this on the Linux side)
> 
> Once you start to care about real time performance, then you need to
> consider the applications.  This is industrial control, not streaming
> your tunes from your ipod.

Do not underestimate the need for media over TSN. I fully see your point of 
real-time systems, but they are not the only valid use-cases for TSN.

> > That being said, that only works for kinds of traffic that maps well to
> > this configuration in advance model, which is the model that the IEEE
> > (see 802.1Qcc) and the AVNU Alliance[1] are pushing for.
> 
> Again, you are missing the point of what they aiming for.  I have
> looked at a number of production systems, and in each case the
> developers want total control over the transmission, in order to
> reduce latency to an absolute minimum.  Typically the data to be sent
> are available only microseconds before the transmission deadline.
> 
> Consider OpenAVB on github that people are already using.  Take a look
> at simple_talker.c and explain how "applications do not need to be
> changed to use it."

I do not think simple-talker was everintended to be how users of AVB should 
be implemented, but as a demonstration of what the protocol could do.

ALSA/V4L2 should supply some interface to this so that you can attach 
media-applications to it without the application itself having to be "TSN 
aware".

> > [1]
> > http://avnu.org/theory-of-operation-for-tsn-enabled-industrial-systems/
> 
> Did you even read this?
> 
>     [page 24]
> 
>     As described in section 2, some industrial control systems require
>     predictable, very low latency and cycle-to-cycle variation to meet
>     hard real-time application requirements. In these systems,
>     multiple distributed controllers commonly synchronize their
>     sensor/actuator operations with other controllers by scheduling
>     these operations in time, typically using a repeating control
>     cycle.
>     ...
>     The gate control mechanism is itself a time-aware PTP application
>     operating within a bridge or end station port.
> 
> It is an application, not a "god box."
>
> > In short, I see a per-packet transmission time and a per-queue schedule
> > as solutions to different problems.
> 
> Well, I can agree with that.  For some non real-time applications,
> bandwidth shaping is enough, and your Qdisc idea is sufficient.  For
> the really challenging TSN targets (industrial control, automotive),
> your idea of an opaque schedule file won't fly.

Would it make sense to adapt the proposed Qdisc here as well as the 
back-o-the-napkin idea in the other thread to to a per-socket queue for 
each priority and then sort those sockets based on SO_TXTIME?

TSN operates on a per-StreamID basis, and that should map fairly well to a 
per-socket approach I think (let us just assume that an application that 
sends TSN traffic will open up a separate socket for each stream.

This should allow a userspace application that is _very_ aware of its 
timing constraints to send frames exactly when it needs to as you have 
SO_TXTIME available. It would also let applications that basically want a 
fine-grained rate control (audio and video comes to mind) to use the same 
qdisc.

For those sockets that do not support SO_TXTIME, but still map to a 
priority handled by sch_cbs (or whatever it'll end up being called) you can 
set the transmit-time to be the time of the last skb in the queue + an 
delta which will give you the correct rate (TSN operates on observation 
intervals which you can specify via tc when you create the queues).

Then you can have, as you propose in your other series, a hrtimer that is 
being called when the next SO_TXTIME enters, grab a skb and move it to the 
hw-queue. This should also allow you to keep a sorted per-socket queue 
should an application send frames in the wrong order, without having to 
rearrange descriptors for the DMA machinery.

If this makes sense, I am more than happy to give it a stab and see how it 
goes.

-Henrik
Vinicius Costa Gomes Sept. 20, 2017, 12:19 a.m. UTC | #21
Hi Richard,

Richard Cochran <richardcochran@gmail.com> writes:

> On Mon, Sep 18, 2017 at 04:06:28PM -0700, Vinicius Costa Gomes wrote:
>> That's the point, the application does not need to know that, and asking
>> that would be stupid.
>
> On the contrary, this information is essential to the application.
> Probably you have never seen an actual Ethernet field bus in
> operation?  In any case, you are missing the point.
>
>> (And that's another nice point of how 802.1Qbv works, applications do
>> not need to be changed to use it, and I think we should work to achieve
>> this on the Linux side)
>
> Once you start to care about real time performance, then you need to
> consider the applications.  This is industrial control, not streaming
> your tunes from your ipod.
>
>> That being said, that only works for kinds of traffic that maps well to
>> this configuration in advance model, which is the model that the IEEE
>> (see 802.1Qcc) and the AVNU Alliance[1] are pushing for.
>
> Again, you are missing the point of what they aiming for.  I have
> looked at a number of production systems, and in each case the
> developers want total control over the transmission, in order to
> reduce latency to an absolute minimum.  Typically the data to be sent
> are available only microseconds before the transmission deadline.
>
> Consider OpenAVB on github that people are already using.  Take a look
> at simple_talker.c and explain how "applications do not need to be
> changed to use it."

Just let me use the mention of OpenAVNU as a hook to explain what we
(the team I am part of) are working to do, perhaps it will make our
choices and designs clearer.

One of the problems with OpenAVNU is that it's too coupled with the i210
NIC. One of the things we want is to decouple OpenAVNU from the
controller. The way we thought best was to propose interfaces (that
would work along side to the Linux networking stack) as close as
possible to what the current standards define, that means the IEEE
802.1Q family of specifications, in the hope that network controller
vendors would also look at the specifications when designing their
controllers.

Our objective with the Qdiscs we are proposing (both cbs and taprio) is
to provide a sane way to configure controllers that support TSN features
(we were looking specifically at the IEEE specs).

After we have some rough consensus on the interfaces to use, then we can
start working on OpenAVNU.

>
>> [1]
>> http://avnu.org/theory-of-operation-for-tsn-enabled-industrial-systems/
>
> Did you even read this?
>
>     [page 24]
>
>     As described in section 2, some industrial control systems require
>     predictable, very low latency and cycle-to-cycle variation to meet
>     hard real-time application requirements. In these systems,
>     multiple distributed controllers commonly synchronize their
>     sensor/actuator operations with other controllers by scheduling
>     these operations in time, typically using a repeating control
>     cycle.
>     ...
>     The gate control mechanism is itself a time-aware PTP application
>     operating within a bridge or end station port.
>
> It is an application, not a "god box."
>
>> In short, I see a per-packet transmission time and a per-queue schedule
>> as solutions to different problems.
>
> Well, I can agree with that.  For some non real-time applications,
> bandwidth shaping is enough, and your Qdisc idea is sufficient.  For
> the really challenging TSN targets (industrial control, automotive),
> your idea of an opaque schedule file won't fly.

(Sorry if I am being annoying here, but the idea of an opaque schedule
is not ours, that comes from the people who wrote the Qbv specification)

I have a question, what about a controller that doesn't provide a way to
set a per-packet transmission time, but it supports Qbv/Qbu. What would
be your proposal to configure it?

(I think LaunchTime is something specific to the i210, right?)


Cheers,
--
Vinicius
Levi Pearson Sept. 20, 2017, 1:59 a.m. UTC | #22
On Thu, Aug 31, 2017 at 06:26:20PM -0700, Vinicius Costa Gomes wrote:
> Hi,
> 
> This patchset is an RFC on a proposal of how the Traffic Control subsystem can
> be used to offload the configuration of traffic shapers into network devices
> that provide support for them in HW. Our goal here is to start upstreaming
> support for features related to the Time-Sensitive Networking (TSN) set of
> standards into the kernel.

I'm very excited to see these features moving into the kernel! I am one of the
maintainers of the OpenAvnu project and I've been involved in building AVB/TSN
systems and working on the standards for around 10 years, so the support that's
been slowly making it into more silicon and now Linux drivers is very
encouraging.

My team at Harman is working on endpoint code based on what's in the OpenAvnu
project and a few Linux-based platforms. The Qav interface you've proposed will
fit nicely with our traffic shaper management daemon, which already uses mqprio
as a base but uses the htb shaper to approximate the Qav credit-based shaper on
platforms where launch time scheduling isn't available.

I've applied your patches and plan on testing them in conjunction with our
shaper manager to see if we run into any hitches, but I don't expect any
problems.

> As part of this work, we've assessed previous public discussions related to TSN
> enabling: patches from Henrik Austad (Cisco), the presentation from Eric Mann
> at Linux Plumbers 2012, patches from Gangfeng Huang (National Instruments) and
> the current state of the OpenAVNU project (https://github.com/AVnu/OpenAvnu/).
> 
> Please note that the patches provided as part of this RFC are implementing what
> is needed only for 802.1Qav (FQTSS) only, but we'd like to take advantage of
> this discussion and share our WIP ideas for the 802.1Qbv and 802.1Qbu interfaces
> as well. The current patches are only providing support for HW offload of the
> configs.
> 
> 
> Overview
> ========
> 
> Time-sensitive Networking (TSN) is a set of standards that aim to address
> resources availability for providing bandwidth reservation and bounded latency
> on Ethernet based LANs. The proposal described here aims to cover mainly what is
> needed to enable the following standards: 802.1Qat, 802.1Qav, 802.1Qbv and
> 802.1Qbu.
> 
> The initial target of this work is the Intel i210 NIC, but other controllers'
> datasheet were also taken into account, like the Renesas RZ/A1H RZ/A1M group and
> the Synopsis DesignWare Ethernet QoS controller.

Recent SoCs from NXP (the i.MX 6 SoloX, and all the i.MX 7 and 8 parts) support
Qav shaping as well as scheduled launch functionality; these are the parts I 
have been mostly working with. Marvell silicon (some subset of Armada processors
and Link Street DSA switches) generally supports traffic shaping as well.

I think a lack of an interface like this has probably slowed upstream driver
support for this functionality where it exists; most vendors have an out-of-
tree version of their driver with TSN functionality enabled via non-standard
interfaces. Hopefully making it available will encourage vendors to upstream
their driver support!

> Proposal
> ========
> 
> Feature-wise, what is covered here are configuration interfaces for HW
> implementations of the Credit-Based shaper (CBS, 802.1Qav), Time-Aware shaper
> (802.1Qbv) and Frame Preemption (802.1Qbu). CBS is a per-queue shaper, while
> Qbv and Qbu must be configured per port, with the configuration covering all
> queues. Given that these features are related to traffic shaping, and that the
> traffic control subsystem already provides a queueing discipline that offloads
> config into the device driver (i.e. mqprio), designing new qdiscs for the
> specific purpose of offloading the config for each shaper seemed like a good
> fit.

This makes sense to me too. The 802.1Q standards are all based on the sort of
mappings between priority, traffic class, and hardware queues that the existing
tc infrastructure seems to be modeling. I believe the mqprio module's mapping
scheme is flexible enough to meet any TSN needs in conjunction with the other
parts of the kernel qdisc system.

> For steering traffic into the correct queues, we use the socket option
> SO_PRIORITY and then a mechanism to map priority to traffic classes / Txqueues.
> The qdisc mqprio is currently used in our tests.
> 
> As for the shapers config interface:
> 
>  * CBS (802.1Qav)
> 
>    This patchset is proposing a new qdisc called 'cbs'. Its 'tc' cmd line is:
>    $ tc qdisc add dev IFACE parent ID cbs locredit N hicredit M sendslope S \
>      idleslope I
> 
>    Note that the parameters for this qdisc are the ones defined by the
>    802.1Q-2014 spec, so no hardware specific functionality is exposed here.

These parameters look good to me as a baseline; some additional optional
parameters may be useful for software-based implementations--such as setting an
interval at which to recalculate queues--but those can be discussed later.

>  * Time-aware shaper (802.1Qbv):

I haven't come across any specific NIC or SoC MAC that does Qbv, but I have
been experimenting with an EspressoBin board, which has a "Topaz" DSA switch
in it that has some features intended for Qbv support, although they were done
with a draft version in mind.

I haven't looked at the interaction between the qdisc subsystem and DSA yet,
but this mechanism might be useful to configure Qbv on the slave ports in
that context. I've got both the board and the documentation, so I might be
able to work on an implementation at some point.

If some endpoint device shows up with direct Qbv support, this interface would
probably work well there too, although a talker would need to be able to
schedule its transmits pretty precisely to achieve the lowest possible latency.

>    The idea we are currently exploring is to add a "time-aware", priority based
>    qdisc, that also exposes the Tx queues available and provides a mechanism for
>    mapping priority <-> traffic class <-> Tx queues in a similar fashion as
>    mqprio. We are calling this qdisc 'taprio', and its 'tc' cmd line would be:
> 
>    $ $ tc qdisc add dev ens4 parent root handle 100 taprio num_tc 4    \
>      	   map 2 2 1 0 3 3 3 3 3 3 3 3 3 3 3 3                         \
> 	   queues 0 1 2 3                                              \
>      	   sched-file gates.sched [base-time <interval>]               \
>            [cycle-time <interval>] [extension-time <interval>]

One concern here is calling the base-time parameter an interval; it's really
an absolute time with respect to the PTP timescale. Good documentation will
be important to this one, since the specification discusses some subtleties
regarding the impact of different time values chosen here.

The format for specifying the actual intervals such as cycle-time could prove
to be an important detail as well; Qbv specifies cycle-time as a ratio of two
integers expressed in seconds, while extension-time is specified as an integer
number of nanoseconds.

Precision with the cycle-time is especially important, since base-time can be
almost arbitrarily far in the past or future, and any given cycle start should
be calculable from the base-time plus/minus some integer multiple of cycle-
time.

>    <file> is multi-line, with each line being of the following format:
>    <cmd> <gate mask> <interval in nanoseconds>
> 
>    Qbv only defines one <cmd>: "S" for 'SetGates'
> 
>    For example:
> 
>    S 0x01 300
>    S 0x03 500
> 
>    This means that there are two intervals, the first will have the gate
>    for traffic class 0 open for 300 nanoseconds, the second will have
>    both traffic classes open for 500 nanoseconds.
> 
>    Additionally, an option to set just one entry of the gate control list will
>    also be provided by 'taprio':
> 
>    $ tc qdisc (...) \
>         sched-row <row number> <cmd> <gate mask> <interval>  \
>         [base-time <interval>] [cycle-time <interval>] \
>         [extension-time <interval>]

If I understand correctly, 'sched-row' is meant to be usable multiple times in
a single command and the 'sched-file' option is just a shorthand version for
large tables? Or is it meant to update an existing schedule table? It doesn't
seem very useful if it can only be specified once when the whole taprio intance
is being established.

>  * Frame Preemption (802.1Qbu):
> 
>    To control even further the latency, it may prove useful to signal which
>    traffic classes are marked as preemptable. For that, 'taprio' provides the
>    preemption command so you set each traffic class as preemptable or not:
> 
>    $ tc qdisc (...) \
>         preemption 0 1 1 1
> 
> 
>  * Time-aware shaper + Preemption:
> 
>    As an example of how Qbv and Qbu can be used together, we may specify
>    both the schedule and the preempt-mask, and this way we may also
>    specify the Set-Gates-and-Hold and Set-Gates-and-Release commands as
>    specified in the Qbu spec:
> 
>    $ tc qdisc add dev ens4 parent root handle 100 taprio num_tc 4 \
>      	   map 2 2 1 0 3 3 3 3 3 3 3 3 3 3 3 3                    \
> 	   queues 0 1 2 3                                         \
>      	   preemption 0 1 1 1                                     \
> 	   sched-file preempt_gates.sched
> 
>     <file> is multi-line, with each line being of the following format:
>     <cmd> <gate mask> <interval in nanoseconds>
> 
>     For this case, two new commands are introduced:
> 
>     "H" for 'set gates and hold'
>     "R" for 'set gates and release'
> 
>     H 0x01 300
>     R 0x03 500
> 

The new Hold and Release gate commands look right, but I'm not sure about the
preemption flags. Qbu describes a preemption parameter table indexed by
*priority* rather than traffic class or queue. These select which of two MAC
service interfaces is used by the frame at the ISS layer, either express or
preemptable, at the time the frame is selected for transmit. If my
understanding is correct, it's possible to map a preemptable priority as well
as an express priority to the same queue, so flagging preemptability at the
queue level is not correct.

I'm not aware of any endpoint interfaces that support Qbu either, nor do I 
know of any switches that support it that someone could experiment with right
now, so there's no pressure on getting that interface nailed down yet.

Hopefully you find this feedback useful, and I appreciate the effort taken to
get the RFC posted here!

Levi
Levi Pearson Sept. 20, 2017, 5:17 a.m. UTC | #23
On Mon, Sep 18, 2017, Richard Cochran wrote:
> Just for the record, here is my score card showing the current status
> of TSN support in Linux.  Comments and corrections are more welcome.
> 
> Thanks,
> Richard
> 
> 
>  | FEATURE                                        | STANDARD            | STATUS                       |
>  |------------------------------------------------+---------------------+------------------------------|
>  | Synchronization                                | 802.1AS-2011        | Implemented in               |
>  |                                                |                     | - Linux kernel PHC subsystem |
>  |                                                |                     | - linuxptp (userspace)       |
>  |------------------------------------------------+---------------------+------------------------------|

An alternate implementation of the userspace portion of gPTP is also available at [1]

>  | Forwarding and Queuing Enhancements            | 802.1Q-2014 sec. 34 | RFC posted (this thread)     |
>  | for Time-Sensitive Streams (FQTSS)             |                     |                              |
>  |------------------------------------------------+---------------------+------------------------------|
>  | Stream Reservation Protocol (SRP)              | 802.1Q-2014 sec. 35 | in Open-AVB [1]              |
>  |------------------------------------------------+---------------------+------------------------------|
>  | Audio Video Transport Protocol (AVTP)          | IEEE 1722-2011      | DNE                          |
>  |------------------------------------------------+---------------------+------------------------------|
>  | Audio/Video Device Discovery, Enumeration,     | IEEE 1722.1-2013    | jdksavdecc-c [2]             |
>  | Connection Management and Control (AVDECC)     |                     |                              |
>  | AVDECC Connection Management Protocol (ACMP)   |                     |                              |
>  | AVDECC Enumeration and Control Protocol (AECP) |                     |                              |
>  | MAC Address Acquisition Protocol (MAAP)        |                     | in Open-AVB                  |
>  |------------------------------------------------+---------------------+------------------------------|

All of the above are available to some degree in the AVTP Pipeline part of [1], specifically at this
location: https://github.com/AVnu/OpenAvnu/tree/master/lib/avtp_pipeline

The code is very modular and configurable, although some parts are in better shape than others. The AVTP
portion can use the custom userspace driver for the i210, which can be configured to use launch scheduling,
or it can use standard kernel interfaces via sendmsg or PACKET_MMAP. It runs as-is when configured for
standard interfaces with any network hardware that supports gPTP. I previously implemented a CMSG-based
launch time scheduling mechanism like the one you have proposed, and I have a socket backend for it that
could easily be ported to your proposal. It is not part of the repository yet since there's no kernel
support for it outside of my prototype and your RFC.

It is currently tied to the OpenAvnu gPTP daemon rather than linuxptp, as it uses a shared memory interface
to get the current rate-ratio and offset information between the various clocks. There may be better ways
to do this, but that's how the initial port of the codebase was done. It would be nice to get it working
with linuxptp's userspace tools at some point as well, though.

The libraries under avtp_pipeline are designed to be used separately, but a simple integrated application
is provided and is built by the CI system.

In addition to OpenAvnu, Renesas has a number of github repositories with what looks like a fairly
complete media streaming system:

https://github.com/renesas-rcar/avb-mse
https://github.com/renesas-rcar/avb-streaming
https://github.com/renesas-rcar/avb-applications

I haven't examined them in great detail yet, though.


>  | Frame Preemption                               | P802.1Qbu           | DNE                          |
>  | Scheduled Traffic                              | P802.1Qbv           | RFC posted (SO_TXTIME)       |
>  | SRP Enhancements and Performance Improvements  | P802.1Qcc           | DNE                          |
> 
>  DNE = Does Not Exist (to my knowledge)

Although your SO_TXTIME proposal could certainly form the basis of an endpoint's implementation of Qbv, I
think it is a stretch to consider it a Qbv implementation in itself, if that's what you mean by this.

I have been working with colleagues on some experiments relating to a Linux-controlled DSN switch
(a Marvell Topaz) that are a part of this effort in TSN: 

http://ieee802.org/1/files/public/docs2017/tsn-cgunther-802-3cg-multidrop-0917-v01.pdf

The proper interfaces for the Qbv configuration and managing of switch-level PTP timestamps are not yet
in place, so there's nothing even at RFC stage to present yet, but Qbv-capable Linux-managed switch
hardware is available and we hope to get some reusable code published even if it's not yet ready to be
integrated in the kernel.

> 
> 1. https://github.com/Avnu/OpenAvnu
> 
>    (DISCLAIMER from the website:)
> 
>    It is planned to eventually include the various packet encapsulation types,
>    protocol discovery daemons, libraries to convert media clocks to AVB clocks
>    and vice versa, and drivers.
> 
>    This repository does not include all components required to build a full
>    production AVB/TSN system (e.g. a turnkey solution to stream stored or live audio
>    or video content). Some simple example applications are provided which
>    illustrate the flow - but a professional Audio/Video system requires a full media stack
>    - including audio and video inputs and outputs, media processing elements, and
>    various graphical user interfaces. Various companies provide such integrated
>    solutions.

A bit of progress has been made since that was written, although it is true that it's still not
quite complete and certainly not turnkey. The most glaring absence at the moment is the media
clock recovery portion of AVTP, but I am actively working on this.

> 
> 2. https://github.com/jdkoftinoff/jdksavdecc-c

This is pulled in as a dependency of the AVDECC code in OpenAvnu; it's used in the command line driven
controller, but not in the avtp_pipeline code that implements the endpoint AVDECC behavior. I don't think
either are complete by any means, but they are complete enough to be mostly compliant and usable in the
subset of behavior they support.

The bulk of the command line controller is a clone of: https://github.com/audioscience/avdecc-lib 

Things are maybe a bit farther along than they seemed, but there is still important kernel work to be
done to reduce the need for out-of-tree drivers and to get everyone on the same interfaces. I plan
to be an active participant going forward.


Levi
Richard Cochran Sept. 20, 2017, 5:25 a.m. UTC | #24
On Tue, Sep 19, 2017 at 05:19:18PM -0700, Vinicius Costa Gomes wrote:
> One of the problems with OpenAVNU is that it's too coupled with the i210
> NIC. One of the things we want is to decouple OpenAVNU from the
> controller.

Yes, I want that, too.

> The way we thought best was to propose interfaces (that
> would work along side to the Linux networking stack) as close as
> possible to what the current standards define, that means the IEEE
> 802.1Q family of specifications, in the hope that network controller
> vendors would also look at the specifications when designing their
> controllers.

These standard define the *behavior*, not the programming APIs.  Our
task as kernel developers is to invent the best interfaces for
supporting 802.1Q and other standards, the hardware capabilities, and
the widest range of applications (not jut AVB).

> Our objective with the Qdiscs we are proposing (both cbs and taprio) is
> to provide a sane way to configure controllers that support TSN features
> (we were looking specifically at the IEEE specs).

I can see how your proposed Qdiscs are inspired by the IEEE standards.
However, in the case of time based transmission, I think there is a
better way to do it, namely with SO_TXTIME (which BTW was originally
proposed by Eric Mann).
 
> After we have some rough consensus on the interfaces to use, then we can
> start working on OpenAVNU.

Did you see my table in the other mail?  Any comments?

> (Sorry if I am being annoying here, but the idea of an opaque schedule
> is not ours, that comes from the people who wrote the Qbv specification)

The schedule is easy to implement using SO_TXTIME.
 
> I have a question, what about a controller that doesn't provide a way to
> set a per-packet transmission time, but it supports Qbv/Qbu. What would
> be your proposal to configure it?

SO_TXTIME will have a generic SW fallback.

BTW, regarding the i210, there is no sensible way to configure both
CBS and time based transmission at the same time.  The card performs a
logical AND to make the launch decision.  The effect of this is that
each and every packet needs a LaunchTime, and the driver would be
forced to guess the time for a packet before entering it into its
queue.

So if we end up merging CBS and SO_TXTIME, then we'll have to make
them exclusive of each other (in the case of the i210) and manage the
i210 queue configurations correctly.

> (I think LaunchTime is something specific to the i210, right?)

To my knowledge yes.  However, if TSN does take hold, then other MAC
vendors will copy it.

Thanks,
Richard
Richard Cochran Sept. 20, 2017, 5:49 a.m. UTC | #25
On Tue, Sep 19, 2017 at 11:17:54PM -0600, levipearson@gmail.com wrote:
> In addition to OpenAvnu, Renesas has a number of github repositories with what looks like a fairly
> complete media streaming system:

Is it a generic stack or a set of hacks for their HW?

> Although your SO_TXTIME proposal could certainly form the basis of an endpoint's implementation of Qbv, I
> think it is a stretch to consider it a Qbv implementation in itself, if that's what you mean by this.

No, that is not what I meant.  We need some minimal additional kernel
support in order to fully implement the TSN family of standards.  Of
course, the bulk will have to be done in user space.  It would be a
mistake to cram the stuff that belongs in userland into the kernel.

Looking at the table, and reading your descriptions of the state of
OpenAVB, I remained convinced that the kernel needs only three
additions:

1. SO_TXTIME
2. CBS Qdisc
3. ALSA support for DAC clock control (but that is another story)

> The proper interfaces for the Qbv configuration and managing of switch-level PTP timestamps are not yet
> in place, so there's nothing even at RFC stage to present yet, but Qbv-capable Linux-managed switch
> hardware is available and we hope to get some reusable code published even if it's not yet ready to be
> integrated in the kernel.

Right, configuring Qbv in an attached DSA switch needs its own
interface.

Regarding PHC support for DSA switches, I have something in the works
to be published soon.

> A bit of progress has been made since that was written, although it is true that it's still not
> quite complete and certainly not turnkey.

So OpenAVB is neither complete nor turnkey.  That was my impression,
too.

> Things are maybe a bit farther along than they seemed, but there is still important kernel work to be
> done to reduce the need for out-of-tree drivers and to get everyone on the same interfaces. I plan
> to be an active participant going forward.

You mentioned a couple of different kernel things you implemented.
I would encourage you to post the work already done.

Thanks,
Richard
Richard Cochran Sept. 20, 2017, 5:56 a.m. UTC | #26
On Tue, Sep 19, 2017 at 07:59:11PM -0600, levipearson@gmail.com wrote:
> If some endpoint device shows up with direct Qbv support, this interface would
> probably work well there too, although a talker would need to be able to
> schedule its transmits pretty precisely to achieve the lowest possible latency.

This is an argument for SO_TXTIME.

> One concern here is calling the base-time parameter an interval; it's really
> an absolute time with respect to the PTP timescale. Good documentation will
> be important to this one, since the specification discusses some subtleties
> regarding the impact of different time values chosen here.
> 
> The format for specifying the actual intervals such as cycle-time could prove
> to be an important detail as well; Qbv specifies cycle-time as a ratio of two
> integers expressed in seconds, while extension-time is specified as an integer
> number of nanoseconds.
> 
> Precision with the cycle-time is especially important, since base-time can be
> almost arbitrarily far in the past or future, and any given cycle start should
> be calculable from the base-time plus/minus some integer multiple of cycle-
> time.

The above three points also.

Thanks,
Richard
Richard Cochran Sept. 20, 2017, 5:58 a.m. UTC | #27
On Tue, Sep 19, 2017 at 05:19:18PM -0700, Vinicius Costa Gomes wrote:
> (I think LaunchTime is something specific to the i210, right?)

Levi just told us:

   Recent SoCs from NXP (the i.MX 6 SoloX, and all the i.MX 7 and 8
   parts) support Qav shaping as well as scheduled launch
   functionality;

Thanks,
Richard
Jesus Sanchez-Palencia Sept. 20, 2017, 9:29 p.m. UTC | #28
Hi,


On 09/19/2017 10:49 PM, Richard Cochran wrote:
(...)

> 
> No, that is not what I meant.  We need some minimal additional kernel
> support in order to fully implement the TSN family of standards.  Of
> course, the bulk will have to be done in user space.  It would be a
> mistake to cram the stuff that belongs in userland into the kernel.
> 
> Looking at the table, and reading your descriptions of the state of
> OpenAVB, I remained convinced that the kernel needs only three
> additions:
> 
> 1. SO_TXTIME
> 2. CBS Qdisc
> 3. ALSA support for DAC clock control (but that is another story)


We'll be posting the CBS v1 series for review soon.

The current SO_TXTIME RFC for the purpose of Launchtime looks great, and we are
looking forward for the v1 + its companion qdisc so we can test / review and
provide feedback.

We are still under the impression that a config interface for HW offload of Qbv
/ Qbu config will be needed, but we'll be deferring the 'taprio' proposal until
there are NICs (end stations) that support these standards available. We can
revisit it if that ever happens, and if it's still needed, but then taking into
account SO_TXTIME (and its related qdisc).

Thanks everyone for all the feedback so far.

Regards,
Jesus
Jesus Sanchez-Palencia Oct. 18, 2017, 10:37 p.m. UTC | #29
Hi Richard,


On 09/19/2017 10:25 PM, Richard Cochran wrote:
(...)
>  
>> I have a question, what about a controller that doesn't provide a way to
>> set a per-packet transmission time, but it supports Qbv/Qbu. What would
>> be your proposal to configure it?
> 
> SO_TXTIME will have a generic SW fallback.
> 
> BTW, regarding the i210, there is no sensible way to configure both
> CBS and time based transmission at the same time.  The card performs a
> logical AND to make the launch decision.  The effect of this is that
> each and every packet needs a LaunchTime, and the driver would be
> forced to guess the time for a packet before entering it into its
> queue.
> 
> So if we end up merging CBS and SO_TXTIME, then we'll have to make
> them exclusive of each other (in the case of the i210) and manage the
> i210 queue configurations correctly.
> 

I've ran some quick tests here having launch time enabled on i210 + our cbs
patchset. When valid Launch times are set on each packet you still get the
expected behavior, so I'm not sure we should just make them exclusive of each other.

I also did some tests with when you don't set valid launch times, but here using
your idea from above, so with the driver calculating a valid launch time (i.e.
current NIC time + X ns, varying X across tests) for packets that didn't have it
set by the user, and I wasn't too happy with its reliability. It could
definitely be improved, but it has left me wondering: instead, what about
documenting that if you enable TXTIME, then you *must* provide a valid Launch
time for all packets on traffic classes that are affected?

With the SO_TXTIME qdisc idea in place, that could even be enforced before
packets were enqueued into the netdevice.


Regards,
Jesus
Richard Cochran Oct. 19, 2017, 8:39 p.m. UTC | #30
On Wed, Oct 18, 2017 at 03:37:35PM -0700, Jesus Sanchez-Palencia wrote:
> I also did some tests with when you don't set valid launch times, but here using
> your idea from above, so with the driver calculating a valid launch time (i.e.
> current NIC time + X ns, varying X across tests) for packets that didn't have it
> set by the user, and I wasn't too happy with its reliability. It could
> definitely be improved, but it has left me wondering: instead, what about
> documenting that if you enable TXTIME, then you *must* provide a valid Launch
> time for all packets on traffic classes that are affected?

If txtime is enabled, then CBS is pointless because the txtime already
specifies the bandwidth implicitly.

The problem is when one program uses txtime and another uses CBS, then
the CBS user will experience really wrong performance.

Thanks,
Richard
Jesus Sanchez-Palencia Oct. 23, 2017, 5:18 p.m. UTC | #31
Hi,

On 10/19/2017 01:39 PM, Richard Cochran wrote:
> On Wed, Oct 18, 2017 at 03:37:35PM -0700, Jesus Sanchez-Palencia wrote:
>> I also did some tests with when you don't set valid launch times, but here using
>> your idea from above, so with the driver calculating a valid launch time (i.e.
>> current NIC time + X ns, varying X across tests) for packets that didn't have it
>> set by the user, and I wasn't too happy with its reliability. It could
>> definitely be improved, but it has left me wondering: instead, what about
>> documenting that if you enable TXTIME, then you *must* provide a valid Launch
>> time for all packets on traffic classes that are affected?
> 
> If txtime is enabled, then CBS is pointless because the txtime already
> specifies the bandwidth implicitly.


Assuming there is no "interfering" traffic on that traffic class, yes.
Otherwise, CBS could be configured just to avoid that outbound traffic ever goes
beyond the reserved bandwidth.


> 
> The problem is when one program uses txtime and another uses CBS, then
> the CBS user will experience really wrong performance.


Good point. We'll need to adjust the launch time for controllers that behave
like the i210 then, imo.


Thanks,
Jesus


> 
> Thanks,
> Richard
>