[OpenWrt-Devel,RFC,v4] lantiq: IRQ balancing, ethernet driver, wave300
diff mbox series

Message ID 40efd247-c72d-c341-de31-b46ac9b3ad69@gmail.com
State RFC
Headers show
Series
  • [OpenWrt-Devel,RFC,v4] lantiq: IRQ balancing, ethernet driver, wave300
Related show

Commit Message

Petr Cvek March 14, 2019, 5:46 a.m. UTC
Hello again,

I've managed to enhance few drivers for lantiq platform. They are still
in ugly commented form (ethernet part especially). But I need some hints
before the final version. The patches are based on a kernel 4.14.99.
Copy them into target/linux/lantiq/patches-4.14 (cleaned from any of my
previous patch).

The eth+irq speedup is up to 360/260 Mbps (the vanilla was 170/80 on my
setup). The iperf3 benchmark (2 passes for both vanilla and changed
versions) altogether with script are in the attachment.

1) IRQ with SMP and balancing support:

	0901-add-icu-smp-support.patch
	0902-enable-external-irqs-for-second-vpe.patch
	0903-add-icu1-node-for-smp.patch

As requested I've changed the patch heavily. The original locking from
k3b source code (probably from UGW) didn't work and in heavy load the
system could have froze (smp affinity change during irq handling). This
version has this fixed by using generic raw spinlocks with irq.

The SMP IRQ now works in a way that before every irq_enable (serves as
unmask too) the VPE will be switched. This can be limited by writing
into /proc/irq/X/smp_affinity (it can be possibly balanced from
userspace too).

I've rewritten the device tree reg fields so there are only 2 arrays
now. One per an icu controller. The original one per module was
redundant as the ranges were continuous. The modules of a single ICU are
now explicitly computed in a macro:

	ltq_w32((x), ltq_icu_membase[vpe] + m*0x28 + (y))
	ltq_r32(ltq_icu_membase[vpe] + m*0x28 + (x))

before there was a pointer for every 0x28 block (there shouldn't be
speed downgrade, only a multiplication and an addition for every
register access).

Also I've simplified register names from LTQ_ICU_IM0_ISR to LTQ_ICU_ISR
as "IM0" (module) was confusing (the real module number 0-4 was a part
of the macro).

The code is written in a way it should work fine on a uniprocessor
configuration (as the for_each_present_cpu etc macros will cycle on a
single VPE on uniprocessor). I didn't test the no CONFIG_SMP yet, but I
did check it with "nosmp" kernel parameter. It works.

Anyway please test if you have the board where the second VPE is used
for FXS.

The new device tree structure is now incompatible with an old version of
the driver (and old device tree with the new driver too). It seems icu
driver is used in Danube, AR9, AmazonSE and Falcon chipset too. I don't
know the hardware for these boards so before a final patch I would like
to know if they have a second ICU too (at 0x80300 offset).

More development could be done with locking probably. As only the
accesses in a single module (= 1 set of registers) would cause a race
condition. But as the most contented interrupts are in the same module
there won't be much speed increase IMO. I can add it if requested (just
spinlock array and some lookup code).

2) Reworked lantiq xrx200 ethernet driver:

	0904-backport-vanilla-eth-driver.patch
	0905-increase-dma-descriptors.patch
	0906-increase-dma-burst-size.patch

The code is still ugly, but stable now. There is a fragmented skb
support and napi polling. DMA ring buffer was increased so it handle
faster speeds and I've fixed some code weirdness. A can split the
changes in the future into separate patches.

I didn't test the ICU and eth patches separate, but I've tested the
ethernet driver on a single VPE only (by setting smp affinity and
nosmp). This version of the ethernet driver was used for root over NFS
on the debug setup for like two weeks (without problems).

Tell me if we should pursue the way for the second DMA channel to PPE so
both VPEs can send frames at the same time.

3) WAVE300

In the two past weeks I've tried to fix a mash together various versions
of wave300 wifi driver (there are partial version in GPL sources from
router vendors). And I've managed to put the driver into "not
immediately crashing" mode. If you are interested in the development,
there is a thread in openwrt forum. The source repo here:

https://repo.or.cz/wave300.git
https://repo.or.cz/wave300_rflib.git

(the second one must be copied into the first one)

The driver will often crash when meeting an unknown packet, request for
encryption (no encryption support), unusual combination of configuration
or just by module unloading. The code is _really_ ugly and it will
server only as hardware specification for better GPL driver development.
If you want to help or you have some tips you can join the forum (there
are links for firmwares and intensive research of available source codes
from vendors).

Links:
https://forum.openwrt.org/t/support-for-wave-300-wi-fi-chip/24690/129
https://forum.openwrt.org/t/how-can-we-make-the-lantiq-xrx200-devices-faster/9724/70
https://forum.openwrt.org/t/xrx200-irq-balancing-between-vpes/29732/25

Petr
+ : ':::::::[' configuration vanilla ']:::::::' :
+ iperf3 -c 10.0.0.80
Connecting to host 10.0.0.80, port 5201
[  4] local 10.0.0.1 port 51814 connected to 10.0.0.80 port 5201
[ ID] Interval           Transfer     Bandwidth       Retr  Cwnd
[  4]   0.00-1.00   sec  21.2 MBytes   178 Mbits/sec   27   72.1 KBytes       
[  4]   1.00-2.00   sec  20.6 MBytes   173 Mbits/sec   29   70.7 KBytes       
[  4]   2.00-3.00   sec  20.8 MBytes   174 Mbits/sec   35   60.8 KBytes       
[  4]   3.00-4.00   sec  20.8 MBytes   174 Mbits/sec   29   73.5 KBytes       
[  4]   4.00-5.00   sec  20.8 MBytes   174 Mbits/sec   32   70.7 KBytes       
[  4]   5.00-6.00   sec  20.7 MBytes   174 Mbits/sec   35   69.3 KBytes       
[  4]   6.00-7.00   sec  20.8 MBytes   174 Mbits/sec   36   60.8 KBytes       
[  4]   7.00-8.00   sec  20.8 MBytes   175 Mbits/sec   29   59.4 KBytes       
[  4]   8.00-9.00   sec  20.8 MBytes   175 Mbits/sec   41   46.7 KBytes       
[  4]   9.00-10.00  sec  20.8 MBytes   175 Mbits/sec   28   50.9 KBytes       
- - - - - - - - - - - - - - - - - - - - - - - - -
[ ID] Interval           Transfer     Bandwidth       Retr
[  4]   0.00-10.00  sec   208 MBytes   174 Mbits/sec  321             sender
[  4]   0.00-10.00  sec   208 MBytes   174 Mbits/sec                  receiver

iperf Done.
+ sleep 10
+ iperf3 -c 10.0.0.80 -R
Connecting to host 10.0.0.80, port 5201
Reverse mode, remote host 10.0.0.80 is sending
[  4] local 10.0.0.1 port 51862 connected to 10.0.0.80 port 5201
[ ID] Interval           Transfer     Bandwidth
[  4]   0.00-1.00   sec  9.63 MBytes  80.7 Mbits/sec                  
[  4]   1.00-2.00   sec  9.65 MBytes  81.0 Mbits/sec                  
[  4]   2.00-3.00   sec  9.52 MBytes  79.9 Mbits/sec                  
[  4]   3.00-4.00   sec  9.69 MBytes  81.3 Mbits/sec                  
[  4]   4.00-5.00   sec  9.68 MBytes  81.2 Mbits/sec                  
[  4]   5.00-6.00   sec  9.66 MBytes  81.0 Mbits/sec                  
[  4]   6.00-7.00   sec  9.68 MBytes  81.2 Mbits/sec                  
[  4]   7.00-8.00   sec  9.70 MBytes  81.4 Mbits/sec                  
[  4]   8.00-9.00   sec  9.69 MBytes  81.3 Mbits/sec                  
[  4]   9.00-10.00  sec  9.79 MBytes  82.1 Mbits/sec                  
- - - - - - - - - - - - - - - - - - - - - - - - -
[ ID] Interval           Transfer     Bandwidth       Retr
[  4]   0.00-10.00  sec  97.0 MBytes  81.4 Mbits/sec    0             sender
[  4]   0.00-10.00  sec  97.0 MBytes  81.4 Mbits/sec                  receiver

iperf Done.
+ sleep 10
+ iperf3 -c 10.0.0.80 -u -b 150M
Connecting to host 10.0.0.80, port 5201
[  4] local 10.0.0.1 port 51957 connected to 10.0.0.80 port 5201
[ ID] Interval           Transfer     Bandwidth       Total Datagrams
[  4]   0.00-1.00   sec  16.4 MBytes   138 Mbits/sec  2101  
[  4]   1.00-2.00   sec  17.9 MBytes   150 Mbits/sec  2288  
[  4]   2.00-3.00   sec  17.9 MBytes   150 Mbits/sec  2285  
[  4]   3.00-4.00   sec  17.9 MBytes   150 Mbits/sec  2292  
[  4]   4.00-5.00   sec  17.9 MBytes   150 Mbits/sec  2287  
[  4]   5.00-6.00   sec  17.9 MBytes   150 Mbits/sec  2291  
[  4]   6.00-7.00   sec  17.9 MBytes   150 Mbits/sec  2288  
[  4]   7.00-8.00   sec  17.9 MBytes   150 Mbits/sec  2291  
[  4]   8.00-9.00   sec  17.8 MBytes   150 Mbits/sec  2282  
[  4]   9.00-10.00  sec  17.9 MBytes   150 Mbits/sec  2292  
- - - - - - - - - - - - - - - - - - - - - - - - -
[ ID] Interval           Transfer     Bandwidth       Jitter    Lost/Total Datagrams
[  4]   0.00-10.00  sec   177 MBytes   149 Mbits/sec  136434.385 ms  1349/1417 (95%)  
[  4] Sent 1417 datagrams

iperf Done.
+ sleep 10
+ iperf3 -c 10.0.0.80 -u -b 150M -R
Connecting to host 10.0.0.80, port 5201
Reverse mode, remote host 10.0.0.80 is sending
[  4] local 10.0.0.1 port 46317 connected to 10.0.0.80 port 5201
[ ID] Interval           Transfer     Bandwidth       Jitter    Lost/Total Datagrams
[  4]   0.00-1.00   sec  14.7 MBytes   124 Mbits/sec  0.077 ms  0/1885 (0%)  
[  4]   1.00-2.00   sec  15.2 MBytes   127 Mbits/sec  0.072 ms  0/1942 (0%)  
[  4]   2.00-3.00   sec  15.0 MBytes   126 Mbits/sec  0.074 ms  0/1924 (0%)  
[  4]   3.00-4.00   sec  14.3 MBytes   120 Mbits/sec  0.080 ms  0/1825 (0%)  
[  4]   4.00-5.00   sec  14.4 MBytes   120 Mbits/sec  0.079 ms  0/1837 (0%)  
[  4]   5.00-6.00   sec  14.8 MBytes   124 Mbits/sec  0.065 ms  0/1888 (0%)  
[  4]   6.00-7.00   sec  15.3 MBytes   128 Mbits/sec  0.076 ms  0/1956 (0%)  
[  4]   7.00-8.00   sec  15.2 MBytes   128 Mbits/sec  0.095 ms  0/1948 (0%)  
[  4]   8.00-9.00   sec  15.1 MBytes   127 Mbits/sec  0.092 ms  0/1932 (0%)  
[  4]   9.00-10.00  sec  15.1 MBytes   127 Mbits/sec  0.095 ms  0/1938 (0%)  
- - - - - - - - - - - - - - - - - - - - - - - - -
[ ID] Interval           Transfer     Bandwidth       Jitter    Lost/Total Datagrams
[  4]   0.00-10.00  sec   149 MBytes   125 Mbits/sec  0.085 ms  0/19082 (0%)  
[  4] Sent 19082 datagrams

iperf Done.
+ sleep 10
+ iperf3 -c 10.0.0.80 -u -b 500M
Connecting to host 10.0.0.80, port 5201
[  4] local 10.0.0.1 port 48172 connected to 10.0.0.80 port 5201
[ ID] Interval           Transfer     Bandwidth       Total Datagrams
[  4]   0.00-1.00   sec  58.5 MBytes   491 Mbits/sec  7494  
[  4]   1.00-2.00   sec  60.6 MBytes   508 Mbits/sec  7756  
[  4]   2.00-3.00   sec  58.7 MBytes   492 Mbits/sec  7508  
[  4]   3.00-4.00   sec  60.2 MBytes   505 Mbits/sec  7710  
[  4]   4.00-5.00   sec  59.0 MBytes   495 Mbits/sec  7556  
[  4]   5.00-6.00   sec  60.5 MBytes   508 Mbits/sec  7744  
[  4]   6.00-7.00   sec  58.7 MBytes   492 Mbits/sec  7508  
[  4]   7.00-8.00   sec  59.1 MBytes   496 Mbits/sec  7565  
[  4]   8.00-9.00   sec  60.4 MBytes   507 Mbits/sec  7730  
[  4]   9.00-10.00  sec  59.9 MBytes   502 Mbits/sec  7664  
- - - - - - - - - - - - - - - - - - - - - - - - -
[ ID] Interval           Transfer     Bandwidth       Jitter    Lost/Total Datagrams
[  4]   0.00-10.00  sec   596 MBytes   500 Mbits/sec  2051749.337 ms  64268/64294 (1e+02%)  
[  4] Sent 64294 datagrams

iperf Done.
+ sleep 10
+ iperf3 -c 10.0.0.80 -u -b 500M -R
Connecting to host 10.0.0.80, port 5201
Reverse mode, remote host 10.0.0.80 is sending
[  4] local 10.0.0.1 port 35361 connected to 10.0.0.80 port 5201
[ ID] Interval           Transfer     Bandwidth       Jitter    Lost/Total Datagrams
[  4]   0.00-1.00   sec  14.3 MBytes   120 Mbits/sec  0.097 ms  0/1830 (0%)  
[  4]   1.00-2.00   sec  14.3 MBytes   120 Mbits/sec  0.101 ms  0/1830 (0%)  
[  4]   2.00-3.00   sec  14.3 MBytes   120 Mbits/sec  0.072 ms  0/1827 (0%)  
[  4]   3.00-4.00   sec  14.2 MBytes   119 Mbits/sec  0.081 ms  0/1819 (0%)  
[  4]   4.00-5.00   sec  14.3 MBytes   120 Mbits/sec  0.070 ms  0/1834 (0%)  
[  4]   5.00-6.00   sec  14.3 MBytes   120 Mbits/sec  0.085 ms  0/1833 (0%)  
[  4]   6.00-7.00   sec  14.3 MBytes   120 Mbits/sec  0.082 ms  0/1835 (0%)  
[  4]   7.00-8.00   sec  14.3 MBytes   120 Mbits/sec  0.109 ms  0/1836 (0%)  
[  4]   8.00-9.00   sec  14.2 MBytes   119 Mbits/sec  0.080 ms  0/1822 (0%)  
[  4]   9.00-10.00  sec  14.3 MBytes   120 Mbits/sec  0.090 ms  0/1825 (0%)  
- - - - - - - - - - - - - - - - - - - - - - - - -
[ ID] Interval           Transfer     Bandwidth       Jitter    Lost/Total Datagrams
[  4]   0.00-10.00  sec   143 MBytes   120 Mbits/sec  0.104 ms  0/18298 (0%)  
[  4] Sent 18298 datagrams

iperf Done.
+ sleep 10
+ iperf3 -c 10.0.0.80 -u -b 1000M
Connecting to host 10.0.0.80, port 5201
[  4] local 10.0.0.1 port 53231 connected to 10.0.0.80 port 5201
[ ID] Interval           Transfer     Bandwidth       Total Datagrams
[  4]   0.00-1.00   sec   107 MBytes   902 Mbits/sec  13759  
[  4]   1.00-2.00   sec   107 MBytes   896 Mbits/sec  13675  
[  4]   2.00-3.00   sec   107 MBytes   901 Mbits/sec  13753  
[  4]   3.00-4.00   sec   107 MBytes   898 Mbits/sec  13700  
[  4]   4.00-5.00   sec   107 MBytes   902 Mbits/sec  13759  
[  4]   5.00-6.00   sec   108 MBytes   902 Mbits/sec  13762  
[  4]   6.00-7.00   sec   107 MBytes   899 Mbits/sec  13719  
[  4]   7.00-8.00   sec   108 MBytes   902 Mbits/sec  13760  
[  4]   8.00-9.00   sec   107 MBytes   901 Mbits/sec  13753  
[  4]   9.00-10.00  sec   107 MBytes   902 Mbits/sec  13756  
- - - - - - - - - - - - - - - - - - - - - - - - -
[ ID] Interval           Transfer     Bandwidth       Jitter    Lost/Total Datagrams
[  4]   0.00-10.00  sec  1.05 GBytes   900 Mbits/sec  5762140.265 ms  210/220 (95%)  
[  4] Sent 220 datagrams

iperf Done.
+ sleep 10
+ iperf3 -c 10.0.0.80 -u -b 1000M -R
Connecting to host 10.0.0.80, port 5201
Reverse mode, remote host 10.0.0.80 is sending
[  4] local 10.0.0.1 port 34296 connected to 10.0.0.80 port 5201
[ ID] Interval           Transfer     Bandwidth       Jitter    Lost/Total Datagrams
[  4]   0.00-1.00   sec  14.3 MBytes   120 Mbits/sec  0.084 ms  0/1835 (0%)  
[  4]   1.00-2.00   sec  14.3 MBytes   120 Mbits/sec  0.075 ms  0/1835 (0%)  
[  4]   2.00-3.00   sec  14.5 MBytes   122 Mbits/sec  0.062 ms  0/1858 (0%)  
[  4]   3.00-4.00   sec  15.1 MBytes   127 Mbits/sec  0.060 ms  0/1935 (0%)  
[  4]   4.00-5.00   sec  15.3 MBytes   128 Mbits/sec  0.076 ms  0/1958 (0%)  
[  4]   5.00-6.00   sec  14.5 MBytes   122 Mbits/sec  0.078 ms  0/1861 (0%)  
[  4]   6.00-7.00   sec  14.4 MBytes   120 Mbits/sec  0.100 ms  0/1837 (0%)  
[  4]   7.00-8.00   sec  14.3 MBytes   120 Mbits/sec  0.098 ms  0/1835 (0%)  
[  4]   8.00-9.00   sec  14.2 MBytes   119 Mbits/sec  0.085 ms  0/1821 (0%)  
[  4]   9.00-10.00  sec  14.3 MBytes   120 Mbits/sec  0.110 ms  0/1825 (0%)  
- - - - - - - - - - - - - - - - - - - - - - - - -
[ ID] Interval           Transfer     Bandwidth       Jitter    Lost/Total Datagrams
[  4]   0.00-10.00  sec   145 MBytes   122 Mbits/sec  0.101 ms  0/18606 (0%)  
[  4] Sent 18606 datagrams

iperf Done.
+ sleep 10
+ iperf3 -c 10.0.0.80 -P3
Connecting to host 10.0.0.80, port 5201
[  4] local 10.0.0.1 port 52130 connected to 10.0.0.80 port 5201
[  6] local 10.0.0.1 port 52132 connected to 10.0.0.80 port 5201
[  9] local 10.0.0.1 port 52134 connected to 10.0.0.80 port 5201
[ ID] Interval           Transfer     Bandwidth       Retr  Cwnd
[  4]   0.00-1.00   sec  7.47 MBytes  62.6 Mbits/sec   73   17.0 KBytes       
[  6]   0.00-1.00   sec  7.21 MBytes  60.5 Mbits/sec   78   19.8 KBytes       
[  9]   0.00-1.00   sec  7.14 MBytes  59.9 Mbits/sec   76   31.1 KBytes       
[SUM]   0.00-1.00   sec  21.8 MBytes   183 Mbits/sec  227             
- - - - - - - - - - - - - - - - - - - - - - - - -
[  4]   1.00-2.00   sec  7.95 MBytes  66.7 Mbits/sec   61   12.7 KBytes       
[  6]   1.00-2.00   sec  5.84 MBytes  49.0 Mbits/sec   99   35.4 KBytes       
[  9]   1.00-2.00   sec  7.08 MBytes  59.4 Mbits/sec   78   32.5 KBytes       
[SUM]   1.00-2.00   sec  20.9 MBytes   175 Mbits/sec  238             
- - - - - - - - - - - - - - - - - - - - - - - - -
[  4]   2.00-3.00   sec  6.09 MBytes  51.1 Mbits/sec   73   31.1 KBytes       
[  6]   2.00-3.00   sec  8.95 MBytes  75.1 Mbits/sec   64   22.6 KBytes       
[  9]   2.00-3.00   sec  6.09 MBytes  51.1 Mbits/sec   81   18.4 KBytes       
[SUM]   2.00-3.00   sec  21.1 MBytes   177 Mbits/sec  218             
- - - - - - - - - - - - - - - - - - - - - - - - -
[  4]   3.00-4.00   sec  6.71 MBytes  56.3 Mbits/sec   80   11.3 KBytes       
[  6]   3.00-4.00   sec  8.26 MBytes  69.3 Mbits/sec   76   17.0 KBytes       
[  9]   3.00-4.00   sec  6.28 MBytes  52.7 Mbits/sec   77   42.4 KBytes       
[SUM]   3.00-4.00   sec  21.3 MBytes   178 Mbits/sec  233             
- - - - - - - - - - - - - - - - - - - - - - - - -
[  4]   4.00-5.00   sec  6.59 MBytes  55.3 Mbits/sec   94   12.7 KBytes       
[  6]   4.00-5.00   sec  7.58 MBytes  63.6 Mbits/sec   63   28.3 KBytes       
[  9]   4.00-5.00   sec  6.84 MBytes  57.3 Mbits/sec   62   11.3 KBytes       
[SUM]   4.00-5.00   sec  21.0 MBytes   176 Mbits/sec  219             
- - - - - - - - - - - - - - - - - - - - - - - - -
[  4]   5.00-6.00   sec  8.76 MBytes  73.5 Mbits/sec   57   22.6 KBytes       
[  6]   5.00-6.00   sec  6.28 MBytes  52.6 Mbits/sec   80   38.2 KBytes       
[  9]   5.00-6.00   sec  6.28 MBytes  52.6 Mbits/sec   90   7.07 KBytes       
[SUM]   5.00-6.00   sec  21.3 MBytes   179 Mbits/sec  227             
- - - - - - - - - - - - - - - - - - - - - - - - -
[  4]   6.00-7.00   sec  7.33 MBytes  61.5 Mbits/sec   72   18.4 KBytes       
[  6]   6.00-7.00   sec  7.02 MBytes  58.9 Mbits/sec   66   35.4 KBytes       
[  9]   6.00-7.00   sec  6.77 MBytes  56.8 Mbits/sec   67   17.0 KBytes       
[SUM]   6.00-7.00   sec  21.1 MBytes   177 Mbits/sec  205             
- - - - - - - - - - - - - - - - - - - - - - - - -
[  4]   7.00-8.00   sec  8.45 MBytes  70.9 Mbits/sec   72   25.5 KBytes       
[  6]   7.00-8.00   sec  6.71 MBytes  56.3 Mbits/sec   82   35.4 KBytes       
[  9]   7.00-8.00   sec  5.90 MBytes  49.5 Mbits/sec   74   17.0 KBytes       
[SUM]   7.00-8.00   sec  21.1 MBytes   177 Mbits/sec  228             
- - - - - - - - - - - - - - - - - - - - - - - - -
[  4]   8.00-9.00   sec  6.46 MBytes  54.2 Mbits/sec   77   36.8 KBytes       
[  6]   8.00-9.00   sec  6.90 MBytes  57.9 Mbits/sec   78   11.3 KBytes       
[  9]   8.00-9.00   sec  7.89 MBytes  66.2 Mbits/sec   68   11.3 KBytes       
[SUM]   8.00-9.00   sec  21.3 MBytes   178 Mbits/sec  223             
- - - - - - - - - - - - - - - - - - - - - - - - -
[  4]   9.00-10.00  sec  6.59 MBytes  55.3 Mbits/sec   79   38.2 KBytes       
[  6]   9.00-10.00  sec  8.76 MBytes  73.5 Mbits/sec   58   24.0 KBytes       
[  9]   9.00-10.00  sec  5.72 MBytes  48.0 Mbits/sec   77   7.07 KBytes       
[SUM]   9.00-10.00  sec  21.1 MBytes   177 Mbits/sec  214             
- - - - - - - - - - - - - - - - - - - - - - - - -
[ ID] Interval           Transfer     Bandwidth       Retr
[  4]   0.00-10.00  sec  72.4 MBytes  60.7 Mbits/sec  738             sender
[  4]   0.00-10.00  sec  72.0 MBytes  60.4 Mbits/sec                  receiver
[  6]   0.00-10.00  sec  73.5 MBytes  61.7 Mbits/sec  744             sender
[  6]   0.00-10.00  sec  73.2 MBytes  61.4 Mbits/sec                  receiver
[  9]   0.00-10.00  sec  66.0 MBytes  55.4 Mbits/sec  750             sender
[  9]   0.00-10.00  sec  65.6 MBytes  55.1 Mbits/sec                  receiver
[SUM]   0.00-10.00  sec   212 MBytes   178 Mbits/sec  2232             sender
[SUM]   0.00-10.00  sec   211 MBytes   177 Mbits/sec                  receiver

iperf Done.
+ sleep 10
+ iperf3 -c 10.0.0.80 -P3 -R
Connecting to host 10.0.0.80, port 5201
Reverse mode, remote host 10.0.0.80 is sending
[  4] local 10.0.0.1 port 52178 connected to 10.0.0.80 port 5201
[  6] local 10.0.0.1 port 52180 connected to 10.0.0.80 port 5201
[  9] local 10.0.0.1 port 52182 connected to 10.0.0.80 port 5201
[ ID] Interval           Transfer     Bandwidth
[  4]   0.00-1.00   sec  4.03 MBytes  33.8 Mbits/sec                  
[  6]   0.00-1.00   sec  2.81 MBytes  23.6 Mbits/sec                  
[  9]   0.00-1.00   sec  2.79 MBytes  23.4 Mbits/sec                  
[SUM]   0.00-1.00   sec  9.63 MBytes  80.8 Mbits/sec                  
- - - - - - - - - - - - - - - - - - - - - - - - -
[  4]   1.00-2.00   sec  3.25 MBytes  27.3 Mbits/sec                  
[  6]   1.00-2.00   sec  3.25 MBytes  27.3 Mbits/sec                  
[  9]   1.00-2.00   sec  3.22 MBytes  27.0 Mbits/sec                  
[SUM]   1.00-2.00   sec  9.72 MBytes  81.5 Mbits/sec                  
- - - - - - - - - - - - - - - - - - - - - - - - -
[  4]   2.00-3.00   sec  3.25 MBytes  27.3 Mbits/sec                  
[  6]   2.00-3.00   sec  3.25 MBytes  27.3 Mbits/sec                  
[  9]   2.00-3.00   sec  3.30 MBytes  27.6 Mbits/sec                  
[SUM]   2.00-3.00   sec  9.80 MBytes  82.2 Mbits/sec                  
- - - - - - - - - - - - - - - - - - - - - - - - -
[  4]   3.00-4.00   sec  3.25 MBytes  27.3 Mbits/sec                  
[  6]   3.00-4.00   sec  3.12 MBytes  26.2 Mbits/sec                  
[  9]   3.00-4.00   sec  3.19 MBytes  26.8 Mbits/sec                  
[SUM]   3.00-4.00   sec  9.57 MBytes  80.2 Mbits/sec                  
- - - - - - - - - - - - - - - - - - - - - - - - -
[  4]   4.00-5.00   sec  2.22 MBytes  18.6 Mbits/sec                  
[  6]   4.00-5.00   sec  2.38 MBytes  19.9 Mbits/sec                  
[  9]   4.00-5.00   sec  2.28 MBytes  19.2 Mbits/sec                  
[SUM]   4.00-5.00   sec  6.88 MBytes  57.7 Mbits/sec                  
- - - - - - - - - - - - - - - - - - - - - - - - -
[  4]   5.00-6.00   sec  3.30 MBytes  27.7 Mbits/sec                  
[  6]   5.00-6.00   sec  3.37 MBytes  28.3 Mbits/sec                  
[  9]   5.00-6.00   sec  3.18 MBytes  26.7 Mbits/sec                  
[SUM]   5.00-6.00   sec  9.85 MBytes  82.6 Mbits/sec                  
- - - - - - - - - - - - - - - - - - - - - - - - -
[  4]   6.00-7.00   sec  2.99 MBytes  25.1 Mbits/sec                  
[  6]   6.00-7.00   sec  2.88 MBytes  24.1 Mbits/sec                  
[  9]   6.00-7.00   sec  3.00 MBytes  25.2 Mbits/sec                  
[SUM]   6.00-7.00   sec  8.87 MBytes  74.4 Mbits/sec                  
- - - - - - - - - - - - - - - - - - - - - - - - -
[  4]   7.00-8.00   sec  3.14 MBytes  26.3 Mbits/sec                  
[  6]   7.00-8.00   sec  3.25 MBytes  27.3 Mbits/sec                  
[  9]   7.00-8.00   sec  3.25 MBytes  27.3 Mbits/sec                  
[SUM]   7.00-8.00   sec  9.64 MBytes  80.9 Mbits/sec                  
- - - - - - - - - - - - - - - - - - - - - - - - -
[  4]   8.00-9.00   sec  3.25 MBytes  27.2 Mbits/sec                  
[  6]   8.00-9.00   sec  3.25 MBytes  27.3 Mbits/sec                  
[  9]   8.00-9.00   sec  3.16 MBytes  26.5 Mbits/sec                  
[SUM]   8.00-9.00   sec  9.65 MBytes  81.0 Mbits/sec                  
- - - - - - - - - - - - - - - - - - - - - - - - -
[  4]   9.00-10.00  sec  3.25 MBytes  27.3 Mbits/sec                  
[  6]   9.00-10.00  sec  3.21 MBytes  26.9 Mbits/sec                  
[  9]   9.00-10.00  sec  3.22 MBytes  27.0 Mbits/sec                  
[SUM]   9.00-10.00  sec  9.68 MBytes  81.2 Mbits/sec                  
- - - - - - - - - - - - - - - - - - - - - - - - -
[ ID] Interval           Transfer     Bandwidth       Retr
[  4]   0.00-10.00  sec  32.2 MBytes  27.0 Mbits/sec    0             sender
[  4]   0.00-10.00  sec  32.2 MBytes  27.0 Mbits/sec                  receiver
[  6]   0.00-10.00  sec  31.1 MBytes  26.1 Mbits/sec    0             sender
[  6]   0.00-10.00  sec  31.1 MBytes  26.1 Mbits/sec                  receiver
[  9]   0.00-10.00  sec  30.9 MBytes  25.9 Mbits/sec    0             sender
[  9]   0.00-10.00  sec  30.9 MBytes  25.9 Mbits/sec                  receiver
[SUM]   0.00-10.00  sec  94.2 MBytes  79.0 Mbits/sec    0             sender
[SUM]   0.00-10.00  sec  94.2 MBytes  79.0 Mbits/sec                  receiver

iperf Done.
+ sleep 10
+ iperf3 -c 10.0.0.80 -P3 -u -b 800M
Connecting to host 10.0.0.80, port 5201
[  4] local 10.0.0.1 port 36791 connected to 10.0.0.80 port 5201
[  6] local 10.0.0.1 port 51969 connected to 10.0.0.80 port 5201
[  9] local 10.0.0.1 port 39473 connected to 10.0.0.80 port 5201
[ ID] Interval           Transfer     Bandwidth       Total Datagrams
[  4]   0.00-1.00   sec  38.1 MBytes   319 Mbits/sec  4871  
[  6]   0.00-1.00   sec  38.1 MBytes   319 Mbits/sec  4871  
[  9]   0.00-1.00   sec  38.1 MBytes   319 Mbits/sec  4871  
[SUM]   0.00-1.00   sec   114 MBytes   958 Mbits/sec  14613  
- - - - - - - - - - - - - - - - - - - - - - - - -
[  4]   1.00-2.00   sec  38.1 MBytes   319 Mbits/sec  4873  
[  6]   1.00-2.00   sec  38.1 MBytes   319 Mbits/sec  4873  
[  9]   1.00-2.00   sec  38.1 MBytes   319 Mbits/sec  4873  
[SUM]   1.00-2.00   sec   114 MBytes   958 Mbits/sec  14619  
- - - - - - - - - - - - - - - - - - - - - - - - -
[  4]   2.00-3.00   sec  38.1 MBytes   319 Mbits/sec  4875  
[  6]   2.00-3.00   sec  38.1 MBytes   319 Mbits/sec  4875  
[  9]   2.00-3.00   sec  38.1 MBytes   319 Mbits/sec  4875  
[SUM]   2.00-3.00   sec   114 MBytes   958 Mbits/sec  14625  
- - - - - - - - - - - - - - - - - - - - - - - - -
[  4]   3.00-4.00   sec  38.1 MBytes   319 Mbits/sec  4874  
[  6]   3.00-4.00   sec  38.1 MBytes   319 Mbits/sec  4874  
[  9]   3.00-4.00   sec  38.1 MBytes   319 Mbits/sec  4874  
[SUM]   3.00-4.00   sec   114 MBytes   958 Mbits/sec  14622  
- - - - - - - - - - - - - - - - - - - - - - - - -
[  4]   4.00-5.00   sec  38.0 MBytes   319 Mbits/sec  4864  
[  6]   4.00-5.00   sec  38.0 MBytes   319 Mbits/sec  4864  
[  9]   4.00-5.00   sec  38.0 MBytes   319 Mbits/sec  4864  
[SUM]   4.00-5.00   sec   114 MBytes   956 Mbits/sec  14592  
- - - - - - - - - - - - - - - - - - - - - - - - -
[  4]   5.00-6.00   sec  38.1 MBytes   319 Mbits/sec  4875  
[  6]   5.00-6.00   sec  38.1 MBytes   319 Mbits/sec  4875  
[  9]   5.00-6.00   sec  38.1 MBytes   319 Mbits/sec  4875  
[SUM]   5.00-6.00   sec   114 MBytes   958 Mbits/sec  14625  
- - - - - - - - - - - - - - - - - - - - - - - - -
[  4]   6.00-7.00   sec  38.1 MBytes   319 Mbits/sec  4873  
[  6]   6.00-7.00   sec  38.1 MBytes   319 Mbits/sec  4873  
[  9]   6.00-7.00   sec  38.1 MBytes   319 Mbits/sec  4873  
[SUM]   6.00-7.00   sec   114 MBytes   958 Mbits/sec  14619  
- - - - - - - - - - - - - - - - - - - - - - - - -
[  4]   7.00-8.00   sec  38.1 MBytes   319 Mbits/sec  4876  
[  6]   7.00-8.00   sec  38.1 MBytes   319 Mbits/sec  4876  
[  9]   7.00-8.00   sec  38.1 MBytes   319 Mbits/sec  4876  
[SUM]   7.00-8.00   sec   114 MBytes   958 Mbits/sec  14628  
- - - - - - - - - - - - - - - - - - - - - - - - -
[  4]   8.00-9.00   sec  38.1 MBytes   319 Mbits/sec  4874  
[  6]   8.00-9.00   sec  38.1 MBytes   319 Mbits/sec  4874  
[  9]   8.00-9.00   sec  38.1 MBytes   319 Mbits/sec  4874  
[SUM]   8.00-9.00   sec   114 MBytes   958 Mbits/sec  14622  
- - - - - - - - - - - - - - - - - - - - - - - - -
[  4]   9.00-10.00  sec  37.9 MBytes   318 Mbits/sec  4856  
[  6]   9.00-10.00  sec  37.9 MBytes   318 Mbits/sec  4856  
[  9]   9.00-10.00  sec  37.9 MBytes   318 Mbits/sec  4856  
[SUM]   9.00-10.00  sec   114 MBytes   955 Mbits/sec  14568  
- - - - - - - - - - - - - - - - - - - - - - - - -
[ ID] Interval           Transfer     Bandwidth       Jitter    Lost/Total Datagrams
[  4]   0.00-10.00  sec   381 MBytes   319 Mbits/sec  9052841.391 ms  0/3 (0%)  
[  4] Sent 3 datagrams
[  6]   0.00-10.00  sec   381 MBytes   319 Mbits/sec  9052841.281 ms  0/3 (0%)  
[  6] Sent 3 datagrams
[  9]   0.00-10.00  sec   381 MBytes   319 Mbits/sec  9052841.181 ms  0/3 (0%)  
[  9] Sent 3 datagrams
[SUM]   0.00-10.00  sec  1.11 GBytes   958 Mbits/sec  9052841.285 ms  0/9 (0%)  

iperf Done.
+ sleep 10
+ iperf3 -c 10.0.0.80 -P3 -R -u -b 800M
Connecting to host 10.0.0.80, port 5201
Reverse mode, remote host 10.0.0.80 is sending
[  4] local 10.0.0.1 port 43263 connected to 10.0.0.80 port 5201
[  6] local 10.0.0.1 port 49331 connected to 10.0.0.80 port 5201
[  9] local 10.0.0.1 port 60542 connected to 10.0.0.80 port 5201
[ ID] Interval           Transfer     Bandwidth       Jitter    Lost/Total Datagrams
[  4]   0.00-1.00   sec  4.92 MBytes  41.3 Mbits/sec  0.156 ms  0/630 (0%)  
[  6]   0.00-1.00   sec  4.92 MBytes  41.3 Mbits/sec  0.170 ms  0/630 (0%)  
[  9]   0.00-1.00   sec  4.91 MBytes  41.2 Mbits/sec  0.237 ms  0/629 (0%)  
[SUM]   0.00-1.00   sec  14.8 MBytes   124 Mbits/sec  0.188 ms  0/1889 (0%)  
- - - - - - - - - - - - - - - - - - - - - - - - -
[  4]   1.00-2.00   sec  4.92 MBytes  41.3 Mbits/sec  0.173 ms  0/630 (0%)  
[  6]   1.00-2.00   sec  4.91 MBytes  41.2 Mbits/sec  0.191 ms  0/629 (0%)  
[  9]   1.00-2.00   sec  4.91 MBytes  41.2 Mbits/sec  0.192 ms  0/629 (0%)  
[SUM]   1.00-2.00   sec  14.8 MBytes   124 Mbits/sec  0.185 ms  0/1888 (0%)  
- - - - - - - - - - - - - - - - - - - - - - - - -
[  4]   2.00-3.00   sec  4.96 MBytes  41.6 Mbits/sec  0.246 ms  0/635 (0%)  
[  6]   2.00-3.00   sec  4.97 MBytes  41.7 Mbits/sec  0.167 ms  0/636 (0%)  
[  9]   2.00-3.00   sec  4.95 MBytes  41.5 Mbits/sec  0.232 ms  0/634 (0%)  
[SUM]   2.00-3.00   sec  14.9 MBytes   125 Mbits/sec  0.215 ms  0/1905 (0%)  
- - - - - - - - - - - - - - - - - - - - - - - - -
[  4]   3.00-4.00   sec  4.97 MBytes  41.7 Mbits/sec  0.189 ms  0/636 (0%)  
[  6]   3.00-4.00   sec  4.96 MBytes  41.6 Mbits/sec  0.121 ms  0/635 (0%)  
[  9]   3.00-4.00   sec  4.97 MBytes  41.7 Mbits/sec  0.195 ms  0/636 (0%)  
[SUM]   3.00-4.00   sec  14.9 MBytes   125 Mbits/sec  0.168 ms  0/1907 (0%)  
- - - - - - - - - - - - - - - - - - - - - - - - -
[  4]   4.00-5.00   sec  4.97 MBytes  41.7 Mbits/sec  0.180 ms  0/636 (0%)  
[  6]   4.00-5.00   sec  4.97 MBytes  41.7 Mbits/sec  0.185 ms  0/636 (0%)  
[  9]   4.00-5.00   sec  4.96 MBytes  41.6 Mbits/sec  0.132 ms  0/635 (0%)  
[SUM]   4.00-5.00   sec  14.9 MBytes   125 Mbits/sec  0.166 ms  0/1907 (0%)  
- - - - - - - - - - - - - - - - - - - - - - - - -
[  4]   5.00-6.00   sec  4.97 MBytes  41.7 Mbits/sec  0.178 ms  0/636 (0%)  
[  6]   5.00-6.00   sec  4.97 MBytes  41.7 Mbits/sec  0.209 ms  0/636 (0%)  
[  9]   5.00-6.00   sec  4.97 MBytes  41.7 Mbits/sec  0.167 ms  0/636 (0%)  
[SUM]   5.00-6.00   sec  14.9 MBytes   125 Mbits/sec  0.185 ms  0/1908 (0%)  
- - - - - - - - - - - - - - - - - - - - - - - - -
[  4]   6.00-7.00   sec  4.91 MBytes  41.2 Mbits/sec  0.141 ms  0/628 (0%)  
[  6]   6.00-7.00   sec  4.91 MBytes  41.2 Mbits/sec  0.211 ms  0/628 (0%)  
[  9]   6.00-7.00   sec  4.91 MBytes  41.2 Mbits/sec  0.152 ms  0/629 (0%)  
[SUM]   6.00-7.00   sec  14.7 MBytes   124 Mbits/sec  0.168 ms  0/1885 (0%)  
- - - - - - - - - - - - - - - - - - - - - - - - -
[  4]   7.00-8.00   sec  4.92 MBytes  41.3 Mbits/sec  0.290 ms  0/630 (0%)  
[  6]   7.00-8.00   sec  4.91 MBytes  41.2 Mbits/sec  0.167 ms  0/629 (0%)  
[  9]   7.00-8.00   sec  4.91 MBytes  41.2 Mbits/sec  0.367 ms  0/629 (0%)  
[SUM]   7.00-8.00   sec  14.8 MBytes   124 Mbits/sec  0.275 ms  0/1888 (0%)  
- - - - - - - - - - - - - - - - - - - - - - - - -
[  4]   8.00-9.00   sec  4.93 MBytes  41.4 Mbits/sec  0.147 ms  0/631 (0%)  
[  6]   8.00-9.00   sec  4.91 MBytes  41.2 Mbits/sec  0.170 ms  0/628 (0%)  
[  9]   8.00-9.00   sec  4.91 MBytes  41.2 Mbits/sec  0.137 ms  0/628 (0%)  
[SUM]   8.00-9.00   sec  14.7 MBytes   124 Mbits/sec  0.151 ms  0/1887 (0%)  
- - - - - - - - - - - - - - - - - - - - - - - - -
[  4]   9.00-10.00  sec  4.97 MBytes  41.7 Mbits/sec  0.215 ms  0/636 (0%)  
[  6]   9.00-10.00  sec  4.98 MBytes  41.7 Mbits/sec  0.150 ms  0/637 (0%)  
[  9]   9.00-10.00  sec  4.96 MBytes  41.6 Mbits/sec  0.272 ms  0/635 (0%)  
[SUM]   9.00-10.00  sec  14.9 MBytes   125 Mbits/sec  0.212 ms  0/1908 (0%)  
- - - - - - - - - - - - - - - - - - - - - - - - -
[ ID] Interval           Transfer     Bandwidth       Jitter    Lost/Total Datagrams
[  4]   0.00-10.00  sec  49.5 MBytes  41.5 Mbits/sec  0.227 ms  0/6335 (0%)  
[  4] Sent 6335 datagrams
[  6]   0.00-10.00  sec  49.5 MBytes  41.5 Mbits/sec  0.183 ms  0/6331 (0%)  
[  6] Sent 6331 datagrams
[  9]   0.00-10.00  sec  49.4 MBytes  41.5 Mbits/sec  0.261 ms  0/6327 (0%)  
[  9] Sent 6327 datagrams
[SUM]   0.00-10.00  sec   148 MBytes   124 Mbits/sec  0.224 ms  0/18993 (0%)  

iperf Done.
+ : ':::::::[' configuration vanilla ']:::::::' :
+ iperf3 -c 10.0.0.80
Connecting to host 10.0.0.80, port 5201
[  4] local 10.0.0.1 port 52728 connected to 10.0.0.80 port 5201
[ ID] Interval           Transfer     Bandwidth       Retr  Cwnd
[  4]   0.00-1.00   sec  21.2 MBytes   178 Mbits/sec   24   66.5 KBytes       
[  4]   1.00-2.00   sec  20.8 MBytes   175 Mbits/sec   36   67.9 KBytes       
[  4]   2.00-3.00   sec  20.5 MBytes   172 Mbits/sec   40   52.3 KBytes       
[  4]   3.00-4.00   sec  20.7 MBytes   174 Mbits/sec   37   45.2 KBytes       
[  4]   4.00-5.00   sec  20.4 MBytes   171 Mbits/sec   27   46.7 KBytes       
[  4]   5.00-6.00   sec  20.6 MBytes   173 Mbits/sec   36   66.5 KBytes       
[  4]   6.00-7.00   sec  20.8 MBytes   174 Mbits/sec   31   56.6 KBytes       
[  4]   7.00-8.00   sec  20.8 MBytes   174 Mbits/sec   46   43.8 KBytes       
[  4]   8.00-9.00   sec  20.9 MBytes   176 Mbits/sec   31   48.1 KBytes       
[  4]   9.00-10.00  sec  20.9 MBytes   175 Mbits/sec   28   65.0 KBytes       
- - - - - - - - - - - - - - - - - - - - - - - - -
[ ID] Interval           Transfer     Bandwidth       Retr
[  4]   0.00-10.00  sec   208 MBytes   174 Mbits/sec  336             sender
[  4]   0.00-10.00  sec   207 MBytes   174 Mbits/sec                  receiver

iperf Done.
+ sleep 10
+ iperf3 -c 10.0.0.80 -R
Connecting to host 10.0.0.80, port 5201
Reverse mode, remote host 10.0.0.80 is sending
[  4] local 10.0.0.1 port 52768 connected to 10.0.0.80 port 5201
[ ID] Interval           Transfer     Bandwidth
[  4]   0.00-1.00   sec  9.95 MBytes  83.5 Mbits/sec                  
[  4]   1.00-2.00   sec  13.0 MBytes   109 Mbits/sec                  
[  4]   2.00-3.00   sec  13.0 MBytes   109 Mbits/sec                  
[  4]   3.00-4.00   sec  13.0 MBytes   109 Mbits/sec                  
[  4]   4.00-5.00   sec  13.1 MBytes   110 Mbits/sec                  
[  4]   5.00-6.00   sec  13.2 MBytes   111 Mbits/sec                  
[  4]   6.00-7.00   sec  13.2 MBytes   111 Mbits/sec                  
[  4]   7.00-8.00   sec  13.2 MBytes   111 Mbits/sec                  
[  4]   8.00-9.00   sec  13.3 MBytes   112 Mbits/sec                  
[  4]   9.00-10.00  sec  10.2 MBytes  85.6 Mbits/sec                  
- - - - - - - - - - - - - - - - - - - - - - - - -
[ ID] Interval           Transfer     Bandwidth       Retr
[  4]   0.00-10.00  sec   125 MBytes   105 Mbits/sec    0             sender
[  4]   0.00-10.00  sec   125 MBytes   105 Mbits/sec                  receiver

iperf Done.
+ sleep 10
+ iperf3 -c 10.0.0.80 -u -b 150M
Connecting to host 10.0.0.80, port 5201
[  4] local 10.0.0.1 port 40911 connected to 10.0.0.80 port 5201
[ ID] Interval           Transfer     Bandwidth       Total Datagrams
[  4]   0.00-1.00   sec  16.4 MBytes   138 Mbits/sec  2100  
[  4]   1.00-2.00   sec  17.9 MBytes   150 Mbits/sec  2289  
[  4]   2.00-3.00   sec  17.9 MBytes   150 Mbits/sec  2289  
[  4]   3.00-4.00   sec  17.9 MBytes   150 Mbits/sec  2292  
[  4]   4.00-5.00   sec  17.9 MBytes   150 Mbits/sec  2288  
[  4]   5.00-6.00   sec  17.9 MBytes   150 Mbits/sec  2286  
[  4]   6.00-7.00   sec  17.9 MBytes   150 Mbits/sec  2288  
[  4]   7.00-8.00   sec  17.9 MBytes   150 Mbits/sec  2289  
[  4]   8.00-9.00   sec  17.9 MBytes   150 Mbits/sec  2289  
[  4]   9.00-10.00  sec  17.9 MBytes   150 Mbits/sec  2289  
- - - - - - - - - - - - - - - - - - - - - - - - -
[ ID] Interval           Transfer     Bandwidth       Jitter    Lost/Total Datagrams
[  4]   0.00-10.00  sec   177 MBytes   149 Mbits/sec  136432.924 ms  1354/1422 (95%)  
[  4] Sent 1422 datagrams

iperf Done.
+ sleep 10
+ iperf3 -c 10.0.0.80 -u -b 150M -R
Connecting to host 10.0.0.80, port 5201
Reverse mode, remote host 10.0.0.80 is sending
[  4] local 10.0.0.1 port 45555 connected to 10.0.0.80 port 5201
[ ID] Interval           Transfer     Bandwidth       Jitter    Lost/Total Datagrams
[  4]   0.00-1.00   sec  15.3 MBytes   128 Mbits/sec  0.077 ms  0/1960 (0%)  
[  4]   1.00-2.00   sec  15.3 MBytes   129 Mbits/sec  0.068 ms  0/1962 (0%)  
[  4]   2.00-3.00   sec  15.2 MBytes   127 Mbits/sec  0.051 ms  0/1942 (0%)  
[  4]   3.00-4.00   sec  15.2 MBytes   127 Mbits/sec  0.074 ms  0/1940 (0%)  
[  4]   4.00-5.00   sec  15.3 MBytes   128 Mbits/sec  0.064 ms  0/1959 (0%)  
[  4]   5.00-6.00   sec  14.7 MBytes   123 Mbits/sec  0.093 ms  0/1879 (0%)  
[  4]   6.00-7.00   sec  15.5 MBytes   130 Mbits/sec  0.086 ms  0/1990 (0%)  
[  4]   7.00-8.00   sec  17.5 MBytes   146 Mbits/sec  0.089 ms  0/2235 (0%)  
[  4]   8.00-9.00   sec  14.2 MBytes   119 Mbits/sec  0.060 ms  0/1816 (0%)  
[  4]   9.00-10.00  sec  15.0 MBytes   126 Mbits/sec  0.095 ms  0/1917 (0%)  
- - - - - - - - - - - - - - - - - - - - - - - - -
[ ID] Interval           Transfer     Bandwidth       Jitter    Lost/Total Datagrams
[  4]   0.00-10.00  sec   153 MBytes   128 Mbits/sec  0.101 ms  0/19606 (0%)  
[  4] Sent 19606 datagrams

iperf Done.
+ sleep 10
+ iperf3 -c 10.0.0.80 -u -b 500M
Connecting to host 10.0.0.80, port 5201
[  4] local 10.0.0.1 port 33642 connected to 10.0.0.80 port 5201
[ ID] Interval           Transfer     Bandwidth       Total Datagrams
[  4]   0.00-1.00   sec  59.3 MBytes   497 Mbits/sec  7585  
[  4]   1.00-2.00   sec  58.5 MBytes   491 Mbits/sec  7492  
[  4]   2.00-3.00   sec  60.0 MBytes   503 Mbits/sec  7677  
[  4]   3.00-4.00   sec  59.3 MBytes   498 Mbits/sec  7596  
[  4]   4.00-5.00   sec  60.9 MBytes   511 Mbits/sec  7794  
[  4]   5.00-6.00   sec  59.0 MBytes   495 Mbits/sec  7556  
[  4]   6.00-7.00   sec  59.7 MBytes   501 Mbits/sec  7639  
[  4]   7.00-8.00   sec  59.2 MBytes   496 Mbits/sec  7574  
[  4]   8.00-9.00   sec  60.4 MBytes   507 Mbits/sec  7736  
[  4]   9.00-10.00  sec  58.7 MBytes   493 Mbits/sec  7517  
- - - - - - - - - - - - - - - - - - - - - - - - -
[ ID] Interval           Transfer     Bandwidth       Jitter    Lost/Total Datagrams
[  4]   0.00-10.00  sec   595 MBytes   499 Mbits/sec  1147799.599 ms  64273/64308 (1e+02%)  
[  4] Sent 64308 datagrams

iperf Done.
+ sleep 10
+ iperf3 -c 10.0.0.80 -u -b 500M -R
Connecting to host 10.0.0.80, port 5201
Reverse mode, remote host 10.0.0.80 is sending
[  4] local 10.0.0.1 port 42014 connected to 10.0.0.80 port 5201
[ ID] Interval           Transfer     Bandwidth       Jitter    Lost/Total Datagrams
[  4]   0.00-1.00   sec  15.2 MBytes   127 Mbits/sec  0.086 ms  0/1942 (0%)  
[  4]   1.00-2.00   sec  15.2 MBytes   127 Mbits/sec  0.099 ms  0/1940 (0%)  
[  4]   2.00-3.00   sec  15.1 MBytes   127 Mbits/sec  0.087 ms  0/1932 (0%)  
[  4]   3.00-4.00   sec  15.0 MBytes   126 Mbits/sec  0.059 ms  0/1920 (0%)  
[  4]   4.00-5.00   sec  15.1 MBytes   127 Mbits/sec  0.070 ms  0/1931 (0%)  
[  4]   5.00-6.00   sec  15.2 MBytes   127 Mbits/sec  0.109 ms  0/1942 (0%)  
[  4]   6.00-7.00   sec  15.2 MBytes   127 Mbits/sec  0.102 ms  0/1941 (0%)  
[  4]   7.00-8.00   sec  15.2 MBytes   127 Mbits/sec  0.069 ms  0/1943 (0%)  
[  4]   8.00-9.00   sec  15.0 MBytes   126 Mbits/sec  0.074 ms  0/1926 (0%)  
[  4]   9.00-10.00  sec  15.0 MBytes   126 Mbits/sec  0.082 ms  0/1919 (0%)  
- - - - - - - - - - - - - - - - - - - - - - - - -
[ ID] Interval           Transfer     Bandwidth       Jitter    Lost/Total Datagrams
[  4]   0.00-10.00  sec   151 MBytes   127 Mbits/sec  0.089 ms  0/19342 (0%)  
[  4] Sent 19342 datagrams

iperf Done.
+ sleep 10
+ iperf3 -c 10.0.0.80 -u -b 1000M
Connecting to host 10.0.0.80, port 5201
[  4] local 10.0.0.1 port 55639 connected to 10.0.0.80 port 5201
[ ID] Interval           Transfer     Bandwidth       Total Datagrams
[  4]   0.00-1.00   sec  96.9 MBytes   813 Mbits/sec  12402  
[  4]   1.00-2.00   sec   107 MBytes   897 Mbits/sec  13693  
[  4]   2.00-3.00   sec   107 MBytes   898 Mbits/sec  13701  
[  4]   3.00-4.00   sec   107 MBytes   898 Mbits/sec  13698  
[  4]   4.00-5.00   sec   107 MBytes   897 Mbits/sec  13689  
[  4]   5.00-6.00   sec   107 MBytes   896 Mbits/sec  13679  
[  4]   6.00-7.00   sec   107 MBytes   898 Mbits/sec  13710  
[  4]   7.00-8.00   sec   107 MBytes   899 Mbits/sec  13719  
[  4]   8.00-9.00   sec   107 MBytes   894 Mbits/sec  13635  
[  4]   9.00-10.00  sec   107 MBytes   899 Mbits/sec  13725  
- - - - - - - - - - - - - - - - - - - - - - - - -
[ ID] Interval           Transfer     Bandwidth       Jitter    Lost/Total Datagrams
[  4]   0.00-10.00  sec  1.03 GBytes   889 Mbits/sec  3022016.748 ms  1257/1277 (98%)  
[  4] Sent 1277 datagrams

iperf Done.
+ sleep 10
+ iperf3 -c 10.0.0.80 -u -b 1000M -R
Connecting to host 10.0.0.80, port 5201
Reverse mode, remote host 10.0.0.80 is sending
[  4] local 10.0.0.1 port 54887 connected to 10.0.0.80 port 5201
[ ID] Interval           Transfer     Bandwidth       Jitter    Lost/Total Datagrams
[  4]   0.00-1.00   sec  15.2 MBytes   127 Mbits/sec  0.087 ms  0/1942 (0%)  
[  4]   1.00-2.00   sec  15.2 MBytes   127 Mbits/sec  0.092 ms  0/1945 (0%)  
[  4]   2.00-3.00   sec  15.1 MBytes   126 Mbits/sec  0.078 ms  0/1930 (0%)  
[  4]   3.00-4.00   sec  15.1 MBytes   127 Mbits/sec  0.078 ms  0/1938 (0%)  
[  4]   4.00-5.00   sec  15.3 MBytes   128 Mbits/sec  0.080 ms  0/1954 (0%)  
[  4]   5.00-6.00   sec  15.3 MBytes   128 Mbits/sec  0.088 ms  0/1959 (0%)  
[  4]   6.00-7.00   sec  15.3 MBytes   129 Mbits/sec  0.084 ms  0/1961 (0%)  
[  4]   7.00-8.00   sec  15.3 MBytes   128 Mbits/sec  0.266 ms  0/1956 (0%)  
[  4]   8.00-9.00   sec  15.2 MBytes   128 Mbits/sec  0.079 ms  0/1949 (0%)  
[  4]   9.00-10.00  sec  15.1 MBytes   127 Mbits/sec  0.069 ms  0/1939 (0%)  
- - - - - - - - - - - - - - - - - - - - - - - - -
[ ID] Interval           Transfer     Bandwidth       Jitter    Lost/Total Datagrams
[  4]   0.00-10.00  sec   152 MBytes   128 Mbits/sec  0.063 ms  0/19480 (0%)  
[  4] Sent 19480 datagrams

iperf Done.
+ sleep 10
+ iperf3 -c 10.0.0.80 -P3
Connecting to host 10.0.0.80, port 5201
[  4] local 10.0.0.1 port 53060 connected to 10.0.0.80 port 5201
[  6] local 10.0.0.1 port 53062 connected to 10.0.0.80 port 5201
[  9] local 10.0.0.1 port 53064 connected to 10.0.0.80 port 5201
[ ID] Interval           Transfer     Bandwidth       Retr  Cwnd
[  4]   0.00-1.00   sec  8.14 MBytes  68.3 Mbits/sec   65   11.3 KBytes       
[  6]   0.00-1.00   sec  6.65 MBytes  55.8 Mbits/sec   81   12.7 KBytes       
[  9]   0.00-1.00   sec  7.20 MBytes  60.4 Mbits/sec   60   49.5 KBytes       
[SUM]   0.00-1.00   sec  22.0 MBytes   184 Mbits/sec  206             
- - - - - - - - - - - - - - - - - - - - - - - - -
[  4]   1.00-2.00   sec  6.84 MBytes  57.3 Mbits/sec   76   25.5 KBytes       
[  6]   1.00-2.00   sec  6.59 MBytes  55.3 Mbits/sec   89   24.0 KBytes       
[  9]   1.00-2.00   sec  7.83 MBytes  65.7 Mbits/sec   60   18.4 KBytes       
[SUM]   1.00-2.00   sec  21.3 MBytes   178 Mbits/sec  225             
- - - - - - - - - - - - - - - - - - - - - - - - -
[  4]   2.00-3.00   sec  7.83 MBytes  65.7 Mbits/sec   78   15.6 KBytes       
[  6]   2.00-3.00   sec  6.84 MBytes  57.3 Mbits/sec   69   29.7 KBytes       
[  9]   2.00-3.00   sec  6.77 MBytes  56.8 Mbits/sec   72   19.8 KBytes       
[SUM]   2.00-3.00   sec  21.4 MBytes   180 Mbits/sec  219             
- - - - - - - - - - - - - - - - - - - - - - - - -
[  4]   3.00-4.00   sec  6.52 MBytes  54.7 Mbits/sec  120   18.4 KBytes       
[  6]   3.00-4.00   sec  7.08 MBytes  59.4 Mbits/sec   90   26.9 KBytes       
[  9]   3.00-4.00   sec  6.77 MBytes  56.8 Mbits/sec   77   31.1 KBytes       
[SUM]   3.00-4.00   sec  20.4 MBytes   171 Mbits/sec  287             
- - - - - - - - - - - - - - - - - - - - - - - - -
[  4]   4.00-5.00   sec  6.28 MBytes  52.6 Mbits/sec   82   21.2 KBytes       
[  6]   4.00-5.00   sec  7.15 MBytes  59.9 Mbits/sec   61   19.8 KBytes       
[  9]   4.00-5.00   sec  7.71 MBytes  64.6 Mbits/sec   61   22.6 KBytes       
[SUM]   4.00-5.00   sec  21.1 MBytes   177 Mbits/sec  204             
- - - - - - - - - - - - - - - - - - - - - - - - -
[  4]   5.00-6.00   sec  8.95 MBytes  75.1 Mbits/sec   61   11.3 KBytes       
[  6]   5.00-6.00   sec  5.84 MBytes  49.0 Mbits/sec  105   39.6 KBytes       
[  9]   5.00-6.00   sec  5.10 MBytes  42.7 Mbits/sec  112   21.2 KBytes       
[SUM]   5.00-6.00   sec  19.9 MBytes   167 Mbits/sec  278             
- - - - - - - - - - - - - - - - - - - - - - - - -
[  4]   6.00-7.00   sec  7.95 MBytes  66.7 Mbits/sec   78   46.7 KBytes       
[  6]   6.00-7.00   sec  6.77 MBytes  56.8 Mbits/sec  110   14.1 KBytes       
[  9]   6.00-7.00   sec  5.65 MBytes  47.4 Mbits/sec  112   18.4 KBytes       
[SUM]   6.00-7.00   sec  20.4 MBytes   171 Mbits/sec  300             
- - - - - - - - - - - - - - - - - - - - - - - - -
[  4]   7.00-8.00   sec  8.51 MBytes  71.4 Mbits/sec   69   32.5 KBytes       
[  6]   7.00-8.00   sec  6.52 MBytes  54.7 Mbits/sec  109   12.7 KBytes       
[  9]   7.00-8.00   sec  4.97 MBytes  41.7 Mbits/sec   81   19.8 KBytes       
[SUM]   7.00-8.00   sec  20.0 MBytes   168 Mbits/sec  259             
- - - - - - - - - - - - - - - - - - - - - - - - -
[  4]   8.00-9.00   sec  8.58 MBytes  71.9 Mbits/sec   65   14.1 KBytes       
[  6]   8.00-9.00   sec  5.72 MBytes  48.0 Mbits/sec  104   28.3 KBytes       
[  9]   8.00-9.00   sec  6.59 MBytes  55.3 Mbits/sec   85   11.3 KBytes       
[SUM]   8.00-9.00   sec  20.9 MBytes   175 Mbits/sec  254             
- - - - - - - - - - - - - - - - - - - - - - - - -
[  4]   9.00-10.00  sec  9.94 MBytes  83.4 Mbits/sec   63   48.1 KBytes       
[  6]   9.00-10.00  sec  4.47 MBytes  37.5 Mbits/sec   75   26.9 KBytes       
[  9]   9.00-10.00  sec  5.03 MBytes  42.2 Mbits/sec  122   43.8 KBytes       
[SUM]   9.00-10.00  sec  19.5 MBytes   163 Mbits/sec  260             
- - - - - - - - - - - - - - - - - - - - - - - - -
[ ID] Interval           Transfer     Bandwidth       Retr
[  4]   0.00-10.00  sec  79.5 MBytes  66.7 Mbits/sec  757             sender
[  4]   0.00-10.00  sec  79.1 MBytes  66.4 Mbits/sec                  receiver
[  6]   0.00-10.00  sec  63.6 MBytes  53.4 Mbits/sec  893             sender
[  6]   0.00-10.00  sec  63.3 MBytes  53.1 Mbits/sec                  receiver
[  9]   0.00-10.00  sec  63.6 MBytes  53.4 Mbits/sec  842             sender
[  9]   0.00-10.00  sec  63.2 MBytes  53.1 Mbits/sec                  receiver
[SUM]   0.00-10.00  sec   207 MBytes   173 Mbits/sec  2492             sender
[SUM]   0.00-10.00  sec   206 MBytes   172 Mbits/sec                  receiver

iperf Done.
+ sleep 10
+ iperf3 -c 10.0.0.80 -P3 -R
Connecting to host 10.0.0.80, port 5201
Reverse mode, remote host 10.0.0.80 is sending
[  4] local 10.0.0.1 port 53114 connected to 10.0.0.80 port 5201
[  6] local 10.0.0.1 port 53116 connected to 10.0.0.80 port 5201
[  9] local 10.0.0.1 port 53118 connected to 10.0.0.80 port 5201
[ ID] Interval           Transfer     Bandwidth
[  4]   0.00-1.00   sec  3.58 MBytes  30.1 Mbits/sec                  
[  6]   0.00-1.00   sec  3.68 MBytes  30.9 Mbits/sec                  
[  9]   0.00-1.00   sec  2.38 MBytes  19.9 Mbits/sec                  
[SUM]   0.00-1.00   sec  9.64 MBytes  80.8 Mbits/sec                  
- - - - - - - - - - - - - - - - - - - - - - - - -
[  4]   1.00-2.00   sec  3.21 MBytes  26.9 Mbits/sec                  
[  6]   1.00-2.00   sec  3.19 MBytes  26.8 Mbits/sec                  
[  9]   1.00-2.00   sec  3.23 MBytes  27.1 Mbits/sec                  
[SUM]   1.00-2.00   sec  9.63 MBytes  80.8 Mbits/sec                  
- - - - - - - - - - - - - - - - - - - - - - - - -
[  4]   2.00-3.00   sec  3.21 MBytes  26.9 Mbits/sec                  
[  6]   2.00-3.00   sec  3.25 MBytes  27.3 Mbits/sec                  
[  9]   2.00-3.00   sec  3.17 MBytes  26.6 Mbits/sec                  
[SUM]   2.00-3.00   sec  9.63 MBytes  80.8 Mbits/sec                  
- - - - - - - - - - - - - - - - - - - - - - - - -
[  4]   3.00-4.00   sec  3.38 MBytes  28.4 Mbits/sec                  
[  6]   3.00-4.00   sec  3.27 MBytes  27.4 Mbits/sec                  
[  9]   3.00-4.00   sec  3.32 MBytes  27.8 Mbits/sec                  
[SUM]   3.00-4.00   sec  9.97 MBytes  83.6 Mbits/sec                  
- - - - - - - - - - - - - - - - - - - - - - - - -
[  4]   4.00-5.00   sec  4.84 MBytes  40.6 Mbits/sec                  
[  6]   4.00-5.00   sec  4.20 MBytes  35.2 Mbits/sec                  
[  9]   4.00-5.00   sec  4.67 MBytes  39.2 Mbits/sec                  
[SUM]   4.00-5.00   sec  13.7 MBytes   115 Mbits/sec                  
- - - - - - - - - - - - - - - - - - - - - - - - -
[  4]   5.00-6.00   sec  4.12 MBytes  34.6 Mbits/sec                  
[  6]   5.00-6.00   sec  2.81 MBytes  23.6 Mbits/sec                  
[  9]   5.00-6.00   sec  2.76 MBytes  23.1 Mbits/sec                  
[SUM]   5.00-6.00   sec  9.69 MBytes  81.3 Mbits/sec                  
- - - - - - - - - - - - - - - - - - - - - - - - -
[  4]   6.00-7.00   sec  3.25 MBytes  27.2 Mbits/sec                  
[  6]   6.00-7.00   sec  3.25 MBytes  27.3 Mbits/sec                  
[  9]   6.00-7.00   sec  3.25 MBytes  27.3 Mbits/sec                  
[SUM]   6.00-7.00   sec  9.75 MBytes  81.8 Mbits/sec                  
- - - - - - - - - - - - - - - - - - - - - - - - -
[  4]   7.00-8.00   sec  3.27 MBytes  27.4 Mbits/sec                  
[  6]   7.00-8.00   sec  3.37 MBytes  28.3 Mbits/sec                  
[  9]   7.00-8.00   sec  3.24 MBytes  27.2 Mbits/sec                  
[SUM]   7.00-8.00   sec  9.88 MBytes  82.9 Mbits/sec                  
- - - - - - - - - - - - - - - - - - - - - - - - -
[  4]   8.00-9.00   sec  3.25 MBytes  27.3 Mbits/sec                  
[  6]   8.00-9.00   sec  3.20 MBytes  26.8 Mbits/sec                  
[  9]   8.00-9.00   sec  3.18 MBytes  26.7 Mbits/sec                  
[SUM]   8.00-9.00   sec  9.63 MBytes  80.8 Mbits/sec                  
- - - - - - - - - - - - - - - - - - - - - - - - -
[  4]   9.00-10.00  sec  3.25 MBytes  27.3 Mbits/sec                  
[  6]   9.00-10.00  sec  3.23 MBytes  27.1 Mbits/sec                  
[  9]   9.00-10.00  sec  3.25 MBytes  27.3 Mbits/sec                  
[SUM]   9.00-10.00  sec  9.73 MBytes  81.6 Mbits/sec                  
- - - - - - - - - - - - - - - - - - - - - - - - -
[ ID] Interval           Transfer     Bandwidth       Retr
[  4]   0.00-10.00  sec  35.7 MBytes  29.9 Mbits/sec    0             sender
[  4]   0.00-10.00  sec  35.7 MBytes  29.9 Mbits/sec                  receiver
[  6]   0.00-10.00  sec  33.8 MBytes  28.3 Mbits/sec    0             sender
[  6]   0.00-10.00  sec  33.8 MBytes  28.3 Mbits/sec                  receiver
[  9]   0.00-10.00  sec  32.8 MBytes  27.5 Mbits/sec    0             sender
[  9]   0.00-10.00  sec  32.8 MBytes  27.5 Mbits/sec                  receiver
[SUM]   0.00-10.00  sec   102 MBytes  85.8 Mbits/sec    0             sender
[SUM]   0.00-10.00  sec   102 MBytes  85.8 Mbits/sec                  receiver

iperf Done.
+ sleep 10
+ iperf3 -c 10.0.0.80 -P3 -u -b 800M
Connecting to host 10.0.0.80, port 5201
[  4] local 10.0.0.1 port 54179 connected to 10.0.0.80 port 5201
[  6] local 10.0.0.1 port 53430 connected to 10.0.0.80 port 5201
[  9] local 10.0.0.1 port 57739 connected to 10.0.0.80 port 5201
[ ID] Interval           Transfer     Bandwidth       Total Datagrams
[  4]   0.00-1.00   sec  38.0 MBytes   318 Mbits/sec  4860  
[  6]   0.00-1.00   sec  38.0 MBytes   318 Mbits/sec  4860  
[  9]   0.00-1.00   sec  38.0 MBytes   318 Mbits/sec  4860  
[SUM]   0.00-1.00   sec   114 MBytes   955 Mbits/sec  14580  
- - - - - - - - - - - - - - - - - - - - - - - - -
[  4]   1.00-2.00   sec  38.1 MBytes   319 Mbits/sec  4872  
[  6]   1.00-2.00   sec  38.1 MBytes   319 Mbits/sec  4872  
[  9]   1.00-2.00   sec  38.1 MBytes   319 Mbits/sec  4872  
[SUM]   1.00-2.00   sec   114 MBytes   958 Mbits/sec  14616  
- - - - - - - - - - - - - - - - - - - - - - - - -
[  4]   2.00-3.00   sec  38.0 MBytes   319 Mbits/sec  4870  
[  6]   2.00-3.00   sec  38.0 MBytes   319 Mbits/sec  4870  
[  9]   2.00-3.00   sec  38.0 MBytes   319 Mbits/sec  4870  
[SUM]   2.00-3.00   sec   114 MBytes   958 Mbits/sec  14610  
- - - - - - - - - - - - - - - - - - - - - - - - -
[  4]   3.00-4.00   sec  38.1 MBytes   319 Mbits/sec  4872  
[  6]   3.00-4.00   sec  38.1 MBytes   319 Mbits/sec  4872  
[  9]   3.00-4.00   sec  38.1 MBytes   319 Mbits/sec  4872  
[SUM]   3.00-4.00   sec   114 MBytes   958 Mbits/sec  14616  
- - - - - - - - - - - - - - - - - - - - - - - - -
[  4]   4.00-5.00   sec  38.1 MBytes   319 Mbits/sec  4874  
[  6]   4.00-5.00   sec  38.1 MBytes   319 Mbits/sec  4874  
[  9]   4.00-5.00   sec  38.1 MBytes   319 Mbits/sec  4874  
[SUM]   4.00-5.00   sec   114 MBytes   958 Mbits/sec  14622  
- - - - - - - - - - - - - - - - - - - - - - - - -
[  4]   5.00-6.00   sec  38.1 MBytes   319 Mbits/sec  4873  
[  6]   5.00-6.00   sec  38.1 MBytes   319 Mbits/sec  4873  
[  9]   5.00-6.00   sec  38.1 MBytes   319 Mbits/sec  4873  
[SUM]   5.00-6.00   sec   114 MBytes   958 Mbits/sec  14619  
- - - - - - - - - - - - - - - - - - - - - - - - -
[  4]   6.00-7.00   sec  38.1 MBytes   319 Mbits/sec  4873  
[  6]   6.00-7.00   sec  38.1 MBytes   319 Mbits/sec  4873  
[  9]   6.00-7.00   sec  38.1 MBytes   319 Mbits/sec  4873  
[SUM]   6.00-7.00   sec   114 MBytes   958 Mbits/sec  14619  
- - - - - - - - - - - - - - - - - - - - - - - - -
[  4]   7.00-8.00   sec  38.1 MBytes   319 Mbits/sec  4874  
[  6]   7.00-8.00   sec  38.1 MBytes   319 Mbits/sec  4874  
[  9]   7.00-8.00   sec  38.1 MBytes   319 Mbits/sec  4874  
[SUM]   7.00-8.00   sec   114 MBytes   958 Mbits/sec  14622  
- - - - - - - - - - - - - - - - - - - - - - - - -
[  4]   8.00-9.00   sec  38.1 MBytes   319 Mbits/sec  4874  
[  6]   8.00-9.00   sec  38.1 MBytes   319 Mbits/sec  4874  
[  9]   8.00-9.00   sec  38.1 MBytes   319 Mbits/sec  4874  
[SUM]   8.00-9.00   sec   114 MBytes   958 Mbits/sec  14622  
- - - - - - - - - - - - - - - - - - - - - - - - -
[  4]   9.00-10.00  sec  38.1 MBytes   319 Mbits/sec  4874  
[  6]   9.00-10.00  sec  38.1 MBytes   319 Mbits/sec  4874  
[  9]   9.00-10.00  sec  38.1 MBytes   319 Mbits/sec  4874  
[SUM]   9.00-10.00  sec   114 MBytes   958 Mbits/sec  14622  
- - - - - - - - - - - - - - - - - - - - - - - - -
[ ID] Interval           Transfer     Bandwidth       Jitter    Lost/Total Datagrams
[  4]   0.00-10.00  sec   381 MBytes   319 Mbits/sec  8487039.300 ms  0/4 (0%)  
[  4] Sent 4 datagrams
[  6]   0.00-10.00  sec   381 MBytes   319 Mbits/sec  8487039.639 ms  365/369 (99%)  
[  6] Sent 369 datagrams
[  9]   0.00-10.00  sec   381 MBytes   319 Mbits/sec  8487039.544 ms  472/476 (99%)  
[  9] Sent 476 datagrams
[SUM]   0.00-10.00  sec  1.12 GBytes   958 Mbits/sec  8487039.495 ms  837/849 (99%)  

iperf Done.
+ sleep 10
+ iperf3 -c 10.0.0.80 -P3 -R -u -b 800M
Connecting to host 10.0.0.80, port 5201
Reverse mode, remote host 10.0.0.80 is sending
[  4] local 10.0.0.1 port 58078 connected to 10.0.0.80 port 5201
[  6] local 10.0.0.1 port 50018 connected to 10.0.0.80 port 5201
[  9] local 10.0.0.1 port 46349 connected to 10.0.0.80 port 5201
[ ID] Interval           Transfer     Bandwidth       Jitter    Lost/Total Datagrams
[  4]   0.00-1.00   sec  4.97 MBytes  41.7 Mbits/sec  0.205 ms  0/636 (0%)  
[  6]   0.00-1.00   sec  4.96 MBytes  41.6 Mbits/sec  0.154 ms  0/635 (0%)  
[  9]   0.00-1.00   sec  4.96 MBytes  41.6 Mbits/sec  0.248 ms  0/635 (0%)  
[SUM]   0.00-1.00   sec  14.9 MBytes   125 Mbits/sec  0.202 ms  0/1906 (0%)  
- - - - - - - - - - - - - - - - - - - - - - - - -
[  4]   1.00-2.00   sec  4.95 MBytes  41.5 Mbits/sec  0.162 ms  0/634 (0%)  
[  6]   1.00-2.00   sec  4.95 MBytes  41.5 Mbits/sec  0.216 ms  0/634 (0%)  
[  9]   1.00-2.00   sec  4.95 MBytes  41.5 Mbits/sec  0.125 ms  0/633 (0%)  
[SUM]   1.00-2.00   sec  14.9 MBytes   125 Mbits/sec  0.168 ms  0/1901 (0%)  
- - - - - - - - - - - - - - - - - - - - - - - - -
[  4]   2.00-3.00   sec  4.91 MBytes  41.2 Mbits/sec  0.148 ms  0/629 (0%)  
[  6]   2.00-3.00   sec  4.92 MBytes  41.3 Mbits/sec  0.213 ms  0/630 (0%)  
[  9]   2.00-3.00   sec  4.91 MBytes  41.2 Mbits/sec  0.123 ms  0/629 (0%)  
[SUM]   2.00-3.00   sec  14.8 MBytes   124 Mbits/sec  0.161 ms  0/1888 (0%)  
- - - - - - - - - - - - - - - - - - - - - - - - -
[  4]   3.00-4.00   sec  4.94 MBytes  41.4 Mbits/sec  0.226 ms  0/632 (0%)  
[  6]   3.00-4.00   sec  4.94 MBytes  41.4 Mbits/sec  0.151 ms  0/632 (0%)  
[  9]   3.00-4.00   sec  4.94 MBytes  41.4 Mbits/sec  0.171 ms  0/632 (0%)  
[SUM]   3.00-4.00   sec  14.8 MBytes   124 Mbits/sec  0.183 ms  0/1896 (0%)  
- - - - - - - - - - - - - - - - - - - - - - - - -
[  4]   4.00-5.00   sec  4.98 MBytes  41.8 Mbits/sec  0.191 ms  0/638 (0%)  
[  6]   4.00-5.00   sec  4.98 MBytes  41.7 Mbits/sec  0.129 ms  0/637 (0%)  
[  9]   4.00-5.00   sec  4.98 MBytes  41.7 Mbits/sec  0.154 ms  0/637 (0%)  
[SUM]   4.00-5.00   sec  14.9 MBytes   125 Mbits/sec  0.158 ms  0/1912 (0%)  
- - - - - - - - - - - - - - - - - - - - - - - - -
[  4]   5.00-6.00   sec  4.98 MBytes  41.7 Mbits/sec  0.162 ms  0/637 (0%)  
[  6]   5.00-6.00   sec  4.97 MBytes  41.7 Mbits/sec  0.085 ms  0/636 (0%)  
[  9]   5.00-6.00   sec  4.97 MBytes  41.7 Mbits/sec  0.157 ms  0/636 (0%)  
[SUM]   5.00-6.00   sec  14.9 MBytes   125 Mbits/sec  0.135 ms  0/1909 (0%)  
- - - - - - - - - - - - - - - - - - - - - - - - -
[  4]   6.00-7.00   sec  4.97 MBytes  41.7 Mbits/sec  0.221 ms  0/636 (0%)  
[  6]   6.00-7.00   sec  4.97 MBytes  41.7 Mbits/sec  0.115 ms  0/636 (0%)  
[  9]   6.00-7.00   sec  4.96 MBytes  41.6 Mbits/sec  0.193 ms  0/635 (0%)  
[SUM]   6.00-7.00   sec  14.9 MBytes   125 Mbits/sec  0.176 ms  0/1907 (0%)  
- - - - - - - - - - - - - - - - - - - - - - - - -
[  4]   7.00-8.00   sec  4.94 MBytes  41.4 Mbits/sec  0.226 ms  0/632 (0%)  
[  6]   7.00-8.00   sec  4.94 MBytes  41.4 Mbits/sec  0.134 ms  0/632 (0%)  
[  9]   7.00-8.00   sec  4.94 MBytes  41.4 Mbits/sec  0.180 ms  0/632 (0%)  
[SUM]   7.00-8.00   sec  14.8 MBytes   124 Mbits/sec  0.180 ms  0/1896 (0%)  
- - - - - - - - - - - - - - - - - - - - - - - - -
[  4]   8.00-9.00   sec  4.92 MBytes  41.3 Mbits/sec  0.196 ms  0/630 (0%)  
[  6]   8.00-9.00   sec  4.91 MBytes  41.2 Mbits/sec  0.161 ms  0/629 (0%)  
[  9]   8.00-9.00   sec  4.91 MBytes  41.2 Mbits/sec  0.161 ms  0/629 (0%)  
[SUM]   8.00-9.00   sec  14.8 MBytes   124 Mbits/sec  0.173 ms  0/1888 (0%)  
- - - - - - - - - - - - - - - - - - - - - - - - -
[  4]   9.00-10.00  sec  4.95 MBytes  41.5 Mbits/sec  0.194 ms  0/633 (0%)  
[  6]   9.00-10.00  sec  4.95 MBytes  41.5 Mbits/sec  0.132 ms  0/633 (0%)  
[  9]   9.00-10.00  sec  4.94 MBytes  41.4 Mbits/sec  0.107 ms  0/632 (0%)  
[SUM]   9.00-10.00  sec  14.8 MBytes   124 Mbits/sec  0.144 ms  0/1898 (0%)  
- - - - - - - - - - - - - - - - - - - - - - - - -
[ ID] Interval           Transfer     Bandwidth       Jitter    Lost/Total Datagrams
[  4]   0.00-10.00  sec  49.6 MBytes  41.6 Mbits/sec  0.223 ms  0/6344 (0%)  
[  4] Sent 6344 datagrams
[  6]   0.00-10.00  sec  49.5 MBytes  41.6 Mbits/sec  0.153 ms  0/6341 (0%)  
[  6] Sent 6341 datagrams
[  9]   0.00-10.00  sec  49.5 MBytes  41.5 Mbits/sec  0.125 ms  0/6337 (0%)  
[  9] Sent 6337 datagrams
[SUM]   0.00-10.00  sec   149 MBytes   125 Mbits/sec  0.167 ms  0/19022 (0%)  

iperf Done.

Comments

Hauke Mehrtens March 25, 2019, 11:24 p.m. UTC | #1
Hi Petr

On 3/14/19 6:46 AM, Petr Cvek wrote:
> Hello again,
> 
> I've managed to enhance few drivers for lantiq platform. They are still
> in ugly commented form (ethernet part especially). But I need some hints
> before the final version. The patches are based on a kernel 4.14.99.
> Copy them into target/linux/lantiq/patches-4.14 (cleaned from any of my
> previous patch).

Thanks for working on this.

> The eth+irq speedup is up to 360/260 Mbps (the vanilla was 170/80 on my
> setup). The iperf3 benchmark (2 passes for both vanilla and changed
> versions) altogether with script are in the attachment.
> 
> 1) IRQ with SMP and balancing support:
> 
> 	0901-add-icu-smp-support.patch
> 	0902-enable-external-irqs-for-second-vpe.patch
> 	0903-add-icu1-node-for-smp.patch
> 
> As requested I've changed the patch heavily. The original locking from
> k3b source code (probably from UGW) didn't work and in heavy load the
> system could have froze (smp affinity change during irq handling). This
> version has this fixed by using generic raw spinlocks with irq.
> 
> The SMP IRQ now works in a way that before every irq_enable (serves as
> unmask too) the VPE will be switched. This can be limited by writing
> into /proc/irq/X/smp_affinity (it can be possibly balanced from
> userspace too).
> 
> I've rewritten the device tree reg fields so there are only 2 arrays
> now. One per an icu controller. The original one per module was
> redundant as the ranges were continuous. The modules of a single ICU are
> now explicitly computed in a macro:
> 
> 	ltq_w32((x), ltq_icu_membase[vpe] + m*0x28 + (y))
> 	ltq_r32(ltq_icu_membase[vpe] + m*0x28 + (x))
> 
> before there was a pointer for every 0x28 block (there shouldn't be
> speed downgrade, only a multiplication and an addition for every
> register access).
> 
> Also I've simplified register names from LTQ_ICU_IM0_ISR to LTQ_ICU_ISR
> as "IM0" (module) was confusing (the real module number 0-4 was a part
> of the macro).
> 
> The code is written in a way it should work fine on a uniprocessor
> configuration (as the for_each_present_cpu etc macros will cycle on a
> single VPE on uniprocessor). I didn't test the no CONFIG_SMP yet, but I
> did check it with "nosmp" kernel parameter. It works.
> 
> Anyway please test if you have the board where the second VPE is used
> for FXS.
> 
> The new device tree structure is now incompatible with an old version of
> the driver (and old device tree with the new driver too). It seems icu
> driver is used in Danube, AR9, AmazonSE and Falcon chipset too. I don't
> know the hardware for these boards so before a final patch I would like
> to know if they have a second ICU too (at 0x80300 offset).

Normally the device tree should stay stable, but I already though about
the same change and I am not aware that any device ships a U-Boot with
an embedded device tree, so this should be fine.

The Amazon and Amazon SE only have one ICU block because they only have
one CPU with one VPE.
The Danube SoC has two ICU blocks one for each CPU, each CPU only has
one VPE. The CPUs are not cache coherent, SMP is not possible.

Falcon, AR9, VR9, AR10, ARX300, GRX300, GRX330 have two ICU blocks one
for each VPE of the single CPU.
GRX350 uses a MIPS InterAptiv CPU with a MIPS GIC.

> More development could be done with locking probably. As only the
> accesses in a single module (= 1 set of registers) would cause a race
> condition. But as the most contented interrupts are in the same module
> there won't be much speed increase IMO. I can add it if requested (just
> spinlock array and some lookup code).

I do not think that this improves the performance significantly, I
assume that the CPUs only have to wait there in rare conditions anyway.

> 2) Reworked lantiq xrx200 ethernet driver:
> 
> 	0904-backport-vanilla-eth-driver.patch
> 	0905-increase-dma-descriptors.patch
> 	0906-increase-dma-burst-size.patch
> 
> The code is still ugly, but stable now. There is a fragmented skb
> support and napi polling. DMA ring buffer was increased so it handle
> faster speeds and I've fixed some code weirdness. A can split the
> changes in the future into separate patches.

It would be nice if you could also do the same changes to the upstream
driver in mainline Linux kernel and send this for inclusion to mainline
Linux.

> I didn't test the ICU and eth patches separate, but I've tested the
> ethernet driver on a single VPE only (by setting smp affinity and
> nosmp). This version of the ethernet driver was used for root over NFS
> on the debug setup for like two weeks (without problems).
> 
> Tell me if we should pursue the way for the second DMA channel to PPE so
> both VPEs can send frames at the same time.

I think it should be ok to use both DMA channels for the CPU traffic.

> 3) WAVE300
> 
> In the two past weeks I've tried to fix a mash together various versions
> of wave300 wifi driver (there are partial version in GPL sources from
> router vendors). And I've managed to put the driver into "not
> immediately crashing" mode. If you are interested in the development,
> there is a thread in openwrt forum. The source repo here:
> 
> https://repo.or.cz/wave300.git
> https://repo.or.cz/wave300_rflib.git
> 
> (the second one must be copied into the first one)
> 
> The driver will often crash when meeting an unknown packet, request for
> encryption (no encryption support), unusual combination of configuration
> or just by module unloading. The code is _really_ ugly and it will
> server only as hardware specification for better GPL driver development.
> If you want to help or you have some tips you can join the forum (there
> are links for firmwares and intensive research of available source codes
> from vendors).
> 
> Links:
> https://forum.openwrt.org/t/support-for-wave-300-wi-fi-chip/24690/129
> https://forum.openwrt.org/t/how-can-we-make-the-lantiq-xrx200-devices-faster/9724/70
> https://forum.openwrt.org/t/xrx200-irq-balancing-between-vpes/29732/25
> 
> Petr
Hauke
Hauke Mehrtens March 25, 2019, 11:45 p.m. UTC | #2
On 3/26/19 12:24 AM, Hauke Mehrtens wrote:
> Hi Petr
> 
> On 3/14/19 6:46 AM, Petr Cvek wrote:
>> Hello again,
>>
>> I've managed to enhance few drivers for lantiq platform. They are still
>> in ugly commented form (ethernet part especially). But I need some hints
>> before the final version. The patches are based on a kernel 4.14.99.
>> Copy them into target/linux/lantiq/patches-4.14 (cleaned from any of my
>> previous patch).
> 
> Thanks for working on this.
> 
>> The eth+irq speedup is up to 360/260 Mbps (the vanilla was 170/80 on my
>> setup). The iperf3 benchmark (2 passes for both vanilla and changed
>> versions) altogether with script are in the attachment.
>>
>> 1) IRQ with SMP and balancing support:
>>
>> 	0901-add-icu-smp-support.patch
>> 	0902-enable-external-irqs-for-second-vpe.patch
>> 	0903-add-icu1-node-for-smp.patch
>>
>> As requested I've changed the patch heavily. The original locking from
>> k3b source code (probably from UGW) didn't work and in heavy load the
>> system could have froze (smp affinity change during irq handling). This
>> version has this fixed by using generic raw spinlocks with irq.
>>
>> The SMP IRQ now works in a way that before every irq_enable (serves as
>> unmask too) the VPE will be switched. This can be limited by writing
>> into /proc/irq/X/smp_affinity (it can be possibly balanced from
>> userspace too).
>>
>> I've rewritten the device tree reg fields so there are only 2 arrays
>> now. One per an icu controller. The original one per module was
>> redundant as the ranges were continuous. The modules of a single ICU are
>> now explicitly computed in a macro:
>>
>> 	ltq_w32((x), ltq_icu_membase[vpe] + m*0x28 + (y))
>> 	ltq_r32(ltq_icu_membase[vpe] + m*0x28 + (x))
>>
>> before there was a pointer for every 0x28 block (there shouldn't be
>> speed downgrade, only a multiplication and an addition for every
>> register access).
>>
>> Also I've simplified register names from LTQ_ICU_IM0_ISR to LTQ_ICU_ISR
>> as "IM0" (module) was confusing (the real module number 0-4 was a part
>> of the macro).
>>
>> The code is written in a way it should work fine on a uniprocessor
>> configuration (as the for_each_present_cpu etc macros will cycle on a
>> single VPE on uniprocessor). I didn't test the no CONFIG_SMP yet, but I
>> did check it with "nosmp" kernel parameter. It works.
>>
>> Anyway please test if you have the board where the second VPE is used
>> for FXS.
>>
>> The new device tree structure is now incompatible with an old version of
>> the driver (and old device tree with the new driver too). It seems icu
>> driver is used in Danube, AR9, AmazonSE and Falcon chipset too. I don't
>> know the hardware for these boards so before a final patch I would like
>> to know if they have a second ICU too (at 0x80300 offset).
> 
> Normally the device tree should stay stable, but I already though about
> the same change and I am not aware that any device ships a U-Boot with
> an embedded device tree, so this should be fine.
> 
> The Amazon and Amazon SE only have one ICU block because they only have
> one CPU with one VPE.
> The Danube SoC has two ICU blocks one for each CPU, each CPU only has
> one VPE. The CPUs are not cache coherent, SMP is not possible.
> 
> Falcon, AR9, VR9, AR10, ARX300, GRX300, GRX330 have two ICU blocks one
> for each VPE of the single CPU.
> GRX350 uses a MIPS InterAptiv CPU with a MIPS GIC.
> 
>> More development could be done with locking probably. As only the
>> accesses in a single module (= 1 set of registers) would cause a race
>> condition. But as the most contented interrupts are in the same module
>> there won't be much speed increase IMO. I can add it if requested (just
>> spinlock array and some lookup code).
> 
> I do not think that this improves the performance significantly, I
> assume that the CPUs only have to wait there in rare conditions anyway.
> 
>> 2) Reworked lantiq xrx200 ethernet driver:
>>
>> 	0904-backport-vanilla-eth-driver.patch
>> 	0905-increase-dma-descriptors.patch
>> 	0906-increase-dma-burst-size.patch
>>
>> The code is still ugly, but stable now. There is a fragmented skb
>> support and napi polling. DMA ring buffer was increased so it handle
>> faster speeds and I've fixed some code weirdness. A can split the
>> changes in the future into separate patches.
> 
> It would be nice if you could also do the same changes to the upstream
> driver in mainline Linux kernel and send this for inclusion to mainline
> Linux.
> 
>> I didn't test the ICU and eth patches separate, but I've tested the
>> ethernet driver on a single VPE only (by setting smp affinity and
>> nosmp). This version of the ethernet driver was used for root over NFS
>> on the debug setup for like two weeks (without problems).
>>
>> Tell me if we should pursue the way for the second DMA channel to PPE so
>> both VPEs can send frames at the same time.
> 
> I think it should be ok to use both DMA channels for the CPU traffic.
> 
>> 3) WAVE300
>>
>> In the two past weeks I've tried to fix a mash together various versions
>> of wave300 wifi driver (there are partial version in GPL sources from
>> router vendors). And I've managed to put the driver into "not
>> immediately crashing" mode. If you are interested in the development,
>> there is a thread in openwrt forum. The source repo here:
>>
>> https://repo.or.cz/wave300.git
>> https://repo.or.cz/wave300_rflib.git
>>
>> (the second one must be copied into the first one)
>>
>> The driver will often crash when meeting an unknown packet, request for
>> encryption (no encryption support), unusual combination of configuration
>> or just by module unloading. The code is _really_ ugly and it will
>> server only as hardware specification for better GPL driver development.
>> If you want to help or you have some tips you can join the forum (there
>> are links for firmwares and intensive research of available source codes
>> from vendors).
>>
>> Links:
>> https://forum.openwrt.org/t/support-for-wave-300-wi-fi-chip/24690/129
>> https://forum.openwrt.org/t/how-can-we-make-the-lantiq-xrx200-devices-faster/9724/70
>> https://forum.openwrt.org/t/xrx200-irq-balancing-between-vpes/29732/25
>>
>> Petr
> Hauke

It would be nice if you could send your patches as single mails and
inline so I can easily comment on them.

The DMA handling in the OpenWrt Ethernet driver is only more flexible to
handle arbitrary number of DMA channels, but I think this is not needed.

The DMA memory is already 16 byte aligned, see the byte_offset variable
in xmit, so it should not be a problem to use the 4W DMA mode, I assume
that the hardware also takes care of this.

Why are the changes in arch/mips/kernel/smp-mt.c needed? this looks
strange to me.

Changing LTQ_DMA_CPOLL could affect the latency of the system, but I
think your increase should not harm significantly.

Hauke
Petr Cvek March 26, 2019, 12:24 a.m. UTC | #3
Dne 26. 03. 19 v 0:45 Hauke Mehrtens napsal(a):
> On 3/26/19 12:24 AM, Hauke Mehrtens wrote:
>> Hi Petr
>>
>> On 3/14/19 6:46 AM, Petr Cvek wrote:
>>> Hello again,
>>>
>>> I've managed to enhance few drivers for lantiq platform. They are still
>>> in ugly commented form (ethernet part especially). But I need some hints
>>> before the final version. The patches are based on a kernel 4.14.99.
>>> Copy them into target/linux/lantiq/patches-4.14 (cleaned from any of my
>>> previous patch).
>>
>> Thanks for working on this.
>>
>>> The eth+irq speedup is up to 360/260 Mbps (the vanilla was 170/80 on my
>>> setup). The iperf3 benchmark (2 passes for both vanilla and changed
>>> versions) altogether with script are in the attachment.
>>>
>>> 1) IRQ with SMP and balancing support:
>>>
>>> 	0901-add-icu-smp-support.patch
>>> 	0902-enable-external-irqs-for-second-vpe.patch
>>> 	0903-add-icu1-node-for-smp.patch
>>>
>>> As requested I've changed the patch heavily. The original locking from
>>> k3b source code (probably from UGW) didn't work and in heavy load the
>>> system could have froze (smp affinity change during irq handling). This
>>> version has this fixed by using generic raw spinlocks with irq.
>>>
>>> The SMP IRQ now works in a way that before every irq_enable (serves as
>>> unmask too) the VPE will be switched. This can be limited by writing
>>> into /proc/irq/X/smp_affinity (it can be possibly balanced from
>>> userspace too).
>>>
>>> I've rewritten the device tree reg fields so there are only 2 arrays
>>> now. One per an icu controller. The original one per module was
>>> redundant as the ranges were continuous. The modules of a single ICU are
>>> now explicitly computed in a macro:
>>>
>>> 	ltq_w32((x), ltq_icu_membase[vpe] + m*0x28 + (y))
>>> 	ltq_r32(ltq_icu_membase[vpe] + m*0x28 + (x))
>>>
>>> before there was a pointer for every 0x28 block (there shouldn't be
>>> speed downgrade, only a multiplication and an addition for every
>>> register access).
>>>
>>> Also I've simplified register names from LTQ_ICU_IM0_ISR to LTQ_ICU_ISR
>>> as "IM0" (module) was confusing (the real module number 0-4 was a part
>>> of the macro).
>>>
>>> The code is written in a way it should work fine on a uniprocessor
>>> configuration (as the for_each_present_cpu etc macros will cycle on a
>>> single VPE on uniprocessor). I didn't test the no CONFIG_SMP yet, but I
>>> did check it with "nosmp" kernel parameter. It works.
>>>
>>> Anyway please test if you have the board where the second VPE is used
>>> for FXS.
>>>
>>> The new device tree structure is now incompatible with an old version of
>>> the driver (and old device tree with the new driver too). It seems icu
>>> driver is used in Danube, AR9, AmazonSE and Falcon chipset too. I don't
>>> know the hardware for these boards so before a final patch I would like
>>> to know if they have a second ICU too (at 0x80300 offset).
>>
>> Normally the device tree should stay stable, but I already though about
>> the same change and I am not aware that any device ships a U-Boot with
>> an embedded device tree, so this should be fine.
>>
>> The Amazon and Amazon SE only have one ICU block because they only have
>> one CPU with one VPE.
>> The Danube SoC has two ICU blocks one for each CPU, each CPU only has
>> one VPE. The CPUs are not cache coherent, SMP is not possible.
>>
>> Falcon, AR9, VR9, AR10, ARX300, GRX300, GRX330 have two ICU blocks one
>> for each VPE of the single CPU.
>> GRX350 uses a MIPS InterAptiv CPU with a MIPS GIC.
>>
>>> More development could be done with locking probably. As only the
>>> accesses in a single module (= 1 set of registers) would cause a race
>>> condition. But as the most contented interrupts are in the same module
>>> there won't be much speed increase IMO. I can add it if requested (just
>>> spinlock array and some lookup code).
>>
>> I do not think that this improves the performance significantly, I
>> assume that the CPUs only have to wait there in rare conditions anyway.
>>
>>> 2) Reworked lantiq xrx200 ethernet driver:
>>>
>>> 	0904-backport-vanilla-eth-driver.patch
>>> 	0905-increase-dma-descriptors.patch
>>> 	0906-increase-dma-burst-size.patch
>>>
>>> The code is still ugly, but stable now. There is a fragmented skb
>>> support and napi polling. DMA ring buffer was increased so it handle
>>> faster speeds and I've fixed some code weirdness. A can split the
>>> changes in the future into separate patches.
>>
>> It would be nice if you could also do the same changes to the upstream
>> driver in mainline Linux kernel and send this for inclusion to mainline
>> Linux.
>>
>>> I didn't test the ICU and eth patches separate, but I've tested the
>>> ethernet driver on a single VPE only (by setting smp affinity and
>>> nosmp). This version of the ethernet driver was used for root over NFS
>>> on the debug setup for like two weeks (without problems).
>>>
>>> Tell me if we should pursue the way for the second DMA channel to PPE so
>>> both VPEs can send frames at the same time.
>>
>> I think it should be ok to use both DMA channels for the CPU traffic.
>>
>>> 3) WAVE300
>>>
>>> In the two past weeks I've tried to fix a mash together various versions
>>> of wave300 wifi driver (there are partial version in GPL sources from
>>> router vendors). And I've managed to put the driver into "not
>>> immediately crashing" mode. If you are interested in the development,
>>> there is a thread in openwrt forum. The source repo here:
>>>
>>> https://repo.or.cz/wave300.git
>>> https://repo.or.cz/wave300_rflib.git
>>>
>>> (the second one must be copied into the first one)
>>>
>>> The driver will often crash when meeting an unknown packet, request for
>>> encryption (no encryption support), unusual combination of configuration
>>> or just by module unloading. The code is _really_ ugly and it will
>>> server only as hardware specification for better GPL driver development.
>>> If you want to help or you have some tips you can join the forum (there
>>> are links for firmwares and intensive research of available source codes
>>> from vendors).
>>>
>>> Links:
>>> https://forum.openwrt.org/t/support-for-wave-300-wi-fi-chip/24690/129
>>> https://forum.openwrt.org/t/how-can-we-make-the-lantiq-xrx200-devices-faster/9724/70
>>> https://forum.openwrt.org/t/xrx200-irq-balancing-between-vpes/29732/25
>>>
>>> Petr
>> Hauke
> 

Hi

> It would be nice if you could send your patches as single mails and
> inline so I can easily comment on them.

OK

> 
> The DMA handling in the OpenWrt Ethernet driver is only more flexible to
> handle arbitrary number of DMA channels, but I think this is not needed.
> 
> The DMA memory is already 16 byte aligned, see the byte_offset variable
> in xmit, so it should not be a problem to use the 4W DMA mode, I assume
> that the hardware also takes care of this.
> 

Yes it is 16 byte aligned in the original driver, but my patched driver
is using 32 byte alignment (8W DMA mode). Using 32B bursts with 16B
alignment caused crashing.

> Why are the changes in arch/mips/kernel/smp-mt.c needed? this looks
> strange to me.
> 

That is interrupt masking. IP0 and IP1 are (I think) software interrupts
for IPI communications, IP6/7 are timer (and something) and in IP2-IP5
range, which is not enabled there are external IRQ signals for ICU.
Without this set the second VPE only receives IPI and not ICU events.

Basically I've set this MIPS C0 Status register to the same value as the
C0 Status register for the first VPE.

> Changing LTQ_DMA_CPOLL could affect the latency of the system, but I
> think your increase should not harm significantly.

Yeah I've tested it, there is some minor impact on the maximal
bandwidth. However I cannot set the value correctly without the model of
xrx200 SoC (I assume this register controls the check frequency of the
OWN bit of the first descriptor). I don't even know the clock and width
of the bus between DMA and RAM (or between DMA and ethernet FIFO). But
if the original value DMA_CLK_DIV4 means "every fourth clock" it seems
too often for me (if a packet has like 1500 bytes, it would check many
times before the packet is transferred). The highest values empirically
lags the DMA descriptor ring.

> 
> Hauke
>
Hauke Mehrtens March 26, 2019, 1:23 a.m. UTC | #4
On 3/26/19 1:24 AM, Petr Cvek wrote:
> 
> 
> Dne 26. 03. 19 v 0:45 Hauke Mehrtens napsal(a):
>> On 3/26/19 12:24 AM, Hauke Mehrtens wrote:
>>> Hi Petr
>>>
>>> On 3/14/19 6:46 AM, Petr Cvek wrote:
>>>> Hello again,
>>>>
>>>> I've managed to enhance few drivers for lantiq platform. They are still
>>>> in ugly commented form (ethernet part especially). But I need some hints
>>>> before the final version. The patches are based on a kernel 4.14.99.
>>>> Copy them into target/linux/lantiq/patches-4.14 (cleaned from any of my
>>>> previous patch).
>>>
>>> Thanks for working on this.
>>>
>>>> The eth+irq speedup is up to 360/260 Mbps (the vanilla was 170/80 on my
>>>> setup). The iperf3 benchmark (2 passes for both vanilla and changed
>>>> versions) altogether with script are in the attachment.
>>>>
>>>> 1) IRQ with SMP and balancing support:
>>>>
>>>> 	0901-add-icu-smp-support.patch
>>>> 	0902-enable-external-irqs-for-second-vpe.patch
>>>> 	0903-add-icu1-node-for-smp.patch
>>>>
>>>> As requested I've changed the patch heavily. The original locking from
>>>> k3b source code (probably from UGW) didn't work and in heavy load the
>>>> system could have froze (smp affinity change during irq handling). This
>>>> version has this fixed by using generic raw spinlocks with irq.
>>>>
>>>> The SMP IRQ now works in a way that before every irq_enable (serves as
>>>> unmask too) the VPE will be switched. This can be limited by writing
>>>> into /proc/irq/X/smp_affinity (it can be possibly balanced from
>>>> userspace too).
>>>>
>>>> I've rewritten the device tree reg fields so there are only 2 arrays
>>>> now. One per an icu controller. The original one per module was
>>>> redundant as the ranges were continuous. The modules of a single ICU are
>>>> now explicitly computed in a macro:
>>>>
>>>> 	ltq_w32((x), ltq_icu_membase[vpe] + m*0x28 + (y))
>>>> 	ltq_r32(ltq_icu_membase[vpe] + m*0x28 + (x))
>>>>
>>>> before there was a pointer for every 0x28 block (there shouldn't be
>>>> speed downgrade, only a multiplication and an addition for every
>>>> register access).
>>>>
>>>> Also I've simplified register names from LTQ_ICU_IM0_ISR to LTQ_ICU_ISR
>>>> as "IM0" (module) was confusing (the real module number 0-4 was a part
>>>> of the macro).
>>>>
>>>> The code is written in a way it should work fine on a uniprocessor
>>>> configuration (as the for_each_present_cpu etc macros will cycle on a
>>>> single VPE on uniprocessor). I didn't test the no CONFIG_SMP yet, but I
>>>> did check it with "nosmp" kernel parameter. It works.
>>>>
>>>> Anyway please test if you have the board where the second VPE is used
>>>> for FXS.
>>>>
>>>> The new device tree structure is now incompatible with an old version of
>>>> the driver (and old device tree with the new driver too). It seems icu
>>>> driver is used in Danube, AR9, AmazonSE and Falcon chipset too. I don't
>>>> know the hardware for these boards so before a final patch I would like
>>>> to know if they have a second ICU too (at 0x80300 offset).
>>>
>>> Normally the device tree should stay stable, but I already though about
>>> the same change and I am not aware that any device ships a U-Boot with
>>> an embedded device tree, so this should be fine.
>>>
>>> The Amazon and Amazon SE only have one ICU block because they only have
>>> one CPU with one VPE.
>>> The Danube SoC has two ICU blocks one for each CPU, each CPU only has
>>> one VPE. The CPUs are not cache coherent, SMP is not possible.
>>>
>>> Falcon, AR9, VR9, AR10, ARX300, GRX300, GRX330 have two ICU blocks one
>>> for each VPE of the single CPU.
>>> GRX350 uses a MIPS InterAptiv CPU with a MIPS GIC.
>>>
>>>> More development could be done with locking probably. As only the
>>>> accesses in a single module (= 1 set of registers) would cause a race
>>>> condition. But as the most contented interrupts are in the same module
>>>> there won't be much speed increase IMO. I can add it if requested (just
>>>> spinlock array and some lookup code).
>>>
>>> I do not think that this improves the performance significantly, I
>>> assume that the CPUs only have to wait there in rare conditions anyway.
>>>
>>>> 2) Reworked lantiq xrx200 ethernet driver:
>>>>
>>>> 	0904-backport-vanilla-eth-driver.patch
>>>> 	0905-increase-dma-descriptors.patch
>>>> 	0906-increase-dma-burst-size.patch
>>>>
>>>> The code is still ugly, but stable now. There is a fragmented skb
>>>> support and napi polling. DMA ring buffer was increased so it handle
>>>> faster speeds and I've fixed some code weirdness. A can split the
>>>> changes in the future into separate patches.
>>>
>>> It would be nice if you could also do the same changes to the upstream
>>> driver in mainline Linux kernel and send this for inclusion to mainline
>>> Linux.
>>>
>>>> I didn't test the ICU and eth patches separate, but I've tested the
>>>> ethernet driver on a single VPE only (by setting smp affinity and
>>>> nosmp). This version of the ethernet driver was used for root over NFS
>>>> on the debug setup for like two weeks (without problems).
>>>>
>>>> Tell me if we should pursue the way for the second DMA channel to PPE so
>>>> both VPEs can send frames at the same time.
>>>
>>> I think it should be ok to use both DMA channels for the CPU traffic.
>>>
>>>> 3) WAVE300
>>>>
>>>> In the two past weeks I've tried to fix a mash together various versions
>>>> of wave300 wifi driver (there are partial version in GPL sources from
>>>> router vendors). And I've managed to put the driver into "not
>>>> immediately crashing" mode. If you are interested in the development,
>>>> there is a thread in openwrt forum. The source repo here:
>>>>
>>>> https://repo.or.cz/wave300.git
>>>> https://repo.or.cz/wave300_rflib.git
>>>>
>>>> (the second one must be copied into the first one)
>>>>
>>>> The driver will often crash when meeting an unknown packet, request for
>>>> encryption (no encryption support), unusual combination of configuration
>>>> or just by module unloading. The code is _really_ ugly and it will
>>>> server only as hardware specification for better GPL driver development.
>>>> If you want to help or you have some tips you can join the forum (there
>>>> are links for firmwares and intensive research of available source codes
>>>> from vendors).
>>>>
>>>> Links:
>>>> https://forum.openwrt.org/t/support-for-wave-300-wi-fi-chip/24690/129
>>>> https://forum.openwrt.org/t/how-can-we-make-the-lantiq-xrx200-devices-faster/9724/70
>>>> https://forum.openwrt.org/t/xrx200-irq-balancing-between-vpes/29732/25
>>>>
>>>> Petr
>>> Hauke
>>
> 
> Hi
> 
>> It would be nice if you could send your patches as single mails and
>> inline so I can easily comment on them.
> 
> OK
> 
>>
>> The DMA handling in the OpenWrt Ethernet driver is only more flexible to
>> handle arbitrary number of DMA channels, but I think this is not needed.
>>
>> The DMA memory is already 16 byte aligned, see the byte_offset variable
>> in xmit, so it should not be a problem to use the 4W DMA mode, I assume
>> that the hardware also takes care of this.
>>
> 
> Yes it is 16 byte aligned in the original driver, but my patched driver
> is using 32 byte alignment (8W DMA mode). Using 32B bursts with 16B
> alignment caused crashing.
> 
>> Why are the changes in arch/mips/kernel/smp-mt.c needed? this looks
>> strange to me.
>>
> 
> That is interrupt masking. IP0 and IP1 are (I think) software interrupts
> for IPI communications, IP6/7 are timer (and something) and in IP2-IP5
> range, which is not enabled there are external IRQ signals for ICU.
> Without this set the second VPE only receives IPI and not ICU events.
>
> Basically I've set this MIPS C0 Status register to the same value as the
> C0 Status register for the first VPE.

hmm strange, looks like there are not so many SoCs with multiple VPEs
which have an own IRQ controller.

>> Changing LTQ_DMA_CPOLL could affect the latency of the system, but I
>> think your increase should not harm significantly.
> 
> Yeah I've tested it, there is some minor impact on the maximal
> bandwidth. However I cannot set the value correctly without the model of
> xrx200 SoC (I assume this register controls the check frequency of the
> OWN bit of the first descriptor).

Yes this is the polling frequency in fDMA/16, this value is global and
not per channel. The DMA controller will check the OWN bit on all
descriptors for all DMA channels where polling is activated with this
frequency. fDMA is the same as the FPI frequency, probably 250MHz.

> I don't even know the clock and width
> of the bus between DMA and RAM (or between DMA and ethernet FIFO). But
> if the original value DMA_CLK_DIV4 means "every fourth clock" it seems
> too often for me (if a packet has like 1500 bytes, it would check many
> times before the packet is transferred). The highest values empirically
> lags the DMA descriptor ring.

The DMA controller uses a 32 bit wide data path to the RAM and 28 bit
word addresses, a word for the DMA controller is 32 bit.

The DMA controller can handle some priorities between the ports and
channels. When you activate PKTARB (BIT(31)) in DMA_CTRL the DMA
controller will transfer the complete packet before the arbitration is
changed. With MBRSTCNT (bit 25:16) in DMA_CTRL you can control after how
many burst the arbitration should be changed, when MBRSTARB (BIT(30)) in
DMA_CTRL is activated. Both is for TX and RX.

Hauke
Petr Cvek May 18, 2019, 2:08 a.m. UTC | #5
Hi again,

I'm finishing the ethernet driver and it is still sort of slow for my
taste, but it seems I've reached the hardware limit.

As someone who well knows the internals of the SoC, could you guess the
maximum hardware possible speed of TX bandwidth speed (roughly big
saturated UDP packets)?

If I'm evaluating this correctly, there is DDR2 controller @250MHz... I
don't know if 250MHz is the bus speed as my modem has DDR2-800 chip,
which means 400MHz bus speed (pretty big 150MHz reserve).

But if I'm right that would mean the data are transferred at 500MT/s
over 16bit bus. So the continuous construction of the UDP packets in CPU
(500MHZ@32bit) would ate the whole RAM bandwidth.

This result seems wrong as the VPE needs to load instructions too and
there is up to 4 threads. And most importantly there are the gigabit
switch with multiple ports and PCI(e) peripherals too.

Anyway my measurements shows the saturated UDP traffic on localhost
interface are only up to around 400Mbit/s and they are only mem/cache
transfers.

Am I right? Is it impossible to obtain the full 1Gbit/s with vrx-268?

Best regards,

Petr

Dne 26. 03. 19 v 2:23 Hauke Mehrtens napsal(a):
> On 3/26/19 1:24 AM, Petr Cvek wrote:
>>
>>
>> Dne 26. 03. 19 v 0:45 Hauke Mehrtens napsal(a):
>>> On 3/26/19 12:24 AM, Hauke Mehrtens wrote:
>>>> Hi Petr
>>>>
>>>> On 3/14/19 6:46 AM, Petr Cvek wrote:
>>>>> Hello again,
>>>>>
>>>>> I've managed to enhance few drivers for lantiq platform. They are still
>>>>> in ugly commented form (ethernet part especially). But I need some hints
>>>>> before the final version. The patches are based on a kernel 4.14.99.
>>>>> Copy them into target/linux/lantiq/patches-4.14 (cleaned from any of my
>>>>> previous patch).
>>>>
>>>> Thanks for working on this.
>>>>
>>>>> The eth+irq speedup is up to 360/260 Mbps (the vanilla was 170/80 on my
>>>>> setup). The iperf3 benchmark (2 passes for both vanilla and changed
>>>>> versions) altogether with script are in the attachment.
>>>>>
>>>>> 1) IRQ with SMP and balancing support:
>>>>>
>>>>> 	0901-add-icu-smp-support.patch
>>>>> 	0902-enable-external-irqs-for-second-vpe.patch
>>>>> 	0903-add-icu1-node-for-smp.patch
>>>>>
>>>>> As requested I've changed the patch heavily. The original locking from
>>>>> k3b source code (probably from UGW) didn't work and in heavy load the
>>>>> system could have froze (smp affinity change during irq handling). This
>>>>> version has this fixed by using generic raw spinlocks with irq.
>>>>>
>>>>> The SMP IRQ now works in a way that before every irq_enable (serves as
>>>>> unmask too) the VPE will be switched. This can be limited by writing
>>>>> into /proc/irq/X/smp_affinity (it can be possibly balanced from
>>>>> userspace too).
>>>>>
>>>>> I've rewritten the device tree reg fields so there are only 2 arrays
>>>>> now. One per an icu controller. The original one per module was
>>>>> redundant as the ranges were continuous. The modules of a single ICU are
>>>>> now explicitly computed in a macro:
>>>>>
>>>>> 	ltq_w32((x), ltq_icu_membase[vpe] + m*0x28 + (y))
>>>>> 	ltq_r32(ltq_icu_membase[vpe] + m*0x28 + (x))
>>>>>
>>>>> before there was a pointer for every 0x28 block (there shouldn't be
>>>>> speed downgrade, only a multiplication and an addition for every
>>>>> register access).
>>>>>
>>>>> Also I've simplified register names from LTQ_ICU_IM0_ISR to LTQ_ICU_ISR
>>>>> as "IM0" (module) was confusing (the real module number 0-4 was a part
>>>>> of the macro).
>>>>>
>>>>> The code is written in a way it should work fine on a uniprocessor
>>>>> configuration (as the for_each_present_cpu etc macros will cycle on a
>>>>> single VPE on uniprocessor). I didn't test the no CONFIG_SMP yet, but I
>>>>> did check it with "nosmp" kernel parameter. It works.
>>>>>
>>>>> Anyway please test if you have the board where the second VPE is used
>>>>> for FXS.
>>>>>
>>>>> The new device tree structure is now incompatible with an old version of
>>>>> the driver (and old device tree with the new driver too). It seems icu
>>>>> driver is used in Danube, AR9, AmazonSE and Falcon chipset too. I don't
>>>>> know the hardware for these boards so before a final patch I would like
>>>>> to know if they have a second ICU too (at 0x80300 offset).
>>>>
>>>> Normally the device tree should stay stable, but I already though about
>>>> the same change and I am not aware that any device ships a U-Boot with
>>>> an embedded device tree, so this should be fine.
>>>>
>>>> The Amazon and Amazon SE only have one ICU block because they only have
>>>> one CPU with one VPE.
>>>> The Danube SoC has two ICU blocks one for each CPU, each CPU only has
>>>> one VPE. The CPUs are not cache coherent, SMP is not possible.
>>>>
>>>> Falcon, AR9, VR9, AR10, ARX300, GRX300, GRX330 have two ICU blocks one
>>>> for each VPE of the single CPU.
>>>> GRX350 uses a MIPS InterAptiv CPU with a MIPS GIC.
>>>>
>>>>> More development could be done with locking probably. As only the
>>>>> accesses in a single module (= 1 set of registers) would cause a race
>>>>> condition. But as the most contented interrupts are in the same module
>>>>> there won't be much speed increase IMO. I can add it if requested (just
>>>>> spinlock array and some lookup code).
>>>>
>>>> I do not think that this improves the performance significantly, I
>>>> assume that the CPUs only have to wait there in rare conditions anyway.
>>>>
>>>>> 2) Reworked lantiq xrx200 ethernet driver:
>>>>>
>>>>> 	0904-backport-vanilla-eth-driver.patch
>>>>> 	0905-increase-dma-descriptors.patch
>>>>> 	0906-increase-dma-burst-size.patch
>>>>>
>>>>> The code is still ugly, but stable now. There is a fragmented skb
>>>>> support and napi polling. DMA ring buffer was increased so it handle
>>>>> faster speeds and I've fixed some code weirdness. A can split the
>>>>> changes in the future into separate patches.
>>>>
>>>> It would be nice if you could also do the same changes to the upstream
>>>> driver in mainline Linux kernel and send this for inclusion to mainline
>>>> Linux.
>>>>
>>>>> I didn't test the ICU and eth patches separate, but I've tested the
>>>>> ethernet driver on a single VPE only (by setting smp affinity and
>>>>> nosmp). This version of the ethernet driver was used for root over NFS
>>>>> on the debug setup for like two weeks (without problems).
>>>>>
>>>>> Tell me if we should pursue the way for the second DMA channel to PPE so
>>>>> both VPEs can send frames at the same time.
>>>>
>>>> I think it should be ok to use both DMA channels for the CPU traffic.
>>>>
>>>>> 3) WAVE300
>>>>>
>>>>> In the two past weeks I've tried to fix a mash together various versions
>>>>> of wave300 wifi driver (there are partial version in GPL sources from
>>>>> router vendors). And I've managed to put the driver into "not
>>>>> immediately crashing" mode. If you are interested in the development,
>>>>> there is a thread in openwrt forum. The source repo here:
>>>>>
>>>>> https://repo.or.cz/wave300.git
>>>>> https://repo.or.cz/wave300_rflib.git
>>>>>
>>>>> (the second one must be copied into the first one)
>>>>>
>>>>> The driver will often crash when meeting an unknown packet, request for
>>>>> encryption (no encryption support), unusual combination of configuration
>>>>> or just by module unloading. The code is _really_ ugly and it will
>>>>> server only as hardware specification for better GPL driver development.
>>>>> If you want to help or you have some tips you can join the forum (there
>>>>> are links for firmwares and intensive research of available source codes
>>>>> from vendors).
>>>>>
>>>>> Links:
>>>>> https://forum.openwrt.org/t/support-for-wave-300-wi-fi-chip/24690/129
>>>>> https://forum.openwrt.org/t/how-can-we-make-the-lantiq-xrx200-devices-faster/9724/70
>>>>> https://forum.openwrt.org/t/xrx200-irq-balancing-between-vpes/29732/25
>>>>>
>>>>> Petr
>>>> Hauke
>>>
>>
>> Hi
>>
>>> It would be nice if you could send your patches as single mails and
>>> inline so I can easily comment on them.
>>
>> OK
>>
>>>
>>> The DMA handling in the OpenWrt Ethernet driver is only more flexible to
>>> handle arbitrary number of DMA channels, but I think this is not needed.
>>>
>>> The DMA memory is already 16 byte aligned, see the byte_offset variable
>>> in xmit, so it should not be a problem to use the 4W DMA mode, I assume
>>> that the hardware also takes care of this.
>>>
>>
>> Yes it is 16 byte aligned in the original driver, but my patched driver
>> is using 32 byte alignment (8W DMA mode). Using 32B bursts with 16B
>> alignment caused crashing.
>>
>>> Why are the changes in arch/mips/kernel/smp-mt.c needed? this looks
>>> strange to me.
>>>
>>
>> That is interrupt masking. IP0 and IP1 are (I think) software interrupts
>> for IPI communications, IP6/7 are timer (and something) and in IP2-IP5
>> range, which is not enabled there are external IRQ signals for ICU.
>> Without this set the second VPE only receives IPI and not ICU events.
>>
>> Basically I've set this MIPS C0 Status register to the same value as the
>> C0 Status register for the first VPE.
> 
> hmm strange, looks like there are not so many SoCs with multiple VPEs
> which have an own IRQ controller.
> 
>>> Changing LTQ_DMA_CPOLL could affect the latency of the system, but I
>>> think your increase should not harm significantly.
>>
>> Yeah I've tested it, there is some minor impact on the maximal
>> bandwidth. However I cannot set the value correctly without the model of
>> xrx200 SoC (I assume this register controls the check frequency of the
>> OWN bit of the first descriptor).
> 
> Yes this is the polling frequency in fDMA/16, this value is global and
> not per channel. The DMA controller will check the OWN bit on all
> descriptors for all DMA channels where polling is activated with this
> frequency. fDMA is the same as the FPI frequency, probably 250MHz.
> 
>> I don't even know the clock and width
>> of the bus between DMA and RAM (or between DMA and ethernet FIFO). But
>> if the original value DMA_CLK_DIV4 means "every fourth clock" it seems
>> too often for me (if a packet has like 1500 bytes, it would check many
>> times before the packet is transferred). The highest values empirically
>> lags the DMA descriptor ring.
> 
> The DMA controller uses a 32 bit wide data path to the RAM and 28 bit
> word addresses, a word for the DMA controller is 32 bit.
> 
> The DMA controller can handle some priorities between the ports and
> channels. When you activate PKTARB (BIT(31)) in DMA_CTRL the DMA
> controller will transfer the complete packet before the arbitration is
> changed. With MBRSTCNT (bit 25:16) in DMA_CTRL you can control after how
> many burst the arbitration should be changed, when MBRSTARB (BIT(30)) in
> DMA_CTRL is activated. Both is for TX and RX.
> 
> Hauke
>
Hauke Mehrtens May 19, 2019, 9:24 a.m. UTC | #6
On 5/18/19 4:08 AM, Petr Cvek wrote:
> Hi again,
> 
> I'm finishing the ethernet driver and it is still sort of slow for my
> taste, but it seems I've reached the hardware limit.

Will you send these patches also to the upstream kernel? I would like to
see the improvements to the DMA controller and the scatter DMA in the
mainline kernel then we do not have to maintain this separately in
OpenWrt any more.

> As someone who well knows the internals of the SoC, could you guess the
> maximum hardware possible speed of TX bandwidth speed (roughly big
> saturated UDP packets)?
> 
> If I'm evaluating this correctly, there is DDR2 controller @250MHz... I
> don't know if 250MHz is the bus speed as my modem has DDR2-800 chip,
> which means 400MHz bus speed (pretty big 150MHz reserve).

I would not be surprised if the RAM is running with a lower frequency
than what would be supported by the RAM chips, but I haven't checked
what is the maximum supported frequency by the SoC itself.

> But if I'm right that would mean the data are transferred at 500MT/s
> over 16bit bus. So the continuous construction of the UDP packets in CPU
> (500MHZ@32bit) would ate the whole RAM bandwidth.
> 
> This result seems wrong as the VPE needs to load instructions too and
> there is up to 4 threads. And most importantly there are the gigabit
> switch with multiple ports and PCI(e) peripherals too.
> 
> Anyway my measurements shows the saturated UDP traffic on localhost
> interface are only up to around 400Mbit/s and they are only mem/cache
> transfers.
> 
> Am I right? Is it impossible to obtain the full 1Gbit/s with vrx-268?

The SoC and many of the competition SoCs are not build to handle all the
traffic in Linux. This SoC is designed that the data traffic should be
handled by the hardware or some specialized FW. There is even some SRAM
in the chip which is used by these HW blocks to avoid coping the data to
the RAM.

The VRX200 line has the GSWIP which can handle the layer 2 switching at
line rate (1 GBit/s) at least for normal packages sizes.

NAT, PPPoE and some other L3 handling is done by the PP32 hardware block
which runs a separate FW and also has some specialized HW blocks. This
block can also directly take packages from the DSL and wifi and forward
packages to these peripherals.

The CPU path is only used to learn a flow which is then later offloaded
to the hardware

Hauke

> 
> Best regards,
> 
> Petr
> 
> Dne 26. 03. 19 v 2:23 Hauke Mehrtens napsal(a):
>> On 3/26/19 1:24 AM, Petr Cvek wrote:
>>>
>>>
>>> Dne 26. 03. 19 v 0:45 Hauke Mehrtens napsal(a):
>>>> On 3/26/19 12:24 AM, Hauke Mehrtens wrote:
>>>>> Hi Petr
>>>>>
>>>>> On 3/14/19 6:46 AM, Petr Cvek wrote:
>>>>>> Hello again,
>>>>>>
>>>>>> I've managed to enhance few drivers for lantiq platform. They are still
>>>>>> in ugly commented form (ethernet part especially). But I need some hints
>>>>>> before the final version. The patches are based on a kernel 4.14.99.
>>>>>> Copy them into target/linux/lantiq/patches-4.14 (cleaned from any of my
>>>>>> previous patch).
>>>>>
>>>>> Thanks for working on this.
>>>>>
>>>>>> The eth+irq speedup is up to 360/260 Mbps (the vanilla was 170/80 on my
>>>>>> setup). The iperf3 benchmark (2 passes for both vanilla and changed
>>>>>> versions) altogether with script are in the attachment.
>>>>>>
>>>>>> 1) IRQ with SMP and balancing support:
>>>>>>
>>>>>> 	0901-add-icu-smp-support.patch
>>>>>> 	0902-enable-external-irqs-for-second-vpe.patch
>>>>>> 	0903-add-icu1-node-for-smp.patch
>>>>>>
>>>>>> As requested I've changed the patch heavily. The original locking from
>>>>>> k3b source code (probably from UGW) didn't work and in heavy load the
>>>>>> system could have froze (smp affinity change during irq handling). This
>>>>>> version has this fixed by using generic raw spinlocks with irq.
>>>>>>
>>>>>> The SMP IRQ now works in a way that before every irq_enable (serves as
>>>>>> unmask too) the VPE will be switched. This can be limited by writing
>>>>>> into /proc/irq/X/smp_affinity (it can be possibly balanced from
>>>>>> userspace too).
>>>>>>
>>>>>> I've rewritten the device tree reg fields so there are only 2 arrays
>>>>>> now. One per an icu controller. The original one per module was
>>>>>> redundant as the ranges were continuous. The modules of a single ICU are
>>>>>> now explicitly computed in a macro:
>>>>>>
>>>>>> 	ltq_w32((x), ltq_icu_membase[vpe] + m*0x28 + (y))
>>>>>> 	ltq_r32(ltq_icu_membase[vpe] + m*0x28 + (x))
>>>>>>
>>>>>> before there was a pointer for every 0x28 block (there shouldn't be
>>>>>> speed downgrade, only a multiplication and an addition for every
>>>>>> register access).
>>>>>>
>>>>>> Also I've simplified register names from LTQ_ICU_IM0_ISR to LTQ_ICU_ISR
>>>>>> as "IM0" (module) was confusing (the real module number 0-4 was a part
>>>>>> of the macro).
>>>>>>
>>>>>> The code is written in a way it should work fine on a uniprocessor
>>>>>> configuration (as the for_each_present_cpu etc macros will cycle on a
>>>>>> single VPE on uniprocessor). I didn't test the no CONFIG_SMP yet, but I
>>>>>> did check it with "nosmp" kernel parameter. It works.
>>>>>>
>>>>>> Anyway please test if you have the board where the second VPE is used
>>>>>> for FXS.
>>>>>>
>>>>>> The new device tree structure is now incompatible with an old version of
>>>>>> the driver (and old device tree with the new driver too). It seems icu
>>>>>> driver is used in Danube, AR9, AmazonSE and Falcon chipset too. I don't
>>>>>> know the hardware for these boards so before a final patch I would like
>>>>>> to know if they have a second ICU too (at 0x80300 offset).
>>>>>
>>>>> Normally the device tree should stay stable, but I already though about
>>>>> the same change and I am not aware that any device ships a U-Boot with
>>>>> an embedded device tree, so this should be fine.
>>>>>
>>>>> The Amazon and Amazon SE only have one ICU block because they only have
>>>>> one CPU with one VPE.
>>>>> The Danube SoC has two ICU blocks one for each CPU, each CPU only has
>>>>> one VPE. The CPUs are not cache coherent, SMP is not possible.
>>>>>
>>>>> Falcon, AR9, VR9, AR10, ARX300, GRX300, GRX330 have two ICU blocks one
>>>>> for each VPE of the single CPU.
>>>>> GRX350 uses a MIPS InterAptiv CPU with a MIPS GIC.
>>>>>
>>>>>> More development could be done with locking probably. As only the
>>>>>> accesses in a single module (= 1 set of registers) would cause a race
>>>>>> condition. But as the most contented interrupts are in the same module
>>>>>> there won't be much speed increase IMO. I can add it if requested (just
>>>>>> spinlock array and some lookup code).
>>>>>
>>>>> I do not think that this improves the performance significantly, I
>>>>> assume that the CPUs only have to wait there in rare conditions anyway.
>>>>>
>>>>>> 2) Reworked lantiq xrx200 ethernet driver:
>>>>>>
>>>>>> 	0904-backport-vanilla-eth-driver.patch
>>>>>> 	0905-increase-dma-descriptors.patch
>>>>>> 	0906-increase-dma-burst-size.patch
>>>>>>
>>>>>> The code is still ugly, but stable now. There is a fragmented skb
>>>>>> support and napi polling. DMA ring buffer was increased so it handle
>>>>>> faster speeds and I've fixed some code weirdness. A can split the
>>>>>> changes in the future into separate patches.
>>>>>
>>>>> It would be nice if you could also do the same changes to the upstream
>>>>> driver in mainline Linux kernel and send this for inclusion to mainline
>>>>> Linux.
>>>>>
>>>>>> I didn't test the ICU and eth patches separate, but I've tested the
>>>>>> ethernet driver on a single VPE only (by setting smp affinity and
>>>>>> nosmp). This version of the ethernet driver was used for root over NFS
>>>>>> on the debug setup for like two weeks (without problems).
>>>>>>
>>>>>> Tell me if we should pursue the way for the second DMA channel to PPE so
>>>>>> both VPEs can send frames at the same time.
>>>>>
>>>>> I think it should be ok to use both DMA channels for the CPU traffic.
>>>>>
>>>>>> 3) WAVE300
>>>>>>
>>>>>> In the two past weeks I've tried to fix a mash together various versions
>>>>>> of wave300 wifi driver (there are partial version in GPL sources from
>>>>>> router vendors). And I've managed to put the driver into "not
>>>>>> immediately crashing" mode. If you are interested in the development,
>>>>>> there is a thread in openwrt forum. The source repo here:
>>>>>>
>>>>>> https://repo.or.cz/wave300.git
>>>>>> https://repo.or.cz/wave300_rflib.git
>>>>>>
>>>>>> (the second one must be copied into the first one)
>>>>>>
>>>>>> The driver will often crash when meeting an unknown packet, request for
>>>>>> encryption (no encryption support), unusual combination of configuration
>>>>>> or just by module unloading. The code is _really_ ugly and it will
>>>>>> server only as hardware specification for better GPL driver development.
>>>>>> If you want to help or you have some tips you can join the forum (there
>>>>>> are links for firmwares and intensive research of available source codes
>>>>>> from vendors).
>>>>>>
>>>>>> Links:
>>>>>> https://forum.openwrt.org/t/support-for-wave-300-wi-fi-chip/24690/129
>>>>>> https://forum.openwrt.org/t/how-can-we-make-the-lantiq-xrx200-devices-faster/9724/70
>>>>>> https://forum.openwrt.org/t/xrx200-irq-balancing-between-vpes/29732/25
>>>>>>
>>>>>> Petr
>>>>> Hauke
>>>>
>>>
>>> Hi
>>>
>>>> It would be nice if you could send your patches as single mails and
>>>> inline so I can easily comment on them.
>>>
>>> OK
>>>
>>>>
>>>> The DMA handling in the OpenWrt Ethernet driver is only more flexible to
>>>> handle arbitrary number of DMA channels, but I think this is not needed.
>>>>
>>>> The DMA memory is already 16 byte aligned, see the byte_offset variable
>>>> in xmit, so it should not be a problem to use the 4W DMA mode, I assume
>>>> that the hardware also takes care of this.
>>>>
>>>
>>> Yes it is 16 byte aligned in the original driver, but my patched driver
>>> is using 32 byte alignment (8W DMA mode). Using 32B bursts with 16B
>>> alignment caused crashing.
>>>
>>>> Why are the changes in arch/mips/kernel/smp-mt.c needed? this looks
>>>> strange to me.
>>>>
>>>
>>> That is interrupt masking. IP0 and IP1 are (I think) software interrupts
>>> for IPI communications, IP6/7 are timer (and something) and in IP2-IP5
>>> range, which is not enabled there are external IRQ signals for ICU.
>>> Without this set the second VPE only receives IPI and not ICU events.
>>>
>>> Basically I've set this MIPS C0 Status register to the same value as the
>>> C0 Status register for the first VPE.
>>
>> hmm strange, looks like there are not so many SoCs with multiple VPEs
>> which have an own IRQ controller.
>>
>>>> Changing LTQ_DMA_CPOLL could affect the latency of the system, but I
>>>> think your increase should not harm significantly.
>>>
>>> Yeah I've tested it, there is some minor impact on the maximal
>>> bandwidth. However I cannot set the value correctly without the model of
>>> xrx200 SoC (I assume this register controls the check frequency of the
>>> OWN bit of the first descriptor).
>>
>> Yes this is the polling frequency in fDMA/16, this value is global and
>> not per channel. The DMA controller will check the OWN bit on all
>> descriptors for all DMA channels where polling is activated with this
>> frequency. fDMA is the same as the FPI frequency, probably 250MHz.
>>
>>> I don't even know the clock and width
>>> of the bus between DMA and RAM (or between DMA and ethernet FIFO). But
>>> if the original value DMA_CLK_DIV4 means "every fourth clock" it seems
>>> too often for me (if a packet has like 1500 bytes, it would check many
>>> times before the packet is transferred). The highest values empirically
>>> lags the DMA descriptor ring.
>>
>> The DMA controller uses a 32 bit wide data path to the RAM and 28 bit
>> word addresses, a word for the DMA controller is 32 bit.
>>
>> The DMA controller can handle some priorities between the ports and
>> channels. When you activate PKTARB (BIT(31)) in DMA_CTRL the DMA
>> controller will transfer the complete packet before the arbitration is
>> changed. With MBRSTCNT (bit 25:16) in DMA_CTRL you can control after how
>> many burst the arbitration should be changed, when MBRSTARB (BIT(30)) in
>> DMA_CTRL is activated. Both is for TX and RX.
>>
>> Hauke
>>

Patch
diff mbox series

--- a/arch/mips/lantiq/xway/dma.c	2019-02-12 19:46:14.000000000 +0100
+++ b/arch/mips/lantiq/xway/dma.c	2019-02-15 12:51:56.781495450 +0100
@@ -49,7 +49,10 @@ 
 #define DMA_IRQ_ACK		0x7e		/* IRQ status register */
 #define DMA_POLL		BIT(31)		/* turn on channel polling */
 #define DMA_CLK_DIV4		BIT(6)		/* polling clock divider */
-#define DMA_2W_BURST		BIT(1)		/* 2 word burst length */
+#define DMA_1W_BURST		0x0		/* 1 word burst length/no burst */
+#define DMA_2W_BURST		0x1		/* 2 word burst length */
+#define DMA_4W_BURST		0x2		/* 4 word burst length */
+#define DMA_8W_BURST		0x3		/* 8 word burst length */
 #define DMA_MAX_CHANNEL		20		/* the soc has 20 channels */
 #define DMA_ETOP_ENDIANNESS	(0xf << 8) /* endianness swap etop channels */
 #define DMA_WEIGHT	(BIT(17) | BIT(16))	/* default channel wheight */
@@ -138,7 +141,7 @@ 
 	spin_lock_irqsave(&ltq_dma_lock, flags);
 	ltq_dma_w32(ch->nr, LTQ_DMA_CS);
 	ltq_dma_w32(ch->phys, LTQ_DMA_CDBA);
-	ltq_dma_w32(LTQ_DESC_NUM, LTQ_DMA_CDLEN);
+	ltq_dma_w32(LTQ_DESC_NUM, LTQ_DMA_CDLEN);	//0xff mask
 	ltq_dma_w32_mask(DMA_CHAN_ON, 0, LTQ_DMA_CCTRL);
 	wmb();
 	ltq_dma_w32_mask(0, DMA_CHAN_RST, LTQ_DMA_CCTRL);
@@ -155,7 +158,13 @@ 
 	ltq_dma_alloc(ch);
 
 	spin_lock_irqsave(&ltq_dma_lock, flags);
-	ltq_dma_w32(DMA_DESCPT, LTQ_DMA_CIE);
+
+//DMA_DESCPT BIT(3) //end of descriptor
+//BIT(1)	//end of packet
+//	ltq_dma_w32(DMA_DESCPT, LTQ_DMA_CIE);
+	ltq_dma_w32(BIT(1), LTQ_DMA_CIE);
+	
+	
 	ltq_dma_w32_mask(0, 1 << ch->nr, LTQ_DMA_IRNEN);
 	ltq_dma_w32(DMA_WEIGHT | DMA_TX, LTQ_DMA_CCTRL);
 	spin_unlock_irqrestore(&ltq_dma_lock, flags);
@@ -194,6 +203,12 @@ 
 	ltq_dma_w32(p, LTQ_DMA_PS);
 	switch (p) {
 	case DMA_PORT_ETOP:
+
+		/* 8 words burst, data must be aligned on 4*N bytes or freeze */
+//TODO? different bursts for TX and RX (RX is fine at 1G eth)		
+		ltq_dma_w32_mask(0x3c, (DMA_8W_BURST << 4) | (DMA_8W_BURST << 2),
+			LTQ_DMA_PCTRL);
+
 		/*
 		 * Tell the DMA engine to swap the endianness of data frames and
 		 * drop packets if the channel arbitration fails.
@@ -241,10 +256,18 @@ 
 	for (i = 0; i < DMA_MAX_CHANNEL; i++) {
 		ltq_dma_w32(i, LTQ_DMA_CS);
 		ltq_dma_w32(DMA_CHAN_RST, LTQ_DMA_CCTRL);
-		ltq_dma_w32(DMA_POLL | DMA_CLK_DIV4, LTQ_DMA_CPOLL);
 		ltq_dma_w32_mask(DMA_CHAN_ON, 0, LTQ_DMA_CCTRL);
 	}
 
+//TODO 0x100 << 4 fastest TX without fragments
+// 0x100 for fragments timeouts, 0x10 only under really _heavy_ load
+//TODO not dependent on channel select (LTQ_DMA_CS), why it was in for cycle
+	ltq_dma_w32(DMA_POLL | (0x10 << 4), LTQ_DMA_CPOLL);
+
+//TODO packet arbitration ???, test different values
+//0x3ff << 16 multiple burst count, 1<<30 multiple burst arbitration, 1<<31 packet arbitration, 1<<0 reset (!)
+//	ltq_dma_w32((1 << 31) | 0x40000, LTQ_DMA_CTRL);
+
 	id = ltq_dma_r32(LTQ_DMA_ID);
 	dev_info(&pdev->dev,
 		"Init done - hw rev: %X, ports: %d, channels: %d\n",