diff mbox series

[OpenWrt-Devel,RFC,v4] lantiq: IRQ balancing, ethernet driver, wave300

Message ID 40efd247-c72d-c341-de31-b46ac9b3ad69@gmail.com
State RFC
Headers show
Series [OpenWrt-Devel,RFC,v4] lantiq: IRQ balancing, ethernet driver, wave300 | expand

Commit Message

Petr Cvek March 14, 2019, 5:46 a.m. UTC
Hello again,

I've managed to enhance few drivers for lantiq platform. They are still
in ugly commented form (ethernet part especially). But I need some hints
before the final version. The patches are based on a kernel 4.14.99.
Copy them into target/linux/lantiq/patches-4.14 (cleaned from any of my
previous patch).

The eth+irq speedup is up to 360/260 Mbps (the vanilla was 170/80 on my
setup). The iperf3 benchmark (2 passes for both vanilla and changed
versions) altogether with script are in the attachment.

1) IRQ with SMP and balancing support:

	0901-add-icu-smp-support.patch
	0902-enable-external-irqs-for-second-vpe.patch
	0903-add-icu1-node-for-smp.patch

As requested I've changed the patch heavily. The original locking from
k3b source code (probably from UGW) didn't work and in heavy load the
system could have froze (smp affinity change during irq handling). This
version has this fixed by using generic raw spinlocks with irq.

The SMP IRQ now works in a way that before every irq_enable (serves as
unmask too) the VPE will be switched. This can be limited by writing
into /proc/irq/X/smp_affinity (it can be possibly balanced from
userspace too).

I've rewritten the device tree reg fields so there are only 2 arrays
now. One per an icu controller. The original one per module was
redundant as the ranges were continuous. The modules of a single ICU are
now explicitly computed in a macro:

	ltq_w32((x), ltq_icu_membase[vpe] + m*0x28 + (y))
	ltq_r32(ltq_icu_membase[vpe] + m*0x28 + (x))

before there was a pointer for every 0x28 block (there shouldn't be
speed downgrade, only a multiplication and an addition for every
register access).

Also I've simplified register names from LTQ_ICU_IM0_ISR to LTQ_ICU_ISR
as "IM0" (module) was confusing (the real module number 0-4 was a part
of the macro).

The code is written in a way it should work fine on a uniprocessor
configuration (as the for_each_present_cpu etc macros will cycle on a
single VPE on uniprocessor). I didn't test the no CONFIG_SMP yet, but I
did check it with "nosmp" kernel parameter. It works.

Anyway please test if you have the board where the second VPE is used
for FXS.

The new device tree structure is now incompatible with an old version of
the driver (and old device tree with the new driver too). It seems icu
driver is used in Danube, AR9, AmazonSE and Falcon chipset too. I don't
know the hardware for these boards so before a final patch I would like
to know if they have a second ICU too (at 0x80300 offset).

More development could be done with locking probably. As only the
accesses in a single module (= 1 set of registers) would cause a race
condition. But as the most contented interrupts are in the same module
there won't be much speed increase IMO. I can add it if requested (just
spinlock array and some lookup code).

2) Reworked lantiq xrx200 ethernet driver:

	0904-backport-vanilla-eth-driver.patch
	0905-increase-dma-descriptors.patch
	0906-increase-dma-burst-size.patch

The code is still ugly, but stable now. There is a fragmented skb
support and napi polling. DMA ring buffer was increased so it handle
faster speeds and I've fixed some code weirdness. A can split the
changes in the future into separate patches.

I didn't test the ICU and eth patches separate, but I've tested the
ethernet driver on a single VPE only (by setting smp affinity and
nosmp). This version of the ethernet driver was used for root over NFS
on the debug setup for like two weeks (without problems).

Tell me if we should pursue the way for the second DMA channel to PPE so
both VPEs can send frames at the same time.

3) WAVE300

In the two past weeks I've tried to fix a mash together various versions
of wave300 wifi driver (there are partial version in GPL sources from
router vendors). And I've managed to put the driver into "not
immediately crashing" mode. If you are interested in the development,
there is a thread in openwrt forum. The source repo here:

https://repo.or.cz/wave300.git
https://repo.or.cz/wave300_rflib.git

(the second one must be copied into the first one)

The driver will often crash when meeting an unknown packet, request for
encryption (no encryption support), unusual combination of configuration
or just by module unloading. The code is _really_ ugly and it will
server only as hardware specification for better GPL driver development.
If you want to help or you have some tips you can join the forum (there
are links for firmwares and intensive research of available source codes
from vendors).

Links:
https://forum.openwrt.org/t/support-for-wave-300-wi-fi-chip/24690/129
https://forum.openwrt.org/t/how-can-we-make-the-lantiq-xrx200-devices-faster/9724/70
https://forum.openwrt.org/t/xrx200-irq-balancing-between-vpes/29732/25

Petr
+ : ':::::::[' configuration vanilla ']:::::::' :
+ iperf3 -c 10.0.0.80
Connecting to host 10.0.0.80, port 5201
[  4] local 10.0.0.1 port 51814 connected to 10.0.0.80 port 5201
[ ID] Interval           Transfer     Bandwidth       Retr  Cwnd
[  4]   0.00-1.00   sec  21.2 MBytes   178 Mbits/sec   27   72.1 KBytes       
[  4]   1.00-2.00   sec  20.6 MBytes   173 Mbits/sec   29   70.7 KBytes       
[  4]   2.00-3.00   sec  20.8 MBytes   174 Mbits/sec   35   60.8 KBytes       
[  4]   3.00-4.00   sec  20.8 MBytes   174 Mbits/sec   29   73.5 KBytes       
[  4]   4.00-5.00   sec  20.8 MBytes   174 Mbits/sec   32   70.7 KBytes       
[  4]   5.00-6.00   sec  20.7 MBytes   174 Mbits/sec   35   69.3 KBytes       
[  4]   6.00-7.00   sec  20.8 MBytes   174 Mbits/sec   36   60.8 KBytes       
[  4]   7.00-8.00   sec  20.8 MBytes   175 Mbits/sec   29   59.4 KBytes       
[  4]   8.00-9.00   sec  20.8 MBytes   175 Mbits/sec   41   46.7 KBytes       
[  4]   9.00-10.00  sec  20.8 MBytes   175 Mbits/sec   28   50.9 KBytes       
- - - - - - - - - - - - - - - - - - - - - - - - -
[ ID] Interval           Transfer     Bandwidth       Retr
[  4]   0.00-10.00  sec   208 MBytes   174 Mbits/sec  321             sender
[  4]   0.00-10.00  sec   208 MBytes   174 Mbits/sec                  receiver

iperf Done.
+ sleep 10
+ iperf3 -c 10.0.0.80 -R
Connecting to host 10.0.0.80, port 5201
Reverse mode, remote host 10.0.0.80 is sending
[  4] local 10.0.0.1 port 51862 connected to 10.0.0.80 port 5201
[ ID] Interval           Transfer     Bandwidth
[  4]   0.00-1.00   sec  9.63 MBytes  80.7 Mbits/sec                  
[  4]   1.00-2.00   sec  9.65 MBytes  81.0 Mbits/sec                  
[  4]   2.00-3.00   sec  9.52 MBytes  79.9 Mbits/sec                  
[  4]   3.00-4.00   sec  9.69 MBytes  81.3 Mbits/sec                  
[  4]   4.00-5.00   sec  9.68 MBytes  81.2 Mbits/sec                  
[  4]   5.00-6.00   sec  9.66 MBytes  81.0 Mbits/sec                  
[  4]   6.00-7.00   sec  9.68 MBytes  81.2 Mbits/sec                  
[  4]   7.00-8.00   sec  9.70 MBytes  81.4 Mbits/sec                  
[  4]   8.00-9.00   sec  9.69 MBytes  81.3 Mbits/sec                  
[  4]   9.00-10.00  sec  9.79 MBytes  82.1 Mbits/sec                  
- - - - - - - - - - - - - - - - - - - - - - - - -
[ ID] Interval           Transfer     Bandwidth       Retr
[  4]   0.00-10.00  sec  97.0 MBytes  81.4 Mbits/sec    0             sender
[  4]   0.00-10.00  sec  97.0 MBytes  81.4 Mbits/sec                  receiver

iperf Done.
+ sleep 10
+ iperf3 -c 10.0.0.80 -u -b 150M
Connecting to host 10.0.0.80, port 5201
[  4] local 10.0.0.1 port 51957 connected to 10.0.0.80 port 5201
[ ID] Interval           Transfer     Bandwidth       Total Datagrams
[  4]   0.00-1.00   sec  16.4 MBytes   138 Mbits/sec  2101  
[  4]   1.00-2.00   sec  17.9 MBytes   150 Mbits/sec  2288  
[  4]   2.00-3.00   sec  17.9 MBytes   150 Mbits/sec  2285  
[  4]   3.00-4.00   sec  17.9 MBytes   150 Mbits/sec  2292  
[  4]   4.00-5.00   sec  17.9 MBytes   150 Mbits/sec  2287  
[  4]   5.00-6.00   sec  17.9 MBytes   150 Mbits/sec  2291  
[  4]   6.00-7.00   sec  17.9 MBytes   150 Mbits/sec  2288  
[  4]   7.00-8.00   sec  17.9 MBytes   150 Mbits/sec  2291  
[  4]   8.00-9.00   sec  17.8 MBytes   150 Mbits/sec  2282  
[  4]   9.00-10.00  sec  17.9 MBytes   150 Mbits/sec  2292  
- - - - - - - - - - - - - - - - - - - - - - - - -
[ ID] Interval           Transfer     Bandwidth       Jitter    Lost/Total Datagrams
[  4]   0.00-10.00  sec   177 MBytes   149 Mbits/sec  136434.385 ms  1349/1417 (95%)  
[  4] Sent 1417 datagrams

iperf Done.
+ sleep 10
+ iperf3 -c 10.0.0.80 -u -b 150M -R
Connecting to host 10.0.0.80, port 5201
Reverse mode, remote host 10.0.0.80 is sending
[  4] local 10.0.0.1 port 46317 connected to 10.0.0.80 port 5201
[ ID] Interval           Transfer     Bandwidth       Jitter    Lost/Total Datagrams
[  4]   0.00-1.00   sec  14.7 MBytes   124 Mbits/sec  0.077 ms  0/1885 (0%)  
[  4]   1.00-2.00   sec  15.2 MBytes   127 Mbits/sec  0.072 ms  0/1942 (0%)  
[  4]   2.00-3.00   sec  15.0 MBytes   126 Mbits/sec  0.074 ms  0/1924 (0%)  
[  4]   3.00-4.00   sec  14.3 MBytes   120 Mbits/sec  0.080 ms  0/1825 (0%)  
[  4]   4.00-5.00   sec  14.4 MBytes   120 Mbits/sec  0.079 ms  0/1837 (0%)  
[  4]   5.00-6.00   sec  14.8 MBytes   124 Mbits/sec  0.065 ms  0/1888 (0%)  
[  4]   6.00-7.00   sec  15.3 MBytes   128 Mbits/sec  0.076 ms  0/1956 (0%)  
[  4]   7.00-8.00   sec  15.2 MBytes   128 Mbits/sec  0.095 ms  0/1948 (0%)  
[  4]   8.00-9.00   sec  15.1 MBytes   127 Mbits/sec  0.092 ms  0/1932 (0%)  
[  4]   9.00-10.00  sec  15.1 MBytes   127 Mbits/sec  0.095 ms  0/1938 (0%)  
- - - - - - - - - - - - - - - - - - - - - - - - -
[ ID] Interval           Transfer     Bandwidth       Jitter    Lost/Total Datagrams
[  4]   0.00-10.00  sec   149 MBytes   125 Mbits/sec  0.085 ms  0/19082 (0%)  
[  4] Sent 19082 datagrams

iperf Done.
+ sleep 10
+ iperf3 -c 10.0.0.80 -u -b 500M
Connecting to host 10.0.0.80, port 5201
[  4] local 10.0.0.1 port 48172 connected to 10.0.0.80 port 5201
[ ID] Interval           Transfer     Bandwidth       Total Datagrams
[  4]   0.00-1.00   sec  58.5 MBytes   491 Mbits/sec  7494  
[  4]   1.00-2.00   sec  60.6 MBytes   508 Mbits/sec  7756  
[  4]   2.00-3.00   sec  58.7 MBytes   492 Mbits/sec  7508  
[  4]   3.00-4.00   sec  60.2 MBytes   505 Mbits/sec  7710  
[  4]   4.00-5.00   sec  59.0 MBytes   495 Mbits/sec  7556  
[  4]   5.00-6.00   sec  60.5 MBytes   508 Mbits/sec  7744  
[  4]   6.00-7.00   sec  58.7 MBytes   492 Mbits/sec  7508  
[  4]   7.00-8.00   sec  59.1 MBytes   496 Mbits/sec  7565  
[  4]   8.00-9.00   sec  60.4 MBytes   507 Mbits/sec  7730  
[  4]   9.00-10.00  sec  59.9 MBytes   502 Mbits/sec  7664  
- - - - - - - - - - - - - - - - - - - - - - - - -
[ ID] Interval           Transfer     Bandwidth       Jitter    Lost/Total Datagrams
[  4]   0.00-10.00  sec   596 MBytes   500 Mbits/sec  2051749.337 ms  64268/64294 (1e+02%)  
[  4] Sent 64294 datagrams

iperf Done.
+ sleep 10
+ iperf3 -c 10.0.0.80 -u -b 500M -R
Connecting to host 10.0.0.80, port 5201
Reverse mode, remote host 10.0.0.80 is sending
[  4] local 10.0.0.1 port 35361 connected to 10.0.0.80 port 5201
[ ID] Interval           Transfer     Bandwidth       Jitter    Lost/Total Datagrams
[  4]   0.00-1.00   sec  14.3 MBytes   120 Mbits/sec  0.097 ms  0/1830 (0%)  
[  4]   1.00-2.00   sec  14.3 MBytes   120 Mbits/sec  0.101 ms  0/1830 (0%)  
[  4]   2.00-3.00   sec  14.3 MBytes   120 Mbits/sec  0.072 ms  0/1827 (0%)  
[  4]   3.00-4.00   sec  14.2 MBytes   119 Mbits/sec  0.081 ms  0/1819 (0%)  
[  4]   4.00-5.00   sec  14.3 MBytes   120 Mbits/sec  0.070 ms  0/1834 (0%)  
[  4]   5.00-6.00   sec  14.3 MBytes   120 Mbits/sec  0.085 ms  0/1833 (0%)  
[  4]   6.00-7.00   sec  14.3 MBytes   120 Mbits/sec  0.082 ms  0/1835 (0%)  
[  4]   7.00-8.00   sec  14.3 MBytes   120 Mbits/sec  0.109 ms  0/1836 (0%)  
[  4]   8.00-9.00   sec  14.2 MBytes   119 Mbits/sec  0.080 ms  0/1822 (0%)  
[  4]   9.00-10.00  sec  14.3 MBytes   120 Mbits/sec  0.090 ms  0/1825 (0%)  
- - - - - - - - - - - - - - - - - - - - - - - - -
[ ID] Interval           Transfer     Bandwidth       Jitter    Lost/Total Datagrams
[  4]   0.00-10.00  sec   143 MBytes   120 Mbits/sec  0.104 ms  0/18298 (0%)  
[  4] Sent 18298 datagrams

iperf Done.
+ sleep 10
+ iperf3 -c 10.0.0.80 -u -b 1000M
Connecting to host 10.0.0.80, port 5201
[  4] local 10.0.0.1 port 53231 connected to 10.0.0.80 port 5201
[ ID] Interval           Transfer     Bandwidth       Total Datagrams
[  4]   0.00-1.00   sec   107 MBytes   902 Mbits/sec  13759  
[  4]   1.00-2.00   sec   107 MBytes   896 Mbits/sec  13675  
[  4]   2.00-3.00   sec   107 MBytes   901 Mbits/sec  13753  
[  4]   3.00-4.00   sec   107 MBytes   898 Mbits/sec  13700  
[  4]   4.00-5.00   sec   107 MBytes   902 Mbits/sec  13759  
[  4]   5.00-6.00   sec   108 MBytes   902 Mbits/sec  13762  
[  4]   6.00-7.00   sec   107 MBytes   899 Mbits/sec  13719  
[  4]   7.00-8.00   sec   108 MBytes   902 Mbits/sec  13760  
[  4]   8.00-9.00   sec   107 MBytes   901 Mbits/sec  13753  
[  4]   9.00-10.00  sec   107 MBytes   902 Mbits/sec  13756  
- - - - - - - - - - - - - - - - - - - - - - - - -
[ ID] Interval           Transfer     Bandwidth       Jitter    Lost/Total Datagrams
[  4]   0.00-10.00  sec  1.05 GBytes   900 Mbits/sec  5762140.265 ms  210/220 (95%)  
[  4] Sent 220 datagrams

iperf Done.
+ sleep 10
+ iperf3 -c 10.0.0.80 -u -b 1000M -R
Connecting to host 10.0.0.80, port 5201
Reverse mode, remote host 10.0.0.80 is sending
[  4] local 10.0.0.1 port 34296 connected to 10.0.0.80 port 5201
[ ID] Interval           Transfer     Bandwidth       Jitter    Lost/Total Datagrams
[  4]   0.00-1.00   sec  14.3 MBytes   120 Mbits/sec  0.084 ms  0/1835 (0%)  
[  4]   1.00-2.00   sec  14.3 MBytes   120 Mbits/sec  0.075 ms  0/1835 (0%)  
[  4]   2.00-3.00   sec  14.5 MBytes   122 Mbits/sec  0.062 ms  0/1858 (0%)  
[  4]   3.00-4.00   sec  15.1 MBytes   127 Mbits/sec  0.060 ms  0/1935 (0%)  
[  4]   4.00-5.00   sec  15.3 MBytes   128 Mbits/sec  0.076 ms  0/1958 (0%)  
[  4]   5.00-6.00   sec  14.5 MBytes   122 Mbits/sec  0.078 ms  0/1861 (0%)  
[  4]   6.00-7.00   sec  14.4 MBytes   120 Mbits/sec  0.100 ms  0/1837 (0%)  
[  4]   7.00-8.00   sec  14.3 MBytes   120 Mbits/sec  0.098 ms  0/1835 (0%)  
[  4]   8.00-9.00   sec  14.2 MBytes   119 Mbits/sec  0.085 ms  0/1821 (0%)  
[  4]   9.00-10.00  sec  14.3 MBytes   120 Mbits/sec  0.110 ms  0/1825 (0%)  
- - - - - - - - - - - - - - - - - - - - - - - - -
[ ID] Interval           Transfer     Bandwidth       Jitter    Lost/Total Datagrams
[  4]   0.00-10.00  sec   145 MBytes   122 Mbits/sec  0.101 ms  0/18606 (0%)  
[  4] Sent 18606 datagrams

iperf Done.
+ sleep 10
+ iperf3 -c 10.0.0.80 -P3
Connecting to host 10.0.0.80, port 5201
[  4] local 10.0.0.1 port 52130 connected to 10.0.0.80 port 5201
[  6] local 10.0.0.1 port 52132 connected to 10.0.0.80 port 5201
[  9] local 10.0.0.1 port 52134 connected to 10.0.0.80 port 5201
[ ID] Interval           Transfer     Bandwidth       Retr  Cwnd
[  4]   0.00-1.00   sec  7.47 MBytes  62.6 Mbits/sec   73   17.0 KBytes       
[  6]   0.00-1.00   sec  7.21 MBytes  60.5 Mbits/sec   78   19.8 KBytes       
[  9]   0.00-1.00   sec  7.14 MBytes  59.9 Mbits/sec   76   31.1 KBytes       
[SUM]   0.00-1.00   sec  21.8 MBytes   183 Mbits/sec  227             
- - - - - - - - - - - - - - - - - - - - - - - - -
[  4]   1.00-2.00   sec  7.95 MBytes  66.7 Mbits/sec   61   12.7 KBytes       
[  6]   1.00-2.00   sec  5.84 MBytes  49.0 Mbits/sec   99   35.4 KBytes       
[  9]   1.00-2.00   sec  7.08 MBytes  59.4 Mbits/sec   78   32.5 KBytes       
[SUM]   1.00-2.00   sec  20.9 MBytes   175 Mbits/sec  238             
- - - - - - - - - - - - - - - - - - - - - - - - -
[  4]   2.00-3.00   sec  6.09 MBytes  51.1 Mbits/sec   73   31.1 KBytes       
[  6]   2.00-3.00   sec  8.95 MBytes  75.1 Mbits/sec   64   22.6 KBytes       
[  9]   2.00-3.00   sec  6.09 MBytes  51.1 Mbits/sec   81   18.4 KBytes       
[SUM]   2.00-3.00   sec  21.1 MBytes   177 Mbits/sec  218             
- - - - - - - - - - - - - - - - - - - - - - - - -
[  4]   3.00-4.00   sec  6.71 MBytes  56.3 Mbits/sec   80   11.3 KBytes       
[  6]   3.00-4.00   sec  8.26 MBytes  69.3 Mbits/sec   76   17.0 KBytes       
[  9]   3.00-4.00   sec  6.28 MBytes  52.7 Mbits/sec   77   42.4 KBytes       
[SUM]   3.00-4.00   sec  21.3 MBytes   178 Mbits/sec  233             
- - - - - - - - - - - - - - - - - - - - - - - - -
[  4]   4.00-5.00   sec  6.59 MBytes  55.3 Mbits/sec   94   12.7 KBytes       
[  6]   4.00-5.00   sec  7.58 MBytes  63.6 Mbits/sec   63   28.3 KBytes       
[  9]   4.00-5.00   sec  6.84 MBytes  57.3 Mbits/sec   62   11.3 KBytes       
[SUM]   4.00-5.00   sec  21.0 MBytes   176 Mbits/sec  219             
- - - - - - - - - - - - - - - - - - - - - - - - -
[  4]   5.00-6.00   sec  8.76 MBytes  73.5 Mbits/sec   57   22.6 KBytes       
[  6]   5.00-6.00   sec  6.28 MBytes  52.6 Mbits/sec   80   38.2 KBytes       
[  9]   5.00-6.00   sec  6.28 MBytes  52.6 Mbits/sec   90   7.07 KBytes       
[SUM]   5.00-6.00   sec  21.3 MBytes   179 Mbits/sec  227             
- - - - - - - - - - - - - - - - - - - - - - - - -
[  4]   6.00-7.00   sec  7.33 MBytes  61.5 Mbits/sec   72   18.4 KBytes       
[  6]   6.00-7.00   sec  7.02 MBytes  58.9 Mbits/sec   66   35.4 KBytes       
[  9]   6.00-7.00   sec  6.77 MBytes  56.8 Mbits/sec   67   17.0 KBytes       
[SUM]   6.00-7.00   sec  21.1 MBytes   177 Mbits/sec  205             
- - - - - - - - - - - - - - - - - - - - - - - - -
[  4]   7.00-8.00   sec  8.45 MBytes  70.9 Mbits/sec   72   25.5 KBytes       
[  6]   7.00-8.00   sec  6.71 MBytes  56.3 Mbits/sec   82   35.4 KBytes       
[  9]   7.00-8.00   sec  5.90 MBytes  49.5 Mbits/sec   74   17.0 KBytes       
[SUM]   7.00-8.00   sec  21.1 MBytes   177 Mbits/sec  228             
- - - - - - - - - - - - - - - - - - - - - - - - -
[  4]   8.00-9.00   sec  6.46 MBytes  54.2 Mbits/sec   77   36.8 KBytes       
[  6]   8.00-9.00   sec  6.90 MBytes  57.9 Mbits/sec   78   11.3 KBytes       
[  9]   8.00-9.00   sec  7.89 MBytes  66.2 Mbits/sec   68   11.3 KBytes       
[SUM]   8.00-9.00   sec  21.3 MBytes   178 Mbits/sec  223             
- - - - - - - - - - - - - - - - - - - - - - - - -
[  4]   9.00-10.00  sec  6.59 MBytes  55.3 Mbits/sec   79   38.2 KBytes       
[  6]   9.00-10.00  sec  8.76 MBytes  73.5 Mbits/sec   58   24.0 KBytes       
[  9]   9.00-10.00  sec  5.72 MBytes  48.0 Mbits/sec   77   7.07 KBytes       
[SUM]   9.00-10.00  sec  21.1 MBytes   177 Mbits/sec  214             
- - - - - - - - - - - - - - - - - - - - - - - - -
[ ID] Interval           Transfer     Bandwidth       Retr
[  4]   0.00-10.00  sec  72.4 MBytes  60.7 Mbits/sec  738             sender
[  4]   0.00-10.00  sec  72.0 MBytes  60.4 Mbits/sec                  receiver
[  6]   0.00-10.00  sec  73.5 MBytes  61.7 Mbits/sec  744             sender
[  6]   0.00-10.00  sec  73.2 MBytes  61.4 Mbits/sec                  receiver
[  9]   0.00-10.00  sec  66.0 MBytes  55.4 Mbits/sec  750             sender
[  9]   0.00-10.00  sec  65.6 MBytes  55.1 Mbits/sec                  receiver
[SUM]   0.00-10.00  sec   212 MBytes   178 Mbits/sec  2232             sender
[SUM]   0.00-10.00  sec   211 MBytes   177 Mbits/sec                  receiver

iperf Done.
+ sleep 10
+ iperf3 -c 10.0.0.80 -P3 -R
Connecting to host 10.0.0.80, port 5201
Reverse mode, remote host 10.0.0.80 is sending
[  4] local 10.0.0.1 port 52178 connected to 10.0.0.80 port 5201
[  6] local 10.0.0.1 port 52180 connected to 10.0.0.80 port 5201
[  9] local 10.0.0.1 port 52182 connected to 10.0.0.80 port 5201
[ ID] Interval           Transfer     Bandwidth
[  4]   0.00-1.00   sec  4.03 MBytes  33.8 Mbits/sec                  
[  6]   0.00-1.00   sec  2.81 MBytes  23.6 Mbits/sec                  
[  9]   0.00-1.00   sec  2.79 MBytes  23.4 Mbits/sec                  
[SUM]   0.00-1.00   sec  9.63 MBytes  80.8 Mbits/sec                  
- - - - - - - - - - - - - - - - - - - - - - - - -
[  4]   1.00-2.00   sec  3.25 MBytes  27.3 Mbits/sec                  
[  6]   1.00-2.00   sec  3.25 MBytes  27.3 Mbits/sec                  
[  9]   1.00-2.00   sec  3.22 MBytes  27.0 Mbits/sec                  
[SUM]   1.00-2.00   sec  9.72 MBytes  81.5 Mbits/sec                  
- - - - - - - - - - - - - - - - - - - - - - - - -
[  4]   2.00-3.00   sec  3.25 MBytes  27.3 Mbits/sec                  
[  6]   2.00-3.00   sec  3.25 MBytes  27.3 Mbits/sec                  
[  9]   2.00-3.00   sec  3.30 MBytes  27.6 Mbits/sec                  
[SUM]   2.00-3.00   sec  9.80 MBytes  82.2 Mbits/sec                  
- - - - - - - - - - - - - - - - - - - - - - - - -
[  4]   3.00-4.00   sec  3.25 MBytes  27.3 Mbits/sec                  
[  6]   3.00-4.00   sec  3.12 MBytes  26.2 Mbits/sec                  
[  9]   3.00-4.00   sec  3.19 MBytes  26.8 Mbits/sec                  
[SUM]   3.00-4.00   sec  9.57 MBytes  80.2 Mbits/sec                  
- - - - - - - - - - - - - - - - - - - - - - - - -
[  4]   4.00-5.00   sec  2.22 MBytes  18.6 Mbits/sec                  
[  6]   4.00-5.00   sec  2.38 MBytes  19.9 Mbits/sec                  
[  9]   4.00-5.00   sec  2.28 MBytes  19.2 Mbits/sec                  
[SUM]   4.00-5.00   sec  6.88 MBytes  57.7 Mbits/sec                  
- - - - - - - - - - - - - - - - - - - - - - - - -
[  4]   5.00-6.00   sec  3.30 MBytes  27.7 Mbits/sec                  
[  6]   5.00-6.00   sec  3.37 MBytes  28.3 Mbits/sec                  
[  9]   5.00-6.00   sec  3.18 MBytes  26.7 Mbits/sec                  
[SUM]   5.00-6.00   sec  9.85 MBytes  82.6 Mbits/sec                  
- - - - - - - - - - - - - - - - - - - - - - - - -
[  4]   6.00-7.00   sec  2.99 MBytes  25.1 Mbits/sec                  
[  6]   6.00-7.00   sec  2.88 MBytes  24.1 Mbits/sec                  
[  9]   6.00-7.00   sec  3.00 MBytes  25.2 Mbits/sec                  
[SUM]   6.00-7.00   sec  8.87 MBytes  74.4 Mbits/sec                  
- - - - - - - - - - - - - - - - - - - - - - - - -
[  4]   7.00-8.00   sec  3.14 MBytes  26.3 Mbits/sec                  
[  6]   7.00-8.00   sec  3.25 MBytes  27.3 Mbits/sec                  
[  9]   7.00-8.00   sec  3.25 MBytes  27.3 Mbits/sec                  
[SUM]   7.00-8.00   sec  9.64 MBytes  80.9 Mbits/sec                  
- - - - - - - - - - - - - - - - - - - - - - - - -
[  4]   8.00-9.00   sec  3.25 MBytes  27.2 Mbits/sec                  
[  6]   8.00-9.00   sec  3.25 MBytes  27.3 Mbits/sec                  
[  9]   8.00-9.00   sec  3.16 MBytes  26.5 Mbits/sec                  
[SUM]   8.00-9.00   sec  9.65 MBytes  81.0 Mbits/sec                  
- - - - - - - - - - - - - - - - - - - - - - - - -
[  4]   9.00-10.00  sec  3.25 MBytes  27.3 Mbits/sec                  
[  6]   9.00-10.00  sec  3.21 MBytes  26.9 Mbits/sec                  
[  9]   9.00-10.00  sec  3.22 MBytes  27.0 Mbits/sec                  
[SUM]   9.00-10.00  sec  9.68 MBytes  81.2 Mbits/sec                  
- - - - - - - - - - - - - - - - - - - - - - - - -
[ ID] Interval           Transfer     Bandwidth       Retr
[  4]   0.00-10.00  sec  32.2 MBytes  27.0 Mbits/sec    0             sender
[  4]   0.00-10.00  sec  32.2 MBytes  27.0 Mbits/sec                  receiver
[  6]   0.00-10.00  sec  31.1 MBytes  26.1 Mbits/sec    0             sender
[  6]   0.00-10.00  sec  31.1 MBytes  26.1 Mbits/sec                  receiver
[  9]   0.00-10.00  sec  30.9 MBytes  25.9 Mbits/sec    0             sender
[  9]   0.00-10.00  sec  30.9 MBytes  25.9 Mbits/sec                  receiver
[SUM]   0.00-10.00  sec  94.2 MBytes  79.0 Mbits/sec    0             sender
[SUM]   0.00-10.00  sec  94.2 MBytes  79.0 Mbits/sec                  receiver

iperf Done.
+ sleep 10
+ iperf3 -c 10.0.0.80 -P3 -u -b 800M
Connecting to host 10.0.0.80, port 5201
[  4] local 10.0.0.1 port 36791 connected to 10.0.0.80 port 5201
[  6] local 10.0.0.1 port 51969 connected to 10.0.0.80 port 5201
[  9] local 10.0.0.1 port 39473 connected to 10.0.0.80 port 5201
[ ID] Interval           Transfer     Bandwidth       Total Datagrams
[  4]   0.00-1.00   sec  38.1 MBytes   319 Mbits/sec  4871  
[  6]   0.00-1.00   sec  38.1 MBytes   319 Mbits/sec  4871  
[  9]   0.00-1.00   sec  38.1 MBytes   319 Mbits/sec  4871  
[SUM]   0.00-1.00   sec   114 MBytes   958 Mbits/sec  14613  
- - - - - - - - - - - - - - - - - - - - - - - - -
[  4]   1.00-2.00   sec  38.1 MBytes   319 Mbits/sec  4873  
[  6]   1.00-2.00   sec  38.1 MBytes   319 Mbits/sec  4873  
[  9]   1.00-2.00   sec  38.1 MBytes   319 Mbits/sec  4873  
[SUM]   1.00-2.00   sec   114 MBytes   958 Mbits/sec  14619  
- - - - - - - - - - - - - - - - - - - - - - - - -
[  4]   2.00-3.00   sec  38.1 MBytes   319 Mbits/sec  4875  
[  6]   2.00-3.00   sec  38.1 MBytes   319 Mbits/sec  4875  
[  9]   2.00-3.00   sec  38.1 MBytes   319 Mbits/sec  4875  
[SUM]   2.00-3.00   sec   114 MBytes   958 Mbits/sec  14625  
- - - - - - - - - - - - - - - - - - - - - - - - -
[  4]   3.00-4.00   sec  38.1 MBytes   319 Mbits/sec  4874  
[  6]   3.00-4.00   sec  38.1 MBytes   319 Mbits/sec  4874  
[  9]   3.00-4.00   sec  38.1 MBytes   319 Mbits/sec  4874  
[SUM]   3.00-4.00   sec   114 MBytes   958 Mbits/sec  14622  
- - - - - - - - - - - - - - - - - - - - - - - - -
[  4]   4.00-5.00   sec  38.0 MBytes   319 Mbits/sec  4864  
[  6]   4.00-5.00   sec  38.0 MBytes   319 Mbits/sec  4864  
[  9]   4.00-5.00   sec  38.0 MBytes   319 Mbits/sec  4864  
[SUM]   4.00-5.00   sec   114 MBytes   956 Mbits/sec  14592  
- - - - - - - - - - - - - - - - - - - - - - - - -
[  4]   5.00-6.00   sec  38.1 MBytes   319 Mbits/sec  4875  
[  6]   5.00-6.00   sec  38.1 MBytes   319 Mbits/sec  4875  
[  9]   5.00-6.00   sec  38.1 MBytes   319 Mbits/sec  4875  
[SUM]   5.00-6.00   sec   114 MBytes   958 Mbits/sec  14625  
- - - - - - - - - - - - - - - - - - - - - - - - -
[  4]   6.00-7.00   sec  38.1 MBytes   319 Mbits/sec  4873  
[  6]   6.00-7.00   sec  38.1 MBytes   319 Mbits/sec  4873  
[  9]   6.00-7.00   sec  38.1 MBytes   319 Mbits/sec  4873  
[SUM]   6.00-7.00   sec   114 MBytes   958 Mbits/sec  14619  
- - - - - - - - - - - - - - - - - - - - - - - - -
[  4]   7.00-8.00   sec  38.1 MBytes   319 Mbits/sec  4876  
[  6]   7.00-8.00   sec  38.1 MBytes   319 Mbits/sec  4876  
[  9]   7.00-8.00   sec  38.1 MBytes   319 Mbits/sec  4876  
[SUM]   7.00-8.00   sec   114 MBytes   958 Mbits/sec  14628  
- - - - - - - - - - - - - - - - - - - - - - - - -
[  4]   8.00-9.00   sec  38.1 MBytes   319 Mbits/sec  4874  
[  6]   8.00-9.00   sec  38.1 MBytes   319 Mbits/sec  4874  
[  9]   8.00-9.00   sec  38.1 MBytes   319 Mbits/sec  4874  
[SUM]   8.00-9.00   sec   114 MBytes   958 Mbits/sec  14622  
- - - - - - - - - - - - - - - - - - - - - - - - -
[  4]   9.00-10.00  sec  37.9 MBytes   318 Mbits/sec  4856  
[  6]   9.00-10.00  sec  37.9 MBytes   318 Mbits/sec  4856  
[  9]   9.00-10.00  sec  37.9 MBytes   318 Mbits/sec  4856  
[SUM]   9.00-10.00  sec   114 MBytes   955 Mbits/sec  14568  
- - - - - - - - - - - - - - - - - - - - - - - - -
[ ID] Interval           Transfer     Bandwidth       Jitter    Lost/Total Datagrams
[  4]   0.00-10.00  sec   381 MBytes   319 Mbits/sec  9052841.391 ms  0/3 (0%)  
[  4] Sent 3 datagrams
[  6]   0.00-10.00  sec   381 MBytes   319 Mbits/sec  9052841.281 ms  0/3 (0%)  
[  6] Sent 3 datagrams
[  9]   0.00-10.00  sec   381 MBytes   319 Mbits/sec  9052841.181 ms  0/3 (0%)  
[  9] Sent 3 datagrams
[SUM]   0.00-10.00  sec  1.11 GBytes   958 Mbits/sec  9052841.285 ms  0/9 (0%)  

iperf Done.
+ sleep 10
+ iperf3 -c 10.0.0.80 -P3 -R -u -b 800M
Connecting to host 10.0.0.80, port 5201
Reverse mode, remote host 10.0.0.80 is sending
[  4] local 10.0.0.1 port 43263 connected to 10.0.0.80 port 5201
[  6] local 10.0.0.1 port 49331 connected to 10.0.0.80 port 5201
[  9] local 10.0.0.1 port 60542 connected to 10.0.0.80 port 5201
[ ID] Interval           Transfer     Bandwidth       Jitter    Lost/Total Datagrams
[  4]   0.00-1.00   sec  4.92 MBytes  41.3 Mbits/sec  0.156 ms  0/630 (0%)  
[  6]   0.00-1.00   sec  4.92 MBytes  41.3 Mbits/sec  0.170 ms  0/630 (0%)  
[  9]   0.00-1.00   sec  4.91 MBytes  41.2 Mbits/sec  0.237 ms  0/629 (0%)  
[SUM]   0.00-1.00   sec  14.8 MBytes   124 Mbits/sec  0.188 ms  0/1889 (0%)  
- - - - - - - - - - - - - - - - - - - - - - - - -
[  4]   1.00-2.00   sec  4.92 MBytes  41.3 Mbits/sec  0.173 ms  0/630 (0%)  
[  6]   1.00-2.00   sec  4.91 MBytes  41.2 Mbits/sec  0.191 ms  0/629 (0%)  
[  9]   1.00-2.00   sec  4.91 MBytes  41.2 Mbits/sec  0.192 ms  0/629 (0%)  
[SUM]   1.00-2.00   sec  14.8 MBytes   124 Mbits/sec  0.185 ms  0/1888 (0%)  
- - - - - - - - - - - - - - - - - - - - - - - - -
[  4]   2.00-3.00   sec  4.96 MBytes  41.6 Mbits/sec  0.246 ms  0/635 (0%)  
[  6]   2.00-3.00   sec  4.97 MBytes  41.7 Mbits/sec  0.167 ms  0/636 (0%)  
[  9]   2.00-3.00   sec  4.95 MBytes  41.5 Mbits/sec  0.232 ms  0/634 (0%)  
[SUM]   2.00-3.00   sec  14.9 MBytes   125 Mbits/sec  0.215 ms  0/1905 (0%)  
- - - - - - - - - - - - - - - - - - - - - - - - -
[  4]   3.00-4.00   sec  4.97 MBytes  41.7 Mbits/sec  0.189 ms  0/636 (0%)  
[  6]   3.00-4.00   sec  4.96 MBytes  41.6 Mbits/sec  0.121 ms  0/635 (0%)  
[  9]   3.00-4.00   sec  4.97 MBytes  41.7 Mbits/sec  0.195 ms  0/636 (0%)  
[SUM]   3.00-4.00   sec  14.9 MBytes   125 Mbits/sec  0.168 ms  0/1907 (0%)  
- - - - - - - - - - - - - - - - - - - - - - - - -
[  4]   4.00-5.00   sec  4.97 MBytes  41.7 Mbits/sec  0.180 ms  0/636 (0%)  
[  6]   4.00-5.00   sec  4.97 MBytes  41.7 Mbits/sec  0.185 ms  0/636 (0%)  
[  9]   4.00-5.00   sec  4.96 MBytes  41.6 Mbits/sec  0.132 ms  0/635 (0%)  
[SUM]   4.00-5.00   sec  14.9 MBytes   125 Mbits/sec  0.166 ms  0/1907 (0%)  
- - - - - - - - - - - - - - - - - - - - - - - - -
[  4]   5.00-6.00   sec  4.97 MBytes  41.7 Mbits/sec  0.178 ms  0/636 (0%)  
[  6]   5.00-6.00   sec  4.97 MBytes  41.7 Mbits/sec  0.209 ms  0/636 (0%)  
[  9]   5.00-6.00   sec  4.97 MBytes  41.7 Mbits/sec  0.167 ms  0/636 (0%)  
[SUM]   5.00-6.00   sec  14.9 MBytes   125 Mbits/sec  0.185 ms  0/1908 (0%)  
- - - - - - - - - - - - - - - - - - - - - - - - -
[  4]   6.00-7.00   sec  4.91 MBytes  41.2 Mbits/sec  0.141 ms  0/628 (0%)  
[  6]   6.00-7.00   sec  4.91 MBytes  41.2 Mbits/sec  0.211 ms  0/628 (0%)  
[  9]   6.00-7.00   sec  4.91 MBytes  41.2 Mbits/sec  0.152 ms  0/629 (0%)  
[SUM]   6.00-7.00   sec  14.7 MBytes   124 Mbits/sec  0.168 ms  0/1885 (0%)  
- - - - - - - - - - - - - - - - - - - - - - - - -
[  4]   7.00-8.00   sec  4.92 MBytes  41.3 Mbits/sec  0.290 ms  0/630 (0%)  
[  6]   7.00-8.00   sec  4.91 MBytes  41.2 Mbits/sec  0.167 ms  0/629 (0%)  
[  9]   7.00-8.00   sec  4.91 MBytes  41.2 Mbits/sec  0.367 ms  0/629 (0%)  
[SUM]   7.00-8.00   sec  14.8 MBytes   124 Mbits/sec  0.275 ms  0/1888 (0%)  
- - - - - - - - - - - - - - - - - - - - - - - - -
[  4]   8.00-9.00   sec  4.93 MBytes  41.4 Mbits/sec  0.147 ms  0/631 (0%)  
[  6]   8.00-9.00   sec  4.91 MBytes  41.2 Mbits/sec  0.170 ms  0/628 (0%)  
[  9]   8.00-9.00   sec  4.91 MBytes  41.2 Mbits/sec  0.137 ms  0/628 (0%)  
[SUM]   8.00-9.00   sec  14.7 MBytes   124 Mbits/sec  0.151 ms  0/1887 (0%)  
- - - - - - - - - - - - - - - - - - - - - - - - -
[  4]   9.00-10.00  sec  4.97 MBytes  41.7 Mbits/sec  0.215 ms  0/636 (0%)  
[  6]   9.00-10.00  sec  4.98 MBytes  41.7 Mbits/sec  0.150 ms  0/637 (0%)  
[  9]   9.00-10.00  sec  4.96 MBytes  41.6 Mbits/sec  0.272 ms  0/635 (0%)  
[SUM]   9.00-10.00  sec  14.9 MBytes   125 Mbits/sec  0.212 ms  0/1908 (0%)  
- - - - - - - - - - - - - - - - - - - - - - - - -
[ ID] Interval           Transfer     Bandwidth       Jitter    Lost/Total Datagrams
[  4]   0.00-10.00  sec  49.5 MBytes  41.5 Mbits/sec  0.227 ms  0/6335 (0%)  
[  4] Sent 6335 datagrams
[  6]   0.00-10.00  sec  49.5 MBytes  41.5 Mbits/sec  0.183 ms  0/6331 (0%)  
[  6] Sent 6331 datagrams
[  9]   0.00-10.00  sec  49.4 MBytes  41.5 Mbits/sec  0.261 ms  0/6327 (0%)  
[  9] Sent 6327 datagrams
[SUM]   0.00-10.00  sec   148 MBytes   124 Mbits/sec  0.224 ms  0/18993 (0%)  

iperf Done.
+ : ':::::::[' configuration vanilla ']:::::::' :
+ iperf3 -c 10.0.0.80
Connecting to host 10.0.0.80, port 5201
[  4] local 10.0.0.1 port 52728 connected to 10.0.0.80 port 5201
[ ID] Interval           Transfer     Bandwidth       Retr  Cwnd
[  4]   0.00-1.00   sec  21.2 MBytes   178 Mbits/sec   24   66.5 KBytes       
[  4]   1.00-2.00   sec  20.8 MBytes   175 Mbits/sec   36   67.9 KBytes       
[  4]   2.00-3.00   sec  20.5 MBytes   172 Mbits/sec   40   52.3 KBytes       
[  4]   3.00-4.00   sec  20.7 MBytes   174 Mbits/sec   37   45.2 KBytes       
[  4]   4.00-5.00   sec  20.4 MBytes   171 Mbits/sec   27   46.7 KBytes       
[  4]   5.00-6.00   sec  20.6 MBytes   173 Mbits/sec   36   66.5 KBytes       
[  4]   6.00-7.00   sec  20.8 MBytes   174 Mbits/sec   31   56.6 KBytes       
[  4]   7.00-8.00   sec  20.8 MBytes   174 Mbits/sec   46   43.8 KBytes       
[  4]   8.00-9.00   sec  20.9 MBytes   176 Mbits/sec   31   48.1 KBytes       
[  4]   9.00-10.00  sec  20.9 MBytes   175 Mbits/sec   28   65.0 KBytes       
- - - - - - - - - - - - - - - - - - - - - - - - -
[ ID] Interval           Transfer     Bandwidth       Retr
[  4]   0.00-10.00  sec   208 MBytes   174 Mbits/sec  336             sender
[  4]   0.00-10.00  sec   207 MBytes   174 Mbits/sec                  receiver

iperf Done.
+ sleep 10
+ iperf3 -c 10.0.0.80 -R
Connecting to host 10.0.0.80, port 5201
Reverse mode, remote host 10.0.0.80 is sending
[  4] local 10.0.0.1 port 52768 connected to 10.0.0.80 port 5201
[ ID] Interval           Transfer     Bandwidth
[  4]   0.00-1.00   sec  9.95 MBytes  83.5 Mbits/sec                  
[  4]   1.00-2.00   sec  13.0 MBytes   109 Mbits/sec                  
[  4]   2.00-3.00   sec  13.0 MBytes   109 Mbits/sec                  
[  4]   3.00-4.00   sec  13.0 MBytes   109 Mbits/sec                  
[  4]   4.00-5.00   sec  13.1 MBytes   110 Mbits/sec                  
[  4]   5.00-6.00   sec  13.2 MBytes   111 Mbits/sec                  
[  4]   6.00-7.00   sec  13.2 MBytes   111 Mbits/sec                  
[  4]   7.00-8.00   sec  13.2 MBytes   111 Mbits/sec                  
[  4]   8.00-9.00   sec  13.3 MBytes   112 Mbits/sec                  
[  4]   9.00-10.00  sec  10.2 MBytes  85.6 Mbits/sec                  
- - - - - - - - - - - - - - - - - - - - - - - - -
[ ID] Interval           Transfer     Bandwidth       Retr
[  4]   0.00-10.00  sec   125 MBytes   105 Mbits/sec    0             sender
[  4]   0.00-10.00  sec   125 MBytes   105 Mbits/sec                  receiver

iperf Done.
+ sleep 10
+ iperf3 -c 10.0.0.80 -u -b 150M
Connecting to host 10.0.0.80, port 5201
[  4] local 10.0.0.1 port 40911 connected to 10.0.0.80 port 5201
[ ID] Interval           Transfer     Bandwidth       Total Datagrams
[  4]   0.00-1.00   sec  16.4 MBytes   138 Mbits/sec  2100  
[  4]   1.00-2.00   sec  17.9 MBytes   150 Mbits/sec  2289  
[  4]   2.00-3.00   sec  17.9 MBytes   150 Mbits/sec  2289  
[  4]   3.00-4.00   sec  17.9 MBytes   150 Mbits/sec  2292  
[  4]   4.00-5.00   sec  17.9 MBytes   150 Mbits/sec  2288  
[  4]   5.00-6.00   sec  17.9 MBytes   150 Mbits/sec  2286  
[  4]   6.00-7.00   sec  17.9 MBytes   150 Mbits/sec  2288  
[  4]   7.00-8.00   sec  17.9 MBytes   150 Mbits/sec  2289  
[  4]   8.00-9.00   sec  17.9 MBytes   150 Mbits/sec  2289  
[  4]   9.00-10.00  sec  17.9 MBytes   150 Mbits/sec  2289  
- - - - - - - - - - - - - - - - - - - - - - - - -
[ ID] Interval           Transfer     Bandwidth       Jitter    Lost/Total Datagrams
[  4]   0.00-10.00  sec   177 MBytes   149 Mbits/sec  136432.924 ms  1354/1422 (95%)  
[  4] Sent 1422 datagrams

iperf Done.
+ sleep 10
+ iperf3 -c 10.0.0.80 -u -b 150M -R
Connecting to host 10.0.0.80, port 5201
Reverse mode, remote host 10.0.0.80 is sending
[  4] local 10.0.0.1 port 45555 connected to 10.0.0.80 port 5201
[ ID] Interval           Transfer     Bandwidth       Jitter    Lost/Total Datagrams
[  4]   0.00-1.00   sec  15.3 MBytes   128 Mbits/sec  0.077 ms  0/1960 (0%)  
[  4]   1.00-2.00   sec  15.3 MBytes   129 Mbits/sec  0.068 ms  0/1962 (0%)  
[  4]   2.00-3.00   sec  15.2 MBytes   127 Mbits/sec  0.051 ms  0/1942 (0%)  
[  4]   3.00-4.00   sec  15.2 MBytes   127 Mbits/sec  0.074 ms  0/1940 (0%)  
[  4]   4.00-5.00   sec  15.3 MBytes   128 Mbits/sec  0.064 ms  0/1959 (0%)  
[  4]   5.00-6.00   sec  14.7 MBytes   123 Mbits/sec  0.093 ms  0/1879 (0%)  
[  4]   6.00-7.00   sec  15.5 MBytes   130 Mbits/sec  0.086 ms  0/1990 (0%)  
[  4]   7.00-8.00   sec  17.5 MBytes   146 Mbits/sec  0.089 ms  0/2235 (0%)  
[  4]   8.00-9.00   sec  14.2 MBytes   119 Mbits/sec  0.060 ms  0/1816 (0%)  
[  4]   9.00-10.00  sec  15.0 MBytes   126 Mbits/sec  0.095 ms  0/1917 (0%)  
- - - - - - - - - - - - - - - - - - - - - - - - -
[ ID] Interval           Transfer     Bandwidth       Jitter    Lost/Total Datagrams
[  4]   0.00-10.00  sec   153 MBytes   128 Mbits/sec  0.101 ms  0/19606 (0%)  
[  4] Sent 19606 datagrams

iperf Done.
+ sleep 10
+ iperf3 -c 10.0.0.80 -u -b 500M
Connecting to host 10.0.0.80, port 5201
[  4] local 10.0.0.1 port 33642 connected to 10.0.0.80 port 5201
[ ID] Interval           Transfer     Bandwidth       Total Datagrams
[  4]   0.00-1.00   sec  59.3 MBytes   497 Mbits/sec  7585  
[  4]   1.00-2.00   sec  58.5 MBytes   491 Mbits/sec  7492  
[  4]   2.00-3.00   sec  60.0 MBytes   503 Mbits/sec  7677  
[  4]   3.00-4.00   sec  59.3 MBytes   498 Mbits/sec  7596  
[  4]   4.00-5.00   sec  60.9 MBytes   511 Mbits/sec  7794  
[  4]   5.00-6.00   sec  59.0 MBytes   495 Mbits/sec  7556  
[  4]   6.00-7.00   sec  59.7 MBytes   501 Mbits/sec  7639  
[  4]   7.00-8.00   sec  59.2 MBytes   496 Mbits/sec  7574  
[  4]   8.00-9.00   sec  60.4 MBytes   507 Mbits/sec  7736  
[  4]   9.00-10.00  sec  58.7 MBytes   493 Mbits/sec  7517  
- - - - - - - - - - - - - - - - - - - - - - - - -
[ ID] Interval           Transfer     Bandwidth       Jitter    Lost/Total Datagrams
[  4]   0.00-10.00  sec   595 MBytes   499 Mbits/sec  1147799.599 ms  64273/64308 (1e+02%)  
[  4] Sent 64308 datagrams

iperf Done.
+ sleep 10
+ iperf3 -c 10.0.0.80 -u -b 500M -R
Connecting to host 10.0.0.80, port 5201
Reverse mode, remote host 10.0.0.80 is sending
[  4] local 10.0.0.1 port 42014 connected to 10.0.0.80 port 5201
[ ID] Interval           Transfer     Bandwidth       Jitter    Lost/Total Datagrams
[  4]   0.00-1.00   sec  15.2 MBytes   127 Mbits/sec  0.086 ms  0/1942 (0%)  
[  4]   1.00-2.00   sec  15.2 MBytes   127 Mbits/sec  0.099 ms  0/1940 (0%)  
[  4]   2.00-3.00   sec  15.1 MBytes   127 Mbits/sec  0.087 ms  0/1932 (0%)  
[  4]   3.00-4.00   sec  15.0 MBytes   126 Mbits/sec  0.059 ms  0/1920 (0%)  
[  4]   4.00-5.00   sec  15.1 MBytes   127 Mbits/sec  0.070 ms  0/1931 (0%)  
[  4]   5.00-6.00   sec  15.2 MBytes   127 Mbits/sec  0.109 ms  0/1942 (0%)  
[  4]   6.00-7.00   sec  15.2 MBytes   127 Mbits/sec  0.102 ms  0/1941 (0%)  
[  4]   7.00-8.00   sec  15.2 MBytes   127 Mbits/sec  0.069 ms  0/1943 (0%)  
[  4]   8.00-9.00   sec  15.0 MBytes   126 Mbits/sec  0.074 ms  0/1926 (0%)  
[  4]   9.00-10.00  sec  15.0 MBytes   126 Mbits/sec  0.082 ms  0/1919 (0%)  
- - - - - - - - - - - - - - - - - - - - - - - - -
[ ID] Interval           Transfer     Bandwidth       Jitter    Lost/Total Datagrams
[  4]   0.00-10.00  sec   151 MBytes   127 Mbits/sec  0.089 ms  0/19342 (0%)  
[  4] Sent 19342 datagrams

iperf Done.
+ sleep 10
+ iperf3 -c 10.0.0.80 -u -b 1000M
Connecting to host 10.0.0.80, port 5201
[  4] local 10.0.0.1 port 55639 connected to 10.0.0.80 port 5201
[ ID] Interval           Transfer     Bandwidth       Total Datagrams
[  4]   0.00-1.00   sec  96.9 MBytes   813 Mbits/sec  12402  
[  4]   1.00-2.00   sec   107 MBytes   897 Mbits/sec  13693  
[  4]   2.00-3.00   sec   107 MBytes   898 Mbits/sec  13701  
[  4]   3.00-4.00   sec   107 MBytes   898 Mbits/sec  13698  
[  4]   4.00-5.00   sec   107 MBytes   897 Mbits/sec  13689  
[  4]   5.00-6.00   sec   107 MBytes   896 Mbits/sec  13679  
[  4]   6.00-7.00   sec   107 MBytes   898 Mbits/sec  13710  
[  4]   7.00-8.00   sec   107 MBytes   899 Mbits/sec  13719  
[  4]   8.00-9.00   sec   107 MBytes   894 Mbits/sec  13635  
[  4]   9.00-10.00  sec   107 MBytes   899 Mbits/sec  13725  
- - - - - - - - - - - - - - - - - - - - - - - - -
[ ID] Interval           Transfer     Bandwidth       Jitter    Lost/Total Datagrams
[  4]   0.00-10.00  sec  1.03 GBytes   889 Mbits/sec  3022016.748 ms  1257/1277 (98%)  
[  4] Sent 1277 datagrams

iperf Done.
+ sleep 10
+ iperf3 -c 10.0.0.80 -u -b 1000M -R
Connecting to host 10.0.0.80, port 5201
Reverse mode, remote host 10.0.0.80 is sending
[  4] local 10.0.0.1 port 54887 connected to 10.0.0.80 port 5201
[ ID] Interval           Transfer     Bandwidth       Jitter    Lost/Total Datagrams
[  4]   0.00-1.00   sec  15.2 MBytes   127 Mbits/sec  0.087 ms  0/1942 (0%)  
[  4]   1.00-2.00   sec  15.2 MBytes   127 Mbits/sec  0.092 ms  0/1945 (0%)  
[  4]   2.00-3.00   sec  15.1 MBytes   126 Mbits/sec  0.078 ms  0/1930 (0%)  
[  4]   3.00-4.00   sec  15.1 MBytes   127 Mbits/sec  0.078 ms  0/1938 (0%)  
[  4]   4.00-5.00   sec  15.3 MBytes   128 Mbits/sec  0.080 ms  0/1954 (0%)  
[  4]   5.00-6.00   sec  15.3 MBytes   128 Mbits/sec  0.088 ms  0/1959 (0%)  
[  4]   6.00-7.00   sec  15.3 MBytes   129 Mbits/sec  0.084 ms  0/1961 (0%)  
[  4]   7.00-8.00   sec  15.3 MBytes   128 Mbits/sec  0.266 ms  0/1956 (0%)  
[  4]   8.00-9.00   sec  15.2 MBytes   128 Mbits/sec  0.079 ms  0/1949 (0%)  
[  4]   9.00-10.00  sec  15.1 MBytes   127 Mbits/sec  0.069 ms  0/1939 (0%)  
- - - - - - - - - - - - - - - - - - - - - - - - -
[ ID] Interval           Transfer     Bandwidth       Jitter    Lost/Total Datagrams
[  4]   0.00-10.00  sec   152 MBytes   128 Mbits/sec  0.063 ms  0/19480 (0%)  
[  4] Sent 19480 datagrams

iperf Done.
+ sleep 10
+ iperf3 -c 10.0.0.80 -P3
Connecting to host 10.0.0.80, port 5201
[  4] local 10.0.0.1 port 53060 connected to 10.0.0.80 port 5201
[  6] local 10.0.0.1 port 53062 connected to 10.0.0.80 port 5201
[  9] local 10.0.0.1 port 53064 connected to 10.0.0.80 port 5201
[ ID] Interval           Transfer     Bandwidth       Retr  Cwnd
[  4]   0.00-1.00   sec  8.14 MBytes  68.3 Mbits/sec   65   11.3 KBytes       
[  6]   0.00-1.00   sec  6.65 MBytes  55.8 Mbits/sec   81   12.7 KBytes       
[  9]   0.00-1.00   sec  7.20 MBytes  60.4 Mbits/sec   60   49.5 KBytes       
[SUM]   0.00-1.00   sec  22.0 MBytes   184 Mbits/sec  206             
- - - - - - - - - - - - - - - - - - - - - - - - -
[  4]   1.00-2.00   sec  6.84 MBytes  57.3 Mbits/sec   76   25.5 KBytes       
[  6]   1.00-2.00   sec  6.59 MBytes  55.3 Mbits/sec   89   24.0 KBytes       
[  9]   1.00-2.00   sec  7.83 MBytes  65.7 Mbits/sec   60   18.4 KBytes       
[SUM]   1.00-2.00   sec  21.3 MBytes   178 Mbits/sec  225             
- - - - - - - - - - - - - - - - - - - - - - - - -
[  4]   2.00-3.00   sec  7.83 MBytes  65.7 Mbits/sec   78   15.6 KBytes       
[  6]   2.00-3.00   sec  6.84 MBytes  57.3 Mbits/sec   69   29.7 KBytes       
[  9]   2.00-3.00   sec  6.77 MBytes  56.8 Mbits/sec   72   19.8 KBytes       
[SUM]   2.00-3.00   sec  21.4 MBytes   180 Mbits/sec  219             
- - - - - - - - - - - - - - - - - - - - - - - - -
[  4]   3.00-4.00   sec  6.52 MBytes  54.7 Mbits/sec  120   18.4 KBytes       
[  6]   3.00-4.00   sec  7.08 MBytes  59.4 Mbits/sec   90   26.9 KBytes       
[  9]   3.00-4.00   sec  6.77 MBytes  56.8 Mbits/sec   77   31.1 KBytes       
[SUM]   3.00-4.00   sec  20.4 MBytes   171 Mbits/sec  287             
- - - - - - - - - - - - - - - - - - - - - - - - -
[  4]   4.00-5.00   sec  6.28 MBytes  52.6 Mbits/sec   82   21.2 KBytes       
[  6]   4.00-5.00   sec  7.15 MBytes  59.9 Mbits/sec   61   19.8 KBytes       
[  9]   4.00-5.00   sec  7.71 MBytes  64.6 Mbits/sec   61   22.6 KBytes       
[SUM]   4.00-5.00   sec  21.1 MBytes   177 Mbits/sec  204             
- - - - - - - - - - - - - - - - - - - - - - - - -
[  4]   5.00-6.00   sec  8.95 MBytes  75.1 Mbits/sec   61   11.3 KBytes       
[  6]   5.00-6.00   sec  5.84 MBytes  49.0 Mbits/sec  105   39.6 KBytes       
[  9]   5.00-6.00   sec  5.10 MBytes  42.7 Mbits/sec  112   21.2 KBytes       
[SUM]   5.00-6.00   sec  19.9 MBytes   167 Mbits/sec  278             
- - - - - - - - - - - - - - - - - - - - - - - - -
[  4]   6.00-7.00   sec  7.95 MBytes  66.7 Mbits/sec   78   46.7 KBytes       
[  6]   6.00-7.00   sec  6.77 MBytes  56.8 Mbits/sec  110   14.1 KBytes       
[  9]   6.00-7.00   sec  5.65 MBytes  47.4 Mbits/sec  112   18.4 KBytes       
[SUM]   6.00-7.00   sec  20.4 MBytes   171 Mbits/sec  300             
- - - - - - - - - - - - - - - - - - - - - - - - -
[  4]   7.00-8.00   sec  8.51 MBytes  71.4 Mbits/sec   69   32.5 KBytes       
[  6]   7.00-8.00   sec  6.52 MBytes  54.7 Mbits/sec  109   12.7 KBytes       
[  9]   7.00-8.00   sec  4.97 MBytes  41.7 Mbits/sec   81   19.8 KBytes       
[SUM]   7.00-8.00   sec  20.0 MBytes   168 Mbits/sec  259             
- - - - - - - - - - - - - - - - - - - - - - - - -
[  4]   8.00-9.00   sec  8.58 MBytes  71.9 Mbits/sec   65   14.1 KBytes       
[  6]   8.00-9.00   sec  5.72 MBytes  48.0 Mbits/sec  104   28.3 KBytes       
[  9]   8.00-9.00   sec  6.59 MBytes  55.3 Mbits/sec   85   11.3 KBytes       
[SUM]   8.00-9.00   sec  20.9 MBytes   175 Mbits/sec  254             
- - - - - - - - - - - - - - - - - - - - - - - - -
[  4]   9.00-10.00  sec  9.94 MBytes  83.4 Mbits/sec   63   48.1 KBytes       
[  6]   9.00-10.00  sec  4.47 MBytes  37.5 Mbits/sec   75   26.9 KBytes       
[  9]   9.00-10.00  sec  5.03 MBytes  42.2 Mbits/sec  122   43.8 KBytes       
[SUM]   9.00-10.00  sec  19.5 MBytes   163 Mbits/sec  260             
- - - - - - - - - - - - - - - - - - - - - - - - -
[ ID] Interval           Transfer     Bandwidth       Retr
[  4]   0.00-10.00  sec  79.5 MBytes  66.7 Mbits/sec  757             sender
[  4]   0.00-10.00  sec  79.1 MBytes  66.4 Mbits/sec                  receiver
[  6]   0.00-10.00  sec  63.6 MBytes  53.4 Mbits/sec  893             sender
[  6]   0.00-10.00  sec  63.3 MBytes  53.1 Mbits/sec                  receiver
[  9]   0.00-10.00  sec  63.6 MBytes  53.4 Mbits/sec  842             sender
[  9]   0.00-10.00  sec  63.2 MBytes  53.1 Mbits/sec                  receiver
[SUM]   0.00-10.00  sec   207 MBytes   173 Mbits/sec  2492             sender
[SUM]   0.00-10.00  sec   206 MBytes   172 Mbits/sec                  receiver

iperf Done.
+ sleep 10
+ iperf3 -c 10.0.0.80 -P3 -R
Connecting to host 10.0.0.80, port 5201
Reverse mode, remote host 10.0.0.80 is sending
[  4] local 10.0.0.1 port 53114 connected to 10.0.0.80 port 5201
[  6] local 10.0.0.1 port 53116 connected to 10.0.0.80 port 5201
[  9] local 10.0.0.1 port 53118 connected to 10.0.0.80 port 5201
[ ID] Interval           Transfer     Bandwidth
[  4]   0.00-1.00   sec  3.58 MBytes  30.1 Mbits/sec                  
[  6]   0.00-1.00   sec  3.68 MBytes  30.9 Mbits/sec                  
[  9]   0.00-1.00   sec  2.38 MBytes  19.9 Mbits/sec                  
[SUM]   0.00-1.00   sec  9.64 MBytes  80.8 Mbits/sec                  
- - - - - - - - - - - - - - - - - - - - - - - - -
[  4]   1.00-2.00   sec  3.21 MBytes  26.9 Mbits/sec                  
[  6]   1.00-2.00   sec  3.19 MBytes  26.8 Mbits/sec                  
[  9]   1.00-2.00   sec  3.23 MBytes  27.1 Mbits/sec                  
[SUM]   1.00-2.00   sec  9.63 MBytes  80.8 Mbits/sec                  
- - - - - - - - - - - - - - - - - - - - - - - - -
[  4]   2.00-3.00   sec  3.21 MBytes  26.9 Mbits/sec                  
[  6]   2.00-3.00   sec  3.25 MBytes  27.3 Mbits/sec                  
[  9]   2.00-3.00   sec  3.17 MBytes  26.6 Mbits/sec                  
[SUM]   2.00-3.00   sec  9.63 MBytes  80.8 Mbits/sec                  
- - - - - - - - - - - - - - - - - - - - - - - - -
[  4]   3.00-4.00   sec  3.38 MBytes  28.4 Mbits/sec                  
[  6]   3.00-4.00   sec  3.27 MBytes  27.4 Mbits/sec                  
[  9]   3.00-4.00   sec  3.32 MBytes  27.8 Mbits/sec                  
[SUM]   3.00-4.00   sec  9.97 MBytes  83.6 Mbits/sec                  
- - - - - - - - - - - - - - - - - - - - - - - - -
[  4]   4.00-5.00   sec  4.84 MBytes  40.6 Mbits/sec                  
[  6]   4.00-5.00   sec  4.20 MBytes  35.2 Mbits/sec                  
[  9]   4.00-5.00   sec  4.67 MBytes  39.2 Mbits/sec                  
[SUM]   4.00-5.00   sec  13.7 MBytes   115 Mbits/sec                  
- - - - - - - - - - - - - - - - - - - - - - - - -
[  4]   5.00-6.00   sec  4.12 MBytes  34.6 Mbits/sec                  
[  6]   5.00-6.00   sec  2.81 MBytes  23.6 Mbits/sec                  
[  9]   5.00-6.00   sec  2.76 MBytes  23.1 Mbits/sec                  
[SUM]   5.00-6.00   sec  9.69 MBytes  81.3 Mbits/sec                  
- - - - - - - - - - - - - - - - - - - - - - - - -
[  4]   6.00-7.00   sec  3.25 MBytes  27.2 Mbits/sec                  
[  6]   6.00-7.00   sec  3.25 MBytes  27.3 Mbits/sec                  
[  9]   6.00-7.00   sec  3.25 MBytes  27.3 Mbits/sec                  
[SUM]   6.00-7.00   sec  9.75 MBytes  81.8 Mbits/sec                  
- - - - - - - - - - - - - - - - - - - - - - - - -
[  4]   7.00-8.00   sec  3.27 MBytes  27.4 Mbits/sec                  
[  6]   7.00-8.00   sec  3.37 MBytes  28.3 Mbits/sec                  
[  9]   7.00-8.00   sec  3.24 MBytes  27.2 Mbits/sec                  
[SUM]   7.00-8.00   sec  9.88 MBytes  82.9 Mbits/sec                  
- - - - - - - - - - - - - - - - - - - - - - - - -
[  4]   8.00-9.00   sec  3.25 MBytes  27.3 Mbits/sec                  
[  6]   8.00-9.00   sec  3.20 MBytes  26.8 Mbits/sec                  
[  9]   8.00-9.00   sec  3.18 MBytes  26.7 Mbits/sec                  
[SUM]   8.00-9.00   sec  9.63 MBytes  80.8 Mbits/sec                  
- - - - - - - - - - - - - - - - - - - - - - - - -
[  4]   9.00-10.00  sec  3.25 MBytes  27.3 Mbits/sec                  
[  6]   9.00-10.00  sec  3.23 MBytes  27.1 Mbits/sec                  
[  9]   9.00-10.00  sec  3.25 MBytes  27.3 Mbits/sec                  
[SUM]   9.00-10.00  sec  9.73 MBytes  81.6 Mbits/sec                  
- - - - - - - - - - - - - - - - - - - - - - - - -
[ ID] Interval           Transfer     Bandwidth       Retr
[  4]   0.00-10.00  sec  35.7 MBytes  29.9 Mbits/sec    0             sender
[  4]   0.00-10.00  sec  35.7 MBytes  29.9 Mbits/sec                  receiver
[  6]   0.00-10.00  sec  33.8 MBytes  28.3 Mbits/sec    0             sender
[  6]   0.00-10.00  sec  33.8 MBytes  28.3 Mbits/sec                  receiver
[  9]   0.00-10.00  sec  32.8 MBytes  27.5 Mbits/sec    0             sender
[  9]   0.00-10.00  sec  32.8 MBytes  27.5 Mbits/sec                  receiver
[SUM]   0.00-10.00  sec   102 MBytes  85.8 Mbits/sec    0             sender
[SUM]   0.00-10.00  sec   102 MBytes  85.8 Mbits/sec                  receiver

iperf Done.
+ sleep 10
+ iperf3 -c 10.0.0.80 -P3 -u -b 800M
Connecting to host 10.0.0.80, port 5201
[  4] local 10.0.0.1 port 54179 connected to 10.0.0.80 port 5201
[  6] local 10.0.0.1 port 53430 connected to 10.0.0.80 port 5201
[  9] local 10.0.0.1 port 57739 connected to 10.0.0.80 port 5201
[ ID] Interval           Transfer     Bandwidth       Total Datagrams
[  4]   0.00-1.00   sec  38.0 MBytes   318 Mbits/sec  4860  
[  6]   0.00-1.00   sec  38.0 MBytes   318 Mbits/sec  4860  
[  9]   0.00-1.00   sec  38.0 MBytes   318 Mbits/sec  4860  
[SUM]   0.00-1.00   sec   114 MBytes   955 Mbits/sec  14580  
- - - - - - - - - - - - - - - - - - - - - - - - -
[  4]   1.00-2.00   sec  38.1 MBytes   319 Mbits/sec  4872  
[  6]   1.00-2.00   sec  38.1 MBytes   319 Mbits/sec  4872  
[  9]   1.00-2.00   sec  38.1 MBytes   319 Mbits/sec  4872  
[SUM]   1.00-2.00   sec   114 MBytes   958 Mbits/sec  14616  
- - - - - - - - - - - - - - - - - - - - - - - - -
[  4]   2.00-3.00   sec  38.0 MBytes   319 Mbits/sec  4870  
[  6]   2.00-3.00   sec  38.0 MBytes   319 Mbits/sec  4870  
[  9]   2.00-3.00   sec  38.0 MBytes   319 Mbits/sec  4870  
[SUM]   2.00-3.00   sec   114 MBytes   958 Mbits/sec  14610  
- - - - - - - - - - - - - - - - - - - - - - - - -
[  4]   3.00-4.00   sec  38.1 MBytes   319 Mbits/sec  4872  
[  6]   3.00-4.00   sec  38.1 MBytes   319 Mbits/sec  4872  
[  9]   3.00-4.00   sec  38.1 MBytes   319 Mbits/sec  4872  
[SUM]   3.00-4.00   sec   114 MBytes   958 Mbits/sec  14616  
- - - - - - - - - - - - - - - - - - - - - - - - -
[  4]   4.00-5.00   sec  38.1 MBytes   319 Mbits/sec  4874  
[  6]   4.00-5.00   sec  38.1 MBytes   319 Mbits/sec  4874  
[  9]   4.00-5.00   sec  38.1 MBytes   319 Mbits/sec  4874  
[SUM]   4.00-5.00   sec   114 MBytes   958 Mbits/sec  14622  
- - - - - - - - - - - - - - - - - - - - - - - - -
[  4]   5.00-6.00   sec  38.1 MBytes   319 Mbits/sec  4873  
[  6]   5.00-6.00   sec  38.1 MBytes   319 Mbits/sec  4873  
[  9]   5.00-6.00   sec  38.1 MBytes   319 Mbits/sec  4873  
[SUM]   5.00-6.00   sec   114 MBytes   958 Mbits/sec  14619  
- - - - - - - - - - - - - - - - - - - - - - - - -
[  4]   6.00-7.00   sec  38.1 MBytes   319 Mbits/sec  4873  
[  6]   6.00-7.00   sec  38.1 MBytes   319 Mbits/sec  4873  
[  9]   6.00-7.00   sec  38.1 MBytes   319 Mbits/sec  4873  
[SUM]   6.00-7.00   sec   114 MBytes   958 Mbits/sec  14619  
- - - - - - - - - - - - - - - - - - - - - - - - -
[  4]   7.00-8.00   sec  38.1 MBytes   319 Mbits/sec  4874  
[  6]   7.00-8.00   sec  38.1 MBytes   319 Mbits/sec  4874  
[  9]   7.00-8.00   sec  38.1 MBytes   319 Mbits/sec  4874  
[SUM]   7.00-8.00   sec   114 MBytes   958 Mbits/sec  14622  
- - - - - - - - - - - - - - - - - - - - - - - - -
[  4]   8.00-9.00   sec  38.1 MBytes   319 Mbits/sec  4874  
[  6]   8.00-9.00   sec  38.1 MBytes   319 Mbits/sec  4874  
[  9]   8.00-9.00   sec  38.1 MBytes   319 Mbits/sec  4874  
[SUM]   8.00-9.00   sec   114 MBytes   958 Mbits/sec  14622  
- - - - - - - - - - - - - - - - - - - - - - - - -
[  4]   9.00-10.00  sec  38.1 MBytes   319 Mbits/sec  4874  
[  6]   9.00-10.00  sec  38.1 MBytes   319 Mbits/sec  4874  
[  9]   9.00-10.00  sec  38.1 MBytes   319 Mbits/sec  4874  
[SUM]   9.00-10.00  sec   114 MBytes   958 Mbits/sec  14622  
- - - - - - - - - - - - - - - - - - - - - - - - -
[ ID] Interval           Transfer     Bandwidth       Jitter    Lost/Total Datagrams
[  4]   0.00-10.00  sec   381 MBytes   319 Mbits/sec  8487039.300 ms  0/4 (0%)  
[  4] Sent 4 datagrams
[  6]   0.00-10.00  sec   381 MBytes   319 Mbits/sec  8487039.639 ms  365/369 (99%)  
[  6] Sent 369 datagrams
[  9]   0.00-10.00  sec   381 MBytes   319 Mbits/sec  8487039.544 ms  472/476 (99%)  
[  9] Sent 476 datagrams
[SUM]   0.00-10.00  sec  1.12 GBytes   958 Mbits/sec  8487039.495 ms  837/849 (99%)  

iperf Done.
+ sleep 10
+ iperf3 -c 10.0.0.80 -P3 -R -u -b 800M
Connecting to host 10.0.0.80, port 5201
Reverse mode, remote host 10.0.0.80 is sending
[  4] local 10.0.0.1 port 58078 connected to 10.0.0.80 port 5201
[  6] local 10.0.0.1 port 50018 connected to 10.0.0.80 port 5201
[  9] local 10.0.0.1 port 46349 connected to 10.0.0.80 port 5201
[ ID] Interval           Transfer     Bandwidth       Jitter    Lost/Total Datagrams
[  4]   0.00-1.00   sec  4.97 MBytes  41.7 Mbits/sec  0.205 ms  0/636 (0%)  
[  6]   0.00-1.00   sec  4.96 MBytes  41.6 Mbits/sec  0.154 ms  0/635 (0%)  
[  9]   0.00-1.00   sec  4.96 MBytes  41.6 Mbits/sec  0.248 ms  0/635 (0%)  
[SUM]   0.00-1.00   sec  14.9 MBytes   125 Mbits/sec  0.202 ms  0/1906 (0%)  
- - - - - - - - - - - - - - - - - - - - - - - - -
[  4]   1.00-2.00   sec  4.95 MBytes  41.5 Mbits/sec  0.162 ms  0/634 (0%)  
[  6]   1.00-2.00   sec  4.95 MBytes  41.5 Mbits/sec  0.216 ms  0/634 (0%)  
[  9]   1.00-2.00   sec  4.95 MBytes  41.5 Mbits/sec  0.125 ms  0/633 (0%)  
[SUM]   1.00-2.00   sec  14.9 MBytes   125 Mbits/sec  0.168 ms  0/1901 (0%)  
- - - - - - - - - - - - - - - - - - - - - - - - -
[  4]   2.00-3.00   sec  4.91 MBytes  41.2 Mbits/sec  0.148 ms  0/629 (0%)  
[  6]   2.00-3.00   sec  4.92 MBytes  41.3 Mbits/sec  0.213 ms  0/630 (0%)  
[  9]   2.00-3.00   sec  4.91 MBytes  41.2 Mbits/sec  0.123 ms  0/629 (0%)  
[SUM]   2.00-3.00   sec  14.8 MBytes   124 Mbits/sec  0.161 ms  0/1888 (0%)  
- - - - - - - - - - - - - - - - - - - - - - - - -
[  4]   3.00-4.00   sec  4.94 MBytes  41.4 Mbits/sec  0.226 ms  0/632 (0%)  
[  6]   3.00-4.00   sec  4.94 MBytes  41.4 Mbits/sec  0.151 ms  0/632 (0%)  
[  9]   3.00-4.00   sec  4.94 MBytes  41.4 Mbits/sec  0.171 ms  0/632 (0%)  
[SUM]   3.00-4.00   sec  14.8 MBytes   124 Mbits/sec  0.183 ms  0/1896 (0%)  
- - - - - - - - - - - - - - - - - - - - - - - - -
[  4]   4.00-5.00   sec  4.98 MBytes  41.8 Mbits/sec  0.191 ms  0/638 (0%)  
[  6]   4.00-5.00   sec  4.98 MBytes  41.7 Mbits/sec  0.129 ms  0/637 (0%)  
[  9]   4.00-5.00   sec  4.98 MBytes  41.7 Mbits/sec  0.154 ms  0/637 (0%)  
[SUM]   4.00-5.00   sec  14.9 MBytes   125 Mbits/sec  0.158 ms  0/1912 (0%)  
- - - - - - - - - - - - - - - - - - - - - - - - -
[  4]   5.00-6.00   sec  4.98 MBytes  41.7 Mbits/sec  0.162 ms  0/637 (0%)  
[  6]   5.00-6.00   sec  4.97 MBytes  41.7 Mbits/sec  0.085 ms  0/636 (0%)  
[  9]   5.00-6.00   sec  4.97 MBytes  41.7 Mbits/sec  0.157 ms  0/636 (0%)  
[SUM]   5.00-6.00   sec  14.9 MBytes   125 Mbits/sec  0.135 ms  0/1909 (0%)  
- - - - - - - - - - - - - - - - - - - - - - - - -
[  4]   6.00-7.00   sec  4.97 MBytes  41.7 Mbits/sec  0.221 ms  0/636 (0%)  
[  6]   6.00-7.00   sec  4.97 MBytes  41.7 Mbits/sec  0.115 ms  0/636 (0%)  
[  9]   6.00-7.00   sec  4.96 MBytes  41.6 Mbits/sec  0.193 ms  0/635 (0%)  
[SUM]   6.00-7.00   sec  14.9 MBytes   125 Mbits/sec  0.176 ms  0/1907 (0%)  
- - - - - - - - - - - - - - - - - - - - - - - - -
[  4]   7.00-8.00   sec  4.94 MBytes  41.4 Mbits/sec  0.226 ms  0/632 (0%)  
[  6]   7.00-8.00   sec  4.94 MBytes  41.4 Mbits/sec  0.134 ms  0/632 (0%)  
[  9]   7.00-8.00   sec  4.94 MBytes  41.4 Mbits/sec  0.180 ms  0/632 (0%)  
[SUM]   7.00-8.00   sec  14.8 MBytes   124 Mbits/sec  0.180 ms  0/1896 (0%)  
- - - - - - - - - - - - - - - - - - - - - - - - -
[  4]   8.00-9.00   sec  4.92 MBytes  41.3 Mbits/sec  0.196 ms  0/630 (0%)  
[  6]   8.00-9.00   sec  4.91 MBytes  41.2 Mbits/sec  0.161 ms  0/629 (0%)  
[  9]   8.00-9.00   sec  4.91 MBytes  41.2 Mbits/sec  0.161 ms  0/629 (0%)  
[SUM]   8.00-9.00   sec  14.8 MBytes   124 Mbits/sec  0.173 ms  0/1888 (0%)  
- - - - - - - - - - - - - - - - - - - - - - - - -
[  4]   9.00-10.00  sec  4.95 MBytes  41.5 Mbits/sec  0.194 ms  0/633 (0%)  
[  6]   9.00-10.00  sec  4.95 MBytes  41.5 Mbits/sec  0.132 ms  0/633 (0%)  
[  9]   9.00-10.00  sec  4.94 MBytes  41.4 Mbits/sec  0.107 ms  0/632 (0%)  
[SUM]   9.00-10.00  sec  14.8 MBytes   124 Mbits/sec  0.144 ms  0/1898 (0%)  
- - - - - - - - - - - - - - - - - - - - - - - - -
[ ID] Interval           Transfer     Bandwidth       Jitter    Lost/Total Datagrams
[  4]   0.00-10.00  sec  49.6 MBytes  41.6 Mbits/sec  0.223 ms  0/6344 (0%)  
[  4] Sent 6344 datagrams
[  6]   0.00-10.00  sec  49.5 MBytes  41.6 Mbits/sec  0.153 ms  0/6341 (0%)  
[  6] Sent 6341 datagrams
[  9]   0.00-10.00  sec  49.5 MBytes  41.5 Mbits/sec  0.125 ms  0/6337 (0%)  
[  9] Sent 6337 datagrams
[SUM]   0.00-10.00  sec   149 MBytes   125 Mbits/sec  0.167 ms  0/19022 (0%)  

iperf Done.

Comments

Hauke Mehrtens March 25, 2019, 11:24 p.m. UTC | #1
Hi Petr

On 3/14/19 6:46 AM, Petr Cvek wrote:
> Hello again,
> 
> I've managed to enhance few drivers for lantiq platform. They are still
> in ugly commented form (ethernet part especially). But I need some hints
> before the final version. The patches are based on a kernel 4.14.99.
> Copy them into target/linux/lantiq/patches-4.14 (cleaned from any of my
> previous patch).

Thanks for working on this.

> The eth+irq speedup is up to 360/260 Mbps (the vanilla was 170/80 on my
> setup). The iperf3 benchmark (2 passes for both vanilla and changed
> versions) altogether with script are in the attachment.
> 
> 1) IRQ with SMP and balancing support:
> 
> 	0901-add-icu-smp-support.patch
> 	0902-enable-external-irqs-for-second-vpe.patch
> 	0903-add-icu1-node-for-smp.patch
> 
> As requested I've changed the patch heavily. The original locking from
> k3b source code (probably from UGW) didn't work and in heavy load the
> system could have froze (smp affinity change during irq handling). This
> version has this fixed by using generic raw spinlocks with irq.
> 
> The SMP IRQ now works in a way that before every irq_enable (serves as
> unmask too) the VPE will be switched. This can be limited by writing
> into /proc/irq/X/smp_affinity (it can be possibly balanced from
> userspace too).
> 
> I've rewritten the device tree reg fields so there are only 2 arrays
> now. One per an icu controller. The original one per module was
> redundant as the ranges were continuous. The modules of a single ICU are
> now explicitly computed in a macro:
> 
> 	ltq_w32((x), ltq_icu_membase[vpe] + m*0x28 + (y))
> 	ltq_r32(ltq_icu_membase[vpe] + m*0x28 + (x))
> 
> before there was a pointer for every 0x28 block (there shouldn't be
> speed downgrade, only a multiplication and an addition for every
> register access).
> 
> Also I've simplified register names from LTQ_ICU_IM0_ISR to LTQ_ICU_ISR
> as "IM0" (module) was confusing (the real module number 0-4 was a part
> of the macro).
> 
> The code is written in a way it should work fine on a uniprocessor
> configuration (as the for_each_present_cpu etc macros will cycle on a
> single VPE on uniprocessor). I didn't test the no CONFIG_SMP yet, but I
> did check it with "nosmp" kernel parameter. It works.
> 
> Anyway please test if you have the board where the second VPE is used
> for FXS.
> 
> The new device tree structure is now incompatible with an old version of
> the driver (and old device tree with the new driver too). It seems icu
> driver is used in Danube, AR9, AmazonSE and Falcon chipset too. I don't
> know the hardware for these boards so before a final patch I would like
> to know if they have a second ICU too (at 0x80300 offset).

Normally the device tree should stay stable, but I already though about
the same change and I am not aware that any device ships a U-Boot with
an embedded device tree, so this should be fine.

The Amazon and Amazon SE only have one ICU block because they only have
one CPU with one VPE.
The Danube SoC has two ICU blocks one for each CPU, each CPU only has
one VPE. The CPUs are not cache coherent, SMP is not possible.

Falcon, AR9, VR9, AR10, ARX300, GRX300, GRX330 have two ICU blocks one
for each VPE of the single CPU.
GRX350 uses a MIPS InterAptiv CPU with a MIPS GIC.

> More development could be done with locking probably. As only the
> accesses in a single module (= 1 set of registers) would cause a race
> condition. But as the most contented interrupts are in the same module
> there won't be much speed increase IMO. I can add it if requested (just
> spinlock array and some lookup code).

I do not think that this improves the performance significantly, I
assume that the CPUs only have to wait there in rare conditions anyway.

> 2) Reworked lantiq xrx200 ethernet driver:
> 
> 	0904-backport-vanilla-eth-driver.patch
> 	0905-increase-dma-descriptors.patch
> 	0906-increase-dma-burst-size.patch
> 
> The code is still ugly, but stable now. There is a fragmented skb
> support and napi polling. DMA ring buffer was increased so it handle
> faster speeds and I've fixed some code weirdness. A can split the
> changes in the future into separate patches.

It would be nice if you could also do the same changes to the upstream
driver in mainline Linux kernel and send this for inclusion to mainline
Linux.

> I didn't test the ICU and eth patches separate, but I've tested the
> ethernet driver on a single VPE only (by setting smp affinity and
> nosmp). This version of the ethernet driver was used for root over NFS
> on the debug setup for like two weeks (without problems).
> 
> Tell me if we should pursue the way for the second DMA channel to PPE so
> both VPEs can send frames at the same time.

I think it should be ok to use both DMA channels for the CPU traffic.

> 3) WAVE300
> 
> In the two past weeks I've tried to fix a mash together various versions
> of wave300 wifi driver (there are partial version in GPL sources from
> router vendors). And I've managed to put the driver into "not
> immediately crashing" mode. If you are interested in the development,
> there is a thread in openwrt forum. The source repo here:
> 
> https://repo.or.cz/wave300.git
> https://repo.or.cz/wave300_rflib.git
> 
> (the second one must be copied into the first one)
> 
> The driver will often crash when meeting an unknown packet, request for
> encryption (no encryption support), unusual combination of configuration
> or just by module unloading. The code is _really_ ugly and it will
> server only as hardware specification for better GPL driver development.
> If you want to help or you have some tips you can join the forum (there
> are links for firmwares and intensive research of available source codes
> from vendors).
> 
> Links:
> https://forum.openwrt.org/t/support-for-wave-300-wi-fi-chip/24690/129
> https://forum.openwrt.org/t/how-can-we-make-the-lantiq-xrx200-devices-faster/9724/70
> https://forum.openwrt.org/t/xrx200-irq-balancing-between-vpes/29732/25
> 
> Petr
Hauke
Hauke Mehrtens March 25, 2019, 11:45 p.m. UTC | #2
On 3/26/19 12:24 AM, Hauke Mehrtens wrote:
> Hi Petr
> 
> On 3/14/19 6:46 AM, Petr Cvek wrote:
>> Hello again,
>>
>> I've managed to enhance few drivers for lantiq platform. They are still
>> in ugly commented form (ethernet part especially). But I need some hints
>> before the final version. The patches are based on a kernel 4.14.99.
>> Copy them into target/linux/lantiq/patches-4.14 (cleaned from any of my
>> previous patch).
> 
> Thanks for working on this.
> 
>> The eth+irq speedup is up to 360/260 Mbps (the vanilla was 170/80 on my
>> setup). The iperf3 benchmark (2 passes for both vanilla and changed
>> versions) altogether with script are in the attachment.
>>
>> 1) IRQ with SMP and balancing support:
>>
>> 	0901-add-icu-smp-support.patch
>> 	0902-enable-external-irqs-for-second-vpe.patch
>> 	0903-add-icu1-node-for-smp.patch
>>
>> As requested I've changed the patch heavily. The original locking from
>> k3b source code (probably from UGW) didn't work and in heavy load the
>> system could have froze (smp affinity change during irq handling). This
>> version has this fixed by using generic raw spinlocks with irq.
>>
>> The SMP IRQ now works in a way that before every irq_enable (serves as
>> unmask too) the VPE will be switched. This can be limited by writing
>> into /proc/irq/X/smp_affinity (it can be possibly balanced from
>> userspace too).
>>
>> I've rewritten the device tree reg fields so there are only 2 arrays
>> now. One per an icu controller. The original one per module was
>> redundant as the ranges were continuous. The modules of a single ICU are
>> now explicitly computed in a macro:
>>
>> 	ltq_w32((x), ltq_icu_membase[vpe] + m*0x28 + (y))
>> 	ltq_r32(ltq_icu_membase[vpe] + m*0x28 + (x))
>>
>> before there was a pointer for every 0x28 block (there shouldn't be
>> speed downgrade, only a multiplication and an addition for every
>> register access).
>>
>> Also I've simplified register names from LTQ_ICU_IM0_ISR to LTQ_ICU_ISR
>> as "IM0" (module) was confusing (the real module number 0-4 was a part
>> of the macro).
>>
>> The code is written in a way it should work fine on a uniprocessor
>> configuration (as the for_each_present_cpu etc macros will cycle on a
>> single VPE on uniprocessor). I didn't test the no CONFIG_SMP yet, but I
>> did check it with "nosmp" kernel parameter. It works.
>>
>> Anyway please test if you have the board where the second VPE is used
>> for FXS.
>>
>> The new device tree structure is now incompatible with an old version of
>> the driver (and old device tree with the new driver too). It seems icu
>> driver is used in Danube, AR9, AmazonSE and Falcon chipset too. I don't
>> know the hardware for these boards so before a final patch I would like
>> to know if they have a second ICU too (at 0x80300 offset).
> 
> Normally the device tree should stay stable, but I already though about
> the same change and I am not aware that any device ships a U-Boot with
> an embedded device tree, so this should be fine.
> 
> The Amazon and Amazon SE only have one ICU block because they only have
> one CPU with one VPE.
> The Danube SoC has two ICU blocks one for each CPU, each CPU only has
> one VPE. The CPUs are not cache coherent, SMP is not possible.
> 
> Falcon, AR9, VR9, AR10, ARX300, GRX300, GRX330 have two ICU blocks one
> for each VPE of the single CPU.
> GRX350 uses a MIPS InterAptiv CPU with a MIPS GIC.
> 
>> More development could be done with locking probably. As only the
>> accesses in a single module (= 1 set of registers) would cause a race
>> condition. But as the most contented interrupts are in the same module
>> there won't be much speed increase IMO. I can add it if requested (just
>> spinlock array and some lookup code).
> 
> I do not think that this improves the performance significantly, I
> assume that the CPUs only have to wait there in rare conditions anyway.
> 
>> 2) Reworked lantiq xrx200 ethernet driver:
>>
>> 	0904-backport-vanilla-eth-driver.patch
>> 	0905-increase-dma-descriptors.patch
>> 	0906-increase-dma-burst-size.patch
>>
>> The code is still ugly, but stable now. There is a fragmented skb
>> support and napi polling. DMA ring buffer was increased so it handle
>> faster speeds and I've fixed some code weirdness. A can split the
>> changes in the future into separate patches.
> 
> It would be nice if you could also do the same changes to the upstream
> driver in mainline Linux kernel and send this for inclusion to mainline
> Linux.
> 
>> I didn't test the ICU and eth patches separate, but I've tested the
>> ethernet driver on a single VPE only (by setting smp affinity and
>> nosmp). This version of the ethernet driver was used for root over NFS
>> on the debug setup for like two weeks (without problems).
>>
>> Tell me if we should pursue the way for the second DMA channel to PPE so
>> both VPEs can send frames at the same time.
> 
> I think it should be ok to use both DMA channels for the CPU traffic.
> 
>> 3) WAVE300
>>
>> In the two past weeks I've tried to fix a mash together various versions
>> of wave300 wifi driver (there are partial version in GPL sources from
>> router vendors). And I've managed to put the driver into "not
>> immediately crashing" mode. If you are interested in the development,
>> there is a thread in openwrt forum. The source repo here:
>>
>> https://repo.or.cz/wave300.git
>> https://repo.or.cz/wave300_rflib.git
>>
>> (the second one must be copied into the first one)
>>
>> The driver will often crash when meeting an unknown packet, request for
>> encryption (no encryption support), unusual combination of configuration
>> or just by module unloading. The code is _really_ ugly and it will
>> server only as hardware specification for better GPL driver development.
>> If you want to help or you have some tips you can join the forum (there
>> are links for firmwares and intensive research of available source codes
>> from vendors).
>>
>> Links:
>> https://forum.openwrt.org/t/support-for-wave-300-wi-fi-chip/24690/129
>> https://forum.openwrt.org/t/how-can-we-make-the-lantiq-xrx200-devices-faster/9724/70
>> https://forum.openwrt.org/t/xrx200-irq-balancing-between-vpes/29732/25
>>
>> Petr
> Hauke

It would be nice if you could send your patches as single mails and
inline so I can easily comment on them.

The DMA handling in the OpenWrt Ethernet driver is only more flexible to
handle arbitrary number of DMA channels, but I think this is not needed.

The DMA memory is already 16 byte aligned, see the byte_offset variable
in xmit, so it should not be a problem to use the 4W DMA mode, I assume
that the hardware also takes care of this.

Why are the changes in arch/mips/kernel/smp-mt.c needed? this looks
strange to me.

Changing LTQ_DMA_CPOLL could affect the latency of the system, but I
think your increase should not harm significantly.

Hauke
Petr Cvek March 26, 2019, 12:24 a.m. UTC | #3
Dne 26. 03. 19 v 0:45 Hauke Mehrtens napsal(a):
> On 3/26/19 12:24 AM, Hauke Mehrtens wrote:
>> Hi Petr
>>
>> On 3/14/19 6:46 AM, Petr Cvek wrote:
>>> Hello again,
>>>
>>> I've managed to enhance few drivers for lantiq platform. They are still
>>> in ugly commented form (ethernet part especially). But I need some hints
>>> before the final version. The patches are based on a kernel 4.14.99.
>>> Copy them into target/linux/lantiq/patches-4.14 (cleaned from any of my
>>> previous patch).
>>
>> Thanks for working on this.
>>
>>> The eth+irq speedup is up to 360/260 Mbps (the vanilla was 170/80 on my
>>> setup). The iperf3 benchmark (2 passes for both vanilla and changed
>>> versions) altogether with script are in the attachment.
>>>
>>> 1) IRQ with SMP and balancing support:
>>>
>>> 	0901-add-icu-smp-support.patch
>>> 	0902-enable-external-irqs-for-second-vpe.patch
>>> 	0903-add-icu1-node-for-smp.patch
>>>
>>> As requested I've changed the patch heavily. The original locking from
>>> k3b source code (probably from UGW) didn't work and in heavy load the
>>> system could have froze (smp affinity change during irq handling). This
>>> version has this fixed by using generic raw spinlocks with irq.
>>>
>>> The SMP IRQ now works in a way that before every irq_enable (serves as
>>> unmask too) the VPE will be switched. This can be limited by writing
>>> into /proc/irq/X/smp_affinity (it can be possibly balanced from
>>> userspace too).
>>>
>>> I've rewritten the device tree reg fields so there are only 2 arrays
>>> now. One per an icu controller. The original one per module was
>>> redundant as the ranges were continuous. The modules of a single ICU are
>>> now explicitly computed in a macro:
>>>
>>> 	ltq_w32((x), ltq_icu_membase[vpe] + m*0x28 + (y))
>>> 	ltq_r32(ltq_icu_membase[vpe] + m*0x28 + (x))
>>>
>>> before there was a pointer for every 0x28 block (there shouldn't be
>>> speed downgrade, only a multiplication and an addition for every
>>> register access).
>>>
>>> Also I've simplified register names from LTQ_ICU_IM0_ISR to LTQ_ICU_ISR
>>> as "IM0" (module) was confusing (the real module number 0-4 was a part
>>> of the macro).
>>>
>>> The code is written in a way it should work fine on a uniprocessor
>>> configuration (as the for_each_present_cpu etc macros will cycle on a
>>> single VPE on uniprocessor). I didn't test the no CONFIG_SMP yet, but I
>>> did check it with "nosmp" kernel parameter. It works.
>>>
>>> Anyway please test if you have the board where the second VPE is used
>>> for FXS.
>>>
>>> The new device tree structure is now incompatible with an old version of
>>> the driver (and old device tree with the new driver too). It seems icu
>>> driver is used in Danube, AR9, AmazonSE and Falcon chipset too. I don't
>>> know the hardware for these boards so before a final patch I would like
>>> to know if they have a second ICU too (at 0x80300 offset).
>>
>> Normally the device tree should stay stable, but I already though about
>> the same change and I am not aware that any device ships a U-Boot with
>> an embedded device tree, so this should be fine.
>>
>> The Amazon and Amazon SE only have one ICU block because they only have
>> one CPU with one VPE.
>> The Danube SoC has two ICU blocks one for each CPU, each CPU only has
>> one VPE. The CPUs are not cache coherent, SMP is not possible.
>>
>> Falcon, AR9, VR9, AR10, ARX300, GRX300, GRX330 have two ICU blocks one
>> for each VPE of the single CPU.
>> GRX350 uses a MIPS InterAptiv CPU with a MIPS GIC.
>>
>>> More development could be done with locking probably. As only the
>>> accesses in a single module (= 1 set of registers) would cause a race
>>> condition. But as the most contented interrupts are in the same module
>>> there won't be much speed increase IMO. I can add it if requested (just
>>> spinlock array and some lookup code).
>>
>> I do not think that this improves the performance significantly, I
>> assume that the CPUs only have to wait there in rare conditions anyway.
>>
>>> 2) Reworked lantiq xrx200 ethernet driver:
>>>
>>> 	0904-backport-vanilla-eth-driver.patch
>>> 	0905-increase-dma-descriptors.patch
>>> 	0906-increase-dma-burst-size.patch
>>>
>>> The code is still ugly, but stable now. There is a fragmented skb
>>> support and napi polling. DMA ring buffer was increased so it handle
>>> faster speeds and I've fixed some code weirdness. A can split the
>>> changes in the future into separate patches.
>>
>> It would be nice if you could also do the same changes to the upstream
>> driver in mainline Linux kernel and send this for inclusion to mainline
>> Linux.
>>
>>> I didn't test the ICU and eth patches separate, but I've tested the
>>> ethernet driver on a single VPE only (by setting smp affinity and
>>> nosmp). This version of the ethernet driver was used for root over NFS
>>> on the debug setup for like two weeks (without problems).
>>>
>>> Tell me if we should pursue the way for the second DMA channel to PPE so
>>> both VPEs can send frames at the same time.
>>
>> I think it should be ok to use both DMA channels for the CPU traffic.
>>
>>> 3) WAVE300
>>>
>>> In the two past weeks I've tried to fix a mash together various versions
>>> of wave300 wifi driver (there are partial version in GPL sources from
>>> router vendors). And I've managed to put the driver into "not
>>> immediately crashing" mode. If you are interested in the development,
>>> there is a thread in openwrt forum. The source repo here:
>>>
>>> https://repo.or.cz/wave300.git
>>> https://repo.or.cz/wave300_rflib.git
>>>
>>> (the second one must be copied into the first one)
>>>
>>> The driver will often crash when meeting an unknown packet, request for
>>> encryption (no encryption support), unusual combination of configuration
>>> or just by module unloading. The code is _really_ ugly and it will
>>> server only as hardware specification for better GPL driver development.
>>> If you want to help or you have some tips you can join the forum (there
>>> are links for firmwares and intensive research of available source codes
>>> from vendors).
>>>
>>> Links:
>>> https://forum.openwrt.org/t/support-for-wave-300-wi-fi-chip/24690/129
>>> https://forum.openwrt.org/t/how-can-we-make-the-lantiq-xrx200-devices-faster/9724/70
>>> https://forum.openwrt.org/t/xrx200-irq-balancing-between-vpes/29732/25
>>>
>>> Petr
>> Hauke
> 

Hi

> It would be nice if you could send your patches as single mails and
> inline so I can easily comment on them.

OK

> 
> The DMA handling in the OpenWrt Ethernet driver is only more flexible to
> handle arbitrary number of DMA channels, but I think this is not needed.
> 
> The DMA memory is already 16 byte aligned, see the byte_offset variable
> in xmit, so it should not be a problem to use the 4W DMA mode, I assume
> that the hardware also takes care of this.
> 

Yes it is 16 byte aligned in the original driver, but my patched driver
is using 32 byte alignment (8W DMA mode). Using 32B bursts with 16B
alignment caused crashing.

> Why are the changes in arch/mips/kernel/smp-mt.c needed? this looks
> strange to me.
> 

That is interrupt masking. IP0 and IP1 are (I think) software interrupts
for IPI communications, IP6/7 are timer (and something) and in IP2-IP5
range, which is not enabled there are external IRQ signals for ICU.
Without this set the second VPE only receives IPI and not ICU events.

Basically I've set this MIPS C0 Status register to the same value as the
C0 Status register for the first VPE.

> Changing LTQ_DMA_CPOLL could affect the latency of the system, but I
> think your increase should not harm significantly.

Yeah I've tested it, there is some minor impact on the maximal
bandwidth. However I cannot set the value correctly without the model of
xrx200 SoC (I assume this register controls the check frequency of the
OWN bit of the first descriptor). I don't even know the clock and width
of the bus between DMA and RAM (or between DMA and ethernet FIFO). But
if the original value DMA_CLK_DIV4 means "every fourth clock" it seems
too often for me (if a packet has like 1500 bytes, it would check many
times before the packet is transferred). The highest values empirically
lags the DMA descriptor ring.

> 
> Hauke
>
Hauke Mehrtens March 26, 2019, 1:23 a.m. UTC | #4
On 3/26/19 1:24 AM, Petr Cvek wrote:
> 
> 
> Dne 26. 03. 19 v 0:45 Hauke Mehrtens napsal(a):
>> On 3/26/19 12:24 AM, Hauke Mehrtens wrote:
>>> Hi Petr
>>>
>>> On 3/14/19 6:46 AM, Petr Cvek wrote:
>>>> Hello again,
>>>>
>>>> I've managed to enhance few drivers for lantiq platform. They are still
>>>> in ugly commented form (ethernet part especially). But I need some hints
>>>> before the final version. The patches are based on a kernel 4.14.99.
>>>> Copy them into target/linux/lantiq/patches-4.14 (cleaned from any of my
>>>> previous patch).
>>>
>>> Thanks for working on this.
>>>
>>>> The eth+irq speedup is up to 360/260 Mbps (the vanilla was 170/80 on my
>>>> setup). The iperf3 benchmark (2 passes for both vanilla and changed
>>>> versions) altogether with script are in the attachment.
>>>>
>>>> 1) IRQ with SMP and balancing support:
>>>>
>>>> 	0901-add-icu-smp-support.patch
>>>> 	0902-enable-external-irqs-for-second-vpe.patch
>>>> 	0903-add-icu1-node-for-smp.patch
>>>>
>>>> As requested I've changed the patch heavily. The original locking from
>>>> k3b source code (probably from UGW) didn't work and in heavy load the
>>>> system could have froze (smp affinity change during irq handling). This
>>>> version has this fixed by using generic raw spinlocks with irq.
>>>>
>>>> The SMP IRQ now works in a way that before every irq_enable (serves as
>>>> unmask too) the VPE will be switched. This can be limited by writing
>>>> into /proc/irq/X/smp_affinity (it can be possibly balanced from
>>>> userspace too).
>>>>
>>>> I've rewritten the device tree reg fields so there are only 2 arrays
>>>> now. One per an icu controller. The original one per module was
>>>> redundant as the ranges were continuous. The modules of a single ICU are
>>>> now explicitly computed in a macro:
>>>>
>>>> 	ltq_w32((x), ltq_icu_membase[vpe] + m*0x28 + (y))
>>>> 	ltq_r32(ltq_icu_membase[vpe] + m*0x28 + (x))
>>>>
>>>> before there was a pointer for every 0x28 block (there shouldn't be
>>>> speed downgrade, only a multiplication and an addition for every
>>>> register access).
>>>>
>>>> Also I've simplified register names from LTQ_ICU_IM0_ISR to LTQ_ICU_ISR
>>>> as "IM0" (module) was confusing (the real module number 0-4 was a part
>>>> of the macro).
>>>>
>>>> The code is written in a way it should work fine on a uniprocessor
>>>> configuration (as the for_each_present_cpu etc macros will cycle on a
>>>> single VPE on uniprocessor). I didn't test the no CONFIG_SMP yet, but I
>>>> did check it with "nosmp" kernel parameter. It works.
>>>>
>>>> Anyway please test if you have the board where the second VPE is used
>>>> for FXS.
>>>>
>>>> The new device tree structure is now incompatible with an old version of
>>>> the driver (and old device tree with the new driver too). It seems icu
>>>> driver is used in Danube, AR9, AmazonSE and Falcon chipset too. I don't
>>>> know the hardware for these boards so before a final patch I would like
>>>> to know if they have a second ICU too (at 0x80300 offset).
>>>
>>> Normally the device tree should stay stable, but I already though about
>>> the same change and I am not aware that any device ships a U-Boot with
>>> an embedded device tree, so this should be fine.
>>>
>>> The Amazon and Amazon SE only have one ICU block because they only have
>>> one CPU with one VPE.
>>> The Danube SoC has two ICU blocks one for each CPU, each CPU only has
>>> one VPE. The CPUs are not cache coherent, SMP is not possible.
>>>
>>> Falcon, AR9, VR9, AR10, ARX300, GRX300, GRX330 have two ICU blocks one
>>> for each VPE of the single CPU.
>>> GRX350 uses a MIPS InterAptiv CPU with a MIPS GIC.
>>>
>>>> More development could be done with locking probably. As only the
>>>> accesses in a single module (= 1 set of registers) would cause a race
>>>> condition. But as the most contented interrupts are in the same module
>>>> there won't be much speed increase IMO. I can add it if requested (just
>>>> spinlock array and some lookup code).
>>>
>>> I do not think that this improves the performance significantly, I
>>> assume that the CPUs only have to wait there in rare conditions anyway.
>>>
>>>> 2) Reworked lantiq xrx200 ethernet driver:
>>>>
>>>> 	0904-backport-vanilla-eth-driver.patch
>>>> 	0905-increase-dma-descriptors.patch
>>>> 	0906-increase-dma-burst-size.patch
>>>>
>>>> The code is still ugly, but stable now. There is a fragmented skb
>>>> support and napi polling. DMA ring buffer was increased so it handle
>>>> faster speeds and I've fixed some code weirdness. A can split the
>>>> changes in the future into separate patches.
>>>
>>> It would be nice if you could also do the same changes to the upstream
>>> driver in mainline Linux kernel and send this for inclusion to mainline
>>> Linux.
>>>
>>>> I didn't test the ICU and eth patches separate, but I've tested the
>>>> ethernet driver on a single VPE only (by setting smp affinity and
>>>> nosmp). This version of the ethernet driver was used for root over NFS
>>>> on the debug setup for like two weeks (without problems).
>>>>
>>>> Tell me if we should pursue the way for the second DMA channel to PPE so
>>>> both VPEs can send frames at the same time.
>>>
>>> I think it should be ok to use both DMA channels for the CPU traffic.
>>>
>>>> 3) WAVE300
>>>>
>>>> In the two past weeks I've tried to fix a mash together various versions
>>>> of wave300 wifi driver (there are partial version in GPL sources from
>>>> router vendors). And I've managed to put the driver into "not
>>>> immediately crashing" mode. If you are interested in the development,
>>>> there is a thread in openwrt forum. The source repo here:
>>>>
>>>> https://repo.or.cz/wave300.git
>>>> https://repo.or.cz/wave300_rflib.git
>>>>
>>>> (the second one must be copied into the first one)
>>>>
>>>> The driver will often crash when meeting an unknown packet, request for
>>>> encryption (no encryption support), unusual combination of configuration
>>>> or just by module unloading. The code is _really_ ugly and it will
>>>> server only as hardware specification for better GPL driver development.
>>>> If you want to help or you have some tips you can join the forum (there
>>>> are links for firmwares and intensive research of available source codes
>>>> from vendors).
>>>>
>>>> Links:
>>>> https://forum.openwrt.org/t/support-for-wave-300-wi-fi-chip/24690/129
>>>> https://forum.openwrt.org/t/how-can-we-make-the-lantiq-xrx200-devices-faster/9724/70
>>>> https://forum.openwrt.org/t/xrx200-irq-balancing-between-vpes/29732/25
>>>>
>>>> Petr
>>> Hauke
>>
> 
> Hi
> 
>> It would be nice if you could send your patches as single mails and
>> inline so I can easily comment on them.
> 
> OK
> 
>>
>> The DMA handling in the OpenWrt Ethernet driver is only more flexible to
>> handle arbitrary number of DMA channels, but I think this is not needed.
>>
>> The DMA memory is already 16 byte aligned, see the byte_offset variable
>> in xmit, so it should not be a problem to use the 4W DMA mode, I assume
>> that the hardware also takes care of this.
>>
> 
> Yes it is 16 byte aligned in the original driver, but my patched driver
> is using 32 byte alignment (8W DMA mode). Using 32B bursts with 16B
> alignment caused crashing.
> 
>> Why are the changes in arch/mips/kernel/smp-mt.c needed? this looks
>> strange to me.
>>
> 
> That is interrupt masking. IP0 and IP1 are (I think) software interrupts
> for IPI communications, IP6/7 are timer (and something) and in IP2-IP5
> range, which is not enabled there are external IRQ signals for ICU.
> Without this set the second VPE only receives IPI and not ICU events.
>
> Basically I've set this MIPS C0 Status register to the same value as the
> C0 Status register for the first VPE.

hmm strange, looks like there are not so many SoCs with multiple VPEs
which have an own IRQ controller.

>> Changing LTQ_DMA_CPOLL could affect the latency of the system, but I
>> think your increase should not harm significantly.
> 
> Yeah I've tested it, there is some minor impact on the maximal
> bandwidth. However I cannot set the value correctly without the model of
> xrx200 SoC (I assume this register controls the check frequency of the
> OWN bit of the first descriptor).

Yes this is the polling frequency in fDMA/16, this value is global and
not per channel. The DMA controller will check the OWN bit on all
descriptors for all DMA channels where polling is activated with this
frequency. fDMA is the same as the FPI frequency, probably 250MHz.

> I don't even know the clock and width
> of the bus between DMA and RAM (or between DMA and ethernet FIFO). But
> if the original value DMA_CLK_DIV4 means "every fourth clock" it seems
> too often for me (if a packet has like 1500 bytes, it would check many
> times before the packet is transferred). The highest values empirically
> lags the DMA descriptor ring.

The DMA controller uses a 32 bit wide data path to the RAM and 28 bit
word addresses, a word for the DMA controller is 32 bit.

The DMA controller can handle some priorities between the ports and
channels. When you activate PKTARB (BIT(31)) in DMA_CTRL the DMA
controller will transfer the complete packet before the arbitration is
changed. With MBRSTCNT (bit 25:16) in DMA_CTRL you can control after how
many burst the arbitration should be changed, when MBRSTARB (BIT(30)) in
DMA_CTRL is activated. Both is for TX and RX.

Hauke
Petr Cvek May 18, 2019, 2:08 a.m. UTC | #5
Hi again,

I'm finishing the ethernet driver and it is still sort of slow for my
taste, but it seems I've reached the hardware limit.

As someone who well knows the internals of the SoC, could you guess the
maximum hardware possible speed of TX bandwidth speed (roughly big
saturated UDP packets)?

If I'm evaluating this correctly, there is DDR2 controller @250MHz... I
don't know if 250MHz is the bus speed as my modem has DDR2-800 chip,
which means 400MHz bus speed (pretty big 150MHz reserve).

But if I'm right that would mean the data are transferred at 500MT/s
over 16bit bus. So the continuous construction of the UDP packets in CPU
(500MHZ@32bit) would ate the whole RAM bandwidth.

This result seems wrong as the VPE needs to load instructions too and
there is up to 4 threads. And most importantly there are the gigabit
switch with multiple ports and PCI(e) peripherals too.

Anyway my measurements shows the saturated UDP traffic on localhost
interface are only up to around 400Mbit/s and they are only mem/cache
transfers.

Am I right? Is it impossible to obtain the full 1Gbit/s with vrx-268?

Best regards,

Petr

Dne 26. 03. 19 v 2:23 Hauke Mehrtens napsal(a):
> On 3/26/19 1:24 AM, Petr Cvek wrote:
>>
>>
>> Dne 26. 03. 19 v 0:45 Hauke Mehrtens napsal(a):
>>> On 3/26/19 12:24 AM, Hauke Mehrtens wrote:
>>>> Hi Petr
>>>>
>>>> On 3/14/19 6:46 AM, Petr Cvek wrote:
>>>>> Hello again,
>>>>>
>>>>> I've managed to enhance few drivers for lantiq platform. They are still
>>>>> in ugly commented form (ethernet part especially). But I need some hints
>>>>> before the final version. The patches are based on a kernel 4.14.99.
>>>>> Copy them into target/linux/lantiq/patches-4.14 (cleaned from any of my
>>>>> previous patch).
>>>>
>>>> Thanks for working on this.
>>>>
>>>>> The eth+irq speedup is up to 360/260 Mbps (the vanilla was 170/80 on my
>>>>> setup). The iperf3 benchmark (2 passes for both vanilla and changed
>>>>> versions) altogether with script are in the attachment.
>>>>>
>>>>> 1) IRQ with SMP and balancing support:
>>>>>
>>>>> 	0901-add-icu-smp-support.patch
>>>>> 	0902-enable-external-irqs-for-second-vpe.patch
>>>>> 	0903-add-icu1-node-for-smp.patch
>>>>>
>>>>> As requested I've changed the patch heavily. The original locking from
>>>>> k3b source code (probably from UGW) didn't work and in heavy load the
>>>>> system could have froze (smp affinity change during irq handling). This
>>>>> version has this fixed by using generic raw spinlocks with irq.
>>>>>
>>>>> The SMP IRQ now works in a way that before every irq_enable (serves as
>>>>> unmask too) the VPE will be switched. This can be limited by writing
>>>>> into /proc/irq/X/smp_affinity (it can be possibly balanced from
>>>>> userspace too).
>>>>>
>>>>> I've rewritten the device tree reg fields so there are only 2 arrays
>>>>> now. One per an icu controller. The original one per module was
>>>>> redundant as the ranges were continuous. The modules of a single ICU are
>>>>> now explicitly computed in a macro:
>>>>>
>>>>> 	ltq_w32((x), ltq_icu_membase[vpe] + m*0x28 + (y))
>>>>> 	ltq_r32(ltq_icu_membase[vpe] + m*0x28 + (x))
>>>>>
>>>>> before there was a pointer for every 0x28 block (there shouldn't be
>>>>> speed downgrade, only a multiplication and an addition for every
>>>>> register access).
>>>>>
>>>>> Also I've simplified register names from LTQ_ICU_IM0_ISR to LTQ_ICU_ISR
>>>>> as "IM0" (module) was confusing (the real module number 0-4 was a part
>>>>> of the macro).
>>>>>
>>>>> The code is written in a way it should work fine on a uniprocessor
>>>>> configuration (as the for_each_present_cpu etc macros will cycle on a
>>>>> single VPE on uniprocessor). I didn't test the no CONFIG_SMP yet, but I
>>>>> did check it with "nosmp" kernel parameter. It works.
>>>>>
>>>>> Anyway please test if you have the board where the second VPE is used
>>>>> for FXS.
>>>>>
>>>>> The new device tree structure is now incompatible with an old version of
>>>>> the driver (and old device tree with the new driver too). It seems icu
>>>>> driver is used in Danube, AR9, AmazonSE and Falcon chipset too. I don't
>>>>> know the hardware for these boards so before a final patch I would like
>>>>> to know if they have a second ICU too (at 0x80300 offset).
>>>>
>>>> Normally the device tree should stay stable, but I already though about
>>>> the same change and I am not aware that any device ships a U-Boot with
>>>> an embedded device tree, so this should be fine.
>>>>
>>>> The Amazon and Amazon SE only have one ICU block because they only have
>>>> one CPU with one VPE.
>>>> The Danube SoC has two ICU blocks one for each CPU, each CPU only has
>>>> one VPE. The CPUs are not cache coherent, SMP is not possible.
>>>>
>>>> Falcon, AR9, VR9, AR10, ARX300, GRX300, GRX330 have two ICU blocks one
>>>> for each VPE of the single CPU.
>>>> GRX350 uses a MIPS InterAptiv CPU with a MIPS GIC.
>>>>
>>>>> More development could be done with locking probably. As only the
>>>>> accesses in a single module (= 1 set of registers) would cause a race
>>>>> condition. But as the most contented interrupts are in the same module
>>>>> there won't be much speed increase IMO. I can add it if requested (just
>>>>> spinlock array and some lookup code).
>>>>
>>>> I do not think that this improves the performance significantly, I
>>>> assume that the CPUs only have to wait there in rare conditions anyway.
>>>>
>>>>> 2) Reworked lantiq xrx200 ethernet driver:
>>>>>
>>>>> 	0904-backport-vanilla-eth-driver.patch
>>>>> 	0905-increase-dma-descriptors.patch
>>>>> 	0906-increase-dma-burst-size.patch
>>>>>
>>>>> The code is still ugly, but stable now. There is a fragmented skb
>>>>> support and napi polling. DMA ring buffer was increased so it handle
>>>>> faster speeds and I've fixed some code weirdness. A can split the
>>>>> changes in the future into separate patches.
>>>>
>>>> It would be nice if you could also do the same changes to the upstream
>>>> driver in mainline Linux kernel and send this for inclusion to mainline
>>>> Linux.
>>>>
>>>>> I didn't test the ICU and eth patches separate, but I've tested the
>>>>> ethernet driver on a single VPE only (by setting smp affinity and
>>>>> nosmp). This version of the ethernet driver was used for root over NFS
>>>>> on the debug setup for like two weeks (without problems).
>>>>>
>>>>> Tell me if we should pursue the way for the second DMA channel to PPE so
>>>>> both VPEs can send frames at the same time.
>>>>
>>>> I think it should be ok to use both DMA channels for the CPU traffic.
>>>>
>>>>> 3) WAVE300
>>>>>
>>>>> In the two past weeks I've tried to fix a mash together various versions
>>>>> of wave300 wifi driver (there are partial version in GPL sources from
>>>>> router vendors). And I've managed to put the driver into "not
>>>>> immediately crashing" mode. If you are interested in the development,
>>>>> there is a thread in openwrt forum. The source repo here:
>>>>>
>>>>> https://repo.or.cz/wave300.git
>>>>> https://repo.or.cz/wave300_rflib.git
>>>>>
>>>>> (the second one must be copied into the first one)
>>>>>
>>>>> The driver will often crash when meeting an unknown packet, request for
>>>>> encryption (no encryption support), unusual combination of configuration
>>>>> or just by module unloading. The code is _really_ ugly and it will
>>>>> server only as hardware specification for better GPL driver development.
>>>>> If you want to help or you have some tips you can join the forum (there
>>>>> are links for firmwares and intensive research of available source codes
>>>>> from vendors).
>>>>>
>>>>> Links:
>>>>> https://forum.openwrt.org/t/support-for-wave-300-wi-fi-chip/24690/129
>>>>> https://forum.openwrt.org/t/how-can-we-make-the-lantiq-xrx200-devices-faster/9724/70
>>>>> https://forum.openwrt.org/t/xrx200-irq-balancing-between-vpes/29732/25
>>>>>
>>>>> Petr
>>>> Hauke
>>>
>>
>> Hi
>>
>>> It would be nice if you could send your patches as single mails and
>>> inline so I can easily comment on them.
>>
>> OK
>>
>>>
>>> The DMA handling in the OpenWrt Ethernet driver is only more flexible to
>>> handle arbitrary number of DMA channels, but I think this is not needed.
>>>
>>> The DMA memory is already 16 byte aligned, see the byte_offset variable
>>> in xmit, so it should not be a problem to use the 4W DMA mode, I assume
>>> that the hardware also takes care of this.
>>>
>>
>> Yes it is 16 byte aligned in the original driver, but my patched driver
>> is using 32 byte alignment (8W DMA mode). Using 32B bursts with 16B
>> alignment caused crashing.
>>
>>> Why are the changes in arch/mips/kernel/smp-mt.c needed? this looks
>>> strange to me.
>>>
>>
>> That is interrupt masking. IP0 and IP1 are (I think) software interrupts
>> for IPI communications, IP6/7 are timer (and something) and in IP2-IP5
>> range, which is not enabled there are external IRQ signals for ICU.
>> Without this set the second VPE only receives IPI and not ICU events.
>>
>> Basically I've set this MIPS C0 Status register to the same value as the
>> C0 Status register for the first VPE.
> 
> hmm strange, looks like there are not so many SoCs with multiple VPEs
> which have an own IRQ controller.
> 
>>> Changing LTQ_DMA_CPOLL could affect the latency of the system, but I
>>> think your increase should not harm significantly.
>>
>> Yeah I've tested it, there is some minor impact on the maximal
>> bandwidth. However I cannot set the value correctly without the model of
>> xrx200 SoC (I assume this register controls the check frequency of the
>> OWN bit of the first descriptor).
> 
> Yes this is the polling frequency in fDMA/16, this value is global and
> not per channel. The DMA controller will check the OWN bit on all
> descriptors for all DMA channels where polling is activated with this
> frequency. fDMA is the same as the FPI frequency, probably 250MHz.
> 
>> I don't even know the clock and width
>> of the bus between DMA and RAM (or between DMA and ethernet FIFO). But
>> if the original value DMA_CLK_DIV4 means "every fourth clock" it seems
>> too often for me (if a packet has like 1500 bytes, it would check many
>> times before the packet is transferred). The highest values empirically
>> lags the DMA descriptor ring.
> 
> The DMA controller uses a 32 bit wide data path to the RAM and 28 bit
> word addresses, a word for the DMA controller is 32 bit.
> 
> The DMA controller can handle some priorities between the ports and
> channels. When you activate PKTARB (BIT(31)) in DMA_CTRL the DMA
> controller will transfer the complete packet before the arbitration is
> changed. With MBRSTCNT (bit 25:16) in DMA_CTRL you can control after how
> many burst the arbitration should be changed, when MBRSTARB (BIT(30)) in
> DMA_CTRL is activated. Both is for TX and RX.
> 
> Hauke
>
Hauke Mehrtens May 19, 2019, 9:24 a.m. UTC | #6
On 5/18/19 4:08 AM, Petr Cvek wrote:
> Hi again,
> 
> I'm finishing the ethernet driver and it is still sort of slow for my
> taste, but it seems I've reached the hardware limit.

Will you send these patches also to the upstream kernel? I would like to
see the improvements to the DMA controller and the scatter DMA in the
mainline kernel then we do not have to maintain this separately in
OpenWrt any more.

> As someone who well knows the internals of the SoC, could you guess the
> maximum hardware possible speed of TX bandwidth speed (roughly big
> saturated UDP packets)?
> 
> If I'm evaluating this correctly, there is DDR2 controller @250MHz... I
> don't know if 250MHz is the bus speed as my modem has DDR2-800 chip,
> which means 400MHz bus speed (pretty big 150MHz reserve).

I would not be surprised if the RAM is running with a lower frequency
than what would be supported by the RAM chips, but I haven't checked
what is the maximum supported frequency by the SoC itself.

> But if I'm right that would mean the data are transferred at 500MT/s
> over 16bit bus. So the continuous construction of the UDP packets in CPU
> (500MHZ@32bit) would ate the whole RAM bandwidth.
> 
> This result seems wrong as the VPE needs to load instructions too and
> there is up to 4 threads. And most importantly there are the gigabit
> switch with multiple ports and PCI(e) peripherals too.
> 
> Anyway my measurements shows the saturated UDP traffic on localhost
> interface are only up to around 400Mbit/s and they are only mem/cache
> transfers.
> 
> Am I right? Is it impossible to obtain the full 1Gbit/s with vrx-268?

The SoC and many of the competition SoCs are not build to handle all the
traffic in Linux. This SoC is designed that the data traffic should be
handled by the hardware or some specialized FW. There is even some SRAM
in the chip which is used by these HW blocks to avoid coping the data to
the RAM.

The VRX200 line has the GSWIP which can handle the layer 2 switching at
line rate (1 GBit/s) at least for normal packages sizes.

NAT, PPPoE and some other L3 handling is done by the PP32 hardware block
which runs a separate FW and also has some specialized HW blocks. This
block can also directly take packages from the DSL and wifi and forward
packages to these peripherals.

The CPU path is only used to learn a flow which is then later offloaded
to the hardware

Hauke

> 
> Best regards,
> 
> Petr
> 
> Dne 26. 03. 19 v 2:23 Hauke Mehrtens napsal(a):
>> On 3/26/19 1:24 AM, Petr Cvek wrote:
>>>
>>>
>>> Dne 26. 03. 19 v 0:45 Hauke Mehrtens napsal(a):
>>>> On 3/26/19 12:24 AM, Hauke Mehrtens wrote:
>>>>> Hi Petr
>>>>>
>>>>> On 3/14/19 6:46 AM, Petr Cvek wrote:
>>>>>> Hello again,
>>>>>>
>>>>>> I've managed to enhance few drivers for lantiq platform. They are still
>>>>>> in ugly commented form (ethernet part especially). But I need some hints
>>>>>> before the final version. The patches are based on a kernel 4.14.99.
>>>>>> Copy them into target/linux/lantiq/patches-4.14 (cleaned from any of my
>>>>>> previous patch).
>>>>>
>>>>> Thanks for working on this.
>>>>>
>>>>>> The eth+irq speedup is up to 360/260 Mbps (the vanilla was 170/80 on my
>>>>>> setup). The iperf3 benchmark (2 passes for both vanilla and changed
>>>>>> versions) altogether with script are in the attachment.
>>>>>>
>>>>>> 1) IRQ with SMP and balancing support:
>>>>>>
>>>>>> 	0901-add-icu-smp-support.patch
>>>>>> 	0902-enable-external-irqs-for-second-vpe.patch
>>>>>> 	0903-add-icu1-node-for-smp.patch
>>>>>>
>>>>>> As requested I've changed the patch heavily. The original locking from
>>>>>> k3b source code (probably from UGW) didn't work and in heavy load the
>>>>>> system could have froze (smp affinity change during irq handling). This
>>>>>> version has this fixed by using generic raw spinlocks with irq.
>>>>>>
>>>>>> The SMP IRQ now works in a way that before every irq_enable (serves as
>>>>>> unmask too) the VPE will be switched. This can be limited by writing
>>>>>> into /proc/irq/X/smp_affinity (it can be possibly balanced from
>>>>>> userspace too).
>>>>>>
>>>>>> I've rewritten the device tree reg fields so there are only 2 arrays
>>>>>> now. One per an icu controller. The original one per module was
>>>>>> redundant as the ranges were continuous. The modules of a single ICU are
>>>>>> now explicitly computed in a macro:
>>>>>>
>>>>>> 	ltq_w32((x), ltq_icu_membase[vpe] + m*0x28 + (y))
>>>>>> 	ltq_r32(ltq_icu_membase[vpe] + m*0x28 + (x))
>>>>>>
>>>>>> before there was a pointer for every 0x28 block (there shouldn't be
>>>>>> speed downgrade, only a multiplication and an addition for every
>>>>>> register access).
>>>>>>
>>>>>> Also I've simplified register names from LTQ_ICU_IM0_ISR to LTQ_ICU_ISR
>>>>>> as "IM0" (module) was confusing (the real module number 0-4 was a part
>>>>>> of the macro).
>>>>>>
>>>>>> The code is written in a way it should work fine on a uniprocessor
>>>>>> configuration (as the for_each_present_cpu etc macros will cycle on a
>>>>>> single VPE on uniprocessor). I didn't test the no CONFIG_SMP yet, but I
>>>>>> did check it with "nosmp" kernel parameter. It works.
>>>>>>
>>>>>> Anyway please test if you have the board where the second VPE is used
>>>>>> for FXS.
>>>>>>
>>>>>> The new device tree structure is now incompatible with an old version of
>>>>>> the driver (and old device tree with the new driver too). It seems icu
>>>>>> driver is used in Danube, AR9, AmazonSE and Falcon chipset too. I don't
>>>>>> know the hardware for these boards so before a final patch I would like
>>>>>> to know if they have a second ICU too (at 0x80300 offset).
>>>>>
>>>>> Normally the device tree should stay stable, but I already though about
>>>>> the same change and I am not aware that any device ships a U-Boot with
>>>>> an embedded device tree, so this should be fine.
>>>>>
>>>>> The Amazon and Amazon SE only have one ICU block because they only have
>>>>> one CPU with one VPE.
>>>>> The Danube SoC has two ICU blocks one for each CPU, each CPU only has
>>>>> one VPE. The CPUs are not cache coherent, SMP is not possible.
>>>>>
>>>>> Falcon, AR9, VR9, AR10, ARX300, GRX300, GRX330 have two ICU blocks one
>>>>> for each VPE of the single CPU.
>>>>> GRX350 uses a MIPS InterAptiv CPU with a MIPS GIC.
>>>>>
>>>>>> More development could be done with locking probably. As only the
>>>>>> accesses in a single module (= 1 set of registers) would cause a race
>>>>>> condition. But as the most contented interrupts are in the same module
>>>>>> there won't be much speed increase IMO. I can add it if requested (just
>>>>>> spinlock array and some lookup code).
>>>>>
>>>>> I do not think that this improves the performance significantly, I
>>>>> assume that the CPUs only have to wait there in rare conditions anyway.
>>>>>
>>>>>> 2) Reworked lantiq xrx200 ethernet driver:
>>>>>>
>>>>>> 	0904-backport-vanilla-eth-driver.patch
>>>>>> 	0905-increase-dma-descriptors.patch
>>>>>> 	0906-increase-dma-burst-size.patch
>>>>>>
>>>>>> The code is still ugly, but stable now. There is a fragmented skb
>>>>>> support and napi polling. DMA ring buffer was increased so it handle
>>>>>> faster speeds and I've fixed some code weirdness. A can split the
>>>>>> changes in the future into separate patches.
>>>>>
>>>>> It would be nice if you could also do the same changes to the upstream
>>>>> driver in mainline Linux kernel and send this for inclusion to mainline
>>>>> Linux.
>>>>>
>>>>>> I didn't test the ICU and eth patches separate, but I've tested the
>>>>>> ethernet driver on a single VPE only (by setting smp affinity and
>>>>>> nosmp). This version of the ethernet driver was used for root over NFS
>>>>>> on the debug setup for like two weeks (without problems).
>>>>>>
>>>>>> Tell me if we should pursue the way for the second DMA channel to PPE so
>>>>>> both VPEs can send frames at the same time.
>>>>>
>>>>> I think it should be ok to use both DMA channels for the CPU traffic.
>>>>>
>>>>>> 3) WAVE300
>>>>>>
>>>>>> In the two past weeks I've tried to fix a mash together various versions
>>>>>> of wave300 wifi driver (there are partial version in GPL sources from
>>>>>> router vendors). And I've managed to put the driver into "not
>>>>>> immediately crashing" mode. If you are interested in the development,
>>>>>> there is a thread in openwrt forum. The source repo here:
>>>>>>
>>>>>> https://repo.or.cz/wave300.git
>>>>>> https://repo.or.cz/wave300_rflib.git
>>>>>>
>>>>>> (the second one must be copied into the first one)
>>>>>>
>>>>>> The driver will often crash when meeting an unknown packet, request for
>>>>>> encryption (no encryption support), unusual combination of configuration
>>>>>> or just by module unloading. The code is _really_ ugly and it will
>>>>>> server only as hardware specification for better GPL driver development.
>>>>>> If you want to help or you have some tips you can join the forum (there
>>>>>> are links for firmwares and intensive research of available source codes
>>>>>> from vendors).
>>>>>>
>>>>>> Links:
>>>>>> https://forum.openwrt.org/t/support-for-wave-300-wi-fi-chip/24690/129
>>>>>> https://forum.openwrt.org/t/how-can-we-make-the-lantiq-xrx200-devices-faster/9724/70
>>>>>> https://forum.openwrt.org/t/xrx200-irq-balancing-between-vpes/29732/25
>>>>>>
>>>>>> Petr
>>>>> Hauke
>>>>
>>>
>>> Hi
>>>
>>>> It would be nice if you could send your patches as single mails and
>>>> inline so I can easily comment on them.
>>>
>>> OK
>>>
>>>>
>>>> The DMA handling in the OpenWrt Ethernet driver is only more flexible to
>>>> handle arbitrary number of DMA channels, but I think this is not needed.
>>>>
>>>> The DMA memory is already 16 byte aligned, see the byte_offset variable
>>>> in xmit, so it should not be a problem to use the 4W DMA mode, I assume
>>>> that the hardware also takes care of this.
>>>>
>>>
>>> Yes it is 16 byte aligned in the original driver, but my patched driver
>>> is using 32 byte alignment (8W DMA mode). Using 32B bursts with 16B
>>> alignment caused crashing.
>>>
>>>> Why are the changes in arch/mips/kernel/smp-mt.c needed? this looks
>>>> strange to me.
>>>>
>>>
>>> That is interrupt masking. IP0 and IP1 are (I think) software interrupts
>>> for IPI communications, IP6/7 are timer (and something) and in IP2-IP5
>>> range, which is not enabled there are external IRQ signals for ICU.
>>> Without this set the second VPE only receives IPI and not ICU events.
>>>
>>> Basically I've set this MIPS C0 Status register to the same value as the
>>> C0 Status register for the first VPE.
>>
>> hmm strange, looks like there are not so many SoCs with multiple VPEs
>> which have an own IRQ controller.
>>
>>>> Changing LTQ_DMA_CPOLL could affect the latency of the system, but I
>>>> think your increase should not harm significantly.
>>>
>>> Yeah I've tested it, there is some minor impact on the maximal
>>> bandwidth. However I cannot set the value correctly without the model of
>>> xrx200 SoC (I assume this register controls the check frequency of the
>>> OWN bit of the first descriptor).
>>
>> Yes this is the polling frequency in fDMA/16, this value is global and
>> not per channel. The DMA controller will check the OWN bit on all
>> descriptors for all DMA channels where polling is activated with this
>> frequency. fDMA is the same as the FPI frequency, probably 250MHz.
>>
>>> I don't even know the clock and width
>>> of the bus between DMA and RAM (or between DMA and ethernet FIFO). But
>>> if the original value DMA_CLK_DIV4 means "every fourth clock" it seems
>>> too often for me (if a packet has like 1500 bytes, it would check many
>>> times before the packet is transferred). The highest values empirically
>>> lags the DMA descriptor ring.
>>
>> The DMA controller uses a 32 bit wide data path to the RAM and 28 bit
>> word addresses, a word for the DMA controller is 32 bit.
>>
>> The DMA controller can handle some priorities between the ports and
>> channels. When you activate PKTARB (BIT(31)) in DMA_CTRL the DMA
>> controller will transfer the complete packet before the arbitration is
>> changed. With MBRSTCNT (bit 25:16) in DMA_CTRL you can control after how
>> many burst the arbitration should be changed, when MBRSTARB (BIT(30)) in
>> DMA_CTRL is activated. Both is for TX and RX.
>>
>> Hauke
>>
Petr Cvek May 24, 2019, 5:15 a.m. UTC | #7
Dne 19. 05. 19 v 11:24 Hauke Mehrtens napsal(a):
> On 5/18/19 4:08 AM, Petr Cvek wrote:
>> Hi again,
>>
>> I'm finishing the ethernet driver and it is still sort of slow for my
>> taste, but it seems I've reached the hardware limit.
> 
> Will you send these patches also to the upstream kernel? I would like to
> see the improvements to the DMA controller and the scatter DMA in the
> mainline kernel then we do not have to maintain this separately in
> OpenWrt any more.

Yeah eventually, but the patches will be untested (I don't think I can run linux-next in openwrt on lantiq modem without big changes from the current 4.14).

I didn't add scattergather DMA into the kernel I'm just using individual descriptors for skb fragments. The DMA patches are only for FIFO length and some register tuning.

> 
>> As someone who well knows the internals of the SoC, could you guess the
>> maximum hardware possible speed of TX bandwidth speed (roughly big
>> saturated UDP packets)?
>>
>> If I'm evaluating this correctly, there is DDR2 controller @250MHz... I
>> don't know if 250MHz is the bus speed as my modem has DDR2-800 chip,
>> which means 400MHz bus speed (pretty big 150MHz reserve).
> 
> I would not be surprised if the RAM is running with a lower frequency
> than what would be supported by the RAM chips, but I haven't checked
> what is the maximum supported frequency by the SoC itself.

I was just poking around ugw sources from tplink and it seems they maybe 600/300 MHz (CPU RAM) settings. So if the chip is in the limits it could make the network even faster.

> 
>> But if I'm right that would mean the data are transferred at 500MT/s
>> over 16bit bus. So the continuous construction of the UDP packets in CPU
>> (500MHZ@32bit) would ate the whole RAM bandwidth.
>>
>> This result seems wrong as the VPE needs to load instructions too and
>> there is up to 4 threads. And most importantly there are the gigabit
>> switch with multiple ports and PCI(e) peripherals too.
>>
>> Anyway my measurements shows the saturated UDP traffic on localhost
>> interface are only up to around 400Mbit/s and they are only mem/cache
>> transfers.
>>
>> Am I right? Is it impossible to obtain the full 1Gbit/s with vrx-268?
> 
> The SoC and many of the competition SoCs are not build to handle all the
> traffic in Linux. This SoC is designed that the data traffic should be
> handled by the hardware or some specialized FW. There is even some SRAM
> in the chip which is used by these HW blocks to avoid coping the data to
> the RAM.
> 
> The VRX200 line has the GSWIP which can handle the layer 2 switching at
> line rate (1 GBit/s) at least for normal packages sizes.
> 
> NAT, PPPoE and some other L3 handling is done by the PP32 hardware block
> which runs a separate FW and also has some specialized HW blocks. This
> block can also directly take packages from the DSL and wifi and forward
> packages to these peripherals.
> 

Yeah DSL is fine, it is in the software limits of my driver, but I was worried about wifi speeds. 

Anyway that was just my thinking about where is the weak spot and if it is in the driver, because bus speed 250MHz@32bit is fine for multiple 1G ethernets. But if it is the CPU, then I'm fine ;-).


BTW in Dlink GPL kernel source (probably UGW 6.x), there is this table:

                const struct ifx_dma_chan_map dma_map[28] = {
                /* portnum, device name, channel direction, class value,
                 * IRQ number, relative channel number */
                {0, "PPE",      IFX_DMA_RX_CH,  0,  DMA_CH0_INT,    0},
                {0, "PPE",      IFX_DMA_TX_CH,  0,  DMA_CH1_INT,    0},
                {0, "PPE",      IFX_DMA_RX_CH,  1,  DMA_CH2_INT,    1},
                {0, "PPE",      IFX_DMA_TX_CH,  1,  DMA_CH3_INT,    1},
                {0, "PPE",      IFX_DMA_RX_CH,  2,  DMA_CH4_INT,    2},
                {0, "PPE",      IFX_DMA_TX_CH,  2,  DMA_CH5_INT,    2},
                {0, "PPE",      IFX_DMA_RX_CH,  3,  DMA_CH6_INT,    3},
                {0, "PPE",      IFX_DMA_TX_CH,  3,  DMA_CH7_INT,    3},
                {1, "DEU",      IFX_DMA_RX_CH,  0,  DMA_CH8_INT,    0},
                {1, "DEU",      IFX_DMA_TX_CH,  0,  DMA_CH9_INT,    0},
                {1, "DEU",      IFX_DMA_RX_CH,  1,  DMA_CH10_INT,   1},
                {1, "DEU",      IFX_DMA_TX_CH,  1,  DMA_CH11_INT,   1},
                {2, "SPI",      IFX_DMA_RX_CH,  0,  DMA_CH12_INT,   0},
                {2, "SPI",      IFX_DMA_TX_CH,  0,  DMA_CH13_INT,   0},
                {3, "SDIO",     IFX_DMA_RX_CH,  0,  DMA_CH14_INT,   0},
                {3, "SDIO",     IFX_DMA_TX_CH,  0,  DMA_CH15_INT,   0},
                {4, "MCTRL",    IFX_DMA_RX_CH,  0,  DMA_CH16_INT,   0},
                {4, "MCTRL",    IFX_DMA_TX_CH,  0,  DMA_CH17_INT,   0},
                {4, "MCTRL",    IFX_DMA_RX_CH,  1,  DMA_CH18_INT,   1},
                {4, "MCTRL",    IFX_DMA_TX_CH,  1,  DMA_CH19_INT,   1},
                {0, "PPE",      IFX_DMA_RX_CH,  4,  DMA_CH20_INT,   4},
                {0, "PPE",      IFX_DMA_RX_CH,  5,  DMA_CH21_INT,   5},
                {0, "PPE",      IFX_DMA_RX_CH,  6,  DMA_CH22_INT,   6},
                {0, "PPE",      IFX_DMA_RX_CH,  7,  DMA_CH23_INT,   7},
                {5, "USIF",     IFX_DMA_RX_CH,  0,  DMA_CH24_INT,   0},
                {5, "USIF",     IFX_DMA_TX_CH,  0,  DMA_CH25_INT,   0},
                {6, "HSNAND",   IFX_DMA_RX_CH,  0,  DMA_CH26_INT,   0},
                {6, "HSNAND",   IFX_DMA_TX_CH,  0,  DMA_CH27_INT,   0},

Are there 6 TX and 6 RX DMA channels for PPE? 

In the current code I'm using ch1 and ch3 (TXs) for the first and second VPE, so the driver can choose which core will do the cleaning of the TX rings (no speed drops). The problem is the EASY80920 device has two eth interfaces: LAN and WAN and is using ch1 for one and ch3 for the other one. So to make the driver universal I would need to use 4 TX channels. 

If these channels have the equal function I could change the driver to allow mapping of the channels to the interfaces and ports from devicetree, so the support for EASY80920 and for SMP can coexists (also if somebody wants to make port exclusive or reserved PPE channel). 

Is this fine?

Also I didn't get from the original driver how one can assign ethernet port to an RX DMA channel (or if hardware can switch somehow between them).

Actual (in devel) state of the driver attached. Not working on EASY80920 (multiple ethX). TX DMA IRQs should be limited to a single, different VPE or the speeds will be lower. Without some netfilter kernel modules the TX speeds can be +30Mbps.

TCP from host to lantiq	= 323 Mbits/sec
TCP from lantiq to host	= 273 Mbits/sec
UDP from host to lantiq	= 845 Mbits/sec (varies on my slow machine)
UDP from lantiq to host	= 308 Mbits/sec (this one limits the TCP, roughly raw TX traffic)

P.S. I hope the patch survives my thunderbird. I've tried to reconfigure its wrapping settings.

Petr

---
--- a/drivers/net/ethernet/lantiq_xrx200.c	2019-03-10 20:44:58.797133801 +0100
+++ b/drivers/net/ethernet/lantiq_xrx200.c	2019-05-24 04:48:02.217779380 +0200
@@ -36,16 +36,14 @@
 #include "lantiq_pce.h"
 #include "lantiq_xrx200_sw.h"
 
-#define SW_POLLING
-#define SW_ROUTING
+#define SW_POLLING	//polls phy
+#define SW_ROUTING	//adds vlan field
+#define NUM_TX_QUEUES		2	/* set number of TX queues: 1-2 */
 
-#ifdef SW_ROUTING
-#define XRX200_MAX_DEV		2
-#else
-#define XRX200_MAX_DEV		1
-#endif
+#define mystats 1	//TODO tests how locking slows the DMA rings
 
 #define XRX200_MAX_VLAN		64
+
 #define XRX200_PCE_ACTVLAN_IDX	0x01
 #define XRX200_PCE_VLANMAP_IDX	0x02
 
@@ -54,7 +52,8 @@
 
 #define XRX200_HEADROOM		4
 
-#define XRX200_TX_TIMEOUT	(10 * HZ)
+//TODO fine tune
+#define XRX200_TX_TIMEOUT	(30 * HZ)
 
 /* port type */
 #define XRX200_PORT_TYPE_PHY	1
@@ -62,12 +61,12 @@
 
 /* DMA */
 #define XRX200_DMA_DATA_LEN	0x600
+#define XRX200_DMA_TX_ALIGN	(32 - 1)
+
 #define XRX200_DMA_IRQ		INT_NUM_IM2_IRL0
 #define XRX200_DMA_RX		0
 #define XRX200_DMA_TX		1
 #define XRX200_DMA_TX_2		3
-#define XRX200_DMA_IS_TX(x)	(x%2)
-#define XRX200_DMA_IS_RX(x)	(!XRX200_DMA_IS_TX(x))
 
 /* fetch / store dma */
 #define FDMA_PCTRL0		0x2A00
@@ -188,6 +187,54 @@
 #define MDIO_DEVAD_NONE		(-1)
 #define ADVERTIZE_MPD		(1 << 10)
 
+/* this is used in DMA ring to match skb during cleanup */
+struct xrx200_skb {
+	/* skb in use reference */
+	struct sk_buff *skb;
+
+	/* saved dma address for unmap */
+	dma_addr_t addr;
+
+	/* saved length for unmap */
+	size_t size;
+};
+
+struct xrx200_tx_queue {
+	struct xrx200_skb dma_skb[LTQ_DESC_NUM];
+
+	struct napi_struct napi;
+
+	struct ltq_dma_channel dma;
+
+	struct u64_stats_sync syncp;
+	__u64 tx_packets;
+	__u64 tx_bytes;
+	__u64 tx_errors;
+	__u64 tx_dropped;
+
+	struct xrx200_priv *priv;
+
+	/* ring buffer tail pointer */
+	unsigned int tx_free ____cacheline_aligned_in_smp;
+
+	u8 queue_id;	/* which TX queue is it */
+};
+
+struct xrx200_rx_queue {
+	//TODO NUM per channel
+	struct xrx200_skb dma_skb[LTQ_DESC_NUM];
+
+	struct napi_struct napi;
+
+	struct ltq_dma_channel dma;
+
+	struct u64_stats_sync syncp;
+	__u64 rx_packets;
+	__u64 rx_bytes;
+
+	struct xrx200_priv *priv;
+};
+
 struct xrx200_port {
 	u8 num;
 	u8 phy_addr;
@@ -202,53 +249,39 @@
 	struct device_node *phy_node;
 };
 
-struct xrx200_chan {
-	int idx;
-	int refcount;
-	int tx_free;
+struct xrx200_priv {
+	//TODO dynamic?
+	struct xrx200_tx_queue txq[NUM_TX_QUEUES];
+	//TODO dynamic?
+	struct xrx200_rx_queue rxq;
 
-	struct net_device dummy_dev;
-	struct net_device *devs[XRX200_MAX_DEV];
+	struct clk *clk;
 
-	struct tasklet_struct tasklet;
-	struct napi_struct napi;
-	struct ltq_dma_channel dma;
-	struct sk_buff *skb[LTQ_DESC_NUM];
+	struct net_device *net_dev;
+	struct device *dev;
 
-	spinlock_t lock;
-};
+	struct u64_stats_sync syncp;
+	__u64 tx_errors;
+
+	struct xrx200_port port[XRX200_MAX_PORT];
+	int num_port;
+	bool wan;
+	bool sw;
+	unsigned short d_port_map;
+	unsigned char mac[6];
 
-struct xrx200_hw {
-	struct clk *clk;
 	struct mii_bus *mii_bus;
 
-	struct xrx200_chan chan[XRX200_MAX_DMA];
 	u16 vlan_vid[XRX200_MAX_VLAN];
 	u16 vlan_port_map[XRX200_MAX_VLAN];
 
-	struct net_device *devs[XRX200_MAX_DEV];
-	int num_devs;
-
+	// TODO pc2005 not implemented multiple ports, EASY80920 "lantiq,xrx200-pdi"
 	int port_map[XRX200_MAX_PORT];
 	unsigned short wan_map;
 
 	struct switch_dev swdev;
 };
 
-struct xrx200_priv {
-	struct net_device_stats stats;
-	int id;
-
-	struct xrx200_port port[XRX200_MAX_PORT];
-	int num_port;
-	bool wan;
-	bool sw;
-	unsigned short port_map;
-	unsigned char mac[6];
-
-	struct xrx200_hw *hw;
-};
-
 static __iomem void *xrx200_switch_membase;
 static __iomem void *xrx200_mii_membase;
 static __iomem void *xrx200_mdio_membase;
@@ -470,14 +503,14 @@
 }
 
 // swconfig interface
-static void xrx200_hw_init(struct xrx200_hw *hw);
+static void xrx200_hw_init(struct xrx200_priv *priv);
 
 // global
 static int xrx200sw_reset_switch(struct switch_dev *dev)
 {
-	struct xrx200_hw *hw = container_of(dev, struct xrx200_hw, swdev);
+	struct xrx200_priv *priv = container_of(dev, struct xrx200_priv, swdev);
 
-	xrx200_hw_init(hw);
+	xrx200_hw_init(priv);
 
 	return 0;
 }
@@ -523,7 +556,7 @@
 static int xrx200sw_set_vlan_vid(struct switch_dev *dev, const struct switch_attr *attr,
 				 struct switch_val *val)
 {
-	struct xrx200_hw *hw = container_of(dev, struct xrx200_hw, swdev);
+	struct xrx200_priv *priv = container_of(dev, struct xrx200_priv, swdev);
 	int i;
 	struct xrx200_pce_table_entry tev;
 	struct xrx200_pce_table_entry tem;
@@ -538,7 +571,7 @@
 			return -EINVAL;
 	}
 
-	hw->vlan_vid[val->port_vlan] = val->value.i;
+	priv->vlan_vid[val->port_vlan] = val->value.i;
 
 	tev.index = val->port_vlan;
 	xrx200_pce_table_entry_read(&tev);
@@ -571,7 +604,7 @@
 
 static int xrx200sw_set_vlan_ports(struct switch_dev *dev, struct switch_val *val)
 {
-	struct xrx200_hw *hw = container_of(dev, struct xrx200_hw, swdev);
+	struct xrx200_priv *priv = container_of(dev, struct xrx200_priv, swdev);
 	int i, portmap, tagmap, untagged;
 	struct xrx200_pce_table_entry tem;
 
@@ -624,7 +657,7 @@
 
 	ltq_switch_w32_mask(0, portmap, PCE_PMAP2);
 	ltq_switch_w32_mask(0, portmap, PCE_PMAP3);
-	hw->vlan_port_map[val->port_vlan] = portmap;
+	priv->vlan_port_map[val->port_vlan] = portmap;
 
 	xrx200sw_fixup_pvids();
 
@@ -722,8 +755,8 @@
 
 	link->duplex = xrx200sw_read_x(XRX200_MAC_PSTAT_FDUP, port);
 
-	link->rx_flow = !!(xrx200sw_read_x(XRX200_MAC_CTRL_0_FCON, port) && 0x0010);
-	link->tx_flow = !!(xrx200sw_read_x(XRX200_MAC_CTRL_0_FCON, port) && 0x0020);
+	link->rx_flow = !!(xrx200sw_read_x(XRX200_MAC_CTRL_0_FCON, port) & 0x0010);
+	link->tx_flow = !!(xrx200sw_read_x(XRX200_MAC_CTRL_0_FCON, port) & 0x0020);
 	link->aneg = !(xrx200sw_read_x(XRX200_MAC_CTRL_0_FCON, port));
 
 	link->speed = SWITCH_PORT_SPEED_10;
@@ -834,30 +867,42 @@
 //	.get_port_stats = xrx200sw_get_port_stats, //TODO
 };
 
-static int xrx200sw_init(struct xrx200_hw *hw)
+static void xrx200sw_init(struct xrx200_priv *priv)
 {
-	int netdev_num;
 
-	for (netdev_num = 0; netdev_num < hw->num_devs; netdev_num++)
-	{
-		struct switch_dev *swdev;
-		struct net_device *dev = hw->devs[netdev_num];
-		struct xrx200_priv *priv = netdev_priv(dev);
-		if (!priv->sw)
-			continue;
+	struct switch_dev *swdev;
+	if (!priv->sw) {
+		return;
+	}
 
-		swdev = &hw->swdev;
+	swdev = &priv->swdev;
 
-		swdev->name = "Lantiq XRX200 Switch";
-		swdev->vlans = XRX200_MAX_VLAN;
-		swdev->ports = XRX200_MAX_PORT;
-		swdev->cpu_port = 6;
-		swdev->ops = &xrx200sw_ops;
+	swdev->name = "Lantiq XRX200 Switch";
+	swdev->vlans = XRX200_MAX_VLAN;
+	swdev->ports = XRX200_MAX_PORT;
+	swdev->cpu_port = 6;
+	swdev->ops = &xrx200sw_ops;
 
-		register_switch(swdev, dev);
-		return 0; // enough switches
+	register_switch(swdev, priv->net_dev);
+	return;
+}
+
+/* drop all the packets from the DMA ring */
+static void xrx200_flush_dma(struct ltq_dma_channel *dma)
+{
+	int i;
+
+	for (i = 0; i < LTQ_DESC_NUM; i++) {
+		struct ltq_dma_desc *desc = &dma->desc_base[dma->desc];
+
+		if ((desc->ctl & (LTQ_DMA_OWN | LTQ_DMA_C)) != LTQ_DMA_C)
+			break;
+
+		desc->ctl = LTQ_DMA_OWN | LTQ_DMA_RX_OFFSET(NET_IP_ALIGN) |
+				XRX200_DMA_DATA_LEN;
+
+		dma->desc = (dma->desc + 1) % LTQ_DESC_NUM;
 	}
-	return 0;
 }
 
 static int xrx200_open(struct net_device *dev)
@@ -865,22 +910,32 @@
 	struct xrx200_priv *priv = netdev_priv(dev);
 	int i;
 
-	for (i = 0; i < XRX200_MAX_DMA; i++) {
-		if (!priv->hw->chan[i].dma.irq)
-			continue;
-		spin_lock_bh(&priv->hw->chan[i].lock);
-		if (!priv->hw->chan[i].refcount) {
-			if (XRX200_DMA_IS_RX(i))
-				napi_enable(&priv->hw->chan[i].napi);
-			ltq_dma_open(&priv->hw->chan[i].dma);
-		}
-		priv->hw->chan[i].refcount++;
-		spin_unlock_bh(&priv->hw->chan[i].lock);
+	for (i = 0; i < NUM_TX_QUEUES; i++) {
+		napi_enable(&priv->txq[i].napi);
+		ltq_dma_open(&priv->txq[i].dma);
+		ltq_dma_enable_irq(&priv->txq[i].dma);
 	}
+
+	napi_enable(&priv->rxq.napi);
+	ltq_dma_open(&priv->rxq.dma);
+
+	/* The boot loader does not always deactivate the receiving of frames
+	 * on the ports and then some packets queue up in the PPE buffers.
+	 * They already passed the PMAC so they do not have the tags
+	 * configured here. Read the these packets here and drop them.
+	 * The HW should have written them into memory after 10us
+	 */
+	usleep_range(20, 40);
+	xrx200_flush_dma(&priv->rxq.dma);
+
+	ltq_dma_enable_irq(&priv->rxq.dma);
+
 	for (i = 0; i < priv->num_port; i++)
 		if (priv->port[i].phydev)
 			phy_start(priv->port[i].phydev);
-	netif_wake_queue(dev);
+
+	/* works with a single tx queue too */
+	netif_tx_wake_all_queues(dev);
 
 	return 0;
 }
@@ -890,198 +945,314 @@
 	struct xrx200_priv *priv = netdev_priv(dev);
 	int i;
 
-	netif_stop_queue(dev);
+	netif_tx_stop_all_queues(dev);
 
 	for (i = 0; i < priv->num_port; i++)
 		if (priv->port[i].phydev)
 			phy_stop(priv->port[i].phydev);
 
-	for (i = 0; i < XRX200_MAX_DMA; i++) {
-		if (!priv->hw->chan[i].dma.irq)
-			continue;
+	napi_disable(&priv->rxq.napi);
+	ltq_dma_close(&priv->rxq.dma);
 
-		priv->hw->chan[i].refcount--;
-		if (!priv->hw->chan[i].refcount) {
-			if (XRX200_DMA_IS_RX(i))
-				napi_disable(&priv->hw->chan[i].napi);
-			spin_lock_bh(&priv->hw->chan[i].lock);
-			ltq_dma_close(&priv->hw->chan[XRX200_DMA_RX].dma);
-			spin_unlock_bh(&priv->hw->chan[i].lock);
-		}
+	for (i = 0; i < NUM_TX_QUEUES; i++) {
+		napi_disable(&priv->txq[i].napi);
+		ltq_dma_close(&priv->txq[i].dma);
 	}
 
 	return 0;
 }
 
-static int xrx200_alloc_skb(struct xrx200_chan *ch)
+static int xrx200_alloc_skb(struct xrx200_priv *priv,
+			    struct ltq_dma_channel *dma,
+			    struct xrx200_skb *dma_skb)
 {
+	struct ltq_dma_desc *base = &dma->desc_base[dma->desc];
+	struct sk_buff *skb;
+
 #define DMA_PAD	(NET_IP_ALIGN + NET_SKB_PAD)
-	ch->skb[ch->dma.desc] = dev_alloc_skb(XRX200_DMA_DATA_LEN + DMA_PAD);
-	if (!ch->skb[ch->dma.desc])
+
+
+	skb = napi_alloc_skb(&priv->rxq.napi, XRX200_DMA_DATA_LEN );
+//   	skb = netdev_alloc_skb(priv->net_dev, XRX200_DMA_DATA_LEN + DMA_PAD);
+//	pr_info("idx:%i %px\n",dma->desc, skb->data);
+
+	//TODO fix fail path
+	if (unlikely(!skb)) {
+		pr_err("skb alloc failed\n");
+
+		/* leave the old skb if not enough memory */
 		goto skip;
+	}
+#if 1
+	dma_unmap_single(priv->dev, dma_skb->addr, XRX200_DMA_DATA_LEN,
+			 DMA_FROM_DEVICE);
+#endif
+	// 	skb_reserve(skb, NET_SKB_PAD);
+	skb_reserve(skb, -NET_IP_ALIGN);
+
+	base->addr = dma_skb->addr =
+		dma_map_single(priv->dev, skb->data,
+			       XRX200_DMA_DATA_LEN, DMA_FROM_DEVICE);
+
+
+// 		if (dma_mapping_error(&cp->pdev->dev, new_mapping)) {
+// 			dev->stats.rx_dropped++;
+// 			kfree_skb(new_skb);
+// 			goto rx_next;
+// 		}
+
 
-	skb_reserve(ch->skb[ch->dma.desc], NET_SKB_PAD);
-	ch->dma.desc_base[ch->dma.desc].addr = dma_map_single(NULL,
-		ch->skb[ch->dma.desc]->data, XRX200_DMA_DATA_LEN,
-			DMA_FROM_DEVICE);
-	ch->dma.desc_base[ch->dma.desc].addr =
-		CPHYSADDR(ch->skb[ch->dma.desc]->data);
-	skb_reserve(ch->skb[ch->dma.desc], NET_IP_ALIGN);
+ 	skb_reserve(skb, NET_IP_ALIGN);
+
+	dma_skb->skb = skb;
+
+	wmb();
 
 skip:
-	ch->dma.desc_base[ch->dma.desc].ctl =
-		LTQ_DMA_OWN | LTQ_DMA_RX_OFFSET(NET_IP_ALIGN) |
+	base->ctl = LTQ_DMA_OWN | LTQ_DMA_RX_OFFSET(NET_IP_ALIGN) |
 		XRX200_DMA_DATA_LEN;
 
+	dma->desc = (dma->desc + 1) % LTQ_DESC_NUM;
+
 	return 0;
 }
 
-static void xrx200_hw_receive(struct xrx200_chan *ch, int id)
+static void xrx200_hw_receive(struct xrx200_rx_queue *rxq,
+			      struct ltq_dma_channel *dma,
+			      struct xrx200_skb *dma_skb, int id)
 {
-	struct net_device *dev = ch->devs[id];
-	struct xrx200_priv *priv = netdev_priv(dev);
-	struct ltq_dma_desc *desc = &ch->dma.desc_base[ch->dma.desc];
-	struct sk_buff *skb = ch->skb[ch->dma.desc];
+	//	struct net_device *dev = rxq->priv->net_dev;
+	struct net_device *dev = rxq->napi.dev;
+	struct ltq_dma_desc *desc = &dma->desc_base[dma->desc];
 	int len = (desc->ctl & LTQ_DMA_SIZE_MASK);
 	int ret;
+	/* struct value will get overwritten by xrx200_alloc_skb */
+	struct sk_buff *filled_skb = dma_skb->skb;
 
-	ret = xrx200_alloc_skb(ch);
-
-	ch->dma.desc++;
-	ch->dma.desc %= LTQ_DESC_NUM;
+	/* alloc new skb first so DMA ring can work during netif_receive_skb */
+	ret = xrx200_alloc_skb(rxq->priv, dma, dma_skb);
 
 	if (ret) {
 		netdev_err(dev,
 			"failed to allocate new rx buffer\n");
+
+		//TODO
 		return;
 	}
 
-	skb_put(skb, len);
+	/* set skb length for netdev */
+	skb_put(filled_skb, len);
 #ifdef SW_ROUTING
-	skb_pull(skb, 8);
+	/* remove special tag */
+	skb_pull(filled_skb, 8);
+#endif
+
+	filled_skb->dev = dev;
+	filled_skb->protocol = eth_type_trans(filled_skb, dev);
+
+	netif_receive_skb(filled_skb);
+////////	napi_gro_receive(&rxq->napi, filled_skb);
+
+#ifdef mystats
+	u64_stats_update_begin(&rxq->syncp);
+	rxq->rx_bytes += len;
+	rxq->rx_packets++;
+	u64_stats_update_end(&rxq->syncp);
 #endif
-	skb->dev = dev;
-	skb->protocol = eth_type_trans(skb, dev);
-	netif_receive_skb(skb);
-	priv->stats.rx_packets++;
-	priv->stats.rx_bytes+=len;
 }
 
 static int xrx200_poll_rx(struct napi_struct *napi, int budget)
 {
-	struct xrx200_chan *ch = container_of(napi,
-				struct xrx200_chan, napi);
-	struct xrx200_priv *priv = netdev_priv(ch->devs[0]);
+	struct xrx200_rx_queue *rxq = container_of(napi,
+						  struct xrx200_rx_queue, napi);
+	struct ltq_dma_channel *dma = &rxq->dma;
 	int rx = 0;
-	int complete = 0;
 
-	while ((rx < budget) && !complete) {
-		struct ltq_dma_desc *desc = &ch->dma.desc_base[ch->dma.desc];
-		if ((desc->ctl & (LTQ_DMA_OWN | LTQ_DMA_C)) == LTQ_DMA_C) {
+	while (rx < budget) {
+		struct ltq_dma_desc *desc = &dma->desc_base[dma->desc];
+		if (likely((desc->ctl & (LTQ_DMA_OWN | LTQ_DMA_C)) == LTQ_DMA_C)) {
+			struct xrx200_skb *dma_skb = &rxq->dma_skb[dma->desc];
+
 #ifdef SW_ROUTING
-			struct sk_buff *skb = ch->skb[ch->dma.desc];
-			u8 *special_tag = (u8*)skb->data;
+			u8 *special_tag = (u8*)dma_skb->skb->data;
 			int port = (special_tag[7] >> SPPID_SHIFT) & SPPID_MASK;
-			xrx200_hw_receive(ch, priv->hw->port_map[port]);
+
+			xrx200_hw_receive(rxq, dma, dma_skb, rxq->priv->port_map[port]);
 #else
-			xrx200_hw_receive(ch, 0);
+			xrx200_hw_receive(rxq, dma, dma_skb, 0);
 #endif
 			rx++;
 		} else {
-			complete = 1;
+			break;
 		}
 	}
 
-	if (complete || !rx) {
-		napi_complete(&ch->napi);
-		ltq_dma_enable_irq(&ch->dma);
+//pr_info("R %i\n",rx);
+
+	if (rx < budget) {
+		if (napi_complete_done(napi, rx)) {
+//can an unacked irq event wait here now?
+			ltq_dma_enable_irq(dma);
+		}
+	} else {
+// pr_info("F\n");
+
 	}
 
 	return rx;
 }
 
-static void xrx200_tx_housekeeping(unsigned long ptr)
-{
-	struct xrx200_chan *ch = (struct xrx200_chan *) ptr;
+
+#define TX_BUFFS_AVAIL(tail, head)		\
+	((tail <= head) ?			\
+	  tail + (LTQ_DESC_NUM - 1) - head :	\
+	  tail - head - 1)
+
+static int xrx200_tx_housekeeping(struct napi_struct *napi, int budget)
+{
+	struct xrx200_tx_queue *txq =
+		container_of(napi, struct xrx200_tx_queue, napi);
+		//	struct net_device *net_dev = txq->priv->net_dev;
+	struct net_device *net_dev = napi->dev;
 	int pkts = 0;
-	int i;
+	unsigned long bytes = 0;
 
-	spin_lock_bh(&ch->lock);
-	ltq_dma_ack_irq(&ch->dma);
-	while ((ch->dma.desc_base[ch->tx_free].ctl & (LTQ_DMA_OWN | LTQ_DMA_C)) == LTQ_DMA_C) {
-		struct sk_buff *skb = ch->skb[ch->tx_free];
+//	while (1) {
+	while (pkts < budget) {
+		struct ltq_dma_desc *desc = &txq->dma.desc_base[txq->tx_free];
 
-		pkts++;
-		ch->skb[ch->tx_free] = NULL;
-		dev_kfree_skb(skb);
-		memset(&ch->dma.desc_base[ch->tx_free], 0,
-			sizeof(struct ltq_dma_desc));
-		ch->tx_free++;
-		ch->tx_free %= LTQ_DESC_NUM;
+		if ((desc->ctl & (LTQ_DMA_OWN | LTQ_DMA_C)) == LTQ_DMA_C) {
+			struct xrx200_skb *dma_skb = &txq->dma_skb[txq->tx_free];
+
+			bytes += dma_skb->size;
+
+#if 1
+//TODO use it, but this ate ~4Mbps in one test, old version is missing it
+			dma_unmap_single(txq->priv->dev, dma_skb->addr,
+					 dma_skb->size, DMA_TO_DEVICE);
+#endif
+			/* Consume skb only at last fragment */
+			if (desc->ctl & LTQ_DMA_EOP) {
+				dev_consume_skb_irq(dma_skb->skb);
+				pkts++;
+			}
+
+			dma_skb->skb = NULL;
+//only control word must be erased, rest is fine
+//			memset(desc, 0, sizeof(struct ltq_dma_desc));
+			desc->ctl = 0;
+
+			txq->tx_free = (txq->tx_free + 1) % LTQ_DESC_NUM;
+		} else {
+			break;
+		}
 	}
-	ltq_dma_enable_irq(&ch->dma);
-	spin_unlock_bh(&ch->lock);
 
-	if (!pkts)
-		return;
+#ifdef mystats
+	u64_stats_update_begin(&txq->syncp);
+	txq->tx_packets += pkts;
+	txq->tx_bytes += bytes;
+	u64_stats_update_end(&txq->syncp);
+#endif
 
-	for (i = 0; i < XRX200_MAX_DEV && ch->devs[i]; i++)
-		netif_wake_queue(ch->devs[i]);
-}
+	// HACK, free all descriptors, even over budget (else there will be queue stalls, slow CPU)
+//  	pkts = pkts ? (budget - 1) : 0;
 
-static struct net_device_stats *xrx200_get_stats (struct net_device *dev)
-{
-	struct xrx200_priv *priv = netdev_priv(dev);
+// pr_info("ch->tx_free %i %i, %i %i\n",ch->tx_free,ch->dma.desc,pkts,budget);
+
+	if (pkts < budget) {
+		if (napi_complete_done(napi, pkts)) {
+			ltq_dma_enable_irq(&txq->dma);
+		}
+	}
 
-	return &priv->stats;
+	if (netif_tx_queue_stopped(netdev_get_tx_queue(net_dev, txq->queue_id))) {
+		if (unlikely(TX_BUFFS_AVAIL(txq->tx_free, txq->dma.desc) > (MAX_SKB_FRAGS + 1))) {
+			netif_tx_wake_queue(netdev_get_tx_queue(net_dev, txq->queue_id));
+		}
+	}
+
+	return pkts;
 }
 
 static void xrx200_tx_timeout(struct net_device *dev)
 {
 	struct xrx200_priv *priv = netdev_priv(dev);
 
-	printk(KERN_ERR "%s: transmit timed out, disable the dma channel irq\n", dev->name);
+	netdev_err(dev, "transmit timed out!\n");
+
+	u64_stats_update_begin(&priv->syncp);
+	priv->tx_errors++;
+	u64_stats_update_end(&priv->syncp);
+
+//TODO this should be enough, but timed out messages usually means driver bugs
+	netif_tx_wake_all_queues(dev);
 
-	priv->stats.tx_errors++;
-	netif_wake_queue(dev);
+// 	if (netif_queue_stopped(dev)) {
+// 		netif_wake_queue(dev);
+// 	} else {
+// 		netdev_warn(dev, "high transmit load\n");
+// 	}
+}
+
+static void xrx200_unwind_mapped_tx_skb(struct xrx200_tx_queue *txq,
+					int tail,
+					int head)
+{
+	for (; tail != head; tail = (tail + 1) % LTQ_DESC_NUM) {
+		dma_unmap_single(txq->priv->dev, txq->dma_skb[tail].addr,
+				 txq->dma_skb[tail].size, DMA_TO_DEVICE);
+		txq->dma.desc_base[tail].ctl = 0;
+	}
 }
 
-static int xrx200_start_xmit(struct sk_buff *skb, struct net_device *dev)
+static netdev_tx_t xrx200_start_xmit(struct sk_buff *skb,
+				     struct net_device *dev)
 {
 	struct xrx200_priv *priv = netdev_priv(dev);
-	struct xrx200_chan *ch;
+	struct xrx200_tx_queue *txq;
+	unsigned int skb_idx;
 	struct ltq_dma_desc *desc;
-	u32 byte_offset;
 	int ret = NETDEV_TX_OK;
 	int len;
+	int i;
+	dma_addr_t mapping;
+	u8 queue_id;
+	struct netdev_queue *netq;
 #ifdef SW_ROUTING
 	u32 special_tag = (SPID_CPU_PORT << SPID_SHIFT) | DPID_ENABLE;
 #endif
-	if(priv->id)
-		ch = &priv->hw->chan[XRX200_DMA_TX_2];
-	else
-		ch = &priv->hw->chan[XRX200_DMA_TX];
 
-	desc = &ch->dma.desc_base[ch->dma.desc];
+	queue_id = skb_get_queue_mapping(skb);
+	netq = netdev_get_tx_queue(dev, queue_id);
+
+	// TX2 is always queue 1, TX is always queue 0
+	txq = &priv->txq[queue_id];
+
+	if (skb_put_padto(skb, ETH_ZLEN)) {
 
-	skb->dev = dev;
-	len = skb->len < ETH_ZLEN ? ETH_ZLEN : skb->len;
+		u64_stats_update_begin(&txq->syncp);
+		txq->tx_dropped++;
+		u64_stats_update_end(&txq->syncp);
+
+		return NETDEV_TX_OK;
+	}
 
 #ifdef SW_ROUTING
 	if (is_multicast_ether_addr(eth_hdr(skb)->h_dest)) {
-		u16 port_map = priv->port_map;
+		u16 port_map = priv->d_port_map;
 
 		if (priv->sw && skb->protocol == htons(ETH_P_8021Q)) {
 			u16 vid;
 			int i;
 
-			port_map = 0;
+ 			port_map = 0;
 			if (!__vlan_get_tag(skb, &vid)) {
 				for (i = 0; i < XRX200_MAX_VLAN; i++) {
-					if (priv->hw->vlan_vid[i] != vid)
-						continue;
-					port_map = priv->hw->vlan_port_map[i];
-					break;
+					if (priv->vlan_vid[i] == vid) {
+						port_map = priv->vlan_port_map[i];
+						break;
+					}
 				}
 			}
 		}
@@ -1089,108 +1260,219 @@
 		special_tag |= (port_map << PORT_MAP_SHIFT) |
 			       PORT_MAP_SEL | PORT_MAP_EN;
 	}
-	if(priv->wan)
+
+	if (priv->wan)
 		special_tag |= (1 << DPID_SHIFT);
-	if(skb_headroom(skb) < 4) {
-		struct sk_buff *tmp = skb_realloc_headroom(skb, 4);
+
+	if (skb_headroom(skb) < XRX200_HEADROOM) {
+		struct sk_buff *tmp = skb_realloc_headroom(skb, XRX200_HEADROOM);
 		dev_kfree_skb_any(skb);
 		skb = tmp;
 	}
-	skb_push(skb, 4);
+
+	skb_push(skb, XRX200_HEADROOM);
 	memcpy(skb->data, &special_tag, sizeof(u32));
-	len += 4;
 #endif
 
-	/* dma needs to start on a 16 byte aligned address */
-	byte_offset = CPHYSADDR(skb->data) % 16;
+	skb_idx = txq->dma.desc;
+
+	if (TX_BUFFS_AVAIL(txq->tx_free, skb_idx) <= (MAX_SKB_FRAGS + 1)) {
+		netif_tx_stop_queue(netq);
+		netdev_err(dev, "not enough TX ring space on queue %i\n", queue_id);
+		return NETDEV_TX_BUSY;
+	}
+
+	/* Send first fragment */
+	desc = &txq->dma.desc_base[skb_idx];
+
+	if (skb_shinfo(skb)->nr_frags == 0) {
+		len = skb->len;
+	} else {
+		len = skb_headlen(skb);
+	}
+
+	mapping = dma_map_single(priv->dev, skb->data, len, DMA_TO_DEVICE);
+	if (unlikely(dma_mapping_error(priv->dev, mapping))) {
+		dev_kfree_skb(skb);
+		netdev_err(dev, "DMA mapping failed\n");
+
+		u64_stats_update_begin(&txq->syncp);
+		txq->tx_dropped++;
+		txq->tx_errors++;
+		u64_stats_update_end(&txq->syncp);
 
-	spin_lock_bh(&ch->lock);
-	if ((desc->ctl & (LTQ_DMA_OWN | LTQ_DMA_C)) || ch->skb[ch->dma.desc]) {
-		netdev_err(dev, "tx ring full\n");
-		netif_stop_queue(dev);
-		ret = NETDEV_TX_BUSY;
+		ret = NETDEV_TX_OK;
 		goto out;
 	}
 
-	ch->skb[ch->dma.desc] = skb;
+	txq->dma_skb[skb_idx].skb = skb;
+	txq->dma_skb[skb_idx].addr = mapping;
+	txq->dma_skb[skb_idx].size = len;
+
+	desc->addr = (mapping & 0x1fffffe0) | (1<<31);
+
+	/* Don't set LTQ_DMA_OWN before filling all fragments descriptors */
+	desc->ctl = LTQ_DMA_SOP | LTQ_DMA_TX_OFFSET(mapping & XRX200_DMA_TX_ALIGN)
+			| (len & LTQ_DMA_SIZE_MASK);
+
+	if (!skb_shinfo(skb)->nr_frags)
+		desc->ctl |= LTQ_DMA_EOP;
+
+	/* Send rest of fragments */
+	for (i = 0; i < skb_shinfo(skb)->nr_frags; i++) {
+		unsigned int frag_idx = (skb_idx + i + 1) % LTQ_DESC_NUM;
+		struct xrx200_skb *dma_skb = &txq->dma_skb[frag_idx];
+		struct ltq_dma_desc *frag_desc = &txq->dma.desc_base[frag_idx];
+
+		len = skb_frag_size(&skb_shinfo(skb)->frags[i]);
+
+// TODO weird, etop uses virt_to_phys, why it is working there??
+		mapping = dma_map_single(priv->dev,
+					 skb_frag_address(&skb_shinfo(skb)->frags[i]),
+					 len, DMA_TO_DEVICE);
+		if (unlikely(dma_mapping_error(priv->dev, mapping))) {
+
+			xrx200_unwind_mapped_tx_skb(txq, skb_idx, frag_idx);
+
+			netdev_err(dev, "DMA mapping for fragment failed\n");
+			dev_kfree_skb(skb);
 
-	netif_trans_update(dev);
+			u64_stats_update_begin(&txq->syncp);
+			txq->tx_dropped++;
+			txq->tx_errors++;
+			u64_stats_update_end(&txq->syncp);
+
+			ret = NETDEV_TX_OK;
+			goto out;
+		}
+
+		dma_skb->skb = skb;
+		dma_skb->addr = mapping;
+		dma_skb->size = len;
+
+		frag_desc = &txq->dma.desc_base[frag_idx];
+
+		frag_desc->addr = (mapping & 0x1fffffe0) | (1<<31);
+
+		frag_desc->ctl = LTQ_DMA_OWN |
+			LTQ_DMA_TX_OFFSET(mapping & XRX200_DMA_TX_ALIGN) | (len & LTQ_DMA_SIZE_MASK);
+
+		if (i == (skb_shinfo(skb)->nr_frags - 1))
+			frag_desc->ctl |= LTQ_DMA_EOP;
+	}
+
+	/* Increment TX ring index */
+	txq->dma.desc = (skb_idx + skb_shinfo(skb)->nr_frags + 1) % LTQ_DESC_NUM;
 
-	desc->addr = ((unsigned int) dma_map_single(NULL, skb->data, len,
-						DMA_TO_DEVICE)) - byte_offset;
 	wmb();
-	desc->ctl = LTQ_DMA_OWN | LTQ_DMA_SOP | LTQ_DMA_EOP |
-		LTQ_DMA_TX_OFFSET(byte_offset) | (len & LTQ_DMA_SIZE_MASK);
-	ch->dma.desc++;
-	ch->dma.desc %= LTQ_DESC_NUM;
-	if (ch->dma.desc == ch->tx_free)
-		netif_stop_queue(dev);
 
+	/* Start TX DMA */
+	desc->ctl |= LTQ_DMA_OWN;
 
-	priv->stats.tx_packets++;
-	priv->stats.tx_bytes+=len;
+	if (unlikely(TX_BUFFS_AVAIL(txq->tx_free, txq->dma.desc) <= (MAX_SKB_FRAGS + 1))) {
+		netif_tx_stop_queue(netq);
+	}
 
-out:
-	spin_unlock_bh(&ch->lock);
+	skb_tx_timestamp(skb);
 
+out:
 	return ret;
 }
 
-static irqreturn_t xrx200_dma_irq(int irq, void *priv)
+static irqreturn_t xrx200_tx_dma_irq(int irq, void *ptr)
 {
-	struct xrx200_hw *hw = priv;
-	int chnr = irq - XRX200_DMA_IRQ;
-	struct xrx200_chan *ch = &hw->chan[chnr];
+	struct xrx200_tx_queue *txq = ptr;
 
-	ltq_dma_disable_irq(&ch->dma);
-	ltq_dma_ack_irq(&ch->dma);
+	ltq_dma_disable_irq(&txq->dma);
+	ltq_dma_ack_irq(&txq->dma);
+	napi_schedule_irqoff(&txq->napi);
 
-	if (chnr % 2)
-		tasklet_schedule(&ch->tasklet);
-	else
-		napi_schedule(&ch->napi);
+	return IRQ_HANDLED;
+}
+
+static irqreturn_t xrx200_rx_dma_irq(int irq, void *ptr)
+{
+	struct xrx200_rx_queue *rxq = ptr;
+
+	ltq_dma_disable_irq(&rxq->dma);
+	ltq_dma_ack_irq(&rxq->dma);
+	napi_schedule_irqoff(&rxq->napi);
 
 	return IRQ_HANDLED;
 }
 
-static int xrx200_dma_init(struct xrx200_hw *hw)
+static int xrx200_dma_init(struct xrx200_priv *priv)
 {
-	int i, err = 0;
+	int i;
+	struct xrx200_rx_queue *rxq = &priv->rxq;
+	int ret;
 
 	ltq_dma_init_port(DMA_PORT_ETOP);
 
-	for (i = 0; i < 8 && !err; i++) {
-		int irq = XRX200_DMA_IRQ + i;
-		struct xrx200_chan *ch = &hw->chan[i];
-
-		spin_lock_init(&ch->lock);
-
-		ch->idx = ch->dma.nr = i;
-
-		if (i == XRX200_DMA_TX) {
-			ltq_dma_alloc_tx(&ch->dma);
-			err = request_irq(irq, xrx200_dma_irq, 0, "vrx200_tx", hw);
-		} else if (i == XRX200_DMA_TX_2) {
-			ltq_dma_alloc_tx(&ch->dma);
-			err = request_irq(irq, xrx200_dma_irq, 0, "vrx200_tx_2", hw);
-		} else if (i == XRX200_DMA_RX) {
-			ltq_dma_alloc_rx(&ch->dma);
-			for (ch->dma.desc = 0; ch->dma.desc < LTQ_DESC_NUM;
-					ch->dma.desc++)
-				if (xrx200_alloc_skb(ch))
-					err = -ENOMEM;
-			ch->dma.desc = 0;
-			err = request_irq(irq, xrx200_dma_irq, 0, "vrx200_rx", hw);
-		} else
-			continue;
+	rxq->dma.nr = XRX200_DMA_RX;
+	rxq->priv = priv;
 
-		if (!err)
-			ch->dma.irq = irq;
-		else
-			pr_err("net-xrx200: failed to request irq %d\n", irq);
+	ltq_dma_alloc_rx(&rxq->dma);
+	//TODO careful about desc incrementing in original alloc_skb
+	rxq->dma.desc = 0;
+
+	for (i = 0; i < LTQ_DESC_NUM; i++) {
+		ret = xrx200_alloc_skb(priv, &rxq->dma,
+				       &rxq->dma_skb[rxq->dma.desc]);
+		if (ret)
+			goto rx_free;
 	}
+	rxq->dma.desc = 0;
 
-	return err;
+	ret = devm_request_irq(priv->dev, rxq->dma.irq, xrx200_rx_dma_irq, 0,
+			       "xrx200-net rx", &priv->rxq);
+	if (ret) {
+		dev_err(priv->dev, "failed to request RX irq %d\n",
+			rxq->dma.irq);
+		goto rx_ring_free;
+	}
+
+	//TODO this is HACK, devicetree? or at least array
+	priv->txq[0].dma.nr = XRX200_DMA_TX;
+#if (NUM_TX_QUEUES > 1)
+	priv->txq[1].dma.nr = XRX200_DMA_TX_2;
+#endif
+	for (i = 0; i < NUM_TX_QUEUES; i++) {
+		char *irq_name;
+
+		priv->txq[i].priv = priv;
+
+		ltq_dma_alloc_tx(&priv->txq[i].dma);
+		irq_name = devm_kasprintf(priv->dev, GFP_KERNEL, "xrx200-net tx%d", i);
+
+		ret = devm_request_irq(priv->dev, priv->txq[i].dma.irq,
+				       xrx200_tx_dma_irq, 0, irq_name,
+				       &priv->txq[i]);
+
+		if (ret) {
+			dev_err(priv->dev, "failed to request TX irq %d\n",
+				priv->txq[i].dma.irq);
+
+			for (; i >= 0; i--) {
+				ltq_dma_free(&priv->txq[i].dma);
+			}
+
+			goto rx_ring_free;
+		}
+	}
+
+	return ret;
+
+rx_ring_free:
+	/* free the allocated RX ring */
+	for (i = 0; i < LTQ_DESC_NUM; i++) {
+		if (rxq->dma_skb[i].skb)
+			dev_kfree_skb_any(rxq->dma_skb[i].skb);
+	}
+
+rx_free:
+	ltq_dma_free(&rxq->dma);
+	return ret;
 }
 
 #ifdef SW_POLLING
@@ -1310,8 +1592,8 @@
 
 static int xrx200_phy_has_link(struct net_device *dev)
 {
-	struct xrx200_priv *priv = netdev_priv(dev);
 	int i;
+	struct xrx200_priv *priv = netdev_priv(dev);
 
 	for (i = 0; i < priv->num_port; i++) {
 		if (!priv->port[i].phydev)
@@ -1328,11 +1610,12 @@
 {
 	struct net_device *netdev = phydev->attached_dev;
 
-	if (do_carrier)
+	if (do_carrier) {
 		if (up)
 			netif_carrier_on(netdev);
 		else if (!xrx200_phy_has_link(netdev))
 			netif_carrier_off(netdev);
+	}
 
 	phydev->adjust_link(netdev);
 }
@@ -1343,7 +1626,7 @@
 	struct phy_device *phydev = NULL;
 	unsigned val;
 
-	phydev = mdiobus_get_phy(priv->hw->mii_bus, port->phy_addr);
+	phydev = mdiobus_get_phy(priv->mii_bus, port->phy_addr);
 
 	if (!phydev) {
 		netdev_err(dev, "no PHY found\n");
@@ -1376,10 +1659,10 @@
 #ifdef SW_POLLING
 	phy_read_status(phydev);
 
-	val = xrx200_mdio_rd(priv->hw->mii_bus, MDIO_DEVAD_NONE, MII_CTRL1000);
+	val = xrx200_mdio_rd(priv->mii_bus, MDIO_DEVAD_NONE, MII_CTRL1000);
 	val |= ADVERTIZE_MPD;
-	xrx200_mdio_wr(priv->hw->mii_bus, MDIO_DEVAD_NONE, MII_CTRL1000, val);
-	xrx200_mdio_wr(priv->hw->mii_bus, 0, 0, 0x1040);
+	xrx200_mdio_wr(priv->mii_bus, MDIO_DEVAD_NONE, MII_CTRL1000, val);
+	xrx200_mdio_wr(priv->mii_bus, 0, 0, 0x1040);
 
 	phy_start_aneg(phydev);
 #endif
@@ -1476,7 +1759,7 @@
 
 	memcpy(&mac.sa_data, priv->mac, ETH_ALEN);
 	if (!is_valid_ether_addr(mac.sa_data)) {
-		pr_warn("net-xrx200: invalid MAC, using random\n");
+		netdev_warn(dev, "net-xrx200: invalid MAC, using random\n");
 		eth_random_addr(mac.sa_data);
 		dev->addr_assign_type |= NET_ADDR_RANDOM;
 	}
@@ -1487,7 +1770,7 @@
 
 	for (i = 0; i < priv->num_port; i++)
 		if (xrx200_mdio_probe(dev, &priv->port[i]))
-			pr_warn("xrx200-mdio: probing phy of port %d failed\n",
+			netdev_warn(dev, "xrx200-mdio: probing phy of port %d failed\n",
 					 priv->port[i].num);
 
 	return 0;
@@ -1522,19 +1805,20 @@
 	ltq_switch_w32_mask(0, BIT(3), PCE_GCTRL_REG(0));
 }
 
-static void xrx200_hw_init(struct xrx200_hw *hw)
+static void xrx200_hw_init(struct xrx200_priv *priv)
 {
 	int i;
 
 	/* enable clock gate */
-	clk_enable(hw->clk);
+	clk_enable(priv->clk);
 
 	ltq_switch_w32(1, 0);
 	mdelay(100);
 	ltq_switch_w32(0, 0);
+
 	/*
-	 * TODO: we should really disbale all phys/miis here and explicitly
-	 * enable them in the device secific init function
+	 * TODO: we should really disable all phys/miis here and explicitly
+	 * enable them in the device specific init function
 	 */
 
 	/* disable port fetch/store dma */
@@ -1554,16 +1838,18 @@
 	ltq_switch_w32(0x40, PCE_PMAP2);
 	ltq_switch_w32(0x40, PCE_PMAP3);
 
+//TODO search XRX200_BM_GCTRL_FR_RBC
+
 	/* RMON Counter Enable for all physical ports */
-	for (i = 0; i < 7; i++)
-		ltq_switch_w32(0x1, BM_PCFG(i));
+//	for (i = 0; i < 7; i++)
+//		ltq_switch_w32(0x1, BM_PCFG(i));
 
 	/* disable auto polling */
 	ltq_mdio_w32(0x0, MDIO_CLK_CFG0);
 
 	/* enable port statistic counters */
-	for (i = 0; i < 7; i++)
-		ltq_switch_w32(0x1, BM_PCFGx(i));
+//	for (i = 0; i < 7; i++)
+//		ltq_switch_w32(0x1, BM_PCFGx(i));
 
 	/* set IPG to 12 */
 	ltq_pmac_w32_mask(PMAC_IPG_MASK, 0xb, PMAC_RX_IPG);
@@ -1595,49 +1881,48 @@
 	xrx200sw_write_x(1, XRX200_BM_QUEUE_GCTRL_GL_MOD, 0);
 
 	for (i = 0; i < XRX200_MAX_VLAN; i++)
-		hw->vlan_vid[i] = i;
+		priv->vlan_vid[i] = i;
 }
 
-static void xrx200_hw_cleanup(struct xrx200_hw *hw)
+static void xrx200_hw_cleanup(struct xrx200_priv *priv)
 {
 	int i;
 
 	/* disable the switch */
 	ltq_mdio_w32_mask(MDIO_GLOB_ENABLE, 0, MDIO_GLOB);
 
-	/* free the channels and IRQs */
-	for (i = 0; i < 2; i++) {
-		ltq_dma_free(&hw->chan[i].dma);
-		if (hw->chan[i].dma.irq)
-			free_irq(hw->chan[i].dma.irq, hw);
+	for (i = 0; i < NUM_TX_QUEUES; i++) {
+		ltq_dma_free(&priv->txq[i].dma);
 	}
 
+	ltq_dma_free(&priv->rxq.dma);
+
 	/* free the allocated RX ring */
 	for (i = 0; i < LTQ_DESC_NUM; i++)
-		dev_kfree_skb_any(hw->chan[XRX200_DMA_RX].skb[i]);
+		dev_kfree_skb_any(priv->rxq.dma_skb[i].skb);
 
 	/* clear the mdio bus */
-	mdiobus_unregister(hw->mii_bus);
-	mdiobus_free(hw->mii_bus);
+	mdiobus_unregister(priv->mii_bus);
+	mdiobus_free(priv->mii_bus);
 
 	/* release the clock */
-	clk_disable(hw->clk);
-	clk_put(hw->clk);
+	clk_disable(priv->clk);
+	clk_put(priv->clk);
 }
 
-static int xrx200_of_mdio(struct xrx200_hw *hw, struct device_node *np)
+static int xrx200_of_mdio(struct xrx200_priv *priv, struct device_node *np)
 {
-	hw->mii_bus = mdiobus_alloc();
-	if (!hw->mii_bus)
+	priv->mii_bus = mdiobus_alloc();
+	if (!priv->mii_bus)
 		return -ENOMEM;
 
-	hw->mii_bus->read = xrx200_mdio_rd;
-	hw->mii_bus->write = xrx200_mdio_wr;
-	hw->mii_bus->name = "lantiq,xrx200-mdio";
-	snprintf(hw->mii_bus->id, MII_BUS_ID_SIZE, "%x", 0);
+	priv->mii_bus->read = xrx200_mdio_rd;
+	priv->mii_bus->write = xrx200_mdio_wr;
+	priv->mii_bus->name = "lantiq,xrx200-mdio";
+	snprintf(priv->mii_bus->id, MII_BUS_ID_SIZE, "%x", 0);
 
-	if (of_mdiobus_register(hw->mii_bus, np)) {
-		mdiobus_free(hw->mii_bus);
+	if (of_mdiobus_register(priv->mii_bus, np)) {
+		mdiobus_free(priv->mii_bus);
 		return -ENXIO;
 	}
 
@@ -1655,6 +1940,7 @@
 	memset(p, 0, sizeof(struct xrx200_port));
 	p->phy_node = of_parse_phandle(port, "phy-handle", 0);
 	addr = of_get_property(p->phy_node, "reg", NULL);
+
 	if (!addr)
 		return;
 
@@ -1665,6 +1951,7 @@
 		p->flags = XRX200_PORT_TYPE_MAC;
 	else
 		p->flags = XRX200_PORT_TYPE_PHY;
+
 	priv->num_port++;
 
 	p->gpio = of_get_gpio_flags(port, 0, &p->gpio_flags);
@@ -1677,14 +1964,95 @@
 		}
 	/* is this port a wan port ? */
 	if (priv->wan)
-		priv->hw->wan_map |= BIT(p->num);
+		priv->wan_map |= BIT(p->num);
 
-	priv->port_map |= BIT(p->num);
+	priv->d_port_map |= BIT(p->num);
 
 	/* store the port id in the hw struct so we can map ports -> devices */
-	priv->hw->port_map[p->num] = priv->hw->num_devs;
+	priv->port_map[p->num] = 0;
+}
+
+static void xrx200_get_stats64(struct net_device *dev,
+			       struct rtnl_link_stats64 *storage)
+{
+	struct xrx200_priv *priv = netdev_priv(dev);
+	unsigned int start;
+	int i;
+
+//TODO are there HW registers?
+	do {
+		start = u64_stats_fetch_begin_irq(&priv->rxq.syncp);
+		storage->rx_packets = priv->rxq.rx_packets;
+		storage->rx_bytes = priv->rxq.rx_bytes;
+	} while (u64_stats_fetch_retry_irq(&priv->rxq.syncp, start));
+
+	for (i = 0; i < NUM_TX_QUEUES; i++) {
+		do {
+			start = u64_stats_fetch_begin_irq(&priv->txq[i].syncp);
+			storage->tx_packets += priv->txq[i].tx_packets;
+			storage->tx_bytes += priv->txq[i].tx_bytes;
+			storage->tx_errors += priv->txq[i].tx_errors;
+			storage->tx_dropped += priv->txq[i].tx_dropped;
+		} while (u64_stats_fetch_retry_irq(&priv->txq[i].syncp, start));
+	}
+
+	do {
+		start = u64_stats_fetch_begin_irq(&priv->syncp);
+		storage->tx_errors += priv->tx_errors;
+	} while (u64_stats_fetch_retry_irq(&priv->syncp, start));
+}
+
+//TODO this too?
+// * int (*ndo_change_mtu)(struct net_device *dev, int new_mtu);
+// *	Called when a user wants to change the Maximum Transfer Unit
+// *	of a device.
+
+u16 glqid=0;
+
+static u16 xrx200_select_queue(struct net_device *dev, struct sk_buff *skb,
+			    void *accel_priv, select_queue_fallback_t fallback)
+{
+	u16 qid;
+
+	/*
+	 * The SoC seems to be slowed down by tx housekeeping so for
+	 * the best network speed is to schedule tx housekeeping interrupt
+	 * to the other VPE.
+	 *
+	 * The default netdev queue select causes TX speed drops as
+	 * userspace is sometimes scheduled to the same VPE which is making
+	 * housekeeping.
+	 *
+	 * The TX DMAs IRQ should be constrained to a single VPE as the
+	 * cycling through them will cause 50% of time to have the housekeeping
+	 * on the same VPE.
+	 */
+
+	//TODO cornercases: single queue, singlecore, constrained affinity
+
+
+#if 0
+	if (skb_rx_queue_recorded(skb))
+		qid = skb_get_rx_queue(skb);
+	else
+		qid = fallback(dev, skb);
+//#else
+// 	qid = glqid?1:0;
+
+// 	glqid = !glqid;
+#endif
+
+	//HACK only two VPEs max
+	if (smp_processor_id()) {
+		qid = 0;
+	} else {
+		qid = 1;
+	}
+
+	return qid;
 }
 
+
 static const struct net_device_ops xrx200_netdev_ops = {
 	.ndo_init		= xrx200_init,
 	.ndo_open		= xrx200_open,
@@ -1692,33 +2060,23 @@
 	.ndo_start_xmit		= xrx200_start_xmit,
 	.ndo_set_mac_address	= eth_mac_addr,
 	.ndo_validate_addr	= eth_validate_addr,
-	.ndo_get_stats		= xrx200_get_stats,
 	.ndo_tx_timeout		= xrx200_tx_timeout,
+	.ndo_get_stats64	= xrx200_get_stats64,
+ 	.ndo_select_queue	= xrx200_select_queue,
 };
 
-static void xrx200_of_iface(struct xrx200_hw *hw, struct device_node *iface, struct device *dev)
+static void xrx200_of_iface(struct xrx200_priv *priv, struct device_node *iface, struct device *dev)
 {
-	struct xrx200_priv *priv;
 	struct device_node *port;
 	const __be32 *wan;
 	const u8 *mac;
 
-	/* alloc the network device */
-	hw->devs[hw->num_devs] = alloc_etherdev(sizeof(struct xrx200_priv));
-	if (!hw->devs[hw->num_devs])
-		return;
-
 	/* setup the network device */
-	strcpy(hw->devs[hw->num_devs]->name, "eth%d");
-	hw->devs[hw->num_devs]->netdev_ops = &xrx200_netdev_ops;
-	hw->devs[hw->num_devs]->watchdog_timeo = XRX200_TX_TIMEOUT;
-	hw->devs[hw->num_devs]->needed_headroom = XRX200_HEADROOM;
-	SET_NETDEV_DEV(hw->devs[hw->num_devs], dev);
-
-	/* setup our private data */
-	priv = netdev_priv(hw->devs[hw->num_devs]);
-	priv->hw = hw;
-	priv->id = hw->num_devs;
+	strcpy(priv->net_dev->name, "eth%d");
+	priv->net_dev->netdev_ops = &xrx200_netdev_ops;
+	priv->net_dev->watchdog_timeo = XRX200_TX_TIMEOUT;
+	priv->net_dev->needed_headroom = XRX200_HEADROOM;
+	SET_NETDEV_DEV(priv->net_dev, dev);
 
 	mac = of_get_mac_address(iface);
 	if (mac)
@@ -1738,20 +2096,34 @@
 		if (of_device_is_compatible(port, "lantiq,xrx200-pdi-port"))
 			xrx200_of_port(priv, port);
 
-	/* register the actual device */
-	if (!register_netdev(hw->devs[hw->num_devs]))
-		hw->num_devs++;
 }
 
-static struct xrx200_hw xrx200_hw;
-
 static int xrx200_probe(struct platform_device *pdev)
 {
+	struct device *dev = &pdev->dev;
 	struct resource *res[4];
 	struct device_node *mdio_np, *iface_np, *phy_np;
 	struct of_phandle_iterator it;
 	int err;
 	int i;
+	struct xrx200_priv *priv;
+	struct net_device *net_dev;
+
+	/* alloc the network device */
+	net_dev = devm_alloc_etherdev_mqs(dev, sizeof(struct xrx200_priv),
+					  NUM_TX_QUEUES, 1);
+
+	if (!net_dev)
+		return -ENOMEM;
+
+	priv = netdev_priv(net_dev);
+	priv->net_dev = net_dev;
+	priv->dev = dev;
+
+	net_dev->netdev_ops = &xrx200_netdev_ops;
+	SET_NETDEV_DEV(net_dev, dev);
+	net_dev->min_mtu = ETH_ZLEN;
+	net_dev->max_mtu = XRX200_DMA_DATA_LEN;
 
 	/* load the memory ranges */
 	for (i = 0; i < 4; i++) {
@@ -1761,10 +2133,12 @@
 			return -ENOENT;
 		}
 	}
+
 	xrx200_switch_membase = devm_ioremap_resource(&pdev->dev, res[0]);
 	xrx200_mdio_membase = devm_ioremap_resource(&pdev->dev, res[1]);
 	xrx200_mii_membase = devm_ioremap_resource(&pdev->dev, res[2]);
 	xrx200_pmac_membase = devm_ioremap_resource(&pdev->dev, res[3]);
+
 	if (!xrx200_switch_membase || !xrx200_mdio_membase ||
 			!xrx200_mii_membase || !xrx200_pmac_membase) {
 		dev_err(&pdev->dev, "failed to request and remap io ranges \n");
@@ -1775,91 +2149,117 @@
 		phy_np = it.node;
 		if (phy_np) {
 			struct platform_device *phy = of_find_device_by_node(phy_np);
-	
+
 			of_node_put(phy_np);
 			if (!platform_get_drvdata(phy))
 				return -EPROBE_DEFER;
 		}
 	}
 
+	priv->rxq.dma.irq = XRX200_DMA_IRQ + XRX200_DMA_RX;
+	priv->rxq.priv = priv;
+
+	//TODO this is HACK, devicetree? or at least array
+	priv->txq[0].dma.irq = XRX200_DMA_IRQ + XRX200_DMA_TX;
+#if (NUM_TX_QUEUES > 1)
+	priv->txq[1].dma.irq = XRX200_DMA_IRQ + XRX200_DMA_TX_2;
+#endif
+
+	for (i = 0; i < NUM_TX_QUEUES; i++) {
+		priv->txq[i].priv = priv;
+		priv->txq[i].queue_id = i;
+	}
+
 	/* get the clock */
-	xrx200_hw.clk = clk_get(&pdev->dev, NULL);
-	if (IS_ERR(xrx200_hw.clk)) {
+	priv->clk = clk_get(&pdev->dev, NULL);
+	if (IS_ERR(priv->clk)) {
 		dev_err(&pdev->dev, "failed to get clock\n");
-		return PTR_ERR(xrx200_hw.clk);
+		return PTR_ERR(priv->clk);
 	}
 
 	/* bring up the dma engine and IP core */
-	xrx200_dma_init(&xrx200_hw);
-	xrx200_hw_init(&xrx200_hw);
-	tasklet_init(&xrx200_hw.chan[XRX200_DMA_TX].tasklet, xrx200_tx_housekeeping, (u32) &xrx200_hw.chan[XRX200_DMA_TX]);
-	tasklet_init(&xrx200_hw.chan[XRX200_DMA_TX_2].tasklet, xrx200_tx_housekeeping, (u32) &xrx200_hw.chan[XRX200_DMA_TX_2]);
+	err = xrx200_dma_init(priv);
+	if (err)
+		return err;
+
+	/* enable clock gate */
+	err = clk_prepare_enable(priv->clk);
+	if (err)
+		goto err_uninit_dma;
+
+	xrx200_hw_init(priv);
 
 	/* bring up the mdio bus */
 	mdio_np = of_find_compatible_node(pdev->dev.of_node, NULL,
 				"lantiq,xrx200-mdio");
 	if (mdio_np)
-		if (xrx200_of_mdio(&xrx200_hw, mdio_np))
+		if (xrx200_of_mdio(priv, mdio_np))
 			dev_err(&pdev->dev, "mdio probe failed\n");
 
 	/* load the interfaces */
 	for_each_child_of_node(pdev->dev.of_node, iface_np)
-		if (of_device_is_compatible(iface_np, "lantiq,xrx200-pdi")) {
-			if (xrx200_hw.num_devs < XRX200_MAX_DEV)
-				xrx200_of_iface(&xrx200_hw, iface_np, &pdev->dev);
-			else
-				dev_err(&pdev->dev,
-					"only %d interfaces allowed\n",
-					XRX200_MAX_DEV);
-		}
-
-	if (!xrx200_hw.num_devs) {
-		xrx200_hw_cleanup(&xrx200_hw);
-		dev_err(&pdev->dev, "failed to load interfaces\n");
-		return -ENOENT;
-	}
+			if (of_device_is_compatible(iface_np, "lantiq,xrx200-pdi")) {
+				xrx200_of_iface(priv, iface_np, &pdev->dev);
+				break;	//hack
+			}
 
-	xrx200sw_init(&xrx200_hw);
+	xrx200sw_init(priv);
 
 	/* set wan port mask */
-	ltq_pmac_w32(xrx200_hw.wan_map, PMAC_EWAN);
+	ltq_pmac_w32(priv->wan_map, PMAC_EWAN);
 
-	for (i = 0; i < xrx200_hw.num_devs; i++) {
-		xrx200_hw.chan[XRX200_DMA_RX].devs[i] = xrx200_hw.devs[i];
-		xrx200_hw.chan[XRX200_DMA_TX].devs[i] = xrx200_hw.devs[i];
-		xrx200_hw.chan[XRX200_DMA_TX_2].devs[i] = xrx200_hw.devs[i];
+	/* setup NAPI */
+	netif_napi_add(net_dev, &priv->rxq.napi, xrx200_poll_rx, 64);	//32
+
+	for (i = 0; i < NUM_TX_QUEUES; i++) {
+		netif_tx_napi_add(net_dev, &priv->txq[i].napi, xrx200_tx_housekeeping, 48);
 	}
 
-	/* setup NAPI */
-	init_dummy_netdev(&xrx200_hw.chan[XRX200_DMA_RX].dummy_dev);
-	netif_napi_add(&xrx200_hw.chan[XRX200_DMA_RX].dummy_dev,
-			&xrx200_hw.chan[XRX200_DMA_RX].napi, xrx200_poll_rx, 32);
+	net_dev->features |= NETIF_F_SG ;
+	net_dev->hw_features |= NETIF_F_SG;
+	net_dev->vlan_features |= NETIF_F_SG;
 
-	platform_set_drvdata(pdev, &xrx200_hw);
+	platform_set_drvdata(pdev, priv);
+
+	err = register_netdev(net_dev);
+	if (err)
+		goto err_unprepare_clk;
 
 	return 0;
+
+err_unprepare_clk:
+	clk_disable_unprepare(priv->clk);
+
+err_uninit_dma:
+	xrx200_hw_cleanup(priv);
+
+	return err;
 }
 
 static int xrx200_remove(struct platform_device *pdev)
 {
-	struct net_device *dev = platform_get_drvdata(pdev);
-	struct xrx200_priv *priv;
+	int i;
+	struct xrx200_priv *priv = platform_get_drvdata(pdev);
+	struct net_device *net_dev = priv->net_dev;
 
-	if (!dev)
-		return 0;
+	/* free stack related instances */
 
-	priv = netdev_priv(dev);
+	netif_tx_stop_all_queues(net_dev);
 
-	/* free stack related instances */
-	netif_stop_queue(dev);
-	netif_napi_del(&xrx200_hw.chan[XRX200_DMA_RX].napi);
+	for (i = 0; i < NUM_TX_QUEUES; i++) {
+		netif_napi_del(&priv->txq[i].napi);
+	}
 
-	/* shut down hardware */
-	xrx200_hw_cleanup(&xrx200_hw);
+	netif_napi_del(&priv->rxq.napi);
 
 	/* remove the actual device */
-	unregister_netdev(dev);
-	free_netdev(dev);
+	unregister_netdev(net_dev);
+
+	/* release the clock */
+	clk_disable_unprepare(priv->clk);
+
+	/* shut down hardware */
+	xrx200_hw_cleanup(priv);
 
 	return 0;
 }
--- a/arch/mips/include/asm/mach-lantiq/xway/xway_dma.h	2019-03-05 17:58:03.000000000 +0100
+++ b/arch/mips/include/asm/mach-lantiq/xway/xway_dma.h	2019-05-19 03:05:57.299963234 +0200
@@ -19,7 +19,7 @@
 #define LTQ_DMA_H__
 
 #define LTQ_DESC_SIZE		0x08	/* each descriptor is 64bit */
-#define LTQ_DESC_NUM		0x40	/* 64 descriptors / channel */
+#define LTQ_DESC_NUM		0x80	/* 128 descriptors / channel */
 
 #define LTQ_DMA_OWN		BIT(31) /* owner bit */
 #define LTQ_DMA_C		BIT(30) /* complete bit */
--- a/arch/mips/lantiq/xway/dma.c	2019-03-05 17:58:03.000000000 +0100
+++ b/arch/mips/lantiq/xway/dma.c	2019-05-19 03:05:57.301963209 +0200
@@ -49,7 +49,10 @@
 #define DMA_IRQ_ACK		0x7e		/* IRQ status register */
 #define DMA_POLL		BIT(31)		/* turn on channel polling */
 #define DMA_CLK_DIV4		BIT(6)		/* polling clock divider */
-#define DMA_2W_BURST		BIT(1)		/* 2 word burst length */
+#define DMA_1W_BURST		0x0		/* 1 word burst length/no burst */
+#define DMA_2W_BURST		0x1		/* 2 word burst length */
+#define DMA_4W_BURST		0x2		/* 4 word burst length */
+#define DMA_8W_BURST		0x3		/* 8 word burst length */
 #define DMA_MAX_CHANNEL		20		/* the soc has 20 channels */
 #define DMA_ETOP_ENDIANNESS	(0xf << 8) /* endianness swap etop channels */
 #define DMA_WEIGHT	(BIT(17) | BIT(16))	/* default channel wheight */
@@ -138,7 +141,7 @@
 	spin_lock_irqsave(&ltq_dma_lock, flags);
 	ltq_dma_w32(ch->nr, LTQ_DMA_CS);
 	ltq_dma_w32(ch->phys, LTQ_DMA_CDBA);
-	ltq_dma_w32(LTQ_DESC_NUM, LTQ_DMA_CDLEN);
+	ltq_dma_w32(LTQ_DESC_NUM, LTQ_DMA_CDLEN);	//0xff mask
 	ltq_dma_w32_mask(DMA_CHAN_ON, 0, LTQ_DMA_CCTRL);
 	wmb();
 	ltq_dma_w32_mask(0, DMA_CHAN_RST, LTQ_DMA_CCTRL);
@@ -155,7 +158,13 @@
 	ltq_dma_alloc(ch);
 
 	spin_lock_irqsave(&ltq_dma_lock, flags);
-	ltq_dma_w32(DMA_DESCPT, LTQ_DMA_CIE);
+
+//DMA_DESCPT BIT(3) //end of descriptor
+//BIT(1)	//end of packet
+//	ltq_dma_w32(DMA_DESCPT, LTQ_DMA_CIE);
+	ltq_dma_w32(BIT(1), LTQ_DMA_CIE);
+	
+	
 	ltq_dma_w32_mask(0, 1 << ch->nr, LTQ_DMA_IRNEN);
 	ltq_dma_w32(DMA_WEIGHT | DMA_TX, LTQ_DMA_CCTRL);
 	spin_unlock_irqrestore(&ltq_dma_lock, flags);
@@ -194,6 +203,12 @@
 	ltq_dma_w32(p, LTQ_DMA_PS);
 	switch (p) {
 	case DMA_PORT_ETOP:
+
+		/* 8 words burst, data must be aligned on 4*N bytes or freeze */
+//TODO? different bursts for TX and RX (RX is fine at 1G eth)		
+		ltq_dma_w32_mask(0x3c, (DMA_8W_BURST << 4) | (DMA_8W_BURST << 2),
+			LTQ_DMA_PCTRL);
+
 		/*
 		 * Tell the DMA engine to swap the endianness of data frames and
 		 * drop packets if the channel arbitration fails.
@@ -241,10 +256,18 @@
 	for (i = 0; i < DMA_MAX_CHANNEL; i++) {
 		ltq_dma_w32(i, LTQ_DMA_CS);
 		ltq_dma_w32(DMA_CHAN_RST, LTQ_DMA_CCTRL);
-		ltq_dma_w32(DMA_POLL | DMA_CLK_DIV4, LTQ_DMA_CPOLL);
 		ltq_dma_w32_mask(DMA_CHAN_ON, 0, LTQ_DMA_CCTRL);
 	}
 
+//TODO 0x100 << 4 fastest TX without fragments
+// 0x100 for fragments timeouts, 0x10 only under really _heavy_ load
+//TODO not dependent on channel select (LTQ_DMA_CS), why it was in for cycle
+	ltq_dma_w32(DMA_POLL | (0x10 << 4), LTQ_DMA_CPOLL);
+
+//TODO packet arbitration ???, test different values
+//0x3ff << 16 multiple burst count, 1<<30 multiple burst arbitration, 1<<31 packet arbitration, 1<<0 reset (!)
+//	ltq_dma_w32((1 << 31) | 0x40000, LTQ_DMA_CTRL);
+
 	id = ltq_dma_r32(LTQ_DMA_ID);
 	dev_info(&pdev->dev,
 		"Init done - hw rev: %X, ports: %d, channels: %d\n",
diff mbox series

Patch

--- a/arch/mips/lantiq/xway/dma.c	2019-02-12 19:46:14.000000000 +0100
+++ b/arch/mips/lantiq/xway/dma.c	2019-02-15 12:51:56.781495450 +0100
@@ -49,7 +49,10 @@ 
 #define DMA_IRQ_ACK		0x7e		/* IRQ status register */
 #define DMA_POLL		BIT(31)		/* turn on channel polling */
 #define DMA_CLK_DIV4		BIT(6)		/* polling clock divider */
-#define DMA_2W_BURST		BIT(1)		/* 2 word burst length */
+#define DMA_1W_BURST		0x0		/* 1 word burst length/no burst */
+#define DMA_2W_BURST		0x1		/* 2 word burst length */
+#define DMA_4W_BURST		0x2		/* 4 word burst length */
+#define DMA_8W_BURST		0x3		/* 8 word burst length */
 #define DMA_MAX_CHANNEL		20		/* the soc has 20 channels */
 #define DMA_ETOP_ENDIANNESS	(0xf << 8) /* endianness swap etop channels */
 #define DMA_WEIGHT	(BIT(17) | BIT(16))	/* default channel wheight */
@@ -138,7 +141,7 @@ 
 	spin_lock_irqsave(&ltq_dma_lock, flags);
 	ltq_dma_w32(ch->nr, LTQ_DMA_CS);
 	ltq_dma_w32(ch->phys, LTQ_DMA_CDBA);
-	ltq_dma_w32(LTQ_DESC_NUM, LTQ_DMA_CDLEN);
+	ltq_dma_w32(LTQ_DESC_NUM, LTQ_DMA_CDLEN);	//0xff mask
 	ltq_dma_w32_mask(DMA_CHAN_ON, 0, LTQ_DMA_CCTRL);
 	wmb();
 	ltq_dma_w32_mask(0, DMA_CHAN_RST, LTQ_DMA_CCTRL);
@@ -155,7 +158,13 @@ 
 	ltq_dma_alloc(ch);
 
 	spin_lock_irqsave(&ltq_dma_lock, flags);
-	ltq_dma_w32(DMA_DESCPT, LTQ_DMA_CIE);
+
+//DMA_DESCPT BIT(3) //end of descriptor
+//BIT(1)	//end of packet
+//	ltq_dma_w32(DMA_DESCPT, LTQ_DMA_CIE);
+	ltq_dma_w32(BIT(1), LTQ_DMA_CIE);
+	
+	
 	ltq_dma_w32_mask(0, 1 << ch->nr, LTQ_DMA_IRNEN);
 	ltq_dma_w32(DMA_WEIGHT | DMA_TX, LTQ_DMA_CCTRL);
 	spin_unlock_irqrestore(&ltq_dma_lock, flags);
@@ -194,6 +203,12 @@ 
 	ltq_dma_w32(p, LTQ_DMA_PS);
 	switch (p) {
 	case DMA_PORT_ETOP:
+
+		/* 8 words burst, data must be aligned on 4*N bytes or freeze */
+//TODO? different bursts for TX and RX (RX is fine at 1G eth)		
+		ltq_dma_w32_mask(0x3c, (DMA_8W_BURST << 4) | (DMA_8W_BURST << 2),
+			LTQ_DMA_PCTRL);
+
 		/*
 		 * Tell the DMA engine to swap the endianness of data frames and
 		 * drop packets if the channel arbitration fails.
@@ -241,10 +256,18 @@ 
 	for (i = 0; i < DMA_MAX_CHANNEL; i++) {
 		ltq_dma_w32(i, LTQ_DMA_CS);
 		ltq_dma_w32(DMA_CHAN_RST, LTQ_DMA_CCTRL);
-		ltq_dma_w32(DMA_POLL | DMA_CLK_DIV4, LTQ_DMA_CPOLL);
 		ltq_dma_w32_mask(DMA_CHAN_ON, 0, LTQ_DMA_CCTRL);
 	}
 
+//TODO 0x100 << 4 fastest TX without fragments
+// 0x100 for fragments timeouts, 0x10 only under really _heavy_ load
+//TODO not dependent on channel select (LTQ_DMA_CS), why it was in for cycle
+	ltq_dma_w32(DMA_POLL | (0x10 << 4), LTQ_DMA_CPOLL);
+
+//TODO packet arbitration ???, test different values
+//0x3ff << 16 multiple burst count, 1<<30 multiple burst arbitration, 1<<31 packet arbitration, 1<<0 reset (!)
+//	ltq_dma_w32((1 << 31) | 0x40000, LTQ_DMA_CTRL);
+
 	id = ltq_dma_r32(LTQ_DMA_ID);
 	dev_info(&pdev->dev,
 		"Init done - hw rev: %X, ports: %d, channels: %d\n",