diff mbox series

[OpenWrt-Devel,RFC] kernel: drop -fno-reorder-blocks

Message ID 20190409093046.13401-1-zajec5@gmail.com
State RFC
Delegated to: Rafał Miłecki
Headers show
Series [OpenWrt-Devel,RFC] kernel: drop -fno-reorder-blocks | expand

Commit Message

Rafał Miłecki April 9, 2019, 9:30 a.m. UTC
From: Rafał Miłecki <rafal@milecki.pl>

Dropping this option significantly improves NAT performance on BCM5301X
(bcm53xx) for LAN to WAN traffic with GRO disabled (+14%). It slightly
affects kernel size - it gets bigger by 1,5% - 3% depending on a target.

Unfortunately this change may decrease NAT performance for some other
platforms (targets). It seems to e.g. affect mt7621 with GRO enabled.

https://www.gnu.org/software/gcc/news/reorder.html

Following testing results come from OpenWrt with kernel 4.14.109 using
OpenWrt's default fq_codel.

**********

1) bcm53xx: BCM47094 SoC (echo 2 > rps_cpus)

zImage size: 1840424 → 1871328 (+1,68%)

a) gro off
LAN to WAN: 824 Mb/s → 940 Mb/s (+14,08%)
WAN to LAN: 935 Mb/s → 940 Mb/s (+0,53%)

b) gro on
LAN to WAN: 512 Mb/s → 534 Mb/s (+4,30%)
WAN to LAN: 539 Mb/s → 549 Mb/s (+1,85%)

**********

2) brcm47xx: BCM4706 SoC

vmlinux.lzma: 1536486 → 1588082 (+3,36%)

a) gro off
LAN to WAN: 152 Mb/s → 157 Mb/s (+3,29%)
WAN to LAN: 191 Mb/s → 182 Mb/s (-4,71%)

b) gro on
LAN to WAN: 223 Mb/s → 226 Mb/s (+1,35%)
WAN to LAN: 214 Mb/s → 214 Mb/s (+0,00%)

**********

3) ramips/mt7621: Ubiquiti ER-X (echo 8 > rps_cpus)

vmlinux size: 6084176 → 6248016 (+2,69%)

a) gro off
LAN to WAN: 415 Mb/s → 418 Mb/s (+0,07%)
WAN to LAN: 509 Mb/s → 543 Mb/s (+6,68%)

b) gro on
LAN to WAN: 640 Mb/s → 537 Mb/s (-16,09%)
WAN to LAN: 748 Mb/s → 683 Mb/s (-8,69%)

c) gro on [another run]
LAN to WAN: 648 Mb/s → 530 Mb/s (-18,20%)
WAN to LAN: 782 Mb/s → 691 Mb/s (-11,64%)

**********

4) ar71xx: Netgear WNR2200 (AR7241)

vmlinux size: 5106084 → 5244996 (+2,72%)

a) gro off
LAN to WAN: 94.1 Mb/s → 94.2 Mb/s (~ +0%)
WAN to LAN: 94.2 Mb/s → 94.2 Mb/s (~ +0%)

b) gro on
LAN to WAN: 94.2 Mb/s → 94.2 Mb/s (~ +0%)
WAN to LAN: 94.1 Mb/s → 94.2 Mb/s (~ +0%)

**********

Signed-off-by: Rafał Miłecki <rafal@milecki.pl>
---
 target/linux/generic/pending-4.14/201-extra_optimization.patch | 2 +-
 target/linux/generic/pending-4.19/201-extra_optimization.patch | 2 +-
 2 files changed, 2 insertions(+), 2 deletions(-)

Comments

Rafał Miłecki April 9, 2019, 9:33 a.m. UTC | #1
On 09.04.2019 11:30, Rafał Miłecki wrote:
> 1) bcm53xx: BCM47094 SoC (echo 2 > rps_cpus)
> 
> zImage size: 1840424 → 1871328 (+1,68%)
> 
> a) gro off
> LAN to WAN: 824 Mb/s → 940 Mb/s (+14,08%)
> WAN to LAN: 935 Mb/s → 940 Mb/s (+0,53%)
> 
> b) gro on
> LAN to WAN: 512 Mb/s → 534 Mb/s (+4,30%)
> WAN to LAN: 539 Mb/s → 549 Mb/s (+1,85%)

I was obviously curious why this change affects bcm53xx. I tried using perf to
profile kernel before & after the change.

I'm not sure about results interpretation. One thing I noticed is lowered CPU
usage for __softirqentry_text_start. Could that be it?

You can see FlameGraph-s at files.zajec.net/openwrt/fno-reorder-blocks/

P.S.
I used FlameGraph's difffolded.pl to compare LAN to WAN perfs before and after
the change. It seems to highlight the same thing: __softirqentry_text_start. I'm
still unsure what does it mean and if the same improvement can be achieved any
other way.

**********

LAN to WAN

1) Before the patch (824 Mb/s):
+    9,61%  swapper      [kernel.kallsyms]       [k] v7_dma_inv_range
+    6,22%  swapper      [kernel.kallsyms]       [k] __softirqentry_text_start
+    5,14%  swapper      [kernel.kallsyms]       [k] l2c210_inv_range
+    4,88%  ksoftirqd/1  [kernel.kallsyms]       [k] v7_dma_clean_range
+    3,93%  swapper      [kernel.kallsyms]       [k] bcma_host_soc_read32
+    3,43%  ksoftirqd/1  [kernel.kallsyms]       [k] __netif_receive_skb_core
+    3,01%  swapper      [kernel.kallsyms]       [k] arch_cpu_idle
+    2,81%  ksoftirqd/1  [kernel.kallsyms]       [k] l2c210_clean_range
+    2,15%  ksoftirqd/1  [kernel.kallsyms]       [k] bgmac_start_xmit
+    2,02%  swapper      [kernel.kallsyms]       [k] bgmac_poll
+    1,90%  ksoftirqd/1  [kernel.kallsyms]       [k] __dev_queue_xmit
+    1,73%  ksoftirqd/1  [kernel.kallsyms]       [k] nf_hook_slow
+    1,34%  ksoftirqd/1  [kernel.kallsyms]       [k] __local_bh_enable_ip
+    1,05%  ksoftirqd/1  [kernel.kallsyms]       [k] skb_pull_rcsum

2) After the patch (940 Mb/s):

+   11,07%  swapper       [kernel.kallsyms]       [k] v7_dma_inv_range
+    5,76%  swapper       [kernel.kallsyms]       [k] __softirqentry_text_start
+    5,72%  ksoftirqd/1   [kernel.kallsyms]       [k] v7_dma_clean_range
+    5,37%  swapper       [kernel.kallsyms]       [k] l2c210_inv_range
+    4,34%  swapper       [kernel.kallsyms]       [k] bcma_host_soc_read32
+    3,65%  ksoftirqd/1   [kernel.kallsyms]       [k] __netif_receive_skb_core
+    3,18%  ksoftirqd/1   [kernel.kallsyms]       [k] l2c210_clean_range
+    2,71%  swapper       [kernel.kallsyms]       [k] bgmac_poll
+    2,59%  swapper       [kernel.kallsyms]       [k] arch_cpu_idle
+    1,97%  ksoftirqd/1   [kernel.kallsyms]       [k] bgmac_start_xmit
+    1,67%  ksoftirqd/1   [kernel.kallsyms]       [k] __dev_queue_xmit
+    1,54%  ksoftirqd/1   [kernel.kallsyms]       [k] nf_hook_slow
+    1,16%  ksoftirqd/1   [kernel.kallsyms]       [k] ip_rcv
+    1,08%  ksoftirqd/1   [kernel.kallsyms]       [k] skb_pull_rcsum
+    1,07%  ksoftirqd/1   [kernel.kallsyms]       [k] netif_skb_features
+    1,04%  ksoftirqd/1   [kernel.kallsyms]       [k] __local_bh_enable_ip

**********

WAN to LAN

1) Before the patch (935 Mb/s):
+   10,55%  swapper      [kernel.kallsyms]       [k] v7_dma_inv_range
+    6,01%  swapper      [kernel.kallsyms]       [k] __softirqentry_text_start
+    5,56%  swapper      [kernel.kallsyms]       [k] l2c210_inv_range
+    5,55%  ksoftirqd/1  [kernel.kallsyms]       [k] v7_dma_clean_range
+    4,36%  swapper      [kernel.kallsyms]       [k] bcma_host_soc_read32
+    2,70%  ksoftirqd/1  [kernel.kallsyms]       [k] l2c210_clean_range
+    2,65%  swapper      [kernel.kallsyms]       [k] arch_cpu_idle
+    2,43%  ksoftirqd/1  [kernel.kallsyms]       [k] __netif_receive_skb_core
+    2,34%  ksoftirqd/1  [kernel.kallsyms]       [k] __dev_queue_xmit
+    2,19%  swapper      [kernel.kallsyms]       [k] bgmac_poll
+    2,08%  ksoftirqd/1  [kernel.kallsyms]       [k] bgmac_start_xmit
+    1,73%  ksoftirqd/1  [kernel.kallsyms]       [k] nf_hook_slow
+    1,52%  ksoftirqd/1  [kernel.kallsyms]       [k] __local_bh_enable_ip
+    1,45%  ksoftirqd/1  [kernel.kallsyms]       [k] ip_rcv
+    1,13%  ksoftirqd/1  [kernel.kallsyms]       [k] skb_pull_rcsum
+    1,11%  ksoftirqd/1  [kernel.kallsyms]       [k] ip_finish_output2
+    1,06%  ksoftirqd/1  [kernel.kallsyms]       [k] netif_skb_features

2) After the patch (940 Mb/s):

+   11,73%  swapper      [kernel.kallsyms]       [k] v7_dma_inv_range
+    6,05%  ksoftirqd/1  [kernel.kallsyms]       [k] v7_dma_clean_range
+    5,94%  swapper      [kernel.kallsyms]       [k] l2c210_inv_range
+    4,79%  swapper      [kernel.kallsyms]       [k] __softirqentry_text_start
+    4,08%  swapper      [kernel.kallsyms]       [k] bcma_host_soc_read32
+    3,05%  ksoftirqd/1  [kernel.kallsyms]       [k] __netif_receive_skb_core
+    2,98%  ksoftirqd/1  [kernel.kallsyms]       [k] l2c210_clean_range
+    2,53%  swapper      [kernel.kallsyms]       [k] bgmac_poll
+    2,36%  ksoftirqd/1  [kernel.kallsyms]       [k] __dev_queue_xmit
+    2,15%  ksoftirqd/1  [kernel.kallsyms]       [k] bgmac_start_xmit
+    2,10%  swapper      [kernel.kallsyms]       [k] arch_cpu_idle
+    1,64%  ksoftirqd/1  [kernel.kallsyms]       [k] nf_hook_slow
+    1,33%  ksoftirqd/1  [kernel.kallsyms]       [k] ip_rcv
+    1,28%  ksoftirqd/1  [kernel.kallsyms]       [k] netif_skb_features
+    1,27%  ksoftirqd/1  [kernel.kallsyms]       [k] __local_bh_enable_ip
+    1,02%  swapper      [kernel.kallsyms]       [k] __skb_flow_dissect
diff mbox series

Patch

diff --git a/target/linux/generic/pending-4.14/201-extra_optimization.patch b/target/linux/generic/pending-4.14/201-extra_optimization.patch
index c7790657fd..3f7613d3dd 100644
--- a/target/linux/generic/pending-4.14/201-extra_optimization.patch
+++ b/target/linux/generic/pending-4.14/201-extra_optimization.patch
@@ -26,7 +26,7 @@  Signed-off-by: Felix Fietkau <nbd@nbd.name>
 +KBUILD_CFLAGS	+= -O2 $(call cc-disable-warning,maybe-uninitialized,) $(EXTRA_OPTIMIZATION)
  else
 -KBUILD_CFLAGS   += -O2
-+KBUILD_CFLAGS   += -O2 -fno-reorder-blocks -fno-tree-ch $(EXTRA_OPTIMIZATION)
++KBUILD_CFLAGS   += -O2 -fno-tree-ch $(EXTRA_OPTIMIZATION)
  endif
  endif
  
diff --git a/target/linux/generic/pending-4.19/201-extra_optimization.patch b/target/linux/generic/pending-4.19/201-extra_optimization.patch
index d86e29fc75..f002c49676 100644
--- a/target/linux/generic/pending-4.19/201-extra_optimization.patch
+++ b/target/linux/generic/pending-4.19/201-extra_optimization.patch
@@ -26,7 +26,7 @@  Signed-off-by: Felix Fietkau <nbd@nbd.name>
 +KBUILD_CFLAGS	+= -O2 $(call cc-disable-warning,maybe-uninitialized,) $(EXTRA_OPTIMIZATION)
  else
 -KBUILD_CFLAGS   += -O2
-+KBUILD_CFLAGS   += -O2 -fno-reorder-blocks -fno-tree-ch $(EXTRA_OPTIMIZATION)
++KBUILD_CFLAGS   += -O2 -fno-tree-ch $(EXTRA_OPTIMIZATION)
  endif
  endif