diff mbox

Strange latency spikes/TX network stalls on Sun Fire X4150(x86) and e1000e

Message ID 88c43001441945e1431609db252b69e7@visp.net.lb
State RFC, archived
Delegated to: David Miller
Headers show

Commit Message

Denys Fedoryshchenko May 17, 2012, 1:42 p.m. UTC
Found commit that cause problem:

author	Tom Herbert <therbert@google.com>
Mon, 28 Nov 2011 16:33:16 +0000 (16:33 +0000)
committer	David S. Miller <davem@davemloft.net>
Tue, 29 Nov 2011 17:46:19 +0000 (12:46 -0500)
commit	3f0cfa3bc11e7f00c9994e0f469cbc0e7da7b00c
tree	d6670a4f94b2b9dedacc38edb6f0e1306b889f6b	tree | snapshot
parent	114cf5802165ee93e3ab461c9c505cd94a08b800	commit | diff
e1000e: Support for byte queue limits

Changes to e1000e to use byte queue limits.

Signed-off-by: Tom Herbert <therbert@google.com>
Acked-by: Eric Dumazet <eric.dumazet@gmail.com>
Signed-off-by: David S. Miller <davem@davemloft.net>

If i reverse it, problem disappearing.

How i reproduce it:
In two consoles do "fast" ping to nearby host
ping 194.146.XXX.XXX -s1472 -i0.0001
ping 194.146.XXX.XXX -s1472 -i0.1

For third open ssh to host with "problem", open mcedit, and just scroll 
down large text file.
After few seconds some "stalls" will occur, and in ping history i can 
see:
1480 bytes from 194.146.153.7: icmp_req=1797 ttl=64 time=0.161 ms
1480 bytes from 194.146.153.7: icmp_req=1798 ttl=64 time=0.198 ms
1480 bytes from 194.146.153.7: icmp_req=1799 ttl=64 time=0.340 ms
1480 bytes from 194.146.153.7: icmp_req=1800 ttl=64 time=0.381 ms
1480 bytes from 194.146.153.7: icmp_req=1801 ttl=64 time=914 ms
1480 bytes from 194.146.153.7: icmp_req=1802 ttl=64 time=804 ms
1480 bytes from 194.146.153.7: icmp_req=1803 ttl=64 time=704 ms
1480 bytes from 194.146.153.7: icmp_req=1804 ttl=64 time=594 ms
1480 bytes from 194.146.153.7: icmp_req=1805 ttl=64 time=0.287 ms
1480 bytes from 194.146.153.7: icmp_req=1806 ttl=64 time=0.226 ms


If i apply small patch - problem will disappear. Sure it is not a 
solution, but
let me know how i can help to debug problem more.

send. */
                 e1000_maybe_stop_tx(netdev, MAX_SKB_FRAGS + 2);



On 2012-05-15 17:15, Denys Fedoryshchenko wrote:
> Hi
>
> I have two identical servers, Sun Fire X4150, both has different
> flavors of Linux, x86_64 and i386.
> 04:00.0 Ethernet controller: Intel Corporation 80003ES2LAN Gigabit
> Ethernet Controller (Copper) (rev 01)
> 04:00.1 Ethernet controller: Intel Corporation 80003ES2LAN Gigabit
> Ethernet Controller (Copper) (rev 01)
> 0b:00.0 Ethernet controller: Intel Corporation 82571EB Gigabit
> Ethernet Controller (rev 06)
> 0b:00.1 Ethernet controller: Intel Corporation 82571EB Gigabit
> Ethernet Controller (rev 06)
> I am using now interface:
> #ethtool -i eth0
> driver: e1000e
> version: 1.9.5-k
> firmware-version: 2.1-11
> bus-info: 0000:04:00.0
> There is 2 CPU , Intel(R) Xeon(R) CPU           E5440  @ 2.83GHz .
>
> i386 was acting as NAT and shaper, and as soon as i removed shaper
> from it, i started to experience strange lockups, e.g. traffic is
> normal for 5-30 seconds, then short lockup for 500-3000ms (usually
> around 1000ms) with dropped packets counter increasing. I was
> suspecting it is due load, but it seems was wrong.
> Recently, on another server, x86_64 i am using as development, i
> upgrade kernel (it was old, from 2.6 series) and on completely idle
> machine started to experience same latency spikes, while i am just
> running mc and for example typing in text editor - i notice "stalls".
> After i investigate it a little more, i notice also small amount of
> drops on interface. No tcpdump running. Also this machine is idle, 
> and
> the only traffic there - some small broadcasts from network, my ssh,
> and ping.
>
> Dropped packets in ifconfig
>           RX packets:3752868 errors:0 dropped:5350 overruns:0 frame:0
> Counter is increasing sometimes, when this stall happening.
>
> ethtool -S is clean, there is no dropped packets.
>
> I did tried to check load (mpstat and perf), there is nothing
> suspicious, latencytop also doesn't show anything suspicious.
> dropwatch report a lot of drops, but mostly because there is some
> broadcasts and etc. tcpdump at the moment of such drops doesn't show
> anything suspicious.
> Changed qdisc from default fifo_fast to bfifo, without any result.
> Tried:  ethtool -K eth0 tso off gso off gro off sg off , no result
> Problem occured at 3.3.6 - 3.4.0-rc7, most probably 3.3.0 also, but i
> don't remember for sure. I thik on some kernels like 3.1 probably it
> doesn't occur, i will check it soon, because it is not always 
> reliable
> to reproduce it. All tests i did on 3.4.0-rc7.
>
> I did run also in background tcpdump, additionally iptables with
> timestamps, and at time when stall occured, seems i am still 
> receiving
> packets properly, also on iperf udp  (from some host to this SunFire)
> at this moments no packets missing. But i am sure RX interface errors
> are increasing.
> If i do iperf from SunFire to test host - there is packetloss at
> moments when stall occured.
>
> I suspect that by some reason network card stop to transmit, but
> unable to pinpoint issue. All other hosts in this network are fine 
> and
> don't have such problems.
> Can you help me with that please? Maybe i can provide more debug
> information, compile with patches and etc. Also i will try to 
> fallback
> to 3.1 and 3.0 kernels.
>
> Here it is how it occurs and i am reproducing it:
> I'm just opening file, and start to scroll it in mc, then in another
> console i run ping
> [1337089061.844167] 1480 bytes from 194.146.153.20: icmp_req=162
> ttl=64 time=0.485 ms
> [1337089061.944138] 1480 bytes from 194.146.153.20: icmp_req=163
> ttl=64 time=0.470 ms
> [1337089062.467759] 1480 bytes from 194.146.153.20: icmp_req=164
> ttl=64 time=424 ms
> [1337089062.467899] 1480 bytes from 194.146.153.20: icmp_req=165
> ttl=64 time=324 ms
> [1337089062.468058] 1480 bytes from 194.146.153.20: icmp_req=166
> ttl=64 time=214 ms
> [1337089062.468161] 1480 bytes from 194.146.153.20: icmp_req=167
> ttl=64 time=104 ms
> [1337089062.468958] 1480 bytes from 194.146.153.20: icmp_req=168
> ttl=64 time=1.15 ms
> [1337089062.568604] 1480 bytes from 194.146.153.20: icmp_req=169
> ttl=64 time=0.477 ms
> [1337089062.668909] 1480 bytes from 194.146.153.20: icmp_req=170
> ttl=64 time=0.667 ms
>
> Remote host tcpdump:
> 1337089061.934737 IP 194.146.153.20 > 194.146.153.22: ICMP echo
> reply, id 3486, seq 163, length 1480
> 1337089062.458360 IP 194.146.153.22 > 194.146.153.20: ICMP echo
> request, id 3486, seq 164, length 1480
> 1337089062.458380 IP 194.146.153.20 > 194.146.153.22: ICMP echo
> reply, id 3486, seq 164, length 1480
> 1337089062.458481 IP 194.146.153.22 > 194.146.153.20: ICMP echo
> request, id 3486, seq 165, length 1480
> 1337089062.458502 IP 194.146.153.20 > 194.146.153.22: ICMP echo
> reply, id 3486, seq 165, length 1480
> 1337089062.458606 IP 194.146.153.22 > 194.146.153.20: ICMP echo
> request, id 3486, seq 166, length 1480
> 1337089062.458623 IP 194.146.153.20 > 194.146.153.22: ICMP echo
> reply, id 3486, seq 166, length 1480
> 1337089062.458729 IP 194.146.153.22 > 194.146.153.20: ICMP echo
> request, id 3486, seq 167, length 1480
> 1337089062.458745 IP 194.146.153.20 > 194.146.153.22: ICMP echo
> reply, id 3486, seq 167, length 1480
> 1337089062.459537 IP 194.146.153.22 > 194.146.153.20: ICMP echo
> request, id 3486, seq 168, length 1480
> 1337089062.459545 IP 194.146.153.20 > 194.146.153.22: ICMP echo
> reply, id 3486, seq 168, length 1480
>
> Local host(SunFire) tcpdump:
> 1337089061.844140 IP 194.146.153.20 > 194.146.153.22: ICMP echo
> reply, id 3486, seq 162, length 1480
> 1337089061.943661 IP 194.146.153.22 > 194.146.153.20: ICMP echo
> request, id 3486, seq 163, length 1480
> 1337089061.944124 IP 194.146.153.20 > 194.146.153.22: ICMP echo
> reply, id 3486, seq 163, length 1480
> 1337089062.465622 IP 194.146.153.22 > 194.146.153.20: ICMP echo
> request, id 3486, seq 164, length 1480
> 1337089062.465630 IP 194.146.153.22 > 194.146.153.20: ICMP echo
> request, id 3486, seq 165, length 1480
> 1337089062.465632 IP 194.146.153.22 > 194.146.153.20: ICMP echo
> request, id 3486, seq 166, length 1480
> 1337089062.465634 IP 194.146.153.22 > 194.146.153.20: ICMP echo
> request, id 3486, seq 167, length 1480
> 1337089062.467730 IP 194.146.153.20 > 194.146.153.22: ICMP echo
> reply, id 3486, seq 164, length 1480
> 1337089062.467785 IP 194.146.153.22 > 194.146.153.20: ICMP echo
> request, id 3486, seq 168, length 1480
> 1337089062.467884 IP 194.146.153.20 > 194.146.153.22: ICMP echo
> reply, id 3486, seq 165, length 1480
> 1337089062.468035 IP 194.146.153.20 > 194.146.153.22: ICMP echo
> reply, id 3486, seq 166, length 1480
> 1337089062.468129 IP 194.146.153.20 > 194.146.153.22: ICMP echo
> reply, id 3486, seq 167, length 1480
> 1337089062.468928 IP 194.146.153.20 > 194.146.153.22: ICMP echo
> reply, id 3486, seq 168, length 1480
> 1337089062.568112 IP 194.146.153.22 > 194.146.153.20: ICMP echo
> request, id 3486, seq 169, length 1480
> 1337089062.568578 IP 194.146.153.20 > 194.146.153.22: ICMP echo
> reply, id 3486, seq 169, length 1480
>
> lspci -t
> centaur src # lspci -t
> -[0000:00]-+-00.0
>            +-02.0-[01-05]--+-00.0-[02-04]--+-00.0-[03]--
>            |               |               \-02.0-[04]--+-00.0
>            |               |                            \-00.1
>            |               \-00.3-[05]--
>            +-03.0-[06]--
>            +-04.0-[07]----00.0
>            +-05.0-[08]--
>            +-06.0-[09]--
>            +-07.0-[0a]--
>            +-08.0
>            +-10.0
>            +-10.1
>            +-10.2
>            +-11.0
>            +-13.0
>            +-15.0
>            +-16.0
>            +-1c.0-[0b]--+-00.0
>            |            \-00.1
>            +-1d.0
>            +-1d.1
>            +-1d.2
>            +-1d.3
>            +-1d.7
>            +-1e.0-[0c]----05.0
>            +-1f.0
>            +-1f.1
>            +-1f.2
>            \-1f.3
> lspci
> 00:00.0 Host bridge: Intel Corporation 5000P Chipset Memory
> Controller Hub (rev b1)
> 00:02.0 PCI bridge: Intel Corporation 5000 Series Chipset PCI Express
> x4 Port 2 (rev b1)
> 00:03.0 PCI bridge: Intel Corporation 5000 Series Chipset PCI Express
> x4 Port 3 (rev b1)
> 00:04.0 PCI bridge: Intel Corporation 5000 Series Chipset PCI Express
> x8 Port 4-5 (rev b1)
> 00:05.0 PCI bridge: Intel Corporation 5000 Series Chipset PCI Express
> x4 Port 5 (rev b1)
> 00:06.0 PCI bridge: Intel Corporation 5000 Series Chipset PCI Express
> x8 Port 6-7 (rev b1)
> 00:07.0 PCI bridge: Intel Corporation 5000 Series Chipset PCI Express
> x4 Port 7 (rev b1)
> 00:08.0 System peripheral: Intel Corporation 5000 Series Chipset DMA
> Engine (rev b1)
> 00:10.0 Host bridge: Intel Corporation 5000 Series Chipset FSB
> Registers (rev b1)
> 00:10.1 Host bridge: Intel Corporation 5000 Series Chipset FSB
> Registers (rev b1)
> 00:10.2 Host bridge: Intel Corporation 5000 Series Chipset FSB
> Registers (rev b1)
> 00:11.0 Host bridge: Intel Corporation 5000 Series Chipset Reserved
> Registers (rev b1)
> 00:13.0 Host bridge: Intel Corporation 5000 Series Chipset Reserved
> Registers (rev b1)
> 00:15.0 Host bridge: Intel Corporation 5000 Series Chipset FBD
> Registers (rev b1)
> 00:16.0 Host bridge: Intel Corporation 5000 Series Chipset FBD
> Registers (rev b1)
> 00:1c.0 PCI bridge: Intel Corporation 631xESB/632xESB/3100 Chipset
> PCI Express Root Port 1 (rev 09)
> 00:1d.0 USB controller: Intel Corporation 631xESB/632xESB/3100
> Chipset UHCI USB Controller #1 (rev 09)
> 00:1d.1 USB controller: Intel Corporation 631xESB/632xESB/3100
> Chipset UHCI USB Controller #2 (rev 09)
> 00:1d.2 USB controller: Intel Corporation 631xESB/632xESB/3100
> Chipset UHCI USB Controller #3 (rev 09)
> 00:1d.3 USB controller: Intel Corporation 631xESB/632xESB/3100
> Chipset UHCI USB Controller #4 (rev 09)
> 00:1d.7 USB controller: Intel Corporation 631xESB/632xESB/3100
> Chipset EHCI USB2 Controller (rev 09)
> 00:1e.0 PCI bridge: Intel Corporation 82801 PCI Bridge (rev d9)
> 00:1f.0 ISA bridge: Intel Corporation 631xESB/632xESB/3100 Chipset
> LPC Interface Controller (rev 09)
> 00:1f.1 IDE interface: Intel Corporation 631xESB/632xESB IDE
> Controller (rev 09)
> 00:1f.2 SATA controller: Intel Corporation 631xESB/632xESB SATA AHCI
> Controller (rev 09)
> 00:1f.3 SMBus: Intel Corporation 631xESB/632xESB/3100 Chipset SMBus
> Controller (rev 09)
> 01:00.0 PCI bridge: Intel Corporation 6311ESB/6321ESB PCI Express
> Upstream Port (rev 01)
> 01:00.3 PCI bridge: Intel Corporation 6311ESB/6321ESB PCI Express to
> PCI-X Bridge (rev 01)
> 02:00.0 PCI bridge: Intel Corporation 6311ESB/6321ESB PCI Express
> Downstream Port E1 (rev 01)
> 02:02.0 PCI bridge: Intel Corporation 6311ESB/6321ESB PCI Express
> Downstream Port E3 (rev 01)
> 04:00.0 Ethernet controller: Intel Corporation 80003ES2LAN Gigabit
> Ethernet Controller (Copper) (rev 01)
> 04:00.1 Ethernet controller: Intel Corporation 80003ES2LAN Gigabit
> Ethernet Controller (Copper) (rev 01)
> 07:00.0 RAID bus controller: Adaptec AAC-RAID (rev 09)
> 0b:00.0 Ethernet controller: Intel Corporation 82571EB Gigabit
> Ethernet Controller (rev 06)
> 0b:00.1 Ethernet controller: Intel Corporation 82571EB Gigabit
> Ethernet Controller (rev 06)
> 0c:05.0 VGA compatible controller: ASPEED Technology, Inc. ASPEED
> Graphics Family
>
>
> dmesg:
> [    4.936885] e1000: Intel(R) PRO/1000 Network Driver - version
> 7.3.21-k8-NAPI
> [    4.936887] e1000: Copyright (c) 1999-2006 Intel Corporation.
> [    4.936966] e1000e: Intel(R) PRO/1000 Network Driver - 1.9.5-k
> [    4.936967] e1000e: Copyright(c) 1999 - 2012 Intel Corporation.
> [    4.938529] e1000e 0000:04:00.0: (unregistered net_device):
> Interrupt Throttling Rate (ints/sec) set to dynamic conservative mode
> [    4.939598] e1000e 0000:04:00.0: irq 65 for MSI/MSI-X
> [    4.992246] e1000e 0000:04:00.0: eth0: (PCI Express:2.5GT/s:Width
> x4) 00:1e:68:04:99:f8
> [    4.992657] e1000e 0000:04:00.0: eth0: Intel(R) PRO/1000 Network
> Connection
> [    4.992964] e1000e 0000:04:00.0: eth0: MAC: 5, PHY: 5, PBA No: 
> FFFFFF-0FF
> [    4.994745] e1000e 0000:04:00.1: (unregistered net_device):
> Interrupt Throttling Rate (ints/sec) set to dynamic conservative mode
> [    4.996233] e1000e 0000:04:00.1: irq 66 for MSI/MSI-X
> [    5.050901] e1000e 0000:04:00.1: eth1: (PCI Express:2.5GT/s:Width
> x4) 00:1e:68:04:99:f9
> [    5.051317] e1000e 0000:04:00.1: eth1: Intel(R) PRO/1000 Network
> Connection
> [    5.051623] e1000e 0000:04:00.1: eth1: MAC: 5, PHY: 5, PBA No: 
> FFFFFF-0FF
> [    5.051857] e1000e 0000:0b:00.0: Disabling ASPM  L1
> [    5.052168] e1000e 0000:0b:00.0: (unregistered net_device):
> Interrupt Throttling Rate (ints/sec) set to dynamic conservative mode
> [    5.052611] e1000e 0000:0b:00.0: irq 67 for MSI/MSI-X
> [    5.223454] e1000e 0000:0b:00.0: eth2: (PCI Express:2.5GT/s:Width
> x4) 00:1e:68:04:99:fa
> [    5.223864] e1000e 0000:0b:00.0: eth2: Intel(R) PRO/1000 Network
> Connection
> [    5.224178] e1000e 0000:0b:00.0: eth2: MAC: 0, PHY: 4, PBA No: 
> C83246-002
> [    5.224412] e1000e 0000:0b:00.1: Disabling ASPM  L1
> [    5.224709] e1000e 0000:0b:00.1: (unregistered net_device):
> Interrupt Throttling Rate (ints/sec) set to dynamic conservative mode
> [    5.225168] e1000e 0000:0b:00.1: irq 68 for MSI/MSI-X
> [    5.397603] e1000e 0000:0b:00.1: eth3: (PCI Express:2.5GT/s:Width
> x4) 00:1e:68:04:99:fb
> [    5.398021] e1000e 0000:0b:00.1: eth3: Intel(R) PRO/1000 Network
> Connection
> [    5.398336] e1000e 0000:0b:00.1: eth3: MAC: 0, PHY: 4, PBA No: 
> C83246-002
> [   13.859817] e1000e 0000:04:00.0: irq 65 for MSI/MSI-X
> [   13.962309] e1000e 0000:04:00.0: irq 65 for MSI/MSI-X
> [   17.150392] e1000e: eth0 NIC Link is Up 1000 Mbps Full Duplex,
> Flow Control: None
>
> --
> To unsubscribe from this list: send the line "unsubscribe netdev" in
> the body of a message to majordomo@vger.kernel.org
> More majordomo info at  http://vger.kernel.org/majordomo-info.html

---
Network engineer
Denys Fedoryshchenko

Dora Highway - Center Cebaco - 2nd Floor
Beirut, Lebanon
Tel:	+961 1 247373
E-Mail: denys@visp.net.lb
--
To unsubscribe from this list: send the line "unsubscribe netdev" in
the body of a message to majordomo@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html

Comments

Denys Fedoryshchenko May 17, 2012, 4:54 p.m. UTC | #1
Also i notice, limit constantly changing over time (even i am not 
touching it).

centaur ~ # grep "" /sys/class/net/eth0/queues/tx-0/byte_queue_limits/*
/sys/class/net/eth0/queues/tx-0/byte_queue_limits/hold_time:1000
/sys/class/net/eth0/queues/tx-0/byte_queue_limits/inflight:0
/sys/class/net/eth0/queues/tx-0/byte_queue_limits/limit:13018
/sys/class/net/eth0/queues/tx-0/byte_queue_limits/limit_max:1879048192
/sys/class/net/eth0/queues/tx-0/byte_queue_limits/limit_min:0
centaur ~ # grep "" /sys/class/net/eth0/queues/tx-0/byte_queue_limits/*
/sys/class/net/eth0/queues/tx-0/byte_queue_limits/hold_time:1000
/sys/class/net/eth0/queues/tx-0/byte_queue_limits/inflight:4542
/sys/class/net/eth0/queues/tx-0/byte_queue_limits/limit:13018
/sys/class/net/eth0/queues/tx-0/byte_queue_limits/limit_max:1879048192
/sys/class/net/eth0/queues/tx-0/byte_queue_limits/limit_min:0

Is it supposed to be like this?

On 2012-05-17 16:42, Denys Fedoryshchenko wrote:
> Found commit that cause problem:
>
> author	Tom Herbert <therbert@google.com>
> Mon, 28 Nov 2011 16:33:16 +0000 (16:33 +0000)
> committer	David S. Miller <davem@davemloft.net>
> Tue, 29 Nov 2011 17:46:19 +0000 (12:46 -0500)
> commit	3f0cfa3bc11e7f00c9994e0f469cbc0e7da7b00c
> tree	d6670a4f94b2b9dedacc38edb6f0e1306b889f6b	tree | snapshot
> parent	114cf5802165ee93e3ab461c9c505cd94a08b800	commit | diff
> e1000e: Support for byte queue limits
>
> Changes to e1000e to use byte queue limits.
>
> Signed-off-by: Tom Herbert <therbert@google.com>
> Acked-by: Eric Dumazet <eric.dumazet@gmail.com>
> Signed-off-by: David S. Miller <davem@davemloft.net>
>
> If i reverse it, problem disappearing.
>
> How i reproduce it:
> In two consoles do "fast" ping to nearby host
> ping 194.146.XXX.XXX -s1472 -i0.0001
> ping 194.146.XXX.XXX -s1472 -i0.1
>
> For third open ssh to host with "problem", open mcedit, and just
> scroll down large text file.
> After few seconds some "stalls" will occur, and in ping history i can 
> see:
> 1480 bytes from 194.146.153.7: icmp_req=1797 ttl=64 time=0.161 ms
> 1480 bytes from 194.146.153.7: icmp_req=1798 ttl=64 time=0.198 ms
> 1480 bytes from 194.146.153.7: icmp_req=1799 ttl=64 time=0.340 ms
> 1480 bytes from 194.146.153.7: icmp_req=1800 ttl=64 time=0.381 ms
> 1480 bytes from 194.146.153.7: icmp_req=1801 ttl=64 time=914 ms
> 1480 bytes from 194.146.153.7: icmp_req=1802 ttl=64 time=804 ms
> 1480 bytes from 194.146.153.7: icmp_req=1803 ttl=64 time=704 ms
> 1480 bytes from 194.146.153.7: icmp_req=1804 ttl=64 time=594 ms
> 1480 bytes from 194.146.153.7: icmp_req=1805 ttl=64 time=0.287 ms
> 1480 bytes from 194.146.153.7: icmp_req=1806 ttl=64 time=0.226 ms
>
>
> If i apply small patch - problem will disappear. Sure it is not a
> solution, but
> let me know how i can help to debug problem more.
>
> --- netdev.c    2012-05-12 20:08:37.000000000 +0300
> +++ netdev.c.patched    2012-05-17 16:32:28.895760472 +0300
> @@ -1135,7 +1135,7 @@
>
>         tx_ring->next_to_clean = i;
>
> -       netdev_completed_queue(netdev, pkts_compl, bytes_compl);
> +//     netdev_completed_queue(netdev, pkts_compl, bytes_compl);
>
>  #define TX_WAKE_THRESHOLD 32
>         if (count && netif_carrier_ok(netdev) &&
> @@ -2263,7 +2263,7 @@
>                 e1000_put_txbuf(adapter, buffer_info);
>         }
>
> -       netdev_reset_queue(adapter->netdev);
> +//     netdev_reset_queue(adapter->netdev);
>         size = sizeof(struct e1000_buffer) * tx_ring->count;
>         memset(tx_ring->buffer_info, 0, size);
>
> @@ -5056,7 +5056,7 @@
>         /* if count is 0 then mapping error has occurred */
>         count = e1000_tx_map(adapter, skb, first, max_per_txd,
> nr_frags, mss);
>         if (count) {
> -               netdev_sent_queue(netdev, skb->len);
> +//             netdev_sent_queue(netdev, skb->len);
>                 e1000_tx_queue(adapter, tx_flags, count);
>                 /* Make sure there is space in the ring for the next 
> send. */
>                 e1000_maybe_stop_tx(netdev, MAX_SKB_FRAGS + 2);
>
>
>
> On 2012-05-15 17:15, Denys Fedoryshchenko wrote:
>> Hi
>>
>> I have two identical servers, Sun Fire X4150, both has different
>> flavors of Linux, x86_64 and i386.
>> 04:00.0 Ethernet controller: Intel Corporation 80003ES2LAN Gigabit
>> Ethernet Controller (Copper) (rev 01)
>> 04:00.1 Ethernet controller: Intel Corporation 80003ES2LAN Gigabit
>> Ethernet Controller (Copper) (rev 01)
>> 0b:00.0 Ethernet controller: Intel Corporation 82571EB Gigabit
>> Ethernet Controller (rev 06)
>> 0b:00.1 Ethernet controller: Intel Corporation 82571EB Gigabit
>> Ethernet Controller (rev 06)
>> I am using now interface:
>> #ethtool -i eth0
>> driver: e1000e
>> version: 1.9.5-k
>> firmware-version: 2.1-11
>> bus-info: 0000:04:00.0
>> There is 2 CPU , Intel(R) Xeon(R) CPU           E5440  @ 2.83GHz .
>>
>> i386 was acting as NAT and shaper, and as soon as i removed shaper
>> from it, i started to experience strange lockups, e.g. traffic is
>> normal for 5-30 seconds, then short lockup for 500-3000ms (usually
>> around 1000ms) with dropped packets counter increasing. I was
>> suspecting it is due load, but it seems was wrong.
>> Recently, on another server, x86_64 i am using as development, i
>> upgrade kernel (it was old, from 2.6 series) and on completely idle
>> machine started to experience same latency spikes, while i am just
>> running mc and for example typing in text editor - i notice 
>> "stalls".
>> After i investigate it a little more, i notice also small amount of
>> drops on interface. No tcpdump running. Also this machine is idle, 
>> and
>> the only traffic there - some small broadcasts from network, my ssh,
>> and ping.
>>
>> Dropped packets in ifconfig
>>           RX packets:3752868 errors:0 dropped:5350 overruns:0 
>> frame:0
>> Counter is increasing sometimes, when this stall happening.
>>
>> ethtool -S is clean, there is no dropped packets.
>>
>> I did tried to check load (mpstat and perf), there is nothing
>> suspicious, latencytop also doesn't show anything suspicious.
>> dropwatch report a lot of drops, but mostly because there is some
>> broadcasts and etc. tcpdump at the moment of such drops doesn't show
>> anything suspicious.
>> Changed qdisc from default fifo_fast to bfifo, without any result.
>> Tried:  ethtool -K eth0 tso off gso off gro off sg off , no result
>> Problem occured at 3.3.6 - 3.4.0-rc7, most probably 3.3.0 also, but 
>> i
>> don't remember for sure. I thik on some kernels like 3.1 probably it
>> doesn't occur, i will check it soon, because it is not always 
>> reliable
>> to reproduce it. All tests i did on 3.4.0-rc7.
>>
>> I did run also in background tcpdump, additionally iptables with
>> timestamps, and at time when stall occured, seems i am still 
>> receiving
>> packets properly, also on iperf udp  (from some host to this 
>> SunFire)
>> at this moments no packets missing. But i am sure RX interface 
>> errors
>> are increasing.
>> If i do iperf from SunFire to test host - there is packetloss at
>> moments when stall occured.
>>
>> I suspect that by some reason network card stop to transmit, but
>> unable to pinpoint issue. All other hosts in this network are fine 
>> and
>> don't have such problems.
>> Can you help me with that please? Maybe i can provide more debug
>> information, compile with patches and etc. Also i will try to 
>> fallback
>> to 3.1 and 3.0 kernels.
>>
>> Here it is how it occurs and i am reproducing it:
>> I'm just opening file, and start to scroll it in mc, then in another
>> console i run ping
>> [1337089061.844167] 1480 bytes from 194.146.153.20: icmp_req=162
>> ttl=64 time=0.485 ms
>> [1337089061.944138] 1480 bytes from 194.146.153.20: icmp_req=163
>> ttl=64 time=0.470 ms
>> [1337089062.467759] 1480 bytes from 194.146.153.20: icmp_req=164
>> ttl=64 time=424 ms
>> [1337089062.467899] 1480 bytes from 194.146.153.20: icmp_req=165
>> ttl=64 time=324 ms
>> [1337089062.468058] 1480 bytes from 194.146.153.20: icmp_req=166
>> ttl=64 time=214 ms
>> [1337089062.468161] 1480 bytes from 194.146.153.20: icmp_req=167
>> ttl=64 time=104 ms
>> [1337089062.468958] 1480 bytes from 194.146.153.20: icmp_req=168
>> ttl=64 time=1.15 ms
>> [1337089062.568604] 1480 bytes from 194.146.153.20: icmp_req=169
>> ttl=64 time=0.477 ms
>> [1337089062.668909] 1480 bytes from 194.146.153.20: icmp_req=170
>> ttl=64 time=0.667 ms
>>
>> Remote host tcpdump:
>> 1337089061.934737 IP 194.146.153.20 > 194.146.153.22: ICMP echo
>> reply, id 3486, seq 163, length 1480
>> 1337089062.458360 IP 194.146.153.22 > 194.146.153.20: ICMP echo
>> request, id 3486, seq 164, length 1480
>> 1337089062.458380 IP 194.146.153.20 > 194.146.153.22: ICMP echo
>> reply, id 3486, seq 164, length 1480
>> 1337089062.458481 IP 194.146.153.22 > 194.146.153.20: ICMP echo
>> request, id 3486, seq 165, length 1480
>> 1337089062.458502 IP 194.146.153.20 > 194.146.153.22: ICMP echo
>> reply, id 3486, seq 165, length 1480
>> 1337089062.458606 IP 194.146.153.22 > 194.146.153.20: ICMP echo
>> request, id 3486, seq 166, length 1480
>> 1337089062.458623 IP 194.146.153.20 > 194.146.153.22: ICMP echo
>> reply, id 3486, seq 166, length 1480
>> 1337089062.458729 IP 194.146.153.22 > 194.146.153.20: ICMP echo
>> request, id 3486, seq 167, length 1480
>> 1337089062.458745 IP 194.146.153.20 > 194.146.153.22: ICMP echo
>> reply, id 3486, seq 167, length 1480
>> 1337089062.459537 IP 194.146.153.22 > 194.146.153.20: ICMP echo
>> request, id 3486, seq 168, length 1480
>> 1337089062.459545 IP 194.146.153.20 > 194.146.153.22: ICMP echo
>> reply, id 3486, seq 168, length 1480
>>
>> Local host(SunFire) tcpdump:
>> 1337089061.844140 IP 194.146.153.20 > 194.146.153.22: ICMP echo
>> reply, id 3486, seq 162, length 1480
>> 1337089061.943661 IP 194.146.153.22 > 194.146.153.20: ICMP echo
>> request, id 3486, seq 163, length 1480
>> 1337089061.944124 IP 194.146.153.20 > 194.146.153.22: ICMP echo
>> reply, id 3486, seq 163, length 1480
>> 1337089062.465622 IP 194.146.153.22 > 194.146.153.20: ICMP echo
>> request, id 3486, seq 164, length 1480
>> 1337089062.465630 IP 194.146.153.22 > 194.146.153.20: ICMP echo
>> request, id 3486, seq 165, length 1480
>> 1337089062.465632 IP 194.146.153.22 > 194.146.153.20: ICMP echo
>> request, id 3486, seq 166, length 1480
>> 1337089062.465634 IP 194.146.153.22 > 194.146.153.20: ICMP echo
>> request, id 3486, seq 167, length 1480
>> 1337089062.467730 IP 194.146.153.20 > 194.146.153.22: ICMP echo
>> reply, id 3486, seq 164, length 1480
>> 1337089062.467785 IP 194.146.153.22 > 194.146.153.20: ICMP echo
>> request, id 3486, seq 168, length 1480
>> 1337089062.467884 IP 194.146.153.20 > 194.146.153.22: ICMP echo
>> reply, id 3486, seq 165, length 1480
>> 1337089062.468035 IP 194.146.153.20 > 194.146.153.22: ICMP echo
>> reply, id 3486, seq 166, length 1480
>> 1337089062.468129 IP 194.146.153.20 > 194.146.153.22: ICMP echo
>> reply, id 3486, seq 167, length 1480
>> 1337089062.468928 IP 194.146.153.20 > 194.146.153.22: ICMP echo
>> reply, id 3486, seq 168, length 1480
>> 1337089062.568112 IP 194.146.153.22 > 194.146.153.20: ICMP echo
>> request, id 3486, seq 169, length 1480
>> 1337089062.568578 IP 194.146.153.20 > 194.146.153.22: ICMP echo
>> reply, id 3486, seq 169, length 1480
>>
>> lspci -t
>> centaur src # lspci -t
>> -[0000:00]-+-00.0
>>            +-02.0-[01-05]--+-00.0-[02-04]--+-00.0-[03]--
>>            |               |               \-02.0-[04]--+-00.0
>>            |               |                            \-00.1
>>            |               \-00.3-[05]--
>>            +-03.0-[06]--
>>            +-04.0-[07]----00.0
>>            +-05.0-[08]--
>>            +-06.0-[09]--
>>            +-07.0-[0a]--
>>            +-08.0
>>            +-10.0
>>            +-10.1
>>            +-10.2
>>            +-11.0
>>            +-13.0
>>            +-15.0
>>            +-16.0
>>            +-1c.0-[0b]--+-00.0
>>            |            \-00.1
>>            +-1d.0
>>            +-1d.1
>>            +-1d.2
>>            +-1d.3
>>            +-1d.7
>>            +-1e.0-[0c]----05.0
>>            +-1f.0
>>            +-1f.1
>>            +-1f.2
>>            \-1f.3
>> lspci
>> 00:00.0 Host bridge: Intel Corporation 5000P Chipset Memory
>> Controller Hub (rev b1)
>> 00:02.0 PCI bridge: Intel Corporation 5000 Series Chipset PCI 
>> Express
>> x4 Port 2 (rev b1)
>> 00:03.0 PCI bridge: Intel Corporation 5000 Series Chipset PCI 
>> Express
>> x4 Port 3 (rev b1)
>> 00:04.0 PCI bridge: Intel Corporation 5000 Series Chipset PCI 
>> Express
>> x8 Port 4-5 (rev b1)
>> 00:05.0 PCI bridge: Intel Corporation 5000 Series Chipset PCI 
>> Express
>> x4 Port 5 (rev b1)
>> 00:06.0 PCI bridge: Intel Corporation 5000 Series Chipset PCI 
>> Express
>> x8 Port 6-7 (rev b1)
>> 00:07.0 PCI bridge: Intel Corporation 5000 Series Chipset PCI 
>> Express
>> x4 Port 7 (rev b1)
>> 00:08.0 System peripheral: Intel Corporation 5000 Series Chipset DMA
>> Engine (rev b1)
>> 00:10.0 Host bridge: Intel Corporation 5000 Series Chipset FSB
>> Registers (rev b1)
>> 00:10.1 Host bridge: Intel Corporation 5000 Series Chipset FSB
>> Registers (rev b1)
>> 00:10.2 Host bridge: Intel Corporation 5000 Series Chipset FSB
>> Registers (rev b1)
>> 00:11.0 Host bridge: Intel Corporation 5000 Series Chipset Reserved
>> Registers (rev b1)
>> 00:13.0 Host bridge: Intel Corporation 5000 Series Chipset Reserved
>> Registers (rev b1)
>> 00:15.0 Host bridge: Intel Corporation 5000 Series Chipset FBD
>> Registers (rev b1)
>> 00:16.0 Host bridge: Intel Corporation 5000 Series Chipset FBD
>> Registers (rev b1)
>> 00:1c.0 PCI bridge: Intel Corporation 631xESB/632xESB/3100 Chipset
>> PCI Express Root Port 1 (rev 09)
>> 00:1d.0 USB controller: Intel Corporation 631xESB/632xESB/3100
>> Chipset UHCI USB Controller #1 (rev 09)
>> 00:1d.1 USB controller: Intel Corporation 631xESB/632xESB/3100
>> Chipset UHCI USB Controller #2 (rev 09)
>> 00:1d.2 USB controller: Intel Corporation 631xESB/632xESB/3100
>> Chipset UHCI USB Controller #3 (rev 09)
>> 00:1d.3 USB controller: Intel Corporation 631xESB/632xESB/3100
>> Chipset UHCI USB Controller #4 (rev 09)
>> 00:1d.7 USB controller: Intel Corporation 631xESB/632xESB/3100
>> Chipset EHCI USB2 Controller (rev 09)
>> 00:1e.0 PCI bridge: Intel Corporation 82801 PCI Bridge (rev d9)
>> 00:1f.0 ISA bridge: Intel Corporation 631xESB/632xESB/3100 Chipset
>> LPC Interface Controller (rev 09)
>> 00:1f.1 IDE interface: Intel Corporation 631xESB/632xESB IDE
>> Controller (rev 09)
>> 00:1f.2 SATA controller: Intel Corporation 631xESB/632xESB SATA AHCI
>> Controller (rev 09)
>> 00:1f.3 SMBus: Intel Corporation 631xESB/632xESB/3100 Chipset SMBus
>> Controller (rev 09)
>> 01:00.0 PCI bridge: Intel Corporation 6311ESB/6321ESB PCI Express
>> Upstream Port (rev 01)
>> 01:00.3 PCI bridge: Intel Corporation 6311ESB/6321ESB PCI Express to
>> PCI-X Bridge (rev 01)
>> 02:00.0 PCI bridge: Intel Corporation 6311ESB/6321ESB PCI Express
>> Downstream Port E1 (rev 01)
>> 02:02.0 PCI bridge: Intel Corporation 6311ESB/6321ESB PCI Express
>> Downstream Port E3 (rev 01)
>> 04:00.0 Ethernet controller: Intel Corporation 80003ES2LAN Gigabit
>> Ethernet Controller (Copper) (rev 01)
>> 04:00.1 Ethernet controller: Intel Corporation 80003ES2LAN Gigabit
>> Ethernet Controller (Copper) (rev 01)
>> 07:00.0 RAID bus controller: Adaptec AAC-RAID (rev 09)
>> 0b:00.0 Ethernet controller: Intel Corporation 82571EB Gigabit
>> Ethernet Controller (rev 06)
>> 0b:00.1 Ethernet controller: Intel Corporation 82571EB Gigabit
>> Ethernet Controller (rev 06)
>> 0c:05.0 VGA compatible controller: ASPEED Technology, Inc. ASPEED
>> Graphics Family
>>
>>
>> dmesg:
>> [    4.936885] e1000: Intel(R) PRO/1000 Network Driver - version
>> 7.3.21-k8-NAPI
>> [    4.936887] e1000: Copyright (c) 1999-2006 Intel Corporation.
>> [    4.936966] e1000e: Intel(R) PRO/1000 Network Driver - 1.9.5-k
>> [    4.936967] e1000e: Copyright(c) 1999 - 2012 Intel Corporation.
>> [    4.938529] e1000e 0000:04:00.0: (unregistered net_device):
>> Interrupt Throttling Rate (ints/sec) set to dynamic conservative 
>> mode
>> [    4.939598] e1000e 0000:04:00.0: irq 65 for MSI/MSI-X
>> [    4.992246] e1000e 0000:04:00.0: eth0: (PCI Express:2.5GT/s:Width
>> x4) 00:1e:68:04:99:f8
>> [    4.992657] e1000e 0000:04:00.0: eth0: Intel(R) PRO/1000 Network
>> Connection
>> [    4.992964] e1000e 0000:04:00.0: eth0: MAC: 5, PHY: 5, PBA No: 
>> FFFFFF-0FF
>> [    4.994745] e1000e 0000:04:00.1: (unregistered net_device):
>> Interrupt Throttling Rate (ints/sec) set to dynamic conservative 
>> mode
>> [    4.996233] e1000e 0000:04:00.1: irq 66 for MSI/MSI-X
>> [    5.050901] e1000e 0000:04:00.1: eth1: (PCI Express:2.5GT/s:Width
>> x4) 00:1e:68:04:99:f9
>> [    5.051317] e1000e 0000:04:00.1: eth1: Intel(R) PRO/1000 Network
>> Connection
>> [    5.051623] e1000e 0000:04:00.1: eth1: MAC: 5, PHY: 5, PBA No: 
>> FFFFFF-0FF
>> [    5.051857] e1000e 0000:0b:00.0: Disabling ASPM  L1
>> [    5.052168] e1000e 0000:0b:00.0: (unregistered net_device):
>> Interrupt Throttling Rate (ints/sec) set to dynamic conservative 
>> mode
>> [    5.052611] e1000e 0000:0b:00.0: irq 67 for MSI/MSI-X
>> [    5.223454] e1000e 0000:0b:00.0: eth2: (PCI Express:2.5GT/s:Width
>> x4) 00:1e:68:04:99:fa
>> [    5.223864] e1000e 0000:0b:00.0: eth2: Intel(R) PRO/1000 Network
>> Connection
>> [    5.224178] e1000e 0000:0b:00.0: eth2: MAC: 0, PHY: 4, PBA No: 
>> C83246-002
>> [    5.224412] e1000e 0000:0b:00.1: Disabling ASPM  L1
>> [    5.224709] e1000e 0000:0b:00.1: (unregistered net_device):
>> Interrupt Throttling Rate (ints/sec) set to dynamic conservative 
>> mode
>> [    5.225168] e1000e 0000:0b:00.1: irq 68 for MSI/MSI-X
>> [    5.397603] e1000e 0000:0b:00.1: eth3: (PCI Express:2.5GT/s:Width
>> x4) 00:1e:68:04:99:fb
>> [    5.398021] e1000e 0000:0b:00.1: eth3: Intel(R) PRO/1000 Network
>> Connection
>> [    5.398336] e1000e 0000:0b:00.1: eth3: MAC: 0, PHY: 4, PBA No: 
>> C83246-002
>> [   13.859817] e1000e 0000:04:00.0: irq 65 for MSI/MSI-X
>> [   13.962309] e1000e 0000:04:00.0: irq 65 for MSI/MSI-X
>> [   17.150392] e1000e: eth0 NIC Link is Up 1000 Mbps Full Duplex,
>> Flow Control: None
>>
>> --
>> To unsubscribe from this list: send the line "unsubscribe netdev" in
>> the body of a message to majordomo@vger.kernel.org
>> More majordomo info at  http://vger.kernel.org/majordomo-info.html
>
> ---
> Network engineer
> Denys Fedoryshchenko
>
> Dora Highway - Center Cebaco - 2nd Floor
> Beirut, Lebanon
> Tel:	+961 1 247373
> E-Mail: denys@visp.net.lb
> --
> To unsubscribe from this list: send the line "unsubscribe netdev" in
> the body of a message to majordomo@vger.kernel.org
> More majordomo info at  http://vger.kernel.org/majordomo-info.html

---
Network engineer
Denys Fedoryshchenko

Dora Highway - Center Cebaco - 2nd Floor
Beirut, Lebanon
Tel:	+961 1 247373
E-Mail: denys@visp.net.lb
--
To unsubscribe from this list: send the line "unsubscribe netdev" in
the body of a message to majordomo@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Denys Fedoryshchenko May 18, 2012, 2:04 p.m. UTC | #2
It seems logic in BQL has serious issues. The most bad thing, if 
someone don't want limits (especially low as this),
there is no way to disable BQL in Kernel configuration, only tuning 
each interface over sysfs values.

I just did short debug:
if (limit != dql->limit) {
+                printk("New limit %d\n", dql->limit);
                 dql->limit = limit;
                 ovlimit = 0;
}

And got this numbers:
[   18.696839] New limit 0
[   19.622967] New limit 42
[   20.037810] New limit 165
[   35.473666] New limit 386
[   37.418591] New limit 1374
[   37.420064] New limit 6432
[   39.209480] New limit 16548
[   39.214773] New limit 1704
[   40.696065] New limit 6762
[   40.696390] New limit 15564
[   41.921120] New limit 25788
[   41.921165] New limit 388
[   42.696286] New limit 534
[   42.696539] New limit 1096
[   42.696719] New limit 2304
[   53.360394] New limit 24334
[   54.696072] New limit 484
[   54.696135] New limit 934

This means sometimes limit goes below MTU, and till queue limit 
increased, i will see this traffic "stalled",
if there is large packet in queue. Probably BQL miscalculate queue as 
full because of some specific handling
of sent packets in e1000e on this specific hardware. Because it should 
not be full, it is 1Gbps wire,
and it is empty. So in result, instead of eliminating latency, it is 
adding it.

I can make a patch that will make minimum BQL value not less than MTU + 
overhead, is it ok like this?
Probably it will solve issue, but it is more workaround and safety 
fuse, than a solution.

On 2012-05-17 19:54, Denys Fedoryshchenko wrote:
> Also i notice, limit constantly changing over time (even i am not
> touching it).
>
> centaur ~ # grep "" 
> /sys/class/net/eth0/queues/tx-0/byte_queue_limits/*
> /sys/class/net/eth0/queues/tx-0/byte_queue_limits/hold_time:1000
> /sys/class/net/eth0/queues/tx-0/byte_queue_limits/inflight:0
> /sys/class/net/eth0/queues/tx-0/byte_queue_limits/limit:13018
> 
> /sys/class/net/eth0/queues/tx-0/byte_queue_limits/limit_max:1879048192
> /sys/class/net/eth0/queues/tx-0/byte_queue_limits/limit_min:0
> centaur ~ # grep "" 
> /sys/class/net/eth0/queues/tx-0/byte_queue_limits/*
> /sys/class/net/eth0/queues/tx-0/byte_queue_limits/hold_time:1000
> /sys/class/net/eth0/queues/tx-0/byte_queue_limits/inflight:4542
> /sys/class/net/eth0/queues/tx-0/byte_queue_limits/limit:13018
> 
> /sys/class/net/eth0/queues/tx-0/byte_queue_limits/limit_max:1879048192
> /sys/class/net/eth0/queues/tx-0/byte_queue_limits/limit_min:0
>
> Is it supposed to be like this?
>
> On 2012-05-17 16:42, Denys Fedoryshchenko wrote:
>> Found commit that cause problem:
>>
>> author	Tom Herbert <therbert@google.com>
>> Mon, 28 Nov 2011 16:33:16 +0000 (16:33 +0000)
>> committer	David S. Miller <davem@davemloft.net>
>> Tue, 29 Nov 2011 17:46:19 +0000 (12:46 -0500)
>> commit	3f0cfa3bc11e7f00c9994e0f469cbc0e7da7b00c
>> tree	d6670a4f94b2b9dedacc38edb6f0e1306b889f6b	tree | snapshot
>> parent	114cf5802165ee93e3ab461c9c505cd94a08b800	commit | diff
>> e1000e: Support for byte queue limits
>>
>> Changes to e1000e to use byte queue limits.
>>
>> Signed-off-by: Tom Herbert <therbert@google.com>
>> Acked-by: Eric Dumazet <eric.dumazet@gmail.com>
>> Signed-off-by: David S. Miller <davem@davemloft.net>
>>
>> If i reverse it, problem disappearing.
>>
>> How i reproduce it:
>> In two consoles do "fast" ping to nearby host
>> ping 194.146.XXX.XXX -s1472 -i0.0001
>> ping 194.146.XXX.XXX -s1472 -i0.1
>>
>> For third open ssh to host with "problem", open mcedit, and just
>> scroll down large text file.
>> After few seconds some "stalls" will occur, and in ping history i 
>> can see:
>> 1480 bytes from 194.146.153.7: icmp_req=1797 ttl=64 time=0.161 ms
>> 1480 bytes from 194.146.153.7: icmp_req=1798 ttl=64 time=0.198 ms
>> 1480 bytes from 194.146.153.7: icmp_req=1799 ttl=64 time=0.340 ms
>> 1480 bytes from 194.146.153.7: icmp_req=1800 ttl=64 time=0.381 ms
>> 1480 bytes from 194.146.153.7: icmp_req=1801 ttl=64 time=914 ms
>> 1480 bytes from 194.146.153.7: icmp_req=1802 ttl=64 time=804 ms
>> 1480 bytes from 194.146.153.7: icmp_req=1803 ttl=64 time=704 ms
>> 1480 bytes from 194.146.153.7: icmp_req=1804 ttl=64 time=594 ms
>> 1480 bytes from 194.146.153.7: icmp_req=1805 ttl=64 time=0.287 ms
>> 1480 bytes from 194.146.153.7: icmp_req=1806 ttl=64 time=0.226 ms
>>
>>
>> If i apply small patch - problem will disappear. Sure it is not a
>> solution, but
>> let me know how i can help to debug problem more.
>>
>> --- netdev.c    2012-05-12 20:08:37.000000000 +0300
>> +++ netdev.c.patched    2012-05-17 16:32:28.895760472 +0300
>> @@ -1135,7 +1135,7 @@
>>
>>         tx_ring->next_to_clean = i;
>>
>> -       netdev_completed_queue(netdev, pkts_compl, bytes_compl);
>> +//     netdev_completed_queue(netdev, pkts_compl, bytes_compl);
>>
>>  #define TX_WAKE_THRESHOLD 32
>>         if (count && netif_carrier_ok(netdev) &&
>> @@ -2263,7 +2263,7 @@
>>                 e1000_put_txbuf(adapter, buffer_info);
>>         }
>>
>> -       netdev_reset_queue(adapter->netdev);
>> +//     netdev_reset_queue(adapter->netdev);
>>         size = sizeof(struct e1000_buffer) * tx_ring->count;
>>         memset(tx_ring->buffer_info, 0, size);
>>
>> @@ -5056,7 +5056,7 @@
>>         /* if count is 0 then mapping error has occurred */
>>         count = e1000_tx_map(adapter, skb, first, max_per_txd,
>> nr_frags, mss);
>>         if (count) {
>> -               netdev_sent_queue(netdev, skb->len);
>> +//             netdev_sent_queue(netdev, skb->len);
>>                 e1000_tx_queue(adapter, tx_flags, count);
>>                 /* Make sure there is space in the ring for the next 
>> send. */
>>                 e1000_maybe_stop_tx(netdev, MAX_SKB_FRAGS + 2);
>>
>>
>>
>> On 2012-05-15 17:15, Denys Fedoryshchenko wrote:
>>> Hi
>>>
>>> I have two identical servers, Sun Fire X4150, both has different
>>> flavors of Linux, x86_64 and i386.
>>> 04:00.0 Ethernet controller: Intel Corporation 80003ES2LAN Gigabit
>>> Ethernet Controller (Copper) (rev 01)
>>> 04:00.1 Ethernet controller: Intel Corporation 80003ES2LAN Gigabit
>>> Ethernet Controller (Copper) (rev 01)
>>> 0b:00.0 Ethernet controller: Intel Corporation 82571EB Gigabit
>>> Ethernet Controller (rev 06)
>>> 0b:00.1 Ethernet controller: Intel Corporation 82571EB Gigabit
>>> Ethernet Controller (rev 06)
>>> I am using now interface:
>>> #ethtool -i eth0
>>> driver: e1000e
>>> version: 1.9.5-k
>>> firmware-version: 2.1-11
>>> bus-info: 0000:04:00.0
>>> There is 2 CPU , Intel(R) Xeon(R) CPU           E5440  @ 2.83GHz .
>>>
>>> i386 was acting as NAT and shaper, and as soon as i removed shaper
>>> from it, i started to experience strange lockups, e.g. traffic is
>>> normal for 5-30 seconds, then short lockup for 500-3000ms (usually
>>> around 1000ms) with dropped packets counter increasing. I was
>>> suspecting it is due load, but it seems was wrong.
>>> Recently, on another server, x86_64 i am using as development, i
>>> upgrade kernel (it was old, from 2.6 series) and on completely idle
>>> machine started to experience same latency spikes, while i am just
>>> running mc and for example typing in text editor - i notice 
>>> "stalls".
>>> After i investigate it a little more, i notice also small amount of
>>> drops on interface. No tcpdump running. Also this machine is idle, 
>>> and
>>> the only traffic there - some small broadcasts from network, my 
>>> ssh,
>>> and ping.
>>>
>>> Dropped packets in ifconfig
>>>           RX packets:3752868 errors:0 dropped:5350 overruns:0 
>>> frame:0
>>> Counter is increasing sometimes, when this stall happening.
>>>
>>> ethtool -S is clean, there is no dropped packets.
>>>
>>> I did tried to check load (mpstat and perf), there is nothing
>>> suspicious, latencytop also doesn't show anything suspicious.
>>> dropwatch report a lot of drops, but mostly because there is some
>>> broadcasts and etc. tcpdump at the moment of such drops doesn't 
>>> show
>>> anything suspicious.
>>> Changed qdisc from default fifo_fast to bfifo, without any result.
>>> Tried:  ethtool -K eth0 tso off gso off gro off sg off , no result
>>> Problem occured at 3.3.6 - 3.4.0-rc7, most probably 3.3.0 also, but 
>>> i
>>> don't remember for sure. I thik on some kernels like 3.1 probably 
>>> it
>>> doesn't occur, i will check it soon, because it is not always 
>>> reliable
>>> to reproduce it. All tests i did on 3.4.0-rc7.
>>>
>>> I did run also in background tcpdump, additionally iptables with
>>> timestamps, and at time when stall occured, seems i am still 
>>> receiving
>>> packets properly, also on iperf udp  (from some host to this 
>>> SunFire)
>>> at this moments no packets missing. But i am sure RX interface 
>>> errors
>>> are increasing.
>>> If i do iperf from SunFire to test host - there is packetloss at
>>> moments when stall occured.
>>>
>>> I suspect that by some reason network card stop to transmit, but
>>> unable to pinpoint issue. All other hosts in this network are fine 
>>> and
>>> don't have such problems.
>>> Can you help me with that please? Maybe i can provide more debug
>>> information, compile with patches and etc. Also i will try to 
>>> fallback
>>> to 3.1 and 3.0 kernels.
>>>
>>> Here it is how it occurs and i am reproducing it:
>>> I'm just opening file, and start to scroll it in mc, then in 
>>> another
>>> console i run ping
>>> [1337089061.844167] 1480 bytes from 194.146.153.20: icmp_req=162
>>> ttl=64 time=0.485 ms
>>> [1337089061.944138] 1480 bytes from 194.146.153.20: icmp_req=163
>>> ttl=64 time=0.470 ms
>>> [1337089062.467759] 1480 bytes from 194.146.153.20: icmp_req=164
>>> ttl=64 time=424 ms
>>> [1337089062.467899] 1480 bytes from 194.146.153.20: icmp_req=165
>>> ttl=64 time=324 ms
>>> [1337089062.468058] 1480 bytes from 194.146.153.20: icmp_req=166
>>> ttl=64 time=214 ms
>>> [1337089062.468161] 1480 bytes from 194.146.153.20: icmp_req=167
>>> ttl=64 time=104 ms
>>> [1337089062.468958] 1480 bytes from 194.146.153.20: icmp_req=168
>>> ttl=64 time=1.15 ms
>>> [1337089062.568604] 1480 bytes from 194.146.153.20: icmp_req=169
>>> ttl=64 time=0.477 ms
>>> [1337089062.668909] 1480 bytes from 194.146.153.20: icmp_req=170
>>> ttl=64 time=0.667 ms
>>>
>>> Remote host tcpdump:
>>> 1337089061.934737 IP 194.146.153.20 > 194.146.153.22: ICMP echo
>>> reply, id 3486, seq 163, length 1480
>>> 1337089062.458360 IP 194.146.153.22 > 194.146.153.20: ICMP echo
>>> request, id 3486, seq 164, length 1480
>>> 1337089062.458380 IP 194.146.153.20 > 194.146.153.22: ICMP echo
>>> reply, id 3486, seq 164, length 1480
>>> 1337089062.458481 IP 194.146.153.22 > 194.146.153.20: ICMP echo
>>> request, id 3486, seq 165, length 1480
>>> 1337089062.458502 IP 194.146.153.20 > 194.146.153.22: ICMP echo
>>> reply, id 3486, seq 165, length 1480
>>> 1337089062.458606 IP 194.146.153.22 > 194.146.153.20: ICMP echo
>>> request, id 3486, seq 166, length 1480
>>> 1337089062.458623 IP 194.146.153.20 > 194.146.153.22: ICMP echo
>>> reply, id 3486, seq 166, length 1480
>>> 1337089062.458729 IP 194.146.153.22 > 194.146.153.20: ICMP echo
>>> request, id 3486, seq 167, length 1480
>>> 1337089062.458745 IP 194.146.153.20 > 194.146.153.22: ICMP echo
>>> reply, id 3486, seq 167, length 1480
>>> 1337089062.459537 IP 194.146.153.22 > 194.146.153.20: ICMP echo
>>> request, id 3486, seq 168, length 1480
>>> 1337089062.459545 IP 194.146.153.20 > 194.146.153.22: ICMP echo
>>> reply, id 3486, seq 168, length 1480
>>>
>>> Local host(SunFire) tcpdump:
>>> 1337089061.844140 IP 194.146.153.20 > 194.146.153.22: ICMP echo
>>> reply, id 3486, seq 162, length 1480
>>> 1337089061.943661 IP 194.146.153.22 > 194.146.153.20: ICMP echo
>>> request, id 3486, seq 163, length 1480
>>> 1337089061.944124 IP 194.146.153.20 > 194.146.153.22: ICMP echo
>>> reply, id 3486, seq 163, length 1480
>>> 1337089062.465622 IP 194.146.153.22 > 194.146.153.20: ICMP echo
>>> request, id 3486, seq 164, length 1480
>>> 1337089062.465630 IP 194.146.153.22 > 194.146.153.20: ICMP echo
>>> request, id 3486, seq 165, length 1480
>>> 1337089062.465632 IP 194.146.153.22 > 194.146.153.20: ICMP echo
>>> request, id 3486, seq 166, length 1480
>>> 1337089062.465634 IP 194.146.153.22 > 194.146.153.20: ICMP echo
>>> request, id 3486, seq 167, length 1480
>>> 1337089062.467730 IP 194.146.153.20 > 194.146.153.22: ICMP echo
>>> reply, id 3486, seq 164, length 1480
>>> 1337089062.467785 IP 194.146.153.22 > 194.146.153.20: ICMP echo
>>> request, id 3486, seq 168, length 1480
>>> 1337089062.467884 IP 194.146.153.20 > 194.146.153.22: ICMP echo
>>> reply, id 3486, seq 165, length 1480
>>> 1337089062.468035 IP 194.146.153.20 > 194.146.153.22: ICMP echo
>>> reply, id 3486, seq 166, length 1480
>>> 1337089062.468129 IP 194.146.153.20 > 194.146.153.22: ICMP echo
>>> reply, id 3486, seq 167, length 1480
>>> 1337089062.468928 IP 194.146.153.20 > 194.146.153.22: ICMP echo
>>> reply, id 3486, seq 168, length 1480
>>> 1337089062.568112 IP 194.146.153.22 > 194.146.153.20: ICMP echo
>>> request, id 3486, seq 169, length 1480
>>> 1337089062.568578 IP 194.146.153.20 > 194.146.153.22: ICMP echo
>>> reply, id 3486, seq 169, length 1480
>>>
>>> lspci -t
>>> centaur src # lspci -t
>>> -[0000:00]-+-00.0
>>>            +-02.0-[01-05]--+-00.0-[02-04]--+-00.0-[03]--
>>>            |               |               \-02.0-[04]--+-00.0
>>>            |               |                            \-00.1
>>>            |               \-00.3-[05]--
>>>            +-03.0-[06]--
>>>            +-04.0-[07]----00.0
>>>            +-05.0-[08]--
>>>            +-06.0-[09]--
>>>            +-07.0-[0a]--
>>>            +-08.0
>>>            +-10.0
>>>            +-10.1
>>>            +-10.2
>>>            +-11.0
>>>            +-13.0
>>>            +-15.0
>>>            +-16.0
>>>            +-1c.0-[0b]--+-00.0
>>>            |            \-00.1
>>>            +-1d.0
>>>            +-1d.1
>>>            +-1d.2
>>>            +-1d.3
>>>            +-1d.7
>>>            +-1e.0-[0c]----05.0
>>>            +-1f.0
>>>            +-1f.1
>>>            +-1f.2
>>>            \-1f.3
>>> lspci
>>> 00:00.0 Host bridge: Intel Corporation 5000P Chipset Memory
>>> Controller Hub (rev b1)
>>> 00:02.0 PCI bridge: Intel Corporation 5000 Series Chipset PCI 
>>> Express
>>> x4 Port 2 (rev b1)
>>> 00:03.0 PCI bridge: Intel Corporation 5000 Series Chipset PCI 
>>> Express
>>> x4 Port 3 (rev b1)
>>> 00:04.0 PCI bridge: Intel Corporation 5000 Series Chipset PCI 
>>> Express
>>> x8 Port 4-5 (rev b1)
>>> 00:05.0 PCI bridge: Intel Corporation 5000 Series Chipset PCI 
>>> Express
>>> x4 Port 5 (rev b1)
>>> 00:06.0 PCI bridge: Intel Corporation 5000 Series Chipset PCI 
>>> Express
>>> x8 Port 6-7 (rev b1)
>>> 00:07.0 PCI bridge: Intel Corporation 5000 Series Chipset PCI 
>>> Express
>>> x4 Port 7 (rev b1)
>>> 00:08.0 System peripheral: Intel Corporation 5000 Series Chipset 
>>> DMA
>>> Engine (rev b1)
>>> 00:10.0 Host bridge: Intel Corporation 5000 Series Chipset FSB
>>> Registers (rev b1)
>>> 00:10.1 Host bridge: Intel Corporation 5000 Series Chipset FSB
>>> Registers (rev b1)
>>> 00:10.2 Host bridge: Intel Corporation 5000 Series Chipset FSB
>>> Registers (rev b1)
>>> 00:11.0 Host bridge: Intel Corporation 5000 Series Chipset Reserved
>>> Registers (rev b1)
>>> 00:13.0 Host bridge: Intel Corporation 5000 Series Chipset Reserved
>>> Registers (rev b1)
>>> 00:15.0 Host bridge: Intel Corporation 5000 Series Chipset FBD
>>> Registers (rev b1)
>>> 00:16.0 Host bridge: Intel Corporation 5000 Series Chipset FBD
>>> Registers (rev b1)
>>> 00:1c.0 PCI bridge: Intel Corporation 631xESB/632xESB/3100 Chipset
>>> PCI Express Root Port 1 (rev 09)
>>> 00:1d.0 USB controller: Intel Corporation 631xESB/632xESB/3100
>>> Chipset UHCI USB Controller #1 (rev 09)
>>> 00:1d.1 USB controller: Intel Corporation 631xESB/632xESB/3100
>>> Chipset UHCI USB Controller #2 (rev 09)
>>> 00:1d.2 USB controller: Intel Corporation 631xESB/632xESB/3100
>>> Chipset UHCI USB Controller #3 (rev 09)
>>> 00:1d.3 USB controller: Intel Corporation 631xESB/632xESB/3100
>>> Chipset UHCI USB Controller #4 (rev 09)
>>> 00:1d.7 USB controller: Intel Corporation 631xESB/632xESB/3100
>>> Chipset EHCI USB2 Controller (rev 09)
>>> 00:1e.0 PCI bridge: Intel Corporation 82801 PCI Bridge (rev d9)
>>> 00:1f.0 ISA bridge: Intel Corporation 631xESB/632xESB/3100 Chipset
>>> LPC Interface Controller (rev 09)
>>> 00:1f.1 IDE interface: Intel Corporation 631xESB/632xESB IDE
>>> Controller (rev 09)
>>> 00:1f.2 SATA controller: Intel Corporation 631xESB/632xESB SATA 
>>> AHCI
>>> Controller (rev 09)
>>> 00:1f.3 SMBus: Intel Corporation 631xESB/632xESB/3100 Chipset SMBus
>>> Controller (rev 09)
>>> 01:00.0 PCI bridge: Intel Corporation 6311ESB/6321ESB PCI Express
>>> Upstream Port (rev 01)
>>> 01:00.3 PCI bridge: Intel Corporation 6311ESB/6321ESB PCI Express 
>>> to
>>> PCI-X Bridge (rev 01)
>>> 02:00.0 PCI bridge: Intel Corporation 6311ESB/6321ESB PCI Express
>>> Downstream Port E1 (rev 01)
>>> 02:02.0 PCI bridge: Intel Corporation 6311ESB/6321ESB PCI Express
>>> Downstream Port E3 (rev 01)
>>> 04:00.0 Ethernet controller: Intel Corporation 80003ES2LAN Gigabit
>>> Ethernet Controller (Copper) (rev 01)
>>> 04:00.1 Ethernet controller: Intel Corporation 80003ES2LAN Gigabit
>>> Ethernet Controller (Copper) (rev 01)
>>> 07:00.0 RAID bus controller: Adaptec AAC-RAID (rev 09)
>>> 0b:00.0 Ethernet controller: Intel Corporation 82571EB Gigabit
>>> Ethernet Controller (rev 06)
>>> 0b:00.1 Ethernet controller: Intel Corporation 82571EB Gigabit
>>> Ethernet Controller (rev 06)
>>> 0c:05.0 VGA compatible controller: ASPEED Technology, Inc. ASPEED
>>> Graphics Family
>>>
>>>
>>> dmesg:
>>> [    4.936885] e1000: Intel(R) PRO/1000 Network Driver - version
>>> 7.3.21-k8-NAPI
>>> [    4.936887] e1000: Copyright (c) 1999-2006 Intel Corporation.
>>> [    4.936966] e1000e: Intel(R) PRO/1000 Network Driver - 1.9.5-k
>>> [    4.936967] e1000e: Copyright(c) 1999 - 2012 Intel Corporation.
>>> [    4.938529] e1000e 0000:04:00.0: (unregistered net_device):
>>> Interrupt Throttling Rate (ints/sec) set to dynamic conservative 
>>> mode
>>> [    4.939598] e1000e 0000:04:00.0: irq 65 for MSI/MSI-X
>>> [    4.992246] e1000e 0000:04:00.0: eth0: (PCI 
>>> Express:2.5GT/s:Width
>>> x4) 00:1e:68:04:99:f8
>>> [    4.992657] e1000e 0000:04:00.0: eth0: Intel(R) PRO/1000 Network
>>> Connection
>>> [    4.992964] e1000e 0000:04:00.0: eth0: MAC: 5, PHY: 5, PBA No: 
>>> FFFFFF-0FF
>>> [    4.994745] e1000e 0000:04:00.1: (unregistered net_device):
>>> Interrupt Throttling Rate (ints/sec) set to dynamic conservative 
>>> mode
>>> [    4.996233] e1000e 0000:04:00.1: irq 66 for MSI/MSI-X
>>> [    5.050901] e1000e 0000:04:00.1: eth1: (PCI 
>>> Express:2.5GT/s:Width
>>> x4) 00:1e:68:04:99:f9
>>> [    5.051317] e1000e 0000:04:00.1: eth1: Intel(R) PRO/1000 Network
>>> Connection
>>> [    5.051623] e1000e 0000:04:00.1: eth1: MAC: 5, PHY: 5, PBA No: 
>>> FFFFFF-0FF
>>> [    5.051857] e1000e 0000:0b:00.0: Disabling ASPM  L1
>>> [    5.052168] e1000e 0000:0b:00.0: (unregistered net_device):
>>> Interrupt Throttling Rate (ints/sec) set to dynamic conservative 
>>> mode
>>> [    5.052611] e1000e 0000:0b:00.0: irq 67 for MSI/MSI-X
>>> [    5.223454] e1000e 0000:0b:00.0: eth2: (PCI 
>>> Express:2.5GT/s:Width
>>> x4) 00:1e:68:04:99:fa
>>> [    5.223864] e1000e 0000:0b:00.0: eth2: Intel(R) PRO/1000 Network
>>> Connection
>>> [    5.224178] e1000e 0000:0b:00.0: eth2: MAC: 0, PHY: 4, PBA No: 
>>> C83246-002
>>> [    5.224412] e1000e 0000:0b:00.1: Disabling ASPM  L1
>>> [    5.224709] e1000e 0000:0b:00.1: (unregistered net_device):
>>> Interrupt Throttling Rate (ints/sec) set to dynamic conservative 
>>> mode
>>> [    5.225168] e1000e 0000:0b:00.1: irq 68 for MSI/MSI-X
>>> [    5.397603] e1000e 0000:0b:00.1: eth3: (PCI 
>>> Express:2.5GT/s:Width
>>> x4) 00:1e:68:04:99:fb
>>> [    5.398021] e1000e 0000:0b:00.1: eth3: Intel(R) PRO/1000 Network
>>> Connection
>>> [    5.398336] e1000e 0000:0b:00.1: eth3: MAC: 0, PHY: 4, PBA No: 
>>> C83246-002
>>> [   13.859817] e1000e 0000:04:00.0: irq 65 for MSI/MSI-X
>>> [   13.962309] e1000e 0000:04:00.0: irq 65 for MSI/MSI-X
>>> [   17.150392] e1000e: eth0 NIC Link is Up 1000 Mbps Full Duplex,
>>> Flow Control: None
>>>
>>> --
>>> To unsubscribe from this list: send the line "unsubscribe netdev" 
>>> in
>>> the body of a message to majordomo@vger.kernel.org
>>> More majordomo info at  http://vger.kernel.org/majordomo-info.html
>>
>> ---
>> Network engineer
>> Denys Fedoryshchenko
>>
>> Dora Highway - Center Cebaco - 2nd Floor
>> Beirut, Lebanon
>> Tel:	+961 1 247373
>> E-Mail: denys@visp.net.lb
>> --
>> To unsubscribe from this list: send the line "unsubscribe netdev" in
>> the body of a message to majordomo@vger.kernel.org
>> More majordomo info at  http://vger.kernel.org/majordomo-info.html
>
> ---
> Network engineer
> Denys Fedoryshchenko
>
> Dora Highway - Center Cebaco - 2nd Floor
> Beirut, Lebanon
> Tel:	+961 1 247373
> E-Mail: denys@visp.net.lb
> --
> To unsubscribe from this list: send the line "unsubscribe netdev" in
> the body of a message to majordomo@vger.kernel.org
> More majordomo info at  http://vger.kernel.org/majordomo-info.html

---
Network engineer
Denys Fedoryshchenko

Dora Highway - Center Cebaco - 2nd Floor
Beirut, Lebanon
Tel:	+961 1 247373
E-Mail: denys@visp.net.lb
--
To unsubscribe from this list: send the line "unsubscribe netdev" in
the body of a message to majordomo@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Tom Herbert May 19, 2012, 2:07 a.m. UTC | #3
On Fri, May 18, 2012 at 7:04 AM, Denys Fedoryshchenko <denys@visp.net.lb> wrote:
> It seems logic in BQL has serious issues. The most bad thing, if someone
> don't want limits (especially low as this),
> there is no way to disable BQL in Kernel configuration, only tuning each
> interface over sysfs values.
>
echo max > /sys/class/net/ethX/tx-Y/byte_queue_limits/limit_min
 should disable BQL for a queue

> I just did short debug:
> if (limit != dql->limit) {
> +                printk("New limit %d\n", dql->limit);
>                dql->limit = limit;
>                ovlimit = 0;
> }
>
> And got this numbers:
> [   18.696839] New limit 0
> [   19.622967] New limit 42
> [   20.037810] New limit 165
> [   35.473666] New limit 386
> [   37.418591] New limit 1374
> [   37.420064] New limit 6432
> [   39.209480] New limit 16548
> [   39.214773] New limit 1704
> [   40.696065] New limit 6762
> [   40.696390] New limit 15564
> [   41.921120] New limit 25788
> [   41.921165] New limit 388
> [   42.696286] New limit 534
> [   42.696539] New limit 1096
> [   42.696719] New limit 2304
> [   53.360394] New limit 24334
> [   54.696072] New limit 484
> [   54.696135] New limit 934
>
> This means sometimes limit goes below MTU, and till queue limit increased, i
> will see this traffic "stalled",
> if there is large packet in queue. Probably BQL miscalculate queue as full
> because of some specific handling
> of sent packets in e1000e on this specific hardware. Because it should not
> be full, it is 1Gbps wire,
> and it is empty. So in result, instead of eliminating latency, it is adding
> it.
>

Not expected, we've be running e1000e with BQL for a while.  A few questions:

1) What kernel are you running?
2) What sort of traffic load are you using?
3) Have you tried a different interrupt mode?

Per #3, I am wondering if dynamic conservative mode interrupt
throttling is not producing interrupts deterministically for BQL.
I'll try to reproduce the problem in this mode.

> I can make a patch that will make minimum BQL value not less than MTU +
> overhead, is it ok like this?
> Probably it will solve issue, but it is more workaround and safety fuse,
> than a solution.
>
That would just be a bandaid and could just make this a latent issue
for the future.

Tom

>
> On 2012-05-17 19:54, Denys Fedoryshchenko wrote:
>>
>> Also i notice, limit constantly changing over time (even i am not
>> touching it).
>>
>> centaur ~ # grep "" /sys/class/net/eth0/queues/tx-0/byte_queue_limits/*
>> /sys/class/net/eth0/queues/tx-0/byte_queue_limits/hold_time:1000
>> /sys/class/net/eth0/queues/tx-0/byte_queue_limits/inflight:0
>> /sys/class/net/eth0/queues/tx-0/byte_queue_limits/limit:13018
>>
>> /sys/class/net/eth0/queues/tx-0/byte_queue_limits/limit_max:1879048192
>> /sys/class/net/eth0/queues/tx-0/byte_queue_limits/limit_min:0
>> centaur ~ # grep "" /sys/class/net/eth0/queues/tx-0/byte_queue_limits/*
>> /sys/class/net/eth0/queues/tx-0/byte_queue_limits/hold_time:1000
>> /sys/class/net/eth0/queues/tx-0/byte_queue_limits/inflight:4542
>> /sys/class/net/eth0/queues/tx-0/byte_queue_limits/limit:13018
>>
>> /sys/class/net/eth0/queues/tx-0/byte_queue_limits/limit_max:1879048192
>> /sys/class/net/eth0/queues/tx-0/byte_queue_limits/limit_min:0
>>
>> Is it supposed to be like this?
>>
>> On 2012-05-17 16:42, Denys Fedoryshchenko wrote:
>>>
>>> Found commit that cause problem:
>>>
>>> author  Tom Herbert <therbert@google.com>
>>> Mon, 28 Nov 2011 16:33:16 +0000 (16:33 +0000)
>>> committer       David S. Miller <davem@davemloft.net>
>>> Tue, 29 Nov 2011 17:46:19 +0000 (12:46 -0500)
>>> commit  3f0cfa3bc11e7f00c9994e0f469cbc0e7da7b00c
>>> tree    d6670a4f94b2b9dedacc38edb6f0e1306b889f6b        tree | snapshot
>>> parent  114cf5802165ee93e3ab461c9c505cd94a08b800        commit | diff
>>> e1000e: Support for byte queue limits
>>>
>>> Changes to e1000e to use byte queue limits.
>>>
>>> Signed-off-by: Tom Herbert <therbert@google.com>
>>> Acked-by: Eric Dumazet <eric.dumazet@gmail.com>
>>> Signed-off-by: David S. Miller <davem@davemloft.net>
>>>
>>> If i reverse it, problem disappearing.
>>>
>>> How i reproduce it:
>>> In two consoles do "fast" ping to nearby host
>>> ping 194.146.XXX.XXX -s1472 -i0.0001
>>> ping 194.146.XXX.XXX -s1472 -i0.1
>>>
>>> For third open ssh to host with "problem", open mcedit, and just
>>> scroll down large text file.
>>> After few seconds some "stalls" will occur, and in ping history i can
>>> see:
>>> 1480 bytes from 194.146.153.7: icmp_req=1797 ttl=64 time=0.161 ms
>>> 1480 bytes from 194.146.153.7: icmp_req=1798 ttl=64 time=0.198 ms
>>> 1480 bytes from 194.146.153.7: icmp_req=1799 ttl=64 time=0.340 ms
>>> 1480 bytes from 194.146.153.7: icmp_req=1800 ttl=64 time=0.381 ms
>>> 1480 bytes from 194.146.153.7: icmp_req=1801 ttl=64 time=914 ms
>>> 1480 bytes from 194.146.153.7: icmp_req=1802 ttl=64 time=804 ms
>>> 1480 bytes from 194.146.153.7: icmp_req=1803 ttl=64 time=704 ms
>>> 1480 bytes from 194.146.153.7: icmp_req=1804 ttl=64 time=594 ms
>>> 1480 bytes from 194.146.153.7: icmp_req=1805 ttl=64 time=0.287 ms
>>> 1480 bytes from 194.146.153.7: icmp_req=1806 ttl=64 time=0.226 ms
>>>
>>>
>>> If i apply small patch - problem will disappear. Sure it is not a
>>> solution, but
>>> let me know how i can help to debug problem more.
>>>
>>> --- netdev.c    2012-05-12 20:08:37.000000000 +0300
>>> +++ netdev.c.patched    2012-05-17 16:32:28.895760472 +0300
>>> @@ -1135,7 +1135,7 @@
>>>
>>>        tx_ring->next_to_clean = i;
>>>
>>> -       netdev_completed_queue(netdev, pkts_compl, bytes_compl);
>>> +//     netdev_completed_queue(netdev, pkts_compl, bytes_compl);
>>>
>>>  #define TX_WAKE_THRESHOLD 32
>>>        if (count && netif_carrier_ok(netdev) &&
>>> @@ -2263,7 +2263,7 @@
>>>                e1000_put_txbuf(adapter, buffer_info);
>>>        }
>>>
>>> -       netdev_reset_queue(adapter->netdev);
>>> +//     netdev_reset_queue(adapter->netdev);
>>>        size = sizeof(struct e1000_buffer) * tx_ring->count;
>>>        memset(tx_ring->buffer_info, 0, size);
>>>
>>> @@ -5056,7 +5056,7 @@
>>>        /* if count is 0 then mapping error has occurred */
>>>        count = e1000_tx_map(adapter, skb, first, max_per_txd,
>>> nr_frags, mss);
>>>        if (count) {
>>> -               netdev_sent_queue(netdev, skb->len);
>>> +//             netdev_sent_queue(netdev, skb->len);
>>>                e1000_tx_queue(adapter, tx_flags, count);
>>>                /* Make sure there is space in the ring for the next send.
>>> */
>>>                e1000_maybe_stop_tx(netdev, MAX_SKB_FRAGS + 2);
>>>
>>>
>>>
>>> On 2012-05-15 17:15, Denys Fedoryshchenko wrote:
>>>>
>>>> Hi
>>>>
>>>> I have two identical servers, Sun Fire X4150, both has different
>>>> flavors of Linux, x86_64 and i386.
>>>> 04:00.0 Ethernet controller: Intel Corporation 80003ES2LAN Gigabit
>>>> Ethernet Controller (Copper) (rev 01)
>>>> 04:00.1 Ethernet controller: Intel Corporation 80003ES2LAN Gigabit
>>>> Ethernet Controller (Copper) (rev 01)
>>>> 0b:00.0 Ethernet controller: Intel Corporation 82571EB Gigabit
>>>> Ethernet Controller (rev 06)
>>>> 0b:00.1 Ethernet controller: Intel Corporation 82571EB Gigabit
>>>> Ethernet Controller (rev 06)
>>>> I am using now interface:
>>>> #ethtool -i eth0
>>>> driver: e1000e
>>>> version: 1.9.5-k
>>>> firmware-version: 2.1-11
>>>> bus-info: 0000:04:00.0
>>>> There is 2 CPU , Intel(R) Xeon(R) CPU           E5440  @ 2.83GHz .
>>>>
>>>> i386 was acting as NAT and shaper, and as soon as i removed shaper
>>>> from it, i started to experience strange lockups, e.g. traffic is
>>>> normal for 5-30 seconds, then short lockup for 500-3000ms (usually
>>>> around 1000ms) with dropped packets counter increasing. I was
>>>> suspecting it is due load, but it seems was wrong.
>>>> Recently, on another server, x86_64 i am using as development, i
>>>> upgrade kernel (it was old, from 2.6 series) and on completely idle
>>>> machine started to experience same latency spikes, while i am just
>>>> running mc and for example typing in text editor - i notice "stalls".
>>>> After i investigate it a little more, i notice also small amount of
>>>> drops on interface. No tcpdump running. Also this machine is idle, and
>>>> the only traffic there - some small broadcasts from network, my ssh,
>>>> and ping.
>>>>
>>>> Dropped packets in ifconfig
>>>>          RX packets:3752868 errors:0 dropped:5350 overruns:0 frame:0
>>>> Counter is increasing sometimes, when this stall happening.
>>>>
>>>> ethtool -S is clean, there is no dropped packets.
>>>>
>>>> I did tried to check load (mpstat and perf), there is nothing
>>>> suspicious, latencytop also doesn't show anything suspicious.
>>>> dropwatch report a lot of drops, but mostly because there is some
>>>> broadcasts and etc. tcpdump at the moment of such drops doesn't show
>>>> anything suspicious.
>>>> Changed qdisc from default fifo_fast to bfifo, without any result.
>>>> Tried:  ethtool -K eth0 tso off gso off gro off sg off , no result
>>>> Problem occured at 3.3.6 - 3.4.0-rc7, most probably 3.3.0 also, but i
>>>> don't remember for sure. I thik on some kernels like 3.1 probably it
>>>> doesn't occur, i will check it soon, because it is not always reliable
>>>> to reproduce it. All tests i did on 3.4.0-rc7.
>>>>
>>>> I did run also in background tcpdump, additionally iptables with
>>>> timestamps, and at time when stall occured, seems i am still receiving
>>>> packets properly, also on iperf udp  (from some host to this SunFire)
>>>> at this moments no packets missing. But i am sure RX interface errors
>>>> are increasing.
>>>> If i do iperf from SunFire to test host - there is packetloss at
>>>> moments when stall occured.
>>>>
>>>> I suspect that by some reason network card stop to transmit, but
>>>> unable to pinpoint issue. All other hosts in this network are fine and
>>>> don't have such problems.
>>>> Can you help me with that please? Maybe i can provide more debug
>>>> information, compile with patches and etc. Also i will try to fallback
>>>> to 3.1 and 3.0 kernels.
>>>>
>>>> Here it is how it occurs and i am reproducing it:
>>>> I'm just opening file, and start to scroll it in mc, then in another
>>>> console i run ping
>>>> [1337089061.844167] 1480 bytes from 194.146.153.20: icmp_req=162
>>>> ttl=64 time=0.485 ms
>>>> [1337089061.944138] 1480 bytes from 194.146.153.20: icmp_req=163
>>>> ttl=64 time=0.470 ms
>>>> [1337089062.467759] 1480 bytes from 194.146.153.20: icmp_req=164
>>>> ttl=64 time=424 ms
>>>> [1337089062.467899] 1480 bytes from 194.146.153.20: icmp_req=165
>>>> ttl=64 time=324 ms
>>>> [1337089062.468058] 1480 bytes from 194.146.153.20: icmp_req=166
>>>> ttl=64 time=214 ms
>>>> [1337089062.468161] 1480 bytes from 194.146.153.20: icmp_req=167
>>>> ttl=64 time=104 ms
>>>> [1337089062.468958] 1480 bytes from 194.146.153.20: icmp_req=168
>>>> ttl=64 time=1.15 ms
>>>> [1337089062.568604] 1480 bytes from 194.146.153.20: icmp_req=169
>>>> ttl=64 time=0.477 ms
>>>> [1337089062.668909] 1480 bytes from 194.146.153.20: icmp_req=170
>>>> ttl=64 time=0.667 ms
>>>>
>>>> Remote host tcpdump:
>>>> 1337089061.934737 IP 194.146.153.20 > 194.146.153.22: ICMP echo
>>>> reply, id 3486, seq 163, length 1480
>>>> 1337089062.458360 IP 194.146.153.22 > 194.146.153.20: ICMP echo
>>>> request, id 3486, seq 164, length 1480
>>>> 1337089062.458380 IP 194.146.153.20 > 194.146.153.22: ICMP echo
>>>> reply, id 3486, seq 164, length 1480
>>>> 1337089062.458481 IP 194.146.153.22 > 194.146.153.20: ICMP echo
>>>> request, id 3486, seq 165, length 1480
>>>> 1337089062.458502 IP 194.146.153.20 > 194.146.153.22: ICMP echo
>>>> reply, id 3486, seq 165, length 1480
>>>> 1337089062.458606 IP 194.146.153.22 > 194.146.153.20: ICMP echo
>>>> request, id 3486, seq 166, length 1480
>>>> 1337089062.458623 IP 194.146.153.20 > 194.146.153.22: ICMP echo
>>>> reply, id 3486, seq 166, length 1480
>>>> 1337089062.458729 IP 194.146.153.22 > 194.146.153.20: ICMP echo
>>>> request, id 3486, seq 167, length 1480
>>>> 1337089062.458745 IP 194.146.153.20 > 194.146.153.22: ICMP echo
>>>> reply, id 3486, seq 167, length 1480
>>>> 1337089062.459537 IP 194.146.153.22 > 194.146.153.20: ICMP echo
>>>> request, id 3486, seq 168, length 1480
>>>> 1337089062.459545 IP 194.146.153.20 > 194.146.153.22: ICMP echo
>>>> reply, id 3486, seq 168, length 1480
>>>>
>>>> Local host(SunFire) tcpdump:
>>>> 1337089061.844140 IP 194.146.153.20 > 194.146.153.22: ICMP echo
>>>> reply, id 3486, seq 162, length 1480
>>>> 1337089061.943661 IP 194.146.153.22 > 194.146.153.20: ICMP echo
>>>> request, id 3486, seq 163, length 1480
>>>> 1337089061.944124 IP 194.146.153.20 > 194.146.153.22: ICMP echo
>>>> reply, id 3486, seq 163, length 1480
>>>> 1337089062.465622 IP 194.146.153.22 > 194.146.153.20: ICMP echo
>>>> request, id 3486, seq 164, length 1480
>>>> 1337089062.465630 IP 194.146.153.22 > 194.146.153.20: ICMP echo
>>>> request, id 3486, seq 165, length 1480
>>>> 1337089062.465632 IP 194.146.153.22 > 194.146.153.20: ICMP echo
>>>> request, id 3486, seq 166, length 1480
>>>> 1337089062.465634 IP 194.146.153.22 > 194.146.153.20: ICMP echo
>>>> request, id 3486, seq 167, length 1480
>>>> 1337089062.467730 IP 194.146.153.20 > 194.146.153.22: ICMP echo
>>>> reply, id 3486, seq 164, length 1480
>>>> 1337089062.467785 IP 194.146.153.22 > 194.146.153.20: ICMP echo
>>>> request, id 3486, seq 168, length 1480
>>>> 1337089062.467884 IP 194.146.153.20 > 194.146.153.22: ICMP echo
>>>> reply, id 3486, seq 165, length 1480
>>>> 1337089062.468035 IP 194.146.153.20 > 194.146.153.22: ICMP echo
>>>> reply, id 3486, seq 166, length 1480
>>>> 1337089062.468129 IP 194.146.153.20 > 194.146.153.22: ICMP echo
>>>> reply, id 3486, seq 167, length 1480
>>>> 1337089062.468928 IP 194.146.153.20 > 194.146.153.22: ICMP echo
>>>> reply, id 3486, seq 168, length 1480
>>>> 1337089062.568112 IP 194.146.153.22 > 194.146.153.20: ICMP echo
>>>> request, id 3486, seq 169, length 1480
>>>> 1337089062.568578 IP 194.146.153.20 > 194.146.153.22: ICMP echo
>>>> reply, id 3486, seq 169, length 1480
>>>>
>>>> lspci -t
>>>> centaur src # lspci -t
>>>> -[0000:00]-+-00.0
>>>>           +-02.0-[01-05]--+-00.0-[02-04]--+-00.0-[03]--
>>>>           |               |               \-02.0-[04]--+-00.0
>>>>           |               |                            \-00.1
>>>>           |               \-00.3-[05]--
>>>>           +-03.0-[06]--
>>>>           +-04.0-[07]----00.0
>>>>           +-05.0-[08]--
>>>>           +-06.0-[09]--
>>>>           +-07.0-[0a]--
>>>>           +-08.0
>>>>           +-10.0
>>>>           +-10.1
>>>>           +-10.2
>>>>           +-11.0
>>>>           +-13.0
>>>>           +-15.0
>>>>           +-16.0
>>>>           +-1c.0-[0b]--+-00.0
>>>>           |            \-00.1
>>>>           +-1d.0
>>>>           +-1d.1
>>>>           +-1d.2
>>>>           +-1d.3
>>>>           +-1d.7
>>>>           +-1e.0-[0c]----05.0
>>>>           +-1f.0
>>>>           +-1f.1
>>>>           +-1f.2
>>>>           \-1f.3
>>>> lspci
>>>> 00:00.0 Host bridge: Intel Corporation 5000P Chipset Memory
>>>> Controller Hub (rev b1)
>>>> 00:02.0 PCI bridge: Intel Corporation 5000 Series Chipset PCI Express
>>>> x4 Port 2 (rev b1)
>>>> 00:03.0 PCI bridge: Intel Corporation 5000 Series Chipset PCI Express
>>>> x4 Port 3 (rev b1)
>>>> 00:04.0 PCI bridge: Intel Corporation 5000 Series Chipset PCI Express
>>>> x8 Port 4-5 (rev b1)
>>>> 00:05.0 PCI bridge: Intel Corporation 5000 Series Chipset PCI Express
>>>> x4 Port 5 (rev b1)
>>>> 00:06.0 PCI bridge: Intel Corporation 5000 Series Chipset PCI Express
>>>> x8 Port 6-7 (rev b1)
>>>> 00:07.0 PCI bridge: Intel Corporation 5000 Series Chipset PCI Express
>>>> x4 Port 7 (rev b1)
>>>> 00:08.0 System peripheral: Intel Corporation 5000 Series Chipset DMA
>>>> Engine (rev b1)
>>>> 00:10.0 Host bridge: Intel Corporation 5000 Series Chipset FSB
>>>> Registers (rev b1)
>>>> 00:10.1 Host bridge: Intel Corporation 5000 Series Chipset FSB
>>>> Registers (rev b1)
>>>> 00:10.2 Host bridge: Intel Corporation 5000 Series Chipset FSB
>>>> Registers (rev b1)
>>>> 00:11.0 Host bridge: Intel Corporation 5000 Series Chipset Reserved
>>>> Registers (rev b1)
>>>> 00:13.0 Host bridge: Intel Corporation 5000 Series Chipset Reserved
>>>> Registers (rev b1)
>>>> 00:15.0 Host bridge: Intel Corporation 5000 Series Chipset FBD
>>>> Registers (rev b1)
>>>> 00:16.0 Host bridge: Intel Corporation 5000 Series Chipset FBD
>>>> Registers (rev b1)
>>>> 00:1c.0 PCI bridge: Intel Corporation 631xESB/632xESB/3100 Chipset
>>>> PCI Express Root Port 1 (rev 09)
>>>> 00:1d.0 USB controller: Intel Corporation 631xESB/632xESB/3100
>>>> Chipset UHCI USB Controller #1 (rev 09)
>>>> 00:1d.1 USB controller: Intel Corporation 631xESB/632xESB/3100
>>>> Chipset UHCI USB Controller #2 (rev 09)
>>>> 00:1d.2 USB controller: Intel Corporation 631xESB/632xESB/3100
>>>> Chipset UHCI USB Controller #3 (rev 09)
>>>> 00:1d.3 USB controller: Intel Corporation 631xESB/632xESB/3100
>>>> Chipset UHCI USB Controller #4 (rev 09)
>>>> 00:1d.7 USB controller: Intel Corporation 631xESB/632xESB/3100
>>>> Chipset EHCI USB2 Controller (rev 09)
>>>> 00:1e.0 PCI bridge: Intel Corporation 82801 PCI Bridge (rev d9)
>>>> 00:1f.0 ISA bridge: Intel Corporation 631xESB/632xESB/3100 Chipset
>>>> LPC Interface Controller (rev 09)
>>>> 00:1f.1 IDE interface: Intel Corporation 631xESB/632xESB IDE
>>>> Controller (rev 09)
>>>> 00:1f.2 SATA controller: Intel Corporation 631xESB/632xESB SATA AHCI
>>>> Controller (rev 09)
>>>> 00:1f.3 SMBus: Intel Corporation 631xESB/632xESB/3100 Chipset SMBus
>>>> Controller (rev 09)
>>>> 01:00.0 PCI bridge: Intel Corporation 6311ESB/6321ESB PCI Express
>>>> Upstream Port (rev 01)
>>>> 01:00.3 PCI bridge: Intel Corporation 6311ESB/6321ESB PCI Express to
>>>> PCI-X Bridge (rev 01)
>>>> 02:00.0 PCI bridge: Intel Corporation 6311ESB/6321ESB PCI Express
>>>> Downstream Port E1 (rev 01)
>>>> 02:02.0 PCI bridge: Intel Corporation 6311ESB/6321ESB PCI Express
>>>> Downstream Port E3 (rev 01)
>>>> 04:00.0 Ethernet controller: Intel Corporation 80003ES2LAN Gigabit
>>>> Ethernet Controller (Copper) (rev 01)
>>>> 04:00.1 Ethernet controller: Intel Corporation 80003ES2LAN Gigabit
>>>> Ethernet Controller (Copper) (rev 01)
>>>> 07:00.0 RAID bus controller: Adaptec AAC-RAID (rev 09)
>>>> 0b:00.0 Ethernet controller: Intel Corporation 82571EB Gigabit
>>>> Ethernet Controller (rev 06)
>>>> 0b:00.1 Ethernet controller: Intel Corporation 82571EB Gigabit
>>>> Ethernet Controller (rev 06)
>>>> 0c:05.0 VGA compatible controller: ASPEED Technology, Inc. ASPEED
>>>> Graphics Family
>>>>
>>>>
>>>> dmesg:
>>>> [    4.936885] e1000: Intel(R) PRO/1000 Network Driver - version
>>>> 7.3.21-k8-NAPI
>>>> [    4.936887] e1000: Copyright (c) 1999-2006 Intel Corporation.
>>>> [    4.936966] e1000e: Intel(R) PRO/1000 Network Driver - 1.9.5-k
>>>> [    4.936967] e1000e: Copyright(c) 1999 - 2012 Intel Corporation.
>>>> [    4.938529] e1000e 0000:04:00.0: (unregistered net_device):
>>>> Interrupt Throttling Rate (ints/sec) set to dynamic conservative mode
>>>> [    4.939598] e1000e 0000:04:00.0: irq 65 for MSI/MSI-X
>>>> [    4.992246] e1000e 0000:04:00.0: eth0: (PCI Express:2.5GT/s:Width
>>>> x4) 00:1e:68:04:99:f8
>>>> [    4.992657] e1000e 0000:04:00.0: eth0: Intel(R) PRO/1000 Network
>>>> Connection
>>>> [    4.992964] e1000e 0000:04:00.0: eth0: MAC: 5, PHY: 5, PBA No:
>>>> FFFFFF-0FF
>>>> [    4.994745] e1000e 0000:04:00.1: (unregistered net_device):
>>>> Interrupt Throttling Rate (ints/sec) set to dynamic conservative mode
>>>> [    4.996233] e1000e 0000:04:00.1: irq 66 for MSI/MSI-X
>>>> [    5.050901] e1000e 0000:04:00.1: eth1: (PCI Express:2.5GT/s:Width
>>>> x4) 00:1e:68:04:99:f9
>>>> [    5.051317] e1000e 0000:04:00.1: eth1: Intel(R) PRO/1000 Network
>>>> Connection
>>>> [    5.051623] e1000e 0000:04:00.1: eth1: MAC: 5, PHY: 5, PBA No:
>>>> FFFFFF-0FF
>>>> [    5.051857] e1000e 0000:0b:00.0: Disabling ASPM  L1
>>>> [    5.052168] e1000e 0000:0b:00.0: (unregistered net_device):
>>>> Interrupt Throttling Rate (ints/sec) set to dynamic conservative mode
>>>> [    5.052611] e1000e 0000:0b:00.0: irq 67 for MSI/MSI-X
>>>> [    5.223454] e1000e 0000:0b:00.0: eth2: (PCI Express:2.5GT/s:Width
>>>> x4) 00:1e:68:04:99:fa
>>>> [    5.223864] e1000e 0000:0b:00.0: eth2: Intel(R) PRO/1000 Network
>>>> Connection
>>>> [    5.224178] e1000e 0000:0b:00.0: eth2: MAC: 0, PHY: 4, PBA No:
>>>> C83246-002
>>>> [    5.224412] e1000e 0000:0b:00.1: Disabling ASPM  L1
>>>> [    5.224709] e1000e 0000:0b:00.1: (unregistered net_device):
>>>> Interrupt Throttling Rate (ints/sec) set to dynamic conservative mode
>>>> [    5.225168] e1000e 0000:0b:00.1: irq 68 for MSI/MSI-X
>>>> [    5.397603] e1000e 0000:0b:00.1: eth3: (PCI Express:2.5GT/s:Width
>>>> x4) 00:1e:68:04:99:fb
>>>> [    5.398021] e1000e 0000:0b:00.1: eth3: Intel(R) PRO/1000 Network
>>>> Connection
>>>> [    5.398336] e1000e 0000:0b:00.1: eth3: MAC: 0, PHY: 4, PBA No:
>>>> C83246-002
>>>> [   13.859817] e1000e 0000:04:00.0: irq 65 for MSI/MSI-X
>>>> [   13.962309] e1000e 0000:04:00.0: irq 65 for MSI/MSI-X
>>>> [   17.150392] e1000e: eth0 NIC Link is Up 1000 Mbps Full Duplex,
>>>> Flow Control: None
>>>>
>>>> --
>>>> To unsubscribe from this list: send the line "unsubscribe netdev" in
>>>> the body of a message to majordomo@vger.kernel.org
>>>> More majordomo info at  http://vger.kernel.org/majordomo-info.html
>>>
>>>
>>> ---
>>> Network engineer
>>> Denys Fedoryshchenko
>>>
>>> Dora Highway - Center Cebaco - 2nd Floor
>>> Beirut, Lebanon
>>> Tel:    +961 1 247373
>>> E-Mail: denys@visp.net.lb
>>> --
>>> To unsubscribe from this list: send the line "unsubscribe netdev" in
>>> the body of a message to majordomo@vger.kernel.org
>>> More majordomo info at  http://vger.kernel.org/majordomo-info.html
>>
>>
>> ---
>> Network engineer
>> Denys Fedoryshchenko
>>
>> Dora Highway - Center Cebaco - 2nd Floor
>> Beirut, Lebanon
>> Tel:    +961 1 247373
>> E-Mail: denys@visp.net.lb
>> --
>> To unsubscribe from this list: send the line "unsubscribe netdev" in
>> the body of a message to majordomo@vger.kernel.org
>> More majordomo info at  http://vger.kernel.org/majordomo-info.html
>
>
> ---
> Network engineer
> Denys Fedoryshchenko
>
> Dora Highway - Center Cebaco - 2nd Floor
> Beirut, Lebanon
> Tel:    +961 1 247373
> E-Mail: denys@visp.net.lb
--
To unsubscribe from this list: send the line "unsubscribe netdev" in
the body of a message to majordomo@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Denys Fedoryshchenko May 19, 2012, 2:29 a.m. UTC | #4
>
> Not expected, we've be running e1000e with BQL for a while.  A few 
> questions:
>
> 1) What kernel are you running?
> 2) What sort of traffic load are you using?
> 3) Have you tried a different interrupt mode?
>
1)Tested 3.3, 3.3.6, 3.4-rc7.
2)Easy to reproduce:

)Open 3 ssh consoles to server with problems
)First console "ping -s1472 -i0.0001 nearby.ip" (It will help to 
trigger problem quickier)
)Second console "ping -s1472 -i0.1 nearby.ip" (Easy to see when network 
stalls - ping rise to 1000-2000ms)
)Open some large text file and scroll it down. After while you will 
feel it "stalls" and just switch to second
console to check if ping risen recently.

It was occured also on loaded NAT server, after i remove shaper from 
it, that was limiting traffic to 350Mbps,
but there traffic pattern much more complicated.

3)Not yet, i run it with default parameters. I will try now.

---
Denys Fedoryshchenko, Network Engineer, Virtual ISP S.A.L.
--
To unsubscribe from this list: send the line "unsubscribe netdev" in
the body of a message to majordomo@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Denys Fedoryshchenko May 19, 2012, 2:40 a.m. UTC | #5
On 2012-05-19 05:07, Tom Herbert wrote:
> 3) Have you tried a different interrupt mode?

Now tested:
e1000e.TxIntDelay=0,0,0,0
[    4.930570] e1000e 0000:04:00.0: (unregistered net_device): Transmit 
Interrupt Delay set to 0
[    4.930573] e1000e 0000:04:00.0: (unregistered net_device): 
Interrupt Throttling Rate (ints/sec) set to dynamic conservative mode
Problem still occur.

e1000e.TxIntDelay=0,0,0,0 e1000e.InterruptThrottleRate=0,0,0,0
[    4.971476] e1000e 0000:04:00.0: (unregistered net_device): Transmit 
Interrupt Delay set to 0
[    4.971478] e1000e 0000:04:00.0: (unregistered net_device): 
Interrupt Throttling Rate (ints/sec) turned off
Problem also occur.

Probably it is a case of this specific machine, this problem is 
happened on same hardware (Sun Fire X4150).
The only "exotic" thing here is set of cards:
04:00.0 Ethernet controller: Intel Corporation 80003ES2LAN Gigabit 
Ethernet Controller (Copper) (rev 01)
04:00.1 Ethernet controller: Intel Corporation 80003ES2LAN Gigabit 
Ethernet Controller (Copper) (rev 01)
0b:00.0 Ethernet controller: Intel Corporation 82571EB Gigabit Ethernet 
Controller (rev 06)
0b:00.1 Ethernet controller: Intel Corporation 82571EB Gigabit Ethernet 
Controller (rev 06)

I'm using first card.

>
> Per #3, I am wondering if dynamic conservative mode interrupt
> throttling is not producing interrupts deterministically for BQL.
> I'll try to reproduce the problem in this mode.
>
>> I can make a patch that will make minimum BQL value not less than 
>> MTU +
>> overhead, is it ok like this?
>> Probably it will solve issue, but it is more workaround and safety 
>> fuse,
>> than a solution.
>>
> That would just be a bandaid and could just make this a latent issue
> for the future.
>
> Tom
>
>>
>> On 2012-05-17 19:54, Denys Fedoryshchenko wrote:
>>>
>>> Also i notice, limit constantly changing over time (even i am not
>>> touching it).
>>>
>>> centaur ~ # grep "" 
>>> /sys/class/net/eth0/queues/tx-0/byte_queue_limits/*
>>> /sys/class/net/eth0/queues/tx-0/byte_queue_limits/hold_time:1000
>>> /sys/class/net/eth0/queues/tx-0/byte_queue_limits/inflight:0
>>> /sys/class/net/eth0/queues/tx-0/byte_queue_limits/limit:13018
>>>
>>> 
>>> /sys/class/net/eth0/queues/tx-0/byte_queue_limits/limit_max:1879048192
>>> /sys/class/net/eth0/queues/tx-0/byte_queue_limits/limit_min:0
>>> centaur ~ # grep "" 
>>> /sys/class/net/eth0/queues/tx-0/byte_queue_limits/*
>>> /sys/class/net/eth0/queues/tx-0/byte_queue_limits/hold_time:1000
>>> /sys/class/net/eth0/queues/tx-0/byte_queue_limits/inflight:4542
>>> /sys/class/net/eth0/queues/tx-0/byte_queue_limits/limit:13018
>>>
>>> 
>>> /sys/class/net/eth0/queues/tx-0/byte_queue_limits/limit_max:1879048192
>>> /sys/class/net/eth0/queues/tx-0/byte_queue_limits/limit_min:0
>>>
>>> Is it supposed to be like this?
>>>
>>> On 2012-05-17 16:42, Denys Fedoryshchenko wrote:
>>>>
>>>> Found commit that cause problem:
>>>>
>>>> author  Tom Herbert <therbert@google.com>
>>>> Mon, 28 Nov 2011 16:33:16 +0000 (16:33 +0000)
>>>> committer       David S. Miller <davem@davemloft.net>
>>>> Tue, 29 Nov 2011 17:46:19 +0000 (12:46 -0500)
>>>> commit  3f0cfa3bc11e7f00c9994e0f469cbc0e7da7b00c
>>>> tree    d6670a4f94b2b9dedacc38edb6f0e1306b889f6b        tree | 
>>>> snapshot
>>>> parent  114cf5802165ee93e3ab461c9c505cd94a08b800        commit | 
>>>> diff
>>>> e1000e: Support for byte queue limits
>>>>
>>>> Changes to e1000e to use byte queue limits.
>>>>
>>>> Signed-off-by: Tom Herbert <therbert@google.com>
>>>> Acked-by: Eric Dumazet <eric.dumazet@gmail.com>
>>>> Signed-off-by: David S. Miller <davem@davemloft.net>
>>>>
>>>> If i reverse it, problem disappearing.
>>>>
>>>> How i reproduce it:
>>>> In two consoles do "fast" ping to nearby host
>>>> ping 194.146.XXX.XXX -s1472 -i0.0001
>>>> ping 194.146.XXX.XXX -s1472 -i0.1
>>>>
>>>> For third open ssh to host with "problem", open mcedit, and just
>>>> scroll down large text file.
>>>> After few seconds some "stalls" will occur, and in ping history i 
>>>> can
>>>> see:
>>>> 1480 bytes from 194.146.153.7: icmp_req=1797 ttl=64 time=0.161 ms
>>>> 1480 bytes from 194.146.153.7: icmp_req=1798 ttl=64 time=0.198 ms
>>>> 1480 bytes from 194.146.153.7: icmp_req=1799 ttl=64 time=0.340 ms
>>>> 1480 bytes from 194.146.153.7: icmp_req=1800 ttl=64 time=0.381 ms
>>>> 1480 bytes from 194.146.153.7: icmp_req=1801 ttl=64 time=914 ms
>>>> 1480 bytes from 194.146.153.7: icmp_req=1802 ttl=64 time=804 ms
>>>> 1480 bytes from 194.146.153.7: icmp_req=1803 ttl=64 time=704 ms
>>>> 1480 bytes from 194.146.153.7: icmp_req=1804 ttl=64 time=594 ms
>>>> 1480 bytes from 194.146.153.7: icmp_req=1805 ttl=64 time=0.287 ms
>>>> 1480 bytes from 194.146.153.7: icmp_req=1806 ttl=64 time=0.226 ms
>>>>
>>>>
>>>> If i apply small patch - problem will disappear. Sure it is not a
>>>> solution, but
>>>> let me know how i can help to debug problem more.
>>>>
>>>> --- netdev.c    2012-05-12 20:08:37.000000000 +0300
>>>> +++ netdev.c.patched    2012-05-17 16:32:28.895760472 +0300
>>>> @@ -1135,7 +1135,7 @@
>>>>
>>>>        tx_ring->next_to_clean = i;
>>>>
>>>> -       netdev_completed_queue(netdev, pkts_compl, bytes_compl);
>>>> +//     netdev_completed_queue(netdev, pkts_compl, bytes_compl);
>>>>
>>>>  #define TX_WAKE_THRESHOLD 32
>>>>        if (count && netif_carrier_ok(netdev) &&
>>>> @@ -2263,7 +2263,7 @@
>>>>                e1000_put_txbuf(adapter, buffer_info);
>>>>        }
>>>>
>>>> -       netdev_reset_queue(adapter->netdev);
>>>> +//     netdev_reset_queue(adapter->netdev);
>>>>        size = sizeof(struct e1000_buffer) * tx_ring->count;
>>>>        memset(tx_ring->buffer_info, 0, size);
>>>>
>>>> @@ -5056,7 +5056,7 @@
>>>>        /* if count is 0 then mapping error has occurred */
>>>>        count = e1000_tx_map(adapter, skb, first, max_per_txd,
>>>> nr_frags, mss);
>>>>        if (count) {
>>>> -               netdev_sent_queue(netdev, skb->len);
>>>> +//             netdev_sent_queue(netdev, skb->len);
>>>>                e1000_tx_queue(adapter, tx_flags, count);
>>>>                /* Make sure there is space in the ring for the 
>>>> next send.
>>>> */
>>>>                e1000_maybe_stop_tx(netdev, MAX_SKB_FRAGS + 2);
>>>>
>>>>
>>>>
>>>> On 2012-05-15 17:15, Denys Fedoryshchenko wrote:
>>>>>
>>>>> Hi
>>>>>
>>>>> I have two identical servers, Sun Fire X4150, both has different
>>>>> flavors of Linux, x86_64 and i386.
>>>>> 04:00.0 Ethernet controller: Intel Corporation 80003ES2LAN 
>>>>> Gigabit
>>>>> Ethernet Controller (Copper) (rev 01)
>>>>> 04:00.1 Ethernet controller: Intel Corporation 80003ES2LAN 
>>>>> Gigabit
>>>>> Ethernet Controller (Copper) (rev 01)
>>>>> 0b:00.0 Ethernet controller: Intel Corporation 82571EB Gigabit
>>>>> Ethernet Controller (rev 06)
>>>>> 0b:00.1 Ethernet controller: Intel Corporation 82571EB Gigabit
>>>>> Ethernet Controller (rev 06)
>>>>> I am using now interface:
>>>>> #ethtool -i eth0
>>>>> driver: e1000e
>>>>> version: 1.9.5-k
>>>>> firmware-version: 2.1-11
>>>>> bus-info: 0000:04:00.0
>>>>> There is 2 CPU , Intel(R) Xeon(R) CPU           E5440  @ 2.83GHz 
>>>>> .
>>>>>
>>>>> i386 was acting as NAT and shaper, and as soon as i removed 
>>>>> shaper
>>>>> from it, i started to experience strange lockups, e.g. traffic is
>>>>> normal for 5-30 seconds, then short lockup for 500-3000ms 
>>>>> (usually
>>>>> around 1000ms) with dropped packets counter increasing. I was
>>>>> suspecting it is due load, but it seems was wrong.
>>>>> Recently, on another server, x86_64 i am using as development, i
>>>>> upgrade kernel (it was old, from 2.6 series) and on completely 
>>>>> idle
>>>>> machine started to experience same latency spikes, while i am 
>>>>> just
>>>>> running mc and for example typing in text editor - i notice 
>>>>> "stalls".
>>>>> After i investigate it a little more, i notice also small amount 
>>>>> of
>>>>> drops on interface. No tcpdump running. Also this machine is 
>>>>> idle, and
>>>>> the only traffic there - some small broadcasts from network, my 
>>>>> ssh,
>>>>> and ping.
>>>>>
>>>>> Dropped packets in ifconfig
>>>>>          RX packets:3752868 errors:0 dropped:5350 overruns:0 
>>>>> frame:0
>>>>> Counter is increasing sometimes, when this stall happening.
>>>>>
>>>>> ethtool -S is clean, there is no dropped packets.
>>>>>
>>>>> I did tried to check load (mpstat and perf), there is nothing
>>>>> suspicious, latencytop also doesn't show anything suspicious.
>>>>> dropwatch report a lot of drops, but mostly because there is some
>>>>> broadcasts and etc. tcpdump at the moment of such drops doesn't 
>>>>> show
>>>>> anything suspicious.
>>>>> Changed qdisc from default fifo_fast to bfifo, without any 
>>>>> result.
>>>>> Tried:  ethtool -K eth0 tso off gso off gro off sg off , no 
>>>>> result
>>>>> Problem occured at 3.3.6 - 3.4.0-rc7, most probably 3.3.0 also, 
>>>>> but i
>>>>> don't remember for sure. I thik on some kernels like 3.1 probably 
>>>>> it
>>>>> doesn't occur, i will check it soon, because it is not always 
>>>>> reliable
>>>>> to reproduce it. All tests i did on 3.4.0-rc7.
>>>>>
>>>>> I did run also in background tcpdump, additionally iptables with
>>>>> timestamps, and at time when stall occured, seems i am still 
>>>>> receiving
>>>>> packets properly, also on iperf udp  (from some host to this 
>>>>> SunFire)
>>>>> at this moments no packets missing. But i am sure RX interface 
>>>>> errors
>>>>> are increasing.
>>>>> If i do iperf from SunFire to test host - there is packetloss at
>>>>> moments when stall occured.
>>>>>
>>>>> I suspect that by some reason network card stop to transmit, but
>>>>> unable to pinpoint issue. All other hosts in this network are 
>>>>> fine and
>>>>> don't have such problems.
>>>>> Can you help me with that please? Maybe i can provide more debug
>>>>> information, compile with patches and etc. Also i will try to 
>>>>> fallback
>>>>> to 3.1 and 3.0 kernels.
>>>>>
>>>>> Here it is how it occurs and i am reproducing it:
>>>>> I'm just opening file, and start to scroll it in mc, then in 
>>>>> another
>>>>> console i run ping
>>>>> [1337089061.844167] 1480 bytes from 194.146.153.20: icmp_req=162
>>>>> ttl=64 time=0.485 ms
>>>>> [1337089061.944138] 1480 bytes from 194.146.153.20: icmp_req=163
>>>>> ttl=64 time=0.470 ms
>>>>> [1337089062.467759] 1480 bytes from 194.146.153.20: icmp_req=164
>>>>> ttl=64 time=424 ms
>>>>> [1337089062.467899] 1480 bytes from 194.146.153.20: icmp_req=165
>>>>> ttl=64 time=324 ms
>>>>> [1337089062.468058] 1480 bytes from 194.146.153.20: icmp_req=166
>>>>> ttl=64 time=214 ms
>>>>> [1337089062.468161] 1480 bytes from 194.146.153.20: icmp_req=167
>>>>> ttl=64 time=104 ms
>>>>> [1337089062.468958] 1480 bytes from 194.146.153.20: icmp_req=168
>>>>> ttl=64 time=1.15 ms
>>>>> [1337089062.568604] 1480 bytes from 194.146.153.20: icmp_req=169
>>>>> ttl=64 time=0.477 ms
>>>>> [1337089062.668909] 1480 bytes from 194.146.153.20: icmp_req=170
>>>>> ttl=64 time=0.667 ms
>>>>>
>>>>> Remote host tcpdump:
>>>>> 1337089061.934737 IP 194.146.153.20 > 194.146.153.22: ICMP echo
>>>>> reply, id 3486, seq 163, length 1480
>>>>> 1337089062.458360 IP 194.146.153.22 > 194.146.153.20: ICMP echo
>>>>> request, id 3486, seq 164, length 1480
>>>>> 1337089062.458380 IP 194.146.153.20 > 194.146.153.22: ICMP echo
>>>>> reply, id 3486, seq 164, length 1480
>>>>> 1337089062.458481 IP 194.146.153.22 > 194.146.153.20: ICMP echo
>>>>> request, id 3486, seq 165, length 1480
>>>>> 1337089062.458502 IP 194.146.153.20 > 194.146.153.22: ICMP echo
>>>>> reply, id 3486, seq 165, length 1480
>>>>> 1337089062.458606 IP 194.146.153.22 > 194.146.153.20: ICMP echo
>>>>> request, id 3486, seq 166, length 1480
>>>>> 1337089062.458623 IP 194.146.153.20 > 194.146.153.22: ICMP echo
>>>>> reply, id 3486, seq 166, length 1480
>>>>> 1337089062.458729 IP 194.146.153.22 > 194.146.153.20: ICMP echo
>>>>> request, id 3486, seq 167, length 1480
>>>>> 1337089062.458745 IP 194.146.153.20 > 194.146.153.22: ICMP echo
>>>>> reply, id 3486, seq 167, length 1480
>>>>> 1337089062.459537 IP 194.146.153.22 > 194.146.153.20: ICMP echo
>>>>> request, id 3486, seq 168, length 1480
>>>>> 1337089062.459545 IP 194.146.153.20 > 194.146.153.22: ICMP echo
>>>>> reply, id 3486, seq 168, length 1480
>>>>>
>>>>> Local host(SunFire) tcpdump:
>>>>> 1337089061.844140 IP 194.146.153.20 > 194.146.153.22: ICMP echo
>>>>> reply, id 3486, seq 162, length 1480
>>>>> 1337089061.943661 IP 194.146.153.22 > 194.146.153.20: ICMP echo
>>>>> request, id 3486, seq 163, length 1480
>>>>> 1337089061.944124 IP 194.146.153.20 > 194.146.153.22: ICMP echo
>>>>> reply, id 3486, seq 163, length 1480
>>>>> 1337089062.465622 IP 194.146.153.22 > 194.146.153.20: ICMP echo
>>>>> request, id 3486, seq 164, length 1480
>>>>> 1337089062.465630 IP 194.146.153.22 > 194.146.153.20: ICMP echo
>>>>> request, id 3486, seq 165, length 1480
>>>>> 1337089062.465632 IP 194.146.153.22 > 194.146.153.20: ICMP echo
>>>>> request, id 3486, seq 166, length 1480
>>>>> 1337089062.465634 IP 194.146.153.22 > 194.146.153.20: ICMP echo
>>>>> request, id 3486, seq 167, length 1480
>>>>> 1337089062.467730 IP 194.146.153.20 > 194.146.153.22: ICMP echo
>>>>> reply, id 3486, seq 164, length 1480
>>>>> 1337089062.467785 IP 194.146.153.22 > 194.146.153.20: ICMP echo
>>>>> request, id 3486, seq 168, length 1480
>>>>> 1337089062.467884 IP 194.146.153.20 > 194.146.153.22: ICMP echo
>>>>> reply, id 3486, seq 165, length 1480
>>>>> 1337089062.468035 IP 194.146.153.20 > 194.146.153.22: ICMP echo
>>>>> reply, id 3486, seq 166, length 1480
>>>>> 1337089062.468129 IP 194.146.153.20 > 194.146.153.22: ICMP echo
>>>>> reply, id 3486, seq 167, length 1480
>>>>> 1337089062.468928 IP 194.146.153.20 > 194.146.153.22: ICMP echo
>>>>> reply, id 3486, seq 168, length 1480
>>>>> 1337089062.568112 IP 194.146.153.22 > 194.146.153.20: ICMP echo
>>>>> request, id 3486, seq 169, length 1480
>>>>> 1337089062.568578 IP 194.146.153.20 > 194.146.153.22: ICMP echo
>>>>> reply, id 3486, seq 169, length 1480
>>>>>
>>>>> lspci -t
>>>>> centaur src # lspci -t
>>>>> -[0000:00]-+-00.0
>>>>>           +-02.0-[01-05]--+-00.0-[02-04]--+-00.0-[03]--
>>>>>           |               |               \-02.0-[04]--+-00.0
>>>>>           |               |                            \-00.1
>>>>>           |               \-00.3-[05]--
>>>>>           +-03.0-[06]--
>>>>>           +-04.0-[07]----00.0
>>>>>           +-05.0-[08]--
>>>>>           +-06.0-[09]--
>>>>>           +-07.0-[0a]--
>>>>>           +-08.0
>>>>>           +-10.0
>>>>>           +-10.1
>>>>>           +-10.2
>>>>>           +-11.0
>>>>>           +-13.0
>>>>>           +-15.0
>>>>>           +-16.0
>>>>>           +-1c.0-[0b]--+-00.0
>>>>>           |            \-00.1
>>>>>           +-1d.0
>>>>>           +-1d.1
>>>>>           +-1d.2
>>>>>           +-1d.3
>>>>>           +-1d.7
>>>>>           +-1e.0-[0c]----05.0
>>>>>           +-1f.0
>>>>>           +-1f.1
>>>>>           +-1f.2
>>>>>           \-1f.3
>>>>> lspci
>>>>> 00:00.0 Host bridge: Intel Corporation 5000P Chipset Memory
>>>>> Controller Hub (rev b1)
>>>>> 00:02.0 PCI bridge: Intel Corporation 5000 Series Chipset PCI 
>>>>> Express
>>>>> x4 Port 2 (rev b1)
>>>>> 00:03.0 PCI bridge: Intel Corporation 5000 Series Chipset PCI 
>>>>> Express
>>>>> x4 Port 3 (rev b1)
>>>>> 00:04.0 PCI bridge: Intel Corporation 5000 Series Chipset PCI 
>>>>> Express
>>>>> x8 Port 4-5 (rev b1)
>>>>> 00:05.0 PCI bridge: Intel Corporation 5000 Series Chipset PCI 
>>>>> Express
>>>>> x4 Port 5 (rev b1)
>>>>> 00:06.0 PCI bridge: Intel Corporation 5000 Series Chipset PCI 
>>>>> Express
>>>>> x8 Port 6-7 (rev b1)
>>>>> 00:07.0 PCI bridge: Intel Corporation 5000 Series Chipset PCI 
>>>>> Express
>>>>> x4 Port 7 (rev b1)
>>>>> 00:08.0 System peripheral: Intel Corporation 5000 Series Chipset 
>>>>> DMA
>>>>> Engine (rev b1)
>>>>> 00:10.0 Host bridge: Intel Corporation 5000 Series Chipset FSB
>>>>> Registers (rev b1)
>>>>> 00:10.1 Host bridge: Intel Corporation 5000 Series Chipset FSB
>>>>> Registers (rev b1)
>>>>> 00:10.2 Host bridge: Intel Corporation 5000 Series Chipset FSB
>>>>> Registers (rev b1)
>>>>> 00:11.0 Host bridge: Intel Corporation 5000 Series Chipset 
>>>>> Reserved
>>>>> Registers (rev b1)
>>>>> 00:13.0 Host bridge: Intel Corporation 5000 Series Chipset 
>>>>> Reserved
>>>>> Registers (rev b1)
>>>>> 00:15.0 Host bridge: Intel Corporation 5000 Series Chipset FBD
>>>>> Registers (rev b1)
>>>>> 00:16.0 Host bridge: Intel Corporation 5000 Series Chipset FBD
>>>>> Registers (rev b1)
>>>>> 00:1c.0 PCI bridge: Intel Corporation 631xESB/632xESB/3100 
>>>>> Chipset
>>>>> PCI Express Root Port 1 (rev 09)
>>>>> 00:1d.0 USB controller: Intel Corporation 631xESB/632xESB/3100
>>>>> Chipset UHCI USB Controller #1 (rev 09)
>>>>> 00:1d.1 USB controller: Intel Corporation 631xESB/632xESB/3100
>>>>> Chipset UHCI USB Controller #2 (rev 09)
>>>>> 00:1d.2 USB controller: Intel Corporation 631xESB/632xESB/3100
>>>>> Chipset UHCI USB Controller #3 (rev 09)
>>>>> 00:1d.3 USB controller: Intel Corporation 631xESB/632xESB/3100
>>>>> Chipset UHCI USB Controller #4 (rev 09)
>>>>> 00:1d.7 USB controller: Intel Corporation 631xESB/632xESB/3100
>>>>> Chipset EHCI USB2 Controller (rev 09)
>>>>> 00:1e.0 PCI bridge: Intel Corporation 82801 PCI Bridge (rev d9)
>>>>> 00:1f.0 ISA bridge: Intel Corporation 631xESB/632xESB/3100 
>>>>> Chipset
>>>>> LPC Interface Controller (rev 09)
>>>>> 00:1f.1 IDE interface: Intel Corporation 631xESB/632xESB IDE
>>>>> Controller (rev 09)
>>>>> 00:1f.2 SATA controller: Intel Corporation 631xESB/632xESB SATA 
>>>>> AHCI
>>>>> Controller (rev 09)
>>>>> 00:1f.3 SMBus: Intel Corporation 631xESB/632xESB/3100 Chipset 
>>>>> SMBus
>>>>> Controller (rev 09)
>>>>> 01:00.0 PCI bridge: Intel Corporation 6311ESB/6321ESB PCI Express
>>>>> Upstream Port (rev 01)
>>>>> 01:00.3 PCI bridge: Intel Corporation 6311ESB/6321ESB PCI Express 
>>>>> to
>>>>> PCI-X Bridge (rev 01)
>>>>> 02:00.0 PCI bridge: Intel Corporation 6311ESB/6321ESB PCI Express
>>>>> Downstream Port E1 (rev 01)
>>>>> 02:02.0 PCI bridge: Intel Corporation 6311ESB/6321ESB PCI Express
>>>>> Downstream Port E3 (rev 01)
>>>>> 04:00.0 Ethernet controller: Intel Corporation 80003ES2LAN 
>>>>> Gigabit
>>>>> Ethernet Controller (Copper) (rev 01)
>>>>> 04:00.1 Ethernet controller: Intel Corporation 80003ES2LAN 
>>>>> Gigabit
>>>>> Ethernet Controller (Copper) (rev 01)
>>>>> 07:00.0 RAID bus controller: Adaptec AAC-RAID (rev 09)
>>>>> 0b:00.0 Ethernet controller: Intel Corporation 82571EB Gigabit
>>>>> Ethernet Controller (rev 06)
>>>>> 0b:00.1 Ethernet controller: Intel Corporation 82571EB Gigabit
>>>>> Ethernet Controller (rev 06)
>>>>> 0c:05.0 VGA compatible controller: ASPEED Technology, Inc. ASPEED
>>>>> Graphics Family
>>>>>
>>>>>
>>>>> dmesg:
>>>>> [    4.936885] e1000: Intel(R) PRO/1000 Network Driver - version
>>>>> 7.3.21-k8-NAPI
>>>>> [    4.936887] e1000: Copyright (c) 1999-2006 Intel Corporation.
>>>>> [    4.936966] e1000e: Intel(R) PRO/1000 Network Driver - 1.9.5-k
>>>>> [    4.936967] e1000e: Copyright(c) 1999 - 2012 Intel 
>>>>> Corporation.
>>>>> [    4.938529] e1000e 0000:04:00.0: (unregistered net_device):
>>>>> Interrupt Throttling Rate (ints/sec) set to dynamic conservative 
>>>>> mode
>>>>> [    4.939598] e1000e 0000:04:00.0: irq 65 for MSI/MSI-X
>>>>> [    4.992246] e1000e 0000:04:00.0: eth0: (PCI 
>>>>> Express:2.5GT/s:Width
>>>>> x4) 00:1e:68:04:99:f8
>>>>> [    4.992657] e1000e 0000:04:00.0: eth0: Intel(R) PRO/1000 
>>>>> Network
>>>>> Connection
>>>>> [    4.992964] e1000e 0000:04:00.0: eth0: MAC: 5, PHY: 5, PBA No:
>>>>> FFFFFF-0FF
>>>>> [    4.994745] e1000e 0000:04:00.1: (unregistered net_device):
>>>>> Interrupt Throttling Rate (ints/sec) set to dynamic conservative 
>>>>> mode
>>>>> [    4.996233] e1000e 0000:04:00.1: irq 66 for MSI/MSI-X
>>>>> [    5.050901] e1000e 0000:04:00.1: eth1: (PCI 
>>>>> Express:2.5GT/s:Width
>>>>> x4) 00:1e:68:04:99:f9
>>>>> [    5.051317] e1000e 0000:04:00.1: eth1: Intel(R) PRO/1000 
>>>>> Network
>>>>> Connection
>>>>> [    5.051623] e1000e 0000:04:00.1: eth1: MAC: 5, PHY: 5, PBA No:
>>>>> FFFFFF-0FF
>>>>> [    5.051857] e1000e 0000:0b:00.0: Disabling ASPM  L1
>>>>> [    5.052168] e1000e 0000:0b:00.0: (unregistered net_device):
>>>>> Interrupt Throttling Rate (ints/sec) set to dynamic conservative 
>>>>> mode
>>>>> [    5.052611] e1000e 0000:0b:00.0: irq 67 for MSI/MSI-X
>>>>> [    5.223454] e1000e 0000:0b:00.0: eth2: (PCI 
>>>>> Express:2.5GT/s:Width
>>>>> x4) 00:1e:68:04:99:fa
>>>>> [    5.223864] e1000e 0000:0b:00.0: eth2: Intel(R) PRO/1000 
>>>>> Network
>>>>> Connection
>>>>> [    5.224178] e1000e 0000:0b:00.0: eth2: MAC: 0, PHY: 4, PBA No:
>>>>> C83246-002
>>>>> [    5.224412] e1000e 0000:0b:00.1: Disabling ASPM  L1
>>>>> [    5.224709] e1000e 0000:0b:00.1: (unregistered net_device):
>>>>> Interrupt Throttling Rate (ints/sec) set to dynamic conservative 
>>>>> mode
>>>>> [    5.225168] e1000e 0000:0b:00.1: irq 68 for MSI/MSI-X
>>>>> [    5.397603] e1000e 0000:0b:00.1: eth3: (PCI 
>>>>> Express:2.5GT/s:Width
>>>>> x4) 00:1e:68:04:99:fb
>>>>> [    5.398021] e1000e 0000:0b:00.1: eth3: Intel(R) PRO/1000 
>>>>> Network
>>>>> Connection
>>>>> [    5.398336] e1000e 0000:0b:00.1: eth3: MAC: 0, PHY: 4, PBA No:
>>>>> C83246-002
>>>>> [   13.859817] e1000e 0000:04:00.0: irq 65 for MSI/MSI-X
>>>>> [   13.962309] e1000e 0000:04:00.0: irq 65 for MSI/MSI-X
>>>>> [   17.150392] e1000e: eth0 NIC Link is Up 1000 Mbps Full Duplex,
>>>>> Flow Control: None
>>>>>
>>>>> --
>>>>> To unsubscribe from this list: send the line "unsubscribe netdev" 
>>>>> in
>>>>> the body of a message to majordomo@vger.kernel.org
>>>>> More majordomo info at 
>>>>>  http://vger.kernel.org/majordomo-info.html
>>>>
>>>>
>>>> ---
>>>> Network engineer
>>>> Denys Fedoryshchenko
>>>>
>>>> Dora Highway - Center Cebaco - 2nd Floor
>>>> Beirut, Lebanon
>>>> Tel:    +961 1 247373
>>>> E-Mail: denys@visp.net.lb
>>>> --
>>>> To unsubscribe from this list: send the line "unsubscribe netdev" 
>>>> in
>>>> the body of a message to majordomo@vger.kernel.org
>>>> More majordomo info at  http://vger.kernel.org/majordomo-info.html
>>>
>>>
>>> ---
>>> Network engineer
>>> Denys Fedoryshchenko
>>>
>>> Dora Highway - Center Cebaco - 2nd Floor
>>> Beirut, Lebanon
>>> Tel:    +961 1 247373
>>> E-Mail: denys@visp.net.lb
>>> --
>>> To unsubscribe from this list: send the line "unsubscribe netdev" 
>>> in
>>> the body of a message to majordomo@vger.kernel.org
>>> More majordomo info at  http://vger.kernel.org/majordomo-info.html
>>
>>
>> ---
>> Network engineer
>> Denys Fedoryshchenko
>>
>> Dora Highway - Center Cebaco - 2nd Floor
>> Beirut, Lebanon
>> Tel:    +961 1 247373
>> E-Mail: denys@visp.net.lb
> --
> To unsubscribe from this list: send the line "unsubscribe netdev" in
> the body of a message to majordomo@vger.kernel.org
> More majordomo info at  http://vger.kernel.org/majordomo-info.html

---
Denys Fedoryshchenko, Network Engineer, Virtual ISP S.A.L.
--
To unsubscribe from this list: send the line "unsubscribe netdev" in
the body of a message to majordomo@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Tom Herbert May 20, 2012, 5:40 p.m. UTC | #6
Tried to reproduce:

May 20 10:08:30 test kernel: [    6.168240] e1000e 0000:06:00.0:
(unregistered net_device): Interrupt Throttling Rate (ints/sec) set to
dynamic conservative mode
May 20 10:08:30 test kernel: [    6.221591] e1000e 0000:06:00.1:
(unregistered net_device): Interrupt Throttling Rate (ints/sec) set to
dynamic conservative mode

06:00.0 Ethernet controller: Intel Corporation 80003ES2LAN Gigabit
Ethernet Controller (Copper) (rev 01)
06:00.1 Ethernet controller: Intel Corporation 80003ES2LAN Gigabit
Ethernet Controller (Copper) (rev 01)

Following above instructions to repro gives:

1480 bytes from test2 (192.168.2.49): icmp_req=5875 ttl=64 time=0.358 ms
1480 bytes from test2 (192.168.2.49): icmp_req=5876 ttl=64 time=0.330 ms
1480 bytes from test2 (192.168.2.49): icmp_req=5877 ttl=64 time=0.337 ms
1480 bytes from test2 (192.168.2.49): icmp_req=5878 ttl=64 time=0.375 ms
1480 bytes from test2 (192.168.2.49): icmp_req=5879 ttl=64 time=0.359 ms
1480 bytes from lpb49.prod.google.com (192.168.2.49): icmp_req=5880
ttl=64 time=0.380 ms

And I didn't see the stalls. This was on an Intel machine.  The limit
was stable, went up to around 28K when opened large file and tended to
stay between 15-28K.

The describe problem seems to have characteristics that transmit
interrupts are not at all periodic, and it would seem that some are
taking hundreds of milliseconds to pop.  I don't see anything that
would cause that in the NIC, is it possible there is some activity on
the machines periodically and often holding down interrupts for  long
periods of time.  Are there any peculiarities on Sun Fire in interrupt
handling?

Can you also provide an 'ethtool -c eth0'

Thanks,
Tom




On Fri, May 18, 2012 at 7:40 PM, Denys Fedoryshchenko <denys@visp.net.lb> wrote:
> On 2012-05-19 05:07, Tom Herbert wrote:
>>
>> 3) Have you tried a different interrupt mode?
>
>
> Now tested:
> e1000e.TxIntDelay=0,0,0,0
> [    4.930570] e1000e 0000:04:00.0: (unregistered net_device): Transmit
> Interrupt Delay set to 0
> [    4.930573] e1000e 0000:04:00.0: (unregistered net_device): Interrupt
> Throttling Rate (ints/sec) set to dynamic conservative mode
> Problem still occur.
>
> e1000e.TxIntDelay=0,0,0,0 e1000e.InterruptThrottleRate=0,0,0,0
> [    4.971476] e1000e 0000:04:00.0: (unregistered net_device): Transmit
> Interrupt Delay set to 0
> [    4.971478] e1000e 0000:04:00.0: (unregistered net_device): Interrupt
> Throttling Rate (ints/sec) turned off
> Problem also occur.
>
> Probably it is a case of this specific machine, this problem is happened on
> same hardware (Sun Fire X4150).
> The only "exotic" thing here is set of cards:
>
> 04:00.0 Ethernet controller: Intel Corporation 80003ES2LAN Gigabit Ethernet
> Controller (Copper) (rev 01)
> 04:00.1 Ethernet controller: Intel Corporation 80003ES2LAN Gigabit Ethernet
> Controller (Copper) (rev 01)
> 0b:00.0 Ethernet controller: Intel Corporation 82571EB Gigabit Ethernet
> Controller (rev 06)
> 0b:00.1 Ethernet controller: Intel Corporation 82571EB Gigabit Ethernet
> Controller (rev 06)
>
> I'm using first card.
>
>
>>
>> Per #3, I am wondering if dynamic conservative mode interrupt
>> throttling is not producing interrupts deterministically for BQL.
>> I'll try to reproduce the problem in this mode.
>>
>>> I can make a patch that will make minimum BQL value not less than MTU +
>>> overhead, is it ok like this?
>>> Probably it will solve issue, but it is more workaround and safety fuse,
>>> than a solution.
>>>
>> That would just be a bandaid and could just make this a latent issue
>> for the future.
>>
>> Tom
>>
>>>
>>> On 2012-05-17 19:54, Denys Fedoryshchenko wrote:
>>>>
>>>>
>>>> Also i notice, limit constantly changing over time (even i am not
>>>> touching it).
>>>>
>>>> centaur ~ # grep "" /sys/class/net/eth0/queues/tx-0/byte_queue_limits/*
>>>> /sys/class/net/eth0/queues/tx-0/byte_queue_limits/hold_time:1000
>>>> /sys/class/net/eth0/queues/tx-0/byte_queue_limits/inflight:0
>>>> /sys/class/net/eth0/queues/tx-0/byte_queue_limits/limit:13018
>>>>
>>>>
>>>> /sys/class/net/eth0/queues/tx-0/byte_queue_limits/limit_max:1879048192
>>>> /sys/class/net/eth0/queues/tx-0/byte_queue_limits/limit_min:0
>>>> centaur ~ # grep "" /sys/class/net/eth0/queues/tx-0/byte_queue_limits/*
>>>> /sys/class/net/eth0/queues/tx-0/byte_queue_limits/hold_time:1000
>>>> /sys/class/net/eth0/queues/tx-0/byte_queue_limits/inflight:4542
>>>> /sys/class/net/eth0/queues/tx-0/byte_queue_limits/limit:13018
>>>>
>>>>
>>>> /sys/class/net/eth0/queues/tx-0/byte_queue_limits/limit_max:1879048192
>>>> /sys/class/net/eth0/queues/tx-0/byte_queue_limits/limit_min:0
>>>>
>>>> Is it supposed to be like this?
>>>>
>>>> On 2012-05-17 16:42, Denys Fedoryshchenko wrote:
>>>>>
>>>>>
>>>>> Found commit that cause problem:
>>>>>
>>>>> author  Tom Herbert <therbert@google.com>
>>>>> Mon, 28 Nov 2011 16:33:16 +0000 (16:33 +0000)
>>>>> committer       David S. Miller <davem@davemloft.net>
>>>>> Tue, 29 Nov 2011 17:46:19 +0000 (12:46 -0500)
>>>>> commit  3f0cfa3bc11e7f00c9994e0f469cbc0e7da7b00c
>>>>> tree    d6670a4f94b2b9dedacc38edb6f0e1306b889f6b        tree | snapshot
>>>>> parent  114cf5802165ee93e3ab461c9c505cd94a08b800        commit | diff
>>>>> e1000e: Support for byte queue limits
>>>>>
>>>>> Changes to e1000e to use byte queue limits.
>>>>>
>>>>> Signed-off-by: Tom Herbert <therbert@google.com>
>>>>> Acked-by: Eric Dumazet <eric.dumazet@gmail.com>
>>>>> Signed-off-by: David S. Miller <davem@davemloft.net>
>>>>>
>>>>> If i reverse it, problem disappearing.
>>>>>
>>>>> How i reproduce it:
>>>>> In two consoles do "fast" ping to nearby host
>>>>> ping 194.146.XXX.XXX -s1472 -i0.0001
>>>>> ping 194.146.XXX.XXX -s1472 -i0.1
>>>>>
>>>>> For third open ssh to host with "problem", open mcedit, and just
>>>>> scroll down large text file.
>>>>> After few seconds some "stalls" will occur, and in ping history i can
>>>>> see:
>>>>> 1480 bytes from 194.146.153.7: icmp_req=1797 ttl=64 time=0.161 ms
>>>>> 1480 bytes from 194.146.153.7: icmp_req=1798 ttl=64 time=0.198 ms
>>>>> 1480 bytes from 194.146.153.7: icmp_req=1799 ttl=64 time=0.340 ms
>>>>> 1480 bytes from 194.146.153.7: icmp_req=1800 ttl=64 time=0.381 ms
>>>>> 1480 bytes from 194.146.153.7: icmp_req=1801 ttl=64 time=914 ms
>>>>> 1480 bytes from 194.146.153.7: icmp_req=1802 ttl=64 time=804 ms
>>>>> 1480 bytes from 194.146.153.7: icmp_req=1803 ttl=64 time=704 ms
>>>>> 1480 bytes from 194.146.153.7: icmp_req=1804 ttl=64 time=594 ms
>>>>> 1480 bytes from 194.146.153.7: icmp_req=1805 ttl=64 time=0.287 ms
>>>>> 1480 bytes from 194.146.153.7: icmp_req=1806 ttl=64 time=0.226 ms
>>>>>
>>>>>
>>>>> If i apply small patch - problem will disappear. Sure it is not a
>>>>> solution, but
>>>>> let me know how i can help to debug problem more.
>>>>>
>>>>> --- netdev.c    2012-05-12 20:08:37.000000000 +0300
>>>>> +++ netdev.c.patched    2012-05-17 16:32:28.895760472 +0300
>>>>> @@ -1135,7 +1135,7 @@
>>>>>
>>>>>        tx_ring->next_to_clean = i;
>>>>>
>>>>> -       netdev_completed_queue(netdev, pkts_compl, bytes_compl);
>>>>> +//     netdev_completed_queue(netdev, pkts_compl, bytes_compl);
>>>>>
>>>>>  #define TX_WAKE_THRESHOLD 32
>>>>>        if (count && netif_carrier_ok(netdev) &&
>>>>> @@ -2263,7 +2263,7 @@
>>>>>                e1000_put_txbuf(adapter, buffer_info);
>>>>>        }
>>>>>
>>>>> -       netdev_reset_queue(adapter->netdev);
>>>>> +//     netdev_reset_queue(adapter->netdev);
>>>>>        size = sizeof(struct e1000_buffer) * tx_ring->count;
>>>>>        memset(tx_ring->buffer_info, 0, size);
>>>>>
>>>>> @@ -5056,7 +5056,7 @@
>>>>>        /* if count is 0 then mapping error has occurred */
>>>>>        count = e1000_tx_map(adapter, skb, first, max_per_txd,
>>>>> nr_frags, mss);
>>>>>        if (count) {
>>>>> -               netdev_sent_queue(netdev, skb->len);
>>>>> +//             netdev_sent_queue(netdev, skb->len);
>>>>>                e1000_tx_queue(adapter, tx_flags, count);
>>>>>                /* Make sure there is space in the ring for the next
>>>>> send.
>>>>> */
>>>>>                e1000_maybe_stop_tx(netdev, MAX_SKB_FRAGS + 2);
>>>>>
>>>>>
>>>>>
>>>>> On 2012-05-15 17:15, Denys Fedoryshchenko wrote:
>>>>>>
>>>>>>
>>>>>> Hi
>>>>>>
>>>>>> I have two identical servers, Sun Fire X4150, both has different
>>>>>> flavors of Linux, x86_64 and i386.
>>>>>> 04:00.0 Ethernet controller: Intel Corporation 80003ES2LAN Gigabit
>>>>>> Ethernet Controller (Copper) (rev 01)
>>>>>> 04:00.1 Ethernet controller: Intel Corporation 80003ES2LAN Gigabit
>>>>>> Ethernet Controller (Copper) (rev 01)
>>>>>> 0b:00.0 Ethernet controller: Intel Corporation 82571EB Gigabit
>>>>>> Ethernet Controller (rev 06)
>>>>>> 0b:00.1 Ethernet controller: Intel Corporation 82571EB Gigabit
>>>>>> Ethernet Controller (rev 06)
>>>>>> I am using now interface:
>>>>>> #ethtool -i eth0
>>>>>> driver: e1000e
>>>>>> version: 1.9.5-k
>>>>>> firmware-version: 2.1-11
>>>>>> bus-info: 0000:04:00.0
>>>>>> There is 2 CPU , Intel(R) Xeon(R) CPU           E5440  @ 2.83GHz .
>>>>>>
>>>>>> i386 was acting as NAT and shaper, and as soon as i removed shaper
>>>>>> from it, i started to experience strange lockups, e.g. traffic is
>>>>>> normal for 5-30 seconds, then short lockup for 500-3000ms (usually
>>>>>> around 1000ms) with dropped packets counter increasing. I was
>>>>>> suspecting it is due load, but it seems was wrong.
>>>>>> Recently, on another server, x86_64 i am using as development, i
>>>>>> upgrade kernel (it was old, from 2.6 series) and on completely idle
>>>>>> machine started to experience same latency spikes, while i am just
>>>>>> running mc and for example typing in text editor - i notice "stalls".
>>>>>> After i investigate it a little more, i notice also small amount of
>>>>>> drops on interface. No tcpdump running. Also this machine is idle, and
>>>>>> the only traffic there - some small broadcasts from network, my ssh,
>>>>>> and ping.
>>>>>>
>>>>>> Dropped packets in ifconfig
>>>>>>          RX packets:3752868 errors:0 dropped:5350 overruns:0 frame:0
>>>>>> Counter is increasing sometimes, when this stall happening.
>>>>>>
>>>>>> ethtool -S is clean, there is no dropped packets.
>>>>>>
>>>>>> I did tried to check load (mpstat and perf), there is nothing
>>>>>> suspicious, latencytop also doesn't show anything suspicious.
>>>>>> dropwatch report a lot of drops, but mostly because there is some
>>>>>> broadcasts and etc. tcpdump at the moment of such drops doesn't show
>>>>>> anything suspicious.
>>>>>> Changed qdisc from default fifo_fast to bfifo, without any result.
>>>>>> Tried:  ethtool -K eth0 tso off gso off gro off sg off , no result
>>>>>> Problem occured at 3.3.6 - 3.4.0-rc7, most probably 3.3.0 also, but i
>>>>>> don't remember for sure. I thik on some kernels like 3.1 probably it
>>>>>> doesn't occur, i will check it soon, because it is not always reliable
>>>>>> to reproduce it. All tests i did on 3.4.0-rc7.
>>>>>>
>>>>>> I did run also in background tcpdump, additionally iptables with
>>>>>> timestamps, and at time when stall occured, seems i am still receiving
>>>>>> packets properly, also on iperf udp  (from some host to this SunFire)
>>>>>> at this moments no packets missing. But i am sure RX interface errors
>>>>>> are increasing.
>>>>>> If i do iperf from SunFire to test host - there is packetloss at
>>>>>> moments when stall occured.
>>>>>>
>>>>>> I suspect that by some reason network card stop to transmit, but
>>>>>> unable to pinpoint issue. All other hosts in this network are fine and
>>>>>> don't have such problems.
>>>>>> Can you help me with that please? Maybe i can provide more debug
>>>>>> information, compile with patches and etc. Also i will try to fallback
>>>>>> to 3.1 and 3.0 kernels.
>>>>>>
>>>>>> Here it is how it occurs and i am reproducing it:
>>>>>> I'm just opening file, and start to scroll it in mc, then in another
>>>>>> console i run ping
>>>>>> [1337089061.844167] 1480 bytes from 194.146.153.20: icmp_req=162
>>>>>> ttl=64 time=0.485 ms
>>>>>> [1337089061.944138] 1480 bytes from 194.146.153.20: icmp_req=163
>>>>>> ttl=64 time=0.470 ms
>>>>>> [1337089062.467759] 1480 bytes from 194.146.153.20: icmp_req=164
>>>>>> ttl=64 time=424 ms
>>>>>> [1337089062.467899] 1480 bytes from 194.146.153.20: icmp_req=165
>>>>>> ttl=64 time=324 ms
>>>>>> [1337089062.468058] 1480 bytes from 194.146.153.20: icmp_req=166
>>>>>> ttl=64 time=214 ms
>>>>>> [1337089062.468161] 1480 bytes from 194.146.153.20: icmp_req=167
>>>>>> ttl=64 time=104 ms
>>>>>> [1337089062.468958] 1480 bytes from 194.146.153.20: icmp_req=168
>>>>>> ttl=64 time=1.15 ms
>>>>>> [1337089062.568604] 1480 bytes from 194.146.153.20: icmp_req=169
>>>>>> ttl=64 time=0.477 ms
>>>>>> [1337089062.668909] 1480 bytes from 194.146.153.20: icmp_req=170
>>>>>> ttl=64 time=0.667 ms
>>>>>>
>>>>>> Remote host tcpdump:
>>>>>> 1337089061.934737 IP 194.146.153.20 > 194.146.153.22: ICMP echo
>>>>>> reply, id 3486, seq 163, length 1480
>>>>>> 1337089062.458360 IP 194.146.153.22 > 194.146.153.20: ICMP echo
>>>>>> request, id 3486, seq 164, length 1480
>>>>>> 1337089062.458380 IP 194.146.153.20 > 194.146.153.22: ICMP echo
>>>>>> reply, id 3486, seq 164, length 1480
>>>>>> 1337089062.458481 IP 194.146.153.22 > 194.146.153.20: ICMP echo
>>>>>> request, id 3486, seq 165, length 1480
>>>>>> 1337089062.458502 IP 194.146.153.20 > 194.146.153.22: ICMP echo
>>>>>> reply, id 3486, seq 165, length 1480
>>>>>> 1337089062.458606 IP 194.146.153.22 > 194.146.153.20: ICMP echo
>>>>>> request, id 3486, seq 166, length 1480
>>>>>> 1337089062.458623 IP 194.146.153.20 > 194.146.153.22: ICMP echo
>>>>>> reply, id 3486, seq 166, length 1480
>>>>>> 1337089062.458729 IP 194.146.153.22 > 194.146.153.20: ICMP echo
>>>>>> request, id 3486, seq 167, length 1480
>>>>>> 1337089062.458745 IP 194.146.153.20 > 194.146.153.22: ICMP echo
>>>>>> reply, id 3486, seq 167, length 1480
>>>>>> 1337089062.459537 IP 194.146.153.22 > 194.146.153.20: ICMP echo
>>>>>> request, id 3486, seq 168, length 1480
>>>>>> 1337089062.459545 IP 194.146.153.20 > 194.146.153.22: ICMP echo
>>>>>> reply, id 3486, seq 168, length 1480
>>>>>>
>>>>>> Local host(SunFire) tcpdump:
>>>>>> 1337089061.844140 IP 194.146.153.20 > 194.146.153.22: ICMP echo
>>>>>> reply, id 3486, seq 162, length 1480
>>>>>> 1337089061.943661 IP 194.146.153.22 > 194.146.153.20: ICMP echo
>>>>>> request, id 3486, seq 163, length 1480
>>>>>> 1337089061.944124 IP 194.146.153.20 > 194.146.153.22: ICMP echo
>>>>>> reply, id 3486, seq 163, length 1480
>>>>>> 1337089062.465622 IP 194.146.153.22 > 194.146.153.20: ICMP echo
>>>>>> request, id 3486, seq 164, length 1480
>>>>>> 1337089062.465630 IP 194.146.153.22 > 194.146.153.20: ICMP echo
>>>>>> request, id 3486, seq 165, length 1480
>>>>>> 1337089062.465632 IP 194.146.153.22 > 194.146.153.20: ICMP echo
>>>>>> request, id 3486, seq 166, length 1480
>>>>>> 1337089062.465634 IP 194.146.153.22 > 194.146.153.20: ICMP echo
>>>>>> request, id 3486, seq 167, length 1480
>>>>>> 1337089062.467730 IP 194.146.153.20 > 194.146.153.22: ICMP echo
>>>>>> reply, id 3486, seq 164, length 1480
>>>>>> 1337089062.467785 IP 194.146.153.22 > 194.146.153.20: ICMP echo
>>>>>> request, id 3486, seq 168, length 1480
>>>>>> 1337089062.467884 IP 194.146.153.20 > 194.146.153.22: ICMP echo
>>>>>> reply, id 3486, seq 165, length 1480
>>>>>> 1337089062.468035 IP 194.146.153.20 > 194.146.153.22: ICMP echo
>>>>>> reply, id 3486, seq 166, length 1480
>>>>>> 1337089062.468129 IP 194.146.153.20 > 194.146.153.22: ICMP echo
>>>>>> reply, id 3486, seq 167, length 1480
>>>>>> 1337089062.468928 IP 194.146.153.20 > 194.146.153.22: ICMP echo
>>>>>> reply, id 3486, seq 168, length 1480
>>>>>> 1337089062.568112 IP 194.146.153.22 > 194.146.153.20: ICMP echo
>>>>>> request, id 3486, seq 169, length 1480
>>>>>> 1337089062.568578 IP 194.146.153.20 > 194.146.153.22: ICMP echo
>>>>>> reply, id 3486, seq 169, length 1480
>>>>>>
>>>>>> lspci -t
>>>>>> centaur src # lspci -t
>>>>>> -[0000:00]-+-00.0
>>>>>>           +-02.0-[01-05]--+-00.0-[02-04]--+-00.0-[03]--
>>>>>>           |               |               \-02.0-[04]--+-00.0
>>>>>>           |               |                            \-00.1
>>>>>>           |               \-00.3-[05]--
>>>>>>           +-03.0-[06]--
>>>>>>           +-04.0-[07]----00.0
>>>>>>           +-05.0-[08]--
>>>>>>           +-06.0-[09]--
>>>>>>           +-07.0-[0a]--
>>>>>>           +-08.0
>>>>>>           +-10.0
>>>>>>           +-10.1
>>>>>>           +-10.2
>>>>>>           +-11.0
>>>>>>           +-13.0
>>>>>>           +-15.0
>>>>>>           +-16.0
>>>>>>           +-1c.0-[0b]--+-00.0
>>>>>>           |            \-00.1
>>>>>>           +-1d.0
>>>>>>           +-1d.1
>>>>>>           +-1d.2
>>>>>>           +-1d.3
>>>>>>           +-1d.7
>>>>>>           +-1e.0-[0c]----05.0
>>>>>>           +-1f.0
>>>>>>           +-1f.1
>>>>>>           +-1f.2
>>>>>>           \-1f.3
>>>>>> lspci
>>>>>> 00:00.0 Host bridge: Intel Corporation 5000P Chipset Memory
>>>>>> Controller Hub (rev b1)
>>>>>> 00:02.0 PCI bridge: Intel Corporation 5000 Series Chipset PCI Express
>>>>>> x4 Port 2 (rev b1)
>>>>>> 00:03.0 PCI bridge: Intel Corporation 5000 Series Chipset PCI Express
>>>>>> x4 Port 3 (rev b1)
>>>>>> 00:04.0 PCI bridge: Intel Corporation 5000 Series Chipset PCI Express
>>>>>> x8 Port 4-5 (rev b1)
>>>>>> 00:05.0 PCI bridge: Intel Corporation 5000 Series Chipset PCI Express
>>>>>> x4 Port 5 (rev b1)
>>>>>> 00:06.0 PCI bridge: Intel Corporation 5000 Series Chipset PCI Express
>>>>>> x8 Port 6-7 (rev b1)
>>>>>> 00:07.0 PCI bridge: Intel Corporation 5000 Series Chipset PCI Express
>>>>>> x4 Port 7 (rev b1)
>>>>>> 00:08.0 System peripheral: Intel Corporation 5000 Series Chipset DMA
>>>>>> Engine (rev b1)
>>>>>> 00:10.0 Host bridge: Intel Corporation 5000 Series Chipset FSB
>>>>>> Registers (rev b1)
>>>>>> 00:10.1 Host bridge: Intel Corporation 5000 Series Chipset FSB
>>>>>> Registers (rev b1)
>>>>>> 00:10.2 Host bridge: Intel Corporation 5000 Series Chipset FSB
>>>>>> Registers (rev b1)
>>>>>> 00:11.0 Host bridge: Intel Corporation 5000 Series Chipset Reserved
>>>>>> Registers (rev b1)
>>>>>> 00:13.0 Host bridge: Intel Corporation 5000 Series Chipset Reserved
>>>>>> Registers (rev b1)
>>>>>> 00:15.0 Host bridge: Intel Corporation 5000 Series Chipset FBD
>>>>>> Registers (rev b1)
>>>>>> 00:16.0 Host bridge: Intel Corporation 5000 Series Chipset FBD
>>>>>> Registers (rev b1)
>>>>>> 00:1c.0 PCI bridge: Intel Corporation 631xESB/632xESB/3100 Chipset
>>>>>> PCI Express Root Port 1 (rev 09)
>>>>>> 00:1d.0 USB controller: Intel Corporation 631xESB/632xESB/3100
>>>>>> Chipset UHCI USB Controller #1 (rev 09)
>>>>>> 00:1d.1 USB controller: Intel Corporation 631xESB/632xESB/3100
>>>>>> Chipset UHCI USB Controller #2 (rev 09)
>>>>>> 00:1d.2 USB controller: Intel Corporation 631xESB/632xESB/3100
>>>>>> Chipset UHCI USB Controller #3 (rev 09)
>>>>>> 00:1d.3 USB controller: Intel Corporation 631xESB/632xESB/3100
>>>>>> Chipset UHCI USB Controller #4 (rev 09)
>>>>>> 00:1d.7 USB controller: Intel Corporation 631xESB/632xESB/3100
>>>>>> Chipset EHCI USB2 Controller (rev 09)
>>>>>> 00:1e.0 PCI bridge: Intel Corporation 82801 PCI Bridge (rev d9)
>>>>>> 00:1f.0 ISA bridge: Intel Corporation 631xESB/632xESB/3100 Chipset
>>>>>> LPC Interface Controller (rev 09)
>>>>>> 00:1f.1 IDE interface: Intel Corporation 631xESB/632xESB IDE
>>>>>> Controller (rev 09)
>>>>>> 00:1f.2 SATA controller: Intel Corporation 631xESB/632xESB SATA AHCI
>>>>>> Controller (rev 09)
>>>>>> 00:1f.3 SMBus: Intel Corporation 631xESB/632xESB/3100 Chipset SMBus
>>>>>> Controller (rev 09)
>>>>>> 01:00.0 PCI bridge: Intel Corporation 6311ESB/6321ESB PCI Express
>>>>>> Upstream Port (rev 01)
>>>>>> 01:00.3 PCI bridge: Intel Corporation 6311ESB/6321ESB PCI Express to
>>>>>> PCI-X Bridge (rev 01)
>>>>>> 02:00.0 PCI bridge: Intel Corporation 6311ESB/6321ESB PCI Express
>>>>>> Downstream Port E1 (rev 01)
>>>>>> 02:02.0 PCI bridge: Intel Corporation 6311ESB/6321ESB PCI Express
>>>>>> Downstream Port E3 (rev 01)
>>>>>> 04:00.0 Ethernet controller: Intel Corporation 80003ES2LAN Gigabit
>>>>>> Ethernet Controller (Copper) (rev 01)
>>>>>> 04:00.1 Ethernet controller: Intel Corporation 80003ES2LAN Gigabit
>>>>>> Ethernet Controller (Copper) (rev 01)
>>>>>> 07:00.0 RAID bus controller: Adaptec AAC-RAID (rev 09)
>>>>>> 0b:00.0 Ethernet controller: Intel Corporation 82571EB Gigabit
>>>>>> Ethernet Controller (rev 06)
>>>>>> 0b:00.1 Ethernet controller: Intel Corporation 82571EB Gigabit
>>>>>> Ethernet Controller (rev 06)
>>>>>> 0c:05.0 VGA compatible controller: ASPEED Technology, Inc. ASPEED
>>>>>> Graphics Family
>>>>>>
>>>>>>
>>>>>> dmesg:
>>>>>> [    4.936885] e1000: Intel(R) PRO/1000 Network Driver - version
>>>>>> 7.3.21-k8-NAPI
>>>>>> [    4.936887] e1000: Copyright (c) 1999-2006 Intel Corporation.
>>>>>> [    4.936966] e1000e: Intel(R) PRO/1000 Network Driver - 1.9.5-k
>>>>>> [    4.936967] e1000e: Copyright(c) 1999 - 2012 Intel Corporation.
>>>>>> [    4.938529] e1000e 0000:04:00.0: (unregistered net_device):
>>>>>> Interrupt Throttling Rate (ints/sec) set to dynamic conservative mode
>>>>>> [    4.939598] e1000e 0000:04:00.0: irq 65 for MSI/MSI-X
>>>>>> [    4.992246] e1000e 0000:04:00.0: eth0: (PCI Express:2.5GT/s:Width
>>>>>> x4) 00:1e:68:04:99:f8
>>>>>> [    4.992657] e1000e 0000:04:00.0: eth0: Intel(R) PRO/1000 Network
>>>>>> Connection
>>>>>> [    4.992964] e1000e 0000:04:00.0: eth0: MAC: 5, PHY: 5, PBA No:
>>>>>> FFFFFF-0FF
>>>>>> [    4.994745] e1000e 0000:04:00.1: (unregistered net_device):
>>>>>> Interrupt Throttling Rate (ints/sec) set to dynamic conservative mode
>>>>>> [    4.996233] e1000e 0000:04:00.1: irq 66 for MSI/MSI-X
>>>>>> [    5.050901] e1000e 0000:04:00.1: eth1: (PCI Express:2.5GT/s:Width
>>>>>> x4) 00:1e:68:04:99:f9
>>>>>> [    5.051317] e1000e 0000:04:00.1: eth1: Intel(R) PRO/1000 Network
>>>>>> Connection
>>>>>> [    5.051623] e1000e 0000:04:00.1: eth1: MAC: 5, PHY: 5, PBA No:
>>>>>> FFFFFF-0FF
>>>>>> [    5.051857] e1000e 0000:0b:00.0: Disabling ASPM  L1
>>>>>> [    5.052168] e1000e 0000:0b:00.0: (unregistered net_device):
>>>>>> Interrupt Throttling Rate (ints/sec) set to dynamic conservative mode
>>>>>> [    5.052611] e1000e 0000:0b:00.0: irq 67 for MSI/MSI-X
>>>>>> [    5.223454] e1000e 0000:0b:00.0: eth2: (PCI Express:2.5GT/s:Width
>>>>>> x4) 00:1e:68:04:99:fa
>>>>>> [    5.223864] e1000e 0000:0b:00.0: eth2: Intel(R) PRO/1000 Network
>>>>>> Connection
>>>>>> [    5.224178] e1000e 0000:0b:00.0: eth2: MAC: 0, PHY: 4, PBA No:
>>>>>> C83246-002
>>>>>> [    5.224412] e1000e 0000:0b:00.1: Disabling ASPM  L1
>>>>>> [    5.224709] e1000e 0000:0b:00.1: (unregistered net_device):
>>>>>> Interrupt Throttling Rate (ints/sec) set to dynamic conservative mode
>>>>>> [    5.225168] e1000e 0000:0b:00.1: irq 68 for MSI/MSI-X
>>>>>> [    5.397603] e1000e 0000:0b:00.1: eth3: (PCI Express:2.5GT/s:Width
>>>>>> x4) 00:1e:68:04:99:fb
>>>>>> [    5.398021] e1000e 0000:0b:00.1: eth3: Intel(R) PRO/1000 Network
>>>>>> Connection
>>>>>> [    5.398336] e1000e 0000:0b:00.1: eth3: MAC: 0, PHY: 4, PBA No:
>>>>>> C83246-002
>>>>>> [   13.859817] e1000e 0000:04:00.0: irq 65 for MSI/MSI-X
>>>>>> [   13.962309] e1000e 0000:04:00.0: irq 65 for MSI/MSI-X
>>>>>> [   17.150392] e1000e: eth0 NIC Link is Up 1000 Mbps Full Duplex,
>>>>>> Flow Control: None
>>>>>>
>>>>>> --
>>>>>> To unsubscribe from this list: send the line "unsubscribe netdev" in
>>>>>> the body of a message to majordomo@vger.kernel.org
>>>>>> More majordomo info at  http://vger.kernel.org/majordomo-info.html
>>>>>
>>>>>
>>>>>
>>>>> ---
>>>>> Network engineer
>>>>> Denys Fedoryshchenko
>>>>>
>>>>> Dora Highway - Center Cebaco - 2nd Floor
>>>>> Beirut, Lebanon
>>>>> Tel:    +961 1 247373
>>>>> E-Mail: denys@visp.net.lb
>>>>> --
>>>>> To unsubscribe from this list: send the line "unsubscribe netdev" in
>>>>> the body of a message to majordomo@vger.kernel.org
>>>>> More majordomo info at  http://vger.kernel.org/majordomo-info.html
>>>>
>>>>
>>>>
>>>> ---
>>>> Network engineer
>>>> Denys Fedoryshchenko
>>>>
>>>> Dora Highway - Center Cebaco - 2nd Floor
>>>> Beirut, Lebanon
>>>> Tel:    +961 1 247373
>>>> E-Mail: denys@visp.net.lb
>>>> --
>>>> To unsubscribe from this list: send the line "unsubscribe netdev" in
>>>> the body of a message to majordomo@vger.kernel.org
>>>> More majordomo info at  http://vger.kernel.org/majordomo-info.html
>>>
>>>
>>>
>>> ---
>>> Network engineer
>>> Denys Fedoryshchenko
>>>
>>> Dora Highway - Center Cebaco - 2nd Floor
>>> Beirut, Lebanon
>>> Tel:    +961 1 247373
>>> E-Mail: denys@visp.net.lb
>>
>> --
>> To unsubscribe from this list: send the line "unsubscribe netdev" in
>> the body of a message to majordomo@vger.kernel.org
>> More majordomo info at  http://vger.kernel.org/majordomo-info.html
>
>
> ---
> Denys Fedoryshchenko, Network Engineer, Virtual ISP S.A.L.
--
To unsubscribe from this list: send the line "unsubscribe netdev" in
the body of a message to majordomo@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Eric Dumazet May 20, 2012, 6:13 p.m. UTC | #7
On Fri, 2012-05-18 at 17:04 +0300, Denys Fedoryshchenko wrote:
> It seems logic in BQL has serious issues. The most bad thing, if 

> and it is empty. So in result, instead of eliminating latency, it is 
> adding it.

There is maybe a misunderstanding here. BQL by itself only reduce amount
of buffering in TX ring. Now there's still the Qdisc layer.

On my laptop (e1000e NIC), and default pfifo_fast I can immediately see
the difference in ping results if I launch a single netperf TCP_STREAM
in another window :

1480 bytes from 172.30.42.1: icmp_seq=5047 ttl=64 time=0.185 ms
1480 bytes from 172.30.42.1: icmp_seq=5048 ttl=64 time=0.198 ms
1480 bytes from 172.30.42.1: icmp_seq=5049 ttl=64 time=0.255 ms
1480 bytes from 172.30.42.1: icmp_seq=5050 ttl=64 time=0.182 ms
1480 bytes from 172.30.42.1: icmp_seq=5051 ttl=64 time=0.182 ms
1480 bytes from 172.30.42.1: icmp_seq=5052 ttl=64 time=0.226 ms
1480 bytes from 172.30.42.1: icmp_seq=5053 ttl=64 time=0.186 ms
1480 bytes from 172.30.42.1: icmp_seq=5054 ttl=64 time=0.190 ms
1480 bytes from 172.30.42.1: icmp_seq=5055 ttl=64 time=0.223 ms
1480 bytes from 172.30.42.1: icmp_seq=5056 ttl=64 time=0.178 ms
1480 bytes from 172.30.42.1: icmp_seq=5057 ttl=64 time=0.191 ms
1480 bytes from 172.30.42.1: icmp_seq=5058 ttl=64 time=0.229 ms
1480 bytes from 172.30.42.1: icmp_seq=5059 ttl=64 time=0.229 ms
1480 bytes from 172.30.42.1: icmp_seq=5060 ttl=64 time=0.213 ms
1480 bytes from 172.30.42.1: icmp_seq=5061 ttl=64 time=0.187 ms
1480 bytes from 172.30.42.1: icmp_seq=5062 ttl=64 time=0.944 ms
1480 bytes from 172.30.42.1: icmp_seq=5063 ttl=64 time=4.67 ms
1480 bytes from 172.30.42.1: icmp_seq=5064 ttl=64 time=8.55 ms
1480 bytes from 172.30.42.1: icmp_seq=5065 ttl=64 time=12.8 ms
1480 bytes from 172.30.42.1: icmp_seq=5066 ttl=64 time=16.9 ms
1480 bytes from 172.30.42.1: icmp_seq=5067 ttl=64 time=20.5 ms
1480 bytes from 172.30.42.1: icmp_seq=5068 ttl=64 time=24.3 ms
1480 bytes from 172.30.42.1: icmp_seq=5069 ttl=64 time=27.0 ms
1480 bytes from 172.30.42.1: icmp_seq=5070 ttl=64 time=26.8 ms
1480 bytes from 172.30.42.1: icmp_seq=5071 ttl=64 time=26.7 ms
1480 bytes from 172.30.42.1: icmp_seq=5072 ttl=64 time=26.0 ms
1480 bytes from 172.30.42.1: icmp_seq=5073 ttl=64 time=27.0 ms
1480 bytes from 172.30.42.1: icmp_seq=5074 ttl=64 time=25.6 ms
1480 bytes from 172.30.42.1: icmp_seq=5075 ttl=64 time=26.7 ms
1480 bytes from 172.30.42.1: icmp_seq=5076 ttl=64 time=26.9 ms
1480 bytes from 172.30.42.1: icmp_seq=5077 ttl=64 time=25.0 ms


Now if I replace pfifo_fast by fq_codel : (still TSO and GSO are on)

1480 bytes from 172.30.42.1: icmp_seq=37 ttl=64 time=0.197 ms
1480 bytes from 172.30.42.1: icmp_seq=38 ttl=64 time=0.283 ms
1480 bytes from 172.30.42.1: icmp_seq=39 ttl=64 time=0.183 ms
1480 bytes from 172.30.42.1: icmp_seq=40 ttl=64 time=0.182 ms
1480 bytes from 172.30.42.1: icmp_seq=41 ttl=64 time=0.207 ms
1480 bytes from 172.30.42.1: icmp_seq=42 ttl=64 time=0.202 ms
1480 bytes from 172.30.42.1: icmp_seq=43 ttl=64 time=0.244 ms
1480 bytes from 172.30.42.1: icmp_seq=44 ttl=64 time=0.200 ms
1480 bytes from 172.30.42.1: icmp_seq=45 ttl=64 time=0.212 ms
1480 bytes from 172.30.42.1: icmp_seq=46 ttl=64 time=0.215 ms
1480 bytes from 172.30.42.1: icmp_seq=47 ttl=64 time=0.178 ms
1480 bytes from 172.30.42.1: icmp_seq=48 ttl=64 time=0.215 ms
1480 bytes from 172.30.42.1: icmp_seq=49 ttl=64 time=0.238 ms
1480 bytes from 172.30.42.1: icmp_seq=50 ttl=64 time=0.210 ms
1480 bytes from 172.30.42.1: icmp_seq=51 ttl=64 time=0.974 ms   
1480 bytes from 172.30.42.1: icmp_seq=52 ttl=64 time=1.93 ms
1480 bytes from 172.30.42.1: icmp_seq=53 ttl=64 time=1.71 ms
1480 bytes from 172.30.42.1: icmp_seq=54 ttl=64 time=3.07 ms
1480 bytes from 172.30.42.1: icmp_seq=55 ttl=64 time=3.47 ms
1480 bytes from 172.30.42.1: icmp_seq=56 ttl=64 time=2.53 ms
1480 bytes from 172.30.42.1: icmp_seq=57 ttl=64 time=3.24 ms
1480 bytes from 172.30.42.1: icmp_seq=58 ttl=64 time=3.11 ms
1480 bytes from 172.30.42.1: icmp_seq=59 ttl=64 time=3.03 ms
1480 bytes from 172.30.42.1: icmp_seq=60 ttl=64 time=2.96 ms
1480 bytes from 172.30.42.1: icmp_seq=61 ttl=64 time=2.90 ms
1480 bytes from 172.30.42.1: icmp_seq=62 ttl=64 time=2.97 ms


--
To unsubscribe from this list: send the line "unsubscribe netdev" in
the body of a message to majordomo@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Denys Fedoryshchenko May 20, 2012, 6:53 p.m. UTC | #8
>
> And I didn't see the stalls. This was on an Intel machine.  The limit
> was stable, went up to around 28K when opened large file and tended 
> to
> stay between 15-28K.
>
> The describe problem seems to have characteristics that transmit
> interrupts are not at all periodic, and it would seem that some are
> taking hundreds of milliseconds to pop.  I don't see anything that
> would cause that in the NIC, is it possible there is some activity on
> the machines periodically and often holding down interrupts for  long
> periods of time.  Are there any peculiarities on Sun Fire in 
> interrupt
> handling?
>
> Can you also provide an 'ethtool -c eth0'
>
> Thanks,
> Tom
>
Technically i can install there fresh gentoo with latest vanilla kernel 
and provide remote access.
But it will take some time. I will try now to play with IRQ pinning, 
but not sure i will reach anything.

My kernel config http://www.nuclearcat.com/config_stall.txt

It is completely idle machine. I didn't notice any anomalies with 
interrutps to
compare with other servers (not Sun Fire). If there is anomalies with 
interrupts
(for example SMI), probably i should notice that even without BQL, but 
this machine
works very well and i didn't notice any latency without BQL.

centaur ~ # pidstat 1
Linux 3.4.0-rc7-centaur (centaur)       05/20/12        _x86_64_        
(8 CPU)

21:41:25          PID    %usr %system  %guest    %CPU   CPU  Command

21:41:26          PID    %usr %system  %guest    %CPU   CPU  Command
21:41:27         2617    0.00    1.00    0.00    1.00     2  pidstat

21:41:27          PID    %usr %system  %guest    %CPU   CPU  Command

21:41:28          PID    %usr %system  %guest    %CPU   CPU  Command

21:41:29          PID    %usr %system  %guest    %CPU   CPU  Command
21:41:30         2617    0.00    1.00    0.00    1.00     2  pidstat

 From powertop:
Wakeups-from-idle per second :  7.5     interval: 10.0s


ethtool -c eth0, sure i can provide, it is default values:
centaur ~ # ethtool -c eth0
Coalesce parameters for eth0:
Adaptive RX: off  TX: off
stats-block-usecs: 0
sample-interval: 0
pkt-rate-low: 0
pkt-rate-high: 0

rx-usecs: 3
rx-frames: 0
rx-usecs-irq: 0
rx-frames-irq: 0

tx-usecs: 0
tx-frames: 0
tx-usecs-irq: 0
tx-frames-irq: 0

rx-usecs-low: 0
rx-frame-low: 0
tx-usecs-low: 0
tx-frame-low: 0

rx-usecs-high: 0
rx-frame-high: 0
tx-usecs-high: 0
tx-frame-high: 0



>
>
>
> On Fri, May 18, 2012 at 7:40 PM, Denys Fedoryshchenko
> <denys@visp.net.lb> wrote:
>> On 2012-05-19 05:07, Tom Herbert wrote:
>>>
>>> 3) Have you tried a different interrupt mode?
>>
>>
>> Now tested:
>> e1000e.TxIntDelay=0,0,0,0
>> [    4.930570] e1000e 0000:04:00.0: (unregistered net_device): 
>> Transmit
>> Interrupt Delay set to 0
>> [    4.930573] e1000e 0000:04:00.0: (unregistered net_device): 
>> Interrupt
>> Throttling Rate (ints/sec) set to dynamic conservative mode
>> Problem still occur.
>>
>> e1000e.TxIntDelay=0,0,0,0 e1000e.InterruptThrottleRate=0,0,0,0
>> [    4.971476] e1000e 0000:04:00.0: (unregistered net_device): 
>> Transmit
>> Interrupt Delay set to 0
>> [    4.971478] e1000e 0000:04:00.0: (unregistered net_device): 
>> Interrupt
>> Throttling Rate (ints/sec) turned off
>> Problem also occur.
>>
>> Probably it is a case of this specific machine, this problem is 
>> happened on
>> same hardware (Sun Fire X4150).
>> The only "exotic" thing here is set of cards:
>>
>> 04:00.0 Ethernet controller: Intel Corporation 80003ES2LAN Gigabit 
>> Ethernet
>> Controller (Copper) (rev 01)
>> 04:00.1 Ethernet controller: Intel Corporation 80003ES2LAN Gigabit 
>> Ethernet
>> Controller (Copper) (rev 01)
>> 0b:00.0 Ethernet controller: Intel Corporation 82571EB Gigabit 
>> Ethernet
>> Controller (rev 06)
>> 0b:00.1 Ethernet controller: Intel Corporation 82571EB Gigabit 
>> Ethernet
>> Controller (rev 06)
>>
>> I'm using first card.
>>
>>
>>>
>>> Per #3, I am wondering if dynamic conservative mode interrupt
>>> throttling is not producing interrupts deterministically for BQL.
>>> I'll try to reproduce the problem in this mode.
>>>
>>>> I can make a patch that will make minimum BQL value not less than 
>>>> MTU +
>>>> overhead, is it ok like this?
>>>> Probably it will solve issue, but it is more workaround and safety 
>>>> fuse,
>>>> than a solution.
>>>>
>>> That would just be a bandaid and could just make this a latent 
>>> issue
>>> for the future.
>>>
>>> Tom
>>>
>>>>
>>>> On 2012-05-17 19:54, Denys Fedoryshchenko wrote:
>>>>>
>>>>>
>>>>> Also i notice, limit constantly changing over time (even i am not
>>>>> touching it).
>>>>>
>>>>> centaur ~ # grep "" 
>>>>> /sys/class/net/eth0/queues/tx-0/byte_queue_limits/*
>>>>> /sys/class/net/eth0/queues/tx-0/byte_queue_limits/hold_time:1000
>>>>> /sys/class/net/eth0/queues/tx-0/byte_queue_limits/inflight:0
>>>>> /sys/class/net/eth0/queues/tx-0/byte_queue_limits/limit:13018
>>>>>
>>>>>
>>>>> 
>>>>> /sys/class/net/eth0/queues/tx-0/byte_queue_limits/limit_max:1879048192
>>>>> /sys/class/net/eth0/queues/tx-0/byte_queue_limits/limit_min:0
>>>>> centaur ~ # grep "" 
>>>>> /sys/class/net/eth0/queues/tx-0/byte_queue_limits/*
>>>>> /sys/class/net/eth0/queues/tx-0/byte_queue_limits/hold_time:1000
>>>>> /sys/class/net/eth0/queues/tx-0/byte_queue_limits/inflight:4542
>>>>> /sys/class/net/eth0/queues/tx-0/byte_queue_limits/limit:13018
>>>>>
>>>>>
>>>>> 
>>>>> /sys/class/net/eth0/queues/tx-0/byte_queue_limits/limit_max:1879048192
>>>>> /sys/class/net/eth0/queues/tx-0/byte_queue_limits/limit_min:0
>>>>>
>>>>> Is it supposed to be like this?
>>>>>
>>>>> On 2012-05-17 16:42, Denys Fedoryshchenko wrote:
>>>>>>
>>>>>>
>>>>>> Found commit that cause problem:
>>>>>>
>>>>>> author  Tom Herbert <therbert@google.com>
>>>>>> Mon, 28 Nov 2011 16:33:16 +0000 (16:33 +0000)
>>>>>> committer       David S. Miller <davem@davemloft.net>
>>>>>> Tue, 29 Nov 2011 17:46:19 +0000 (12:46 -0500)
>>>>>> commit  3f0cfa3bc11e7f00c9994e0f469cbc0e7da7b00c
>>>>>> tree    d6670a4f94b2b9dedacc38edb6f0e1306b889f6b        tree | 
>>>>>> snapshot
>>>>>> parent  114cf5802165ee93e3ab461c9c505cd94a08b800        commit | 
>>>>>> diff
>>>>>> e1000e: Support for byte queue limits
>>>>>>
>>>>>> Changes to e1000e to use byte queue limits.
>>>>>>
>>>>>> Signed-off-by: Tom Herbert <therbert@google.com>
>>>>>> Acked-by: Eric Dumazet <eric.dumazet@gmail.com>
>>>>>> Signed-off-by: David S. Miller <davem@davemloft.net>
>>>>>>
>>>>>> If i reverse it, problem disappearing.
>>>>>>
>>>>>> How i reproduce it:
>>>>>> In two consoles do "fast" ping to nearby host
>>>>>> ping 194.146.XXX.XXX -s1472 -i0.0001
>>>>>> ping 194.146.XXX.XXX -s1472 -i0.1
>>>>>>
>>>>>> For third open ssh to host with "problem", open mcedit, and just
>>>>>> scroll down large text file.
>>>>>> After few seconds some "stalls" will occur, and in ping history 
>>>>>> i can
>>>>>> see:
>>>>>> 1480 bytes from 194.146.153.7: icmp_req=1797 ttl=64 time=0.161 
>>>>>> ms
>>>>>> 1480 bytes from 194.146.153.7: icmp_req=1798 ttl=64 time=0.198 
>>>>>> ms
>>>>>> 1480 bytes from 194.146.153.7: icmp_req=1799 ttl=64 time=0.340 
>>>>>> ms
>>>>>> 1480 bytes from 194.146.153.7: icmp_req=1800 ttl=64 time=0.381 
>>>>>> ms
>>>>>> 1480 bytes from 194.146.153.7: icmp_req=1801 ttl=64 time=914 ms
>>>>>> 1480 bytes from 194.146.153.7: icmp_req=1802 ttl=64 time=804 ms
>>>>>> 1480 bytes from 194.146.153.7: icmp_req=1803 ttl=64 time=704 ms
>>>>>> 1480 bytes from 194.146.153.7: icmp_req=1804 ttl=64 time=594 ms
>>>>>> 1480 bytes from 194.146.153.7: icmp_req=1805 ttl=64 time=0.287 
>>>>>> ms
>>>>>> 1480 bytes from 194.146.153.7: icmp_req=1806 ttl=64 time=0.226 
>>>>>> ms
>>>>>>
>>>>>>
>>>>>> If i apply small patch - problem will disappear. Sure it is not 
>>>>>> a
>>>>>> solution, but
>>>>>> let me know how i can help to debug problem more.
>>>>>>
>>>>>> --- netdev.c    2012-05-12 20:08:37.000000000 +0300
>>>>>> +++ netdev.c.patched    2012-05-17 16:32:28.895760472 +0300
>>>>>> @@ -1135,7 +1135,7 @@
>>>>>>
>>>>>>        tx_ring->next_to_clean = i;
>>>>>>
>>>>>> -       netdev_completed_queue(netdev, pkts_compl, bytes_compl);
>>>>>> +//     netdev_completed_queue(netdev, pkts_compl, bytes_compl);
>>>>>>
>>>>>>  #define TX_WAKE_THRESHOLD 32
>>>>>>        if (count && netif_carrier_ok(netdev) &&
>>>>>> @@ -2263,7 +2263,7 @@
>>>>>>                e1000_put_txbuf(adapter, buffer_info);
>>>>>>        }
>>>>>>
>>>>>> -       netdev_reset_queue(adapter->netdev);
>>>>>> +//     netdev_reset_queue(adapter->netdev);
>>>>>>        size = sizeof(struct e1000_buffer) * tx_ring->count;
>>>>>>        memset(tx_ring->buffer_info, 0, size);
>>>>>>
>>>>>> @@ -5056,7 +5056,7 @@
>>>>>>        /* if count is 0 then mapping error has occurred */
>>>>>>        count = e1000_tx_map(adapter, skb, first, max_per_txd,
>>>>>> nr_frags, mss);
>>>>>>        if (count) {
>>>>>> -               netdev_sent_queue(netdev, skb->len);
>>>>>> +//             netdev_sent_queue(netdev, skb->len);
>>>>>>                e1000_tx_queue(adapter, tx_flags, count);
>>>>>>                /* Make sure there is space in the ring for the 
>>>>>> next
>>>>>> send.
>>>>>> */
>>>>>>                e1000_maybe_stop_tx(netdev, MAX_SKB_FRAGS + 2);
>>>>>>
>>>>>>
>>>>>>
>>>>>> On 2012-05-15 17:15, Denys Fedoryshchenko wrote:
>>>>>>>
>>>>>>>
>>>>>>> Hi
>>>>>>>
>>>>>>> I have two identical servers, Sun Fire X4150, both has 
>>>>>>> different
>>>>>>> flavors of Linux, x86_64 and i386.
>>>>>>> 04:00.0 Ethernet controller: Intel Corporation 80003ES2LAN 
>>>>>>> Gigabit
>>>>>>> Ethernet Controller (Copper) (rev 01)
>>>>>>> 04:00.1 Ethernet controller: Intel Corporation 80003ES2LAN 
>>>>>>> Gigabit
>>>>>>> Ethernet Controller (Copper) (rev 01)
>>>>>>> 0b:00.0 Ethernet controller: Intel Corporation 82571EB Gigabit
>>>>>>> Ethernet Controller (rev 06)
>>>>>>> 0b:00.1 Ethernet controller: Intel Corporation 82571EB Gigabit
>>>>>>> Ethernet Controller (rev 06)
>>>>>>> I am using now interface:
>>>>>>> #ethtool -i eth0
>>>>>>> driver: e1000e
>>>>>>> version: 1.9.5-k
>>>>>>> firmware-version: 2.1-11
>>>>>>> bus-info: 0000:04:00.0
>>>>>>> There is 2 CPU , Intel(R) Xeon(R) CPU           E5440  @ 
>>>>>>> 2.83GHz .
>>>>>>>
>>>>>>> i386 was acting as NAT and shaper, and as soon as i removed 
>>>>>>> shaper
>>>>>>> from it, i started to experience strange lockups, e.g. traffic 
>>>>>>> is
>>>>>>> normal for 5-30 seconds, then short lockup for 500-3000ms 
>>>>>>> (usually
>>>>>>> around 1000ms) with dropped packets counter increasing. I was
>>>>>>> suspecting it is due load, but it seems was wrong.
>>>>>>> Recently, on another server, x86_64 i am using as development, 
>>>>>>> i
>>>>>>> upgrade kernel (it was old, from 2.6 series) and on completely 
>>>>>>> idle
>>>>>>> machine started to experience same latency spikes, while i am 
>>>>>>> just
>>>>>>> running mc and for example typing in text editor - i notice 
>>>>>>> "stalls".
>>>>>>> After i investigate it a little more, i notice also small 
>>>>>>> amount of
>>>>>>> drops on interface. No tcpdump running. Also this machine is 
>>>>>>> idle, and
>>>>>>> the only traffic there - some small broadcasts from network, my 
>>>>>>> ssh,
>>>>>>> and ping.
>>>>>>>
>>>>>>> Dropped packets in ifconfig
>>>>>>>          RX packets:3752868 errors:0 dropped:5350 overruns:0 
>>>>>>> frame:0
>>>>>>> Counter is increasing sometimes, when this stall happening.
>>>>>>>
>>>>>>> ethtool -S is clean, there is no dropped packets.
>>>>>>>
>>>>>>> I did tried to check load (mpstat and perf), there is nothing
>>>>>>> suspicious, latencytop also doesn't show anything suspicious.
>>>>>>> dropwatch report a lot of drops, but mostly because there is 
>>>>>>> some
>>>>>>> broadcasts and etc. tcpdump at the moment of such drops doesn't 
>>>>>>> show
>>>>>>> anything suspicious.
>>>>>>> Changed qdisc from default fifo_fast to bfifo, without any 
>>>>>>> result.
>>>>>>> Tried:  ethtool -K eth0 tso off gso off gro off sg off , no 
>>>>>>> result
>>>>>>> Problem occured at 3.3.6 - 3.4.0-rc7, most probably 3.3.0 also, 
>>>>>>> but i
>>>>>>> don't remember for sure. I thik on some kernels like 3.1 
>>>>>>> probably it
>>>>>>> doesn't occur, i will check it soon, because it is not always 
>>>>>>> reliable
>>>>>>> to reproduce it. All tests i did on 3.4.0-rc7.
>>>>>>>
>>>>>>> I did run also in background tcpdump, additionally iptables 
>>>>>>> with
>>>>>>> timestamps, and at time when stall occured, seems i am still 
>>>>>>> receiving
>>>>>>> packets properly, also on iperf udp  (from some host to this 
>>>>>>> SunFire)
>>>>>>> at this moments no packets missing. But i am sure RX interface 
>>>>>>> errors
>>>>>>> are increasing.
>>>>>>> If i do iperf from SunFire to test host - there is packetloss 
>>>>>>> at
>>>>>>> moments when stall occured.
>>>>>>>
>>>>>>> I suspect that by some reason network card stop to transmit, 
>>>>>>> but
>>>>>>> unable to pinpoint issue. All other hosts in this network are 
>>>>>>> fine and
>>>>>>> don't have such problems.
>>>>>>> Can you help me with that please? Maybe i can provide more 
>>>>>>> debug
>>>>>>> information, compile with patches and etc. Also i will try to 
>>>>>>> fallback
>>>>>>> to 3.1 and 3.0 kernels.
>>>>>>>
>>>>>>> Here it is how it occurs and i am reproducing it:
>>>>>>> I'm just opening file, and start to scroll it in mc, then in 
>>>>>>> another
>>>>>>> console i run ping
>>>>>>> [1337089061.844167] 1480 bytes from 194.146.153.20: 
>>>>>>> icmp_req=162
>>>>>>> ttl=64 time=0.485 ms
>>>>>>> [1337089061.944138] 1480 bytes from 194.146.153.20: 
>>>>>>> icmp_req=163
>>>>>>> ttl=64 time=0.470 ms
>>>>>>> [1337089062.467759] 1480 bytes from 194.146.153.20: 
>>>>>>> icmp_req=164
>>>>>>> ttl=64 time=424 ms
>>>>>>> [1337089062.467899] 1480 bytes from 194.146.153.20: 
>>>>>>> icmp_req=165
>>>>>>> ttl=64 time=324 ms
>>>>>>> [1337089062.468058] 1480 bytes from 194.146.153.20: 
>>>>>>> icmp_req=166
>>>>>>> ttl=64 time=214 ms
>>>>>>> [1337089062.468161] 1480 bytes from 194.146.153.20: 
>>>>>>> icmp_req=167
>>>>>>> ttl=64 time=104 ms
>>>>>>> [1337089062.468958] 1480 bytes from 194.146.153.20: 
>>>>>>> icmp_req=168
>>>>>>> ttl=64 time=1.15 ms
>>>>>>> [1337089062.568604] 1480 bytes from 194.146.153.20: 
>>>>>>> icmp_req=169
>>>>>>> ttl=64 time=0.477 ms
>>>>>>> [1337089062.668909] 1480 bytes from 194.146.153.20: 
>>>>>>> icmp_req=170
>>>>>>> ttl=64 time=0.667 ms
>>>>>>>
>>>>>>> Remote host tcpdump:
>>>>>>> 1337089061.934737 IP 194.146.153.20 > 194.146.153.22: ICMP echo
>>>>>>> reply, id 3486, seq 163, length 1480
>>>>>>> 1337089062.458360 IP 194.146.153.22 > 194.146.153.20: ICMP echo
>>>>>>> request, id 3486, seq 164, length 1480
>>>>>>> 1337089062.458380 IP 194.146.153.20 > 194.146.153.22: ICMP echo
>>>>>>> reply, id 3486, seq 164, length 1480
>>>>>>> 1337089062.458481 IP 194.146.153.22 > 194.146.153.20: ICMP echo
>>>>>>> request, id 3486, seq 165, length 1480
>>>>>>> 1337089062.458502 IP 194.146.153.20 > 194.146.153.22: ICMP echo
>>>>>>> reply, id 3486, seq 165, length 1480
>>>>>>> 1337089062.458606 IP 194.146.153.22 > 194.146.153.20: ICMP echo
>>>>>>> request, id 3486, seq 166, length 1480
>>>>>>> 1337089062.458623 IP 194.146.153.20 > 194.146.153.22: ICMP echo
>>>>>>> reply, id 3486, seq 166, length 1480
>>>>>>> 1337089062.458729 IP 194.146.153.22 > 194.146.153.20: ICMP echo
>>>>>>> request, id 3486, seq 167, length 1480
>>>>>>> 1337089062.458745 IP 194.146.153.20 > 194.146.153.22: ICMP echo
>>>>>>> reply, id 3486, seq 167, length 1480
>>>>>>> 1337089062.459537 IP 194.146.153.22 > 194.146.153.20: ICMP echo
>>>>>>> request, id 3486, seq 168, length 1480
>>>>>>> 1337089062.459545 IP 194.146.153.20 > 194.146.153.22: ICMP echo
>>>>>>> reply, id 3486, seq 168, length 1480
>>>>>>>
>>>>>>> Local host(SunFire) tcpdump:
>>>>>>> 1337089061.844140 IP 194.146.153.20 > 194.146.153.22: ICMP echo
>>>>>>> reply, id 3486, seq 162, length 1480
>>>>>>> 1337089061.943661 IP 194.146.153.22 > 194.146.153.20: ICMP echo
>>>>>>> request, id 3486, seq 163, length 1480
>>>>>>> 1337089061.944124 IP 194.146.153.20 > 194.146.153.22: ICMP echo
>>>>>>> reply, id 3486, seq 163, length 1480
>>>>>>> 1337089062.465622 IP 194.146.153.22 > 194.146.153.20: ICMP echo
>>>>>>> request, id 3486, seq 164, length 1480
>>>>>>> 1337089062.465630 IP 194.146.153.22 > 194.146.153.20: ICMP echo
>>>>>>> request, id 3486, seq 165, length 1480
>>>>>>> 1337089062.465632 IP 194.146.153.22 > 194.146.153.20: ICMP echo
>>>>>>> request, id 3486, seq 166, length 1480
>>>>>>> 1337089062.465634 IP 194.146.153.22 > 194.146.153.20: ICMP echo
>>>>>>> request, id 3486, seq 167, length 1480
>>>>>>> 1337089062.467730 IP 194.146.153.20 > 194.146.153.22: ICMP echo
>>>>>>> reply, id 3486, seq 164, length 1480
>>>>>>> 1337089062.467785 IP 194.146.153.22 > 194.146.153.20: ICMP echo
>>>>>>> request, id 3486, seq 168, length 1480
>>>>>>> 1337089062.467884 IP 194.146.153.20 > 194.146.153.22: ICMP echo
>>>>>>> reply, id 3486, seq 165, length 1480
>>>>>>> 1337089062.468035 IP 194.146.153.20 > 194.146.153.22: ICMP echo
>>>>>>> reply, id 3486, seq 166, length 1480
>>>>>>> 1337089062.468129 IP 194.146.153.20 > 194.146.153.22: ICMP echo
>>>>>>> reply, id 3486, seq 167, length 1480
>>>>>>> 1337089062.468928 IP 194.146.153.20 > 194.146.153.22: ICMP echo
>>>>>>> reply, id 3486, seq 168, length 1480
>>>>>>> 1337089062.568112 IP 194.146.153.22 > 194.146.153.20: ICMP echo
>>>>>>> request, id 3486, seq 169, length 1480
>>>>>>> 1337089062.568578 IP 194.146.153.20 > 194.146.153.22: ICMP echo
>>>>>>> reply, id 3486, seq 169, length 1480
>>>>>>>
>>>>>>> lspci -t
>>>>>>> centaur src # lspci -t
>>>>>>> -[0000:00]-+-00.0
>>>>>>>           +-02.0-[01-05]--+-00.0-[02-04]--+-00.0-[03]--
>>>>>>>           |               |               \-02.0-[04]--+-00.0
>>>>>>>           |               |                            \-00.1
>>>>>>>           |               \-00.3-[05]--
>>>>>>>           +-03.0-[06]--
>>>>>>>           +-04.0-[07]----00.0
>>>>>>>           +-05.0-[08]--
>>>>>>>           +-06.0-[09]--
>>>>>>>           +-07.0-[0a]--
>>>>>>>           +-08.0
>>>>>>>           +-10.0
>>>>>>>           +-10.1
>>>>>>>           +-10.2
>>>>>>>           +-11.0
>>>>>>>           +-13.0
>>>>>>>           +-15.0
>>>>>>>           +-16.0
>>>>>>>           +-1c.0-[0b]--+-00.0
>>>>>>>           |            \-00.1
>>>>>>>           +-1d.0
>>>>>>>           +-1d.1
>>>>>>>           +-1d.2
>>>>>>>           +-1d.3
>>>>>>>           +-1d.7
>>>>>>>           +-1e.0-[0c]----05.0
>>>>>>>           +-1f.0
>>>>>>>           +-1f.1
>>>>>>>           +-1f.2
>>>>>>>           \-1f.3
>>>>>>> lspci
>>>>>>> 00:00.0 Host bridge: Intel Corporation 5000P Chipset Memory
>>>>>>> Controller Hub (rev b1)
>>>>>>> 00:02.0 PCI bridge: Intel Corporation 5000 Series Chipset PCI 
>>>>>>> Express
>>>>>>> x4 Port 2 (rev b1)
>>>>>>> 00:03.0 PCI bridge: Intel Corporation 5000 Series Chipset PCI 
>>>>>>> Express
>>>>>>> x4 Port 3 (rev b1)
>>>>>>> 00:04.0 PCI bridge: Intel Corporation 5000 Series Chipset PCI 
>>>>>>> Express
>>>>>>> x8 Port 4-5 (rev b1)
>>>>>>> 00:05.0 PCI bridge: Intel Corporation 5000 Series Chipset PCI 
>>>>>>> Express
>>>>>>> x4 Port 5 (rev b1)
>>>>>>> 00:06.0 PCI bridge: Intel Corporation 5000 Series Chipset PCI 
>>>>>>> Express
>>>>>>> x8 Port 6-7 (rev b1)
>>>>>>> 00:07.0 PCI bridge: Intel Corporation 5000 Series Chipset PCI 
>>>>>>> Express
>>>>>>> x4 Port 7 (rev b1)
>>>>>>> 00:08.0 System peripheral: Intel Corporation 5000 Series 
>>>>>>> Chipset DMA
>>>>>>> Engine (rev b1)
>>>>>>> 00:10.0 Host bridge: Intel Corporation 5000 Series Chipset FSB
>>>>>>> Registers (rev b1)
>>>>>>> 00:10.1 Host bridge: Intel Corporation 5000 Series Chipset FSB
>>>>>>> Registers (rev b1)
>>>>>>> 00:10.2 Host bridge: Intel Corporation 5000 Series Chipset FSB
>>>>>>> Registers (rev b1)
>>>>>>> 00:11.0 Host bridge: Intel Corporation 5000 Series Chipset 
>>>>>>> Reserved
>>>>>>> Registers (rev b1)
>>>>>>> 00:13.0 Host bridge: Intel Corporation 5000 Series Chipset 
>>>>>>> Reserved
>>>>>>> Registers (rev b1)
>>>>>>> 00:15.0 Host bridge: Intel Corporation 5000 Series Chipset FBD
>>>>>>> Registers (rev b1)
>>>>>>> 00:16.0 Host bridge: Intel Corporation 5000 Series Chipset FBD
>>>>>>> Registers (rev b1)
>>>>>>> 00:1c.0 PCI bridge: Intel Corporation 631xESB/632xESB/3100 
>>>>>>> Chipset
>>>>>>> PCI Express Root Port 1 (rev 09)
>>>>>>> 00:1d.0 USB controller: Intel Corporation 631xESB/632xESB/3100
>>>>>>> Chipset UHCI USB Controller #1 (rev 09)
>>>>>>> 00:1d.1 USB controller: Intel Corporation 631xESB/632xESB/3100
>>>>>>> Chipset UHCI USB Controller #2 (rev 09)
>>>>>>> 00:1d.2 USB controller: Intel Corporation 631xESB/632xESB/3100
>>>>>>> Chipset UHCI USB Controller #3 (rev 09)
>>>>>>> 00:1d.3 USB controller: Intel Corporation 631xESB/632xESB/3100
>>>>>>> Chipset UHCI USB Controller #4 (rev 09)
>>>>>>> 00:1d.7 USB controller: Intel Corporation 631xESB/632xESB/3100
>>>>>>> Chipset EHCI USB2 Controller (rev 09)
>>>>>>> 00:1e.0 PCI bridge: Intel Corporation 82801 PCI Bridge (rev d9)
>>>>>>> 00:1f.0 ISA bridge: Intel Corporation 631xESB/632xESB/3100 
>>>>>>> Chipset
>>>>>>> LPC Interface Controller (rev 09)
>>>>>>> 00:1f.1 IDE interface: Intel Corporation 631xESB/632xESB IDE
>>>>>>> Controller (rev 09)
>>>>>>> 00:1f.2 SATA controller: Intel Corporation 631xESB/632xESB SATA 
>>>>>>> AHCI
>>>>>>> Controller (rev 09)
>>>>>>> 00:1f.3 SMBus: Intel Corporation 631xESB/632xESB/3100 Chipset 
>>>>>>> SMBus
>>>>>>> Controller (rev 09)
>>>>>>> 01:00.0 PCI bridge: Intel Corporation 6311ESB/6321ESB PCI 
>>>>>>> Express
>>>>>>> Upstream Port (rev 01)
>>>>>>> 01:00.3 PCI bridge: Intel Corporation 6311ESB/6321ESB PCI 
>>>>>>> Express to
>>>>>>> PCI-X Bridge (rev 01)
>>>>>>> 02:00.0 PCI bridge: Intel Corporation 6311ESB/6321ESB PCI 
>>>>>>> Express
>>>>>>> Downstream Port E1 (rev 01)
>>>>>>> 02:02.0 PCI bridge: Intel Corporation 6311ESB/6321ESB PCI 
>>>>>>> Express
>>>>>>> Downstream Port E3 (rev 01)
>>>>>>> 04:00.0 Ethernet controller: Intel Corporation 80003ES2LAN 
>>>>>>> Gigabit
>>>>>>> Ethernet Controller (Copper) (rev 01)
>>>>>>> 04:00.1 Ethernet controller: Intel Corporation 80003ES2LAN 
>>>>>>> Gigabit
>>>>>>> Ethernet Controller (Copper) (rev 01)
>>>>>>> 07:00.0 RAID bus controller: Adaptec AAC-RAID (rev 09)
>>>>>>> 0b:00.0 Ethernet controller: Intel Corporation 82571EB Gigabit
>>>>>>> Ethernet Controller (rev 06)
>>>>>>> 0b:00.1 Ethernet controller: Intel Corporation 82571EB Gigabit
>>>>>>> Ethernet Controller (rev 06)
>>>>>>> 0c:05.0 VGA compatible controller: ASPEED Technology, Inc. 
>>>>>>> ASPEED
>>>>>>> Graphics Family
>>>>>>>
>>>>>>>
>>>>>>> dmesg:
>>>>>>> [    4.936885] e1000: Intel(R) PRO/1000 Network Driver - 
>>>>>>> version
>>>>>>> 7.3.21-k8-NAPI
>>>>>>> [    4.936887] e1000: Copyright (c) 1999-2006 Intel 
>>>>>>> Corporation.
>>>>>>> [    4.936966] e1000e: Intel(R) PRO/1000 Network Driver - 
>>>>>>> 1.9.5-k
>>>>>>> [    4.936967] e1000e: Copyright(c) 1999 - 2012 Intel 
>>>>>>> Corporation.
>>>>>>> [    4.938529] e1000e 0000:04:00.0: (unregistered net_device):
>>>>>>> Interrupt Throttling Rate (ints/sec) set to dynamic 
>>>>>>> conservative mode
>>>>>>> [    4.939598] e1000e 0000:04:00.0: irq 65 for MSI/MSI-X
>>>>>>> [    4.992246] e1000e 0000:04:00.0: eth0: (PCI 
>>>>>>> Express:2.5GT/s:Width
>>>>>>> x4) 00:1e:68:04:99:f8
>>>>>>> [    4.992657] e1000e 0000:04:00.0: eth0: Intel(R) PRO/1000 
>>>>>>> Network
>>>>>>> Connection
>>>>>>> [    4.992964] e1000e 0000:04:00.0: eth0: MAC: 5, PHY: 5, PBA 
>>>>>>> No:
>>>>>>> FFFFFF-0FF
>>>>>>> [    4.994745] e1000e 0000:04:00.1: (unregistered net_device):
>>>>>>> Interrupt Throttling Rate (ints/sec) set to dynamic 
>>>>>>> conservative mode
>>>>>>> [    4.996233] e1000e 0000:04:00.1: irq 66 for MSI/MSI-X
>>>>>>> [    5.050901] e1000e 0000:04:00.1: eth1: (PCI 
>>>>>>> Express:2.5GT/s:Width
>>>>>>> x4) 00:1e:68:04:99:f9
>>>>>>> [    5.051317] e1000e 0000:04:00.1: eth1: Intel(R) PRO/1000 
>>>>>>> Network
>>>>>>> Connection
>>>>>>> [    5.051623] e1000e 0000:04:00.1: eth1: MAC: 5, PHY: 5, PBA 
>>>>>>> No:
>>>>>>> FFFFFF-0FF
>>>>>>> [    5.051857] e1000e 0000:0b:00.0: Disabling ASPM  L1
>>>>>>> [    5.052168] e1000e 0000:0b:00.0: (unregistered net_device):
>>>>>>> Interrupt Throttling Rate (ints/sec) set to dynamic 
>>>>>>> conservative mode
>>>>>>> [    5.052611] e1000e 0000:0b:00.0: irq 67 for MSI/MSI-X
>>>>>>> [    5.223454] e1000e 0000:0b:00.0: eth2: (PCI 
>>>>>>> Express:2.5GT/s:Width
>>>>>>> x4) 00:1e:68:04:99:fa
>>>>>>> [    5.223864] e1000e 0000:0b:00.0: eth2: Intel(R) PRO/1000 
>>>>>>> Network
>>>>>>> Connection
>>>>>>> [    5.224178] e1000e 0000:0b:00.0: eth2: MAC: 0, PHY: 4, PBA 
>>>>>>> No:
>>>>>>> C83246-002
>>>>>>> [    5.224412] e1000e 0000:0b:00.1: Disabling ASPM  L1
>>>>>>> [    5.224709] e1000e 0000:0b:00.1: (unregistered net_device):
>>>>>>> Interrupt Throttling Rate (ints/sec) set to dynamic 
>>>>>>> conservative mode
>>>>>>> [    5.225168] e1000e 0000:0b:00.1: irq 68 for MSI/MSI-X
>>>>>>> [    5.397603] e1000e 0000:0b:00.1: eth3: (PCI 
>>>>>>> Express:2.5GT/s:Width
>>>>>>> x4) 00:1e:68:04:99:fb
>>>>>>> [    5.398021] e1000e 0000:0b:00.1: eth3: Intel(R) PRO/1000 
>>>>>>> Network
>>>>>>> Connection
>>>>>>> [    5.398336] e1000e 0000:0b:00.1: eth3: MAC: 0, PHY: 4, PBA 
>>>>>>> No:
>>>>>>> C83246-002
>>>>>>> [   13.859817] e1000e 0000:04:00.0: irq 65 for MSI/MSI-X
>>>>>>> [   13.962309] e1000e 0000:04:00.0: irq 65 for MSI/MSI-X
>>>>>>> [   17.150392] e1000e: eth0 NIC Link is Up 1000 Mbps Full 
>>>>>>> Duplex,
>>>>>>> Flow Control: None
>>>>>>>
>>>>>>> --
>>>>>>> To unsubscribe from this list: send the line "unsubscribe 
>>>>>>> netdev" in
>>>>>>> the body of a message to majordomo@vger.kernel.org
>>>>>>> More majordomo info at 
>>>>>>>  http://vger.kernel.org/majordomo-info.html
>>>>>>
>>>>>>
>>>>>>
>>>>>> ---
>>>>>> Network engineer
>>>>>> Denys Fedoryshchenko
>>>>>>
>>>>>> Dora Highway - Center Cebaco - 2nd Floor
>>>>>> Beirut, Lebanon
>>>>>> Tel:    +961 1 247373
>>>>>> E-Mail: denys@visp.net.lb
>>>>>> --
>>>>>> To unsubscribe from this list: send the line "unsubscribe 
>>>>>> netdev" in
>>>>>> the body of a message to majordomo@vger.kernel.org
>>>>>> More majordomo info at 
>>>>>>  http://vger.kernel.org/majordomo-info.html
>>>>>
>>>>>
>>>>>
>>>>> ---
>>>>> Network engineer
>>>>> Denys Fedoryshchenko
>>>>>
>>>>> Dora Highway - Center Cebaco - 2nd Floor
>>>>> Beirut, Lebanon
>>>>> Tel:    +961 1 247373
>>>>> E-Mail: denys@visp.net.lb
>>>>> --
>>>>> To unsubscribe from this list: send the line "unsubscribe netdev" 
>>>>> in
>>>>> the body of a message to majordomo@vger.kernel.org
>>>>> More majordomo info at 
>>>>>  http://vger.kernel.org/majordomo-info.html
>>>>
>>>>
>>>>
>>>> ---
>>>> Network engineer
>>>> Denys Fedoryshchenko
>>>>
>>>> Dora Highway - Center Cebaco - 2nd Floor
>>>> Beirut, Lebanon
>>>> Tel:    +961 1 247373
>>>> E-Mail: denys@visp.net.lb
>>>
>>> --
>>> To unsubscribe from this list: send the line "unsubscribe netdev" 
>>> in
>>> the body of a message to majordomo@vger.kernel.org
>>> More majordomo info at  http://vger.kernel.org/majordomo-info.html
>>
>>
>> ---
>> Denys Fedoryshchenko, Network Engineer, Virtual ISP S.A.L.

---
Denys Fedoryshchenko, Network Engineer, Virtual ISP S.A.L.
--
To unsubscribe from this list: send the line "unsubscribe netdev" in
the body of a message to majordomo@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Eric Dumazet May 20, 2012, 7:07 p.m. UTC | #9
On Sun, 2012-05-20 at 21:53 +0300, Denys Fedoryshchenko wrote:
> >

> My kernel config http://www.nuclearcat.com/config_stall.txt
> 
> It is completely idle machine. I didn't notice any anomalies with 
> interrutps to
> compare with other servers (not Sun Fire). If there is anomalies with 
> interrupts
> (for example SMI), probably i should notice that even without BQL, but 
> this machine
> works very well and i didn't notice any latency without BQL.

You could try latencytop, I am not sure if some obvious things will
popup.



--
To unsubscribe from this list: send the line "unsubscribe netdev" in
the body of a message to majordomo@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Denys Fedoryshchenko May 20, 2012, 7:18 p.m. UTC | #10
On 2012-05-20 22:07, Eric Dumazet wrote:
>
> You could try latencytop, I am not sure if some obvious things will
> popup.
For sure i did. Nothing unusual here, max 5ms latency
Cause                                                Maximum     
Percentage
[__skb_recv_datagram]                               4.1 msec         
45.4 %
Waiting for event (select)                          3.9 msec         
54.6 %
Page fault                                          0.7 msec          
0.0 %
Waiting for a process to die                        0.3 msec          
0.0 %
Writing data to TTY                                 0.1 msec          
0.0 %

I will try also to find SMI interrupts detector (if i will find 
source), but i don't think it is an issue.



---
Denys Fedoryshchenko, Network Engineer, Virtual ISP S.A.L.
--
To unsubscribe from this list: send the line "unsubscribe netdev" in
the body of a message to majordomo@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Eric Dumazet May 21, 2012, 3:56 a.m. UTC | #11
On Sun, 2012-05-20 at 22:18 +0300, Denys Fedoryshchenko wrote:
> On 2012-05-20 22:07, Eric Dumazet wrote:
> >
> > You could try latencytop, I am not sure if some obvious things will
> > popup.
> For sure i did. Nothing unusual here, max 5ms latency
> Cause                                                Maximum     
> Percentage
> [__skb_recv_datagram]                               4.1 msec   

Interesting

So your workload is a mix of pings, and receive.

Problem is softirq handler might use a lot of time to complete the
receives, because of TCP stack complexity. And BQL use softirq to
restart the transmits on the same cpu.

tcp_data_queue() can copy the received data directly to user space.
(taking page faults...)

Could you check if net-next behaves the same ?



--
To unsubscribe from this list: send the line "unsubscribe netdev" in
the body of a message to majordomo@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Denys Fedoryshchenko May 21, 2012, 8:06 a.m. UTC | #12
On 2012-05-21 06:56, Eric Dumazet wrote:
> On Sun, 2012-05-20 at 22:18 +0300, Denys Fedoryshchenko wrote:
>> On 2012-05-20 22:07, Eric Dumazet wrote:
>> >
>> > You could try latencytop, I am not sure if some obvious things 
>> will
>> > popup.
>> For sure i did. Nothing unusual here, max 5ms latency
>> Cause                                                Maximum
>> Percentage
>> [__skb_recv_datagram]                               4.1 msec
>
> Interesting
>
> So your workload is a mix of pings, and receive.
>
> Problem is softirq handler might use a lot of time to complete the
> receives, because of TCP stack complexity. And BQL use softirq to
> restart the transmits on the same cpu.
>
> tcp_data_queue() can copy the received data directly to user space.
> (taking page faults...)
>
> Could you check if net-next behaves the same ?
Not sure it is a lot of time, after all it is 2 x core quad machine, 
should be enough fast for pings.
It will cause stalls on small packets even more seems.

Tested latest git, net-next, still the same, stalls.
hardware latency detector are silent by the way, so there is no 
significant SMI.


---
Denys Fedoryshchenko, Network Engineer, Virtual ISP S.A.L.
--
To unsubscribe from this list: send the line "unsubscribe netdev" in
the body of a message to majordomo@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Eric Dumazet May 21, 2012, 8:30 a.m. UTC | #13
On Mon, 2012-05-21 at 11:06 +0300, Denys Fedoryshchenko wrote:

> Not sure it is a lot of time, after all it is 2 x core quad machine, 
> should be enough fast for pings.
> It will cause stalls on small packets even more seems.
> 
> Tested latest git, net-next, still the same, stalls.
> hardware latency detector are silent by the way, so there is no 
> significant SMI.
> 

I am trying to reproduce your problem here with no luck yet.

I wonder of softirq are correctly scheduled on your machine


--
To unsubscribe from this list: send the line "unsubscribe netdev" in
the body of a message to majordomo@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Eric Dumazet May 21, 2012, 8:40 a.m. UTC | #14
On Mon, 2012-05-21 at 10:30 +0200, Eric Dumazet wrote:
> On Mon, 2012-05-21 at 11:06 +0300, Denys Fedoryshchenko wrote:
> 
> > Not sure it is a lot of time, after all it is 2 x core quad machine, 
> > should be enough fast for pings.
> > It will cause stalls on small packets even more seems.
> > 
> > Tested latest git, net-next, still the same, stalls.
> > hardware latency detector are silent by the way, so there is no 
> > significant SMI.
> > 
> 
> I am trying to reproduce your problem here with no luck yet.
> 
> I wonder of softirq are correctly scheduled on your machine
> 

By the way, fact you have 8 cpus is irrelevant.

Only one cpu has queued the NET_TX_SOFTIRQ softirq (serviced by
net_tx_action())


If this cpu is busy servicing other stuff, no other cpu will help.



--
To unsubscribe from this list: send the line "unsubscribe netdev" in
the body of a message to majordomo@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Denys Fedoryshchenko May 21, 2012, 9:22 a.m. UTC | #15
On 2012-05-21 11:40, Eric Dumazet wrote:
> On Mon, 2012-05-21 at 10:30 +0200, Eric Dumazet wrote:
>> On Mon, 2012-05-21 at 11:06 +0300, Denys Fedoryshchenko wrote:
>>
>> > Not sure it is a lot of time, after all it is 2 x core quad 
>> machine,
>> > should be enough fast for pings.
>> > It will cause stalls on small packets even more seems.
>> >
>> > Tested latest git, net-next, still the same, stalls.
>> > hardware latency detector are silent by the way, so there is no
>> > significant SMI.
>> >
>>
>> I am trying to reproduce your problem here with no luck yet.
>>
>> I wonder of softirq are correctly scheduled on your machine
>>
>
> By the way, fact you have 8 cpus is irrelevant.
>
> Only one cpu has queued the NET_TX_SOFTIRQ softirq (serviced by
> net_tx_action())
>
>
> If this cpu is busy servicing other stuff, no other cpu will help.
>
I'm installing now fresh and clean gentoo and tools, and will provide 
ssh to this host.
Probably it will help to troubleshoot the problems.



---
Denys Fedoryshchenko, Network Engineer, Virtual ISP S.A.L.
--
To unsubscribe from this list: send the line "unsubscribe netdev" in
the body of a message to majordomo@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Denys Fedoryshchenko May 22, 2012, 5:11 p.m. UTC | #16
On 2012-05-21 11:40, Eric Dumazet wrote:
> On Mon, 2012-05-21 at 10:30 +0200, Eric Dumazet wrote:
>> On Mon, 2012-05-21 at 11:06 +0300, Denys Fedoryshchenko wrote:
>>
>> > Not sure it is a lot of time, after all it is 2 x core quad 
>> machine,
>> > should be enough fast for pings.
>> > It will cause stalls on small packets even more seems.
>> >
>> > Tested latest git, net-next, still the same, stalls.
>> > hardware latency detector are silent by the way, so there is no
>> > significant SMI.
>> >
>>
>> I am trying to reproduce your problem here with no luck yet.
>>
>> I wonder of softirq are correctly scheduled on your machine
>>
>
> By the way, fact you have 8 cpus is irrelevant.
>
> Only one cpu has queued the NET_TX_SOFTIRQ softirq (serviced by
> net_tx_action())
>
>
> If this cpu is busy servicing other stuff, no other cpu will help.
>
By the way, if BQL limit is going lower than MTU, is it considered as a 
bug?
If yes, i can try to upload 3.4 to some servers and add condition to 
WARN_ON if limit < 1500.

---
Denys Fedoryshchenko, Network Engineer, Virtual ISP S.A.L.
--
To unsubscribe from this list: send the line "unsubscribe netdev" in
the body of a message to majordomo@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Eric Dumazet May 22, 2012, 5:24 p.m. UTC | #17
On Tue, 2012-05-22 at 20:11 +0300, Denys Fedoryshchenko wrote:

> By the way, if BQL limit is going lower than MTU, is it considered as a 
> bug?
> If yes, i can try to upload 3.4 to some servers and add condition to 
> WARN_ON if limit < 1500.

There is no problem with BQL limit going lower than the max packet size.

(With TSO it can be 64K)

Remember BQL allows one packet to be sent to device, regardless of its
size.

Next packet might be blocked/stay in Qdisc

If your workload is mostly idle, but sending bursts of 3 packets, then
only one is immediately sent.

Next packets shall wait the TX completion of first packet.



--
To unsubscribe from this list: send the line "unsubscribe netdev" in
the body of a message to majordomo@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Tom Herbert May 25, 2012, 6:01 a.m. UTC | #18
I think there is a potential problem in that netdev_tx_completed could
be called multiple times for the same interrupt, for example if napi
poll routine completes it's budget and is scheduled again and some new
packets are completed.  We're looking at a solution to this.

Denys, can you try to increase the netdev budget to see if that has an effect?

Thanks,
Tom

On Tue, May 22, 2012 at 10:24 AM, Eric Dumazet <eric.dumazet@gmail.com> wrote:
> On Tue, 2012-05-22 at 20:11 +0300, Denys Fedoryshchenko wrote:
>
>> By the way, if BQL limit is going lower than MTU, is it considered as a
>> bug?
>> If yes, i can try to upload 3.4 to some servers and add condition to
>> WARN_ON if limit < 1500.
>
> There is no problem with BQL limit going lower than the max packet size.
>
> (With TSO it can be 64K)
>
> Remember BQL allows one packet to be sent to device, regardless of its
> size.
>
> Next packet might be blocked/stay in Qdisc
>
> If your workload is mostly idle, but sending bursts of 3 packets, then
> only one is immediately sent.
>
> Next packets shall wait the TX completion of first packet.
>
>
>
--
To unsubscribe from this list: send the line "unsubscribe netdev" in
the body of a message to majordomo@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Eric Dumazet May 25, 2012, 6:22 a.m. UTC | #19
On Thu, 2012-05-24 at 23:01 -0700, Tom Herbert wrote:
> I think there is a potential problem in that netdev_tx_completed could
> be called multiple times for the same interrupt, for example if napi
> poll routine completes it's budget and is scheduled again and some new
> packets are completed.  We're looking at a solution to this.
> 
> Denys, can you try to increase the netdev budget to see if that has an effect?

TX completion has no budget, I am not sure what you mean.

e1000e driver indeed has a limit : It cannot clean more than
tx_ring->count frames per e1000_clean_tx_irq() invocation.

But with BQL, this should not happen ?

# ethtool -g eth0
Ring parameters for eth0:
Pre-set maximums:
RX:		4096
RX Mini:	0
RX Jumbo:	0
TX:		4096
Current hardware settings:
RX:		256
RX Mini:	0
RX Jumbo:	0
TX:		256


--
To unsubscribe from this list: send the line "unsubscribe netdev" in
the body of a message to majordomo@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Tom Herbert May 25, 2012, 4:59 p.m. UTC | #20
> TX completion has no budget, I am not sure what you mean.
>
Right, it's the budget on RX that can be a factor.  TX completion is
done from the NAPI poll routine, so if RX does not complete the NAPI
is rescheduled and TX completion is done again for the same HW
interrupt.

We need to remove the constraint that netdev_completed can only be
called once per interrupt...

Tom

> e1000e driver indeed has a limit : It cannot clean more than
> tx_ring->count frames per e1000_clean_tx_irq() invocation.
>
> But with BQL, this should not happen ?
>
> # ethtool -g eth0
> Ring parameters for eth0:
> Pre-set maximums:
> RX:             4096
> RX Mini:        0
> RX Jumbo:       0
> TX:             4096
> Current hardware settings:
> RX:             256
> RX Mini:        0
> RX Jumbo:       0
> TX:             256
>
>
--
To unsubscribe from this list: send the line "unsubscribe netdev" in
the body of a message to majordomo@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Eric Dumazet May 25, 2012, 5:18 p.m. UTC | #21
On Fri, 2012-05-25 at 09:59 -0700, Tom Herbert wrote:
> > TX completion has no budget, I am not sure what you mean.
> >
> Right, it's the budget on RX that can be a factor.  TX completion is
> done from the NAPI poll routine, so if RX does not complete the NAPI
> is rescheduled and TX completion is done again for the same HW
> interrupt.
> 
> We need to remove the constraint that netdev_completed can only be
> called once per interrupt...

Not clear where is this constraint in the code.

Under heavy load, we can be in a loop situation, one cpu serving NAPI
for a bunch of devices (no more hardware interrupts are delivered since
we dont re-enable them at all)



--
To unsubscribe from this list: send the line "unsubscribe netdev" in
the body of a message to majordomo@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Hiroaki SHIMODA May 29, 2012, 2:25 p.m. UTC | #22
On Sun, 20 May 2012 10:40:41 -0700
Tom Herbert <therbert@google.com> wrote:

> Tried to reproduce:
> 
> May 20 10:08:30 test kernel: [    6.168240] e1000e 0000:06:00.0:
> (unregistered net_device): Interrupt Throttling Rate (ints/sec) set to
> dynamic conservative mode
> May 20 10:08:30 test kernel: [    6.221591] e1000e 0000:06:00.1:
> (unregistered net_device): Interrupt Throttling Rate (ints/sec) set to
> dynamic conservative mode
> 
> 06:00.0 Ethernet controller: Intel Corporation 80003ES2LAN Gigabit
> Ethernet Controller (Copper) (rev 01)
> 06:00.1 Ethernet controller: Intel Corporation 80003ES2LAN Gigabit
> Ethernet Controller (Copper) (rev 01)
> 
> Following above instructions to repro gives:
> 
> 1480 bytes from test2 (192.168.2.49): icmp_req=5875 ttl=64 time=0.358 ms
> 1480 bytes from test2 (192.168.2.49): icmp_req=5876 ttl=64 time=0.330 ms
> 1480 bytes from test2 (192.168.2.49): icmp_req=5877 ttl=64 time=0.337 ms
> 1480 bytes from test2 (192.168.2.49): icmp_req=5878 ttl=64 time=0.375 ms
> 1480 bytes from test2 (192.168.2.49): icmp_req=5879 ttl=64 time=0.359 ms
> 1480 bytes from lpb49.prod.google.com (192.168.2.49): icmp_req=5880
> ttl=64 time=0.380 ms
> 
> And I didn't see the stalls. This was on an Intel machine.  The limit
> was stable, went up to around 28K when opened large file and tended to
> stay between 15-28K.
> 
> The describe problem seems to have characteristics that transmit
> interrupts are not at all periodic, and it would seem that some are
> taking hundreds of milliseconds to pop.  I don't see anything that
> would cause that in the NIC, is it possible there is some activity on
> the machines periodically and often holding down interrupts for  long
> periods of time.  Are there any peculiarities on Sun Fire in interrupt
> handling?
> 
> Can you also provide an 'ethtool -c eth0'
> 
> Thanks,
> Tom

I also observed the similar behaviour on the following environment.

03:00.0 Ethernet controller: Intel Corporation 82574L Gigabit Network Connection

[    2.962119] e1000e: Intel(R) PRO/1000 Network Driver - 2.0.0-k
[    2.968095] e1000e: Copyright(c) 1999 - 2012 Intel Corporation.
[    2.974251] e1000e 0000:03:00.0: Disabling ASPM L0s L1
[    2.979653] e1000e 0000:03:00.0: (unregistered net_device): Interrupt Throttling Rate (ints/sec) set to dynamic conservative mode
[    2.991599] e1000e 0000:03:00.0: irq 72 for MSI/MSI-X
[    2.991606] e1000e 0000:03:00.0: irq 73 for MSI/MSI-X
[    2.991611] e1000e 0000:03:00.0: irq 74 for MSI/MSI-X
[    3.092768] e1000e 0000:03:00.0: eth0: (PCI Express:2.5GT/s:Width x1) 48:5b:39:75:91:bd
[    3.100992] e1000e 0000:03:00.0: eth0: Intel(R) PRO/1000 Network Connection
[    3.108173] e1000e 0000:03:00.0: eth0: MAC: 3, PHY: 8, PBA No: FFFFFF-0FF

I tried some coalesce options by 'ethtool -C eth0', but
anything didn't help.

If I understand the code and spec correctly, TX interrupts are
generated when TXDCTL.WTHRESH descriptors have been accumulated
and write backed.

I tentatively changed the TXDCTL.WTHRESH to 1, then it seems
that latency spikes are disappear.

drivers/net/ethernet/intel/e1000e/e1000.h
@@ -181,7 +181,7 @@ struct e1000_info;
 #define E1000_TXDCTL_DMA_BURST_ENABLE                          \
        (E1000_TXDCTL_GRAN | /* set descriptor granularity */  \
         E1000_TXDCTL_COUNT_DESC |                             \
-        (5 << 16) | /* wthresh must be +1 more than desired */\
+        (1 << 16) | /* wthresh must be +1 more than desired */\
         (1 << 8)  | /* hthresh */                             \
         0x1f)       /* pthresh */

(before) $ ping -i0.2 192.168.11.2
PING 192.168.11.2 (192.168.11.2) 56(84) bytes of data.
64 bytes from 192.168.11.2: icmp_req=1 ttl=64 time=0.191 ms
64 bytes from 192.168.11.2: icmp_req=2 ttl=64 time=0.179 ms
64 bytes from 192.168.11.2: icmp_req=3 ttl=64 time=0.199 ms
64 bytes from 192.168.11.2: icmp_req=4 ttl=64 time=0.143 ms
64 bytes from 192.168.11.2: icmp_req=5 ttl=64 time=0.193 ms
64 bytes from 192.168.11.2: icmp_req=6 ttl=64 time=0.150 ms
64 bytes from 192.168.11.2: icmp_req=7 ttl=64 time=0.186 ms
64 bytes from 192.168.11.2: icmp_req=8 ttl=64 time=0.198 ms
64 bytes from 192.168.11.2: icmp_req=9 ttl=64 time=0.195 ms
64 bytes from 192.168.11.2: icmp_req=10 ttl=64 time=0.194 ms
64 bytes from 192.168.11.2: icmp_req=11 ttl=64 time=0.196 ms
64 bytes from 192.168.11.2: icmp_req=12 ttl=64 time=0.200 ms
64 bytes from 192.168.11.2: icmp_req=13 ttl=64 time=651 ms
64 bytes from 192.168.11.2: icmp_req=14 ttl=64 time=451 ms
64 bytes from 192.168.11.2: icmp_req=15 ttl=64 time=241 ms
64 bytes from 192.168.11.2: icmp_req=16 ttl=64 time=31.3 ms
64 bytes from 192.168.11.2: icmp_req=17 ttl=64 time=0.184 ms
64 bytes from 192.168.11.2: icmp_req=18 ttl=64 time=0.199 ms
64 bytes from 192.168.11.2: icmp_req=19 ttl=64 time=0.197 ms
64 bytes from 192.168.11.2: icmp_req=20 ttl=64 time=0.196 ms
64 bytes from 192.168.11.2: icmp_req=21 ttl=64 time=0.192 ms
64 bytes from 192.168.11.2: icmp_req=22 ttl=64 time=0.205 ms
64 bytes from 192.168.11.2: icmp_req=23 ttl=64 time=629 ms
64 bytes from 192.168.11.2: icmp_req=24 ttl=64 time=419 ms
64 bytes from 192.168.11.2: icmp_req=25 ttl=64 time=209 ms
64 bytes from 192.168.11.2: icmp_req=26 ttl=64 time=0.280 ms
64 bytes from 192.168.11.2: icmp_req=27 ttl=64 time=0.193 ms
64 bytes from 192.168.11.2: icmp_req=28 ttl=64 time=0.194 ms
64 bytes from 192.168.11.2: icmp_req=29 ttl=64 time=0.143 ms
64 bytes from 192.168.11.2: icmp_req=30 ttl=64 time=0.191 ms
64 bytes from 192.168.11.2: icmp_req=31 ttl=64 time=0.144 ms
64 bytes from 192.168.11.2: icmp_req=32 ttl=64 time=0.192 ms
64 bytes from 192.168.11.2: icmp_req=33 ttl=64 time=0.199 ms
64 bytes from 192.168.11.2: icmp_req=34 ttl=64 time=0.193 ms
64 bytes from 192.168.11.2: icmp_req=35 ttl=64 time=0.196 ms
64 bytes from 192.168.11.2: icmp_req=36 ttl=64 time=0.196 ms
64 bytes from 192.168.11.2: icmp_req=37 ttl=64 time=0.196 ms
64 bytes from 192.168.11.2: icmp_req=38 ttl=64 time=1600 ms
64 bytes from 192.168.11.2: icmp_req=39 ttl=64 time=1390 ms
64 bytes from 192.168.11.2: icmp_req=40 ttl=64 time=1180 ms
64 bytes from 192.168.11.2: icmp_req=41 ttl=64 time=980 ms
64 bytes from 192.168.11.2: icmp_req=42 ttl=64 time=780 ms
64 bytes from 192.168.11.2: icmp_req=43 ttl=64 time=570 ms
64 bytes from 192.168.11.2: icmp_req=44 ttl=64 time=0.151 ms
64 bytes from 192.168.11.2: icmp_req=45 ttl=64 time=0.189 ms
64 bytes from 192.168.11.2: icmp_req=46 ttl=64 time=0.203 ms
64 bytes from 192.168.11.2: icmp_req=47 ttl=64 time=0.185 ms
64 bytes from 192.168.11.2: icmp_req=48 ttl=64 time=0.189 ms
64 bytes from 192.168.11.2: icmp_req=49 ttl=64 time=0.204 ms
64 bytes from 192.168.11.2: icmp_req=50 ttl=64 time=0.198 ms

I think 1000 ms - 2000 ms delay is come from e1000_watchdog_task().

(after) $ ping -i0.2 192.168.11.2
64 bytes from 192.168.11.2: icmp_req=1 ttl=64 time=0.175 ms
64 bytes from 192.168.11.2: icmp_req=2 ttl=64 time=0.203 ms
64 bytes from 192.168.11.2: icmp_req=3 ttl=64 time=0.196 ms
64 bytes from 192.168.11.2: icmp_req=4 ttl=64 time=0.197 ms
64 bytes from 192.168.11.2: icmp_req=5 ttl=64 time=0.186 ms
64 bytes from 192.168.11.2: icmp_req=6 ttl=64 time=0.197 ms
64 bytes from 192.168.11.2: icmp_req=7 ttl=64 time=0.189 ms
64 bytes from 192.168.11.2: icmp_req=8 ttl=64 time=0.146 ms
64 bytes from 192.168.11.2: icmp_req=9 ttl=64 time=0.193 ms
64 bytes from 192.168.11.2: icmp_req=10 ttl=64 time=0.194 ms
64 bytes from 192.168.11.2: icmp_req=11 ttl=64 time=0.195 ms
64 bytes from 192.168.11.2: icmp_req=12 ttl=64 time=0.190 ms
64 bytes from 192.168.11.2: icmp_req=13 ttl=64 time=0.204 ms
64 bytes from 192.168.11.2: icmp_req=14 ttl=64 time=0.201 ms
64 bytes from 192.168.11.2: icmp_req=15 ttl=64 time=0.189 ms
64 bytes from 192.168.11.2: icmp_req=16 ttl=64 time=0.193 ms
64 bytes from 192.168.11.2: icmp_req=17 ttl=64 time=0.190 ms
64 bytes from 192.168.11.2: icmp_req=18 ttl=64 time=0.143 ms
64 bytes from 192.168.11.2: icmp_req=19 ttl=64 time=0.191 ms
64 bytes from 192.168.11.2: icmp_req=20 ttl=64 time=0.190 ms
--
To unsubscribe from this list: send the line "unsubscribe netdev" in
the body of a message to majordomo@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Tom Herbert May 29, 2012, 2:54 p.m. UTC | #23
Thanks Hiroaki for this description, it looks promising.  Denys, can
you test with his patch.

Tom

On Tue, May 29, 2012 at 7:25 AM, Hiroaki SHIMODA
<shimoda.hiroaki@gmail.com> wrote:
> On Sun, 20 May 2012 10:40:41 -0700
> Tom Herbert <therbert@google.com> wrote:
>
>> Tried to reproduce:
>>
>> May 20 10:08:30 test kernel: [    6.168240] e1000e 0000:06:00.0:
>> (unregistered net_device): Interrupt Throttling Rate (ints/sec) set to
>> dynamic conservative mode
>> May 20 10:08:30 test kernel: [    6.221591] e1000e 0000:06:00.1:
>> (unregistered net_device): Interrupt Throttling Rate (ints/sec) set to
>> dynamic conservative mode
>>
>> 06:00.0 Ethernet controller: Intel Corporation 80003ES2LAN Gigabit
>> Ethernet Controller (Copper) (rev 01)
>> 06:00.1 Ethernet controller: Intel Corporation 80003ES2LAN Gigabit
>> Ethernet Controller (Copper) (rev 01)
>>
>> Following above instructions to repro gives:
>>
>> 1480 bytes from test2 (192.168.2.49): icmp_req=5875 ttl=64 time=0.358 ms
>> 1480 bytes from test2 (192.168.2.49): icmp_req=5876 ttl=64 time=0.330 ms
>> 1480 bytes from test2 (192.168.2.49): icmp_req=5877 ttl=64 time=0.337 ms
>> 1480 bytes from test2 (192.168.2.49): icmp_req=5878 ttl=64 time=0.375 ms
>> 1480 bytes from test2 (192.168.2.49): icmp_req=5879 ttl=64 time=0.359 ms
>> 1480 bytes from lpb49.prod.google.com (192.168.2.49): icmp_req=5880
>> ttl=64 time=0.380 ms
>>
>> And I didn't see the stalls. This was on an Intel machine.  The limit
>> was stable, went up to around 28K when opened large file and tended to
>> stay between 15-28K.
>>
>> The describe problem seems to have characteristics that transmit
>> interrupts are not at all periodic, and it would seem that some are
>> taking hundreds of milliseconds to pop.  I don't see anything that
>> would cause that in the NIC, is it possible there is some activity on
>> the machines periodically and often holding down interrupts for  long
>> periods of time.  Are there any peculiarities on Sun Fire in interrupt
>> handling?
>>
>> Can you also provide an 'ethtool -c eth0'
>>
>> Thanks,
>> Tom
>
> I also observed the similar behaviour on the following environment.
>
> 03:00.0 Ethernet controller: Intel Corporation 82574L Gigabit Network Connection
>
> [    2.962119] e1000e: Intel(R) PRO/1000 Network Driver - 2.0.0-k
> [    2.968095] e1000e: Copyright(c) 1999 - 2012 Intel Corporation.
> [    2.974251] e1000e 0000:03:00.0: Disabling ASPM L0s L1
> [    2.979653] e1000e 0000:03:00.0: (unregistered net_device): Interrupt Throttling Rate (ints/sec) set to dynamic conservative mode
> [    2.991599] e1000e 0000:03:00.0: irq 72 for MSI/MSI-X
> [    2.991606] e1000e 0000:03:00.0: irq 73 for MSI/MSI-X
> [    2.991611] e1000e 0000:03:00.0: irq 74 for MSI/MSI-X
> [    3.092768] e1000e 0000:03:00.0: eth0: (PCI Express:2.5GT/s:Width x1) 48:5b:39:75:91:bd
> [ 3.100992] e1000e 0000:03:00.0: eth0: Intel(R) PRO/1000 Network Connection
> [ 3.108173] e1000e 0000:03:00.0: eth0: MAC: 3, PHY: 8, PBA No: FFFFFF-0FF
>
> I tried some coalesce options by 'ethtool -C eth0', but
> anything didn't help.
>
> If I understand the code and spec correctly, TX interrupts are
> generated when TXDCTL.WTHRESH descriptors have been accumulated
> and write backed.
>
> I tentatively changed the TXDCTL.WTHRESH to 1, then it seems
> that latency spikes are disappear.
>
> drivers/net/ethernet/intel/e1000e/e1000.h
> @@ -181,7 +181,7 @@ struct e1000_info;
>  #define E1000_TXDCTL_DMA_BURST_ENABLE                          \
>        (E1000_TXDCTL_GRAN | /* set descriptor granularity */  \
>         E1000_TXDCTL_COUNT_DESC |                             \
> -        (5 << 16) | /* wthresh must be +1 more than desired */\
> +        (1 << 16) | /* wthresh must be +1 more than desired */\
>         (1 << 8)  | /* hthresh */                             \
>         0x1f)       /* pthresh */
>
> (before) $ ping -i0.2 192.168.11.2
> PING 192.168.11.2 (192.168.11.2) 56(84) bytes of data.
> 64 bytes from 192.168.11.2: icmp_req=1 ttl=64 time=0.191 ms
> 64 bytes from 192.168.11.2: icmp_req=2 ttl=64 time=0.179 ms
> 64 bytes from 192.168.11.2: icmp_req=3 ttl=64 time=0.199 ms
> 64 bytes from 192.168.11.2: icmp_req=4 ttl=64 time=0.143 ms
> 64 bytes from 192.168.11.2: icmp_req=5 ttl=64 time=0.193 ms
> 64 bytes from 192.168.11.2: icmp_req=6 ttl=64 time=0.150 ms
> 64 bytes from 192.168.11.2: icmp_req=7 ttl=64 time=0.186 ms
> 64 bytes from 192.168.11.2: icmp_req=8 ttl=64 time=0.198 ms
> 64 bytes from 192.168.11.2: icmp_req=9 ttl=64 time=0.195 ms
> 64 bytes from 192.168.11.2: icmp_req=10 ttl=64 time=0.194 ms
> 64 bytes from 192.168.11.2: icmp_req=11 ttl=64 time=0.196 ms
> 64 bytes from 192.168.11.2: icmp_req=12 ttl=64 time=0.200 ms
> 64 bytes from 192.168.11.2: icmp_req=13 ttl=64 time=651 ms
> 64 bytes from 192.168.11.2: icmp_req=14 ttl=64 time=451 ms
> 64 bytes from 192.168.11.2: icmp_req=15 ttl=64 time=241 ms
> 64 bytes from 192.168.11.2: icmp_req=16 ttl=64 time=31.3 ms
> 64 bytes from 192.168.11.2: icmp_req=17 ttl=64 time=0.184 ms
> 64 bytes from 192.168.11.2: icmp_req=18 ttl=64 time=0.199 ms
> 64 bytes from 192.168.11.2: icmp_req=19 ttl=64 time=0.197 ms
> 64 bytes from 192.168.11.2: icmp_req=20 ttl=64 time=0.196 ms
> 64 bytes from 192.168.11.2: icmp_req=21 ttl=64 time=0.192 ms
> 64 bytes from 192.168.11.2: icmp_req=22 ttl=64 time=0.205 ms
> 64 bytes from 192.168.11.2: icmp_req=23 ttl=64 time=629 ms
> 64 bytes from 192.168.11.2: icmp_req=24 ttl=64 time=419 ms
> 64 bytes from 192.168.11.2: icmp_req=25 ttl=64 time=209 ms
> 64 bytes from 192.168.11.2: icmp_req=26 ttl=64 time=0.280 ms
> 64 bytes from 192.168.11.2: icmp_req=27 ttl=64 time=0.193 ms
> 64 bytes from 192.168.11.2: icmp_req=28 ttl=64 time=0.194 ms
> 64 bytes from 192.168.11.2: icmp_req=29 ttl=64 time=0.143 ms
> 64 bytes from 192.168.11.2: icmp_req=30 ttl=64 time=0.191 ms
> 64 bytes from 192.168.11.2: icmp_req=31 ttl=64 time=0.144 ms
> 64 bytes from 192.168.11.2: icmp_req=32 ttl=64 time=0.192 ms
> 64 bytes from 192.168.11.2: icmp_req=33 ttl=64 time=0.199 ms
> 64 bytes from 192.168.11.2: icmp_req=34 ttl=64 time=0.193 ms
> 64 bytes from 192.168.11.2: icmp_req=35 ttl=64 time=0.196 ms
> 64 bytes from 192.168.11.2: icmp_req=36 ttl=64 time=0.196 ms
> 64 bytes from 192.168.11.2: icmp_req=37 ttl=64 time=0.196 ms
> 64 bytes from 192.168.11.2: icmp_req=38 ttl=64 time=1600 ms
> 64 bytes from 192.168.11.2: icmp_req=39 ttl=64 time=1390 ms
> 64 bytes from 192.168.11.2: icmp_req=40 ttl=64 time=1180 ms
> 64 bytes from 192.168.11.2: icmp_req=41 ttl=64 time=980 ms
> 64 bytes from 192.168.11.2: icmp_req=42 ttl=64 time=780 ms
> 64 bytes from 192.168.11.2: icmp_req=43 ttl=64 time=570 ms
> 64 bytes from 192.168.11.2: icmp_req=44 ttl=64 time=0.151 ms
> 64 bytes from 192.168.11.2: icmp_req=45 ttl=64 time=0.189 ms
> 64 bytes from 192.168.11.2: icmp_req=46 ttl=64 time=0.203 ms
> 64 bytes from 192.168.11.2: icmp_req=47 ttl=64 time=0.185 ms
> 64 bytes from 192.168.11.2: icmp_req=48 ttl=64 time=0.189 ms
> 64 bytes from 192.168.11.2: icmp_req=49 ttl=64 time=0.204 ms
> 64 bytes from 192.168.11.2: icmp_req=50 ttl=64 time=0.198 ms
>
> I think 1000 ms - 2000 ms delay is come from e1000_watchdog_task().
>
> (after) $ ping -i0.2 192.168.11.2
> 64 bytes from 192.168.11.2: icmp_req=1 ttl=64 time=0.175 ms
> 64 bytes from 192.168.11.2: icmp_req=2 ttl=64 time=0.203 ms
> 64 bytes from 192.168.11.2: icmp_req=3 ttl=64 time=0.196 ms
> 64 bytes from 192.168.11.2: icmp_req=4 ttl=64 time=0.197 ms
> 64 bytes from 192.168.11.2: icmp_req=5 ttl=64 time=0.186 ms
> 64 bytes from 192.168.11.2: icmp_req=6 ttl=64 time=0.197 ms
> 64 bytes from 192.168.11.2: icmp_req=7 ttl=64 time=0.189 ms
> 64 bytes from 192.168.11.2: icmp_req=8 ttl=64 time=0.146 ms
> 64 bytes from 192.168.11.2: icmp_req=9 ttl=64 time=0.193 ms
> 64 bytes from 192.168.11.2: icmp_req=10 ttl=64 time=0.194 ms
> 64 bytes from 192.168.11.2: icmp_req=11 ttl=64 time=0.195 ms
> 64 bytes from 192.168.11.2: icmp_req=12 ttl=64 time=0.190 ms
> 64 bytes from 192.168.11.2: icmp_req=13 ttl=64 time=0.204 ms
> 64 bytes from 192.168.11.2: icmp_req=14 ttl=64 time=0.201 ms
> 64 bytes from 192.168.11.2: icmp_req=15 ttl=64 time=0.189 ms
> 64 bytes from 192.168.11.2: icmp_req=16 ttl=64 time=0.193 ms
> 64 bytes from 192.168.11.2: icmp_req=17 ttl=64 time=0.190 ms
> 64 bytes from 192.168.11.2: icmp_req=18 ttl=64 time=0.143 ms
> 64 bytes from 192.168.11.2: icmp_req=19 ttl=64 time=0.191 ms
> 64 bytes from 192.168.11.2: icmp_req=20 ttl=64 time=0.190 ms
--
To unsubscribe from this list: send the line "unsubscribe netdev" in
the body of a message to majordomo@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Eric Dumazet May 29, 2012, 3:11 p.m. UTC | #24
On Tue, 2012-05-29 at 07:54 -0700, Tom Herbert wrote:
> Thanks Hiroaki for this description, it looks promising.  Denys, can
> you test with his patch.
> 
> Tom

Indeed this sounds good.

Hmm, I guess my e1000e has no FLAG2_DMA_BURST in adapter->flags2


--
To unsubscribe from this list: send the line "unsubscribe netdev" in
the body of a message to majordomo@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Denys Fedoryshchenko May 29, 2012, 7:52 p.m. UTC | #25
Big Thanks Hiroaki, yes this patch improving situation a lot, and for 
sure it fixes the problem.
I cannot reproduce it anymore.

On 2012-05-29 17:54, Tom Herbert wrote:
> Thanks Hiroaki for this description, it looks promising.  Denys, can
> you test with his patch.
>
> Tom
>
> On Tue, May 29, 2012 at 7:25 AM, Hiroaki SHIMODA
> <shimoda.hiroaki@gmail.com> wrote:
>> On Sun, 20 May 2012 10:40:41 -0700
>> Tom Herbert <therbert@google.com> wrote:
>>
>>> Tried to reproduce:
>>>
>>> May 20 10:08:30 test kernel: [    6.168240] e1000e 0000:06:00.0:
>>> (unregistered net_device): Interrupt Throttling Rate (ints/sec) set 
>>> to
>>> dynamic conservative mode
>>> May 20 10:08:30 test kernel: [    6.221591] e1000e 0000:06:00.1:
>>> (unregistered net_device): Interrupt Throttling Rate (ints/sec) set 
>>> to
>>> dynamic conservative mode
>>>
>>> 06:00.0 Ethernet controller: Intel Corporation 80003ES2LAN Gigabit
>>> Ethernet Controller (Copper) (rev 01)
>>> 06:00.1 Ethernet controller: Intel Corporation 80003ES2LAN Gigabit
>>> Ethernet Controller (Copper) (rev 01)
>>>
>>> Following above instructions to repro gives:
>>>
>>> 1480 bytes from test2 (192.168.2.49): icmp_req=5875 ttl=64 
>>> time=0.358 ms
>>> 1480 bytes from test2 (192.168.2.49): icmp_req=5876 ttl=64 
>>> time=0.330 ms
>>> 1480 bytes from test2 (192.168.2.49): icmp_req=5877 ttl=64 
>>> time=0.337 ms
>>> 1480 bytes from test2 (192.168.2.49): icmp_req=5878 ttl=64 
>>> time=0.375 ms
>>> 1480 bytes from test2 (192.168.2.49): icmp_req=5879 ttl=64 
>>> time=0.359 ms
>>> 1480 bytes from lpb49.prod.google.com (192.168.2.49): icmp_req=5880
>>> ttl=64 time=0.380 ms
>>>
>>> And I didn't see the stalls. This was on an Intel machine.  The 
>>> limit
>>> was stable, went up to around 28K when opened large file and tended 
>>> to
>>> stay between 15-28K.
>>>
>>> The describe problem seems to have characteristics that transmit
>>> interrupts are not at all periodic, and it would seem that some are
>>> taking hundreds of milliseconds to pop.  I don't see anything that
>>> would cause that in the NIC, is it possible there is some activity 
>>> on
>>> the machines periodically and often holding down interrupts for 
>>>  long
>>> periods of time.  Are there any peculiarities on Sun Fire in 
>>> interrupt
>>> handling?
>>>
>>> Can you also provide an 'ethtool -c eth0'
>>>
>>> Thanks,
>>> Tom
>>
>> I also observed the similar behaviour on the following environment.
>>
>> 03:00.0 Ethernet controller: Intel Corporation 82574L Gigabit 
>> Network Connection
>>
>> [    2.962119] e1000e: Intel(R) PRO/1000 Network Driver - 2.0.0-k
>> [    2.968095] e1000e: Copyright(c) 1999 - 2012 Intel Corporation.
>> [    2.974251] e1000e 0000:03:00.0: Disabling ASPM L0s L1
>> [    2.979653] e1000e 0000:03:00.0: (unregistered net_device): 
>> Interrupt Throttling Rate (ints/sec) set to dynamic conservative mode
>> [    2.991599] e1000e 0000:03:00.0: irq 72 for MSI/MSI-X
>> [    2.991606] e1000e 0000:03:00.0: irq 73 for MSI/MSI-X
>> [    2.991611] e1000e 0000:03:00.0: irq 74 for MSI/MSI-X
>> [    3.092768] e1000e 0000:03:00.0: eth0: (PCI Express:2.5GT/s:Width 
>> x1) 48:5b:39:75:91:bd
>> [ 3.100992] e1000e 0000:03:00.0: eth0: Intel(R) PRO/1000 Network 
>> Connection
>> [ 3.108173] e1000e 0000:03:00.0: eth0: MAC: 3, PHY: 8, PBA No: 
>> FFFFFF-0FF
>>
>> I tried some coalesce options by 'ethtool -C eth0', but
>> anything didn't help.
>>
>> If I understand the code and spec correctly, TX interrupts are
>> generated when TXDCTL.WTHRESH descriptors have been accumulated
>> and write backed.
>>
>> I tentatively changed the TXDCTL.WTHRESH to 1, then it seems
>> that latency spikes are disappear.
>>
>> drivers/net/ethernet/intel/e1000e/e1000.h
>> @@ -181,7 +181,7 @@ struct e1000_info;
>>  #define E1000_TXDCTL_DMA_BURST_ENABLE                          \
>>        (E1000_TXDCTL_GRAN | /* set descriptor granularity */  \
>>         E1000_TXDCTL_COUNT_DESC |                             \
>> -        (5 << 16) | /* wthresh must be +1 more than desired */\
>> +        (1 << 16) | /* wthresh must be +1 more than desired */\
>>         (1 << 8)  | /* hthresh */                             \
>>         0x1f)       /* pthresh */
>>
>> (before) $ ping -i0.2 192.168.11.2
>> PING 192.168.11.2 (192.168.11.2) 56(84) bytes of data.
>> 64 bytes from 192.168.11.2: icmp_req=1 ttl=64 time=0.191 ms
>> 64 bytes from 192.168.11.2: icmp_req=2 ttl=64 time=0.179 ms
>> 64 bytes from 192.168.11.2: icmp_req=3 ttl=64 time=0.199 ms
>> 64 bytes from 192.168.11.2: icmp_req=4 ttl=64 time=0.143 ms
>> 64 bytes from 192.168.11.2: icmp_req=5 ttl=64 time=0.193 ms
>> 64 bytes from 192.168.11.2: icmp_req=6 ttl=64 time=0.150 ms
>> 64 bytes from 192.168.11.2: icmp_req=7 ttl=64 time=0.186 ms
>> 64 bytes from 192.168.11.2: icmp_req=8 ttl=64 time=0.198 ms
>> 64 bytes from 192.168.11.2: icmp_req=9 ttl=64 time=0.195 ms
>> 64 bytes from 192.168.11.2: icmp_req=10 ttl=64 time=0.194 ms
>> 64 bytes from 192.168.11.2: icmp_req=11 ttl=64 time=0.196 ms
>> 64 bytes from 192.168.11.2: icmp_req=12 ttl=64 time=0.200 ms
>> 64 bytes from 192.168.11.2: icmp_req=13 ttl=64 time=651 ms
>> 64 bytes from 192.168.11.2: icmp_req=14 ttl=64 time=451 ms
>> 64 bytes from 192.168.11.2: icmp_req=15 ttl=64 time=241 ms
>> 64 bytes from 192.168.11.2: icmp_req=16 ttl=64 time=31.3 ms
>> 64 bytes from 192.168.11.2: icmp_req=17 ttl=64 time=0.184 ms
>> 64 bytes from 192.168.11.2: icmp_req=18 ttl=64 time=0.199 ms
>> 64 bytes from 192.168.11.2: icmp_req=19 ttl=64 time=0.197 ms
>> 64 bytes from 192.168.11.2: icmp_req=20 ttl=64 time=0.196 ms
>> 64 bytes from 192.168.11.2: icmp_req=21 ttl=64 time=0.192 ms
>> 64 bytes from 192.168.11.2: icmp_req=22 ttl=64 time=0.205 ms
>> 64 bytes from 192.168.11.2: icmp_req=23 ttl=64 time=629 ms
>> 64 bytes from 192.168.11.2: icmp_req=24 ttl=64 time=419 ms
>> 64 bytes from 192.168.11.2: icmp_req=25 ttl=64 time=209 ms
>> 64 bytes from 192.168.11.2: icmp_req=26 ttl=64 time=0.280 ms
>> 64 bytes from 192.168.11.2: icmp_req=27 ttl=64 time=0.193 ms
>> 64 bytes from 192.168.11.2: icmp_req=28 ttl=64 time=0.194 ms
>> 64 bytes from 192.168.11.2: icmp_req=29 ttl=64 time=0.143 ms
>> 64 bytes from 192.168.11.2: icmp_req=30 ttl=64 time=0.191 ms
>> 64 bytes from 192.168.11.2: icmp_req=31 ttl=64 time=0.144 ms
>> 64 bytes from 192.168.11.2: icmp_req=32 ttl=64 time=0.192 ms
>> 64 bytes from 192.168.11.2: icmp_req=33 ttl=64 time=0.199 ms
>> 64 bytes from 192.168.11.2: icmp_req=34 ttl=64 time=0.193 ms
>> 64 bytes from 192.168.11.2: icmp_req=35 ttl=64 time=0.196 ms
>> 64 bytes from 192.168.11.2: icmp_req=36 ttl=64 time=0.196 ms
>> 64 bytes from 192.168.11.2: icmp_req=37 ttl=64 time=0.196 ms
>> 64 bytes from 192.168.11.2: icmp_req=38 ttl=64 time=1600 ms
>> 64 bytes from 192.168.11.2: icmp_req=39 ttl=64 time=1390 ms
>> 64 bytes from 192.168.11.2: icmp_req=40 ttl=64 time=1180 ms
>> 64 bytes from 192.168.11.2: icmp_req=41 ttl=64 time=980 ms
>> 64 bytes from 192.168.11.2: icmp_req=42 ttl=64 time=780 ms
>> 64 bytes from 192.168.11.2: icmp_req=43 ttl=64 time=570 ms
>> 64 bytes from 192.168.11.2: icmp_req=44 ttl=64 time=0.151 ms
>> 64 bytes from 192.168.11.2: icmp_req=45 ttl=64 time=0.189 ms
>> 64 bytes from 192.168.11.2: icmp_req=46 ttl=64 time=0.203 ms
>> 64 bytes from 192.168.11.2: icmp_req=47 ttl=64 time=0.185 ms
>> 64 bytes from 192.168.11.2: icmp_req=48 ttl=64 time=0.189 ms
>> 64 bytes from 192.168.11.2: icmp_req=49 ttl=64 time=0.204 ms
>> 64 bytes from 192.168.11.2: icmp_req=50 ttl=64 time=0.198 ms
>>
>> I think 1000 ms - 2000 ms delay is come from e1000_watchdog_task().
>>
>> (after) $ ping -i0.2 192.168.11.2
>> 64 bytes from 192.168.11.2: icmp_req=1 ttl=64 time=0.175 ms
>> 64 bytes from 192.168.11.2: icmp_req=2 ttl=64 time=0.203 ms
>> 64 bytes from 192.168.11.2: icmp_req=3 ttl=64 time=0.196 ms
>> 64 bytes from 192.168.11.2: icmp_req=4 ttl=64 time=0.197 ms
>> 64 bytes from 192.168.11.2: icmp_req=5 ttl=64 time=0.186 ms
>> 64 bytes from 192.168.11.2: icmp_req=6 ttl=64 time=0.197 ms
>> 64 bytes from 192.168.11.2: icmp_req=7 ttl=64 time=0.189 ms
>> 64 bytes from 192.168.11.2: icmp_req=8 ttl=64 time=0.146 ms
>> 64 bytes from 192.168.11.2: icmp_req=9 ttl=64 time=0.193 ms
>> 64 bytes from 192.168.11.2: icmp_req=10 ttl=64 time=0.194 ms
>> 64 bytes from 192.168.11.2: icmp_req=11 ttl=64 time=0.195 ms
>> 64 bytes from 192.168.11.2: icmp_req=12 ttl=64 time=0.190 ms
>> 64 bytes from 192.168.11.2: icmp_req=13 ttl=64 time=0.204 ms
>> 64 bytes from 192.168.11.2: icmp_req=14 ttl=64 time=0.201 ms
>> 64 bytes from 192.168.11.2: icmp_req=15 ttl=64 time=0.189 ms
>> 64 bytes from 192.168.11.2: icmp_req=16 ttl=64 time=0.193 ms
>> 64 bytes from 192.168.11.2: icmp_req=17 ttl=64 time=0.190 ms
>> 64 bytes from 192.168.11.2: icmp_req=18 ttl=64 time=0.143 ms
>> 64 bytes from 192.168.11.2: icmp_req=19 ttl=64 time=0.191 ms
>> 64 bytes from 192.168.11.2: icmp_req=20 ttl=64 time=0.190 ms

---
Denys Fedoryshchenko, Network Engineer, Virtual ISP S.A.L.
--
To unsubscribe from this list: send the line "unsubscribe netdev" in
the body of a message to majordomo@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Hiroaki SHIMODA May 30, 2012, 12:06 a.m. UTC | #26
While reading the bql code, I have some questions.

1) dql_completed() and dql_queued() can be called concurrently,
   so dql->num_queued could change while processing
   dql_completed().
   Is it intentional to refer num_queued from "dql->" each time ?

2) From the comment in the code
   *   - The queue was over-limit in the previous interval and
   *     when enqueuing it was possible that all queued data
   *     had been consumed.

   and

   * Queue was not starved, check if the limit can be decreased.
   * A decrease is only considered if the queue has been busy in
   * the whole interval (the check above). 

   the calculation of all_prev_completed should take into account
   completed == dql->prev_num_queued case ?
   On current implementation, limit shrinks easily and some NIC
   hit TX stalls.
   To mitigate TX stalls, should we fix all_prev_completed rather
   than individual driver ?

3) limit calculation fails to consider integer wrap around in
   one place ?

Here is the patch what I meant.

diff --git a/lib/dynamic_queue_limits.c b/lib/dynamic_queue_limits.c
@@ -11,22 +11,27 @@
 #include <linux/dynamic_queue_limits.h>
 
 #define POSDIFF(A, B) ((A) > (B) ? (A) - (B) : 0)
+#define POSDIFFI(A, B) ((int)((A) - (B)) > 0 ? (A) - (B) : 0)
+#define AFTER_EQ(A, B) ((int)((A) - (B)) >= 0)
 
 /* Records completed count and recalculates the queue limit */
 void dql_completed(struct dql *dql, unsigned int count)
 {
 	unsigned int inprogress, prev_inprogress, limit;
-	unsigned int ovlimit, all_prev_completed, completed;
+	unsigned int ovlimit, completed, num_queued;
+	bool all_prev_completed;
+
+	num_queued = dql->num_queued;
 
 	/* Can't complete more than what's in queue */
-	BUG_ON(count > dql->num_queued - dql->num_completed);
+	BUG_ON(count > num_queued - dql->num_completed);
 
 	completed = dql->num_completed + count;
 	limit = dql->limit;
-	ovlimit = POSDIFF(dql->num_queued - dql->num_completed, limit);
-	inprogress = dql->num_queued - completed;
+	ovlimit = POSDIFF(num_queued - dql->num_completed, limit);
+	inprogress = num_queued - completed;
 	prev_inprogress = dql->prev_num_queued - dql->num_completed;
-	all_prev_completed = POSDIFF(completed, dql->prev_num_queued);
+	all_prev_completed = AFTER_EQ(completed, dql->prev_num_queued);
 
 	if ((ovlimit && !inprogress) ||
 	    (dql->prev_ovlimit && all_prev_completed)) {
@@ -45,7 +50,7 @@ void dql_completed(struct dql *dql, unsigned int count)
 		 *     of bytes both sent and completed in the last interval,
 		 *     plus any previous over-limit.
 		 */
-		limit += POSDIFF(completed, dql->prev_num_queued) +
+		limit += POSDIFFI(completed, dql->prev_num_queued) +
 		     dql->prev_ovlimit;
 		dql->slack_start_time = jiffies;
 		dql->lowest_slack = UINT_MAX;
@@ -104,7 +109,7 @@ void dql_completed(struct dql *dql, unsigned int count)
 	dql->prev_ovlimit = ovlimit;
 	dql->prev_last_obj_cnt = dql->last_obj_cnt;
 	dql->num_completed = completed;
-	dql->prev_num_queued = dql->num_queued;
+	dql->prev_num_queued = num_queued;
 }
 EXPORT_SYMBOL(dql_completed);
--
To unsubscribe from this list: send the line "unsubscribe netdev" in
the body of a message to majordomo@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Eric Dumazet May 30, 2012, 8:40 a.m. UTC | #27
On Wed, 2012-05-30 at 09:06 +0900, Hiroaki SHIMODA wrote:
> While reading the bql code, I have some questions.
> 
> 1) dql_completed() and dql_queued() can be called concurrently,
>    so dql->num_queued could change while processing
>    dql_completed().
>    Is it intentional to refer num_queued from "dql->" each time ?
> 

not sure it can have problems, but doing the read once is indeed a good
plan.

> 2) From the comment in the code
>    *   - The queue was over-limit in the previous interval and
>    *     when enqueuing it was possible that all queued data
>    *     had been consumed.
> 
>    and
> 
>    * Queue was not starved, check if the limit can be decreased.
>    * A decrease is only considered if the queue has been busy in
>    * the whole interval (the check above). 
> 
>    the calculation of all_prev_completed should take into account
>    completed == dql->prev_num_queued case ?
>    On current implementation, limit shrinks easily and some NIC
>    hit TX stalls.
>    To mitigate TX stalls, should we fix all_prev_completed rather
>    than individual driver ?
> 

Not sure what you mean

> 3) limit calculation fails to consider integer wrap around in
>    one place ?
> 

Yes

> Here is the patch what I meant.
> 
> diff --git a/lib/dynamic_queue_limits.c b/lib/dynamic_queue_limits.c
> @@ -11,22 +11,27 @@
>  #include <linux/dynamic_queue_limits.h>
>  
>  #define POSDIFF(A, B) ((A) > (B) ? (A) - (B) : 0)
> +#define POSDIFFI(A, B) ((int)((A) - (B)) > 0 ? (A) - (B) : 0)
> +#define AFTER_EQ(A, B) ((int)((A) - (B)) >= 0)
>  
>  /* Records completed count and recalculates the queue limit */
>  void dql_completed(struct dql *dql, unsigned int count)
>  {
>  	unsigned int inprogress, prev_inprogress, limit;
> -	unsigned int ovlimit, all_prev_completed, completed;
> +	unsigned int ovlimit, completed, num_queued;
> +	bool all_prev_completed;
> +
> +	num_queued = dql->num_queued;


I suggest :

	num_queued = ACCESS_ONCE(dql->num_queued);
	
Or else compiler is free to do whatever he wants.


--
To unsubscribe from this list: send the line "unsubscribe netdev" in
the body of a message to majordomo@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Hiroaki SHIMODA May 30, 2012, 10:43 a.m. UTC | #28
On Wed, 30 May 2012 10:40:31 +0200
Eric Dumazet <eric.dumazet@gmail.com> wrote:

> On Wed, 2012-05-30 at 09:06 +0900, Hiroaki SHIMODA wrote:
> > While reading the bql code, I have some questions.
> > 
> > 1) dql_completed() and dql_queued() can be called concurrently,
> >    so dql->num_queued could change while processing
> >    dql_completed().
> >    Is it intentional to refer num_queued from "dql->" each time ?
> > 
> 
> not sure it can have problems, but doing the read once is indeed a good
> plan.
> 
> > 2) From the comment in the code
> >    *   - The queue was over-limit in the previous interval and
> >    *     when enqueuing it was possible that all queued data
> >    *     had been consumed.
> > 
> >    and
> > 
> >    * Queue was not starved, check if the limit can be decreased.
> >    * A decrease is only considered if the queue has been busy in
> >    * the whole interval (the check above). 
> > 
> >    the calculation of all_prev_completed should take into account
> >    completed == dql->prev_num_queued case ?
> >    On current implementation, limit shrinks easily and some NIC
> >    hit TX stalls.
> >    To mitigate TX stalls, should we fix all_prev_completed rather
> >    than individual driver ?
> > 
> 
> Not sure what you mean

While examining ping problem, below pattern is often observed.

                                               TIME
       dql_queued()         dql_completed()     |
      a) initial state                          |
                                                |
      b) X bytes queued                         V

      c) Y bytes queued
                           d) X bytes completed
      e) Z bytes queued
                           f) Y bytes completed

a) dql->limit has already some value and there is no in-flight packet.
b) X bytes queued.
c) Y bytes queued and excess limit.
d) X bytes completed and dql->prev_ovlimit is set and also
   dql->prev_num_queued is set Y.
e) Z bytes queued.
f) Y bytes completed. inprogress and prev_inprogress are true.

At f), if I read the comment correctly, all_prev_completed becomes
true and limit should be increased. But POSDIFF() ignores
(A == B) case, so limit is decreased.

I thought excess limit decrement induces the TX stalls.

> 
> > 3) limit calculation fails to consider integer wrap around in
> >    one place ?
> > 
> 
> Yes
> 
> > Here is the patch what I meant.
> > 
> > diff --git a/lib/dynamic_queue_limits.c b/lib/dynamic_queue_limits.c
> > @@ -11,22 +11,27 @@
> >  #include <linux/dynamic_queue_limits.h>
> >  
> >  #define POSDIFF(A, B) ((A) > (B) ? (A) - (B) : 0)
> > +#define POSDIFFI(A, B) ((int)((A) - (B)) > 0 ? (A) - (B) : 0)
> > +#define AFTER_EQ(A, B) ((int)((A) - (B)) >= 0)
> >  
> >  /* Records completed count and recalculates the queue limit */
> >  void dql_completed(struct dql *dql, unsigned int count)
> >  {
> >  	unsigned int inprogress, prev_inprogress, limit;
> > -	unsigned int ovlimit, all_prev_completed, completed;
> > +	unsigned int ovlimit, completed, num_queued;
> > +	bool all_prev_completed;
> > +
> > +	num_queued = dql->num_queued;
> 
> 
> I suggest :
> 
> 	num_queued = ACCESS_ONCE(dql->num_queued);
> 	
> Or else compiler is free to do whatever he wants.

Thank you for your suggestion.
--
To unsubscribe from this list: send the line "unsubscribe netdev" in
the body of a message to majordomo@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
David Laight May 30, 2012, 10:52 a.m. UTC | #29
> > +	num_queued = dql->num_queued;
> 
> 
> I suggest :
> 
> 	num_queued = ACCESS_ONCE(dql->num_queued);
> 	
> Or else compiler is free to do whatever he wants.

Or make the structure member volatile, then the
compiler can only read it once.
Probably worth while if the value is expected to
be read like that.

	David


--
To unsubscribe from this list: send the line "unsubscribe netdev" in
the body of a message to majordomo@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Eric Dumazet May 30, 2012, 11:04 a.m. UTC | #30
On Wed, 2012-05-30 at 11:52 +0100, David Laight wrote:
> > > +	num_queued = dql->num_queued;
> > 
> > 
> > I suggest :
> > 
> > 	num_queued = ACCESS_ONCE(dql->num_queued);
> > 	
> > Or else compiler is free to do whatever he wants.
> 
> Or make the structure member volatile, then the
> compiler can only read it once.

No. Compiler can read it several times. Really.

> Probably worth while if the value is expected to
> be read like that.

Please don't use lazy volatile.

Unless you really want Linus flames (and ours)

ACCESS_ONCE() is much cleaner.


--
To unsubscribe from this list: send the line "unsubscribe netdev" in
the body of a message to majordomo@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Eric Dumazet May 30, 2012, 11:08 a.m. UTC | #31
On Wed, 2012-05-30 at 19:43 +0900, Hiroaki SHIMODA wrote:

> While examining ping problem, below pattern is often observed.
> 
>                                                TIME
>        dql_queued()         dql_completed()     |
>       a) initial state                          |
>                                                 |
>       b) X bytes queued                         V
> 
>       c) Y bytes queued
>                            d) X bytes completed
>       e) Z bytes queued
>                            f) Y bytes completed
> 
> a) dql->limit has already some value and there is no in-flight packet.
> b) X bytes queued.
> c) Y bytes queued and excess limit.
> d) X bytes completed and dql->prev_ovlimit is set and also
>    dql->prev_num_queued is set Y.
> e) Z bytes queued.
> f) Y bytes completed. inprogress and prev_inprogress are true.
> 
> At f), if I read the comment correctly, all_prev_completed becomes
> true and limit should be increased. But POSDIFF() ignores
> (A == B) case, so limit is decreased.

Which POSDIFF(), because there are many ;)

By the way, given complexity of this I suggest you split your ideas in
independent patches.

Mabe we should change all POSDIFF(), not selected ones.

#define POSDIFF(A, B) ((int)((A) - (B)) > 0 ? (A) - (B) : 0)



--
To unsubscribe from this list: send the line "unsubscribe netdev" in
the body of a message to majordomo@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Eric Dumazet May 30, 2012, 11:12 a.m. UTC | #32
On Wed, 2012-05-30 at 13:04 +0200, Eric Dumazet wrote:

> Please don't use lazy volatile.
> 
> Unless you really want Linus flames (and ours)
> 
> ACCESS_ONCE() is much cleaner.
> 

http://yarchive.net/comp/linux/ACCESS_ONCE.html


--
To unsubscribe from this list: send the line "unsubscribe netdev" in
the body of a message to majordomo@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Joe Perches May 30, 2012, 11:20 a.m. UTC | #33
On Wed, 2012-05-30 at 13:08 +0200, Eric Dumazet wrote:
> Maybe we should change all POSDIFF(), not selected ones.
> #define POSDIFF(A, B) ((int)((A) - (B)) > 0 ? (A) - (B) : 0)

maybe use an eval once statement expression macro
({
	typeof (A) _a = (A);
	typeof (B) _b = (B);
	((int)(_a - _b) > 0 ? _a - _b : 0;
})


--
To unsubscribe from this list: send the line "unsubscribe netdev" in
the body of a message to majordomo@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Hiroaki SHIMODA May 30, 2012, 11:29 a.m. UTC | #34
On Wed, 30 May 2012 13:08:27 +0200
Eric Dumazet <eric.dumazet@gmail.com> wrote:

> On Wed, 2012-05-30 at 19:43 +0900, Hiroaki SHIMODA wrote:
> 
> > While examining ping problem, below pattern is often observed.
> > 
> >                                                TIME
> >        dql_queued()         dql_completed()     |
> >       a) initial state                          |
> >                                                 |
> >       b) X bytes queued                         V
> > 
> >       c) Y bytes queued
> >                            d) X bytes completed
> >       e) Z bytes queued
> >                            f) Y bytes completed
> > 
> > a) dql->limit has already some value and there is no in-flight packet.
> > b) X bytes queued.
> > c) Y bytes queued and excess limit.
> > d) X bytes completed and dql->prev_ovlimit is set and also
> >    dql->prev_num_queued is set Y.
> > e) Z bytes queued.
> > f) Y bytes completed. inprogress and prev_inprogress are true.
> > 
> > At f), if I read the comment correctly, all_prev_completed becomes
> > true and limit should be increased. But POSDIFF() ignores
> > (A == B) case, so limit is decreased.
> 
> Which POSDIFF(), because there are many ;)

I mean,
	all_prev_completed = POSDIFF(completed, dql->prev_num_queued);

> By the way, given complexity of this I suggest you split your ideas in
> independent patches.

In this case, here is the patch what I thinking.

diff --git a/lib/dynamic_queue_limits.c b/lib/dynamic_queue_limits.c
@@ -11,12 +11,14 @@
 #include <linux/dynamic_queue_limits.h>
 
 #define POSDIFF(A, B) ((A) > (B) ? (A) - (B) : 0)
+#define #define AFTER_EQ(A, B) ((int)((A) - (B)) >= 0)
 
 /* Records completed count and recalculates the queue limit */
 void dql_completed(struct dql *dql, unsigned int count)
 {
 	unsigned int inprogress, prev_inprogress, limit;
-	unsigned int ovlimit, all_prev_completed, completed;
+	unsigned int ovlimit, completed;
+	bool all_prev_completed;
 
 	/* Can't complete more than what's in queue */
 	BUG_ON(count > dql->num_queued - dql->num_completed);
@@ -26,7 +28,7 @@ void dql_completed(struct dql *dql, unsigned int count)
 	ovlimit = POSDIFF(dql->num_queued - dql->num_completed, limit);
 	inprogress = dql->num_queued - completed;
 	prev_inprogress = dql->prev_num_queued - dql->num_completed;
-	all_prev_completed = POSDIFF(completed, dql->prev_num_queued);
+	all_prev_completed = AFTER_EQ(completed, dql->prev_num_queued);
 
 	if ((ovlimit && !inprogress) ||
 	    (dql->prev_ovlimit && all_prev_completed)) {

> Mabe we should change all POSDIFF(), not selected ones.
> 
> #define POSDIFF(A, B) ((int)((A) - (B)) > 0 ? (A) - (B) : 0)
> 
--
To unsubscribe from this list: send the line "unsubscribe netdev" in
the body of a message to majordomo@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Eric Dumazet May 30, 2012, 11:59 a.m. UTC | #35
On Wed, 2012-05-30 at 04:20 -0700, Joe Perches wrote:
> On Wed, 2012-05-30 at 13:08 +0200, Eric Dumazet wrote:
> > Maybe we should change all POSDIFF(), not selected ones.
> > #define POSDIFF(A, B) ((int)((A) - (B)) > 0 ? (A) - (B) : 0)
> 
> maybe use an eval once statement expression macro
> ({
> 	typeof (A) _a = (A);
> 	typeof (B) _b = (B);
> 	((int)(_a - _b) > 0 ? _a - _b : 0;
> })
> 
> 

Well, many choices are possible, including

#define POSDIFF(A, B) max_t(int, (A) - (B), 0);


--
To unsubscribe from this list: send the line "unsubscribe netdev" in
the body of a message to majordomo@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Joe Perches May 30, 2012, 2:09 p.m. UTC | #36
On Wed, 2012-05-30 at 13:59 +0200, Eric Dumazet wrote:
> On Wed, 2012-05-30 at 04:20 -0700, Joe Perches wrote:
> > On Wed, 2012-05-30 at 13:08 +0200, Eric Dumazet wrote:
> > > Maybe we should change all POSDIFF(), not selected ones.
> > > #define POSDIFF(A, B) ((int)((A) - (B)) > 0 ? (A) - (B) : 0)
> > 
> > maybe use an eval once statement expression macro
> > ({
> > 	typeof (A) _a = (A);
> > 	typeof (B) _b = (B);
> > 	((int)(_a - _b) > 0 ? _a - _b : 0;
> > })
> 
> Well, many choices are possible, including
> #define POSDIFF(A, B) max_t(int, (A) - (B), 0);

Maybe. If the intent to always return an int.
The current macro doesn't.

Whatever evals A and B once would be better.

cheers, Joe

--
To unsubscribe from this list: send the line "unsubscribe netdev" in
the body of a message to majordomo@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Eric Dumazet May 30, 2012, 2:42 p.m. UTC | #37
On Wed, 2012-05-30 at 07:09 -0700, Joe Perches wrote:

> Whatever evals A and B once would be better.

Why do you believe they could be evaluated several time ?

#define POSDIFF(A, B) max_t(int, (A) - (B), 0)

(A) - (B) is done once, or its a huge max_t() bug.

By the way, we handle 32bit values here. No way BQL can overflow a 4GB
limit.



--
To unsubscribe from this list: send the line "unsubscribe netdev" in
the body of a message to majordomo@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Eric Dumazet May 30, 2012, 2:49 p.m. UTC | #38
On Wed, 2012-05-30 at 20:29 +0900, Hiroaki SHIMODA wrote:
> On Wed, 30 May 2012 13:08:27 +0200
> Eric Dumazet <eric.dumazet@gmail.com> wrote:
> 
> > On Wed, 2012-05-30 at 19:43 +0900, Hiroaki SHIMODA wrote:
> > 
> > > While examining ping problem, below pattern is often observed.
> > > 
> > >                                                TIME
> > >        dql_queued()         dql_completed()     |
> > >       a) initial state                          |
> > >                                                 |
> > >       b) X bytes queued                         V
> > > 
> > >       c) Y bytes queued
> > >                            d) X bytes completed
> > >       e) Z bytes queued
> > >                            f) Y bytes completed
> > > 
> > > a) dql->limit has already some value and there is no in-flight packet.
> > > b) X bytes queued.
> > > c) Y bytes queued and excess limit.
> > > d) X bytes completed and dql->prev_ovlimit is set and also
> > >    dql->prev_num_queued is set Y.
> > > e) Z bytes queued.
> > > f) Y bytes completed. inprogress and prev_inprogress are true.
> > > 
> > > At f), if I read the comment correctly, all_prev_completed becomes
> > > true and limit should be increased. But POSDIFF() ignores
> > > (A == B) case, so limit is decreased.
> > 
> > Which POSDIFF(), because there are many ;)
> 
> I mean,
> 	all_prev_completed = POSDIFF(completed, dql->prev_num_queued);
> 
> > By the way, given complexity of this I suggest you split your ideas in
> > independent patches.
> 
> In this case, here is the patch what I thinking.
> 
> diff --git a/lib/dynamic_queue_limits.c b/lib/dynamic_queue_limits.c
> @@ -11,12 +11,14 @@
>  #include <linux/dynamic_queue_limits.h>
>  
>  #define POSDIFF(A, B) ((A) > (B) ? (A) - (B) : 0)
> +#define #define AFTER_EQ(A, B) ((int)((A) - (B)) >= 0)
>  
>  /* Records completed count and recalculates the queue limit */
>  void dql_completed(struct dql *dql, unsigned int count)
>  {
>  	unsigned int inprogress, prev_inprogress, limit;
> -	unsigned int ovlimit, all_prev_completed, completed;
> +	unsigned int ovlimit, completed;
> +	bool all_prev_completed;
>  
>  	/* Can't complete more than what's in queue */
>  	BUG_ON(count > dql->num_queued - dql->num_completed);
> @@ -26,7 +28,7 @@ void dql_completed(struct dql *dql, unsigned int count)
>  	ovlimit = POSDIFF(dql->num_queued - dql->num_completed, limit);
>  	inprogress = dql->num_queued - completed;
>  	prev_inprogress = dql->prev_num_queued - dql->num_completed;
> -	all_prev_completed = POSDIFF(completed, dql->prev_num_queued);
> +	all_prev_completed = AFTER_EQ(completed, dql->prev_num_queued);
>  
>  	if ((ovlimit && !inprogress) ||
>  	    (dql->prev_ovlimit && all_prev_completed)) {

I am fine with this one.

Can you send official patches please ?



--
To unsubscribe from this list: send the line "unsubscribe netdev" in
the body of a message to majordomo@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Joe Perches May 30, 2012, 2:50 p.m. UTC | #39
On Wed, 2012-05-30 at 16:42 +0200, Eric Dumazet wrote:
> On Wed, 2012-05-30 at 07:09 -0700, Joe Perches wrote:
> 
> > Whatever evals A and B once would be better.
> 
> Why do you believe they could be evaluated several time ?
> 
> #define POSDIFF(A, B) max_t(int, (A) - (B), 0)

I don't and didn't say I did.

> By the way, we handle 32bit values here. No way BQL can overflow a 4GB
> limit.

swell.

--
To unsubscribe from this list: send the line "unsubscribe netdev" in
the body of a message to majordomo@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Dave Taht May 30, 2012, 3 p.m. UTC | #40
On 05/30/2012 07:49 AM, Eric Dumazet wrote:
> On Wed, 2012-05-30 at 20:29 +0900, Hiroaki SHIMODA wrote:
>> On Wed, 30 May 2012 13:08:27 +0200
>> Eric Dumazet<eric.dumazet@gmail.com>  wrote:
>>
>>> On Wed, 2012-05-30 at 19:43 +0900, Hiroaki SHIMODA wrote:
>>>
>>>> While examining ping problem, below pattern is often observed.
>>>>
>>>>                                                 TIME
>>>>         dql_queued()         dql_completed()     |
>>>>        a) initial state                          |
>>>>                                                  |
>>>>        b) X bytes queued                         V
>>>>
>>>>        c) Y bytes queued
>>>>                             d) X bytes completed
>>>>        e) Z bytes queued
>>>>                             f) Y bytes completed
>>>>
>>>> a) dql->limit has already some value and there is no in-flight packet.
>>>> b) X bytes queued.
>>>> c) Y bytes queued and excess limit.
>>>> d) X bytes completed and dql->prev_ovlimit is set and also
>>>>     dql->prev_num_queued is set Y.
>>>> e) Z bytes queued.
>>>> f) Y bytes completed. inprogress and prev_inprogress are true.
>>>>
>>>> At f), if I read the comment correctly, all_prev_completed becomes
>>>> true and limit should be increased. But POSDIFF() ignores
>>>> (A == B) case, so limit is decreased.
>>> Which POSDIFF(), because there are many ;)
>> I mean,
>> 	all_prev_completed = POSDIFF(completed, dql->prev_num_queued);
>>
>>> By the way, given complexity of this I suggest you split your ideas in
>>> independent patches.
>> In this case, here is the patch what I thinking.
>>
>> diff --git a/lib/dynamic_queue_limits.c b/lib/dynamic_queue_limits.c
>> @@ -11,12 +11,14 @@
>>   #include<linux/dynamic_queue_limits.h>
>>
>>   #define POSDIFF(A, B) ((A)>  (B) ? (A) - (B) : 0)
>> +#define #define AFTER_EQ(A, B) ((int)((A) - (B))>= 0)
>>
>>   /* Records completed count and recalculates the queue limit */
>>   void dql_completed(struct dql *dql, unsigned int count)
>>   {
>>   	unsigned int inprogress, prev_inprogress, limit;
>> -	unsigned int ovlimit, all_prev_completed, completed;
>> +	unsigned int ovlimit, completed;
>> +	bool all_prev_completed;
>>
>>   	/* Can't complete more than what's in queue */
>>   	BUG_ON(count>  dql->num_queued - dql->num_completed);
>> @@ -26,7 +28,7 @@ void dql_completed(struct dql *dql, unsigned int count)
>>   	ovlimit = POSDIFF(dql->num_queued - dql->num_completed, limit);
>>   	inprogress = dql->num_queued - completed;
>>   	prev_inprogress = dql->prev_num_queued - dql->num_completed;
>> -	all_prev_completed = POSDIFF(completed, dql->prev_num_queued);
>> +	all_prev_completed = AFTER_EQ(completed, dql->prev_num_queued);
>>
>>   	if ((ovlimit&&  !inprogress) ||
>>   	    (dql->prev_ovlimit&&  all_prev_completed)) {
> I am fine with this one.
>
> Can you send official patches please ?

While this looks encouraging, BQL presently overbuffers by about a 
factor of 2 in low (sub 100Mbit) scenarios
on the hardware I have available to me.

I look forward to re-running benchmarks with this patch however.

>
>
> --
> To unsubscribe from this list: send the line "unsubscribe netdev" in
> the body of a message to majordomo@vger.kernel.org
> More majordomo info at  http://vger.kernel.org/majordomo-info.html

--
To unsubscribe from this list: send the line "unsubscribe netdev" in
the body of a message to majordomo@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Hiroaki SHIMODA May 30, 2012, 10:19 p.m. UTC | #41
On Wed, 30 May 2012 16:49:02 +0200
Eric Dumazet <eric.dumazet@gmail.com> wrote:

> On Wed, 2012-05-30 at 20:29 +0900, Hiroaki SHIMODA wrote:
> > On Wed, 30 May 2012 13:08:27 +0200
> > Eric Dumazet <eric.dumazet@gmail.com> wrote:
> > 
> > > On Wed, 2012-05-30 at 19:43 +0900, Hiroaki SHIMODA wrote:
> > > 
> > > > While examining ping problem, below pattern is often observed.
> > > > 
> > > >                                                TIME
> > > >        dql_queued()         dql_completed()     |
> > > >       a) initial state                          |
> > > >                                                 |
> > > >       b) X bytes queued                         V
> > > > 
> > > >       c) Y bytes queued
> > > >                            d) X bytes completed
> > > >       e) Z bytes queued
> > > >                            f) Y bytes completed
> > > > 
> > > > a) dql->limit has already some value and there is no in-flight packet.
> > > > b) X bytes queued.
> > > > c) Y bytes queued and excess limit.
> > > > d) X bytes completed and dql->prev_ovlimit is set and also
> > > >    dql->prev_num_queued is set Y.
> > > > e) Z bytes queued.
> > > > f) Y bytes completed. inprogress and prev_inprogress are true.
> > > > 
> > > > At f), if I read the comment correctly, all_prev_completed becomes
> > > > true and limit should be increased. But POSDIFF() ignores
> > > > (A == B) case, so limit is decreased.
> > > 
> > > Which POSDIFF(), because there are many ;)
> > 
> > I mean,
> > 	all_prev_completed = POSDIFF(completed, dql->prev_num_queued);
> > 
> > > By the way, given complexity of this I suggest you split your ideas in
> > > independent patches.
> > 
> > In this case, here is the patch what I thinking.
> > 
> > diff --git a/lib/dynamic_queue_limits.c b/lib/dynamic_queue_limits.c
> > @@ -11,12 +11,14 @@
> >  #include <linux/dynamic_queue_limits.h>
> >  
> >  #define POSDIFF(A, B) ((A) > (B) ? (A) - (B) : 0)
> > +#define #define AFTER_EQ(A, B) ((int)((A) - (B)) >= 0)
> >  
> >  /* Records completed count and recalculates the queue limit */
> >  void dql_completed(struct dql *dql, unsigned int count)
> >  {
> >  	unsigned int inprogress, prev_inprogress, limit;
> > -	unsigned int ovlimit, all_prev_completed, completed;
> > +	unsigned int ovlimit, completed;
> > +	bool all_prev_completed;
> >  
> >  	/* Can't complete more than what's in queue */
> >  	BUG_ON(count > dql->num_queued - dql->num_completed);
> > @@ -26,7 +28,7 @@ void dql_completed(struct dql *dql, unsigned int count)
> >  	ovlimit = POSDIFF(dql->num_queued - dql->num_completed, limit);
> >  	inprogress = dql->num_queued - completed;
> >  	prev_inprogress = dql->prev_num_queued - dql->num_completed;
> > -	all_prev_completed = POSDIFF(completed, dql->prev_num_queued);
> > +	all_prev_completed = AFTER_EQ(completed, dql->prev_num_queued);
> >  
> >  	if ((ovlimit && !inprogress) ||
> >  	    (dql->prev_ovlimit && all_prev_completed)) {
> 
> I am fine with this one.
> 
> Can you send official patches please ?

Thanks. I'll post.
--
To unsubscribe from this list: send the line "unsubscribe netdev" in
the body of a message to majordomo@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Eric Dumazet June 6, 2012, 5:10 a.m. UTC | #42
On Tue, 2012-05-29 at 23:25 +0900, Hiroaki SHIMODA wrote:

> If I understand the code and spec correctly, TX interrupts are
> generated when TXDCTL.WTHRESH descriptors have been accumulated
> and write backed.
> 
> I tentatively changed the TXDCTL.WTHRESH to 1, then it seems
> that latency spikes are disappear.
> 
> drivers/net/ethernet/intel/e1000e/e1000.h
> @@ -181,7 +181,7 @@ struct e1000_info;
>  #define E1000_TXDCTL_DMA_BURST_ENABLE                          \
>         (E1000_TXDCTL_GRAN | /* set descriptor granularity */  \
>          E1000_TXDCTL_COUNT_DESC |                             \
> -        (5 << 16) | /* wthresh must be +1 more than desired */\
> +        (1 << 16) | /* wthresh must be +1 more than desired */\
>          (1 << 8)  | /* hthresh */                             \
>          0x1f)       /* pthresh */
> 


Was this patch officially submitted ?

Thanks !


--
To unsubscribe from this list: send the line "unsubscribe netdev" in
the body of a message to majordomo@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Hiroaki SHIMODA June 6, 2012, 8:43 a.m. UTC | #43
On Wed, 06 Jun 2012 07:10:13 +0200
Eric Dumazet <eric.dumazet@gmail.com> wrote:

> On Tue, 2012-05-29 at 23:25 +0900, Hiroaki SHIMODA wrote:
> 
> > If I understand the code and spec correctly, TX interrupts are
> > generated when TXDCTL.WTHRESH descriptors have been accumulated
> > and write backed.
> > 
> > I tentatively changed the TXDCTL.WTHRESH to 1, then it seems
> > that latency spikes are disappear.
> > 
> > drivers/net/ethernet/intel/e1000e/e1000.h
> > @@ -181,7 +181,7 @@ struct e1000_info;
> >  #define E1000_TXDCTL_DMA_BURST_ENABLE                          \
> >         (E1000_TXDCTL_GRAN | /* set descriptor granularity */  \
> >          E1000_TXDCTL_COUNT_DESC |                             \
> > -        (5 << 16) | /* wthresh must be +1 more than desired */\
> > +        (1 << 16) | /* wthresh must be +1 more than desired */\
> >          (1 << 8)  | /* hthresh */                             \
> >          0x1f)       /* pthresh */
> > 
> 
> 
> Was this patch officially submitted ?
> 
> Thanks !

Sorry for long delay. I'll post.
(I have no idea how to fix this problem as keeping TXDCTL.WTHRESH to 5)
--
To unsubscribe from this list: send the line "unsubscribe netdev" in
the body of a message to majordomo@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Jesse Brandeburg June 6, 2012, 4:26 p.m. UTC | #44
On Wed, 6 Jun 2012 Hiroaki SHIMODA <shimoda.hiroaki@gmail.com> wrote:
> Sorry for long delay. I'll post.
> (I have no idea how to fix this problem as keeping TXDCTL.WTHRESH to 5)

I don't like changing WTHRESH wholesale because making the global change
to WTHRESH on e1000e just to fix this one bug (likely specific to a
particular chip/hardware) will adversely effect performance on many
models supported by e1000e not demonstrating any problem.  We could
possibly check something in for just ESB2LAN (S5000 chipset).

Other people (tom herbert) with this same chipset have been unable to
reproduce this issue right?
--
To unsubscribe from this list: send the line "unsubscribe netdev" in
the body of a message to majordomo@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Eric Dumazet June 6, 2012, 5:05 p.m. UTC | #45
On Wed, 2012-06-06 at 09:26 -0700, Jesse Brandeburg wrote:
> On Wed, 6 Jun 2012 Hiroaki SHIMODA <shimoda.hiroaki@gmail.com> wrote:
> > Sorry for long delay. I'll post.
> > (I have no idea how to fix this problem as keeping TXDCTL.WTHRESH to 5)
> 
> I don't like changing WTHRESH wholesale because making the global change
> to WTHRESH on e1000e just to fix this one bug (likely specific to a
> particular chip/hardware) will adversely effect performance on many
> models supported by e1000e not demonstrating any problem.  We could
> possibly check something in for just ESB2LAN (S5000 chipset).
> 
> Other people (tom herbert) with this same chipset have been unable to
> reproduce this issue right?

Thats because Tom Herbert (and me on my T420s laptop)
dont have FLAG2_DMA_BURST in adapter->flags2

So it depends on some e1000e, not all...

You cant hold a TX completion indefinitely, this breaks BQL but also
other stuff.



--
To unsubscribe from this list: send the line "unsubscribe netdev" in
the body of a message to majordomo@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Hiroaki SHIMODA June 6, 2012, 5:19 p.m. UTC | #46
On Wed, 6 Jun 2012 09:26:35 -0700
Jesse Brandeburg <jesse.brandeburg@intel.com> wrote:

> On Wed, 6 Jun 2012 Hiroaki SHIMODA <shimoda.hiroaki@gmail.com> wrote:
> > Sorry for long delay. I'll post.
> > (I have no idea how to fix this problem as keeping TXDCTL.WTHRESH to 5)
> 
> I don't like changing WTHRESH wholesale because making the global change
> to WTHRESH on e1000e just to fix this one bug (likely specific to a
> particular chip/hardware) will adversely effect performance on many
> models supported by e1000e not demonstrating any problem.  We could
> possibly check something in for just ESB2LAN (S5000 chipset).
> 
> Other people (tom herbert) with this same chipset have been unable to
> reproduce this issue right?

I understand your performance concern.

The affected chip would be e1000_82571, e1000_82572, e1000_82574
and e1000_80003es2lan which have FLAG2_DMA_BURST bit in
adapter->flags2.
Anyway, I have no objection intel guys NACK to my patch and
provide right fix. But in that case please consider 82574L chip too
which I observed similar behaviour.

Thanks.
--
To unsubscribe from this list: send the line "unsubscribe netdev" in
the body of a message to majordomo@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
David Miller June 6, 2012, 5:26 p.m. UTC | #47
From: Eric Dumazet <eric.dumazet@gmail.com>
Date: Wed, 06 Jun 2012 19:05:04 +0200

> You cant hold a TX completion indefinitely, this breaks BQL but also
> other stuff.

True.
--
To unsubscribe from this list: send the line "unsubscribe netdev" in
the body of a message to majordomo@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Tom Herbert June 6, 2012, 6:21 p.m. UTC | #48
I'm not exactly sure what the exact effect of WTHRESH is here.  Does
the device coalesce 5 completions regardless of size?  Would the
problem be avoided if bql limit_min were MTU, or could same issue be
hit with larger that 64 byte packets?

Tom

On Wed, Jun 6, 2012 at 10:19 AM, Hiroaki SHIMODA
<shimoda.hiroaki@gmail.com> wrote:
> On Wed, 6 Jun 2012 09:26:35 -0700
> Jesse Brandeburg <jesse.brandeburg@intel.com> wrote:
>
>> On Wed, 6 Jun 2012 Hiroaki SHIMODA <shimoda.hiroaki@gmail.com> wrote:
>> > Sorry for long delay. I'll post.
>> > (I have no idea how to fix this problem as keeping TXDCTL.WTHRESH to 5)
>>
>> I don't like changing WTHRESH wholesale because making the global change
>> to WTHRESH on e1000e just to fix this one bug (likely specific to a
>> particular chip/hardware) will adversely effect performance on many
>> models supported by e1000e not demonstrating any problem.  We could
>> possibly check something in for just ESB2LAN (S5000 chipset).
>>
>> Other people (tom herbert) with this same chipset have been unable to
>> reproduce this issue right?
>
> I understand your performance concern.
>
> The affected chip would be e1000_82571, e1000_82572, e1000_82574
> and e1000_80003es2lan which have FLAG2_DMA_BURST bit in
> adapter->flags2.
> Anyway, I have no objection intel guys NACK to my patch and
> provide right fix. But in that case please consider 82574L chip too
> which I observed similar behaviour.
>
> Thanks.
--
To unsubscribe from this list: send the line "unsubscribe netdev" in
the body of a message to majordomo@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
David Miller June 6, 2012, 6:23 p.m. UTC | #49
From: Tom Herbert <therbert@google.com>
Date: Wed, 6 Jun 2012 11:21:40 -0700

> I'm not exactly sure what the exact effect of WTHRESH is here.  Does
> the device coalesce 5 completions regardless of size?  Would the
> problem be avoided if bql limit_min were MTU, or could same issue be
> hit with larger that 64 byte packets?

The problem is that no TX completions are signalled happen until at
least WTHRESH are pending.

BQL is the least of the problems generated by this kind of behavior.

All drivers must TX complete in a small, finite, amount of time so
it is absolutely illegal to have the behavior that WRTHRESH > 1
gives.
--
To unsubscribe from this list: send the line "unsubscribe netdev" in
the body of a message to majordomo@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
stephen hemminger June 6, 2012, 6:46 p.m. UTC | #50
On Wed, 06 Jun 2012 11:23:32 -0700 (PDT)
David Miller <davem@davemloft.net> wrote:

> From: Tom Herbert <therbert@google.com>
> Date: Wed, 6 Jun 2012 11:21:40 -0700
> 
> > I'm not exactly sure what the exact effect of WTHRESH is here.  Does
> > the device coalesce 5 completions regardless of size?  Would the
> > problem be avoided if bql limit_min were MTU, or could same issue be
> > hit with larger that 64 byte packets?
> 
> The problem is that no TX completions are signalled happen until at
> least WTHRESH are pending.
> 
> BQL is the least of the problems generated by this kind of behavior.
> 
> All drivers must TX complete in a small, finite, amount of time so
> it is absolutely illegal to have the behavior that WRTHRESH > 1
> gives.

The TX completion is also controlled by the programming of the corresponding
interrupt moderation register (EITR).  It makes sense to hold off a little
bit to try and reduce the TX completion interrupt load. 

Intel manual..

Descriptors are written back in one of three cases:
• TXDCTL[n].WTHRESH = 0b and a descriptor which has RS set is ready to be written back
• The corresponding EITR counter has reached zero
• TXDCTL[n].WTHRESH > 0b and TXDCTL[n].WTHRESH descriptors have accumulated

For the first condition, write-backs are immediate. This is the default operation and is backward
compatible with previous device implementations.
The other two conditions are only valid if descriptor bursting is enabled (Section 8.12.13). In the
second condition, the EITR counter is used to force timely write-back of descriptors. The first packet
after timer initialization starts the timer. Timer expiration flushes any accumulated descriptors and sets
an interrupt event (TXDW).
For the final condition, if TXDCTL[n].WTHRESH descriptors are ready for write-back, the write-back is
performed.
--
To unsubscribe from this list: send the line "unsubscribe netdev" in
the body of a message to majordomo@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
diff mbox

Patch

--- netdev.c    2012-05-12 20:08:37.000000000 +0300
+++ netdev.c.patched    2012-05-17 16:32:28.895760472 +0300
@@ -1135,7 +1135,7 @@ 

         tx_ring->next_to_clean = i;

-       netdev_completed_queue(netdev, pkts_compl, bytes_compl);
+//     netdev_completed_queue(netdev, pkts_compl, bytes_compl);

  #define TX_WAKE_THRESHOLD 32
         if (count && netif_carrier_ok(netdev) &&
@@ -2263,7 +2263,7 @@ 
                 e1000_put_txbuf(adapter, buffer_info);
         }

-       netdev_reset_queue(adapter->netdev);
+//     netdev_reset_queue(adapter->netdev);
         size = sizeof(struct e1000_buffer) * tx_ring->count;
         memset(tx_ring->buffer_info, 0, size);

@@ -5056,7 +5056,7 @@ 
         /* if count is 0 then mapping error has occurred */
         count = e1000_tx_map(adapter, skb, first, max_per_txd, 
nr_frags, mss);
         if (count) {
-               netdev_sent_queue(netdev, skb->len);
+//             netdev_sent_queue(netdev, skb->len);
                 e1000_tx_queue(adapter, tx_flags, count);
                 /* Make sure there is space in the ring for the next