Message ID | 88c43001441945e1431609db252b69e7@visp.net.lb |
---|---|
State | RFC, archived |
Delegated to: | David Miller |
Headers | show |
Also i notice, limit constantly changing over time (even i am not touching it). centaur ~ # grep "" /sys/class/net/eth0/queues/tx-0/byte_queue_limits/* /sys/class/net/eth0/queues/tx-0/byte_queue_limits/hold_time:1000 /sys/class/net/eth0/queues/tx-0/byte_queue_limits/inflight:0 /sys/class/net/eth0/queues/tx-0/byte_queue_limits/limit:13018 /sys/class/net/eth0/queues/tx-0/byte_queue_limits/limit_max:1879048192 /sys/class/net/eth0/queues/tx-0/byte_queue_limits/limit_min:0 centaur ~ # grep "" /sys/class/net/eth0/queues/tx-0/byte_queue_limits/* /sys/class/net/eth0/queues/tx-0/byte_queue_limits/hold_time:1000 /sys/class/net/eth0/queues/tx-0/byte_queue_limits/inflight:4542 /sys/class/net/eth0/queues/tx-0/byte_queue_limits/limit:13018 /sys/class/net/eth0/queues/tx-0/byte_queue_limits/limit_max:1879048192 /sys/class/net/eth0/queues/tx-0/byte_queue_limits/limit_min:0 Is it supposed to be like this? On 2012-05-17 16:42, Denys Fedoryshchenko wrote: > Found commit that cause problem: > > author Tom Herbert <therbert@google.com> > Mon, 28 Nov 2011 16:33:16 +0000 (16:33 +0000) > committer David S. Miller <davem@davemloft.net> > Tue, 29 Nov 2011 17:46:19 +0000 (12:46 -0500) > commit 3f0cfa3bc11e7f00c9994e0f469cbc0e7da7b00c > tree d6670a4f94b2b9dedacc38edb6f0e1306b889f6b tree | snapshot > parent 114cf5802165ee93e3ab461c9c505cd94a08b800 commit | diff > e1000e: Support for byte queue limits > > Changes to e1000e to use byte queue limits. > > Signed-off-by: Tom Herbert <therbert@google.com> > Acked-by: Eric Dumazet <eric.dumazet@gmail.com> > Signed-off-by: David S. Miller <davem@davemloft.net> > > If i reverse it, problem disappearing. > > How i reproduce it: > In two consoles do "fast" ping to nearby host > ping 194.146.XXX.XXX -s1472 -i0.0001 > ping 194.146.XXX.XXX -s1472 -i0.1 > > For third open ssh to host with "problem", open mcedit, and just > scroll down large text file. > After few seconds some "stalls" will occur, and in ping history i can > see: > 1480 bytes from 194.146.153.7: icmp_req=1797 ttl=64 time=0.161 ms > 1480 bytes from 194.146.153.7: icmp_req=1798 ttl=64 time=0.198 ms > 1480 bytes from 194.146.153.7: icmp_req=1799 ttl=64 time=0.340 ms > 1480 bytes from 194.146.153.7: icmp_req=1800 ttl=64 time=0.381 ms > 1480 bytes from 194.146.153.7: icmp_req=1801 ttl=64 time=914 ms > 1480 bytes from 194.146.153.7: icmp_req=1802 ttl=64 time=804 ms > 1480 bytes from 194.146.153.7: icmp_req=1803 ttl=64 time=704 ms > 1480 bytes from 194.146.153.7: icmp_req=1804 ttl=64 time=594 ms > 1480 bytes from 194.146.153.7: icmp_req=1805 ttl=64 time=0.287 ms > 1480 bytes from 194.146.153.7: icmp_req=1806 ttl=64 time=0.226 ms > > > If i apply small patch - problem will disappear. Sure it is not a > solution, but > let me know how i can help to debug problem more. > > --- netdev.c 2012-05-12 20:08:37.000000000 +0300 > +++ netdev.c.patched 2012-05-17 16:32:28.895760472 +0300 > @@ -1135,7 +1135,7 @@ > > tx_ring->next_to_clean = i; > > - netdev_completed_queue(netdev, pkts_compl, bytes_compl); > +// netdev_completed_queue(netdev, pkts_compl, bytes_compl); > > #define TX_WAKE_THRESHOLD 32 > if (count && netif_carrier_ok(netdev) && > @@ -2263,7 +2263,7 @@ > e1000_put_txbuf(adapter, buffer_info); > } > > - netdev_reset_queue(adapter->netdev); > +// netdev_reset_queue(adapter->netdev); > size = sizeof(struct e1000_buffer) * tx_ring->count; > memset(tx_ring->buffer_info, 0, size); > > @@ -5056,7 +5056,7 @@ > /* if count is 0 then mapping error has occurred */ > count = e1000_tx_map(adapter, skb, first, max_per_txd, > nr_frags, mss); > if (count) { > - netdev_sent_queue(netdev, skb->len); > +// netdev_sent_queue(netdev, skb->len); > e1000_tx_queue(adapter, tx_flags, count); > /* Make sure there is space in the ring for the next > send. */ > e1000_maybe_stop_tx(netdev, MAX_SKB_FRAGS + 2); > > > > On 2012-05-15 17:15, Denys Fedoryshchenko wrote: >> Hi >> >> I have two identical servers, Sun Fire X4150, both has different >> flavors of Linux, x86_64 and i386. >> 04:00.0 Ethernet controller: Intel Corporation 80003ES2LAN Gigabit >> Ethernet Controller (Copper) (rev 01) >> 04:00.1 Ethernet controller: Intel Corporation 80003ES2LAN Gigabit >> Ethernet Controller (Copper) (rev 01) >> 0b:00.0 Ethernet controller: Intel Corporation 82571EB Gigabit >> Ethernet Controller (rev 06) >> 0b:00.1 Ethernet controller: Intel Corporation 82571EB Gigabit >> Ethernet Controller (rev 06) >> I am using now interface: >> #ethtool -i eth0 >> driver: e1000e >> version: 1.9.5-k >> firmware-version: 2.1-11 >> bus-info: 0000:04:00.0 >> There is 2 CPU , Intel(R) Xeon(R) CPU E5440 @ 2.83GHz . >> >> i386 was acting as NAT and shaper, and as soon as i removed shaper >> from it, i started to experience strange lockups, e.g. traffic is >> normal for 5-30 seconds, then short lockup for 500-3000ms (usually >> around 1000ms) with dropped packets counter increasing. I was >> suspecting it is due load, but it seems was wrong. >> Recently, on another server, x86_64 i am using as development, i >> upgrade kernel (it was old, from 2.6 series) and on completely idle >> machine started to experience same latency spikes, while i am just >> running mc and for example typing in text editor - i notice >> "stalls". >> After i investigate it a little more, i notice also small amount of >> drops on interface. No tcpdump running. Also this machine is idle, >> and >> the only traffic there - some small broadcasts from network, my ssh, >> and ping. >> >> Dropped packets in ifconfig >> RX packets:3752868 errors:0 dropped:5350 overruns:0 >> frame:0 >> Counter is increasing sometimes, when this stall happening. >> >> ethtool -S is clean, there is no dropped packets. >> >> I did tried to check load (mpstat and perf), there is nothing >> suspicious, latencytop also doesn't show anything suspicious. >> dropwatch report a lot of drops, but mostly because there is some >> broadcasts and etc. tcpdump at the moment of such drops doesn't show >> anything suspicious. >> Changed qdisc from default fifo_fast to bfifo, without any result. >> Tried: ethtool -K eth0 tso off gso off gro off sg off , no result >> Problem occured at 3.3.6 - 3.4.0-rc7, most probably 3.3.0 also, but >> i >> don't remember for sure. I thik on some kernels like 3.1 probably it >> doesn't occur, i will check it soon, because it is not always >> reliable >> to reproduce it. All tests i did on 3.4.0-rc7. >> >> I did run also in background tcpdump, additionally iptables with >> timestamps, and at time when stall occured, seems i am still >> receiving >> packets properly, also on iperf udp (from some host to this >> SunFire) >> at this moments no packets missing. But i am sure RX interface >> errors >> are increasing. >> If i do iperf from SunFire to test host - there is packetloss at >> moments when stall occured. >> >> I suspect that by some reason network card stop to transmit, but >> unable to pinpoint issue. All other hosts in this network are fine >> and >> don't have such problems. >> Can you help me with that please? Maybe i can provide more debug >> information, compile with patches and etc. Also i will try to >> fallback >> to 3.1 and 3.0 kernels. >> >> Here it is how it occurs and i am reproducing it: >> I'm just opening file, and start to scroll it in mc, then in another >> console i run ping >> [1337089061.844167] 1480 bytes from 194.146.153.20: icmp_req=162 >> ttl=64 time=0.485 ms >> [1337089061.944138] 1480 bytes from 194.146.153.20: icmp_req=163 >> ttl=64 time=0.470 ms >> [1337089062.467759] 1480 bytes from 194.146.153.20: icmp_req=164 >> ttl=64 time=424 ms >> [1337089062.467899] 1480 bytes from 194.146.153.20: icmp_req=165 >> ttl=64 time=324 ms >> [1337089062.468058] 1480 bytes from 194.146.153.20: icmp_req=166 >> ttl=64 time=214 ms >> [1337089062.468161] 1480 bytes from 194.146.153.20: icmp_req=167 >> ttl=64 time=104 ms >> [1337089062.468958] 1480 bytes from 194.146.153.20: icmp_req=168 >> ttl=64 time=1.15 ms >> [1337089062.568604] 1480 bytes from 194.146.153.20: icmp_req=169 >> ttl=64 time=0.477 ms >> [1337089062.668909] 1480 bytes from 194.146.153.20: icmp_req=170 >> ttl=64 time=0.667 ms >> >> Remote host tcpdump: >> 1337089061.934737 IP 194.146.153.20 > 194.146.153.22: ICMP echo >> reply, id 3486, seq 163, length 1480 >> 1337089062.458360 IP 194.146.153.22 > 194.146.153.20: ICMP echo >> request, id 3486, seq 164, length 1480 >> 1337089062.458380 IP 194.146.153.20 > 194.146.153.22: ICMP echo >> reply, id 3486, seq 164, length 1480 >> 1337089062.458481 IP 194.146.153.22 > 194.146.153.20: ICMP echo >> request, id 3486, seq 165, length 1480 >> 1337089062.458502 IP 194.146.153.20 > 194.146.153.22: ICMP echo >> reply, id 3486, seq 165, length 1480 >> 1337089062.458606 IP 194.146.153.22 > 194.146.153.20: ICMP echo >> request, id 3486, seq 166, length 1480 >> 1337089062.458623 IP 194.146.153.20 > 194.146.153.22: ICMP echo >> reply, id 3486, seq 166, length 1480 >> 1337089062.458729 IP 194.146.153.22 > 194.146.153.20: ICMP echo >> request, id 3486, seq 167, length 1480 >> 1337089062.458745 IP 194.146.153.20 > 194.146.153.22: ICMP echo >> reply, id 3486, seq 167, length 1480 >> 1337089062.459537 IP 194.146.153.22 > 194.146.153.20: ICMP echo >> request, id 3486, seq 168, length 1480 >> 1337089062.459545 IP 194.146.153.20 > 194.146.153.22: ICMP echo >> reply, id 3486, seq 168, length 1480 >> >> Local host(SunFire) tcpdump: >> 1337089061.844140 IP 194.146.153.20 > 194.146.153.22: ICMP echo >> reply, id 3486, seq 162, length 1480 >> 1337089061.943661 IP 194.146.153.22 > 194.146.153.20: ICMP echo >> request, id 3486, seq 163, length 1480 >> 1337089061.944124 IP 194.146.153.20 > 194.146.153.22: ICMP echo >> reply, id 3486, seq 163, length 1480 >> 1337089062.465622 IP 194.146.153.22 > 194.146.153.20: ICMP echo >> request, id 3486, seq 164, length 1480 >> 1337089062.465630 IP 194.146.153.22 > 194.146.153.20: ICMP echo >> request, id 3486, seq 165, length 1480 >> 1337089062.465632 IP 194.146.153.22 > 194.146.153.20: ICMP echo >> request, id 3486, seq 166, length 1480 >> 1337089062.465634 IP 194.146.153.22 > 194.146.153.20: ICMP echo >> request, id 3486, seq 167, length 1480 >> 1337089062.467730 IP 194.146.153.20 > 194.146.153.22: ICMP echo >> reply, id 3486, seq 164, length 1480 >> 1337089062.467785 IP 194.146.153.22 > 194.146.153.20: ICMP echo >> request, id 3486, seq 168, length 1480 >> 1337089062.467884 IP 194.146.153.20 > 194.146.153.22: ICMP echo >> reply, id 3486, seq 165, length 1480 >> 1337089062.468035 IP 194.146.153.20 > 194.146.153.22: ICMP echo >> reply, id 3486, seq 166, length 1480 >> 1337089062.468129 IP 194.146.153.20 > 194.146.153.22: ICMP echo >> reply, id 3486, seq 167, length 1480 >> 1337089062.468928 IP 194.146.153.20 > 194.146.153.22: ICMP echo >> reply, id 3486, seq 168, length 1480 >> 1337089062.568112 IP 194.146.153.22 > 194.146.153.20: ICMP echo >> request, id 3486, seq 169, length 1480 >> 1337089062.568578 IP 194.146.153.20 > 194.146.153.22: ICMP echo >> reply, id 3486, seq 169, length 1480 >> >> lspci -t >> centaur src # lspci -t >> -[0000:00]-+-00.0 >> +-02.0-[01-05]--+-00.0-[02-04]--+-00.0-[03]-- >> | | \-02.0-[04]--+-00.0 >> | | \-00.1 >> | \-00.3-[05]-- >> +-03.0-[06]-- >> +-04.0-[07]----00.0 >> +-05.0-[08]-- >> +-06.0-[09]-- >> +-07.0-[0a]-- >> +-08.0 >> +-10.0 >> +-10.1 >> +-10.2 >> +-11.0 >> +-13.0 >> +-15.0 >> +-16.0 >> +-1c.0-[0b]--+-00.0 >> | \-00.1 >> +-1d.0 >> +-1d.1 >> +-1d.2 >> +-1d.3 >> +-1d.7 >> +-1e.0-[0c]----05.0 >> +-1f.0 >> +-1f.1 >> +-1f.2 >> \-1f.3 >> lspci >> 00:00.0 Host bridge: Intel Corporation 5000P Chipset Memory >> Controller Hub (rev b1) >> 00:02.0 PCI bridge: Intel Corporation 5000 Series Chipset PCI >> Express >> x4 Port 2 (rev b1) >> 00:03.0 PCI bridge: Intel Corporation 5000 Series Chipset PCI >> Express >> x4 Port 3 (rev b1) >> 00:04.0 PCI bridge: Intel Corporation 5000 Series Chipset PCI >> Express >> x8 Port 4-5 (rev b1) >> 00:05.0 PCI bridge: Intel Corporation 5000 Series Chipset PCI >> Express >> x4 Port 5 (rev b1) >> 00:06.0 PCI bridge: Intel Corporation 5000 Series Chipset PCI >> Express >> x8 Port 6-7 (rev b1) >> 00:07.0 PCI bridge: Intel Corporation 5000 Series Chipset PCI >> Express >> x4 Port 7 (rev b1) >> 00:08.0 System peripheral: Intel Corporation 5000 Series Chipset DMA >> Engine (rev b1) >> 00:10.0 Host bridge: Intel Corporation 5000 Series Chipset FSB >> Registers (rev b1) >> 00:10.1 Host bridge: Intel Corporation 5000 Series Chipset FSB >> Registers (rev b1) >> 00:10.2 Host bridge: Intel Corporation 5000 Series Chipset FSB >> Registers (rev b1) >> 00:11.0 Host bridge: Intel Corporation 5000 Series Chipset Reserved >> Registers (rev b1) >> 00:13.0 Host bridge: Intel Corporation 5000 Series Chipset Reserved >> Registers (rev b1) >> 00:15.0 Host bridge: Intel Corporation 5000 Series Chipset FBD >> Registers (rev b1) >> 00:16.0 Host bridge: Intel Corporation 5000 Series Chipset FBD >> Registers (rev b1) >> 00:1c.0 PCI bridge: Intel Corporation 631xESB/632xESB/3100 Chipset >> PCI Express Root Port 1 (rev 09) >> 00:1d.0 USB controller: Intel Corporation 631xESB/632xESB/3100 >> Chipset UHCI USB Controller #1 (rev 09) >> 00:1d.1 USB controller: Intel Corporation 631xESB/632xESB/3100 >> Chipset UHCI USB Controller #2 (rev 09) >> 00:1d.2 USB controller: Intel Corporation 631xESB/632xESB/3100 >> Chipset UHCI USB Controller #3 (rev 09) >> 00:1d.3 USB controller: Intel Corporation 631xESB/632xESB/3100 >> Chipset UHCI USB Controller #4 (rev 09) >> 00:1d.7 USB controller: Intel Corporation 631xESB/632xESB/3100 >> Chipset EHCI USB2 Controller (rev 09) >> 00:1e.0 PCI bridge: Intel Corporation 82801 PCI Bridge (rev d9) >> 00:1f.0 ISA bridge: Intel Corporation 631xESB/632xESB/3100 Chipset >> LPC Interface Controller (rev 09) >> 00:1f.1 IDE interface: Intel Corporation 631xESB/632xESB IDE >> Controller (rev 09) >> 00:1f.2 SATA controller: Intel Corporation 631xESB/632xESB SATA AHCI >> Controller (rev 09) >> 00:1f.3 SMBus: Intel Corporation 631xESB/632xESB/3100 Chipset SMBus >> Controller (rev 09) >> 01:00.0 PCI bridge: Intel Corporation 6311ESB/6321ESB PCI Express >> Upstream Port (rev 01) >> 01:00.3 PCI bridge: Intel Corporation 6311ESB/6321ESB PCI Express to >> PCI-X Bridge (rev 01) >> 02:00.0 PCI bridge: Intel Corporation 6311ESB/6321ESB PCI Express >> Downstream Port E1 (rev 01) >> 02:02.0 PCI bridge: Intel Corporation 6311ESB/6321ESB PCI Express >> Downstream Port E3 (rev 01) >> 04:00.0 Ethernet controller: Intel Corporation 80003ES2LAN Gigabit >> Ethernet Controller (Copper) (rev 01) >> 04:00.1 Ethernet controller: Intel Corporation 80003ES2LAN Gigabit >> Ethernet Controller (Copper) (rev 01) >> 07:00.0 RAID bus controller: Adaptec AAC-RAID (rev 09) >> 0b:00.0 Ethernet controller: Intel Corporation 82571EB Gigabit >> Ethernet Controller (rev 06) >> 0b:00.1 Ethernet controller: Intel Corporation 82571EB Gigabit >> Ethernet Controller (rev 06) >> 0c:05.0 VGA compatible controller: ASPEED Technology, Inc. ASPEED >> Graphics Family >> >> >> dmesg: >> [ 4.936885] e1000: Intel(R) PRO/1000 Network Driver - version >> 7.3.21-k8-NAPI >> [ 4.936887] e1000: Copyright (c) 1999-2006 Intel Corporation. >> [ 4.936966] e1000e: Intel(R) PRO/1000 Network Driver - 1.9.5-k >> [ 4.936967] e1000e: Copyright(c) 1999 - 2012 Intel Corporation. >> [ 4.938529] e1000e 0000:04:00.0: (unregistered net_device): >> Interrupt Throttling Rate (ints/sec) set to dynamic conservative >> mode >> [ 4.939598] e1000e 0000:04:00.0: irq 65 for MSI/MSI-X >> [ 4.992246] e1000e 0000:04:00.0: eth0: (PCI Express:2.5GT/s:Width >> x4) 00:1e:68:04:99:f8 >> [ 4.992657] e1000e 0000:04:00.0: eth0: Intel(R) PRO/1000 Network >> Connection >> [ 4.992964] e1000e 0000:04:00.0: eth0: MAC: 5, PHY: 5, PBA No: >> FFFFFF-0FF >> [ 4.994745] e1000e 0000:04:00.1: (unregistered net_device): >> Interrupt Throttling Rate (ints/sec) set to dynamic conservative >> mode >> [ 4.996233] e1000e 0000:04:00.1: irq 66 for MSI/MSI-X >> [ 5.050901] e1000e 0000:04:00.1: eth1: (PCI Express:2.5GT/s:Width >> x4) 00:1e:68:04:99:f9 >> [ 5.051317] e1000e 0000:04:00.1: eth1: Intel(R) PRO/1000 Network >> Connection >> [ 5.051623] e1000e 0000:04:00.1: eth1: MAC: 5, PHY: 5, PBA No: >> FFFFFF-0FF >> [ 5.051857] e1000e 0000:0b:00.0: Disabling ASPM L1 >> [ 5.052168] e1000e 0000:0b:00.0: (unregistered net_device): >> Interrupt Throttling Rate (ints/sec) set to dynamic conservative >> mode >> [ 5.052611] e1000e 0000:0b:00.0: irq 67 for MSI/MSI-X >> [ 5.223454] e1000e 0000:0b:00.0: eth2: (PCI Express:2.5GT/s:Width >> x4) 00:1e:68:04:99:fa >> [ 5.223864] e1000e 0000:0b:00.0: eth2: Intel(R) PRO/1000 Network >> Connection >> [ 5.224178] e1000e 0000:0b:00.0: eth2: MAC: 0, PHY: 4, PBA No: >> C83246-002 >> [ 5.224412] e1000e 0000:0b:00.1: Disabling ASPM L1 >> [ 5.224709] e1000e 0000:0b:00.1: (unregistered net_device): >> Interrupt Throttling Rate (ints/sec) set to dynamic conservative >> mode >> [ 5.225168] e1000e 0000:0b:00.1: irq 68 for MSI/MSI-X >> [ 5.397603] e1000e 0000:0b:00.1: eth3: (PCI Express:2.5GT/s:Width >> x4) 00:1e:68:04:99:fb >> [ 5.398021] e1000e 0000:0b:00.1: eth3: Intel(R) PRO/1000 Network >> Connection >> [ 5.398336] e1000e 0000:0b:00.1: eth3: MAC: 0, PHY: 4, PBA No: >> C83246-002 >> [ 13.859817] e1000e 0000:04:00.0: irq 65 for MSI/MSI-X >> [ 13.962309] e1000e 0000:04:00.0: irq 65 for MSI/MSI-X >> [ 17.150392] e1000e: eth0 NIC Link is Up 1000 Mbps Full Duplex, >> Flow Control: None >> >> -- >> To unsubscribe from this list: send the line "unsubscribe netdev" in >> the body of a message to majordomo@vger.kernel.org >> More majordomo info at http://vger.kernel.org/majordomo-info.html > > --- > Network engineer > Denys Fedoryshchenko > > Dora Highway - Center Cebaco - 2nd Floor > Beirut, Lebanon > Tel: +961 1 247373 > E-Mail: denys@visp.net.lb > -- > To unsubscribe from this list: send the line "unsubscribe netdev" in > the body of a message to majordomo@vger.kernel.org > More majordomo info at http://vger.kernel.org/majordomo-info.html --- Network engineer Denys Fedoryshchenko Dora Highway - Center Cebaco - 2nd Floor Beirut, Lebanon Tel: +961 1 247373 E-Mail: denys@visp.net.lb -- To unsubscribe from this list: send the line "unsubscribe netdev" in the body of a message to majordomo@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
It seems logic in BQL has serious issues. The most bad thing, if someone don't want limits (especially low as this), there is no way to disable BQL in Kernel configuration, only tuning each interface over sysfs values. I just did short debug: if (limit != dql->limit) { + printk("New limit %d\n", dql->limit); dql->limit = limit; ovlimit = 0; } And got this numbers: [ 18.696839] New limit 0 [ 19.622967] New limit 42 [ 20.037810] New limit 165 [ 35.473666] New limit 386 [ 37.418591] New limit 1374 [ 37.420064] New limit 6432 [ 39.209480] New limit 16548 [ 39.214773] New limit 1704 [ 40.696065] New limit 6762 [ 40.696390] New limit 15564 [ 41.921120] New limit 25788 [ 41.921165] New limit 388 [ 42.696286] New limit 534 [ 42.696539] New limit 1096 [ 42.696719] New limit 2304 [ 53.360394] New limit 24334 [ 54.696072] New limit 484 [ 54.696135] New limit 934 This means sometimes limit goes below MTU, and till queue limit increased, i will see this traffic "stalled", if there is large packet in queue. Probably BQL miscalculate queue as full because of some specific handling of sent packets in e1000e on this specific hardware. Because it should not be full, it is 1Gbps wire, and it is empty. So in result, instead of eliminating latency, it is adding it. I can make a patch that will make minimum BQL value not less than MTU + overhead, is it ok like this? Probably it will solve issue, but it is more workaround and safety fuse, than a solution. On 2012-05-17 19:54, Denys Fedoryshchenko wrote: > Also i notice, limit constantly changing over time (even i am not > touching it). > > centaur ~ # grep "" > /sys/class/net/eth0/queues/tx-0/byte_queue_limits/* > /sys/class/net/eth0/queues/tx-0/byte_queue_limits/hold_time:1000 > /sys/class/net/eth0/queues/tx-0/byte_queue_limits/inflight:0 > /sys/class/net/eth0/queues/tx-0/byte_queue_limits/limit:13018 > > /sys/class/net/eth0/queues/tx-0/byte_queue_limits/limit_max:1879048192 > /sys/class/net/eth0/queues/tx-0/byte_queue_limits/limit_min:0 > centaur ~ # grep "" > /sys/class/net/eth0/queues/tx-0/byte_queue_limits/* > /sys/class/net/eth0/queues/tx-0/byte_queue_limits/hold_time:1000 > /sys/class/net/eth0/queues/tx-0/byte_queue_limits/inflight:4542 > /sys/class/net/eth0/queues/tx-0/byte_queue_limits/limit:13018 > > /sys/class/net/eth0/queues/tx-0/byte_queue_limits/limit_max:1879048192 > /sys/class/net/eth0/queues/tx-0/byte_queue_limits/limit_min:0 > > Is it supposed to be like this? > > On 2012-05-17 16:42, Denys Fedoryshchenko wrote: >> Found commit that cause problem: >> >> author Tom Herbert <therbert@google.com> >> Mon, 28 Nov 2011 16:33:16 +0000 (16:33 +0000) >> committer David S. Miller <davem@davemloft.net> >> Tue, 29 Nov 2011 17:46:19 +0000 (12:46 -0500) >> commit 3f0cfa3bc11e7f00c9994e0f469cbc0e7da7b00c >> tree d6670a4f94b2b9dedacc38edb6f0e1306b889f6b tree | snapshot >> parent 114cf5802165ee93e3ab461c9c505cd94a08b800 commit | diff >> e1000e: Support for byte queue limits >> >> Changes to e1000e to use byte queue limits. >> >> Signed-off-by: Tom Herbert <therbert@google.com> >> Acked-by: Eric Dumazet <eric.dumazet@gmail.com> >> Signed-off-by: David S. Miller <davem@davemloft.net> >> >> If i reverse it, problem disappearing. >> >> How i reproduce it: >> In two consoles do "fast" ping to nearby host >> ping 194.146.XXX.XXX -s1472 -i0.0001 >> ping 194.146.XXX.XXX -s1472 -i0.1 >> >> For third open ssh to host with "problem", open mcedit, and just >> scroll down large text file. >> After few seconds some "stalls" will occur, and in ping history i >> can see: >> 1480 bytes from 194.146.153.7: icmp_req=1797 ttl=64 time=0.161 ms >> 1480 bytes from 194.146.153.7: icmp_req=1798 ttl=64 time=0.198 ms >> 1480 bytes from 194.146.153.7: icmp_req=1799 ttl=64 time=0.340 ms >> 1480 bytes from 194.146.153.7: icmp_req=1800 ttl=64 time=0.381 ms >> 1480 bytes from 194.146.153.7: icmp_req=1801 ttl=64 time=914 ms >> 1480 bytes from 194.146.153.7: icmp_req=1802 ttl=64 time=804 ms >> 1480 bytes from 194.146.153.7: icmp_req=1803 ttl=64 time=704 ms >> 1480 bytes from 194.146.153.7: icmp_req=1804 ttl=64 time=594 ms >> 1480 bytes from 194.146.153.7: icmp_req=1805 ttl=64 time=0.287 ms >> 1480 bytes from 194.146.153.7: icmp_req=1806 ttl=64 time=0.226 ms >> >> >> If i apply small patch - problem will disappear. Sure it is not a >> solution, but >> let me know how i can help to debug problem more. >> >> --- netdev.c 2012-05-12 20:08:37.000000000 +0300 >> +++ netdev.c.patched 2012-05-17 16:32:28.895760472 +0300 >> @@ -1135,7 +1135,7 @@ >> >> tx_ring->next_to_clean = i; >> >> - netdev_completed_queue(netdev, pkts_compl, bytes_compl); >> +// netdev_completed_queue(netdev, pkts_compl, bytes_compl); >> >> #define TX_WAKE_THRESHOLD 32 >> if (count && netif_carrier_ok(netdev) && >> @@ -2263,7 +2263,7 @@ >> e1000_put_txbuf(adapter, buffer_info); >> } >> >> - netdev_reset_queue(adapter->netdev); >> +// netdev_reset_queue(adapter->netdev); >> size = sizeof(struct e1000_buffer) * tx_ring->count; >> memset(tx_ring->buffer_info, 0, size); >> >> @@ -5056,7 +5056,7 @@ >> /* if count is 0 then mapping error has occurred */ >> count = e1000_tx_map(adapter, skb, first, max_per_txd, >> nr_frags, mss); >> if (count) { >> - netdev_sent_queue(netdev, skb->len); >> +// netdev_sent_queue(netdev, skb->len); >> e1000_tx_queue(adapter, tx_flags, count); >> /* Make sure there is space in the ring for the next >> send. */ >> e1000_maybe_stop_tx(netdev, MAX_SKB_FRAGS + 2); >> >> >> >> On 2012-05-15 17:15, Denys Fedoryshchenko wrote: >>> Hi >>> >>> I have two identical servers, Sun Fire X4150, both has different >>> flavors of Linux, x86_64 and i386. >>> 04:00.0 Ethernet controller: Intel Corporation 80003ES2LAN Gigabit >>> Ethernet Controller (Copper) (rev 01) >>> 04:00.1 Ethernet controller: Intel Corporation 80003ES2LAN Gigabit >>> Ethernet Controller (Copper) (rev 01) >>> 0b:00.0 Ethernet controller: Intel Corporation 82571EB Gigabit >>> Ethernet Controller (rev 06) >>> 0b:00.1 Ethernet controller: Intel Corporation 82571EB Gigabit >>> Ethernet Controller (rev 06) >>> I am using now interface: >>> #ethtool -i eth0 >>> driver: e1000e >>> version: 1.9.5-k >>> firmware-version: 2.1-11 >>> bus-info: 0000:04:00.0 >>> There is 2 CPU , Intel(R) Xeon(R) CPU E5440 @ 2.83GHz . >>> >>> i386 was acting as NAT and shaper, and as soon as i removed shaper >>> from it, i started to experience strange lockups, e.g. traffic is >>> normal for 5-30 seconds, then short lockup for 500-3000ms (usually >>> around 1000ms) with dropped packets counter increasing. I was >>> suspecting it is due load, but it seems was wrong. >>> Recently, on another server, x86_64 i am using as development, i >>> upgrade kernel (it was old, from 2.6 series) and on completely idle >>> machine started to experience same latency spikes, while i am just >>> running mc and for example typing in text editor - i notice >>> "stalls". >>> After i investigate it a little more, i notice also small amount of >>> drops on interface. No tcpdump running. Also this machine is idle, >>> and >>> the only traffic there - some small broadcasts from network, my >>> ssh, >>> and ping. >>> >>> Dropped packets in ifconfig >>> RX packets:3752868 errors:0 dropped:5350 overruns:0 >>> frame:0 >>> Counter is increasing sometimes, when this stall happening. >>> >>> ethtool -S is clean, there is no dropped packets. >>> >>> I did tried to check load (mpstat and perf), there is nothing >>> suspicious, latencytop also doesn't show anything suspicious. >>> dropwatch report a lot of drops, but mostly because there is some >>> broadcasts and etc. tcpdump at the moment of such drops doesn't >>> show >>> anything suspicious. >>> Changed qdisc from default fifo_fast to bfifo, without any result. >>> Tried: ethtool -K eth0 tso off gso off gro off sg off , no result >>> Problem occured at 3.3.6 - 3.4.0-rc7, most probably 3.3.0 also, but >>> i >>> don't remember for sure. I thik on some kernels like 3.1 probably >>> it >>> doesn't occur, i will check it soon, because it is not always >>> reliable >>> to reproduce it. All tests i did on 3.4.0-rc7. >>> >>> I did run also in background tcpdump, additionally iptables with >>> timestamps, and at time when stall occured, seems i am still >>> receiving >>> packets properly, also on iperf udp (from some host to this >>> SunFire) >>> at this moments no packets missing. But i am sure RX interface >>> errors >>> are increasing. >>> If i do iperf from SunFire to test host - there is packetloss at >>> moments when stall occured. >>> >>> I suspect that by some reason network card stop to transmit, but >>> unable to pinpoint issue. All other hosts in this network are fine >>> and >>> don't have such problems. >>> Can you help me with that please? Maybe i can provide more debug >>> information, compile with patches and etc. Also i will try to >>> fallback >>> to 3.1 and 3.0 kernels. >>> >>> Here it is how it occurs and i am reproducing it: >>> I'm just opening file, and start to scroll it in mc, then in >>> another >>> console i run ping >>> [1337089061.844167] 1480 bytes from 194.146.153.20: icmp_req=162 >>> ttl=64 time=0.485 ms >>> [1337089061.944138] 1480 bytes from 194.146.153.20: icmp_req=163 >>> ttl=64 time=0.470 ms >>> [1337089062.467759] 1480 bytes from 194.146.153.20: icmp_req=164 >>> ttl=64 time=424 ms >>> [1337089062.467899] 1480 bytes from 194.146.153.20: icmp_req=165 >>> ttl=64 time=324 ms >>> [1337089062.468058] 1480 bytes from 194.146.153.20: icmp_req=166 >>> ttl=64 time=214 ms >>> [1337089062.468161] 1480 bytes from 194.146.153.20: icmp_req=167 >>> ttl=64 time=104 ms >>> [1337089062.468958] 1480 bytes from 194.146.153.20: icmp_req=168 >>> ttl=64 time=1.15 ms >>> [1337089062.568604] 1480 bytes from 194.146.153.20: icmp_req=169 >>> ttl=64 time=0.477 ms >>> [1337089062.668909] 1480 bytes from 194.146.153.20: icmp_req=170 >>> ttl=64 time=0.667 ms >>> >>> Remote host tcpdump: >>> 1337089061.934737 IP 194.146.153.20 > 194.146.153.22: ICMP echo >>> reply, id 3486, seq 163, length 1480 >>> 1337089062.458360 IP 194.146.153.22 > 194.146.153.20: ICMP echo >>> request, id 3486, seq 164, length 1480 >>> 1337089062.458380 IP 194.146.153.20 > 194.146.153.22: ICMP echo >>> reply, id 3486, seq 164, length 1480 >>> 1337089062.458481 IP 194.146.153.22 > 194.146.153.20: ICMP echo >>> request, id 3486, seq 165, length 1480 >>> 1337089062.458502 IP 194.146.153.20 > 194.146.153.22: ICMP echo >>> reply, id 3486, seq 165, length 1480 >>> 1337089062.458606 IP 194.146.153.22 > 194.146.153.20: ICMP echo >>> request, id 3486, seq 166, length 1480 >>> 1337089062.458623 IP 194.146.153.20 > 194.146.153.22: ICMP echo >>> reply, id 3486, seq 166, length 1480 >>> 1337089062.458729 IP 194.146.153.22 > 194.146.153.20: ICMP echo >>> request, id 3486, seq 167, length 1480 >>> 1337089062.458745 IP 194.146.153.20 > 194.146.153.22: ICMP echo >>> reply, id 3486, seq 167, length 1480 >>> 1337089062.459537 IP 194.146.153.22 > 194.146.153.20: ICMP echo >>> request, id 3486, seq 168, length 1480 >>> 1337089062.459545 IP 194.146.153.20 > 194.146.153.22: ICMP echo >>> reply, id 3486, seq 168, length 1480 >>> >>> Local host(SunFire) tcpdump: >>> 1337089061.844140 IP 194.146.153.20 > 194.146.153.22: ICMP echo >>> reply, id 3486, seq 162, length 1480 >>> 1337089061.943661 IP 194.146.153.22 > 194.146.153.20: ICMP echo >>> request, id 3486, seq 163, length 1480 >>> 1337089061.944124 IP 194.146.153.20 > 194.146.153.22: ICMP echo >>> reply, id 3486, seq 163, length 1480 >>> 1337089062.465622 IP 194.146.153.22 > 194.146.153.20: ICMP echo >>> request, id 3486, seq 164, length 1480 >>> 1337089062.465630 IP 194.146.153.22 > 194.146.153.20: ICMP echo >>> request, id 3486, seq 165, length 1480 >>> 1337089062.465632 IP 194.146.153.22 > 194.146.153.20: ICMP echo >>> request, id 3486, seq 166, length 1480 >>> 1337089062.465634 IP 194.146.153.22 > 194.146.153.20: ICMP echo >>> request, id 3486, seq 167, length 1480 >>> 1337089062.467730 IP 194.146.153.20 > 194.146.153.22: ICMP echo >>> reply, id 3486, seq 164, length 1480 >>> 1337089062.467785 IP 194.146.153.22 > 194.146.153.20: ICMP echo >>> request, id 3486, seq 168, length 1480 >>> 1337089062.467884 IP 194.146.153.20 > 194.146.153.22: ICMP echo >>> reply, id 3486, seq 165, length 1480 >>> 1337089062.468035 IP 194.146.153.20 > 194.146.153.22: ICMP echo >>> reply, id 3486, seq 166, length 1480 >>> 1337089062.468129 IP 194.146.153.20 > 194.146.153.22: ICMP echo >>> reply, id 3486, seq 167, length 1480 >>> 1337089062.468928 IP 194.146.153.20 > 194.146.153.22: ICMP echo >>> reply, id 3486, seq 168, length 1480 >>> 1337089062.568112 IP 194.146.153.22 > 194.146.153.20: ICMP echo >>> request, id 3486, seq 169, length 1480 >>> 1337089062.568578 IP 194.146.153.20 > 194.146.153.22: ICMP echo >>> reply, id 3486, seq 169, length 1480 >>> >>> lspci -t >>> centaur src # lspci -t >>> -[0000:00]-+-00.0 >>> +-02.0-[01-05]--+-00.0-[02-04]--+-00.0-[03]-- >>> | | \-02.0-[04]--+-00.0 >>> | | \-00.1 >>> | \-00.3-[05]-- >>> +-03.0-[06]-- >>> +-04.0-[07]----00.0 >>> +-05.0-[08]-- >>> +-06.0-[09]-- >>> +-07.0-[0a]-- >>> +-08.0 >>> +-10.0 >>> +-10.1 >>> +-10.2 >>> +-11.0 >>> +-13.0 >>> +-15.0 >>> +-16.0 >>> +-1c.0-[0b]--+-00.0 >>> | \-00.1 >>> +-1d.0 >>> +-1d.1 >>> +-1d.2 >>> +-1d.3 >>> +-1d.7 >>> +-1e.0-[0c]----05.0 >>> +-1f.0 >>> +-1f.1 >>> +-1f.2 >>> \-1f.3 >>> lspci >>> 00:00.0 Host bridge: Intel Corporation 5000P Chipset Memory >>> Controller Hub (rev b1) >>> 00:02.0 PCI bridge: Intel Corporation 5000 Series Chipset PCI >>> Express >>> x4 Port 2 (rev b1) >>> 00:03.0 PCI bridge: Intel Corporation 5000 Series Chipset PCI >>> Express >>> x4 Port 3 (rev b1) >>> 00:04.0 PCI bridge: Intel Corporation 5000 Series Chipset PCI >>> Express >>> x8 Port 4-5 (rev b1) >>> 00:05.0 PCI bridge: Intel Corporation 5000 Series Chipset PCI >>> Express >>> x4 Port 5 (rev b1) >>> 00:06.0 PCI bridge: Intel Corporation 5000 Series Chipset PCI >>> Express >>> x8 Port 6-7 (rev b1) >>> 00:07.0 PCI bridge: Intel Corporation 5000 Series Chipset PCI >>> Express >>> x4 Port 7 (rev b1) >>> 00:08.0 System peripheral: Intel Corporation 5000 Series Chipset >>> DMA >>> Engine (rev b1) >>> 00:10.0 Host bridge: Intel Corporation 5000 Series Chipset FSB >>> Registers (rev b1) >>> 00:10.1 Host bridge: Intel Corporation 5000 Series Chipset FSB >>> Registers (rev b1) >>> 00:10.2 Host bridge: Intel Corporation 5000 Series Chipset FSB >>> Registers (rev b1) >>> 00:11.0 Host bridge: Intel Corporation 5000 Series Chipset Reserved >>> Registers (rev b1) >>> 00:13.0 Host bridge: Intel Corporation 5000 Series Chipset Reserved >>> Registers (rev b1) >>> 00:15.0 Host bridge: Intel Corporation 5000 Series Chipset FBD >>> Registers (rev b1) >>> 00:16.0 Host bridge: Intel Corporation 5000 Series Chipset FBD >>> Registers (rev b1) >>> 00:1c.0 PCI bridge: Intel Corporation 631xESB/632xESB/3100 Chipset >>> PCI Express Root Port 1 (rev 09) >>> 00:1d.0 USB controller: Intel Corporation 631xESB/632xESB/3100 >>> Chipset UHCI USB Controller #1 (rev 09) >>> 00:1d.1 USB controller: Intel Corporation 631xESB/632xESB/3100 >>> Chipset UHCI USB Controller #2 (rev 09) >>> 00:1d.2 USB controller: Intel Corporation 631xESB/632xESB/3100 >>> Chipset UHCI USB Controller #3 (rev 09) >>> 00:1d.3 USB controller: Intel Corporation 631xESB/632xESB/3100 >>> Chipset UHCI USB Controller #4 (rev 09) >>> 00:1d.7 USB controller: Intel Corporation 631xESB/632xESB/3100 >>> Chipset EHCI USB2 Controller (rev 09) >>> 00:1e.0 PCI bridge: Intel Corporation 82801 PCI Bridge (rev d9) >>> 00:1f.0 ISA bridge: Intel Corporation 631xESB/632xESB/3100 Chipset >>> LPC Interface Controller (rev 09) >>> 00:1f.1 IDE interface: Intel Corporation 631xESB/632xESB IDE >>> Controller (rev 09) >>> 00:1f.2 SATA controller: Intel Corporation 631xESB/632xESB SATA >>> AHCI >>> Controller (rev 09) >>> 00:1f.3 SMBus: Intel Corporation 631xESB/632xESB/3100 Chipset SMBus >>> Controller (rev 09) >>> 01:00.0 PCI bridge: Intel Corporation 6311ESB/6321ESB PCI Express >>> Upstream Port (rev 01) >>> 01:00.3 PCI bridge: Intel Corporation 6311ESB/6321ESB PCI Express >>> to >>> PCI-X Bridge (rev 01) >>> 02:00.0 PCI bridge: Intel Corporation 6311ESB/6321ESB PCI Express >>> Downstream Port E1 (rev 01) >>> 02:02.0 PCI bridge: Intel Corporation 6311ESB/6321ESB PCI Express >>> Downstream Port E3 (rev 01) >>> 04:00.0 Ethernet controller: Intel Corporation 80003ES2LAN Gigabit >>> Ethernet Controller (Copper) (rev 01) >>> 04:00.1 Ethernet controller: Intel Corporation 80003ES2LAN Gigabit >>> Ethernet Controller (Copper) (rev 01) >>> 07:00.0 RAID bus controller: Adaptec AAC-RAID (rev 09) >>> 0b:00.0 Ethernet controller: Intel Corporation 82571EB Gigabit >>> Ethernet Controller (rev 06) >>> 0b:00.1 Ethernet controller: Intel Corporation 82571EB Gigabit >>> Ethernet Controller (rev 06) >>> 0c:05.0 VGA compatible controller: ASPEED Technology, Inc. ASPEED >>> Graphics Family >>> >>> >>> dmesg: >>> [ 4.936885] e1000: Intel(R) PRO/1000 Network Driver - version >>> 7.3.21-k8-NAPI >>> [ 4.936887] e1000: Copyright (c) 1999-2006 Intel Corporation. >>> [ 4.936966] e1000e: Intel(R) PRO/1000 Network Driver - 1.9.5-k >>> [ 4.936967] e1000e: Copyright(c) 1999 - 2012 Intel Corporation. >>> [ 4.938529] e1000e 0000:04:00.0: (unregistered net_device): >>> Interrupt Throttling Rate (ints/sec) set to dynamic conservative >>> mode >>> [ 4.939598] e1000e 0000:04:00.0: irq 65 for MSI/MSI-X >>> [ 4.992246] e1000e 0000:04:00.0: eth0: (PCI >>> Express:2.5GT/s:Width >>> x4) 00:1e:68:04:99:f8 >>> [ 4.992657] e1000e 0000:04:00.0: eth0: Intel(R) PRO/1000 Network >>> Connection >>> [ 4.992964] e1000e 0000:04:00.0: eth0: MAC: 5, PHY: 5, PBA No: >>> FFFFFF-0FF >>> [ 4.994745] e1000e 0000:04:00.1: (unregistered net_device): >>> Interrupt Throttling Rate (ints/sec) set to dynamic conservative >>> mode >>> [ 4.996233] e1000e 0000:04:00.1: irq 66 for MSI/MSI-X >>> [ 5.050901] e1000e 0000:04:00.1: eth1: (PCI >>> Express:2.5GT/s:Width >>> x4) 00:1e:68:04:99:f9 >>> [ 5.051317] e1000e 0000:04:00.1: eth1: Intel(R) PRO/1000 Network >>> Connection >>> [ 5.051623] e1000e 0000:04:00.1: eth1: MAC: 5, PHY: 5, PBA No: >>> FFFFFF-0FF >>> [ 5.051857] e1000e 0000:0b:00.0: Disabling ASPM L1 >>> [ 5.052168] e1000e 0000:0b:00.0: (unregistered net_device): >>> Interrupt Throttling Rate (ints/sec) set to dynamic conservative >>> mode >>> [ 5.052611] e1000e 0000:0b:00.0: irq 67 for MSI/MSI-X >>> [ 5.223454] e1000e 0000:0b:00.0: eth2: (PCI >>> Express:2.5GT/s:Width >>> x4) 00:1e:68:04:99:fa >>> [ 5.223864] e1000e 0000:0b:00.0: eth2: Intel(R) PRO/1000 Network >>> Connection >>> [ 5.224178] e1000e 0000:0b:00.0: eth2: MAC: 0, PHY: 4, PBA No: >>> C83246-002 >>> [ 5.224412] e1000e 0000:0b:00.1: Disabling ASPM L1 >>> [ 5.224709] e1000e 0000:0b:00.1: (unregistered net_device): >>> Interrupt Throttling Rate (ints/sec) set to dynamic conservative >>> mode >>> [ 5.225168] e1000e 0000:0b:00.1: irq 68 for MSI/MSI-X >>> [ 5.397603] e1000e 0000:0b:00.1: eth3: (PCI >>> Express:2.5GT/s:Width >>> x4) 00:1e:68:04:99:fb >>> [ 5.398021] e1000e 0000:0b:00.1: eth3: Intel(R) PRO/1000 Network >>> Connection >>> [ 5.398336] e1000e 0000:0b:00.1: eth3: MAC: 0, PHY: 4, PBA No: >>> C83246-002 >>> [ 13.859817] e1000e 0000:04:00.0: irq 65 for MSI/MSI-X >>> [ 13.962309] e1000e 0000:04:00.0: irq 65 for MSI/MSI-X >>> [ 17.150392] e1000e: eth0 NIC Link is Up 1000 Mbps Full Duplex, >>> Flow Control: None >>> >>> -- >>> To unsubscribe from this list: send the line "unsubscribe netdev" >>> in >>> the body of a message to majordomo@vger.kernel.org >>> More majordomo info at http://vger.kernel.org/majordomo-info.html >> >> --- >> Network engineer >> Denys Fedoryshchenko >> >> Dora Highway - Center Cebaco - 2nd Floor >> Beirut, Lebanon >> Tel: +961 1 247373 >> E-Mail: denys@visp.net.lb >> -- >> To unsubscribe from this list: send the line "unsubscribe netdev" in >> the body of a message to majordomo@vger.kernel.org >> More majordomo info at http://vger.kernel.org/majordomo-info.html > > --- > Network engineer > Denys Fedoryshchenko > > Dora Highway - Center Cebaco - 2nd Floor > Beirut, Lebanon > Tel: +961 1 247373 > E-Mail: denys@visp.net.lb > -- > To unsubscribe from this list: send the line "unsubscribe netdev" in > the body of a message to majordomo@vger.kernel.org > More majordomo info at http://vger.kernel.org/majordomo-info.html --- Network engineer Denys Fedoryshchenko Dora Highway - Center Cebaco - 2nd Floor Beirut, Lebanon Tel: +961 1 247373 E-Mail: denys@visp.net.lb -- To unsubscribe from this list: send the line "unsubscribe netdev" in the body of a message to majordomo@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
On Fri, May 18, 2012 at 7:04 AM, Denys Fedoryshchenko <denys@visp.net.lb> wrote: > It seems logic in BQL has serious issues. The most bad thing, if someone > don't want limits (especially low as this), > there is no way to disable BQL in Kernel configuration, only tuning each > interface over sysfs values. > echo max > /sys/class/net/ethX/tx-Y/byte_queue_limits/limit_min should disable BQL for a queue > I just did short debug: > if (limit != dql->limit) { > + printk("New limit %d\n", dql->limit); > dql->limit = limit; > ovlimit = 0; > } > > And got this numbers: > [ 18.696839] New limit 0 > [ 19.622967] New limit 42 > [ 20.037810] New limit 165 > [ 35.473666] New limit 386 > [ 37.418591] New limit 1374 > [ 37.420064] New limit 6432 > [ 39.209480] New limit 16548 > [ 39.214773] New limit 1704 > [ 40.696065] New limit 6762 > [ 40.696390] New limit 15564 > [ 41.921120] New limit 25788 > [ 41.921165] New limit 388 > [ 42.696286] New limit 534 > [ 42.696539] New limit 1096 > [ 42.696719] New limit 2304 > [ 53.360394] New limit 24334 > [ 54.696072] New limit 484 > [ 54.696135] New limit 934 > > This means sometimes limit goes below MTU, and till queue limit increased, i > will see this traffic "stalled", > if there is large packet in queue. Probably BQL miscalculate queue as full > because of some specific handling > of sent packets in e1000e on this specific hardware. Because it should not > be full, it is 1Gbps wire, > and it is empty. So in result, instead of eliminating latency, it is adding > it. > Not expected, we've be running e1000e with BQL for a while. A few questions: 1) What kernel are you running? 2) What sort of traffic load are you using? 3) Have you tried a different interrupt mode? Per #3, I am wondering if dynamic conservative mode interrupt throttling is not producing interrupts deterministically for BQL. I'll try to reproduce the problem in this mode. > I can make a patch that will make minimum BQL value not less than MTU + > overhead, is it ok like this? > Probably it will solve issue, but it is more workaround and safety fuse, > than a solution. > That would just be a bandaid and could just make this a latent issue for the future. Tom > > On 2012-05-17 19:54, Denys Fedoryshchenko wrote: >> >> Also i notice, limit constantly changing over time (even i am not >> touching it). >> >> centaur ~ # grep "" /sys/class/net/eth0/queues/tx-0/byte_queue_limits/* >> /sys/class/net/eth0/queues/tx-0/byte_queue_limits/hold_time:1000 >> /sys/class/net/eth0/queues/tx-0/byte_queue_limits/inflight:0 >> /sys/class/net/eth0/queues/tx-0/byte_queue_limits/limit:13018 >> >> /sys/class/net/eth0/queues/tx-0/byte_queue_limits/limit_max:1879048192 >> /sys/class/net/eth0/queues/tx-0/byte_queue_limits/limit_min:0 >> centaur ~ # grep "" /sys/class/net/eth0/queues/tx-0/byte_queue_limits/* >> /sys/class/net/eth0/queues/tx-0/byte_queue_limits/hold_time:1000 >> /sys/class/net/eth0/queues/tx-0/byte_queue_limits/inflight:4542 >> /sys/class/net/eth0/queues/tx-0/byte_queue_limits/limit:13018 >> >> /sys/class/net/eth0/queues/tx-0/byte_queue_limits/limit_max:1879048192 >> /sys/class/net/eth0/queues/tx-0/byte_queue_limits/limit_min:0 >> >> Is it supposed to be like this? >> >> On 2012-05-17 16:42, Denys Fedoryshchenko wrote: >>> >>> Found commit that cause problem: >>> >>> author Tom Herbert <therbert@google.com> >>> Mon, 28 Nov 2011 16:33:16 +0000 (16:33 +0000) >>> committer David S. Miller <davem@davemloft.net> >>> Tue, 29 Nov 2011 17:46:19 +0000 (12:46 -0500) >>> commit 3f0cfa3bc11e7f00c9994e0f469cbc0e7da7b00c >>> tree d6670a4f94b2b9dedacc38edb6f0e1306b889f6b tree | snapshot >>> parent 114cf5802165ee93e3ab461c9c505cd94a08b800 commit | diff >>> e1000e: Support for byte queue limits >>> >>> Changes to e1000e to use byte queue limits. >>> >>> Signed-off-by: Tom Herbert <therbert@google.com> >>> Acked-by: Eric Dumazet <eric.dumazet@gmail.com> >>> Signed-off-by: David S. Miller <davem@davemloft.net> >>> >>> If i reverse it, problem disappearing. >>> >>> How i reproduce it: >>> In two consoles do "fast" ping to nearby host >>> ping 194.146.XXX.XXX -s1472 -i0.0001 >>> ping 194.146.XXX.XXX -s1472 -i0.1 >>> >>> For third open ssh to host with "problem", open mcedit, and just >>> scroll down large text file. >>> After few seconds some "stalls" will occur, and in ping history i can >>> see: >>> 1480 bytes from 194.146.153.7: icmp_req=1797 ttl=64 time=0.161 ms >>> 1480 bytes from 194.146.153.7: icmp_req=1798 ttl=64 time=0.198 ms >>> 1480 bytes from 194.146.153.7: icmp_req=1799 ttl=64 time=0.340 ms >>> 1480 bytes from 194.146.153.7: icmp_req=1800 ttl=64 time=0.381 ms >>> 1480 bytes from 194.146.153.7: icmp_req=1801 ttl=64 time=914 ms >>> 1480 bytes from 194.146.153.7: icmp_req=1802 ttl=64 time=804 ms >>> 1480 bytes from 194.146.153.7: icmp_req=1803 ttl=64 time=704 ms >>> 1480 bytes from 194.146.153.7: icmp_req=1804 ttl=64 time=594 ms >>> 1480 bytes from 194.146.153.7: icmp_req=1805 ttl=64 time=0.287 ms >>> 1480 bytes from 194.146.153.7: icmp_req=1806 ttl=64 time=0.226 ms >>> >>> >>> If i apply small patch - problem will disappear. Sure it is not a >>> solution, but >>> let me know how i can help to debug problem more. >>> >>> --- netdev.c 2012-05-12 20:08:37.000000000 +0300 >>> +++ netdev.c.patched 2012-05-17 16:32:28.895760472 +0300 >>> @@ -1135,7 +1135,7 @@ >>> >>> tx_ring->next_to_clean = i; >>> >>> - netdev_completed_queue(netdev, pkts_compl, bytes_compl); >>> +// netdev_completed_queue(netdev, pkts_compl, bytes_compl); >>> >>> #define TX_WAKE_THRESHOLD 32 >>> if (count && netif_carrier_ok(netdev) && >>> @@ -2263,7 +2263,7 @@ >>> e1000_put_txbuf(adapter, buffer_info); >>> } >>> >>> - netdev_reset_queue(adapter->netdev); >>> +// netdev_reset_queue(adapter->netdev); >>> size = sizeof(struct e1000_buffer) * tx_ring->count; >>> memset(tx_ring->buffer_info, 0, size); >>> >>> @@ -5056,7 +5056,7 @@ >>> /* if count is 0 then mapping error has occurred */ >>> count = e1000_tx_map(adapter, skb, first, max_per_txd, >>> nr_frags, mss); >>> if (count) { >>> - netdev_sent_queue(netdev, skb->len); >>> +// netdev_sent_queue(netdev, skb->len); >>> e1000_tx_queue(adapter, tx_flags, count); >>> /* Make sure there is space in the ring for the next send. >>> */ >>> e1000_maybe_stop_tx(netdev, MAX_SKB_FRAGS + 2); >>> >>> >>> >>> On 2012-05-15 17:15, Denys Fedoryshchenko wrote: >>>> >>>> Hi >>>> >>>> I have two identical servers, Sun Fire X4150, both has different >>>> flavors of Linux, x86_64 and i386. >>>> 04:00.0 Ethernet controller: Intel Corporation 80003ES2LAN Gigabit >>>> Ethernet Controller (Copper) (rev 01) >>>> 04:00.1 Ethernet controller: Intel Corporation 80003ES2LAN Gigabit >>>> Ethernet Controller (Copper) (rev 01) >>>> 0b:00.0 Ethernet controller: Intel Corporation 82571EB Gigabit >>>> Ethernet Controller (rev 06) >>>> 0b:00.1 Ethernet controller: Intel Corporation 82571EB Gigabit >>>> Ethernet Controller (rev 06) >>>> I am using now interface: >>>> #ethtool -i eth0 >>>> driver: e1000e >>>> version: 1.9.5-k >>>> firmware-version: 2.1-11 >>>> bus-info: 0000:04:00.0 >>>> There is 2 CPU , Intel(R) Xeon(R) CPU E5440 @ 2.83GHz . >>>> >>>> i386 was acting as NAT and shaper, and as soon as i removed shaper >>>> from it, i started to experience strange lockups, e.g. traffic is >>>> normal for 5-30 seconds, then short lockup for 500-3000ms (usually >>>> around 1000ms) with dropped packets counter increasing. I was >>>> suspecting it is due load, but it seems was wrong. >>>> Recently, on another server, x86_64 i am using as development, i >>>> upgrade kernel (it was old, from 2.6 series) and on completely idle >>>> machine started to experience same latency spikes, while i am just >>>> running mc and for example typing in text editor - i notice "stalls". >>>> After i investigate it a little more, i notice also small amount of >>>> drops on interface. No tcpdump running. Also this machine is idle, and >>>> the only traffic there - some small broadcasts from network, my ssh, >>>> and ping. >>>> >>>> Dropped packets in ifconfig >>>> RX packets:3752868 errors:0 dropped:5350 overruns:0 frame:0 >>>> Counter is increasing sometimes, when this stall happening. >>>> >>>> ethtool -S is clean, there is no dropped packets. >>>> >>>> I did tried to check load (mpstat and perf), there is nothing >>>> suspicious, latencytop also doesn't show anything suspicious. >>>> dropwatch report a lot of drops, but mostly because there is some >>>> broadcasts and etc. tcpdump at the moment of such drops doesn't show >>>> anything suspicious. >>>> Changed qdisc from default fifo_fast to bfifo, without any result. >>>> Tried: ethtool -K eth0 tso off gso off gro off sg off , no result >>>> Problem occured at 3.3.6 - 3.4.0-rc7, most probably 3.3.0 also, but i >>>> don't remember for sure. I thik on some kernels like 3.1 probably it >>>> doesn't occur, i will check it soon, because it is not always reliable >>>> to reproduce it. All tests i did on 3.4.0-rc7. >>>> >>>> I did run also in background tcpdump, additionally iptables with >>>> timestamps, and at time when stall occured, seems i am still receiving >>>> packets properly, also on iperf udp (from some host to this SunFire) >>>> at this moments no packets missing. But i am sure RX interface errors >>>> are increasing. >>>> If i do iperf from SunFire to test host - there is packetloss at >>>> moments when stall occured. >>>> >>>> I suspect that by some reason network card stop to transmit, but >>>> unable to pinpoint issue. All other hosts in this network are fine and >>>> don't have such problems. >>>> Can you help me with that please? Maybe i can provide more debug >>>> information, compile with patches and etc. Also i will try to fallback >>>> to 3.1 and 3.0 kernels. >>>> >>>> Here it is how it occurs and i am reproducing it: >>>> I'm just opening file, and start to scroll it in mc, then in another >>>> console i run ping >>>> [1337089061.844167] 1480 bytes from 194.146.153.20: icmp_req=162 >>>> ttl=64 time=0.485 ms >>>> [1337089061.944138] 1480 bytes from 194.146.153.20: icmp_req=163 >>>> ttl=64 time=0.470 ms >>>> [1337089062.467759] 1480 bytes from 194.146.153.20: icmp_req=164 >>>> ttl=64 time=424 ms >>>> [1337089062.467899] 1480 bytes from 194.146.153.20: icmp_req=165 >>>> ttl=64 time=324 ms >>>> [1337089062.468058] 1480 bytes from 194.146.153.20: icmp_req=166 >>>> ttl=64 time=214 ms >>>> [1337089062.468161] 1480 bytes from 194.146.153.20: icmp_req=167 >>>> ttl=64 time=104 ms >>>> [1337089062.468958] 1480 bytes from 194.146.153.20: icmp_req=168 >>>> ttl=64 time=1.15 ms >>>> [1337089062.568604] 1480 bytes from 194.146.153.20: icmp_req=169 >>>> ttl=64 time=0.477 ms >>>> [1337089062.668909] 1480 bytes from 194.146.153.20: icmp_req=170 >>>> ttl=64 time=0.667 ms >>>> >>>> Remote host tcpdump: >>>> 1337089061.934737 IP 194.146.153.20 > 194.146.153.22: ICMP echo >>>> reply, id 3486, seq 163, length 1480 >>>> 1337089062.458360 IP 194.146.153.22 > 194.146.153.20: ICMP echo >>>> request, id 3486, seq 164, length 1480 >>>> 1337089062.458380 IP 194.146.153.20 > 194.146.153.22: ICMP echo >>>> reply, id 3486, seq 164, length 1480 >>>> 1337089062.458481 IP 194.146.153.22 > 194.146.153.20: ICMP echo >>>> request, id 3486, seq 165, length 1480 >>>> 1337089062.458502 IP 194.146.153.20 > 194.146.153.22: ICMP echo >>>> reply, id 3486, seq 165, length 1480 >>>> 1337089062.458606 IP 194.146.153.22 > 194.146.153.20: ICMP echo >>>> request, id 3486, seq 166, length 1480 >>>> 1337089062.458623 IP 194.146.153.20 > 194.146.153.22: ICMP echo >>>> reply, id 3486, seq 166, length 1480 >>>> 1337089062.458729 IP 194.146.153.22 > 194.146.153.20: ICMP echo >>>> request, id 3486, seq 167, length 1480 >>>> 1337089062.458745 IP 194.146.153.20 > 194.146.153.22: ICMP echo >>>> reply, id 3486, seq 167, length 1480 >>>> 1337089062.459537 IP 194.146.153.22 > 194.146.153.20: ICMP echo >>>> request, id 3486, seq 168, length 1480 >>>> 1337089062.459545 IP 194.146.153.20 > 194.146.153.22: ICMP echo >>>> reply, id 3486, seq 168, length 1480 >>>> >>>> Local host(SunFire) tcpdump: >>>> 1337089061.844140 IP 194.146.153.20 > 194.146.153.22: ICMP echo >>>> reply, id 3486, seq 162, length 1480 >>>> 1337089061.943661 IP 194.146.153.22 > 194.146.153.20: ICMP echo >>>> request, id 3486, seq 163, length 1480 >>>> 1337089061.944124 IP 194.146.153.20 > 194.146.153.22: ICMP echo >>>> reply, id 3486, seq 163, length 1480 >>>> 1337089062.465622 IP 194.146.153.22 > 194.146.153.20: ICMP echo >>>> request, id 3486, seq 164, length 1480 >>>> 1337089062.465630 IP 194.146.153.22 > 194.146.153.20: ICMP echo >>>> request, id 3486, seq 165, length 1480 >>>> 1337089062.465632 IP 194.146.153.22 > 194.146.153.20: ICMP echo >>>> request, id 3486, seq 166, length 1480 >>>> 1337089062.465634 IP 194.146.153.22 > 194.146.153.20: ICMP echo >>>> request, id 3486, seq 167, length 1480 >>>> 1337089062.467730 IP 194.146.153.20 > 194.146.153.22: ICMP echo >>>> reply, id 3486, seq 164, length 1480 >>>> 1337089062.467785 IP 194.146.153.22 > 194.146.153.20: ICMP echo >>>> request, id 3486, seq 168, length 1480 >>>> 1337089062.467884 IP 194.146.153.20 > 194.146.153.22: ICMP echo >>>> reply, id 3486, seq 165, length 1480 >>>> 1337089062.468035 IP 194.146.153.20 > 194.146.153.22: ICMP echo >>>> reply, id 3486, seq 166, length 1480 >>>> 1337089062.468129 IP 194.146.153.20 > 194.146.153.22: ICMP echo >>>> reply, id 3486, seq 167, length 1480 >>>> 1337089062.468928 IP 194.146.153.20 > 194.146.153.22: ICMP echo >>>> reply, id 3486, seq 168, length 1480 >>>> 1337089062.568112 IP 194.146.153.22 > 194.146.153.20: ICMP echo >>>> request, id 3486, seq 169, length 1480 >>>> 1337089062.568578 IP 194.146.153.20 > 194.146.153.22: ICMP echo >>>> reply, id 3486, seq 169, length 1480 >>>> >>>> lspci -t >>>> centaur src # lspci -t >>>> -[0000:00]-+-00.0 >>>> +-02.0-[01-05]--+-00.0-[02-04]--+-00.0-[03]-- >>>> | | \-02.0-[04]--+-00.0 >>>> | | \-00.1 >>>> | \-00.3-[05]-- >>>> +-03.0-[06]-- >>>> +-04.0-[07]----00.0 >>>> +-05.0-[08]-- >>>> +-06.0-[09]-- >>>> +-07.0-[0a]-- >>>> +-08.0 >>>> +-10.0 >>>> +-10.1 >>>> +-10.2 >>>> +-11.0 >>>> +-13.0 >>>> +-15.0 >>>> +-16.0 >>>> +-1c.0-[0b]--+-00.0 >>>> | \-00.1 >>>> +-1d.0 >>>> +-1d.1 >>>> +-1d.2 >>>> +-1d.3 >>>> +-1d.7 >>>> +-1e.0-[0c]----05.0 >>>> +-1f.0 >>>> +-1f.1 >>>> +-1f.2 >>>> \-1f.3 >>>> lspci >>>> 00:00.0 Host bridge: Intel Corporation 5000P Chipset Memory >>>> Controller Hub (rev b1) >>>> 00:02.0 PCI bridge: Intel Corporation 5000 Series Chipset PCI Express >>>> x4 Port 2 (rev b1) >>>> 00:03.0 PCI bridge: Intel Corporation 5000 Series Chipset PCI Express >>>> x4 Port 3 (rev b1) >>>> 00:04.0 PCI bridge: Intel Corporation 5000 Series Chipset PCI Express >>>> x8 Port 4-5 (rev b1) >>>> 00:05.0 PCI bridge: Intel Corporation 5000 Series Chipset PCI Express >>>> x4 Port 5 (rev b1) >>>> 00:06.0 PCI bridge: Intel Corporation 5000 Series Chipset PCI Express >>>> x8 Port 6-7 (rev b1) >>>> 00:07.0 PCI bridge: Intel Corporation 5000 Series Chipset PCI Express >>>> x4 Port 7 (rev b1) >>>> 00:08.0 System peripheral: Intel Corporation 5000 Series Chipset DMA >>>> Engine (rev b1) >>>> 00:10.0 Host bridge: Intel Corporation 5000 Series Chipset FSB >>>> Registers (rev b1) >>>> 00:10.1 Host bridge: Intel Corporation 5000 Series Chipset FSB >>>> Registers (rev b1) >>>> 00:10.2 Host bridge: Intel Corporation 5000 Series Chipset FSB >>>> Registers (rev b1) >>>> 00:11.0 Host bridge: Intel Corporation 5000 Series Chipset Reserved >>>> Registers (rev b1) >>>> 00:13.0 Host bridge: Intel Corporation 5000 Series Chipset Reserved >>>> Registers (rev b1) >>>> 00:15.0 Host bridge: Intel Corporation 5000 Series Chipset FBD >>>> Registers (rev b1) >>>> 00:16.0 Host bridge: Intel Corporation 5000 Series Chipset FBD >>>> Registers (rev b1) >>>> 00:1c.0 PCI bridge: Intel Corporation 631xESB/632xESB/3100 Chipset >>>> PCI Express Root Port 1 (rev 09) >>>> 00:1d.0 USB controller: Intel Corporation 631xESB/632xESB/3100 >>>> Chipset UHCI USB Controller #1 (rev 09) >>>> 00:1d.1 USB controller: Intel Corporation 631xESB/632xESB/3100 >>>> Chipset UHCI USB Controller #2 (rev 09) >>>> 00:1d.2 USB controller: Intel Corporation 631xESB/632xESB/3100 >>>> Chipset UHCI USB Controller #3 (rev 09) >>>> 00:1d.3 USB controller: Intel Corporation 631xESB/632xESB/3100 >>>> Chipset UHCI USB Controller #4 (rev 09) >>>> 00:1d.7 USB controller: Intel Corporation 631xESB/632xESB/3100 >>>> Chipset EHCI USB2 Controller (rev 09) >>>> 00:1e.0 PCI bridge: Intel Corporation 82801 PCI Bridge (rev d9) >>>> 00:1f.0 ISA bridge: Intel Corporation 631xESB/632xESB/3100 Chipset >>>> LPC Interface Controller (rev 09) >>>> 00:1f.1 IDE interface: Intel Corporation 631xESB/632xESB IDE >>>> Controller (rev 09) >>>> 00:1f.2 SATA controller: Intel Corporation 631xESB/632xESB SATA AHCI >>>> Controller (rev 09) >>>> 00:1f.3 SMBus: Intel Corporation 631xESB/632xESB/3100 Chipset SMBus >>>> Controller (rev 09) >>>> 01:00.0 PCI bridge: Intel Corporation 6311ESB/6321ESB PCI Express >>>> Upstream Port (rev 01) >>>> 01:00.3 PCI bridge: Intel Corporation 6311ESB/6321ESB PCI Express to >>>> PCI-X Bridge (rev 01) >>>> 02:00.0 PCI bridge: Intel Corporation 6311ESB/6321ESB PCI Express >>>> Downstream Port E1 (rev 01) >>>> 02:02.0 PCI bridge: Intel Corporation 6311ESB/6321ESB PCI Express >>>> Downstream Port E3 (rev 01) >>>> 04:00.0 Ethernet controller: Intel Corporation 80003ES2LAN Gigabit >>>> Ethernet Controller (Copper) (rev 01) >>>> 04:00.1 Ethernet controller: Intel Corporation 80003ES2LAN Gigabit >>>> Ethernet Controller (Copper) (rev 01) >>>> 07:00.0 RAID bus controller: Adaptec AAC-RAID (rev 09) >>>> 0b:00.0 Ethernet controller: Intel Corporation 82571EB Gigabit >>>> Ethernet Controller (rev 06) >>>> 0b:00.1 Ethernet controller: Intel Corporation 82571EB Gigabit >>>> Ethernet Controller (rev 06) >>>> 0c:05.0 VGA compatible controller: ASPEED Technology, Inc. ASPEED >>>> Graphics Family >>>> >>>> >>>> dmesg: >>>> [ 4.936885] e1000: Intel(R) PRO/1000 Network Driver - version >>>> 7.3.21-k8-NAPI >>>> [ 4.936887] e1000: Copyright (c) 1999-2006 Intel Corporation. >>>> [ 4.936966] e1000e: Intel(R) PRO/1000 Network Driver - 1.9.5-k >>>> [ 4.936967] e1000e: Copyright(c) 1999 - 2012 Intel Corporation. >>>> [ 4.938529] e1000e 0000:04:00.0: (unregistered net_device): >>>> Interrupt Throttling Rate (ints/sec) set to dynamic conservative mode >>>> [ 4.939598] e1000e 0000:04:00.0: irq 65 for MSI/MSI-X >>>> [ 4.992246] e1000e 0000:04:00.0: eth0: (PCI Express:2.5GT/s:Width >>>> x4) 00:1e:68:04:99:f8 >>>> [ 4.992657] e1000e 0000:04:00.0: eth0: Intel(R) PRO/1000 Network >>>> Connection >>>> [ 4.992964] e1000e 0000:04:00.0: eth0: MAC: 5, PHY: 5, PBA No: >>>> FFFFFF-0FF >>>> [ 4.994745] e1000e 0000:04:00.1: (unregistered net_device): >>>> Interrupt Throttling Rate (ints/sec) set to dynamic conservative mode >>>> [ 4.996233] e1000e 0000:04:00.1: irq 66 for MSI/MSI-X >>>> [ 5.050901] e1000e 0000:04:00.1: eth1: (PCI Express:2.5GT/s:Width >>>> x4) 00:1e:68:04:99:f9 >>>> [ 5.051317] e1000e 0000:04:00.1: eth1: Intel(R) PRO/1000 Network >>>> Connection >>>> [ 5.051623] e1000e 0000:04:00.1: eth1: MAC: 5, PHY: 5, PBA No: >>>> FFFFFF-0FF >>>> [ 5.051857] e1000e 0000:0b:00.0: Disabling ASPM L1 >>>> [ 5.052168] e1000e 0000:0b:00.0: (unregistered net_device): >>>> Interrupt Throttling Rate (ints/sec) set to dynamic conservative mode >>>> [ 5.052611] e1000e 0000:0b:00.0: irq 67 for MSI/MSI-X >>>> [ 5.223454] e1000e 0000:0b:00.0: eth2: (PCI Express:2.5GT/s:Width >>>> x4) 00:1e:68:04:99:fa >>>> [ 5.223864] e1000e 0000:0b:00.0: eth2: Intel(R) PRO/1000 Network >>>> Connection >>>> [ 5.224178] e1000e 0000:0b:00.0: eth2: MAC: 0, PHY: 4, PBA No: >>>> C83246-002 >>>> [ 5.224412] e1000e 0000:0b:00.1: Disabling ASPM L1 >>>> [ 5.224709] e1000e 0000:0b:00.1: (unregistered net_device): >>>> Interrupt Throttling Rate (ints/sec) set to dynamic conservative mode >>>> [ 5.225168] e1000e 0000:0b:00.1: irq 68 for MSI/MSI-X >>>> [ 5.397603] e1000e 0000:0b:00.1: eth3: (PCI Express:2.5GT/s:Width >>>> x4) 00:1e:68:04:99:fb >>>> [ 5.398021] e1000e 0000:0b:00.1: eth3: Intel(R) PRO/1000 Network >>>> Connection >>>> [ 5.398336] e1000e 0000:0b:00.1: eth3: MAC: 0, PHY: 4, PBA No: >>>> C83246-002 >>>> [ 13.859817] e1000e 0000:04:00.0: irq 65 for MSI/MSI-X >>>> [ 13.962309] e1000e 0000:04:00.0: irq 65 for MSI/MSI-X >>>> [ 17.150392] e1000e: eth0 NIC Link is Up 1000 Mbps Full Duplex, >>>> Flow Control: None >>>> >>>> -- >>>> To unsubscribe from this list: send the line "unsubscribe netdev" in >>>> the body of a message to majordomo@vger.kernel.org >>>> More majordomo info at http://vger.kernel.org/majordomo-info.html >>> >>> >>> --- >>> Network engineer >>> Denys Fedoryshchenko >>> >>> Dora Highway - Center Cebaco - 2nd Floor >>> Beirut, Lebanon >>> Tel: +961 1 247373 >>> E-Mail: denys@visp.net.lb >>> -- >>> To unsubscribe from this list: send the line "unsubscribe netdev" in >>> the body of a message to majordomo@vger.kernel.org >>> More majordomo info at http://vger.kernel.org/majordomo-info.html >> >> >> --- >> Network engineer >> Denys Fedoryshchenko >> >> Dora Highway - Center Cebaco - 2nd Floor >> Beirut, Lebanon >> Tel: +961 1 247373 >> E-Mail: denys@visp.net.lb >> -- >> To unsubscribe from this list: send the line "unsubscribe netdev" in >> the body of a message to majordomo@vger.kernel.org >> More majordomo info at http://vger.kernel.org/majordomo-info.html > > > --- > Network engineer > Denys Fedoryshchenko > > Dora Highway - Center Cebaco - 2nd Floor > Beirut, Lebanon > Tel: +961 1 247373 > E-Mail: denys@visp.net.lb -- To unsubscribe from this list: send the line "unsubscribe netdev" in the body of a message to majordomo@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
> > Not expected, we've be running e1000e with BQL for a while. A few > questions: > > 1) What kernel are you running? > 2) What sort of traffic load are you using? > 3) Have you tried a different interrupt mode? > 1)Tested 3.3, 3.3.6, 3.4-rc7. 2)Easy to reproduce: )Open 3 ssh consoles to server with problems )First console "ping -s1472 -i0.0001 nearby.ip" (It will help to trigger problem quickier) )Second console "ping -s1472 -i0.1 nearby.ip" (Easy to see when network stalls - ping rise to 1000-2000ms) )Open some large text file and scroll it down. After while you will feel it "stalls" and just switch to second console to check if ping risen recently. It was occured also on loaded NAT server, after i remove shaper from it, that was limiting traffic to 350Mbps, but there traffic pattern much more complicated. 3)Not yet, i run it with default parameters. I will try now. --- Denys Fedoryshchenko, Network Engineer, Virtual ISP S.A.L. -- To unsubscribe from this list: send the line "unsubscribe netdev" in the body of a message to majordomo@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
On 2012-05-19 05:07, Tom Herbert wrote: > 3) Have you tried a different interrupt mode? Now tested: e1000e.TxIntDelay=0,0,0,0 [ 4.930570] e1000e 0000:04:00.0: (unregistered net_device): Transmit Interrupt Delay set to 0 [ 4.930573] e1000e 0000:04:00.0: (unregistered net_device): Interrupt Throttling Rate (ints/sec) set to dynamic conservative mode Problem still occur. e1000e.TxIntDelay=0,0,0,0 e1000e.InterruptThrottleRate=0,0,0,0 [ 4.971476] e1000e 0000:04:00.0: (unregistered net_device): Transmit Interrupt Delay set to 0 [ 4.971478] e1000e 0000:04:00.0: (unregistered net_device): Interrupt Throttling Rate (ints/sec) turned off Problem also occur. Probably it is a case of this specific machine, this problem is happened on same hardware (Sun Fire X4150). The only "exotic" thing here is set of cards: 04:00.0 Ethernet controller: Intel Corporation 80003ES2LAN Gigabit Ethernet Controller (Copper) (rev 01) 04:00.1 Ethernet controller: Intel Corporation 80003ES2LAN Gigabit Ethernet Controller (Copper) (rev 01) 0b:00.0 Ethernet controller: Intel Corporation 82571EB Gigabit Ethernet Controller (rev 06) 0b:00.1 Ethernet controller: Intel Corporation 82571EB Gigabit Ethernet Controller (rev 06) I'm using first card. > > Per #3, I am wondering if dynamic conservative mode interrupt > throttling is not producing interrupts deterministically for BQL. > I'll try to reproduce the problem in this mode. > >> I can make a patch that will make minimum BQL value not less than >> MTU + >> overhead, is it ok like this? >> Probably it will solve issue, but it is more workaround and safety >> fuse, >> than a solution. >> > That would just be a bandaid and could just make this a latent issue > for the future. > > Tom > >> >> On 2012-05-17 19:54, Denys Fedoryshchenko wrote: >>> >>> Also i notice, limit constantly changing over time (even i am not >>> touching it). >>> >>> centaur ~ # grep "" >>> /sys/class/net/eth0/queues/tx-0/byte_queue_limits/* >>> /sys/class/net/eth0/queues/tx-0/byte_queue_limits/hold_time:1000 >>> /sys/class/net/eth0/queues/tx-0/byte_queue_limits/inflight:0 >>> /sys/class/net/eth0/queues/tx-0/byte_queue_limits/limit:13018 >>> >>> >>> /sys/class/net/eth0/queues/tx-0/byte_queue_limits/limit_max:1879048192 >>> /sys/class/net/eth0/queues/tx-0/byte_queue_limits/limit_min:0 >>> centaur ~ # grep "" >>> /sys/class/net/eth0/queues/tx-0/byte_queue_limits/* >>> /sys/class/net/eth0/queues/tx-0/byte_queue_limits/hold_time:1000 >>> /sys/class/net/eth0/queues/tx-0/byte_queue_limits/inflight:4542 >>> /sys/class/net/eth0/queues/tx-0/byte_queue_limits/limit:13018 >>> >>> >>> /sys/class/net/eth0/queues/tx-0/byte_queue_limits/limit_max:1879048192 >>> /sys/class/net/eth0/queues/tx-0/byte_queue_limits/limit_min:0 >>> >>> Is it supposed to be like this? >>> >>> On 2012-05-17 16:42, Denys Fedoryshchenko wrote: >>>> >>>> Found commit that cause problem: >>>> >>>> author Tom Herbert <therbert@google.com> >>>> Mon, 28 Nov 2011 16:33:16 +0000 (16:33 +0000) >>>> committer David S. Miller <davem@davemloft.net> >>>> Tue, 29 Nov 2011 17:46:19 +0000 (12:46 -0500) >>>> commit 3f0cfa3bc11e7f00c9994e0f469cbc0e7da7b00c >>>> tree d6670a4f94b2b9dedacc38edb6f0e1306b889f6b tree | >>>> snapshot >>>> parent 114cf5802165ee93e3ab461c9c505cd94a08b800 commit | >>>> diff >>>> e1000e: Support for byte queue limits >>>> >>>> Changes to e1000e to use byte queue limits. >>>> >>>> Signed-off-by: Tom Herbert <therbert@google.com> >>>> Acked-by: Eric Dumazet <eric.dumazet@gmail.com> >>>> Signed-off-by: David S. Miller <davem@davemloft.net> >>>> >>>> If i reverse it, problem disappearing. >>>> >>>> How i reproduce it: >>>> In two consoles do "fast" ping to nearby host >>>> ping 194.146.XXX.XXX -s1472 -i0.0001 >>>> ping 194.146.XXX.XXX -s1472 -i0.1 >>>> >>>> For third open ssh to host with "problem", open mcedit, and just >>>> scroll down large text file. >>>> After few seconds some "stalls" will occur, and in ping history i >>>> can >>>> see: >>>> 1480 bytes from 194.146.153.7: icmp_req=1797 ttl=64 time=0.161 ms >>>> 1480 bytes from 194.146.153.7: icmp_req=1798 ttl=64 time=0.198 ms >>>> 1480 bytes from 194.146.153.7: icmp_req=1799 ttl=64 time=0.340 ms >>>> 1480 bytes from 194.146.153.7: icmp_req=1800 ttl=64 time=0.381 ms >>>> 1480 bytes from 194.146.153.7: icmp_req=1801 ttl=64 time=914 ms >>>> 1480 bytes from 194.146.153.7: icmp_req=1802 ttl=64 time=804 ms >>>> 1480 bytes from 194.146.153.7: icmp_req=1803 ttl=64 time=704 ms >>>> 1480 bytes from 194.146.153.7: icmp_req=1804 ttl=64 time=594 ms >>>> 1480 bytes from 194.146.153.7: icmp_req=1805 ttl=64 time=0.287 ms >>>> 1480 bytes from 194.146.153.7: icmp_req=1806 ttl=64 time=0.226 ms >>>> >>>> >>>> If i apply small patch - problem will disappear. Sure it is not a >>>> solution, but >>>> let me know how i can help to debug problem more. >>>> >>>> --- netdev.c 2012-05-12 20:08:37.000000000 +0300 >>>> +++ netdev.c.patched 2012-05-17 16:32:28.895760472 +0300 >>>> @@ -1135,7 +1135,7 @@ >>>> >>>> tx_ring->next_to_clean = i; >>>> >>>> - netdev_completed_queue(netdev, pkts_compl, bytes_compl); >>>> +// netdev_completed_queue(netdev, pkts_compl, bytes_compl); >>>> >>>> #define TX_WAKE_THRESHOLD 32 >>>> if (count && netif_carrier_ok(netdev) && >>>> @@ -2263,7 +2263,7 @@ >>>> e1000_put_txbuf(adapter, buffer_info); >>>> } >>>> >>>> - netdev_reset_queue(adapter->netdev); >>>> +// netdev_reset_queue(adapter->netdev); >>>> size = sizeof(struct e1000_buffer) * tx_ring->count; >>>> memset(tx_ring->buffer_info, 0, size); >>>> >>>> @@ -5056,7 +5056,7 @@ >>>> /* if count is 0 then mapping error has occurred */ >>>> count = e1000_tx_map(adapter, skb, first, max_per_txd, >>>> nr_frags, mss); >>>> if (count) { >>>> - netdev_sent_queue(netdev, skb->len); >>>> +// netdev_sent_queue(netdev, skb->len); >>>> e1000_tx_queue(adapter, tx_flags, count); >>>> /* Make sure there is space in the ring for the >>>> next send. >>>> */ >>>> e1000_maybe_stop_tx(netdev, MAX_SKB_FRAGS + 2); >>>> >>>> >>>> >>>> On 2012-05-15 17:15, Denys Fedoryshchenko wrote: >>>>> >>>>> Hi >>>>> >>>>> I have two identical servers, Sun Fire X4150, both has different >>>>> flavors of Linux, x86_64 and i386. >>>>> 04:00.0 Ethernet controller: Intel Corporation 80003ES2LAN >>>>> Gigabit >>>>> Ethernet Controller (Copper) (rev 01) >>>>> 04:00.1 Ethernet controller: Intel Corporation 80003ES2LAN >>>>> Gigabit >>>>> Ethernet Controller (Copper) (rev 01) >>>>> 0b:00.0 Ethernet controller: Intel Corporation 82571EB Gigabit >>>>> Ethernet Controller (rev 06) >>>>> 0b:00.1 Ethernet controller: Intel Corporation 82571EB Gigabit >>>>> Ethernet Controller (rev 06) >>>>> I am using now interface: >>>>> #ethtool -i eth0 >>>>> driver: e1000e >>>>> version: 1.9.5-k >>>>> firmware-version: 2.1-11 >>>>> bus-info: 0000:04:00.0 >>>>> There is 2 CPU , Intel(R) Xeon(R) CPU E5440 @ 2.83GHz >>>>> . >>>>> >>>>> i386 was acting as NAT and shaper, and as soon as i removed >>>>> shaper >>>>> from it, i started to experience strange lockups, e.g. traffic is >>>>> normal for 5-30 seconds, then short lockup for 500-3000ms >>>>> (usually >>>>> around 1000ms) with dropped packets counter increasing. I was >>>>> suspecting it is due load, but it seems was wrong. >>>>> Recently, on another server, x86_64 i am using as development, i >>>>> upgrade kernel (it was old, from 2.6 series) and on completely >>>>> idle >>>>> machine started to experience same latency spikes, while i am >>>>> just >>>>> running mc and for example typing in text editor - i notice >>>>> "stalls". >>>>> After i investigate it a little more, i notice also small amount >>>>> of >>>>> drops on interface. No tcpdump running. Also this machine is >>>>> idle, and >>>>> the only traffic there - some small broadcasts from network, my >>>>> ssh, >>>>> and ping. >>>>> >>>>> Dropped packets in ifconfig >>>>> RX packets:3752868 errors:0 dropped:5350 overruns:0 >>>>> frame:0 >>>>> Counter is increasing sometimes, when this stall happening. >>>>> >>>>> ethtool -S is clean, there is no dropped packets. >>>>> >>>>> I did tried to check load (mpstat and perf), there is nothing >>>>> suspicious, latencytop also doesn't show anything suspicious. >>>>> dropwatch report a lot of drops, but mostly because there is some >>>>> broadcasts and etc. tcpdump at the moment of such drops doesn't >>>>> show >>>>> anything suspicious. >>>>> Changed qdisc from default fifo_fast to bfifo, without any >>>>> result. >>>>> Tried: ethtool -K eth0 tso off gso off gro off sg off , no >>>>> result >>>>> Problem occured at 3.3.6 - 3.4.0-rc7, most probably 3.3.0 also, >>>>> but i >>>>> don't remember for sure. I thik on some kernels like 3.1 probably >>>>> it >>>>> doesn't occur, i will check it soon, because it is not always >>>>> reliable >>>>> to reproduce it. All tests i did on 3.4.0-rc7. >>>>> >>>>> I did run also in background tcpdump, additionally iptables with >>>>> timestamps, and at time when stall occured, seems i am still >>>>> receiving >>>>> packets properly, also on iperf udp (from some host to this >>>>> SunFire) >>>>> at this moments no packets missing. But i am sure RX interface >>>>> errors >>>>> are increasing. >>>>> If i do iperf from SunFire to test host - there is packetloss at >>>>> moments when stall occured. >>>>> >>>>> I suspect that by some reason network card stop to transmit, but >>>>> unable to pinpoint issue. All other hosts in this network are >>>>> fine and >>>>> don't have such problems. >>>>> Can you help me with that please? Maybe i can provide more debug >>>>> information, compile with patches and etc. Also i will try to >>>>> fallback >>>>> to 3.1 and 3.0 kernels. >>>>> >>>>> Here it is how it occurs and i am reproducing it: >>>>> I'm just opening file, and start to scroll it in mc, then in >>>>> another >>>>> console i run ping >>>>> [1337089061.844167] 1480 bytes from 194.146.153.20: icmp_req=162 >>>>> ttl=64 time=0.485 ms >>>>> [1337089061.944138] 1480 bytes from 194.146.153.20: icmp_req=163 >>>>> ttl=64 time=0.470 ms >>>>> [1337089062.467759] 1480 bytes from 194.146.153.20: icmp_req=164 >>>>> ttl=64 time=424 ms >>>>> [1337089062.467899] 1480 bytes from 194.146.153.20: icmp_req=165 >>>>> ttl=64 time=324 ms >>>>> [1337089062.468058] 1480 bytes from 194.146.153.20: icmp_req=166 >>>>> ttl=64 time=214 ms >>>>> [1337089062.468161] 1480 bytes from 194.146.153.20: icmp_req=167 >>>>> ttl=64 time=104 ms >>>>> [1337089062.468958] 1480 bytes from 194.146.153.20: icmp_req=168 >>>>> ttl=64 time=1.15 ms >>>>> [1337089062.568604] 1480 bytes from 194.146.153.20: icmp_req=169 >>>>> ttl=64 time=0.477 ms >>>>> [1337089062.668909] 1480 bytes from 194.146.153.20: icmp_req=170 >>>>> ttl=64 time=0.667 ms >>>>> >>>>> Remote host tcpdump: >>>>> 1337089061.934737 IP 194.146.153.20 > 194.146.153.22: ICMP echo >>>>> reply, id 3486, seq 163, length 1480 >>>>> 1337089062.458360 IP 194.146.153.22 > 194.146.153.20: ICMP echo >>>>> request, id 3486, seq 164, length 1480 >>>>> 1337089062.458380 IP 194.146.153.20 > 194.146.153.22: ICMP echo >>>>> reply, id 3486, seq 164, length 1480 >>>>> 1337089062.458481 IP 194.146.153.22 > 194.146.153.20: ICMP echo >>>>> request, id 3486, seq 165, length 1480 >>>>> 1337089062.458502 IP 194.146.153.20 > 194.146.153.22: ICMP echo >>>>> reply, id 3486, seq 165, length 1480 >>>>> 1337089062.458606 IP 194.146.153.22 > 194.146.153.20: ICMP echo >>>>> request, id 3486, seq 166, length 1480 >>>>> 1337089062.458623 IP 194.146.153.20 > 194.146.153.22: ICMP echo >>>>> reply, id 3486, seq 166, length 1480 >>>>> 1337089062.458729 IP 194.146.153.22 > 194.146.153.20: ICMP echo >>>>> request, id 3486, seq 167, length 1480 >>>>> 1337089062.458745 IP 194.146.153.20 > 194.146.153.22: ICMP echo >>>>> reply, id 3486, seq 167, length 1480 >>>>> 1337089062.459537 IP 194.146.153.22 > 194.146.153.20: ICMP echo >>>>> request, id 3486, seq 168, length 1480 >>>>> 1337089062.459545 IP 194.146.153.20 > 194.146.153.22: ICMP echo >>>>> reply, id 3486, seq 168, length 1480 >>>>> >>>>> Local host(SunFire) tcpdump: >>>>> 1337089061.844140 IP 194.146.153.20 > 194.146.153.22: ICMP echo >>>>> reply, id 3486, seq 162, length 1480 >>>>> 1337089061.943661 IP 194.146.153.22 > 194.146.153.20: ICMP echo >>>>> request, id 3486, seq 163, length 1480 >>>>> 1337089061.944124 IP 194.146.153.20 > 194.146.153.22: ICMP echo >>>>> reply, id 3486, seq 163, length 1480 >>>>> 1337089062.465622 IP 194.146.153.22 > 194.146.153.20: ICMP echo >>>>> request, id 3486, seq 164, length 1480 >>>>> 1337089062.465630 IP 194.146.153.22 > 194.146.153.20: ICMP echo >>>>> request, id 3486, seq 165, length 1480 >>>>> 1337089062.465632 IP 194.146.153.22 > 194.146.153.20: ICMP echo >>>>> request, id 3486, seq 166, length 1480 >>>>> 1337089062.465634 IP 194.146.153.22 > 194.146.153.20: ICMP echo >>>>> request, id 3486, seq 167, length 1480 >>>>> 1337089062.467730 IP 194.146.153.20 > 194.146.153.22: ICMP echo >>>>> reply, id 3486, seq 164, length 1480 >>>>> 1337089062.467785 IP 194.146.153.22 > 194.146.153.20: ICMP echo >>>>> request, id 3486, seq 168, length 1480 >>>>> 1337089062.467884 IP 194.146.153.20 > 194.146.153.22: ICMP echo >>>>> reply, id 3486, seq 165, length 1480 >>>>> 1337089062.468035 IP 194.146.153.20 > 194.146.153.22: ICMP echo >>>>> reply, id 3486, seq 166, length 1480 >>>>> 1337089062.468129 IP 194.146.153.20 > 194.146.153.22: ICMP echo >>>>> reply, id 3486, seq 167, length 1480 >>>>> 1337089062.468928 IP 194.146.153.20 > 194.146.153.22: ICMP echo >>>>> reply, id 3486, seq 168, length 1480 >>>>> 1337089062.568112 IP 194.146.153.22 > 194.146.153.20: ICMP echo >>>>> request, id 3486, seq 169, length 1480 >>>>> 1337089062.568578 IP 194.146.153.20 > 194.146.153.22: ICMP echo >>>>> reply, id 3486, seq 169, length 1480 >>>>> >>>>> lspci -t >>>>> centaur src # lspci -t >>>>> -[0000:00]-+-00.0 >>>>> +-02.0-[01-05]--+-00.0-[02-04]--+-00.0-[03]-- >>>>> | | \-02.0-[04]--+-00.0 >>>>> | | \-00.1 >>>>> | \-00.3-[05]-- >>>>> +-03.0-[06]-- >>>>> +-04.0-[07]----00.0 >>>>> +-05.0-[08]-- >>>>> +-06.0-[09]-- >>>>> +-07.0-[0a]-- >>>>> +-08.0 >>>>> +-10.0 >>>>> +-10.1 >>>>> +-10.2 >>>>> +-11.0 >>>>> +-13.0 >>>>> +-15.0 >>>>> +-16.0 >>>>> +-1c.0-[0b]--+-00.0 >>>>> | \-00.1 >>>>> +-1d.0 >>>>> +-1d.1 >>>>> +-1d.2 >>>>> +-1d.3 >>>>> +-1d.7 >>>>> +-1e.0-[0c]----05.0 >>>>> +-1f.0 >>>>> +-1f.1 >>>>> +-1f.2 >>>>> \-1f.3 >>>>> lspci >>>>> 00:00.0 Host bridge: Intel Corporation 5000P Chipset Memory >>>>> Controller Hub (rev b1) >>>>> 00:02.0 PCI bridge: Intel Corporation 5000 Series Chipset PCI >>>>> Express >>>>> x4 Port 2 (rev b1) >>>>> 00:03.0 PCI bridge: Intel Corporation 5000 Series Chipset PCI >>>>> Express >>>>> x4 Port 3 (rev b1) >>>>> 00:04.0 PCI bridge: Intel Corporation 5000 Series Chipset PCI >>>>> Express >>>>> x8 Port 4-5 (rev b1) >>>>> 00:05.0 PCI bridge: Intel Corporation 5000 Series Chipset PCI >>>>> Express >>>>> x4 Port 5 (rev b1) >>>>> 00:06.0 PCI bridge: Intel Corporation 5000 Series Chipset PCI >>>>> Express >>>>> x8 Port 6-7 (rev b1) >>>>> 00:07.0 PCI bridge: Intel Corporation 5000 Series Chipset PCI >>>>> Express >>>>> x4 Port 7 (rev b1) >>>>> 00:08.0 System peripheral: Intel Corporation 5000 Series Chipset >>>>> DMA >>>>> Engine (rev b1) >>>>> 00:10.0 Host bridge: Intel Corporation 5000 Series Chipset FSB >>>>> Registers (rev b1) >>>>> 00:10.1 Host bridge: Intel Corporation 5000 Series Chipset FSB >>>>> Registers (rev b1) >>>>> 00:10.2 Host bridge: Intel Corporation 5000 Series Chipset FSB >>>>> Registers (rev b1) >>>>> 00:11.0 Host bridge: Intel Corporation 5000 Series Chipset >>>>> Reserved >>>>> Registers (rev b1) >>>>> 00:13.0 Host bridge: Intel Corporation 5000 Series Chipset >>>>> Reserved >>>>> Registers (rev b1) >>>>> 00:15.0 Host bridge: Intel Corporation 5000 Series Chipset FBD >>>>> Registers (rev b1) >>>>> 00:16.0 Host bridge: Intel Corporation 5000 Series Chipset FBD >>>>> Registers (rev b1) >>>>> 00:1c.0 PCI bridge: Intel Corporation 631xESB/632xESB/3100 >>>>> Chipset >>>>> PCI Express Root Port 1 (rev 09) >>>>> 00:1d.0 USB controller: Intel Corporation 631xESB/632xESB/3100 >>>>> Chipset UHCI USB Controller #1 (rev 09) >>>>> 00:1d.1 USB controller: Intel Corporation 631xESB/632xESB/3100 >>>>> Chipset UHCI USB Controller #2 (rev 09) >>>>> 00:1d.2 USB controller: Intel Corporation 631xESB/632xESB/3100 >>>>> Chipset UHCI USB Controller #3 (rev 09) >>>>> 00:1d.3 USB controller: Intel Corporation 631xESB/632xESB/3100 >>>>> Chipset UHCI USB Controller #4 (rev 09) >>>>> 00:1d.7 USB controller: Intel Corporation 631xESB/632xESB/3100 >>>>> Chipset EHCI USB2 Controller (rev 09) >>>>> 00:1e.0 PCI bridge: Intel Corporation 82801 PCI Bridge (rev d9) >>>>> 00:1f.0 ISA bridge: Intel Corporation 631xESB/632xESB/3100 >>>>> Chipset >>>>> LPC Interface Controller (rev 09) >>>>> 00:1f.1 IDE interface: Intel Corporation 631xESB/632xESB IDE >>>>> Controller (rev 09) >>>>> 00:1f.2 SATA controller: Intel Corporation 631xESB/632xESB SATA >>>>> AHCI >>>>> Controller (rev 09) >>>>> 00:1f.3 SMBus: Intel Corporation 631xESB/632xESB/3100 Chipset >>>>> SMBus >>>>> Controller (rev 09) >>>>> 01:00.0 PCI bridge: Intel Corporation 6311ESB/6321ESB PCI Express >>>>> Upstream Port (rev 01) >>>>> 01:00.3 PCI bridge: Intel Corporation 6311ESB/6321ESB PCI Express >>>>> to >>>>> PCI-X Bridge (rev 01) >>>>> 02:00.0 PCI bridge: Intel Corporation 6311ESB/6321ESB PCI Express >>>>> Downstream Port E1 (rev 01) >>>>> 02:02.0 PCI bridge: Intel Corporation 6311ESB/6321ESB PCI Express >>>>> Downstream Port E3 (rev 01) >>>>> 04:00.0 Ethernet controller: Intel Corporation 80003ES2LAN >>>>> Gigabit >>>>> Ethernet Controller (Copper) (rev 01) >>>>> 04:00.1 Ethernet controller: Intel Corporation 80003ES2LAN >>>>> Gigabit >>>>> Ethernet Controller (Copper) (rev 01) >>>>> 07:00.0 RAID bus controller: Adaptec AAC-RAID (rev 09) >>>>> 0b:00.0 Ethernet controller: Intel Corporation 82571EB Gigabit >>>>> Ethernet Controller (rev 06) >>>>> 0b:00.1 Ethernet controller: Intel Corporation 82571EB Gigabit >>>>> Ethernet Controller (rev 06) >>>>> 0c:05.0 VGA compatible controller: ASPEED Technology, Inc. ASPEED >>>>> Graphics Family >>>>> >>>>> >>>>> dmesg: >>>>> [ 4.936885] e1000: Intel(R) PRO/1000 Network Driver - version >>>>> 7.3.21-k8-NAPI >>>>> [ 4.936887] e1000: Copyright (c) 1999-2006 Intel Corporation. >>>>> [ 4.936966] e1000e: Intel(R) PRO/1000 Network Driver - 1.9.5-k >>>>> [ 4.936967] e1000e: Copyright(c) 1999 - 2012 Intel >>>>> Corporation. >>>>> [ 4.938529] e1000e 0000:04:00.0: (unregistered net_device): >>>>> Interrupt Throttling Rate (ints/sec) set to dynamic conservative >>>>> mode >>>>> [ 4.939598] e1000e 0000:04:00.0: irq 65 for MSI/MSI-X >>>>> [ 4.992246] e1000e 0000:04:00.0: eth0: (PCI >>>>> Express:2.5GT/s:Width >>>>> x4) 00:1e:68:04:99:f8 >>>>> [ 4.992657] e1000e 0000:04:00.0: eth0: Intel(R) PRO/1000 >>>>> Network >>>>> Connection >>>>> [ 4.992964] e1000e 0000:04:00.0: eth0: MAC: 5, PHY: 5, PBA No: >>>>> FFFFFF-0FF >>>>> [ 4.994745] e1000e 0000:04:00.1: (unregistered net_device): >>>>> Interrupt Throttling Rate (ints/sec) set to dynamic conservative >>>>> mode >>>>> [ 4.996233] e1000e 0000:04:00.1: irq 66 for MSI/MSI-X >>>>> [ 5.050901] e1000e 0000:04:00.1: eth1: (PCI >>>>> Express:2.5GT/s:Width >>>>> x4) 00:1e:68:04:99:f9 >>>>> [ 5.051317] e1000e 0000:04:00.1: eth1: Intel(R) PRO/1000 >>>>> Network >>>>> Connection >>>>> [ 5.051623] e1000e 0000:04:00.1: eth1: MAC: 5, PHY: 5, PBA No: >>>>> FFFFFF-0FF >>>>> [ 5.051857] e1000e 0000:0b:00.0: Disabling ASPM L1 >>>>> [ 5.052168] e1000e 0000:0b:00.0: (unregistered net_device): >>>>> Interrupt Throttling Rate (ints/sec) set to dynamic conservative >>>>> mode >>>>> [ 5.052611] e1000e 0000:0b:00.0: irq 67 for MSI/MSI-X >>>>> [ 5.223454] e1000e 0000:0b:00.0: eth2: (PCI >>>>> Express:2.5GT/s:Width >>>>> x4) 00:1e:68:04:99:fa >>>>> [ 5.223864] e1000e 0000:0b:00.0: eth2: Intel(R) PRO/1000 >>>>> Network >>>>> Connection >>>>> [ 5.224178] e1000e 0000:0b:00.0: eth2: MAC: 0, PHY: 4, PBA No: >>>>> C83246-002 >>>>> [ 5.224412] e1000e 0000:0b:00.1: Disabling ASPM L1 >>>>> [ 5.224709] e1000e 0000:0b:00.1: (unregistered net_device): >>>>> Interrupt Throttling Rate (ints/sec) set to dynamic conservative >>>>> mode >>>>> [ 5.225168] e1000e 0000:0b:00.1: irq 68 for MSI/MSI-X >>>>> [ 5.397603] e1000e 0000:0b:00.1: eth3: (PCI >>>>> Express:2.5GT/s:Width >>>>> x4) 00:1e:68:04:99:fb >>>>> [ 5.398021] e1000e 0000:0b:00.1: eth3: Intel(R) PRO/1000 >>>>> Network >>>>> Connection >>>>> [ 5.398336] e1000e 0000:0b:00.1: eth3: MAC: 0, PHY: 4, PBA No: >>>>> C83246-002 >>>>> [ 13.859817] e1000e 0000:04:00.0: irq 65 for MSI/MSI-X >>>>> [ 13.962309] e1000e 0000:04:00.0: irq 65 for MSI/MSI-X >>>>> [ 17.150392] e1000e: eth0 NIC Link is Up 1000 Mbps Full Duplex, >>>>> Flow Control: None >>>>> >>>>> -- >>>>> To unsubscribe from this list: send the line "unsubscribe netdev" >>>>> in >>>>> the body of a message to majordomo@vger.kernel.org >>>>> More majordomo info at >>>>> http://vger.kernel.org/majordomo-info.html >>>> >>>> >>>> --- >>>> Network engineer >>>> Denys Fedoryshchenko >>>> >>>> Dora Highway - Center Cebaco - 2nd Floor >>>> Beirut, Lebanon >>>> Tel: +961 1 247373 >>>> E-Mail: denys@visp.net.lb >>>> -- >>>> To unsubscribe from this list: send the line "unsubscribe netdev" >>>> in >>>> the body of a message to majordomo@vger.kernel.org >>>> More majordomo info at http://vger.kernel.org/majordomo-info.html >>> >>> >>> --- >>> Network engineer >>> Denys Fedoryshchenko >>> >>> Dora Highway - Center Cebaco - 2nd Floor >>> Beirut, Lebanon >>> Tel: +961 1 247373 >>> E-Mail: denys@visp.net.lb >>> -- >>> To unsubscribe from this list: send the line "unsubscribe netdev" >>> in >>> the body of a message to majordomo@vger.kernel.org >>> More majordomo info at http://vger.kernel.org/majordomo-info.html >> >> >> --- >> Network engineer >> Denys Fedoryshchenko >> >> Dora Highway - Center Cebaco - 2nd Floor >> Beirut, Lebanon >> Tel: +961 1 247373 >> E-Mail: denys@visp.net.lb > -- > To unsubscribe from this list: send the line "unsubscribe netdev" in > the body of a message to majordomo@vger.kernel.org > More majordomo info at http://vger.kernel.org/majordomo-info.html --- Denys Fedoryshchenko, Network Engineer, Virtual ISP S.A.L. -- To unsubscribe from this list: send the line "unsubscribe netdev" in the body of a message to majordomo@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
Tried to reproduce: May 20 10:08:30 test kernel: [ 6.168240] e1000e 0000:06:00.0: (unregistered net_device): Interrupt Throttling Rate (ints/sec) set to dynamic conservative mode May 20 10:08:30 test kernel: [ 6.221591] e1000e 0000:06:00.1: (unregistered net_device): Interrupt Throttling Rate (ints/sec) set to dynamic conservative mode 06:00.0 Ethernet controller: Intel Corporation 80003ES2LAN Gigabit Ethernet Controller (Copper) (rev 01) 06:00.1 Ethernet controller: Intel Corporation 80003ES2LAN Gigabit Ethernet Controller (Copper) (rev 01) Following above instructions to repro gives: 1480 bytes from test2 (192.168.2.49): icmp_req=5875 ttl=64 time=0.358 ms 1480 bytes from test2 (192.168.2.49): icmp_req=5876 ttl=64 time=0.330 ms 1480 bytes from test2 (192.168.2.49): icmp_req=5877 ttl=64 time=0.337 ms 1480 bytes from test2 (192.168.2.49): icmp_req=5878 ttl=64 time=0.375 ms 1480 bytes from test2 (192.168.2.49): icmp_req=5879 ttl=64 time=0.359 ms 1480 bytes from lpb49.prod.google.com (192.168.2.49): icmp_req=5880 ttl=64 time=0.380 ms And I didn't see the stalls. This was on an Intel machine. The limit was stable, went up to around 28K when opened large file and tended to stay between 15-28K. The describe problem seems to have characteristics that transmit interrupts are not at all periodic, and it would seem that some are taking hundreds of milliseconds to pop. I don't see anything that would cause that in the NIC, is it possible there is some activity on the machines periodically and often holding down interrupts for long periods of time. Are there any peculiarities on Sun Fire in interrupt handling? Can you also provide an 'ethtool -c eth0' Thanks, Tom On Fri, May 18, 2012 at 7:40 PM, Denys Fedoryshchenko <denys@visp.net.lb> wrote: > On 2012-05-19 05:07, Tom Herbert wrote: >> >> 3) Have you tried a different interrupt mode? > > > Now tested: > e1000e.TxIntDelay=0,0,0,0 > [ 4.930570] e1000e 0000:04:00.0: (unregistered net_device): Transmit > Interrupt Delay set to 0 > [ 4.930573] e1000e 0000:04:00.0: (unregistered net_device): Interrupt > Throttling Rate (ints/sec) set to dynamic conservative mode > Problem still occur. > > e1000e.TxIntDelay=0,0,0,0 e1000e.InterruptThrottleRate=0,0,0,0 > [ 4.971476] e1000e 0000:04:00.0: (unregistered net_device): Transmit > Interrupt Delay set to 0 > [ 4.971478] e1000e 0000:04:00.0: (unregistered net_device): Interrupt > Throttling Rate (ints/sec) turned off > Problem also occur. > > Probably it is a case of this specific machine, this problem is happened on > same hardware (Sun Fire X4150). > The only "exotic" thing here is set of cards: > > 04:00.0 Ethernet controller: Intel Corporation 80003ES2LAN Gigabit Ethernet > Controller (Copper) (rev 01) > 04:00.1 Ethernet controller: Intel Corporation 80003ES2LAN Gigabit Ethernet > Controller (Copper) (rev 01) > 0b:00.0 Ethernet controller: Intel Corporation 82571EB Gigabit Ethernet > Controller (rev 06) > 0b:00.1 Ethernet controller: Intel Corporation 82571EB Gigabit Ethernet > Controller (rev 06) > > I'm using first card. > > >> >> Per #3, I am wondering if dynamic conservative mode interrupt >> throttling is not producing interrupts deterministically for BQL. >> I'll try to reproduce the problem in this mode. >> >>> I can make a patch that will make minimum BQL value not less than MTU + >>> overhead, is it ok like this? >>> Probably it will solve issue, but it is more workaround and safety fuse, >>> than a solution. >>> >> That would just be a bandaid and could just make this a latent issue >> for the future. >> >> Tom >> >>> >>> On 2012-05-17 19:54, Denys Fedoryshchenko wrote: >>>> >>>> >>>> Also i notice, limit constantly changing over time (even i am not >>>> touching it). >>>> >>>> centaur ~ # grep "" /sys/class/net/eth0/queues/tx-0/byte_queue_limits/* >>>> /sys/class/net/eth0/queues/tx-0/byte_queue_limits/hold_time:1000 >>>> /sys/class/net/eth0/queues/tx-0/byte_queue_limits/inflight:0 >>>> /sys/class/net/eth0/queues/tx-0/byte_queue_limits/limit:13018 >>>> >>>> >>>> /sys/class/net/eth0/queues/tx-0/byte_queue_limits/limit_max:1879048192 >>>> /sys/class/net/eth0/queues/tx-0/byte_queue_limits/limit_min:0 >>>> centaur ~ # grep "" /sys/class/net/eth0/queues/tx-0/byte_queue_limits/* >>>> /sys/class/net/eth0/queues/tx-0/byte_queue_limits/hold_time:1000 >>>> /sys/class/net/eth0/queues/tx-0/byte_queue_limits/inflight:4542 >>>> /sys/class/net/eth0/queues/tx-0/byte_queue_limits/limit:13018 >>>> >>>> >>>> /sys/class/net/eth0/queues/tx-0/byte_queue_limits/limit_max:1879048192 >>>> /sys/class/net/eth0/queues/tx-0/byte_queue_limits/limit_min:0 >>>> >>>> Is it supposed to be like this? >>>> >>>> On 2012-05-17 16:42, Denys Fedoryshchenko wrote: >>>>> >>>>> >>>>> Found commit that cause problem: >>>>> >>>>> author Tom Herbert <therbert@google.com> >>>>> Mon, 28 Nov 2011 16:33:16 +0000 (16:33 +0000) >>>>> committer David S. Miller <davem@davemloft.net> >>>>> Tue, 29 Nov 2011 17:46:19 +0000 (12:46 -0500) >>>>> commit 3f0cfa3bc11e7f00c9994e0f469cbc0e7da7b00c >>>>> tree d6670a4f94b2b9dedacc38edb6f0e1306b889f6b tree | snapshot >>>>> parent 114cf5802165ee93e3ab461c9c505cd94a08b800 commit | diff >>>>> e1000e: Support for byte queue limits >>>>> >>>>> Changes to e1000e to use byte queue limits. >>>>> >>>>> Signed-off-by: Tom Herbert <therbert@google.com> >>>>> Acked-by: Eric Dumazet <eric.dumazet@gmail.com> >>>>> Signed-off-by: David S. Miller <davem@davemloft.net> >>>>> >>>>> If i reverse it, problem disappearing. >>>>> >>>>> How i reproduce it: >>>>> In two consoles do "fast" ping to nearby host >>>>> ping 194.146.XXX.XXX -s1472 -i0.0001 >>>>> ping 194.146.XXX.XXX -s1472 -i0.1 >>>>> >>>>> For third open ssh to host with "problem", open mcedit, and just >>>>> scroll down large text file. >>>>> After few seconds some "stalls" will occur, and in ping history i can >>>>> see: >>>>> 1480 bytes from 194.146.153.7: icmp_req=1797 ttl=64 time=0.161 ms >>>>> 1480 bytes from 194.146.153.7: icmp_req=1798 ttl=64 time=0.198 ms >>>>> 1480 bytes from 194.146.153.7: icmp_req=1799 ttl=64 time=0.340 ms >>>>> 1480 bytes from 194.146.153.7: icmp_req=1800 ttl=64 time=0.381 ms >>>>> 1480 bytes from 194.146.153.7: icmp_req=1801 ttl=64 time=914 ms >>>>> 1480 bytes from 194.146.153.7: icmp_req=1802 ttl=64 time=804 ms >>>>> 1480 bytes from 194.146.153.7: icmp_req=1803 ttl=64 time=704 ms >>>>> 1480 bytes from 194.146.153.7: icmp_req=1804 ttl=64 time=594 ms >>>>> 1480 bytes from 194.146.153.7: icmp_req=1805 ttl=64 time=0.287 ms >>>>> 1480 bytes from 194.146.153.7: icmp_req=1806 ttl=64 time=0.226 ms >>>>> >>>>> >>>>> If i apply small patch - problem will disappear. Sure it is not a >>>>> solution, but >>>>> let me know how i can help to debug problem more. >>>>> >>>>> --- netdev.c 2012-05-12 20:08:37.000000000 +0300 >>>>> +++ netdev.c.patched 2012-05-17 16:32:28.895760472 +0300 >>>>> @@ -1135,7 +1135,7 @@ >>>>> >>>>> tx_ring->next_to_clean = i; >>>>> >>>>> - netdev_completed_queue(netdev, pkts_compl, bytes_compl); >>>>> +// netdev_completed_queue(netdev, pkts_compl, bytes_compl); >>>>> >>>>> #define TX_WAKE_THRESHOLD 32 >>>>> if (count && netif_carrier_ok(netdev) && >>>>> @@ -2263,7 +2263,7 @@ >>>>> e1000_put_txbuf(adapter, buffer_info); >>>>> } >>>>> >>>>> - netdev_reset_queue(adapter->netdev); >>>>> +// netdev_reset_queue(adapter->netdev); >>>>> size = sizeof(struct e1000_buffer) * tx_ring->count; >>>>> memset(tx_ring->buffer_info, 0, size); >>>>> >>>>> @@ -5056,7 +5056,7 @@ >>>>> /* if count is 0 then mapping error has occurred */ >>>>> count = e1000_tx_map(adapter, skb, first, max_per_txd, >>>>> nr_frags, mss); >>>>> if (count) { >>>>> - netdev_sent_queue(netdev, skb->len); >>>>> +// netdev_sent_queue(netdev, skb->len); >>>>> e1000_tx_queue(adapter, tx_flags, count); >>>>> /* Make sure there is space in the ring for the next >>>>> send. >>>>> */ >>>>> e1000_maybe_stop_tx(netdev, MAX_SKB_FRAGS + 2); >>>>> >>>>> >>>>> >>>>> On 2012-05-15 17:15, Denys Fedoryshchenko wrote: >>>>>> >>>>>> >>>>>> Hi >>>>>> >>>>>> I have two identical servers, Sun Fire X4150, both has different >>>>>> flavors of Linux, x86_64 and i386. >>>>>> 04:00.0 Ethernet controller: Intel Corporation 80003ES2LAN Gigabit >>>>>> Ethernet Controller (Copper) (rev 01) >>>>>> 04:00.1 Ethernet controller: Intel Corporation 80003ES2LAN Gigabit >>>>>> Ethernet Controller (Copper) (rev 01) >>>>>> 0b:00.0 Ethernet controller: Intel Corporation 82571EB Gigabit >>>>>> Ethernet Controller (rev 06) >>>>>> 0b:00.1 Ethernet controller: Intel Corporation 82571EB Gigabit >>>>>> Ethernet Controller (rev 06) >>>>>> I am using now interface: >>>>>> #ethtool -i eth0 >>>>>> driver: e1000e >>>>>> version: 1.9.5-k >>>>>> firmware-version: 2.1-11 >>>>>> bus-info: 0000:04:00.0 >>>>>> There is 2 CPU , Intel(R) Xeon(R) CPU E5440 @ 2.83GHz . >>>>>> >>>>>> i386 was acting as NAT and shaper, and as soon as i removed shaper >>>>>> from it, i started to experience strange lockups, e.g. traffic is >>>>>> normal for 5-30 seconds, then short lockup for 500-3000ms (usually >>>>>> around 1000ms) with dropped packets counter increasing. I was >>>>>> suspecting it is due load, but it seems was wrong. >>>>>> Recently, on another server, x86_64 i am using as development, i >>>>>> upgrade kernel (it was old, from 2.6 series) and on completely idle >>>>>> machine started to experience same latency spikes, while i am just >>>>>> running mc and for example typing in text editor - i notice "stalls". >>>>>> After i investigate it a little more, i notice also small amount of >>>>>> drops on interface. No tcpdump running. Also this machine is idle, and >>>>>> the only traffic there - some small broadcasts from network, my ssh, >>>>>> and ping. >>>>>> >>>>>> Dropped packets in ifconfig >>>>>> RX packets:3752868 errors:0 dropped:5350 overruns:0 frame:0 >>>>>> Counter is increasing sometimes, when this stall happening. >>>>>> >>>>>> ethtool -S is clean, there is no dropped packets. >>>>>> >>>>>> I did tried to check load (mpstat and perf), there is nothing >>>>>> suspicious, latencytop also doesn't show anything suspicious. >>>>>> dropwatch report a lot of drops, but mostly because there is some >>>>>> broadcasts and etc. tcpdump at the moment of such drops doesn't show >>>>>> anything suspicious. >>>>>> Changed qdisc from default fifo_fast to bfifo, without any result. >>>>>> Tried: ethtool -K eth0 tso off gso off gro off sg off , no result >>>>>> Problem occured at 3.3.6 - 3.4.0-rc7, most probably 3.3.0 also, but i >>>>>> don't remember for sure. I thik on some kernels like 3.1 probably it >>>>>> doesn't occur, i will check it soon, because it is not always reliable >>>>>> to reproduce it. All tests i did on 3.4.0-rc7. >>>>>> >>>>>> I did run also in background tcpdump, additionally iptables with >>>>>> timestamps, and at time when stall occured, seems i am still receiving >>>>>> packets properly, also on iperf udp (from some host to this SunFire) >>>>>> at this moments no packets missing. But i am sure RX interface errors >>>>>> are increasing. >>>>>> If i do iperf from SunFire to test host - there is packetloss at >>>>>> moments when stall occured. >>>>>> >>>>>> I suspect that by some reason network card stop to transmit, but >>>>>> unable to pinpoint issue. All other hosts in this network are fine and >>>>>> don't have such problems. >>>>>> Can you help me with that please? Maybe i can provide more debug >>>>>> information, compile with patches and etc. Also i will try to fallback >>>>>> to 3.1 and 3.0 kernels. >>>>>> >>>>>> Here it is how it occurs and i am reproducing it: >>>>>> I'm just opening file, and start to scroll it in mc, then in another >>>>>> console i run ping >>>>>> [1337089061.844167] 1480 bytes from 194.146.153.20: icmp_req=162 >>>>>> ttl=64 time=0.485 ms >>>>>> [1337089061.944138] 1480 bytes from 194.146.153.20: icmp_req=163 >>>>>> ttl=64 time=0.470 ms >>>>>> [1337089062.467759] 1480 bytes from 194.146.153.20: icmp_req=164 >>>>>> ttl=64 time=424 ms >>>>>> [1337089062.467899] 1480 bytes from 194.146.153.20: icmp_req=165 >>>>>> ttl=64 time=324 ms >>>>>> [1337089062.468058] 1480 bytes from 194.146.153.20: icmp_req=166 >>>>>> ttl=64 time=214 ms >>>>>> [1337089062.468161] 1480 bytes from 194.146.153.20: icmp_req=167 >>>>>> ttl=64 time=104 ms >>>>>> [1337089062.468958] 1480 bytes from 194.146.153.20: icmp_req=168 >>>>>> ttl=64 time=1.15 ms >>>>>> [1337089062.568604] 1480 bytes from 194.146.153.20: icmp_req=169 >>>>>> ttl=64 time=0.477 ms >>>>>> [1337089062.668909] 1480 bytes from 194.146.153.20: icmp_req=170 >>>>>> ttl=64 time=0.667 ms >>>>>> >>>>>> Remote host tcpdump: >>>>>> 1337089061.934737 IP 194.146.153.20 > 194.146.153.22: ICMP echo >>>>>> reply, id 3486, seq 163, length 1480 >>>>>> 1337089062.458360 IP 194.146.153.22 > 194.146.153.20: ICMP echo >>>>>> request, id 3486, seq 164, length 1480 >>>>>> 1337089062.458380 IP 194.146.153.20 > 194.146.153.22: ICMP echo >>>>>> reply, id 3486, seq 164, length 1480 >>>>>> 1337089062.458481 IP 194.146.153.22 > 194.146.153.20: ICMP echo >>>>>> request, id 3486, seq 165, length 1480 >>>>>> 1337089062.458502 IP 194.146.153.20 > 194.146.153.22: ICMP echo >>>>>> reply, id 3486, seq 165, length 1480 >>>>>> 1337089062.458606 IP 194.146.153.22 > 194.146.153.20: ICMP echo >>>>>> request, id 3486, seq 166, length 1480 >>>>>> 1337089062.458623 IP 194.146.153.20 > 194.146.153.22: ICMP echo >>>>>> reply, id 3486, seq 166, length 1480 >>>>>> 1337089062.458729 IP 194.146.153.22 > 194.146.153.20: ICMP echo >>>>>> request, id 3486, seq 167, length 1480 >>>>>> 1337089062.458745 IP 194.146.153.20 > 194.146.153.22: ICMP echo >>>>>> reply, id 3486, seq 167, length 1480 >>>>>> 1337089062.459537 IP 194.146.153.22 > 194.146.153.20: ICMP echo >>>>>> request, id 3486, seq 168, length 1480 >>>>>> 1337089062.459545 IP 194.146.153.20 > 194.146.153.22: ICMP echo >>>>>> reply, id 3486, seq 168, length 1480 >>>>>> >>>>>> Local host(SunFire) tcpdump: >>>>>> 1337089061.844140 IP 194.146.153.20 > 194.146.153.22: ICMP echo >>>>>> reply, id 3486, seq 162, length 1480 >>>>>> 1337089061.943661 IP 194.146.153.22 > 194.146.153.20: ICMP echo >>>>>> request, id 3486, seq 163, length 1480 >>>>>> 1337089061.944124 IP 194.146.153.20 > 194.146.153.22: ICMP echo >>>>>> reply, id 3486, seq 163, length 1480 >>>>>> 1337089062.465622 IP 194.146.153.22 > 194.146.153.20: ICMP echo >>>>>> request, id 3486, seq 164, length 1480 >>>>>> 1337089062.465630 IP 194.146.153.22 > 194.146.153.20: ICMP echo >>>>>> request, id 3486, seq 165, length 1480 >>>>>> 1337089062.465632 IP 194.146.153.22 > 194.146.153.20: ICMP echo >>>>>> request, id 3486, seq 166, length 1480 >>>>>> 1337089062.465634 IP 194.146.153.22 > 194.146.153.20: ICMP echo >>>>>> request, id 3486, seq 167, length 1480 >>>>>> 1337089062.467730 IP 194.146.153.20 > 194.146.153.22: ICMP echo >>>>>> reply, id 3486, seq 164, length 1480 >>>>>> 1337089062.467785 IP 194.146.153.22 > 194.146.153.20: ICMP echo >>>>>> request, id 3486, seq 168, length 1480 >>>>>> 1337089062.467884 IP 194.146.153.20 > 194.146.153.22: ICMP echo >>>>>> reply, id 3486, seq 165, length 1480 >>>>>> 1337089062.468035 IP 194.146.153.20 > 194.146.153.22: ICMP echo >>>>>> reply, id 3486, seq 166, length 1480 >>>>>> 1337089062.468129 IP 194.146.153.20 > 194.146.153.22: ICMP echo >>>>>> reply, id 3486, seq 167, length 1480 >>>>>> 1337089062.468928 IP 194.146.153.20 > 194.146.153.22: ICMP echo >>>>>> reply, id 3486, seq 168, length 1480 >>>>>> 1337089062.568112 IP 194.146.153.22 > 194.146.153.20: ICMP echo >>>>>> request, id 3486, seq 169, length 1480 >>>>>> 1337089062.568578 IP 194.146.153.20 > 194.146.153.22: ICMP echo >>>>>> reply, id 3486, seq 169, length 1480 >>>>>> >>>>>> lspci -t >>>>>> centaur src # lspci -t >>>>>> -[0000:00]-+-00.0 >>>>>> +-02.0-[01-05]--+-00.0-[02-04]--+-00.0-[03]-- >>>>>> | | \-02.0-[04]--+-00.0 >>>>>> | | \-00.1 >>>>>> | \-00.3-[05]-- >>>>>> +-03.0-[06]-- >>>>>> +-04.0-[07]----00.0 >>>>>> +-05.0-[08]-- >>>>>> +-06.0-[09]-- >>>>>> +-07.0-[0a]-- >>>>>> +-08.0 >>>>>> +-10.0 >>>>>> +-10.1 >>>>>> +-10.2 >>>>>> +-11.0 >>>>>> +-13.0 >>>>>> +-15.0 >>>>>> +-16.0 >>>>>> +-1c.0-[0b]--+-00.0 >>>>>> | \-00.1 >>>>>> +-1d.0 >>>>>> +-1d.1 >>>>>> +-1d.2 >>>>>> +-1d.3 >>>>>> +-1d.7 >>>>>> +-1e.0-[0c]----05.0 >>>>>> +-1f.0 >>>>>> +-1f.1 >>>>>> +-1f.2 >>>>>> \-1f.3 >>>>>> lspci >>>>>> 00:00.0 Host bridge: Intel Corporation 5000P Chipset Memory >>>>>> Controller Hub (rev b1) >>>>>> 00:02.0 PCI bridge: Intel Corporation 5000 Series Chipset PCI Express >>>>>> x4 Port 2 (rev b1) >>>>>> 00:03.0 PCI bridge: Intel Corporation 5000 Series Chipset PCI Express >>>>>> x4 Port 3 (rev b1) >>>>>> 00:04.0 PCI bridge: Intel Corporation 5000 Series Chipset PCI Express >>>>>> x8 Port 4-5 (rev b1) >>>>>> 00:05.0 PCI bridge: Intel Corporation 5000 Series Chipset PCI Express >>>>>> x4 Port 5 (rev b1) >>>>>> 00:06.0 PCI bridge: Intel Corporation 5000 Series Chipset PCI Express >>>>>> x8 Port 6-7 (rev b1) >>>>>> 00:07.0 PCI bridge: Intel Corporation 5000 Series Chipset PCI Express >>>>>> x4 Port 7 (rev b1) >>>>>> 00:08.0 System peripheral: Intel Corporation 5000 Series Chipset DMA >>>>>> Engine (rev b1) >>>>>> 00:10.0 Host bridge: Intel Corporation 5000 Series Chipset FSB >>>>>> Registers (rev b1) >>>>>> 00:10.1 Host bridge: Intel Corporation 5000 Series Chipset FSB >>>>>> Registers (rev b1) >>>>>> 00:10.2 Host bridge: Intel Corporation 5000 Series Chipset FSB >>>>>> Registers (rev b1) >>>>>> 00:11.0 Host bridge: Intel Corporation 5000 Series Chipset Reserved >>>>>> Registers (rev b1) >>>>>> 00:13.0 Host bridge: Intel Corporation 5000 Series Chipset Reserved >>>>>> Registers (rev b1) >>>>>> 00:15.0 Host bridge: Intel Corporation 5000 Series Chipset FBD >>>>>> Registers (rev b1) >>>>>> 00:16.0 Host bridge: Intel Corporation 5000 Series Chipset FBD >>>>>> Registers (rev b1) >>>>>> 00:1c.0 PCI bridge: Intel Corporation 631xESB/632xESB/3100 Chipset >>>>>> PCI Express Root Port 1 (rev 09) >>>>>> 00:1d.0 USB controller: Intel Corporation 631xESB/632xESB/3100 >>>>>> Chipset UHCI USB Controller #1 (rev 09) >>>>>> 00:1d.1 USB controller: Intel Corporation 631xESB/632xESB/3100 >>>>>> Chipset UHCI USB Controller #2 (rev 09) >>>>>> 00:1d.2 USB controller: Intel Corporation 631xESB/632xESB/3100 >>>>>> Chipset UHCI USB Controller #3 (rev 09) >>>>>> 00:1d.3 USB controller: Intel Corporation 631xESB/632xESB/3100 >>>>>> Chipset UHCI USB Controller #4 (rev 09) >>>>>> 00:1d.7 USB controller: Intel Corporation 631xESB/632xESB/3100 >>>>>> Chipset EHCI USB2 Controller (rev 09) >>>>>> 00:1e.0 PCI bridge: Intel Corporation 82801 PCI Bridge (rev d9) >>>>>> 00:1f.0 ISA bridge: Intel Corporation 631xESB/632xESB/3100 Chipset >>>>>> LPC Interface Controller (rev 09) >>>>>> 00:1f.1 IDE interface: Intel Corporation 631xESB/632xESB IDE >>>>>> Controller (rev 09) >>>>>> 00:1f.2 SATA controller: Intel Corporation 631xESB/632xESB SATA AHCI >>>>>> Controller (rev 09) >>>>>> 00:1f.3 SMBus: Intel Corporation 631xESB/632xESB/3100 Chipset SMBus >>>>>> Controller (rev 09) >>>>>> 01:00.0 PCI bridge: Intel Corporation 6311ESB/6321ESB PCI Express >>>>>> Upstream Port (rev 01) >>>>>> 01:00.3 PCI bridge: Intel Corporation 6311ESB/6321ESB PCI Express to >>>>>> PCI-X Bridge (rev 01) >>>>>> 02:00.0 PCI bridge: Intel Corporation 6311ESB/6321ESB PCI Express >>>>>> Downstream Port E1 (rev 01) >>>>>> 02:02.0 PCI bridge: Intel Corporation 6311ESB/6321ESB PCI Express >>>>>> Downstream Port E3 (rev 01) >>>>>> 04:00.0 Ethernet controller: Intel Corporation 80003ES2LAN Gigabit >>>>>> Ethernet Controller (Copper) (rev 01) >>>>>> 04:00.1 Ethernet controller: Intel Corporation 80003ES2LAN Gigabit >>>>>> Ethernet Controller (Copper) (rev 01) >>>>>> 07:00.0 RAID bus controller: Adaptec AAC-RAID (rev 09) >>>>>> 0b:00.0 Ethernet controller: Intel Corporation 82571EB Gigabit >>>>>> Ethernet Controller (rev 06) >>>>>> 0b:00.1 Ethernet controller: Intel Corporation 82571EB Gigabit >>>>>> Ethernet Controller (rev 06) >>>>>> 0c:05.0 VGA compatible controller: ASPEED Technology, Inc. ASPEED >>>>>> Graphics Family >>>>>> >>>>>> >>>>>> dmesg: >>>>>> [ 4.936885] e1000: Intel(R) PRO/1000 Network Driver - version >>>>>> 7.3.21-k8-NAPI >>>>>> [ 4.936887] e1000: Copyright (c) 1999-2006 Intel Corporation. >>>>>> [ 4.936966] e1000e: Intel(R) PRO/1000 Network Driver - 1.9.5-k >>>>>> [ 4.936967] e1000e: Copyright(c) 1999 - 2012 Intel Corporation. >>>>>> [ 4.938529] e1000e 0000:04:00.0: (unregistered net_device): >>>>>> Interrupt Throttling Rate (ints/sec) set to dynamic conservative mode >>>>>> [ 4.939598] e1000e 0000:04:00.0: irq 65 for MSI/MSI-X >>>>>> [ 4.992246] e1000e 0000:04:00.0: eth0: (PCI Express:2.5GT/s:Width >>>>>> x4) 00:1e:68:04:99:f8 >>>>>> [ 4.992657] e1000e 0000:04:00.0: eth0: Intel(R) PRO/1000 Network >>>>>> Connection >>>>>> [ 4.992964] e1000e 0000:04:00.0: eth0: MAC: 5, PHY: 5, PBA No: >>>>>> FFFFFF-0FF >>>>>> [ 4.994745] e1000e 0000:04:00.1: (unregistered net_device): >>>>>> Interrupt Throttling Rate (ints/sec) set to dynamic conservative mode >>>>>> [ 4.996233] e1000e 0000:04:00.1: irq 66 for MSI/MSI-X >>>>>> [ 5.050901] e1000e 0000:04:00.1: eth1: (PCI Express:2.5GT/s:Width >>>>>> x4) 00:1e:68:04:99:f9 >>>>>> [ 5.051317] e1000e 0000:04:00.1: eth1: Intel(R) PRO/1000 Network >>>>>> Connection >>>>>> [ 5.051623] e1000e 0000:04:00.1: eth1: MAC: 5, PHY: 5, PBA No: >>>>>> FFFFFF-0FF >>>>>> [ 5.051857] e1000e 0000:0b:00.0: Disabling ASPM L1 >>>>>> [ 5.052168] e1000e 0000:0b:00.0: (unregistered net_device): >>>>>> Interrupt Throttling Rate (ints/sec) set to dynamic conservative mode >>>>>> [ 5.052611] e1000e 0000:0b:00.0: irq 67 for MSI/MSI-X >>>>>> [ 5.223454] e1000e 0000:0b:00.0: eth2: (PCI Express:2.5GT/s:Width >>>>>> x4) 00:1e:68:04:99:fa >>>>>> [ 5.223864] e1000e 0000:0b:00.0: eth2: Intel(R) PRO/1000 Network >>>>>> Connection >>>>>> [ 5.224178] e1000e 0000:0b:00.0: eth2: MAC: 0, PHY: 4, PBA No: >>>>>> C83246-002 >>>>>> [ 5.224412] e1000e 0000:0b:00.1: Disabling ASPM L1 >>>>>> [ 5.224709] e1000e 0000:0b:00.1: (unregistered net_device): >>>>>> Interrupt Throttling Rate (ints/sec) set to dynamic conservative mode >>>>>> [ 5.225168] e1000e 0000:0b:00.1: irq 68 for MSI/MSI-X >>>>>> [ 5.397603] e1000e 0000:0b:00.1: eth3: (PCI Express:2.5GT/s:Width >>>>>> x4) 00:1e:68:04:99:fb >>>>>> [ 5.398021] e1000e 0000:0b:00.1: eth3: Intel(R) PRO/1000 Network >>>>>> Connection >>>>>> [ 5.398336] e1000e 0000:0b:00.1: eth3: MAC: 0, PHY: 4, PBA No: >>>>>> C83246-002 >>>>>> [ 13.859817] e1000e 0000:04:00.0: irq 65 for MSI/MSI-X >>>>>> [ 13.962309] e1000e 0000:04:00.0: irq 65 for MSI/MSI-X >>>>>> [ 17.150392] e1000e: eth0 NIC Link is Up 1000 Mbps Full Duplex, >>>>>> Flow Control: None >>>>>> >>>>>> -- >>>>>> To unsubscribe from this list: send the line "unsubscribe netdev" in >>>>>> the body of a message to majordomo@vger.kernel.org >>>>>> More majordomo info at http://vger.kernel.org/majordomo-info.html >>>>> >>>>> >>>>> >>>>> --- >>>>> Network engineer >>>>> Denys Fedoryshchenko >>>>> >>>>> Dora Highway - Center Cebaco - 2nd Floor >>>>> Beirut, Lebanon >>>>> Tel: +961 1 247373 >>>>> E-Mail: denys@visp.net.lb >>>>> -- >>>>> To unsubscribe from this list: send the line "unsubscribe netdev" in >>>>> the body of a message to majordomo@vger.kernel.org >>>>> More majordomo info at http://vger.kernel.org/majordomo-info.html >>>> >>>> >>>> >>>> --- >>>> Network engineer >>>> Denys Fedoryshchenko >>>> >>>> Dora Highway - Center Cebaco - 2nd Floor >>>> Beirut, Lebanon >>>> Tel: +961 1 247373 >>>> E-Mail: denys@visp.net.lb >>>> -- >>>> To unsubscribe from this list: send the line "unsubscribe netdev" in >>>> the body of a message to majordomo@vger.kernel.org >>>> More majordomo info at http://vger.kernel.org/majordomo-info.html >>> >>> >>> >>> --- >>> Network engineer >>> Denys Fedoryshchenko >>> >>> Dora Highway - Center Cebaco - 2nd Floor >>> Beirut, Lebanon >>> Tel: +961 1 247373 >>> E-Mail: denys@visp.net.lb >> >> -- >> To unsubscribe from this list: send the line "unsubscribe netdev" in >> the body of a message to majordomo@vger.kernel.org >> More majordomo info at http://vger.kernel.org/majordomo-info.html > > > --- > Denys Fedoryshchenko, Network Engineer, Virtual ISP S.A.L. -- To unsubscribe from this list: send the line "unsubscribe netdev" in the body of a message to majordomo@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
On Fri, 2012-05-18 at 17:04 +0300, Denys Fedoryshchenko wrote: > It seems logic in BQL has serious issues. The most bad thing, if > and it is empty. So in result, instead of eliminating latency, it is > adding it. There is maybe a misunderstanding here. BQL by itself only reduce amount of buffering in TX ring. Now there's still the Qdisc layer. On my laptop (e1000e NIC), and default pfifo_fast I can immediately see the difference in ping results if I launch a single netperf TCP_STREAM in another window : 1480 bytes from 172.30.42.1: icmp_seq=5047 ttl=64 time=0.185 ms 1480 bytes from 172.30.42.1: icmp_seq=5048 ttl=64 time=0.198 ms 1480 bytes from 172.30.42.1: icmp_seq=5049 ttl=64 time=0.255 ms 1480 bytes from 172.30.42.1: icmp_seq=5050 ttl=64 time=0.182 ms 1480 bytes from 172.30.42.1: icmp_seq=5051 ttl=64 time=0.182 ms 1480 bytes from 172.30.42.1: icmp_seq=5052 ttl=64 time=0.226 ms 1480 bytes from 172.30.42.1: icmp_seq=5053 ttl=64 time=0.186 ms 1480 bytes from 172.30.42.1: icmp_seq=5054 ttl=64 time=0.190 ms 1480 bytes from 172.30.42.1: icmp_seq=5055 ttl=64 time=0.223 ms 1480 bytes from 172.30.42.1: icmp_seq=5056 ttl=64 time=0.178 ms 1480 bytes from 172.30.42.1: icmp_seq=5057 ttl=64 time=0.191 ms 1480 bytes from 172.30.42.1: icmp_seq=5058 ttl=64 time=0.229 ms 1480 bytes from 172.30.42.1: icmp_seq=5059 ttl=64 time=0.229 ms 1480 bytes from 172.30.42.1: icmp_seq=5060 ttl=64 time=0.213 ms 1480 bytes from 172.30.42.1: icmp_seq=5061 ttl=64 time=0.187 ms 1480 bytes from 172.30.42.1: icmp_seq=5062 ttl=64 time=0.944 ms 1480 bytes from 172.30.42.1: icmp_seq=5063 ttl=64 time=4.67 ms 1480 bytes from 172.30.42.1: icmp_seq=5064 ttl=64 time=8.55 ms 1480 bytes from 172.30.42.1: icmp_seq=5065 ttl=64 time=12.8 ms 1480 bytes from 172.30.42.1: icmp_seq=5066 ttl=64 time=16.9 ms 1480 bytes from 172.30.42.1: icmp_seq=5067 ttl=64 time=20.5 ms 1480 bytes from 172.30.42.1: icmp_seq=5068 ttl=64 time=24.3 ms 1480 bytes from 172.30.42.1: icmp_seq=5069 ttl=64 time=27.0 ms 1480 bytes from 172.30.42.1: icmp_seq=5070 ttl=64 time=26.8 ms 1480 bytes from 172.30.42.1: icmp_seq=5071 ttl=64 time=26.7 ms 1480 bytes from 172.30.42.1: icmp_seq=5072 ttl=64 time=26.0 ms 1480 bytes from 172.30.42.1: icmp_seq=5073 ttl=64 time=27.0 ms 1480 bytes from 172.30.42.1: icmp_seq=5074 ttl=64 time=25.6 ms 1480 bytes from 172.30.42.1: icmp_seq=5075 ttl=64 time=26.7 ms 1480 bytes from 172.30.42.1: icmp_seq=5076 ttl=64 time=26.9 ms 1480 bytes from 172.30.42.1: icmp_seq=5077 ttl=64 time=25.0 ms Now if I replace pfifo_fast by fq_codel : (still TSO and GSO are on) 1480 bytes from 172.30.42.1: icmp_seq=37 ttl=64 time=0.197 ms 1480 bytes from 172.30.42.1: icmp_seq=38 ttl=64 time=0.283 ms 1480 bytes from 172.30.42.1: icmp_seq=39 ttl=64 time=0.183 ms 1480 bytes from 172.30.42.1: icmp_seq=40 ttl=64 time=0.182 ms 1480 bytes from 172.30.42.1: icmp_seq=41 ttl=64 time=0.207 ms 1480 bytes from 172.30.42.1: icmp_seq=42 ttl=64 time=0.202 ms 1480 bytes from 172.30.42.1: icmp_seq=43 ttl=64 time=0.244 ms 1480 bytes from 172.30.42.1: icmp_seq=44 ttl=64 time=0.200 ms 1480 bytes from 172.30.42.1: icmp_seq=45 ttl=64 time=0.212 ms 1480 bytes from 172.30.42.1: icmp_seq=46 ttl=64 time=0.215 ms 1480 bytes from 172.30.42.1: icmp_seq=47 ttl=64 time=0.178 ms 1480 bytes from 172.30.42.1: icmp_seq=48 ttl=64 time=0.215 ms 1480 bytes from 172.30.42.1: icmp_seq=49 ttl=64 time=0.238 ms 1480 bytes from 172.30.42.1: icmp_seq=50 ttl=64 time=0.210 ms 1480 bytes from 172.30.42.1: icmp_seq=51 ttl=64 time=0.974 ms 1480 bytes from 172.30.42.1: icmp_seq=52 ttl=64 time=1.93 ms 1480 bytes from 172.30.42.1: icmp_seq=53 ttl=64 time=1.71 ms 1480 bytes from 172.30.42.1: icmp_seq=54 ttl=64 time=3.07 ms 1480 bytes from 172.30.42.1: icmp_seq=55 ttl=64 time=3.47 ms 1480 bytes from 172.30.42.1: icmp_seq=56 ttl=64 time=2.53 ms 1480 bytes from 172.30.42.1: icmp_seq=57 ttl=64 time=3.24 ms 1480 bytes from 172.30.42.1: icmp_seq=58 ttl=64 time=3.11 ms 1480 bytes from 172.30.42.1: icmp_seq=59 ttl=64 time=3.03 ms 1480 bytes from 172.30.42.1: icmp_seq=60 ttl=64 time=2.96 ms 1480 bytes from 172.30.42.1: icmp_seq=61 ttl=64 time=2.90 ms 1480 bytes from 172.30.42.1: icmp_seq=62 ttl=64 time=2.97 ms -- To unsubscribe from this list: send the line "unsubscribe netdev" in the body of a message to majordomo@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
> > And I didn't see the stalls. This was on an Intel machine. The limit > was stable, went up to around 28K when opened large file and tended > to > stay between 15-28K. > > The describe problem seems to have characteristics that transmit > interrupts are not at all periodic, and it would seem that some are > taking hundreds of milliseconds to pop. I don't see anything that > would cause that in the NIC, is it possible there is some activity on > the machines periodically and often holding down interrupts for long > periods of time. Are there any peculiarities on Sun Fire in > interrupt > handling? > > Can you also provide an 'ethtool -c eth0' > > Thanks, > Tom > Technically i can install there fresh gentoo with latest vanilla kernel and provide remote access. But it will take some time. I will try now to play with IRQ pinning, but not sure i will reach anything. My kernel config http://www.nuclearcat.com/config_stall.txt It is completely idle machine. I didn't notice any anomalies with interrutps to compare with other servers (not Sun Fire). If there is anomalies with interrupts (for example SMI), probably i should notice that even without BQL, but this machine works very well and i didn't notice any latency without BQL. centaur ~ # pidstat 1 Linux 3.4.0-rc7-centaur (centaur) 05/20/12 _x86_64_ (8 CPU) 21:41:25 PID %usr %system %guest %CPU CPU Command 21:41:26 PID %usr %system %guest %CPU CPU Command 21:41:27 2617 0.00 1.00 0.00 1.00 2 pidstat 21:41:27 PID %usr %system %guest %CPU CPU Command 21:41:28 PID %usr %system %guest %CPU CPU Command 21:41:29 PID %usr %system %guest %CPU CPU Command 21:41:30 2617 0.00 1.00 0.00 1.00 2 pidstat From powertop: Wakeups-from-idle per second : 7.5 interval: 10.0s ethtool -c eth0, sure i can provide, it is default values: centaur ~ # ethtool -c eth0 Coalesce parameters for eth0: Adaptive RX: off TX: off stats-block-usecs: 0 sample-interval: 0 pkt-rate-low: 0 pkt-rate-high: 0 rx-usecs: 3 rx-frames: 0 rx-usecs-irq: 0 rx-frames-irq: 0 tx-usecs: 0 tx-frames: 0 tx-usecs-irq: 0 tx-frames-irq: 0 rx-usecs-low: 0 rx-frame-low: 0 tx-usecs-low: 0 tx-frame-low: 0 rx-usecs-high: 0 rx-frame-high: 0 tx-usecs-high: 0 tx-frame-high: 0 > > > > On Fri, May 18, 2012 at 7:40 PM, Denys Fedoryshchenko > <denys@visp.net.lb> wrote: >> On 2012-05-19 05:07, Tom Herbert wrote: >>> >>> 3) Have you tried a different interrupt mode? >> >> >> Now tested: >> e1000e.TxIntDelay=0,0,0,0 >> [ 4.930570] e1000e 0000:04:00.0: (unregistered net_device): >> Transmit >> Interrupt Delay set to 0 >> [ 4.930573] e1000e 0000:04:00.0: (unregistered net_device): >> Interrupt >> Throttling Rate (ints/sec) set to dynamic conservative mode >> Problem still occur. >> >> e1000e.TxIntDelay=0,0,0,0 e1000e.InterruptThrottleRate=0,0,0,0 >> [ 4.971476] e1000e 0000:04:00.0: (unregistered net_device): >> Transmit >> Interrupt Delay set to 0 >> [ 4.971478] e1000e 0000:04:00.0: (unregistered net_device): >> Interrupt >> Throttling Rate (ints/sec) turned off >> Problem also occur. >> >> Probably it is a case of this specific machine, this problem is >> happened on >> same hardware (Sun Fire X4150). >> The only "exotic" thing here is set of cards: >> >> 04:00.0 Ethernet controller: Intel Corporation 80003ES2LAN Gigabit >> Ethernet >> Controller (Copper) (rev 01) >> 04:00.1 Ethernet controller: Intel Corporation 80003ES2LAN Gigabit >> Ethernet >> Controller (Copper) (rev 01) >> 0b:00.0 Ethernet controller: Intel Corporation 82571EB Gigabit >> Ethernet >> Controller (rev 06) >> 0b:00.1 Ethernet controller: Intel Corporation 82571EB Gigabit >> Ethernet >> Controller (rev 06) >> >> I'm using first card. >> >> >>> >>> Per #3, I am wondering if dynamic conservative mode interrupt >>> throttling is not producing interrupts deterministically for BQL. >>> I'll try to reproduce the problem in this mode. >>> >>>> I can make a patch that will make minimum BQL value not less than >>>> MTU + >>>> overhead, is it ok like this? >>>> Probably it will solve issue, but it is more workaround and safety >>>> fuse, >>>> than a solution. >>>> >>> That would just be a bandaid and could just make this a latent >>> issue >>> for the future. >>> >>> Tom >>> >>>> >>>> On 2012-05-17 19:54, Denys Fedoryshchenko wrote: >>>>> >>>>> >>>>> Also i notice, limit constantly changing over time (even i am not >>>>> touching it). >>>>> >>>>> centaur ~ # grep "" >>>>> /sys/class/net/eth0/queues/tx-0/byte_queue_limits/* >>>>> /sys/class/net/eth0/queues/tx-0/byte_queue_limits/hold_time:1000 >>>>> /sys/class/net/eth0/queues/tx-0/byte_queue_limits/inflight:0 >>>>> /sys/class/net/eth0/queues/tx-0/byte_queue_limits/limit:13018 >>>>> >>>>> >>>>> >>>>> /sys/class/net/eth0/queues/tx-0/byte_queue_limits/limit_max:1879048192 >>>>> /sys/class/net/eth0/queues/tx-0/byte_queue_limits/limit_min:0 >>>>> centaur ~ # grep "" >>>>> /sys/class/net/eth0/queues/tx-0/byte_queue_limits/* >>>>> /sys/class/net/eth0/queues/tx-0/byte_queue_limits/hold_time:1000 >>>>> /sys/class/net/eth0/queues/tx-0/byte_queue_limits/inflight:4542 >>>>> /sys/class/net/eth0/queues/tx-0/byte_queue_limits/limit:13018 >>>>> >>>>> >>>>> >>>>> /sys/class/net/eth0/queues/tx-0/byte_queue_limits/limit_max:1879048192 >>>>> /sys/class/net/eth0/queues/tx-0/byte_queue_limits/limit_min:0 >>>>> >>>>> Is it supposed to be like this? >>>>> >>>>> On 2012-05-17 16:42, Denys Fedoryshchenko wrote: >>>>>> >>>>>> >>>>>> Found commit that cause problem: >>>>>> >>>>>> author Tom Herbert <therbert@google.com> >>>>>> Mon, 28 Nov 2011 16:33:16 +0000 (16:33 +0000) >>>>>> committer David S. Miller <davem@davemloft.net> >>>>>> Tue, 29 Nov 2011 17:46:19 +0000 (12:46 -0500) >>>>>> commit 3f0cfa3bc11e7f00c9994e0f469cbc0e7da7b00c >>>>>> tree d6670a4f94b2b9dedacc38edb6f0e1306b889f6b tree | >>>>>> snapshot >>>>>> parent 114cf5802165ee93e3ab461c9c505cd94a08b800 commit | >>>>>> diff >>>>>> e1000e: Support for byte queue limits >>>>>> >>>>>> Changes to e1000e to use byte queue limits. >>>>>> >>>>>> Signed-off-by: Tom Herbert <therbert@google.com> >>>>>> Acked-by: Eric Dumazet <eric.dumazet@gmail.com> >>>>>> Signed-off-by: David S. Miller <davem@davemloft.net> >>>>>> >>>>>> If i reverse it, problem disappearing. >>>>>> >>>>>> How i reproduce it: >>>>>> In two consoles do "fast" ping to nearby host >>>>>> ping 194.146.XXX.XXX -s1472 -i0.0001 >>>>>> ping 194.146.XXX.XXX -s1472 -i0.1 >>>>>> >>>>>> For third open ssh to host with "problem", open mcedit, and just >>>>>> scroll down large text file. >>>>>> After few seconds some "stalls" will occur, and in ping history >>>>>> i can >>>>>> see: >>>>>> 1480 bytes from 194.146.153.7: icmp_req=1797 ttl=64 time=0.161 >>>>>> ms >>>>>> 1480 bytes from 194.146.153.7: icmp_req=1798 ttl=64 time=0.198 >>>>>> ms >>>>>> 1480 bytes from 194.146.153.7: icmp_req=1799 ttl=64 time=0.340 >>>>>> ms >>>>>> 1480 bytes from 194.146.153.7: icmp_req=1800 ttl=64 time=0.381 >>>>>> ms >>>>>> 1480 bytes from 194.146.153.7: icmp_req=1801 ttl=64 time=914 ms >>>>>> 1480 bytes from 194.146.153.7: icmp_req=1802 ttl=64 time=804 ms >>>>>> 1480 bytes from 194.146.153.7: icmp_req=1803 ttl=64 time=704 ms >>>>>> 1480 bytes from 194.146.153.7: icmp_req=1804 ttl=64 time=594 ms >>>>>> 1480 bytes from 194.146.153.7: icmp_req=1805 ttl=64 time=0.287 >>>>>> ms >>>>>> 1480 bytes from 194.146.153.7: icmp_req=1806 ttl=64 time=0.226 >>>>>> ms >>>>>> >>>>>> >>>>>> If i apply small patch - problem will disappear. Sure it is not >>>>>> a >>>>>> solution, but >>>>>> let me know how i can help to debug problem more. >>>>>> >>>>>> --- netdev.c 2012-05-12 20:08:37.000000000 +0300 >>>>>> +++ netdev.c.patched 2012-05-17 16:32:28.895760472 +0300 >>>>>> @@ -1135,7 +1135,7 @@ >>>>>> >>>>>> tx_ring->next_to_clean = i; >>>>>> >>>>>> - netdev_completed_queue(netdev, pkts_compl, bytes_compl); >>>>>> +// netdev_completed_queue(netdev, pkts_compl, bytes_compl); >>>>>> >>>>>> #define TX_WAKE_THRESHOLD 32 >>>>>> if (count && netif_carrier_ok(netdev) && >>>>>> @@ -2263,7 +2263,7 @@ >>>>>> e1000_put_txbuf(adapter, buffer_info); >>>>>> } >>>>>> >>>>>> - netdev_reset_queue(adapter->netdev); >>>>>> +// netdev_reset_queue(adapter->netdev); >>>>>> size = sizeof(struct e1000_buffer) * tx_ring->count; >>>>>> memset(tx_ring->buffer_info, 0, size); >>>>>> >>>>>> @@ -5056,7 +5056,7 @@ >>>>>> /* if count is 0 then mapping error has occurred */ >>>>>> count = e1000_tx_map(adapter, skb, first, max_per_txd, >>>>>> nr_frags, mss); >>>>>> if (count) { >>>>>> - netdev_sent_queue(netdev, skb->len); >>>>>> +// netdev_sent_queue(netdev, skb->len); >>>>>> e1000_tx_queue(adapter, tx_flags, count); >>>>>> /* Make sure there is space in the ring for the >>>>>> next >>>>>> send. >>>>>> */ >>>>>> e1000_maybe_stop_tx(netdev, MAX_SKB_FRAGS + 2); >>>>>> >>>>>> >>>>>> >>>>>> On 2012-05-15 17:15, Denys Fedoryshchenko wrote: >>>>>>> >>>>>>> >>>>>>> Hi >>>>>>> >>>>>>> I have two identical servers, Sun Fire X4150, both has >>>>>>> different >>>>>>> flavors of Linux, x86_64 and i386. >>>>>>> 04:00.0 Ethernet controller: Intel Corporation 80003ES2LAN >>>>>>> Gigabit >>>>>>> Ethernet Controller (Copper) (rev 01) >>>>>>> 04:00.1 Ethernet controller: Intel Corporation 80003ES2LAN >>>>>>> Gigabit >>>>>>> Ethernet Controller (Copper) (rev 01) >>>>>>> 0b:00.0 Ethernet controller: Intel Corporation 82571EB Gigabit >>>>>>> Ethernet Controller (rev 06) >>>>>>> 0b:00.1 Ethernet controller: Intel Corporation 82571EB Gigabit >>>>>>> Ethernet Controller (rev 06) >>>>>>> I am using now interface: >>>>>>> #ethtool -i eth0 >>>>>>> driver: e1000e >>>>>>> version: 1.9.5-k >>>>>>> firmware-version: 2.1-11 >>>>>>> bus-info: 0000:04:00.0 >>>>>>> There is 2 CPU , Intel(R) Xeon(R) CPU E5440 @ >>>>>>> 2.83GHz . >>>>>>> >>>>>>> i386 was acting as NAT and shaper, and as soon as i removed >>>>>>> shaper >>>>>>> from it, i started to experience strange lockups, e.g. traffic >>>>>>> is >>>>>>> normal for 5-30 seconds, then short lockup for 500-3000ms >>>>>>> (usually >>>>>>> around 1000ms) with dropped packets counter increasing. I was >>>>>>> suspecting it is due load, but it seems was wrong. >>>>>>> Recently, on another server, x86_64 i am using as development, >>>>>>> i >>>>>>> upgrade kernel (it was old, from 2.6 series) and on completely >>>>>>> idle >>>>>>> machine started to experience same latency spikes, while i am >>>>>>> just >>>>>>> running mc and for example typing in text editor - i notice >>>>>>> "stalls". >>>>>>> After i investigate it a little more, i notice also small >>>>>>> amount of >>>>>>> drops on interface. No tcpdump running. Also this machine is >>>>>>> idle, and >>>>>>> the only traffic there - some small broadcasts from network, my >>>>>>> ssh, >>>>>>> and ping. >>>>>>> >>>>>>> Dropped packets in ifconfig >>>>>>> RX packets:3752868 errors:0 dropped:5350 overruns:0 >>>>>>> frame:0 >>>>>>> Counter is increasing sometimes, when this stall happening. >>>>>>> >>>>>>> ethtool -S is clean, there is no dropped packets. >>>>>>> >>>>>>> I did tried to check load (mpstat and perf), there is nothing >>>>>>> suspicious, latencytop also doesn't show anything suspicious. >>>>>>> dropwatch report a lot of drops, but mostly because there is >>>>>>> some >>>>>>> broadcasts and etc. tcpdump at the moment of such drops doesn't >>>>>>> show >>>>>>> anything suspicious. >>>>>>> Changed qdisc from default fifo_fast to bfifo, without any >>>>>>> result. >>>>>>> Tried: ethtool -K eth0 tso off gso off gro off sg off , no >>>>>>> result >>>>>>> Problem occured at 3.3.6 - 3.4.0-rc7, most probably 3.3.0 also, >>>>>>> but i >>>>>>> don't remember for sure. I thik on some kernels like 3.1 >>>>>>> probably it >>>>>>> doesn't occur, i will check it soon, because it is not always >>>>>>> reliable >>>>>>> to reproduce it. All tests i did on 3.4.0-rc7. >>>>>>> >>>>>>> I did run also in background tcpdump, additionally iptables >>>>>>> with >>>>>>> timestamps, and at time when stall occured, seems i am still >>>>>>> receiving >>>>>>> packets properly, also on iperf udp (from some host to this >>>>>>> SunFire) >>>>>>> at this moments no packets missing. But i am sure RX interface >>>>>>> errors >>>>>>> are increasing. >>>>>>> If i do iperf from SunFire to test host - there is packetloss >>>>>>> at >>>>>>> moments when stall occured. >>>>>>> >>>>>>> I suspect that by some reason network card stop to transmit, >>>>>>> but >>>>>>> unable to pinpoint issue. All other hosts in this network are >>>>>>> fine and >>>>>>> don't have such problems. >>>>>>> Can you help me with that please? Maybe i can provide more >>>>>>> debug >>>>>>> information, compile with patches and etc. Also i will try to >>>>>>> fallback >>>>>>> to 3.1 and 3.0 kernels. >>>>>>> >>>>>>> Here it is how it occurs and i am reproducing it: >>>>>>> I'm just opening file, and start to scroll it in mc, then in >>>>>>> another >>>>>>> console i run ping >>>>>>> [1337089061.844167] 1480 bytes from 194.146.153.20: >>>>>>> icmp_req=162 >>>>>>> ttl=64 time=0.485 ms >>>>>>> [1337089061.944138] 1480 bytes from 194.146.153.20: >>>>>>> icmp_req=163 >>>>>>> ttl=64 time=0.470 ms >>>>>>> [1337089062.467759] 1480 bytes from 194.146.153.20: >>>>>>> icmp_req=164 >>>>>>> ttl=64 time=424 ms >>>>>>> [1337089062.467899] 1480 bytes from 194.146.153.20: >>>>>>> icmp_req=165 >>>>>>> ttl=64 time=324 ms >>>>>>> [1337089062.468058] 1480 bytes from 194.146.153.20: >>>>>>> icmp_req=166 >>>>>>> ttl=64 time=214 ms >>>>>>> [1337089062.468161] 1480 bytes from 194.146.153.20: >>>>>>> icmp_req=167 >>>>>>> ttl=64 time=104 ms >>>>>>> [1337089062.468958] 1480 bytes from 194.146.153.20: >>>>>>> icmp_req=168 >>>>>>> ttl=64 time=1.15 ms >>>>>>> [1337089062.568604] 1480 bytes from 194.146.153.20: >>>>>>> icmp_req=169 >>>>>>> ttl=64 time=0.477 ms >>>>>>> [1337089062.668909] 1480 bytes from 194.146.153.20: >>>>>>> icmp_req=170 >>>>>>> ttl=64 time=0.667 ms >>>>>>> >>>>>>> Remote host tcpdump: >>>>>>> 1337089061.934737 IP 194.146.153.20 > 194.146.153.22: ICMP echo >>>>>>> reply, id 3486, seq 163, length 1480 >>>>>>> 1337089062.458360 IP 194.146.153.22 > 194.146.153.20: ICMP echo >>>>>>> request, id 3486, seq 164, length 1480 >>>>>>> 1337089062.458380 IP 194.146.153.20 > 194.146.153.22: ICMP echo >>>>>>> reply, id 3486, seq 164, length 1480 >>>>>>> 1337089062.458481 IP 194.146.153.22 > 194.146.153.20: ICMP echo >>>>>>> request, id 3486, seq 165, length 1480 >>>>>>> 1337089062.458502 IP 194.146.153.20 > 194.146.153.22: ICMP echo >>>>>>> reply, id 3486, seq 165, length 1480 >>>>>>> 1337089062.458606 IP 194.146.153.22 > 194.146.153.20: ICMP echo >>>>>>> request, id 3486, seq 166, length 1480 >>>>>>> 1337089062.458623 IP 194.146.153.20 > 194.146.153.22: ICMP echo >>>>>>> reply, id 3486, seq 166, length 1480 >>>>>>> 1337089062.458729 IP 194.146.153.22 > 194.146.153.20: ICMP echo >>>>>>> request, id 3486, seq 167, length 1480 >>>>>>> 1337089062.458745 IP 194.146.153.20 > 194.146.153.22: ICMP echo >>>>>>> reply, id 3486, seq 167, length 1480 >>>>>>> 1337089062.459537 IP 194.146.153.22 > 194.146.153.20: ICMP echo >>>>>>> request, id 3486, seq 168, length 1480 >>>>>>> 1337089062.459545 IP 194.146.153.20 > 194.146.153.22: ICMP echo >>>>>>> reply, id 3486, seq 168, length 1480 >>>>>>> >>>>>>> Local host(SunFire) tcpdump: >>>>>>> 1337089061.844140 IP 194.146.153.20 > 194.146.153.22: ICMP echo >>>>>>> reply, id 3486, seq 162, length 1480 >>>>>>> 1337089061.943661 IP 194.146.153.22 > 194.146.153.20: ICMP echo >>>>>>> request, id 3486, seq 163, length 1480 >>>>>>> 1337089061.944124 IP 194.146.153.20 > 194.146.153.22: ICMP echo >>>>>>> reply, id 3486, seq 163, length 1480 >>>>>>> 1337089062.465622 IP 194.146.153.22 > 194.146.153.20: ICMP echo >>>>>>> request, id 3486, seq 164, length 1480 >>>>>>> 1337089062.465630 IP 194.146.153.22 > 194.146.153.20: ICMP echo >>>>>>> request, id 3486, seq 165, length 1480 >>>>>>> 1337089062.465632 IP 194.146.153.22 > 194.146.153.20: ICMP echo >>>>>>> request, id 3486, seq 166, length 1480 >>>>>>> 1337089062.465634 IP 194.146.153.22 > 194.146.153.20: ICMP echo >>>>>>> request, id 3486, seq 167, length 1480 >>>>>>> 1337089062.467730 IP 194.146.153.20 > 194.146.153.22: ICMP echo >>>>>>> reply, id 3486, seq 164, length 1480 >>>>>>> 1337089062.467785 IP 194.146.153.22 > 194.146.153.20: ICMP echo >>>>>>> request, id 3486, seq 168, length 1480 >>>>>>> 1337089062.467884 IP 194.146.153.20 > 194.146.153.22: ICMP echo >>>>>>> reply, id 3486, seq 165, length 1480 >>>>>>> 1337089062.468035 IP 194.146.153.20 > 194.146.153.22: ICMP echo >>>>>>> reply, id 3486, seq 166, length 1480 >>>>>>> 1337089062.468129 IP 194.146.153.20 > 194.146.153.22: ICMP echo >>>>>>> reply, id 3486, seq 167, length 1480 >>>>>>> 1337089062.468928 IP 194.146.153.20 > 194.146.153.22: ICMP echo >>>>>>> reply, id 3486, seq 168, length 1480 >>>>>>> 1337089062.568112 IP 194.146.153.22 > 194.146.153.20: ICMP echo >>>>>>> request, id 3486, seq 169, length 1480 >>>>>>> 1337089062.568578 IP 194.146.153.20 > 194.146.153.22: ICMP echo >>>>>>> reply, id 3486, seq 169, length 1480 >>>>>>> >>>>>>> lspci -t >>>>>>> centaur src # lspci -t >>>>>>> -[0000:00]-+-00.0 >>>>>>> +-02.0-[01-05]--+-00.0-[02-04]--+-00.0-[03]-- >>>>>>> | | \-02.0-[04]--+-00.0 >>>>>>> | | \-00.1 >>>>>>> | \-00.3-[05]-- >>>>>>> +-03.0-[06]-- >>>>>>> +-04.0-[07]----00.0 >>>>>>> +-05.0-[08]-- >>>>>>> +-06.0-[09]-- >>>>>>> +-07.0-[0a]-- >>>>>>> +-08.0 >>>>>>> +-10.0 >>>>>>> +-10.1 >>>>>>> +-10.2 >>>>>>> +-11.0 >>>>>>> +-13.0 >>>>>>> +-15.0 >>>>>>> +-16.0 >>>>>>> +-1c.0-[0b]--+-00.0 >>>>>>> | \-00.1 >>>>>>> +-1d.0 >>>>>>> +-1d.1 >>>>>>> +-1d.2 >>>>>>> +-1d.3 >>>>>>> +-1d.7 >>>>>>> +-1e.0-[0c]----05.0 >>>>>>> +-1f.0 >>>>>>> +-1f.1 >>>>>>> +-1f.2 >>>>>>> \-1f.3 >>>>>>> lspci >>>>>>> 00:00.0 Host bridge: Intel Corporation 5000P Chipset Memory >>>>>>> Controller Hub (rev b1) >>>>>>> 00:02.0 PCI bridge: Intel Corporation 5000 Series Chipset PCI >>>>>>> Express >>>>>>> x4 Port 2 (rev b1) >>>>>>> 00:03.0 PCI bridge: Intel Corporation 5000 Series Chipset PCI >>>>>>> Express >>>>>>> x4 Port 3 (rev b1) >>>>>>> 00:04.0 PCI bridge: Intel Corporation 5000 Series Chipset PCI >>>>>>> Express >>>>>>> x8 Port 4-5 (rev b1) >>>>>>> 00:05.0 PCI bridge: Intel Corporation 5000 Series Chipset PCI >>>>>>> Express >>>>>>> x4 Port 5 (rev b1) >>>>>>> 00:06.0 PCI bridge: Intel Corporation 5000 Series Chipset PCI >>>>>>> Express >>>>>>> x8 Port 6-7 (rev b1) >>>>>>> 00:07.0 PCI bridge: Intel Corporation 5000 Series Chipset PCI >>>>>>> Express >>>>>>> x4 Port 7 (rev b1) >>>>>>> 00:08.0 System peripheral: Intel Corporation 5000 Series >>>>>>> Chipset DMA >>>>>>> Engine (rev b1) >>>>>>> 00:10.0 Host bridge: Intel Corporation 5000 Series Chipset FSB >>>>>>> Registers (rev b1) >>>>>>> 00:10.1 Host bridge: Intel Corporation 5000 Series Chipset FSB >>>>>>> Registers (rev b1) >>>>>>> 00:10.2 Host bridge: Intel Corporation 5000 Series Chipset FSB >>>>>>> Registers (rev b1) >>>>>>> 00:11.0 Host bridge: Intel Corporation 5000 Series Chipset >>>>>>> Reserved >>>>>>> Registers (rev b1) >>>>>>> 00:13.0 Host bridge: Intel Corporation 5000 Series Chipset >>>>>>> Reserved >>>>>>> Registers (rev b1) >>>>>>> 00:15.0 Host bridge: Intel Corporation 5000 Series Chipset FBD >>>>>>> Registers (rev b1) >>>>>>> 00:16.0 Host bridge: Intel Corporation 5000 Series Chipset FBD >>>>>>> Registers (rev b1) >>>>>>> 00:1c.0 PCI bridge: Intel Corporation 631xESB/632xESB/3100 >>>>>>> Chipset >>>>>>> PCI Express Root Port 1 (rev 09) >>>>>>> 00:1d.0 USB controller: Intel Corporation 631xESB/632xESB/3100 >>>>>>> Chipset UHCI USB Controller #1 (rev 09) >>>>>>> 00:1d.1 USB controller: Intel Corporation 631xESB/632xESB/3100 >>>>>>> Chipset UHCI USB Controller #2 (rev 09) >>>>>>> 00:1d.2 USB controller: Intel Corporation 631xESB/632xESB/3100 >>>>>>> Chipset UHCI USB Controller #3 (rev 09) >>>>>>> 00:1d.3 USB controller: Intel Corporation 631xESB/632xESB/3100 >>>>>>> Chipset UHCI USB Controller #4 (rev 09) >>>>>>> 00:1d.7 USB controller: Intel Corporation 631xESB/632xESB/3100 >>>>>>> Chipset EHCI USB2 Controller (rev 09) >>>>>>> 00:1e.0 PCI bridge: Intel Corporation 82801 PCI Bridge (rev d9) >>>>>>> 00:1f.0 ISA bridge: Intel Corporation 631xESB/632xESB/3100 >>>>>>> Chipset >>>>>>> LPC Interface Controller (rev 09) >>>>>>> 00:1f.1 IDE interface: Intel Corporation 631xESB/632xESB IDE >>>>>>> Controller (rev 09) >>>>>>> 00:1f.2 SATA controller: Intel Corporation 631xESB/632xESB SATA >>>>>>> AHCI >>>>>>> Controller (rev 09) >>>>>>> 00:1f.3 SMBus: Intel Corporation 631xESB/632xESB/3100 Chipset >>>>>>> SMBus >>>>>>> Controller (rev 09) >>>>>>> 01:00.0 PCI bridge: Intel Corporation 6311ESB/6321ESB PCI >>>>>>> Express >>>>>>> Upstream Port (rev 01) >>>>>>> 01:00.3 PCI bridge: Intel Corporation 6311ESB/6321ESB PCI >>>>>>> Express to >>>>>>> PCI-X Bridge (rev 01) >>>>>>> 02:00.0 PCI bridge: Intel Corporation 6311ESB/6321ESB PCI >>>>>>> Express >>>>>>> Downstream Port E1 (rev 01) >>>>>>> 02:02.0 PCI bridge: Intel Corporation 6311ESB/6321ESB PCI >>>>>>> Express >>>>>>> Downstream Port E3 (rev 01) >>>>>>> 04:00.0 Ethernet controller: Intel Corporation 80003ES2LAN >>>>>>> Gigabit >>>>>>> Ethernet Controller (Copper) (rev 01) >>>>>>> 04:00.1 Ethernet controller: Intel Corporation 80003ES2LAN >>>>>>> Gigabit >>>>>>> Ethernet Controller (Copper) (rev 01) >>>>>>> 07:00.0 RAID bus controller: Adaptec AAC-RAID (rev 09) >>>>>>> 0b:00.0 Ethernet controller: Intel Corporation 82571EB Gigabit >>>>>>> Ethernet Controller (rev 06) >>>>>>> 0b:00.1 Ethernet controller: Intel Corporation 82571EB Gigabit >>>>>>> Ethernet Controller (rev 06) >>>>>>> 0c:05.0 VGA compatible controller: ASPEED Technology, Inc. >>>>>>> ASPEED >>>>>>> Graphics Family >>>>>>> >>>>>>> >>>>>>> dmesg: >>>>>>> [ 4.936885] e1000: Intel(R) PRO/1000 Network Driver - >>>>>>> version >>>>>>> 7.3.21-k8-NAPI >>>>>>> [ 4.936887] e1000: Copyright (c) 1999-2006 Intel >>>>>>> Corporation. >>>>>>> [ 4.936966] e1000e: Intel(R) PRO/1000 Network Driver - >>>>>>> 1.9.5-k >>>>>>> [ 4.936967] e1000e: Copyright(c) 1999 - 2012 Intel >>>>>>> Corporation. >>>>>>> [ 4.938529] e1000e 0000:04:00.0: (unregistered net_device): >>>>>>> Interrupt Throttling Rate (ints/sec) set to dynamic >>>>>>> conservative mode >>>>>>> [ 4.939598] e1000e 0000:04:00.0: irq 65 for MSI/MSI-X >>>>>>> [ 4.992246] e1000e 0000:04:00.0: eth0: (PCI >>>>>>> Express:2.5GT/s:Width >>>>>>> x4) 00:1e:68:04:99:f8 >>>>>>> [ 4.992657] e1000e 0000:04:00.0: eth0: Intel(R) PRO/1000 >>>>>>> Network >>>>>>> Connection >>>>>>> [ 4.992964] e1000e 0000:04:00.0: eth0: MAC: 5, PHY: 5, PBA >>>>>>> No: >>>>>>> FFFFFF-0FF >>>>>>> [ 4.994745] e1000e 0000:04:00.1: (unregistered net_device): >>>>>>> Interrupt Throttling Rate (ints/sec) set to dynamic >>>>>>> conservative mode >>>>>>> [ 4.996233] e1000e 0000:04:00.1: irq 66 for MSI/MSI-X >>>>>>> [ 5.050901] e1000e 0000:04:00.1: eth1: (PCI >>>>>>> Express:2.5GT/s:Width >>>>>>> x4) 00:1e:68:04:99:f9 >>>>>>> [ 5.051317] e1000e 0000:04:00.1: eth1: Intel(R) PRO/1000 >>>>>>> Network >>>>>>> Connection >>>>>>> [ 5.051623] e1000e 0000:04:00.1: eth1: MAC: 5, PHY: 5, PBA >>>>>>> No: >>>>>>> FFFFFF-0FF >>>>>>> [ 5.051857] e1000e 0000:0b:00.0: Disabling ASPM L1 >>>>>>> [ 5.052168] e1000e 0000:0b:00.0: (unregistered net_device): >>>>>>> Interrupt Throttling Rate (ints/sec) set to dynamic >>>>>>> conservative mode >>>>>>> [ 5.052611] e1000e 0000:0b:00.0: irq 67 for MSI/MSI-X >>>>>>> [ 5.223454] e1000e 0000:0b:00.0: eth2: (PCI >>>>>>> Express:2.5GT/s:Width >>>>>>> x4) 00:1e:68:04:99:fa >>>>>>> [ 5.223864] e1000e 0000:0b:00.0: eth2: Intel(R) PRO/1000 >>>>>>> Network >>>>>>> Connection >>>>>>> [ 5.224178] e1000e 0000:0b:00.0: eth2: MAC: 0, PHY: 4, PBA >>>>>>> No: >>>>>>> C83246-002 >>>>>>> [ 5.224412] e1000e 0000:0b:00.1: Disabling ASPM L1 >>>>>>> [ 5.224709] e1000e 0000:0b:00.1: (unregistered net_device): >>>>>>> Interrupt Throttling Rate (ints/sec) set to dynamic >>>>>>> conservative mode >>>>>>> [ 5.225168] e1000e 0000:0b:00.1: irq 68 for MSI/MSI-X >>>>>>> [ 5.397603] e1000e 0000:0b:00.1: eth3: (PCI >>>>>>> Express:2.5GT/s:Width >>>>>>> x4) 00:1e:68:04:99:fb >>>>>>> [ 5.398021] e1000e 0000:0b:00.1: eth3: Intel(R) PRO/1000 >>>>>>> Network >>>>>>> Connection >>>>>>> [ 5.398336] e1000e 0000:0b:00.1: eth3: MAC: 0, PHY: 4, PBA >>>>>>> No: >>>>>>> C83246-002 >>>>>>> [ 13.859817] e1000e 0000:04:00.0: irq 65 for MSI/MSI-X >>>>>>> [ 13.962309] e1000e 0000:04:00.0: irq 65 for MSI/MSI-X >>>>>>> [ 17.150392] e1000e: eth0 NIC Link is Up 1000 Mbps Full >>>>>>> Duplex, >>>>>>> Flow Control: None >>>>>>> >>>>>>> -- >>>>>>> To unsubscribe from this list: send the line "unsubscribe >>>>>>> netdev" in >>>>>>> the body of a message to majordomo@vger.kernel.org >>>>>>> More majordomo info at >>>>>>> http://vger.kernel.org/majordomo-info.html >>>>>> >>>>>> >>>>>> >>>>>> --- >>>>>> Network engineer >>>>>> Denys Fedoryshchenko >>>>>> >>>>>> Dora Highway - Center Cebaco - 2nd Floor >>>>>> Beirut, Lebanon >>>>>> Tel: +961 1 247373 >>>>>> E-Mail: denys@visp.net.lb >>>>>> -- >>>>>> To unsubscribe from this list: send the line "unsubscribe >>>>>> netdev" in >>>>>> the body of a message to majordomo@vger.kernel.org >>>>>> More majordomo info at >>>>>> http://vger.kernel.org/majordomo-info.html >>>>> >>>>> >>>>> >>>>> --- >>>>> Network engineer >>>>> Denys Fedoryshchenko >>>>> >>>>> Dora Highway - Center Cebaco - 2nd Floor >>>>> Beirut, Lebanon >>>>> Tel: +961 1 247373 >>>>> E-Mail: denys@visp.net.lb >>>>> -- >>>>> To unsubscribe from this list: send the line "unsubscribe netdev" >>>>> in >>>>> the body of a message to majordomo@vger.kernel.org >>>>> More majordomo info at >>>>> http://vger.kernel.org/majordomo-info.html >>>> >>>> >>>> >>>> --- >>>> Network engineer >>>> Denys Fedoryshchenko >>>> >>>> Dora Highway - Center Cebaco - 2nd Floor >>>> Beirut, Lebanon >>>> Tel: +961 1 247373 >>>> E-Mail: denys@visp.net.lb >>> >>> -- >>> To unsubscribe from this list: send the line "unsubscribe netdev" >>> in >>> the body of a message to majordomo@vger.kernel.org >>> More majordomo info at http://vger.kernel.org/majordomo-info.html >> >> >> --- >> Denys Fedoryshchenko, Network Engineer, Virtual ISP S.A.L. --- Denys Fedoryshchenko, Network Engineer, Virtual ISP S.A.L. -- To unsubscribe from this list: send the line "unsubscribe netdev" in the body of a message to majordomo@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
On Sun, 2012-05-20 at 21:53 +0300, Denys Fedoryshchenko wrote: > > > My kernel config http://www.nuclearcat.com/config_stall.txt > > It is completely idle machine. I didn't notice any anomalies with > interrutps to > compare with other servers (not Sun Fire). If there is anomalies with > interrupts > (for example SMI), probably i should notice that even without BQL, but > this machine > works very well and i didn't notice any latency without BQL. You could try latencytop, I am not sure if some obvious things will popup. -- To unsubscribe from this list: send the line "unsubscribe netdev" in the body of a message to majordomo@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
On 2012-05-20 22:07, Eric Dumazet wrote: > > You could try latencytop, I am not sure if some obvious things will > popup. For sure i did. Nothing unusual here, max 5ms latency Cause Maximum Percentage [__skb_recv_datagram] 4.1 msec 45.4 % Waiting for event (select) 3.9 msec 54.6 % Page fault 0.7 msec 0.0 % Waiting for a process to die 0.3 msec 0.0 % Writing data to TTY 0.1 msec 0.0 % I will try also to find SMI interrupts detector (if i will find source), but i don't think it is an issue. --- Denys Fedoryshchenko, Network Engineer, Virtual ISP S.A.L. -- To unsubscribe from this list: send the line "unsubscribe netdev" in the body of a message to majordomo@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
On Sun, 2012-05-20 at 22:18 +0300, Denys Fedoryshchenko wrote: > On 2012-05-20 22:07, Eric Dumazet wrote: > > > > You could try latencytop, I am not sure if some obvious things will > > popup. > For sure i did. Nothing unusual here, max 5ms latency > Cause Maximum > Percentage > [__skb_recv_datagram] 4.1 msec Interesting So your workload is a mix of pings, and receive. Problem is softirq handler might use a lot of time to complete the receives, because of TCP stack complexity. And BQL use softirq to restart the transmits on the same cpu. tcp_data_queue() can copy the received data directly to user space. (taking page faults...) Could you check if net-next behaves the same ? -- To unsubscribe from this list: send the line "unsubscribe netdev" in the body of a message to majordomo@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
On 2012-05-21 06:56, Eric Dumazet wrote: > On Sun, 2012-05-20 at 22:18 +0300, Denys Fedoryshchenko wrote: >> On 2012-05-20 22:07, Eric Dumazet wrote: >> > >> > You could try latencytop, I am not sure if some obvious things >> will >> > popup. >> For sure i did. Nothing unusual here, max 5ms latency >> Cause Maximum >> Percentage >> [__skb_recv_datagram] 4.1 msec > > Interesting > > So your workload is a mix of pings, and receive. > > Problem is softirq handler might use a lot of time to complete the > receives, because of TCP stack complexity. And BQL use softirq to > restart the transmits on the same cpu. > > tcp_data_queue() can copy the received data directly to user space. > (taking page faults...) > > Could you check if net-next behaves the same ? Not sure it is a lot of time, after all it is 2 x core quad machine, should be enough fast for pings. It will cause stalls on small packets even more seems. Tested latest git, net-next, still the same, stalls. hardware latency detector are silent by the way, so there is no significant SMI. --- Denys Fedoryshchenko, Network Engineer, Virtual ISP S.A.L. -- To unsubscribe from this list: send the line "unsubscribe netdev" in the body of a message to majordomo@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
On Mon, 2012-05-21 at 11:06 +0300, Denys Fedoryshchenko wrote: > Not sure it is a lot of time, after all it is 2 x core quad machine, > should be enough fast for pings. > It will cause stalls on small packets even more seems. > > Tested latest git, net-next, still the same, stalls. > hardware latency detector are silent by the way, so there is no > significant SMI. > I am trying to reproduce your problem here with no luck yet. I wonder of softirq are correctly scheduled on your machine -- To unsubscribe from this list: send the line "unsubscribe netdev" in the body of a message to majordomo@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
On Mon, 2012-05-21 at 10:30 +0200, Eric Dumazet wrote: > On Mon, 2012-05-21 at 11:06 +0300, Denys Fedoryshchenko wrote: > > > Not sure it is a lot of time, after all it is 2 x core quad machine, > > should be enough fast for pings. > > It will cause stalls on small packets even more seems. > > > > Tested latest git, net-next, still the same, stalls. > > hardware latency detector are silent by the way, so there is no > > significant SMI. > > > > I am trying to reproduce your problem here with no luck yet. > > I wonder of softirq are correctly scheduled on your machine > By the way, fact you have 8 cpus is irrelevant. Only one cpu has queued the NET_TX_SOFTIRQ softirq (serviced by net_tx_action()) If this cpu is busy servicing other stuff, no other cpu will help. -- To unsubscribe from this list: send the line "unsubscribe netdev" in the body of a message to majordomo@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
On 2012-05-21 11:40, Eric Dumazet wrote: > On Mon, 2012-05-21 at 10:30 +0200, Eric Dumazet wrote: >> On Mon, 2012-05-21 at 11:06 +0300, Denys Fedoryshchenko wrote: >> >> > Not sure it is a lot of time, after all it is 2 x core quad >> machine, >> > should be enough fast for pings. >> > It will cause stalls on small packets even more seems. >> > >> > Tested latest git, net-next, still the same, stalls. >> > hardware latency detector are silent by the way, so there is no >> > significant SMI. >> > >> >> I am trying to reproduce your problem here with no luck yet. >> >> I wonder of softirq are correctly scheduled on your machine >> > > By the way, fact you have 8 cpus is irrelevant. > > Only one cpu has queued the NET_TX_SOFTIRQ softirq (serviced by > net_tx_action()) > > > If this cpu is busy servicing other stuff, no other cpu will help. > I'm installing now fresh and clean gentoo and tools, and will provide ssh to this host. Probably it will help to troubleshoot the problems. --- Denys Fedoryshchenko, Network Engineer, Virtual ISP S.A.L. -- To unsubscribe from this list: send the line "unsubscribe netdev" in the body of a message to majordomo@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
On 2012-05-21 11:40, Eric Dumazet wrote: > On Mon, 2012-05-21 at 10:30 +0200, Eric Dumazet wrote: >> On Mon, 2012-05-21 at 11:06 +0300, Denys Fedoryshchenko wrote: >> >> > Not sure it is a lot of time, after all it is 2 x core quad >> machine, >> > should be enough fast for pings. >> > It will cause stalls on small packets even more seems. >> > >> > Tested latest git, net-next, still the same, stalls. >> > hardware latency detector are silent by the way, so there is no >> > significant SMI. >> > >> >> I am trying to reproduce your problem here with no luck yet. >> >> I wonder of softirq are correctly scheduled on your machine >> > > By the way, fact you have 8 cpus is irrelevant. > > Only one cpu has queued the NET_TX_SOFTIRQ softirq (serviced by > net_tx_action()) > > > If this cpu is busy servicing other stuff, no other cpu will help. > By the way, if BQL limit is going lower than MTU, is it considered as a bug? If yes, i can try to upload 3.4 to some servers and add condition to WARN_ON if limit < 1500. --- Denys Fedoryshchenko, Network Engineer, Virtual ISP S.A.L. -- To unsubscribe from this list: send the line "unsubscribe netdev" in the body of a message to majordomo@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
On Tue, 2012-05-22 at 20:11 +0300, Denys Fedoryshchenko wrote: > By the way, if BQL limit is going lower than MTU, is it considered as a > bug? > If yes, i can try to upload 3.4 to some servers and add condition to > WARN_ON if limit < 1500. There is no problem with BQL limit going lower than the max packet size. (With TSO it can be 64K) Remember BQL allows one packet to be sent to device, regardless of its size. Next packet might be blocked/stay in Qdisc If your workload is mostly idle, but sending bursts of 3 packets, then only one is immediately sent. Next packets shall wait the TX completion of first packet. -- To unsubscribe from this list: send the line "unsubscribe netdev" in the body of a message to majordomo@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
I think there is a potential problem in that netdev_tx_completed could be called multiple times for the same interrupt, for example if napi poll routine completes it's budget and is scheduled again and some new packets are completed. We're looking at a solution to this. Denys, can you try to increase the netdev budget to see if that has an effect? Thanks, Tom On Tue, May 22, 2012 at 10:24 AM, Eric Dumazet <eric.dumazet@gmail.com> wrote: > On Tue, 2012-05-22 at 20:11 +0300, Denys Fedoryshchenko wrote: > >> By the way, if BQL limit is going lower than MTU, is it considered as a >> bug? >> If yes, i can try to upload 3.4 to some servers and add condition to >> WARN_ON if limit < 1500. > > There is no problem with BQL limit going lower than the max packet size. > > (With TSO it can be 64K) > > Remember BQL allows one packet to be sent to device, regardless of its > size. > > Next packet might be blocked/stay in Qdisc > > If your workload is mostly idle, but sending bursts of 3 packets, then > only one is immediately sent. > > Next packets shall wait the TX completion of first packet. > > > -- To unsubscribe from this list: send the line "unsubscribe netdev" in the body of a message to majordomo@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
On Thu, 2012-05-24 at 23:01 -0700, Tom Herbert wrote: > I think there is a potential problem in that netdev_tx_completed could > be called multiple times for the same interrupt, for example if napi > poll routine completes it's budget and is scheduled again and some new > packets are completed. We're looking at a solution to this. > > Denys, can you try to increase the netdev budget to see if that has an effect? TX completion has no budget, I am not sure what you mean. e1000e driver indeed has a limit : It cannot clean more than tx_ring->count frames per e1000_clean_tx_irq() invocation. But with BQL, this should not happen ? # ethtool -g eth0 Ring parameters for eth0: Pre-set maximums: RX: 4096 RX Mini: 0 RX Jumbo: 0 TX: 4096 Current hardware settings: RX: 256 RX Mini: 0 RX Jumbo: 0 TX: 256 -- To unsubscribe from this list: send the line "unsubscribe netdev" in the body of a message to majordomo@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
> TX completion has no budget, I am not sure what you mean. > Right, it's the budget on RX that can be a factor. TX completion is done from the NAPI poll routine, so if RX does not complete the NAPI is rescheduled and TX completion is done again for the same HW interrupt. We need to remove the constraint that netdev_completed can only be called once per interrupt... Tom > e1000e driver indeed has a limit : It cannot clean more than > tx_ring->count frames per e1000_clean_tx_irq() invocation. > > But with BQL, this should not happen ? > > # ethtool -g eth0 > Ring parameters for eth0: > Pre-set maximums: > RX: 4096 > RX Mini: 0 > RX Jumbo: 0 > TX: 4096 > Current hardware settings: > RX: 256 > RX Mini: 0 > RX Jumbo: 0 > TX: 256 > > -- To unsubscribe from this list: send the line "unsubscribe netdev" in the body of a message to majordomo@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
On Fri, 2012-05-25 at 09:59 -0700, Tom Herbert wrote: > > TX completion has no budget, I am not sure what you mean. > > > Right, it's the budget on RX that can be a factor. TX completion is > done from the NAPI poll routine, so if RX does not complete the NAPI > is rescheduled and TX completion is done again for the same HW > interrupt. > > We need to remove the constraint that netdev_completed can only be > called once per interrupt... Not clear where is this constraint in the code. Under heavy load, we can be in a loop situation, one cpu serving NAPI for a bunch of devices (no more hardware interrupts are delivered since we dont re-enable them at all) -- To unsubscribe from this list: send the line "unsubscribe netdev" in the body of a message to majordomo@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
On Sun, 20 May 2012 10:40:41 -0700 Tom Herbert <therbert@google.com> wrote: > Tried to reproduce: > > May 20 10:08:30 test kernel: [ 6.168240] e1000e 0000:06:00.0: > (unregistered net_device): Interrupt Throttling Rate (ints/sec) set to > dynamic conservative mode > May 20 10:08:30 test kernel: [ 6.221591] e1000e 0000:06:00.1: > (unregistered net_device): Interrupt Throttling Rate (ints/sec) set to > dynamic conservative mode > > 06:00.0 Ethernet controller: Intel Corporation 80003ES2LAN Gigabit > Ethernet Controller (Copper) (rev 01) > 06:00.1 Ethernet controller: Intel Corporation 80003ES2LAN Gigabit > Ethernet Controller (Copper) (rev 01) > > Following above instructions to repro gives: > > 1480 bytes from test2 (192.168.2.49): icmp_req=5875 ttl=64 time=0.358 ms > 1480 bytes from test2 (192.168.2.49): icmp_req=5876 ttl=64 time=0.330 ms > 1480 bytes from test2 (192.168.2.49): icmp_req=5877 ttl=64 time=0.337 ms > 1480 bytes from test2 (192.168.2.49): icmp_req=5878 ttl=64 time=0.375 ms > 1480 bytes from test2 (192.168.2.49): icmp_req=5879 ttl=64 time=0.359 ms > 1480 bytes from lpb49.prod.google.com (192.168.2.49): icmp_req=5880 > ttl=64 time=0.380 ms > > And I didn't see the stalls. This was on an Intel machine. The limit > was stable, went up to around 28K when opened large file and tended to > stay between 15-28K. > > The describe problem seems to have characteristics that transmit > interrupts are not at all periodic, and it would seem that some are > taking hundreds of milliseconds to pop. I don't see anything that > would cause that in the NIC, is it possible there is some activity on > the machines periodically and often holding down interrupts for long > periods of time. Are there any peculiarities on Sun Fire in interrupt > handling? > > Can you also provide an 'ethtool -c eth0' > > Thanks, > Tom I also observed the similar behaviour on the following environment. 03:00.0 Ethernet controller: Intel Corporation 82574L Gigabit Network Connection [ 2.962119] e1000e: Intel(R) PRO/1000 Network Driver - 2.0.0-k [ 2.968095] e1000e: Copyright(c) 1999 - 2012 Intel Corporation. [ 2.974251] e1000e 0000:03:00.0: Disabling ASPM L0s L1 [ 2.979653] e1000e 0000:03:00.0: (unregistered net_device): Interrupt Throttling Rate (ints/sec) set to dynamic conservative mode [ 2.991599] e1000e 0000:03:00.0: irq 72 for MSI/MSI-X [ 2.991606] e1000e 0000:03:00.0: irq 73 for MSI/MSI-X [ 2.991611] e1000e 0000:03:00.0: irq 74 for MSI/MSI-X [ 3.092768] e1000e 0000:03:00.0: eth0: (PCI Express:2.5GT/s:Width x1) 48:5b:39:75:91:bd [ 3.100992] e1000e 0000:03:00.0: eth0: Intel(R) PRO/1000 Network Connection [ 3.108173] e1000e 0000:03:00.0: eth0: MAC: 3, PHY: 8, PBA No: FFFFFF-0FF I tried some coalesce options by 'ethtool -C eth0', but anything didn't help. If I understand the code and spec correctly, TX interrupts are generated when TXDCTL.WTHRESH descriptors have been accumulated and write backed. I tentatively changed the TXDCTL.WTHRESH to 1, then it seems that latency spikes are disappear. drivers/net/ethernet/intel/e1000e/e1000.h @@ -181,7 +181,7 @@ struct e1000_info; #define E1000_TXDCTL_DMA_BURST_ENABLE \ (E1000_TXDCTL_GRAN | /* set descriptor granularity */ \ E1000_TXDCTL_COUNT_DESC | \ - (5 << 16) | /* wthresh must be +1 more than desired */\ + (1 << 16) | /* wthresh must be +1 more than desired */\ (1 << 8) | /* hthresh */ \ 0x1f) /* pthresh */ (before) $ ping -i0.2 192.168.11.2 PING 192.168.11.2 (192.168.11.2) 56(84) bytes of data. 64 bytes from 192.168.11.2: icmp_req=1 ttl=64 time=0.191 ms 64 bytes from 192.168.11.2: icmp_req=2 ttl=64 time=0.179 ms 64 bytes from 192.168.11.2: icmp_req=3 ttl=64 time=0.199 ms 64 bytes from 192.168.11.2: icmp_req=4 ttl=64 time=0.143 ms 64 bytes from 192.168.11.2: icmp_req=5 ttl=64 time=0.193 ms 64 bytes from 192.168.11.2: icmp_req=6 ttl=64 time=0.150 ms 64 bytes from 192.168.11.2: icmp_req=7 ttl=64 time=0.186 ms 64 bytes from 192.168.11.2: icmp_req=8 ttl=64 time=0.198 ms 64 bytes from 192.168.11.2: icmp_req=9 ttl=64 time=0.195 ms 64 bytes from 192.168.11.2: icmp_req=10 ttl=64 time=0.194 ms 64 bytes from 192.168.11.2: icmp_req=11 ttl=64 time=0.196 ms 64 bytes from 192.168.11.2: icmp_req=12 ttl=64 time=0.200 ms 64 bytes from 192.168.11.2: icmp_req=13 ttl=64 time=651 ms 64 bytes from 192.168.11.2: icmp_req=14 ttl=64 time=451 ms 64 bytes from 192.168.11.2: icmp_req=15 ttl=64 time=241 ms 64 bytes from 192.168.11.2: icmp_req=16 ttl=64 time=31.3 ms 64 bytes from 192.168.11.2: icmp_req=17 ttl=64 time=0.184 ms 64 bytes from 192.168.11.2: icmp_req=18 ttl=64 time=0.199 ms 64 bytes from 192.168.11.2: icmp_req=19 ttl=64 time=0.197 ms 64 bytes from 192.168.11.2: icmp_req=20 ttl=64 time=0.196 ms 64 bytes from 192.168.11.2: icmp_req=21 ttl=64 time=0.192 ms 64 bytes from 192.168.11.2: icmp_req=22 ttl=64 time=0.205 ms 64 bytes from 192.168.11.2: icmp_req=23 ttl=64 time=629 ms 64 bytes from 192.168.11.2: icmp_req=24 ttl=64 time=419 ms 64 bytes from 192.168.11.2: icmp_req=25 ttl=64 time=209 ms 64 bytes from 192.168.11.2: icmp_req=26 ttl=64 time=0.280 ms 64 bytes from 192.168.11.2: icmp_req=27 ttl=64 time=0.193 ms 64 bytes from 192.168.11.2: icmp_req=28 ttl=64 time=0.194 ms 64 bytes from 192.168.11.2: icmp_req=29 ttl=64 time=0.143 ms 64 bytes from 192.168.11.2: icmp_req=30 ttl=64 time=0.191 ms 64 bytes from 192.168.11.2: icmp_req=31 ttl=64 time=0.144 ms 64 bytes from 192.168.11.2: icmp_req=32 ttl=64 time=0.192 ms 64 bytes from 192.168.11.2: icmp_req=33 ttl=64 time=0.199 ms 64 bytes from 192.168.11.2: icmp_req=34 ttl=64 time=0.193 ms 64 bytes from 192.168.11.2: icmp_req=35 ttl=64 time=0.196 ms 64 bytes from 192.168.11.2: icmp_req=36 ttl=64 time=0.196 ms 64 bytes from 192.168.11.2: icmp_req=37 ttl=64 time=0.196 ms 64 bytes from 192.168.11.2: icmp_req=38 ttl=64 time=1600 ms 64 bytes from 192.168.11.2: icmp_req=39 ttl=64 time=1390 ms 64 bytes from 192.168.11.2: icmp_req=40 ttl=64 time=1180 ms 64 bytes from 192.168.11.2: icmp_req=41 ttl=64 time=980 ms 64 bytes from 192.168.11.2: icmp_req=42 ttl=64 time=780 ms 64 bytes from 192.168.11.2: icmp_req=43 ttl=64 time=570 ms 64 bytes from 192.168.11.2: icmp_req=44 ttl=64 time=0.151 ms 64 bytes from 192.168.11.2: icmp_req=45 ttl=64 time=0.189 ms 64 bytes from 192.168.11.2: icmp_req=46 ttl=64 time=0.203 ms 64 bytes from 192.168.11.2: icmp_req=47 ttl=64 time=0.185 ms 64 bytes from 192.168.11.2: icmp_req=48 ttl=64 time=0.189 ms 64 bytes from 192.168.11.2: icmp_req=49 ttl=64 time=0.204 ms 64 bytes from 192.168.11.2: icmp_req=50 ttl=64 time=0.198 ms I think 1000 ms - 2000 ms delay is come from e1000_watchdog_task(). (after) $ ping -i0.2 192.168.11.2 64 bytes from 192.168.11.2: icmp_req=1 ttl=64 time=0.175 ms 64 bytes from 192.168.11.2: icmp_req=2 ttl=64 time=0.203 ms 64 bytes from 192.168.11.2: icmp_req=3 ttl=64 time=0.196 ms 64 bytes from 192.168.11.2: icmp_req=4 ttl=64 time=0.197 ms 64 bytes from 192.168.11.2: icmp_req=5 ttl=64 time=0.186 ms 64 bytes from 192.168.11.2: icmp_req=6 ttl=64 time=0.197 ms 64 bytes from 192.168.11.2: icmp_req=7 ttl=64 time=0.189 ms 64 bytes from 192.168.11.2: icmp_req=8 ttl=64 time=0.146 ms 64 bytes from 192.168.11.2: icmp_req=9 ttl=64 time=0.193 ms 64 bytes from 192.168.11.2: icmp_req=10 ttl=64 time=0.194 ms 64 bytes from 192.168.11.2: icmp_req=11 ttl=64 time=0.195 ms 64 bytes from 192.168.11.2: icmp_req=12 ttl=64 time=0.190 ms 64 bytes from 192.168.11.2: icmp_req=13 ttl=64 time=0.204 ms 64 bytes from 192.168.11.2: icmp_req=14 ttl=64 time=0.201 ms 64 bytes from 192.168.11.2: icmp_req=15 ttl=64 time=0.189 ms 64 bytes from 192.168.11.2: icmp_req=16 ttl=64 time=0.193 ms 64 bytes from 192.168.11.2: icmp_req=17 ttl=64 time=0.190 ms 64 bytes from 192.168.11.2: icmp_req=18 ttl=64 time=0.143 ms 64 bytes from 192.168.11.2: icmp_req=19 ttl=64 time=0.191 ms 64 bytes from 192.168.11.2: icmp_req=20 ttl=64 time=0.190 ms -- To unsubscribe from this list: send the line "unsubscribe netdev" in the body of a message to majordomo@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
Thanks Hiroaki for this description, it looks promising. Denys, can you test with his patch. Tom On Tue, May 29, 2012 at 7:25 AM, Hiroaki SHIMODA <shimoda.hiroaki@gmail.com> wrote: > On Sun, 20 May 2012 10:40:41 -0700 > Tom Herbert <therbert@google.com> wrote: > >> Tried to reproduce: >> >> May 20 10:08:30 test kernel: [ 6.168240] e1000e 0000:06:00.0: >> (unregistered net_device): Interrupt Throttling Rate (ints/sec) set to >> dynamic conservative mode >> May 20 10:08:30 test kernel: [ 6.221591] e1000e 0000:06:00.1: >> (unregistered net_device): Interrupt Throttling Rate (ints/sec) set to >> dynamic conservative mode >> >> 06:00.0 Ethernet controller: Intel Corporation 80003ES2LAN Gigabit >> Ethernet Controller (Copper) (rev 01) >> 06:00.1 Ethernet controller: Intel Corporation 80003ES2LAN Gigabit >> Ethernet Controller (Copper) (rev 01) >> >> Following above instructions to repro gives: >> >> 1480 bytes from test2 (192.168.2.49): icmp_req=5875 ttl=64 time=0.358 ms >> 1480 bytes from test2 (192.168.2.49): icmp_req=5876 ttl=64 time=0.330 ms >> 1480 bytes from test2 (192.168.2.49): icmp_req=5877 ttl=64 time=0.337 ms >> 1480 bytes from test2 (192.168.2.49): icmp_req=5878 ttl=64 time=0.375 ms >> 1480 bytes from test2 (192.168.2.49): icmp_req=5879 ttl=64 time=0.359 ms >> 1480 bytes from lpb49.prod.google.com (192.168.2.49): icmp_req=5880 >> ttl=64 time=0.380 ms >> >> And I didn't see the stalls. This was on an Intel machine. The limit >> was stable, went up to around 28K when opened large file and tended to >> stay between 15-28K. >> >> The describe problem seems to have characteristics that transmit >> interrupts are not at all periodic, and it would seem that some are >> taking hundreds of milliseconds to pop. I don't see anything that >> would cause that in the NIC, is it possible there is some activity on >> the machines periodically and often holding down interrupts for long >> periods of time. Are there any peculiarities on Sun Fire in interrupt >> handling? >> >> Can you also provide an 'ethtool -c eth0' >> >> Thanks, >> Tom > > I also observed the similar behaviour on the following environment. > > 03:00.0 Ethernet controller: Intel Corporation 82574L Gigabit Network Connection > > [ 2.962119] e1000e: Intel(R) PRO/1000 Network Driver - 2.0.0-k > [ 2.968095] e1000e: Copyright(c) 1999 - 2012 Intel Corporation. > [ 2.974251] e1000e 0000:03:00.0: Disabling ASPM L0s L1 > [ 2.979653] e1000e 0000:03:00.0: (unregistered net_device): Interrupt Throttling Rate (ints/sec) set to dynamic conservative mode > [ 2.991599] e1000e 0000:03:00.0: irq 72 for MSI/MSI-X > [ 2.991606] e1000e 0000:03:00.0: irq 73 for MSI/MSI-X > [ 2.991611] e1000e 0000:03:00.0: irq 74 for MSI/MSI-X > [ 3.092768] e1000e 0000:03:00.0: eth0: (PCI Express:2.5GT/s:Width x1) 48:5b:39:75:91:bd > [ 3.100992] e1000e 0000:03:00.0: eth0: Intel(R) PRO/1000 Network Connection > [ 3.108173] e1000e 0000:03:00.0: eth0: MAC: 3, PHY: 8, PBA No: FFFFFF-0FF > > I tried some coalesce options by 'ethtool -C eth0', but > anything didn't help. > > If I understand the code and spec correctly, TX interrupts are > generated when TXDCTL.WTHRESH descriptors have been accumulated > and write backed. > > I tentatively changed the TXDCTL.WTHRESH to 1, then it seems > that latency spikes are disappear. > > drivers/net/ethernet/intel/e1000e/e1000.h > @@ -181,7 +181,7 @@ struct e1000_info; > #define E1000_TXDCTL_DMA_BURST_ENABLE \ > (E1000_TXDCTL_GRAN | /* set descriptor granularity */ \ > E1000_TXDCTL_COUNT_DESC | \ > - (5 << 16) | /* wthresh must be +1 more than desired */\ > + (1 << 16) | /* wthresh must be +1 more than desired */\ > (1 << 8) | /* hthresh */ \ > 0x1f) /* pthresh */ > > (before) $ ping -i0.2 192.168.11.2 > PING 192.168.11.2 (192.168.11.2) 56(84) bytes of data. > 64 bytes from 192.168.11.2: icmp_req=1 ttl=64 time=0.191 ms > 64 bytes from 192.168.11.2: icmp_req=2 ttl=64 time=0.179 ms > 64 bytes from 192.168.11.2: icmp_req=3 ttl=64 time=0.199 ms > 64 bytes from 192.168.11.2: icmp_req=4 ttl=64 time=0.143 ms > 64 bytes from 192.168.11.2: icmp_req=5 ttl=64 time=0.193 ms > 64 bytes from 192.168.11.2: icmp_req=6 ttl=64 time=0.150 ms > 64 bytes from 192.168.11.2: icmp_req=7 ttl=64 time=0.186 ms > 64 bytes from 192.168.11.2: icmp_req=8 ttl=64 time=0.198 ms > 64 bytes from 192.168.11.2: icmp_req=9 ttl=64 time=0.195 ms > 64 bytes from 192.168.11.2: icmp_req=10 ttl=64 time=0.194 ms > 64 bytes from 192.168.11.2: icmp_req=11 ttl=64 time=0.196 ms > 64 bytes from 192.168.11.2: icmp_req=12 ttl=64 time=0.200 ms > 64 bytes from 192.168.11.2: icmp_req=13 ttl=64 time=651 ms > 64 bytes from 192.168.11.2: icmp_req=14 ttl=64 time=451 ms > 64 bytes from 192.168.11.2: icmp_req=15 ttl=64 time=241 ms > 64 bytes from 192.168.11.2: icmp_req=16 ttl=64 time=31.3 ms > 64 bytes from 192.168.11.2: icmp_req=17 ttl=64 time=0.184 ms > 64 bytes from 192.168.11.2: icmp_req=18 ttl=64 time=0.199 ms > 64 bytes from 192.168.11.2: icmp_req=19 ttl=64 time=0.197 ms > 64 bytes from 192.168.11.2: icmp_req=20 ttl=64 time=0.196 ms > 64 bytes from 192.168.11.2: icmp_req=21 ttl=64 time=0.192 ms > 64 bytes from 192.168.11.2: icmp_req=22 ttl=64 time=0.205 ms > 64 bytes from 192.168.11.2: icmp_req=23 ttl=64 time=629 ms > 64 bytes from 192.168.11.2: icmp_req=24 ttl=64 time=419 ms > 64 bytes from 192.168.11.2: icmp_req=25 ttl=64 time=209 ms > 64 bytes from 192.168.11.2: icmp_req=26 ttl=64 time=0.280 ms > 64 bytes from 192.168.11.2: icmp_req=27 ttl=64 time=0.193 ms > 64 bytes from 192.168.11.2: icmp_req=28 ttl=64 time=0.194 ms > 64 bytes from 192.168.11.2: icmp_req=29 ttl=64 time=0.143 ms > 64 bytes from 192.168.11.2: icmp_req=30 ttl=64 time=0.191 ms > 64 bytes from 192.168.11.2: icmp_req=31 ttl=64 time=0.144 ms > 64 bytes from 192.168.11.2: icmp_req=32 ttl=64 time=0.192 ms > 64 bytes from 192.168.11.2: icmp_req=33 ttl=64 time=0.199 ms > 64 bytes from 192.168.11.2: icmp_req=34 ttl=64 time=0.193 ms > 64 bytes from 192.168.11.2: icmp_req=35 ttl=64 time=0.196 ms > 64 bytes from 192.168.11.2: icmp_req=36 ttl=64 time=0.196 ms > 64 bytes from 192.168.11.2: icmp_req=37 ttl=64 time=0.196 ms > 64 bytes from 192.168.11.2: icmp_req=38 ttl=64 time=1600 ms > 64 bytes from 192.168.11.2: icmp_req=39 ttl=64 time=1390 ms > 64 bytes from 192.168.11.2: icmp_req=40 ttl=64 time=1180 ms > 64 bytes from 192.168.11.2: icmp_req=41 ttl=64 time=980 ms > 64 bytes from 192.168.11.2: icmp_req=42 ttl=64 time=780 ms > 64 bytes from 192.168.11.2: icmp_req=43 ttl=64 time=570 ms > 64 bytes from 192.168.11.2: icmp_req=44 ttl=64 time=0.151 ms > 64 bytes from 192.168.11.2: icmp_req=45 ttl=64 time=0.189 ms > 64 bytes from 192.168.11.2: icmp_req=46 ttl=64 time=0.203 ms > 64 bytes from 192.168.11.2: icmp_req=47 ttl=64 time=0.185 ms > 64 bytes from 192.168.11.2: icmp_req=48 ttl=64 time=0.189 ms > 64 bytes from 192.168.11.2: icmp_req=49 ttl=64 time=0.204 ms > 64 bytes from 192.168.11.2: icmp_req=50 ttl=64 time=0.198 ms > > I think 1000 ms - 2000 ms delay is come from e1000_watchdog_task(). > > (after) $ ping -i0.2 192.168.11.2 > 64 bytes from 192.168.11.2: icmp_req=1 ttl=64 time=0.175 ms > 64 bytes from 192.168.11.2: icmp_req=2 ttl=64 time=0.203 ms > 64 bytes from 192.168.11.2: icmp_req=3 ttl=64 time=0.196 ms > 64 bytes from 192.168.11.2: icmp_req=4 ttl=64 time=0.197 ms > 64 bytes from 192.168.11.2: icmp_req=5 ttl=64 time=0.186 ms > 64 bytes from 192.168.11.2: icmp_req=6 ttl=64 time=0.197 ms > 64 bytes from 192.168.11.2: icmp_req=7 ttl=64 time=0.189 ms > 64 bytes from 192.168.11.2: icmp_req=8 ttl=64 time=0.146 ms > 64 bytes from 192.168.11.2: icmp_req=9 ttl=64 time=0.193 ms > 64 bytes from 192.168.11.2: icmp_req=10 ttl=64 time=0.194 ms > 64 bytes from 192.168.11.2: icmp_req=11 ttl=64 time=0.195 ms > 64 bytes from 192.168.11.2: icmp_req=12 ttl=64 time=0.190 ms > 64 bytes from 192.168.11.2: icmp_req=13 ttl=64 time=0.204 ms > 64 bytes from 192.168.11.2: icmp_req=14 ttl=64 time=0.201 ms > 64 bytes from 192.168.11.2: icmp_req=15 ttl=64 time=0.189 ms > 64 bytes from 192.168.11.2: icmp_req=16 ttl=64 time=0.193 ms > 64 bytes from 192.168.11.2: icmp_req=17 ttl=64 time=0.190 ms > 64 bytes from 192.168.11.2: icmp_req=18 ttl=64 time=0.143 ms > 64 bytes from 192.168.11.2: icmp_req=19 ttl=64 time=0.191 ms > 64 bytes from 192.168.11.2: icmp_req=20 ttl=64 time=0.190 ms -- To unsubscribe from this list: send the line "unsubscribe netdev" in the body of a message to majordomo@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
On Tue, 2012-05-29 at 07:54 -0700, Tom Herbert wrote: > Thanks Hiroaki for this description, it looks promising. Denys, can > you test with his patch. > > Tom Indeed this sounds good. Hmm, I guess my e1000e has no FLAG2_DMA_BURST in adapter->flags2 -- To unsubscribe from this list: send the line "unsubscribe netdev" in the body of a message to majordomo@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
Big Thanks Hiroaki, yes this patch improving situation a lot, and for sure it fixes the problem. I cannot reproduce it anymore. On 2012-05-29 17:54, Tom Herbert wrote: > Thanks Hiroaki for this description, it looks promising. Denys, can > you test with his patch. > > Tom > > On Tue, May 29, 2012 at 7:25 AM, Hiroaki SHIMODA > <shimoda.hiroaki@gmail.com> wrote: >> On Sun, 20 May 2012 10:40:41 -0700 >> Tom Herbert <therbert@google.com> wrote: >> >>> Tried to reproduce: >>> >>> May 20 10:08:30 test kernel: [ 6.168240] e1000e 0000:06:00.0: >>> (unregistered net_device): Interrupt Throttling Rate (ints/sec) set >>> to >>> dynamic conservative mode >>> May 20 10:08:30 test kernel: [ 6.221591] e1000e 0000:06:00.1: >>> (unregistered net_device): Interrupt Throttling Rate (ints/sec) set >>> to >>> dynamic conservative mode >>> >>> 06:00.0 Ethernet controller: Intel Corporation 80003ES2LAN Gigabit >>> Ethernet Controller (Copper) (rev 01) >>> 06:00.1 Ethernet controller: Intel Corporation 80003ES2LAN Gigabit >>> Ethernet Controller (Copper) (rev 01) >>> >>> Following above instructions to repro gives: >>> >>> 1480 bytes from test2 (192.168.2.49): icmp_req=5875 ttl=64 >>> time=0.358 ms >>> 1480 bytes from test2 (192.168.2.49): icmp_req=5876 ttl=64 >>> time=0.330 ms >>> 1480 bytes from test2 (192.168.2.49): icmp_req=5877 ttl=64 >>> time=0.337 ms >>> 1480 bytes from test2 (192.168.2.49): icmp_req=5878 ttl=64 >>> time=0.375 ms >>> 1480 bytes from test2 (192.168.2.49): icmp_req=5879 ttl=64 >>> time=0.359 ms >>> 1480 bytes from lpb49.prod.google.com (192.168.2.49): icmp_req=5880 >>> ttl=64 time=0.380 ms >>> >>> And I didn't see the stalls. This was on an Intel machine. The >>> limit >>> was stable, went up to around 28K when opened large file and tended >>> to >>> stay between 15-28K. >>> >>> The describe problem seems to have characteristics that transmit >>> interrupts are not at all periodic, and it would seem that some are >>> taking hundreds of milliseconds to pop. I don't see anything that >>> would cause that in the NIC, is it possible there is some activity >>> on >>> the machines periodically and often holding down interrupts for >>> long >>> periods of time. Are there any peculiarities on Sun Fire in >>> interrupt >>> handling? >>> >>> Can you also provide an 'ethtool -c eth0' >>> >>> Thanks, >>> Tom >> >> I also observed the similar behaviour on the following environment. >> >> 03:00.0 Ethernet controller: Intel Corporation 82574L Gigabit >> Network Connection >> >> [ 2.962119] e1000e: Intel(R) PRO/1000 Network Driver - 2.0.0-k >> [ 2.968095] e1000e: Copyright(c) 1999 - 2012 Intel Corporation. >> [ 2.974251] e1000e 0000:03:00.0: Disabling ASPM L0s L1 >> [ 2.979653] e1000e 0000:03:00.0: (unregistered net_device): >> Interrupt Throttling Rate (ints/sec) set to dynamic conservative mode >> [ 2.991599] e1000e 0000:03:00.0: irq 72 for MSI/MSI-X >> [ 2.991606] e1000e 0000:03:00.0: irq 73 for MSI/MSI-X >> [ 2.991611] e1000e 0000:03:00.0: irq 74 for MSI/MSI-X >> [ 3.092768] e1000e 0000:03:00.0: eth0: (PCI Express:2.5GT/s:Width >> x1) 48:5b:39:75:91:bd >> [ 3.100992] e1000e 0000:03:00.0: eth0: Intel(R) PRO/1000 Network >> Connection >> [ 3.108173] e1000e 0000:03:00.0: eth0: MAC: 3, PHY: 8, PBA No: >> FFFFFF-0FF >> >> I tried some coalesce options by 'ethtool -C eth0', but >> anything didn't help. >> >> If I understand the code and spec correctly, TX interrupts are >> generated when TXDCTL.WTHRESH descriptors have been accumulated >> and write backed. >> >> I tentatively changed the TXDCTL.WTHRESH to 1, then it seems >> that latency spikes are disappear. >> >> drivers/net/ethernet/intel/e1000e/e1000.h >> @@ -181,7 +181,7 @@ struct e1000_info; >> #define E1000_TXDCTL_DMA_BURST_ENABLE \ >> (E1000_TXDCTL_GRAN | /* set descriptor granularity */ \ >> E1000_TXDCTL_COUNT_DESC | \ >> - (5 << 16) | /* wthresh must be +1 more than desired */\ >> + (1 << 16) | /* wthresh must be +1 more than desired */\ >> (1 << 8) | /* hthresh */ \ >> 0x1f) /* pthresh */ >> >> (before) $ ping -i0.2 192.168.11.2 >> PING 192.168.11.2 (192.168.11.2) 56(84) bytes of data. >> 64 bytes from 192.168.11.2: icmp_req=1 ttl=64 time=0.191 ms >> 64 bytes from 192.168.11.2: icmp_req=2 ttl=64 time=0.179 ms >> 64 bytes from 192.168.11.2: icmp_req=3 ttl=64 time=0.199 ms >> 64 bytes from 192.168.11.2: icmp_req=4 ttl=64 time=0.143 ms >> 64 bytes from 192.168.11.2: icmp_req=5 ttl=64 time=0.193 ms >> 64 bytes from 192.168.11.2: icmp_req=6 ttl=64 time=0.150 ms >> 64 bytes from 192.168.11.2: icmp_req=7 ttl=64 time=0.186 ms >> 64 bytes from 192.168.11.2: icmp_req=8 ttl=64 time=0.198 ms >> 64 bytes from 192.168.11.2: icmp_req=9 ttl=64 time=0.195 ms >> 64 bytes from 192.168.11.2: icmp_req=10 ttl=64 time=0.194 ms >> 64 bytes from 192.168.11.2: icmp_req=11 ttl=64 time=0.196 ms >> 64 bytes from 192.168.11.2: icmp_req=12 ttl=64 time=0.200 ms >> 64 bytes from 192.168.11.2: icmp_req=13 ttl=64 time=651 ms >> 64 bytes from 192.168.11.2: icmp_req=14 ttl=64 time=451 ms >> 64 bytes from 192.168.11.2: icmp_req=15 ttl=64 time=241 ms >> 64 bytes from 192.168.11.2: icmp_req=16 ttl=64 time=31.3 ms >> 64 bytes from 192.168.11.2: icmp_req=17 ttl=64 time=0.184 ms >> 64 bytes from 192.168.11.2: icmp_req=18 ttl=64 time=0.199 ms >> 64 bytes from 192.168.11.2: icmp_req=19 ttl=64 time=0.197 ms >> 64 bytes from 192.168.11.2: icmp_req=20 ttl=64 time=0.196 ms >> 64 bytes from 192.168.11.2: icmp_req=21 ttl=64 time=0.192 ms >> 64 bytes from 192.168.11.2: icmp_req=22 ttl=64 time=0.205 ms >> 64 bytes from 192.168.11.2: icmp_req=23 ttl=64 time=629 ms >> 64 bytes from 192.168.11.2: icmp_req=24 ttl=64 time=419 ms >> 64 bytes from 192.168.11.2: icmp_req=25 ttl=64 time=209 ms >> 64 bytes from 192.168.11.2: icmp_req=26 ttl=64 time=0.280 ms >> 64 bytes from 192.168.11.2: icmp_req=27 ttl=64 time=0.193 ms >> 64 bytes from 192.168.11.2: icmp_req=28 ttl=64 time=0.194 ms >> 64 bytes from 192.168.11.2: icmp_req=29 ttl=64 time=0.143 ms >> 64 bytes from 192.168.11.2: icmp_req=30 ttl=64 time=0.191 ms >> 64 bytes from 192.168.11.2: icmp_req=31 ttl=64 time=0.144 ms >> 64 bytes from 192.168.11.2: icmp_req=32 ttl=64 time=0.192 ms >> 64 bytes from 192.168.11.2: icmp_req=33 ttl=64 time=0.199 ms >> 64 bytes from 192.168.11.2: icmp_req=34 ttl=64 time=0.193 ms >> 64 bytes from 192.168.11.2: icmp_req=35 ttl=64 time=0.196 ms >> 64 bytes from 192.168.11.2: icmp_req=36 ttl=64 time=0.196 ms >> 64 bytes from 192.168.11.2: icmp_req=37 ttl=64 time=0.196 ms >> 64 bytes from 192.168.11.2: icmp_req=38 ttl=64 time=1600 ms >> 64 bytes from 192.168.11.2: icmp_req=39 ttl=64 time=1390 ms >> 64 bytes from 192.168.11.2: icmp_req=40 ttl=64 time=1180 ms >> 64 bytes from 192.168.11.2: icmp_req=41 ttl=64 time=980 ms >> 64 bytes from 192.168.11.2: icmp_req=42 ttl=64 time=780 ms >> 64 bytes from 192.168.11.2: icmp_req=43 ttl=64 time=570 ms >> 64 bytes from 192.168.11.2: icmp_req=44 ttl=64 time=0.151 ms >> 64 bytes from 192.168.11.2: icmp_req=45 ttl=64 time=0.189 ms >> 64 bytes from 192.168.11.2: icmp_req=46 ttl=64 time=0.203 ms >> 64 bytes from 192.168.11.2: icmp_req=47 ttl=64 time=0.185 ms >> 64 bytes from 192.168.11.2: icmp_req=48 ttl=64 time=0.189 ms >> 64 bytes from 192.168.11.2: icmp_req=49 ttl=64 time=0.204 ms >> 64 bytes from 192.168.11.2: icmp_req=50 ttl=64 time=0.198 ms >> >> I think 1000 ms - 2000 ms delay is come from e1000_watchdog_task(). >> >> (after) $ ping -i0.2 192.168.11.2 >> 64 bytes from 192.168.11.2: icmp_req=1 ttl=64 time=0.175 ms >> 64 bytes from 192.168.11.2: icmp_req=2 ttl=64 time=0.203 ms >> 64 bytes from 192.168.11.2: icmp_req=3 ttl=64 time=0.196 ms >> 64 bytes from 192.168.11.2: icmp_req=4 ttl=64 time=0.197 ms >> 64 bytes from 192.168.11.2: icmp_req=5 ttl=64 time=0.186 ms >> 64 bytes from 192.168.11.2: icmp_req=6 ttl=64 time=0.197 ms >> 64 bytes from 192.168.11.2: icmp_req=7 ttl=64 time=0.189 ms >> 64 bytes from 192.168.11.2: icmp_req=8 ttl=64 time=0.146 ms >> 64 bytes from 192.168.11.2: icmp_req=9 ttl=64 time=0.193 ms >> 64 bytes from 192.168.11.2: icmp_req=10 ttl=64 time=0.194 ms >> 64 bytes from 192.168.11.2: icmp_req=11 ttl=64 time=0.195 ms >> 64 bytes from 192.168.11.2: icmp_req=12 ttl=64 time=0.190 ms >> 64 bytes from 192.168.11.2: icmp_req=13 ttl=64 time=0.204 ms >> 64 bytes from 192.168.11.2: icmp_req=14 ttl=64 time=0.201 ms >> 64 bytes from 192.168.11.2: icmp_req=15 ttl=64 time=0.189 ms >> 64 bytes from 192.168.11.2: icmp_req=16 ttl=64 time=0.193 ms >> 64 bytes from 192.168.11.2: icmp_req=17 ttl=64 time=0.190 ms >> 64 bytes from 192.168.11.2: icmp_req=18 ttl=64 time=0.143 ms >> 64 bytes from 192.168.11.2: icmp_req=19 ttl=64 time=0.191 ms >> 64 bytes from 192.168.11.2: icmp_req=20 ttl=64 time=0.190 ms --- Denys Fedoryshchenko, Network Engineer, Virtual ISP S.A.L. -- To unsubscribe from this list: send the line "unsubscribe netdev" in the body of a message to majordomo@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
While reading the bql code, I have some questions. 1) dql_completed() and dql_queued() can be called concurrently, so dql->num_queued could change while processing dql_completed(). Is it intentional to refer num_queued from "dql->" each time ? 2) From the comment in the code * - The queue was over-limit in the previous interval and * when enqueuing it was possible that all queued data * had been consumed. and * Queue was not starved, check if the limit can be decreased. * A decrease is only considered if the queue has been busy in * the whole interval (the check above). the calculation of all_prev_completed should take into account completed == dql->prev_num_queued case ? On current implementation, limit shrinks easily and some NIC hit TX stalls. To mitigate TX stalls, should we fix all_prev_completed rather than individual driver ? 3) limit calculation fails to consider integer wrap around in one place ? Here is the patch what I meant. diff --git a/lib/dynamic_queue_limits.c b/lib/dynamic_queue_limits.c @@ -11,22 +11,27 @@ #include <linux/dynamic_queue_limits.h> #define POSDIFF(A, B) ((A) > (B) ? (A) - (B) : 0) +#define POSDIFFI(A, B) ((int)((A) - (B)) > 0 ? (A) - (B) : 0) +#define AFTER_EQ(A, B) ((int)((A) - (B)) >= 0) /* Records completed count and recalculates the queue limit */ void dql_completed(struct dql *dql, unsigned int count) { unsigned int inprogress, prev_inprogress, limit; - unsigned int ovlimit, all_prev_completed, completed; + unsigned int ovlimit, completed, num_queued; + bool all_prev_completed; + + num_queued = dql->num_queued; /* Can't complete more than what's in queue */ - BUG_ON(count > dql->num_queued - dql->num_completed); + BUG_ON(count > num_queued - dql->num_completed); completed = dql->num_completed + count; limit = dql->limit; - ovlimit = POSDIFF(dql->num_queued - dql->num_completed, limit); - inprogress = dql->num_queued - completed; + ovlimit = POSDIFF(num_queued - dql->num_completed, limit); + inprogress = num_queued - completed; prev_inprogress = dql->prev_num_queued - dql->num_completed; - all_prev_completed = POSDIFF(completed, dql->prev_num_queued); + all_prev_completed = AFTER_EQ(completed, dql->prev_num_queued); if ((ovlimit && !inprogress) || (dql->prev_ovlimit && all_prev_completed)) { @@ -45,7 +50,7 @@ void dql_completed(struct dql *dql, unsigned int count) * of bytes both sent and completed in the last interval, * plus any previous over-limit. */ - limit += POSDIFF(completed, dql->prev_num_queued) + + limit += POSDIFFI(completed, dql->prev_num_queued) + dql->prev_ovlimit; dql->slack_start_time = jiffies; dql->lowest_slack = UINT_MAX; @@ -104,7 +109,7 @@ void dql_completed(struct dql *dql, unsigned int count) dql->prev_ovlimit = ovlimit; dql->prev_last_obj_cnt = dql->last_obj_cnt; dql->num_completed = completed; - dql->prev_num_queued = dql->num_queued; + dql->prev_num_queued = num_queued; } EXPORT_SYMBOL(dql_completed); -- To unsubscribe from this list: send the line "unsubscribe netdev" in the body of a message to majordomo@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
On Wed, 2012-05-30 at 09:06 +0900, Hiroaki SHIMODA wrote: > While reading the bql code, I have some questions. > > 1) dql_completed() and dql_queued() can be called concurrently, > so dql->num_queued could change while processing > dql_completed(). > Is it intentional to refer num_queued from "dql->" each time ? > not sure it can have problems, but doing the read once is indeed a good plan. > 2) From the comment in the code > * - The queue was over-limit in the previous interval and > * when enqueuing it was possible that all queued data > * had been consumed. > > and > > * Queue was not starved, check if the limit can be decreased. > * A decrease is only considered if the queue has been busy in > * the whole interval (the check above). > > the calculation of all_prev_completed should take into account > completed == dql->prev_num_queued case ? > On current implementation, limit shrinks easily and some NIC > hit TX stalls. > To mitigate TX stalls, should we fix all_prev_completed rather > than individual driver ? > Not sure what you mean > 3) limit calculation fails to consider integer wrap around in > one place ? > Yes > Here is the patch what I meant. > > diff --git a/lib/dynamic_queue_limits.c b/lib/dynamic_queue_limits.c > @@ -11,22 +11,27 @@ > #include <linux/dynamic_queue_limits.h> > > #define POSDIFF(A, B) ((A) > (B) ? (A) - (B) : 0) > +#define POSDIFFI(A, B) ((int)((A) - (B)) > 0 ? (A) - (B) : 0) > +#define AFTER_EQ(A, B) ((int)((A) - (B)) >= 0) > > /* Records completed count and recalculates the queue limit */ > void dql_completed(struct dql *dql, unsigned int count) > { > unsigned int inprogress, prev_inprogress, limit; > - unsigned int ovlimit, all_prev_completed, completed; > + unsigned int ovlimit, completed, num_queued; > + bool all_prev_completed; > + > + num_queued = dql->num_queued; I suggest : num_queued = ACCESS_ONCE(dql->num_queued); Or else compiler is free to do whatever he wants. -- To unsubscribe from this list: send the line "unsubscribe netdev" in the body of a message to majordomo@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
On Wed, 30 May 2012 10:40:31 +0200 Eric Dumazet <eric.dumazet@gmail.com> wrote: > On Wed, 2012-05-30 at 09:06 +0900, Hiroaki SHIMODA wrote: > > While reading the bql code, I have some questions. > > > > 1) dql_completed() and dql_queued() can be called concurrently, > > so dql->num_queued could change while processing > > dql_completed(). > > Is it intentional to refer num_queued from "dql->" each time ? > > > > not sure it can have problems, but doing the read once is indeed a good > plan. > > > 2) From the comment in the code > > * - The queue was over-limit in the previous interval and > > * when enqueuing it was possible that all queued data > > * had been consumed. > > > > and > > > > * Queue was not starved, check if the limit can be decreased. > > * A decrease is only considered if the queue has been busy in > > * the whole interval (the check above). > > > > the calculation of all_prev_completed should take into account > > completed == dql->prev_num_queued case ? > > On current implementation, limit shrinks easily and some NIC > > hit TX stalls. > > To mitigate TX stalls, should we fix all_prev_completed rather > > than individual driver ? > > > > Not sure what you mean While examining ping problem, below pattern is often observed. TIME dql_queued() dql_completed() | a) initial state | | b) X bytes queued V c) Y bytes queued d) X bytes completed e) Z bytes queued f) Y bytes completed a) dql->limit has already some value and there is no in-flight packet. b) X bytes queued. c) Y bytes queued and excess limit. d) X bytes completed and dql->prev_ovlimit is set and also dql->prev_num_queued is set Y. e) Z bytes queued. f) Y bytes completed. inprogress and prev_inprogress are true. At f), if I read the comment correctly, all_prev_completed becomes true and limit should be increased. But POSDIFF() ignores (A == B) case, so limit is decreased. I thought excess limit decrement induces the TX stalls. > > > 3) limit calculation fails to consider integer wrap around in > > one place ? > > > > Yes > > > Here is the patch what I meant. > > > > diff --git a/lib/dynamic_queue_limits.c b/lib/dynamic_queue_limits.c > > @@ -11,22 +11,27 @@ > > #include <linux/dynamic_queue_limits.h> > > > > #define POSDIFF(A, B) ((A) > (B) ? (A) - (B) : 0) > > +#define POSDIFFI(A, B) ((int)((A) - (B)) > 0 ? (A) - (B) : 0) > > +#define AFTER_EQ(A, B) ((int)((A) - (B)) >= 0) > > > > /* Records completed count and recalculates the queue limit */ > > void dql_completed(struct dql *dql, unsigned int count) > > { > > unsigned int inprogress, prev_inprogress, limit; > > - unsigned int ovlimit, all_prev_completed, completed; > > + unsigned int ovlimit, completed, num_queued; > > + bool all_prev_completed; > > + > > + num_queued = dql->num_queued; > > > I suggest : > > num_queued = ACCESS_ONCE(dql->num_queued); > > Or else compiler is free to do whatever he wants. Thank you for your suggestion. -- To unsubscribe from this list: send the line "unsubscribe netdev" in the body of a message to majordomo@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
> > + num_queued = dql->num_queued; > > > I suggest : > > num_queued = ACCESS_ONCE(dql->num_queued); > > Or else compiler is free to do whatever he wants. Or make the structure member volatile, then the compiler can only read it once. Probably worth while if the value is expected to be read like that. David -- To unsubscribe from this list: send the line "unsubscribe netdev" in the body of a message to majordomo@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
On Wed, 2012-05-30 at 11:52 +0100, David Laight wrote: > > > + num_queued = dql->num_queued; > > > > > > I suggest : > > > > num_queued = ACCESS_ONCE(dql->num_queued); > > > > Or else compiler is free to do whatever he wants. > > Or make the structure member volatile, then the > compiler can only read it once. No. Compiler can read it several times. Really. > Probably worth while if the value is expected to > be read like that. Please don't use lazy volatile. Unless you really want Linus flames (and ours) ACCESS_ONCE() is much cleaner. -- To unsubscribe from this list: send the line "unsubscribe netdev" in the body of a message to majordomo@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
On Wed, 2012-05-30 at 19:43 +0900, Hiroaki SHIMODA wrote: > While examining ping problem, below pattern is often observed. > > TIME > dql_queued() dql_completed() | > a) initial state | > | > b) X bytes queued V > > c) Y bytes queued > d) X bytes completed > e) Z bytes queued > f) Y bytes completed > > a) dql->limit has already some value and there is no in-flight packet. > b) X bytes queued. > c) Y bytes queued and excess limit. > d) X bytes completed and dql->prev_ovlimit is set and also > dql->prev_num_queued is set Y. > e) Z bytes queued. > f) Y bytes completed. inprogress and prev_inprogress are true. > > At f), if I read the comment correctly, all_prev_completed becomes > true and limit should be increased. But POSDIFF() ignores > (A == B) case, so limit is decreased. Which POSDIFF(), because there are many ;) By the way, given complexity of this I suggest you split your ideas in independent patches. Mabe we should change all POSDIFF(), not selected ones. #define POSDIFF(A, B) ((int)((A) - (B)) > 0 ? (A) - (B) : 0) -- To unsubscribe from this list: send the line "unsubscribe netdev" in the body of a message to majordomo@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
On Wed, 2012-05-30 at 13:04 +0200, Eric Dumazet wrote: > Please don't use lazy volatile. > > Unless you really want Linus flames (and ours) > > ACCESS_ONCE() is much cleaner. > http://yarchive.net/comp/linux/ACCESS_ONCE.html -- To unsubscribe from this list: send the line "unsubscribe netdev" in the body of a message to majordomo@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
On Wed, 2012-05-30 at 13:08 +0200, Eric Dumazet wrote: > Maybe we should change all POSDIFF(), not selected ones. > #define POSDIFF(A, B) ((int)((A) - (B)) > 0 ? (A) - (B) : 0) maybe use an eval once statement expression macro ({ typeof (A) _a = (A); typeof (B) _b = (B); ((int)(_a - _b) > 0 ? _a - _b : 0; }) -- To unsubscribe from this list: send the line "unsubscribe netdev" in the body of a message to majordomo@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
On Wed, 30 May 2012 13:08:27 +0200 Eric Dumazet <eric.dumazet@gmail.com> wrote: > On Wed, 2012-05-30 at 19:43 +0900, Hiroaki SHIMODA wrote: > > > While examining ping problem, below pattern is often observed. > > > > TIME > > dql_queued() dql_completed() | > > a) initial state | > > | > > b) X bytes queued V > > > > c) Y bytes queued > > d) X bytes completed > > e) Z bytes queued > > f) Y bytes completed > > > > a) dql->limit has already some value and there is no in-flight packet. > > b) X bytes queued. > > c) Y bytes queued and excess limit. > > d) X bytes completed and dql->prev_ovlimit is set and also > > dql->prev_num_queued is set Y. > > e) Z bytes queued. > > f) Y bytes completed. inprogress and prev_inprogress are true. > > > > At f), if I read the comment correctly, all_prev_completed becomes > > true and limit should be increased. But POSDIFF() ignores > > (A == B) case, so limit is decreased. > > Which POSDIFF(), because there are many ;) I mean, all_prev_completed = POSDIFF(completed, dql->prev_num_queued); > By the way, given complexity of this I suggest you split your ideas in > independent patches. In this case, here is the patch what I thinking. diff --git a/lib/dynamic_queue_limits.c b/lib/dynamic_queue_limits.c @@ -11,12 +11,14 @@ #include <linux/dynamic_queue_limits.h> #define POSDIFF(A, B) ((A) > (B) ? (A) - (B) : 0) +#define #define AFTER_EQ(A, B) ((int)((A) - (B)) >= 0) /* Records completed count and recalculates the queue limit */ void dql_completed(struct dql *dql, unsigned int count) { unsigned int inprogress, prev_inprogress, limit; - unsigned int ovlimit, all_prev_completed, completed; + unsigned int ovlimit, completed; + bool all_prev_completed; /* Can't complete more than what's in queue */ BUG_ON(count > dql->num_queued - dql->num_completed); @@ -26,7 +28,7 @@ void dql_completed(struct dql *dql, unsigned int count) ovlimit = POSDIFF(dql->num_queued - dql->num_completed, limit); inprogress = dql->num_queued - completed; prev_inprogress = dql->prev_num_queued - dql->num_completed; - all_prev_completed = POSDIFF(completed, dql->prev_num_queued); + all_prev_completed = AFTER_EQ(completed, dql->prev_num_queued); if ((ovlimit && !inprogress) || (dql->prev_ovlimit && all_prev_completed)) { > Mabe we should change all POSDIFF(), not selected ones. > > #define POSDIFF(A, B) ((int)((A) - (B)) > 0 ? (A) - (B) : 0) > -- To unsubscribe from this list: send the line "unsubscribe netdev" in the body of a message to majordomo@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
On Wed, 2012-05-30 at 04:20 -0700, Joe Perches wrote: > On Wed, 2012-05-30 at 13:08 +0200, Eric Dumazet wrote: > > Maybe we should change all POSDIFF(), not selected ones. > > #define POSDIFF(A, B) ((int)((A) - (B)) > 0 ? (A) - (B) : 0) > > maybe use an eval once statement expression macro > ({ > typeof (A) _a = (A); > typeof (B) _b = (B); > ((int)(_a - _b) > 0 ? _a - _b : 0; > }) > > Well, many choices are possible, including #define POSDIFF(A, B) max_t(int, (A) - (B), 0); -- To unsubscribe from this list: send the line "unsubscribe netdev" in the body of a message to majordomo@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
On Wed, 2012-05-30 at 13:59 +0200, Eric Dumazet wrote: > On Wed, 2012-05-30 at 04:20 -0700, Joe Perches wrote: > > On Wed, 2012-05-30 at 13:08 +0200, Eric Dumazet wrote: > > > Maybe we should change all POSDIFF(), not selected ones. > > > #define POSDIFF(A, B) ((int)((A) - (B)) > 0 ? (A) - (B) : 0) > > > > maybe use an eval once statement expression macro > > ({ > > typeof (A) _a = (A); > > typeof (B) _b = (B); > > ((int)(_a - _b) > 0 ? _a - _b : 0; > > }) > > Well, many choices are possible, including > #define POSDIFF(A, B) max_t(int, (A) - (B), 0); Maybe. If the intent to always return an int. The current macro doesn't. Whatever evals A and B once would be better. cheers, Joe -- To unsubscribe from this list: send the line "unsubscribe netdev" in the body of a message to majordomo@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
On Wed, 2012-05-30 at 07:09 -0700, Joe Perches wrote:
> Whatever evals A and B once would be better.
Why do you believe they could be evaluated several time ?
#define POSDIFF(A, B) max_t(int, (A) - (B), 0)
(A) - (B) is done once, or its a huge max_t() bug.
By the way, we handle 32bit values here. No way BQL can overflow a 4GB
limit.
--
To unsubscribe from this list: send the line "unsubscribe netdev" in
the body of a message to majordomo@vger.kernel.org
More majordomo info at http://vger.kernel.org/majordomo-info.html
On Wed, 2012-05-30 at 20:29 +0900, Hiroaki SHIMODA wrote: > On Wed, 30 May 2012 13:08:27 +0200 > Eric Dumazet <eric.dumazet@gmail.com> wrote: > > > On Wed, 2012-05-30 at 19:43 +0900, Hiroaki SHIMODA wrote: > > > > > While examining ping problem, below pattern is often observed. > > > > > > TIME > > > dql_queued() dql_completed() | > > > a) initial state | > > > | > > > b) X bytes queued V > > > > > > c) Y bytes queued > > > d) X bytes completed > > > e) Z bytes queued > > > f) Y bytes completed > > > > > > a) dql->limit has already some value and there is no in-flight packet. > > > b) X bytes queued. > > > c) Y bytes queued and excess limit. > > > d) X bytes completed and dql->prev_ovlimit is set and also > > > dql->prev_num_queued is set Y. > > > e) Z bytes queued. > > > f) Y bytes completed. inprogress and prev_inprogress are true. > > > > > > At f), if I read the comment correctly, all_prev_completed becomes > > > true and limit should be increased. But POSDIFF() ignores > > > (A == B) case, so limit is decreased. > > > > Which POSDIFF(), because there are many ;) > > I mean, > all_prev_completed = POSDIFF(completed, dql->prev_num_queued); > > > By the way, given complexity of this I suggest you split your ideas in > > independent patches. > > In this case, here is the patch what I thinking. > > diff --git a/lib/dynamic_queue_limits.c b/lib/dynamic_queue_limits.c > @@ -11,12 +11,14 @@ > #include <linux/dynamic_queue_limits.h> > > #define POSDIFF(A, B) ((A) > (B) ? (A) - (B) : 0) > +#define #define AFTER_EQ(A, B) ((int)((A) - (B)) >= 0) > > /* Records completed count and recalculates the queue limit */ > void dql_completed(struct dql *dql, unsigned int count) > { > unsigned int inprogress, prev_inprogress, limit; > - unsigned int ovlimit, all_prev_completed, completed; > + unsigned int ovlimit, completed; > + bool all_prev_completed; > > /* Can't complete more than what's in queue */ > BUG_ON(count > dql->num_queued - dql->num_completed); > @@ -26,7 +28,7 @@ void dql_completed(struct dql *dql, unsigned int count) > ovlimit = POSDIFF(dql->num_queued - dql->num_completed, limit); > inprogress = dql->num_queued - completed; > prev_inprogress = dql->prev_num_queued - dql->num_completed; > - all_prev_completed = POSDIFF(completed, dql->prev_num_queued); > + all_prev_completed = AFTER_EQ(completed, dql->prev_num_queued); > > if ((ovlimit && !inprogress) || > (dql->prev_ovlimit && all_prev_completed)) { I am fine with this one. Can you send official patches please ? -- To unsubscribe from this list: send the line "unsubscribe netdev" in the body of a message to majordomo@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
On Wed, 2012-05-30 at 16:42 +0200, Eric Dumazet wrote: > On Wed, 2012-05-30 at 07:09 -0700, Joe Perches wrote: > > > Whatever evals A and B once would be better. > > Why do you believe they could be evaluated several time ? > > #define POSDIFF(A, B) max_t(int, (A) - (B), 0) I don't and didn't say I did. > By the way, we handle 32bit values here. No way BQL can overflow a 4GB > limit. swell. -- To unsubscribe from this list: send the line "unsubscribe netdev" in the body of a message to majordomo@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
On 05/30/2012 07:49 AM, Eric Dumazet wrote: > On Wed, 2012-05-30 at 20:29 +0900, Hiroaki SHIMODA wrote: >> On Wed, 30 May 2012 13:08:27 +0200 >> Eric Dumazet<eric.dumazet@gmail.com> wrote: >> >>> On Wed, 2012-05-30 at 19:43 +0900, Hiroaki SHIMODA wrote: >>> >>>> While examining ping problem, below pattern is often observed. >>>> >>>> TIME >>>> dql_queued() dql_completed() | >>>> a) initial state | >>>> | >>>> b) X bytes queued V >>>> >>>> c) Y bytes queued >>>> d) X bytes completed >>>> e) Z bytes queued >>>> f) Y bytes completed >>>> >>>> a) dql->limit has already some value and there is no in-flight packet. >>>> b) X bytes queued. >>>> c) Y bytes queued and excess limit. >>>> d) X bytes completed and dql->prev_ovlimit is set and also >>>> dql->prev_num_queued is set Y. >>>> e) Z bytes queued. >>>> f) Y bytes completed. inprogress and prev_inprogress are true. >>>> >>>> At f), if I read the comment correctly, all_prev_completed becomes >>>> true and limit should be increased. But POSDIFF() ignores >>>> (A == B) case, so limit is decreased. >>> Which POSDIFF(), because there are many ;) >> I mean, >> all_prev_completed = POSDIFF(completed, dql->prev_num_queued); >> >>> By the way, given complexity of this I suggest you split your ideas in >>> independent patches. >> In this case, here is the patch what I thinking. >> >> diff --git a/lib/dynamic_queue_limits.c b/lib/dynamic_queue_limits.c >> @@ -11,12 +11,14 @@ >> #include<linux/dynamic_queue_limits.h> >> >> #define POSDIFF(A, B) ((A)> (B) ? (A) - (B) : 0) >> +#define #define AFTER_EQ(A, B) ((int)((A) - (B))>= 0) >> >> /* Records completed count and recalculates the queue limit */ >> void dql_completed(struct dql *dql, unsigned int count) >> { >> unsigned int inprogress, prev_inprogress, limit; >> - unsigned int ovlimit, all_prev_completed, completed; >> + unsigned int ovlimit, completed; >> + bool all_prev_completed; >> >> /* Can't complete more than what's in queue */ >> BUG_ON(count> dql->num_queued - dql->num_completed); >> @@ -26,7 +28,7 @@ void dql_completed(struct dql *dql, unsigned int count) >> ovlimit = POSDIFF(dql->num_queued - dql->num_completed, limit); >> inprogress = dql->num_queued - completed; >> prev_inprogress = dql->prev_num_queued - dql->num_completed; >> - all_prev_completed = POSDIFF(completed, dql->prev_num_queued); >> + all_prev_completed = AFTER_EQ(completed, dql->prev_num_queued); >> >> if ((ovlimit&& !inprogress) || >> (dql->prev_ovlimit&& all_prev_completed)) { > I am fine with this one. > > Can you send official patches please ? While this looks encouraging, BQL presently overbuffers by about a factor of 2 in low (sub 100Mbit) scenarios on the hardware I have available to me. I look forward to re-running benchmarks with this patch however. > > > -- > To unsubscribe from this list: send the line "unsubscribe netdev" in > the body of a message to majordomo@vger.kernel.org > More majordomo info at http://vger.kernel.org/majordomo-info.html -- To unsubscribe from this list: send the line "unsubscribe netdev" in the body of a message to majordomo@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
On Wed, 30 May 2012 16:49:02 +0200 Eric Dumazet <eric.dumazet@gmail.com> wrote: > On Wed, 2012-05-30 at 20:29 +0900, Hiroaki SHIMODA wrote: > > On Wed, 30 May 2012 13:08:27 +0200 > > Eric Dumazet <eric.dumazet@gmail.com> wrote: > > > > > On Wed, 2012-05-30 at 19:43 +0900, Hiroaki SHIMODA wrote: > > > > > > > While examining ping problem, below pattern is often observed. > > > > > > > > TIME > > > > dql_queued() dql_completed() | > > > > a) initial state | > > > > | > > > > b) X bytes queued V > > > > > > > > c) Y bytes queued > > > > d) X bytes completed > > > > e) Z bytes queued > > > > f) Y bytes completed > > > > > > > > a) dql->limit has already some value and there is no in-flight packet. > > > > b) X bytes queued. > > > > c) Y bytes queued and excess limit. > > > > d) X bytes completed and dql->prev_ovlimit is set and also > > > > dql->prev_num_queued is set Y. > > > > e) Z bytes queued. > > > > f) Y bytes completed. inprogress and prev_inprogress are true. > > > > > > > > At f), if I read the comment correctly, all_prev_completed becomes > > > > true and limit should be increased. But POSDIFF() ignores > > > > (A == B) case, so limit is decreased. > > > > > > Which POSDIFF(), because there are many ;) > > > > I mean, > > all_prev_completed = POSDIFF(completed, dql->prev_num_queued); > > > > > By the way, given complexity of this I suggest you split your ideas in > > > independent patches. > > > > In this case, here is the patch what I thinking. > > > > diff --git a/lib/dynamic_queue_limits.c b/lib/dynamic_queue_limits.c > > @@ -11,12 +11,14 @@ > > #include <linux/dynamic_queue_limits.h> > > > > #define POSDIFF(A, B) ((A) > (B) ? (A) - (B) : 0) > > +#define #define AFTER_EQ(A, B) ((int)((A) - (B)) >= 0) > > > > /* Records completed count and recalculates the queue limit */ > > void dql_completed(struct dql *dql, unsigned int count) > > { > > unsigned int inprogress, prev_inprogress, limit; > > - unsigned int ovlimit, all_prev_completed, completed; > > + unsigned int ovlimit, completed; > > + bool all_prev_completed; > > > > /* Can't complete more than what's in queue */ > > BUG_ON(count > dql->num_queued - dql->num_completed); > > @@ -26,7 +28,7 @@ void dql_completed(struct dql *dql, unsigned int count) > > ovlimit = POSDIFF(dql->num_queued - dql->num_completed, limit); > > inprogress = dql->num_queued - completed; > > prev_inprogress = dql->prev_num_queued - dql->num_completed; > > - all_prev_completed = POSDIFF(completed, dql->prev_num_queued); > > + all_prev_completed = AFTER_EQ(completed, dql->prev_num_queued); > > > > if ((ovlimit && !inprogress) || > > (dql->prev_ovlimit && all_prev_completed)) { > > I am fine with this one. > > Can you send official patches please ? Thanks. I'll post. -- To unsubscribe from this list: send the line "unsubscribe netdev" in the body of a message to majordomo@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
On Tue, 2012-05-29 at 23:25 +0900, Hiroaki SHIMODA wrote: > If I understand the code and spec correctly, TX interrupts are > generated when TXDCTL.WTHRESH descriptors have been accumulated > and write backed. > > I tentatively changed the TXDCTL.WTHRESH to 1, then it seems > that latency spikes are disappear. > > drivers/net/ethernet/intel/e1000e/e1000.h > @@ -181,7 +181,7 @@ struct e1000_info; > #define E1000_TXDCTL_DMA_BURST_ENABLE \ > (E1000_TXDCTL_GRAN | /* set descriptor granularity */ \ > E1000_TXDCTL_COUNT_DESC | \ > - (5 << 16) | /* wthresh must be +1 more than desired */\ > + (1 << 16) | /* wthresh must be +1 more than desired */\ > (1 << 8) | /* hthresh */ \ > 0x1f) /* pthresh */ > Was this patch officially submitted ? Thanks ! -- To unsubscribe from this list: send the line "unsubscribe netdev" in the body of a message to majordomo@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
On Wed, 06 Jun 2012 07:10:13 +0200 Eric Dumazet <eric.dumazet@gmail.com> wrote: > On Tue, 2012-05-29 at 23:25 +0900, Hiroaki SHIMODA wrote: > > > If I understand the code and spec correctly, TX interrupts are > > generated when TXDCTL.WTHRESH descriptors have been accumulated > > and write backed. > > > > I tentatively changed the TXDCTL.WTHRESH to 1, then it seems > > that latency spikes are disappear. > > > > drivers/net/ethernet/intel/e1000e/e1000.h > > @@ -181,7 +181,7 @@ struct e1000_info; > > #define E1000_TXDCTL_DMA_BURST_ENABLE \ > > (E1000_TXDCTL_GRAN | /* set descriptor granularity */ \ > > E1000_TXDCTL_COUNT_DESC | \ > > - (5 << 16) | /* wthresh must be +1 more than desired */\ > > + (1 << 16) | /* wthresh must be +1 more than desired */\ > > (1 << 8) | /* hthresh */ \ > > 0x1f) /* pthresh */ > > > > > Was this patch officially submitted ? > > Thanks ! Sorry for long delay. I'll post. (I have no idea how to fix this problem as keeping TXDCTL.WTHRESH to 5) -- To unsubscribe from this list: send the line "unsubscribe netdev" in the body of a message to majordomo@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
On Wed, 6 Jun 2012 Hiroaki SHIMODA <shimoda.hiroaki@gmail.com> wrote: > Sorry for long delay. I'll post. > (I have no idea how to fix this problem as keeping TXDCTL.WTHRESH to 5) I don't like changing WTHRESH wholesale because making the global change to WTHRESH on e1000e just to fix this one bug (likely specific to a particular chip/hardware) will adversely effect performance on many models supported by e1000e not demonstrating any problem. We could possibly check something in for just ESB2LAN (S5000 chipset). Other people (tom herbert) with this same chipset have been unable to reproduce this issue right? -- To unsubscribe from this list: send the line "unsubscribe netdev" in the body of a message to majordomo@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
On Wed, 2012-06-06 at 09:26 -0700, Jesse Brandeburg wrote: > On Wed, 6 Jun 2012 Hiroaki SHIMODA <shimoda.hiroaki@gmail.com> wrote: > > Sorry for long delay. I'll post. > > (I have no idea how to fix this problem as keeping TXDCTL.WTHRESH to 5) > > I don't like changing WTHRESH wholesale because making the global change > to WTHRESH on e1000e just to fix this one bug (likely specific to a > particular chip/hardware) will adversely effect performance on many > models supported by e1000e not demonstrating any problem. We could > possibly check something in for just ESB2LAN (S5000 chipset). > > Other people (tom herbert) with this same chipset have been unable to > reproduce this issue right? Thats because Tom Herbert (and me on my T420s laptop) dont have FLAG2_DMA_BURST in adapter->flags2 So it depends on some e1000e, not all... You cant hold a TX completion indefinitely, this breaks BQL but also other stuff. -- To unsubscribe from this list: send the line "unsubscribe netdev" in the body of a message to majordomo@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
On Wed, 6 Jun 2012 09:26:35 -0700 Jesse Brandeburg <jesse.brandeburg@intel.com> wrote: > On Wed, 6 Jun 2012 Hiroaki SHIMODA <shimoda.hiroaki@gmail.com> wrote: > > Sorry for long delay. I'll post. > > (I have no idea how to fix this problem as keeping TXDCTL.WTHRESH to 5) > > I don't like changing WTHRESH wholesale because making the global change > to WTHRESH on e1000e just to fix this one bug (likely specific to a > particular chip/hardware) will adversely effect performance on many > models supported by e1000e not demonstrating any problem. We could > possibly check something in for just ESB2LAN (S5000 chipset). > > Other people (tom herbert) with this same chipset have been unable to > reproduce this issue right? I understand your performance concern. The affected chip would be e1000_82571, e1000_82572, e1000_82574 and e1000_80003es2lan which have FLAG2_DMA_BURST bit in adapter->flags2. Anyway, I have no objection intel guys NACK to my patch and provide right fix. But in that case please consider 82574L chip too which I observed similar behaviour. Thanks. -- To unsubscribe from this list: send the line "unsubscribe netdev" in the body of a message to majordomo@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
From: Eric Dumazet <eric.dumazet@gmail.com> Date: Wed, 06 Jun 2012 19:05:04 +0200 > You cant hold a TX completion indefinitely, this breaks BQL but also > other stuff. True. -- To unsubscribe from this list: send the line "unsubscribe netdev" in the body of a message to majordomo@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
I'm not exactly sure what the exact effect of WTHRESH is here. Does the device coalesce 5 completions regardless of size? Would the problem be avoided if bql limit_min were MTU, or could same issue be hit with larger that 64 byte packets? Tom On Wed, Jun 6, 2012 at 10:19 AM, Hiroaki SHIMODA <shimoda.hiroaki@gmail.com> wrote: > On Wed, 6 Jun 2012 09:26:35 -0700 > Jesse Brandeburg <jesse.brandeburg@intel.com> wrote: > >> On Wed, 6 Jun 2012 Hiroaki SHIMODA <shimoda.hiroaki@gmail.com> wrote: >> > Sorry for long delay. I'll post. >> > (I have no idea how to fix this problem as keeping TXDCTL.WTHRESH to 5) >> >> I don't like changing WTHRESH wholesale because making the global change >> to WTHRESH on e1000e just to fix this one bug (likely specific to a >> particular chip/hardware) will adversely effect performance on many >> models supported by e1000e not demonstrating any problem. We could >> possibly check something in for just ESB2LAN (S5000 chipset). >> >> Other people (tom herbert) with this same chipset have been unable to >> reproduce this issue right? > > I understand your performance concern. > > The affected chip would be e1000_82571, e1000_82572, e1000_82574 > and e1000_80003es2lan which have FLAG2_DMA_BURST bit in > adapter->flags2. > Anyway, I have no objection intel guys NACK to my patch and > provide right fix. But in that case please consider 82574L chip too > which I observed similar behaviour. > > Thanks. -- To unsubscribe from this list: send the line "unsubscribe netdev" in the body of a message to majordomo@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
From: Tom Herbert <therbert@google.com> Date: Wed, 6 Jun 2012 11:21:40 -0700 > I'm not exactly sure what the exact effect of WTHRESH is here. Does > the device coalesce 5 completions regardless of size? Would the > problem be avoided if bql limit_min were MTU, or could same issue be > hit with larger that 64 byte packets? The problem is that no TX completions are signalled happen until at least WTHRESH are pending. BQL is the least of the problems generated by this kind of behavior. All drivers must TX complete in a small, finite, amount of time so it is absolutely illegal to have the behavior that WRTHRESH > 1 gives. -- To unsubscribe from this list: send the line "unsubscribe netdev" in the body of a message to majordomo@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
On Wed, 06 Jun 2012 11:23:32 -0700 (PDT) David Miller <davem@davemloft.net> wrote: > From: Tom Herbert <therbert@google.com> > Date: Wed, 6 Jun 2012 11:21:40 -0700 > > > I'm not exactly sure what the exact effect of WTHRESH is here. Does > > the device coalesce 5 completions regardless of size? Would the > > problem be avoided if bql limit_min were MTU, or could same issue be > > hit with larger that 64 byte packets? > > The problem is that no TX completions are signalled happen until at > least WTHRESH are pending. > > BQL is the least of the problems generated by this kind of behavior. > > All drivers must TX complete in a small, finite, amount of time so > it is absolutely illegal to have the behavior that WRTHRESH > 1 > gives. The TX completion is also controlled by the programming of the corresponding interrupt moderation register (EITR). It makes sense to hold off a little bit to try and reduce the TX completion interrupt load. Intel manual.. Descriptors are written back in one of three cases: • TXDCTL[n].WTHRESH = 0b and a descriptor which has RS set is ready to be written back • The corresponding EITR counter has reached zero • TXDCTL[n].WTHRESH > 0b and TXDCTL[n].WTHRESH descriptors have accumulated For the first condition, write-backs are immediate. This is the default operation and is backward compatible with previous device implementations. The other two conditions are only valid if descriptor bursting is enabled (Section 8.12.13). In the second condition, the EITR counter is used to force timely write-back of descriptors. The first packet after timer initialization starts the timer. Timer expiration flushes any accumulated descriptors and sets an interrupt event (TXDW). For the final condition, if TXDCTL[n].WTHRESH descriptors are ready for write-back, the write-back is performed. -- To unsubscribe from this list: send the line "unsubscribe netdev" in the body of a message to majordomo@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
--- netdev.c 2012-05-12 20:08:37.000000000 +0300 +++ netdev.c.patched 2012-05-17 16:32:28.895760472 +0300 @@ -1135,7 +1135,7 @@ tx_ring->next_to_clean = i; - netdev_completed_queue(netdev, pkts_compl, bytes_compl); +// netdev_completed_queue(netdev, pkts_compl, bytes_compl); #define TX_WAKE_THRESHOLD 32 if (count && netif_carrier_ok(netdev) && @@ -2263,7 +2263,7 @@ e1000_put_txbuf(adapter, buffer_info); } - netdev_reset_queue(adapter->netdev); +// netdev_reset_queue(adapter->netdev); size = sizeof(struct e1000_buffer) * tx_ring->count; memset(tx_ring->buffer_info, 0, size); @@ -5056,7 +5056,7 @@ /* if count is 0 then mapping error has occurred */ count = e1000_tx_map(adapter, skb, first, max_per_txd, nr_frags, mss); if (count) { - netdev_sent_queue(netdev, skb->len); +// netdev_sent_queue(netdev, skb->len); e1000_tx_queue(adapter, tx_flags, count); /* Make sure there is space in the ring for the next