diff mbox

Kernel 4.6.7-rt13: Intel Ethernet driver igb causes huge latencies in cyclictest

Message ID 29250f87b1d84aacb8aa312935582291@FE-MBX1012.de.bosch.com
State Not Applicable
Headers show

Commit Message

Koehrer Mathias (ETAS/ESW5) Oct. 7, 2016, 8:58 a.m. UTC
Hi Mitch,
> > >
> > > Although, to be clear, it isn't the fact that there exists 8
> > > threads, it's
> > that the device is
> > > firing all 8 interrupts at the same time.  The time spent in hardirq
> > context just waking
> > > up all 8 of those threads (and the cyclictest wakeup) is enough to
> > > cause
> > your
> > > regression.
> > >
> > > netdev/igb folks-
> > >
> > > Under what conditions should it be expected that the i350 trigger
> > > all of
> > the TxRx
> > > interrupts simultaneously?  Any ideas here?
> 
> I can answer that! I wrote that code.
> 
> We trigger the interrupts once a second because MSI and MSI-X interrupts are NOT
> guaranteed to be delivered. If this happens, the queues being serviced by this "lost"
> interrupt are completely stuck.
> 
> The device automatically masks each interrupt vector after it fires, expecting the
> ISR to re-enable the vector after processing is complete. If the interrupt is lost, the
> ISR doesn't run, so the vector ends up permanently masked. At this point, any
> queues associated with that vector are stuck. The only recovery is through the
> netdev watchdog, which initiates a reset.
> 
> During development of igb, we had several platforms with chipsets that routinely
> dropped MSI messages under stress. Things would be running fine and then, pow,
> all the traffic on a queue would stop.
> 
> So, I added code to fire each vector once per second. Just unmasking the interrupt
> isn't enough - we need to trigger the ISR to get the queues cleaned up so the device
> can work again.
> 
> Is this workaround still needed? I don't know. Modern chipsets don't break a sweat
> handling gigabit-speed traffic, and they *probably* don't drop interrupts. But I'd still
> rather have that insurance.
> 
> You could try to remove the write to the EICS registers in the watchdog task to see if
> that takes care of your problem. But I wouldn't want to remove that code
> permanently, because we have seen lost interrupts in the past.
> 
> You also could try staggering the writes so that not all vectors fire each second. But
> then you'll potentially incur a much longer delay if an interrupt does get lost, which
> means you could trigger netdev watchdog events.
> 
> -Mitch

Thanks for the explanation. Now I have switched over to kernel 4.8-rt1.
This shows the very same effect.
I modified the in-kernel's igb_main.c (function igb_watchdog_task) to comment out
the access to the EICS registers:



The result is now slighty better, however the max latency that has been measured by
cyclictest is still much higher compared to the values of kernel 3.18.27-rt27.
I have now enabled all events, the execution of 
# cyclictest -a -i 105 -m -n -p 80 -t 1  -b 30 -C
delivers the following trace values


cyclicte-10062   0....... 10025813402us : sys_exit: NR 230 = 0
cyclicte-10062   0....... 10025813402us : sys_enter: NR 230 (1, 1, 7ffff73ff930, 0, 0, 2bd7e12e)
cyclicte-10062   0....... 10025813402us : hrtimer_init: hrtimer=ffff88040a017e18 clockid=CLOCK_MONOTONIC mode=HRTIMER_MODE_ABS
cyclicte-10062   0d...1.. 10025813403us : hrtimer_start: hrtimer=ffff88040a017e18 function=hrtimer_wakeup expires=10024735653388 softexpires=10024735653388
cyclicte-10062   0d...1.. 10025813403us : write_msr: 6e0, value 20ca630b9aef
cyclicte-10062   0d...1.. 10025813403us : rcu_utilization: Start context switch
cyclicte-10062   0d...1.. 10025813403us : rcu_utilization: End context switch
cyclicte-10062   0d...2.. 10025813404us : sched_switch: prev_comm=cyclictest prev_pid=10062 prev_prio=19 prev_state=S ==> next_comm=kworker/0:3 next_pid=1388 next_prio=120
cyclicte-10062   0d...2.. 10025813404us+: x86_fpu_regs_deactivated: x86/fpu: ffff88040c603ec0 fpregs_active: 0 fpstate_active: 1 counter: 69 xfeatures: 2 xcomp_bv: 0
kworker/-1388    0d..h... 10025813468us : irq_handler_entry: irq=48 name=eth2-tx-0
kworker/-1388    0d..h... 10025813468us : irq_handler_exit: irq=48 ret=handled
kworker/-1388    0d..h1.. 10025813469us : sched_waking: comm=irq/48-eth2-tx- pid=10057 prio=49 target_cpu=000
kworker/-1388    0dN.h2.. 10025813469us : sched_wakeup: comm=irq/48-eth2-tx- pid=10057 prio=49 target_cpu=000
kworker/-1388    0dN.h1.. 10025813470us : irq_handler_entry: irq=47 name=eth2-rx-0
kworker/-1388    0dN.h1.. 10025813470us : irq_handler_exit: irq=47 ret=handled
kworker/-1388    0dN.h2.. 10025813471us : sched_waking: comm=irq/47-eth2-rx- pid=10056 prio=49 target_cpu=000
kworker/-1388    0dN.h3.. 10025813471us : sched_wakeup: comm=irq/47-eth2-rx- pid=10056 prio=49 target_cpu=000
kworker/-1388    0dN..1.. 10025813472us : rcu_utilization: Start context switch
kworker/-1388    0dN..1.. 10025813472us : rcu_utilization: End context switch
kworker/-1388    0dN..2.. 10025813472us : sched_stat_runtime: comm=kworker/0:3 pid=1388 runtime=67566 [ns] vruntime=101216288332 [ns]
kworker/-1388    0d...2.. 10025813472us : sched_switch: prev_comm=kworker/0:3 prev_pid=1388 prev_prio=120 prev_state=R+ ==> next_comm=irq/48-eth2-tx- next_pid=10057 next_prio=49
irq/48-e-10057   0d....11 10025813473us : softirq_raise: vec=3 [action=NET_RX]
irq/48-e-10057   0.....12 10025813474us : softirq_entry: vec=3 [action=NET_RX]
irq/48-e-10057   0.....12 10025813475us : napi_poll: napi poll on napi struct ffff88040a582850 for device eth2 work 0 budget 64
irq/48-e-10057   0.....12 10025813475us : softirq_exit: vec=3 [action=NET_RX]
irq/48-e-10057   0d...1.. 10025813475us : rcu_utilization: Start context switch
irq/48-e-10057   0d...1.. 10025813476us : rcu_utilization: End context switch
irq/48-e-10057   0d...2.. 10025813476us : sched_switch: prev_comm=irq/48-eth2-tx- prev_pid=10057 prev_prio=49 prev_state=S ==> next_comm=irq/47-eth2-rx- next_pid=10056 next_prio=49
irq/47-e-10056   0d....11 10025813477us : softirq_raise: vec=3 [action=NET_RX]
irq/47-e-10056   0.....12 10025813477us : softirq_entry: vec=3 [action=NET_RX]
irq/47-e-10056   0.....12 10025813478us : napi_poll: napi poll on napi struct ffff88040a580850 for device eth2 work 0 budget 64
irq/47-e-10056   0.....12 10025813478us : softirq_exit: vec=3 [action=NET_RX]
irq/47-e-10056   0d...1.. 10025813479us : rcu_utilization: Start context switch
irq/47-e-10056   0d...1.. 10025813479us : rcu_utilization: End context switch
irq/47-e-10056   0d...2.. 10025813479us+: sched_switch: prev_comm=irq/47-eth2-rx- prev_pid=10056 prev_prio=49 prev_state=S ==> next_comm=kworker/0:3 next_pid=1388 next_prio=120
kworker/-1388    0d..h... 10025813516us : local_timer_entry: vector=239
kworker/-1388    0d..h1.. 10025813516us : hrtimer_interrupt: cpu=0 offset=-28999 curr=kworker/0:3[120] thread=cyclictest[19]
kworker/-1388    0d..h1.. 10025813517us : hrtimer_cancel: hrtimer=ffff88040a017e18
kworker/-1388    0d..h... 10025813517us : hrtimer_expire_entry: hrtimer=ffff88040a017e18 function=hrtimer_wakeup now=10024735682387
kworker/-1388    0d..h1.. 10025813517us : sched_waking: comm=cyclictest pid=10062 prio=19 target_cpu=000
kworker/-1388    0dN.h2.. 10025813517us : sched_wakeup: comm=cyclictest pid=10062 prio=19 target_cpu=000
kworker/-1388    0dN.h... 10025813517us : hrtimer_expire_exit: hrtimer=ffff88040a017e18
kworker/-1388    0dN.h1.. 10025813518us : hrtimer_interrupt: cpu=0 offset=317613 curr=kworker/0:3[120] thread=<none>[-1]
kworker/-1388    0dN.h... 10025813518us : write_msr: 6e0, value 20ca631e99f8
kworker/-1388    0dN.h... 10025813518us : local_timer_exit: vector=239
kworker/-1388    0dN..1.. 10025813518us : rcu_utilization: Start context switch
kworker/-1388    0dN..1.. 10025813518us : rcu_utilization: End context switch
kworker/-1388    0dN..2.. 10025813518us : sched_stat_runtime: comm=kworker/0:3 pid=1388 runtime=38173 [ns] vruntime=101216326505 [ns]
kworker/-1388    0d...2.. 10025813519us : sched_switch: prev_comm=kworker/0:3 prev_pid=1388 prev_prio=120 prev_state=R+ ==> next_comm=cyclictest next_pid=10062 next_prio=19
kworker/-1388    0d...2.. 10025813519us : x86_fpu_regs_activated: x86/fpu: ffff88040c603ec0 fpregs_active: 1 fpstate_active: 1 counter: 70 xfeatures: 2 xcomp_bv: 0
kworker/-1388    0d...2.. 10025813519us : write_msr: c0000100, value 7ffff7400700
cyclicte-10062   0....... 10025813519us : sys_exit: NR 230 = 0
cyclicte-10062   0....... 10025813524us : sys_enter: NR 1 (5, 7ffff7400300, 1f, 7ffff77a5460, 2, 7ffff744c99a)
cyclicte-10062   0.....11 10025813526us : tracing_mark_write: hit latency threshold (32 > 30)
cyclicte-10062   0....... 10025813526us : sys_exit: NR 1 = 31
cyclicte-10062   0....... 10025813527us : sys_enter: NR 1 (4, 4076b0, 2, 7ffff77a5460, 2, 7ffff744c99a)

Also, I notice that there is still roughly one irq every 2 seconds on irq/47 (eth2-rx-0) and irq/48 (eth2-tx-0).

Any idea or feedback on this?

Thanks

Mathias

Comments

Julia Cartwright Oct. 10, 2016, 7:39 p.m. UTC | #1
Hello Mathias-

On Fri, Oct 07, 2016 at 08:58:08AM +0000, Koehrer Mathias (ETAS/ESW5) wrote:
[..]
> I modified the in-kernel's igb_main.c (function igb_watchdog_task) to comment out
> the access to the EICS registers:
> 
> --- igb_main.c.orig     2016-10-07 10:43:37.855873754 +0200
> +++ igb_main.c  2016-10-07 10:31:51.451346292 +0200
> @@ -4613,6 +4613,7 @@ no_wait:
>         }
> 
>         /* Cause software interrupt to ensure Rx ring is cleaned */
> +#if 0
>         if (adapter->flags & IGB_FLAG_HAS_MSIX) {
>                 u32 eics = 0;
> 
> @@ -4622,6 +4623,7 @@ no_wait:
>         } else {
>                 wr32(E1000_ICS, E1000_ICS_RXDMT0);
>         }
> +#endif
> 
>         igb_spoof_check(adapter);
>         igb_ptp_rx_hang(adapter);
> 
> 
> The result is now slighty better, however the max latency that has been measured by
> cyclictest is still much higher compared to the values of kernel 3.18.27-rt27.
> I have now enabled all events, the execution of 
> # cyclictest -a -i 105 -m -n -p 80 -t 1  -b 30 -C
> delivers the following trace values

There is something still fishy with these traces...

> cyclicte-10062   0....... 10025813402us : sys_exit: NR 230 = 0
> cyclicte-10062   0....... 10025813402us : sys_enter: NR 230 (1, 1, 7ffff73ff930, 0, 0, 2bd7e12e)
> cyclicte-10062   0....... 10025813402us : hrtimer_init: hrtimer=ffff88040a017e18 clockid=CLOCK_MONOTONIC mode=HRTIMER_MODE_ABS
> cyclicte-10062   0d...1.. 10025813403us : hrtimer_start: hrtimer=ffff88040a017e18 function=hrtimer_wakeup expires=10024735653388 softexpires=10024735653388

cyclictest thread sleeps waiting for wakeup at 10024735653388.

> cyclicte-10062   0d...1.. 10025813403us : write_msr: 6e0, value 20ca630b9aef
> cyclicte-10062   0d...1.. 10025813403us : rcu_utilization: Start context switch
> cyclicte-10062   0d...1.. 10025813403us : rcu_utilization: End context switch
> cyclicte-10062   0d...2.. 10025813404us : sched_switch: prev_comm=cyclictest prev_pid=10062 prev_prio=19 prev_state=S ==> next_comm=kworker/0:3 next_pid=1388 next_prio=120
> cyclicte-10062   0d...2.. 10025813404us+: x86_fpu_regs_deactivated: x86/fpu: ffff88040c603ec0 fpregs_active: 0 fpstate_active: 1 counter: 69 xfeatures: 2 xcomp_bv: 0
> kworker/-1388    0d..h... 10025813468us : irq_handler_entry: irq=48 name=eth2-tx-0
> kworker/-1388    0d..h... 10025813468us : irq_handler_exit: irq=48 ret=handled
> kworker/-1388    0d..h1.. 10025813469us : sched_waking: comm=irq/48-eth2-tx- pid=10057 prio=49 target_cpu=000
> kworker/-1388    0dN.h2.. 10025813469us : sched_wakeup: comm=irq/48-eth2-tx- pid=10057 prio=49 target_cpu=000
> kworker/-1388    0dN.h1.. 10025813470us : irq_handler_entry: irq=47 name=eth2-rx-0
> kworker/-1388    0dN.h1.. 10025813470us : irq_handler_exit: irq=47 ret=handled
> kworker/-1388    0dN.h2.. 10025813471us : sched_waking: comm=irq/47-eth2-rx- pid=10056 prio=49 target_cpu=000
> kworker/-1388    0dN.h3.. 10025813471us : sched_wakeup: comm=irq/47-eth2-rx- pid=10056 prio=49 target_cpu=000

So, kworker/0:3 was busy doing something, and we received both tx and rx
interrupts from eth2, waking up the relevant threads.

> kworker/-1388    0dN..1.. 10025813472us : rcu_utilization: Start context switch
> kworker/-1388    0dN..1.. 10025813472us : rcu_utilization: End context switch
> kworker/-1388    0dN..2.. 10025813472us : sched_stat_runtime: comm=kworker/0:3 pid=1388 runtime=67566 [ns] vruntime=101216288332 [ns]
> kworker/-1388    0d...2.. 10025813472us : sched_switch: prev_comm=kworker/0:3 prev_pid=1388 prev_prio=120 prev_state=R+ ==> next_comm=irq/48-eth2-tx- next_pid=10057 next_prio=49
> irq/48-e-10057   0d....11 10025813473us : softirq_raise: vec=3 [action=NET_RX]
> irq/48-e-10057   0.....12 10025813474us : softirq_entry: vec=3 [action=NET_RX]
> irq/48-e-10057   0.....12 10025813475us : napi_poll: napi poll on napi struct ffff88040a582850 for device eth2 work 0 budget 64

Hmm, the irq/48-eth2-tx- thread is raising NET_RX?  That seems...wrong.

> irq/48-e-10057   0.....12 10025813475us : softirq_exit: vec=3 [action=NET_RX]
> irq/48-e-10057   0d...1.. 10025813475us : rcu_utilization: Start context switch
> irq/48-e-10057   0d...1.. 10025813476us : rcu_utilization: End context switch
> irq/48-e-10057   0d...2.. 10025813476us : sched_switch: prev_comm=irq/48-eth2-tx- prev_pid=10057 prev_prio=49 prev_state=S ==> next_comm=irq/47-eth2-rx- next_pid=10056 next_prio=49
> irq/47-e-10056   0d....11 10025813477us : softirq_raise: vec=3 [action=NET_RX]
> irq/47-e-10056   0.....12 10025813477us : softirq_entry: vec=3 [action=NET_RX]
> irq/47-e-10056   0.....12 10025813478us : napi_poll: napi poll on napi struct ffff88040a580850 for device eth2 work 0 budget 64
> irq/47-e-10056   0.....12 10025813478us : softirq_exit: vec=3 [action=NET_RX]

Oh, then we do the same thing again?  Hrmph...why?

> irq/47-e-10056   0d...1.. 10025813479us : rcu_utilization: Start context switch
> irq/47-e-10056   0d...1.. 10025813479us : rcu_utilization: End context switch
> irq/47-e-10056   0d...2.. 10025813479us+: sched_switch: prev_comm=irq/47-eth2-rx- prev_pid=10056 prev_prio=49 prev_state=S ==> next_comm=kworker/0:3 next_pid=1388 next_prio=120

Done handling the two interrupts.  Back to whatever kworker/0:3 was up
to...

> kworker/-1388    0d..h... 10025813516us : local_timer_entry: vector=239
> kworker/-1388    0d..h1.. 10025813516us : hrtimer_interrupt: cpu=0 offset=-28999 curr=kworker/0:3[120] thread=cyclictest[19]
> kworker/-1388    0d..h1.. 10025813517us : hrtimer_cancel: hrtimer=ffff88040a017e18
> kworker/-1388    0d..h... 10025813517us : hrtimer_expire_entry: hrtimer=ffff88040a017e18 function=hrtimer_wakeup now=10024735682387

Okay, we finally received our wakeup event.  We were expecting to be
woken up at 10024735653388ns, but were actually woken up at 10024735682387ns.

  10024735682387 - 10024735653388 = 28999ns

Our timer fired ~29us late!  But why...?

Sorry I don't have answers, just more questions.  I do wonder what
kworker/0:3 was up to at the time the timer interrupt should have fired.

   Julia
Koehrer Mathias (ETAS/ESW5) Oct. 13, 2016, 6:15 a.m. UTC | #2
Hi Julia,

thanks for the detailed analysis!
> 
> [...] 
> Okay, we finally received our wakeup event.  We were expecting to be woken up at
> 10024735653388ns, but were actually woken up at 10024735682387ns.
> 
>   10024735682387 - 10024735653388 = 28999ns
> 
> Our timer fired ~29us late!  But why...?
> 
> Sorry I don't have answers, just more questions.  I do wonder what
> kworker/0:3 was up to at the time the timer interrupt should have fired.
> 
>    Julia
I have now instrumented the igb driver to generate trace points that allows me to identify in 
more details what's going on.
This delivered interesting results!

Here are some places where I added traces:
In file igb_ptp.c:
void igb_ptp_rx_hang(struct igb_adapter *adapter)
{
        struct e1000_hw *hw = &adapter->hw;
        unsigned long rx_event;
        u32 tsyncrxctl;
        trace_igb(700);
        tsyncrxctl = rd32(E1000_TSYNCRXCTL);
        trace_igb(701);

        /* Other hardware uses per-packet timestamps */
        if (hw->mac.type != e1000_82576)
                return;
...

In file igb_main.c:
static void igb_check_lvmmc(struct igb_adapter *adapter)
{
        struct e1000_hw *hw = &adapter->hw;
        u32 lvmmc;

        trace_igb(600);
        lvmmc = rd32(E1000_LVMMC);
        trace_igb(601);
        if (lvmmc) {
...

When I run now my test, I get the following trace:
[...]
kworker/-607     0....... 107315621us+: igb: val: 700
kworker/-607     0d..h... 107315640us : irq_handler_entry: irq=47 name=eth2-rx-0
kworker/-607     0d..h... 107315640us : irq_handler_exit: irq=47 ret=handled
kworker/-607     0d..h1.. 107315640us : sched_waking: comm=irq/47-eth2-rx- pid=18009 prio=49 target_cpu=000
kworker/-607     0dN.h2.. 107315641us : sched_wakeup: comm=irq/47-eth2-rx- pid=18009 prio=49 target_cpu=000
kworker/-607     0dN.h1.. 107315643us : irq_handler_entry: irq=48 name=eth2-tx-0
kworker/-607     0dN.h1.. 107315643us : irq_handler_exit: irq=48 ret=handled
kworker/-607     0dN.h2.. 107315643us : sched_waking: comm=irq/48-eth2-tx- pid=18010 prio=49 target_cpu=000
kworker/-607     0dN.h3.. 107315644us : sched_wakeup: comm=irq/48-eth2-tx- pid=18010 prio=49 target_cpu=000
kworker/-607     0dN..1.. 107315644us : rcu_utilization: Start context switch
kworker/-607     0dN..1.. 107315644us : rcu_utilization: End context switch
kworker/-607     0dN..2.. 107315644us : sched_stat_runtime: comm=kworker/0:1 pid=607 runtime=88996 [ns] vruntime=49754678074 [ns]
kworker/-607     0d...2.. 107315645us : sched_switch: prev_comm=kworker/0:1 prev_pid=607 prev_prio=120 prev_state=R+ ==> next_comm=irq/47-eth2-rx- next_pid=18009 next_prio=49
irq/47-e-18009   0d....11 107315646us : softirq_raise: vec=3 [action=NET_RX]
irq/47-e-18009   0.....12 107315646us : softirq_entry: vec=3 [action=NET_RX]
irq/47-e-18009   0.....12 107315647us : napi_poll: napi poll on napi struct ffff88040ae58c50 for device eth2 work 0 budget 64
irq/47-e-18009   0.....12 107315647us : softirq_exit: vec=3 [action=NET_RX]
irq/47-e-18009   0d...1.. 107315648us : rcu_utilization: Start context switch
irq/47-e-18009   0d...1.. 107315648us : rcu_utilization: End context switch
irq/47-e-18009   0d...2.. 107315648us : sched_switch: prev_comm=irq/47-eth2-rx- prev_pid=18009 prev_prio=49 prev_state=S ==> next_comm=irq/48-eth2-tx- next_pid=18010 next_prio=49
irq/48-e-18010   0d....11 107315649us : softirq_raise: vec=3 [action=NET_RX]
irq/48-e-18010   0.....12 107315649us : softirq_entry: vec=3 [action=NET_RX]
irq/48-e-18010   0.....12 107315650us : napi_poll: napi poll on napi struct ffff88040ae5f450 for device eth2 work 0 budget 64
irq/48-e-18010   0.....12 107315650us : softirq_exit: vec=3 [action=NET_RX]
irq/48-e-18010   0d...1.. 107315651us : rcu_utilization: Start context switch
irq/48-e-18010   0d...1.. 107315651us : rcu_utilization: End context switch
irq/48-e-18010   0d...2.. 107315651us : sched_switch: prev_comm=irq/48-eth2-tx- prev_pid=18010 prev_prio=49 prev_state=S ==> next_comm=kworker/0:1 next_pid=607 next_prio=120
kworker/-607     0....... 107315652us : igb: val: 701
kworker/-607     0....... 107315652us : igb: val: 106
kworker/-607     0....... 107315652us : igb: val: 107
kworker/-607     0....... 107315652us+: igb: val: 600
kworker/-607     0d..h... 107315689us : local_timer_entry: vector=239
kworker/-607     0d..h1.. 107315689us : hrtimer_interrupt: cpu=0 offset=-34521 curr=kworker/0:1[120] thread=cyclictest[19]
kworker/-607     0d..h1.. 107315689us : hrtimer_cancel: hrtimer=ffff8803d42efe18
kworker/-607     0d..h... 107315689us : hrtimer_expire_entry: hrtimer=ffff8803d42efe18 function=hrtimer_wakeup now=752735681960
kworker/-607     0d..h1.. 107315689us : sched_waking: comm=cyclictest pid=18015 prio=19 target_cpu=000
kworker/-607     0dN.h2.. 107315690us : sched_wakeup: comm=cyclictest pid=18015 prio=19 target_cpu=000
kworker/-607     0dN.h... 107315690us : hrtimer_expire_exit: hrtimer=ffff8803d42efe18
kworker/-607     0dN.h1.. 107315690us : hrtimer_interrupt: cpu=0 offset=318040 curr=kworker/0:1[120] thread=<none>[-1]
kworker/-607     0dN.h... 107315690us : write_msr: 6e0, value 28096cdb9ce
kworker/-607     0dN.h... 107315690us : local_timer_exit: vector=239
kworker/-607     0dN..1.. 107315690us : rcu_utilization: Start context switch
kworker/-607     0dN..1.. 107315691us : rcu_utilization: End context switch
kworker/-607     0dN..2.. 107315691us : sched_stat_runtime: comm=kworker/0:1 pid=607 runtime=38439 [ns] vruntime=49754716513 [ns]
kworker/-607     0d...2.. 107315691us : sched_switch: prev_comm=kworker/0:1 prev_pid=607 prev_prio=120 prev_state=R+ ==> next_comm=cyclictest next_pid=18015 next_prio=19
kworker/-607     0d...2.. 107315691us : x86_fpu_regs_activated: x86/fpu: ffff8803f7f55940 fpregs_active: 1 fpstate_active: 1 counter: 99 xfeatures: 2 xcomp_bv: 0
kworker/-607     0d...2.. 107315691us : write_msr: c0000100, value 7ffff7400700
cyclicte-18015   0....... 107315692us : sys_exit: NR 230 = 0
cyclicte-18015   0....... 107315697us : sys_enter: NR 1 (5, 7ffff7400300, 1f, 7ffff77a5460, 2, 7ffff744c99a)
cyclicte-18015   0.....11 107315698us : tracing_mark_write: hit latency threshold (37 > 33)
cyclicte-18015   0....... 107315699us : sys_exit: NR 1 = 31
cyclicte-18015   0....... 107315699us : sys_enter: NR 1 (4, 4076b0, 2, 7ffff77a5460, 2, 7ffff744c99a)


Very interesting is also the trace that I get in an idle system - without cyclictest running.
When I just enable my igb tracepoint I got the following result:
[...]
kworker/-607     0....... 585779012us+: igb: val: 700
kworker/-607     0....... 585779042us : igb: val: 701
kworker/-607     0....... 585779042us : igb: val: 106
kworker/-607     0....... 585779043us : igb: val: 107
kworker/-607     0....... 585779043us+: igb: val: 600
kworker/-607     0....... 585779080us : igb: val: 601

The time between my trace points 700 and 701 is about 30us, 
the time between my trace points 600 and 601 is even 37us!!
The code in between is 
     tsyncrxctl = rd32(E1000_TSYNCRXCTL);
resp.
     lvmmc = rd32(E1000_LVMMC);   

In both cases this is a single read from a register.
I have no idea why this single read could take that much time!
Is it possible that the igb hardware is in a state that delays the read access
and this is why the whole I/O system might be delayed?

I hope this delivers more details on understanding this issue.

Thanks for any feedback!

Regards

Mathias
Koehrer Mathias (ETAS/ESW5) Oct. 13, 2016, 10:57 a.m. UTC | #3
Hi all!
> 
> Hi Julia,
> 
> thanks for the detailed analysis!
> >
> > [...]
> > Okay, we finally received our wakeup event.  We were expecting to be
> > woken up at 10024735653388ns, but were actually woken up at
> 10024735682387ns.
> >
> >   10024735682387 - 10024735653388 = 28999ns
> >
> > Our timer fired ~29us late!  But why...?
> >
> > Sorry I don't have answers, just more questions.  I do wonder what
> > kworker/0:3 was up to at the time the timer interrupt should have fired.
> >
> >    Julia
> I have now instrumented the igb driver to generate trace points that allows me to
> identify in more details what's going on.
> This delivered interesting results!
> 
> Here are some places where I added traces:
> In file igb_ptp.c:
> void igb_ptp_rx_hang(struct igb_adapter *adapter) {
>         struct e1000_hw *hw = &adapter->hw;
>         unsigned long rx_event;
>         u32 tsyncrxctl;
>         trace_igb(700);
>         tsyncrxctl = rd32(E1000_TSYNCRXCTL);
>         trace_igb(701);
> 
>         /* Other hardware uses per-packet timestamps */
>         if (hw->mac.type != e1000_82576)
>                 return;
> ...
> 
> In file igb_main.c:
> static void igb_check_lvmmc(struct igb_adapter *adapter) {
>         struct e1000_hw *hw = &adapter->hw;
>         u32 lvmmc;
> 
>         trace_igb(600);
>         lvmmc = rd32(E1000_LVMMC);
>         trace_igb(601);
>         if (lvmmc) {
> ...
> 
> When I run now my test, I get the following trace:
> [...]
> kworker/-607     0....... 107315621us+: igb: val: 700
> kworker/-607     0d..h... 107315640us : irq_handler_entry: irq=47 name=eth2-rx-0
> kworker/-607     0d..h... 107315640us : irq_handler_exit: irq=47 ret=handled
> kworker/-607     0d..h1.. 107315640us : sched_waking: comm=irq/47-eth2-rx-
> pid=18009 prio=49 target_cpu=000
> kworker/-607     0dN.h2.. 107315641us : sched_wakeup: comm=irq/47-eth2-rx-
> pid=18009 prio=49 target_cpu=000
> kworker/-607     0dN.h1.. 107315643us : irq_handler_entry: irq=48 name=eth2-tx-0
> kworker/-607     0dN.h1.. 107315643us : irq_handler_exit: irq=48 ret=handled
> kworker/-607     0dN.h2.. 107315643us : sched_waking: comm=irq/48-eth2-tx-
> pid=18010 prio=49 target_cpu=000
> kworker/-607     0dN.h3.. 107315644us : sched_wakeup: comm=irq/48-eth2-tx-
> pid=18010 prio=49 target_cpu=000
> kworker/-607     0dN..1.. 107315644us : rcu_utilization: Start context switch
> kworker/-607     0dN..1.. 107315644us : rcu_utilization: End context switch
> kworker/-607     0dN..2.. 107315644us : sched_stat_runtime: comm=kworker/0:1
> pid=607 runtime=88996 [ns] vruntime=49754678074 [ns]
> kworker/-607     0d...2.. 107315645us : sched_switch: prev_comm=kworker/0:1
> prev_pid=607 prev_prio=120 prev_state=R+ ==> next_comm=irq/47-eth2-rx-
> next_pid=18009 next_prio=49
> irq/47-e-18009   0d....11 107315646us : softirq_raise: vec=3 [action=NET_RX]
> irq/47-e-18009   0.....12 107315646us : softirq_entry: vec=3 [action=NET_RX]
> irq/47-e-18009   0.....12 107315647us : napi_poll: napi poll on napi struct
> ffff88040ae58c50 for device eth2 work 0 budget 64
> irq/47-e-18009   0.....12 107315647us : softirq_exit: vec=3 [action=NET_RX]
> irq/47-e-18009   0d...1.. 107315648us : rcu_utilization: Start context switch
> irq/47-e-18009   0d...1.. 107315648us : rcu_utilization: End context switch
> irq/47-e-18009   0d...2.. 107315648us : sched_switch: prev_comm=irq/47-eth2-rx-
> prev_pid=18009 prev_prio=49 prev_state=S ==> next_comm=irq/48-eth2-tx-
> next_pid=18010 next_prio=49
> irq/48-e-18010   0d....11 107315649us : softirq_raise: vec=3 [action=NET_RX]
> irq/48-e-18010   0.....12 107315649us : softirq_entry: vec=3 [action=NET_RX]
> irq/48-e-18010   0.....12 107315650us : napi_poll: napi poll on napi struct
> ffff88040ae5f450 for device eth2 work 0 budget 64
> irq/48-e-18010   0.....12 107315650us : softirq_exit: vec=3 [action=NET_RX]
> irq/48-e-18010   0d...1.. 107315651us : rcu_utilization: Start context switch
> irq/48-e-18010   0d...1.. 107315651us : rcu_utilization: End context switch
> irq/48-e-18010   0d...2.. 107315651us : sched_switch: prev_comm=irq/48-eth2-tx-
> prev_pid=18010 prev_prio=49 prev_state=S ==> next_comm=kworker/0:1
> next_pid=607 next_prio=120
> kworker/-607     0....... 107315652us : igb: val: 701
> kworker/-607     0....... 107315652us : igb: val: 106
> kworker/-607     0....... 107315652us : igb: val: 107
> kworker/-607     0....... 107315652us+: igb: val: 600
> kworker/-607     0d..h... 107315689us : local_timer_entry: vector=239
> kworker/-607     0d..h1.. 107315689us : hrtimer_interrupt: cpu=0 offset=-34521
> curr=kworker/0:1[120] thread=cyclictest[19]
> kworker/-607     0d..h1.. 107315689us : hrtimer_cancel: hrtimer=ffff8803d42efe18
> kworker/-607     0d..h... 107315689us : hrtimer_expire_entry:
> hrtimer=ffff8803d42efe18 function=hrtimer_wakeup now=752735681960
> kworker/-607     0d..h1.. 107315689us : sched_waking: comm=cyclictest pid=18015
> prio=19 target_cpu=000
> kworker/-607     0dN.h2.. 107315690us : sched_wakeup: comm=cyclictest
> pid=18015 prio=19 target_cpu=000
> kworker/-607     0dN.h... 107315690us : hrtimer_expire_exit:
> hrtimer=ffff8803d42efe18
> kworker/-607     0dN.h1.. 107315690us : hrtimer_interrupt: cpu=0 offset=318040
> curr=kworker/0:1[120] thread=<none>[-1]
> kworker/-607     0dN.h... 107315690us : write_msr: 6e0, value 28096cdb9ce
> kworker/-607     0dN.h... 107315690us : local_timer_exit: vector=239
> kworker/-607     0dN..1.. 107315690us : rcu_utilization: Start context switch
> kworker/-607     0dN..1.. 107315691us : rcu_utilization: End context switch
> kworker/-607     0dN..2.. 107315691us : sched_stat_runtime: comm=kworker/0:1
> pid=607 runtime=38439 [ns] vruntime=49754716513 [ns]
> kworker/-607     0d...2.. 107315691us : sched_switch: prev_comm=kworker/0:1
> prev_pid=607 prev_prio=120 prev_state=R+ ==> next_comm=cyclictest
> next_pid=18015 next_prio=19
> kworker/-607     0d...2.. 107315691us : x86_fpu_regs_activated: x86/fpu:
> ffff8803f7f55940 fpregs_active: 1 fpstate_active: 1 counter: 99 xfeatures: 2
> xcomp_bv: 0
> kworker/-607     0d...2.. 107315691us : write_msr: c0000100, value 7ffff7400700
> cyclicte-18015   0....... 107315692us : sys_exit: NR 230 = 0
> cyclicte-18015   0....... 107315697us : sys_enter: NR 1 (5, 7ffff7400300, 1f,
> 7ffff77a5460, 2, 7ffff744c99a)
> cyclicte-18015   0.....11 107315698us : tracing_mark_write: hit latency threshold (37
> > 33)
> cyclicte-18015   0....... 107315699us : sys_exit: NR 1 = 31
> cyclicte-18015   0....... 107315699us : sys_enter: NR 1 (4, 4076b0, 2, 7ffff77a5460,
> 2, 7ffff744c99a)
> 
> 
> Very interesting is also the trace that I get in an idle system - without cyclictest
> running.
> When I just enable my igb tracepoint I got the following result:
> [...]
> kworker/-607     0....... 585779012us+: igb: val: 700
> kworker/-607     0....... 585779042us : igb: val: 701
> kworker/-607     0....... 585779042us : igb: val: 106
> kworker/-607     0....... 585779043us : igb: val: 107
> kworker/-607     0....... 585779043us+: igb: val: 600
> kworker/-607     0....... 585779080us : igb: val: 601
> 
> The time between my trace points 700 and 701 is about 30us, the time between my
> trace points 600 and 601 is even 37us!!
> The code in between is
>      tsyncrxctl = rd32(E1000_TSYNCRXCTL); resp.
>      lvmmc = rd32(E1000_LVMMC);
> 
> In both cases this is a single read from a register.
> I have no idea why this single read could take that much time!
> Is it possible that the igb hardware is in a state that delays the read access and this is
> why the whole I/O system might be delayed?
> 

To have a proper comparison, I did the same with kernel 3.18.27-rt27.
Also here, I instrumented the igb driver to get traces for the rd32 calls.
However, here everything is generally much faster!
In the idle system the maximum I got for a read was about 6us, most times it was 1-2us.
On the 4.8 kernel this is always much slower (see above).
My question is now: Is there any kernel config option that has been introduced in the meantime
that may lead to this effect and which is not set in my 4.8 config?

Regards

Mathias
David Laight Oct. 13, 2016, 2:02 p.m. UTC | #4
From: Koehrer Mathias
> Sent: 13 October 2016 11:57
..
> > The time between my trace points 700 and 701 is about 30us, the time between my
> > trace points 600 and 601 is even 37us!!
> > The code in between is
> >      tsyncrxctl = rd32(E1000_TSYNCRXCTL); resp.
> >      lvmmc = rd32(E1000_LVMMC);
> >
> > In both cases this is a single read from a register.
> > I have no idea why this single read could take that much time!
> > Is it possible that the igb hardware is in a state that delays the read access and this is
> > why the whole I/O system might be delayed?
> >
> 
> To have a proper comparison, I did the same with kernel 3.18.27-rt27.
> Also here, I instrumented the igb driver to get traces for the rd32 calls.
> However, here everything is generally much faster!
> In the idle system the maximum I got for a read was about 6us, most times it was 1-2us.

1-2us is probably about right, PCIe is high throughput high latency.
You should see the latencies we get talking to fpga!

> On the 4.8 kernel this is always much slower (see above).
> My question is now: Is there any kernel config option that has been introduced in the meantime
> that may lead to this effect and which is not set in my 4.8 config?

Have a look at the generated code for rd32().
Someone might have added a load of synchronisation instructions to it.
On x86 I don't think it needs any.

It is also possible for other PCIe accesses to slow things down
(which might be why you see 6us).

I presume you are doing these comparisons on the same hardware?
Obscure bus topologies could slow things down.

	David
Julia Cartwright Oct. 13, 2016, 4:18 p.m. UTC | #5
Hey Mathias-

On Thu, Oct 13, 2016 at 10:57:18AM +0000, Koehrer Mathias (ETAS/ESW5) wrote:
[..]

Interesting indeed!

> > Here are some places where I added traces:
> > In file igb_ptp.c:
> > void igb_ptp_rx_hang(struct igb_adapter *adapter) {
> >         struct e1000_hw *hw = &adapter->hw;
> >         unsigned long rx_event;
> >         u32 tsyncrxctl;
> >         trace_igb(700);
> >         tsyncrxctl = rd32(E1000_TSYNCRXCTL);
> >         trace_igb(701);
> >
> >         /* Other hardware uses per-packet timestamps */
> >         if (hw->mac.type != e1000_82576)
> >                 return;
> > ...
> >
> > In file igb_main.c:
> > static void igb_check_lvmmc(struct igb_adapter *adapter) {
> >         struct e1000_hw *hw = &adapter->hw;
> >         u32 lvmmc;
> >
> >         trace_igb(600);
> >         lvmmc = rd32(E1000_LVMMC);
> >         trace_igb(601);
> >         if (lvmmc) {
> > ...
> >
[..]
> > The time between my trace points 700 and 701 is about 30us, the time between my
> > trace points 600 and 601 is even 37us!!
> >
> > The code in between is
> >      tsyncrxctl = rd32(E1000_TSYNCRXCTL); resp.
> >      lvmmc = rd32(E1000_LVMMC);
> >
> > In both cases this is a single read from a register.
> >
> > I have no idea why this single read could take that much time!

Are these the only registers you see this amount of delay when reading?

It's also possible that it's not these registers themselves that cause
problems, but any writes prior to these reads.  That is, given to PCI's
posted write behavior, it could be that these reads are delayed only
because it's flushing previously writes to the device.

> > Is it possible that the igb hardware is in a state that delays the read access and this is
> > why the whole I/O system might be delayed?

One additional hypothesis is that some register accesses trigger
accesses to off-chip resources synchronously; for example, a write to
enable timestamping needs to access an external phy on a slower bus,
etc.  I don't know enough about this device to say whether or not that
happens or not.

> To have a proper comparison, I did the same with kernel 3.18.27-rt27.
> Also here, I instrumented the igb driver to get traces for the rd32 calls.
> However, here everything is generally much faster!

> In the idle system the maximum I got for a read was about 6us, most times it was 1-2us.
> On the 4.8 kernel this is always much slower (see above).
> My question is now: Is there any kernel config option that has been introduced in the meantime
> that may lead to this effect and which is not set in my 4.8 config?

Have you tested on a vanilla (non-RT) kernel?  I doubt there is anything
RT specific about what you are seeing, but it might be nice to get
confirmation.  Also, bisection would probably be easier if you confirm
on a vanilla kernel.

I find it unlikely that it's a kernel config option that changed which
regressed you, but instead was a code change to a driver.  Which driver
is now the question, and the surface area is still big (processor
mapping attributes for this region, PCI root complex configuration, PCI
brige configuration, igb driver itself, etc.).

Big enough that I'd recommend a bisection.  It looks like a bisection
between 3.18 and 4.8 would take you about 18 tries to narrow down,
assuming all goes well.

   Julia
diff mbox

Patch

--- igb_main.c.orig     2016-10-07 10:43:37.855873754 +0200
+++ igb_main.c  2016-10-07 10:31:51.451346292 +0200
@@ -4613,6 +4613,7 @@  no_wait:
        }

        /* Cause software interrupt to ensure Rx ring is cleaned */
+#if 0
        if (adapter->flags & IGB_FLAG_HAS_MSIX) {
                u32 eics = 0;

@@ -4622,6 +4623,7 @@  no_wait:
        } else {
                wr32(E1000_ICS, E1000_ICS_RXDMT0);
        }
+#endif

        igb_spoof_check(adapter);
        igb_ptp_rx_hang(adapter);