Patchwork tx queue hashing hot-spots and poor performance (multiq, ixgbe)

login
register
mail settings
Submitter Eric Dumazet
Date May 1, 2009, 7:31 a.m.
Message ID <49FAA55D.7070406@cosmosbay.com>
Download mbox | patch
Permalink /patch/26746/
State Not Applicable
Delegated to: David Miller
Headers show

Comments

Eric Dumazet - May 1, 2009, 7:31 a.m.
Andrew Dickinson a écrit :
> On Thu, Apr 30, 2009 at 11:40 PM, Eric Dumazet <dada1@cosmosbay.com> wrote:
>> Andrew Dickinson a écrit :
>>> On Thu, Apr 30, 2009 at 11:14 PM, Eric Dumazet <dada1@cosmosbay.com> wrote:
>>>> Andrew Dickinson a écrit :
>>>>> OK... I've got some more data on it...
>>>>>
>>>>> I passed a small number of packets through the system and added a ton
>>>>> of printks to it ;-P
>>>>>
>>>>> Here's the distribution of values as seen by
>>>>> skb_rx_queue_recorded()... count on the left, value on the right:
>>>>>      37 0
>>>>>      31 1
>>>>>      31 2
>>>>>      39 3
>>>>>      37 4
>>>>>      31 5
>>>>>      42 6
>>>>>      39 7
>>>>>
>>>>> That's nice and even....  Here's what's getting returned from the
>>>>> skb_tx_hash().  Again, count on the left, value on the right:
>>>>>      31 0
>>>>>      81 1
>>>>>      37 2
>>>>>      70 3
>>>>>      37 4
>>>>>      31 6
>>>>>
>>>>> Note that we're entirely missing 5 and 7 and that those interrupts
>>>>> seem to have gotten munged onto 1 and 3.
>>>>>
>>>>> I think the voodoo lies within:
>>>>>     return (u16) (((u64) hash * dev->real_num_tx_queues) >> 32);
>>>>>
>>>>> David,  I made the change that you suggested:
>>>>>         //hash = skb_get_rx_queue(skb);
>>>>>         return skb_get_rx_queue(skb) % dev->real_num_tx_queues;
>>>>>
>>>>> And now, I see a nice even mixing of interrupts on the TX side (yay!).
>>>>>
>>>>> However, my problem's not solved entirely... here's what top is showing me:
>>>>> top - 23:37:49 up 9 min,  1 user,  load average: 3.93, 2.68, 1.21
>>>>> Tasks: 119 total,   5 running, 114 sleeping,   0 stopped,   0 zombie
>>>>> Cpu0  :  0.0%us,  0.0%sy,  0.0%ni, 99.3%id,  0.0%wa,  0.3%hi,  0.3%si,  0.0%st
>>>>> Cpu1  :  0.0%us,  0.0%sy,  0.0%ni,  0.0%id,  0.0%wa,  4.3%hi, 95.7%si,  0.0%st
>>>>> Cpu2  :  0.0%us,  0.0%sy,  0.0%ni, 99.7%id,  0.0%wa,  0.0%hi,  0.3%si,  0.0%st
>>>>> Cpu3  :  0.0%us,  0.0%sy,  0.0%ni,  0.0%id,  0.0%wa,  4.3%hi, 95.7%si,  0.0%st
>>>>> Cpu4  :  0.0%us,  0.0%sy,  0.0%ni, 99.3%id,  0.0%wa,  0.3%hi,  0.3%si,  0.0%st
>>>>> Cpu5  :  0.0%us,  0.0%sy,  0.0%ni,  2.0%id,  0.0%wa,  4.0%hi, 94.0%si,  0.0%st
>>>>> Cpu6  :  0.0%us,  0.0%sy,  0.0%ni,100.0%id,  0.0%wa,  0.0%hi,  0.0%si,  0.0%st
>>>>> Cpu7  :  0.0%us,  0.0%sy,  0.0%ni,  5.6%id,  0.0%wa,  2.3%hi, 92.1%si,  0.0%st
>>>>> Mem:  16403476k total,   335884k used, 16067592k free,    10108k buffers
>>>>> Swap:  2096472k total,        0k used,  2096472k free,   146364k cached
>>>>>
>>>>>   PID USER      PR  NI  VIRT  RES  SHR S %CPU %MEM    TIME+  COMMAND
>>>>>     7 root      15  -5     0    0    0 R 100.2  0.0   5:35.24
>>>>> ksoftirqd/1
>>>>>    13 root      15  -5     0    0    0 R 100.2  0.0   5:36.98
>>>>> ksoftirqd/3
>>>>>    19 root      15  -5     0    0    0 R 97.8  0.0   5:34.52
>>>>> ksoftirqd/5
>>>>>    25 root      15  -5     0    0    0 R 94.5  0.0   5:13.56
>>>>> ksoftirqd/7
>>>>>  3905 root      20   0 12612 1084  820 R  0.3  0.0   0:00.14 top
>>>>> <snip>
>>>>>
>>>>>
>>>>> It appears that only the odd CPUs are actually handling the
>>>>> interrupts, which doesn't jive with what /proc/interrupts shows me:
>>>>>             CPU0       CPU1     CPU2       CPU3       CPU4       CPU5       CPU6       CPU7
>>>>>   66:    2970565          0          0          0          0
>>>>> 0          0          0   PCI-MSI-edge          eth2-rx-0
>>>>>   67:         28     821122          0          0          0
>>>>> 0          0          0   PCI-MSI-edge          eth2-rx-1
>>>>>   68:         28          0    2943299          0          0
>>>>> 0          0          0   PCI-MSI-edge          eth2-rx-2
>>>>>   69:         28          0          0     817776          0
>>>>> 0          0          0   PCI-MSI-edge          eth2-rx-3
>>>>>   70:         28          0          0          0    2963924
>>>>> 0          0          0   PCI-MSI-edge          eth2-rx-4
>>>>>   71:         28          0          0          0          0
>>>>> 821032          0          0   PCI-MSI-edge     eth2-rx-5
>>>>>   72:         28          0          0          0          0
>>>>> 0    2979987          0   PCI-MSI-edge          eth2-rx-6
>>>>>   73:         28          0          0          0          0
>>>>> 0          0     845422   PCI-MSI-edge          eth2-rx-7
>>>>>   74:    4664732          0          0          0          0
>>>>> 0          0          0   PCI-MSI-edge          eth2-tx-0
>>>>>   75:         34    4679312          0          0          0
>>>>> 0          0          0   PCI-MSI-edge          eth2-tx-1
>>>>>   76:         28          0    4665014          0          0
>>>>> 0          0          0   PCI-MSI-edge          eth2-tx-2
>>>>>   77:         28          0          0    4681531          0
>>>>> 0          0          0   PCI-MSI-edge          eth2-tx-3
>>>>>   78:         28          0          0          0    4665793
>>>>> 0          0          0   PCI-MSI-edge          eth2-tx-4
>>>>>   79:         28          0          0          0          0
>>>>> 4671596          0          0   PCI-MSI-edge    eth2-tx-5
>>>>>   80:         28          0          0          0          0
>>>>> 0    4665279          0   PCI-MSI-edge          eth2-tx-6
>>>>>   81:         28          0          0          0          0
>>>>> 0          0    4664504   PCI-MSI-edge          eth2-tx-7
>>>>>   82:          2          0          0          0          0
>>>>> 0          0          0   PCI-MSI-edge          eth2:lsc
>>>>>
>>>>>
>>>>> Why would ksoftirqd only run on half of the cores (and only the odd
>>>>> ones to boot)?  The one commonality that's striking me is that that
>>>>> all the odd CPU#'s are on the same physical processor:
>>>>>
>>>>> -bash-3.2# cat /proc/cpuinfo | grep -E '(physical|processor)' | grep -v virtual
>>>>> processor     : 0
>>>>> physical id   : 0
>>>>> processor     : 1
>>>>> physical id   : 1
>>>>> processor     : 2
>>>>> physical id   : 0
>>>>> processor     : 3
>>>>> physical id   : 1
>>>>> processor     : 4
>>>>> physical id   : 0
>>>>> processor     : 5
>>>>> physical id   : 1
>>>>> processor     : 6
>>>>> physical id   : 0
>>>>> processor     : 7
>>>>> physical id   : 1
>>>>>
>>>>> I did compile the kernel with NUMA support... am I being bitten by
>>>>> something there?  Other thoughts on where I should look.
>>>>>
>>>>> Also... is there an incantation to get NAPI to work in the torvalds
>>>>> kernel?  As you can see, I'm generating quite a few interrrupts.
>>>>>
>>>>> -A
>>>>>
>>>>>
>>>>> On Thu, Apr 30, 2009 at 7:08 AM, David Miller <davem@davemloft.net> wrote:
>>>>>> From: Andrew Dickinson <andrew@whydna.net>
>>>>>> Date: Thu, 30 Apr 2009 07:04:33 -0700
>>>>>>
>>>>>>>  I'll do some debugging around skb_tx_hash() and see if I can make
>>>>>>> sense of it.  I'll let you know what I find.  My hypothesis is that
>>>>>>> skb_record_rx_queue() isn't being called, but I should dig into it
>>>>>>> before I start making claims. ;-P
>>>>>> That's one possibility.
>>>>>>
>>>>>> Another is that the hashing isn't working out.  One way to
>>>>>> play with that is to simply replace the:
>>>>>>
>>>>>>                hash = skb_get_rx_queue(skb);
>>>>>>
>>>>>> in skb_tx_hash() with something like:
>>>>>>
>>>>>>                return skb_get_rx_queue(skb) % dev->real_num_tx_queues;
>>>>>>
>>>>>> and see if that improves the situation.
>>>>>>
>>>> Hi Andrew
>>>>
>>>> Please try following patch (I dont have multi-queue NIC, sorry)
>>>>
>>>> I will do the followup patch if this ones corrects the distribution problem
>>>> you noticed.
>>>>
>>>> Thanks very much for all your findings.
>>>>
>>>> [PATCH] net: skb_tx_hash() improvements
>>>>
>>>> When skb_rx_queue_recorded() is true, we dont want to use jash distribution
>>>> as the device driver exactly told us which queue was selected at RX time.
>>>> jhash makes a statistical shuffle, but this wont work with 8 static inputs.
>>>>
>>>> Later improvements would be to compute reciprocal value of real_num_tx_queues
>>>> to avoid a divide here. But this computation should be done once,
>>>> when real_num_tx_queues is set. This needs a separate patch, and a new
>>>> field in struct net_device.
>>>>
>>>> Reported-by: Andrew Dickinson <andrew@whydna.net>
>>>> Signed-off-by: Eric Dumazet <dada1@cosmosbay.com>
>>>>
>>>> diff --git a/net/core/dev.c b/net/core/dev.c
>>>> index 308a7d0..e2e9e4a 100644
>>>> --- a/net/core/dev.c
>>>> +++ b/net/core/dev.c
>>>> @@ -1735,11 +1735,12 @@ u16 skb_tx_hash(const struct net_device *dev, const struct sk_buff *skb)
>>>>  {
>>>>        u32 hash;
>>>>
>>>> -       if (skb_rx_queue_recorded(skb)) {
>>>> -               hash = skb_get_rx_queue(skb);
>>>> -       } else if (skb->sk && skb->sk->sk_hash) {
>>>> +       if (skb_rx_queue_recorded(skb))
>>>> +               return skb_get_rx_queue(skb) % dev->real_num_tx_queues;
>>>> +
>>>> +       if (skb->sk && skb->sk->sk_hash)
>>>>                hash = skb->sk->sk_hash;
>>>> -       } else
>>>> +       else
>>>>                hash = skb->protocol;
>>>>
>>>>        hash = jhash_1word(hash, skb_tx_hashrnd);
>>>>
>>>>
>>> Eric,
>>>
>>> That's exactly what I did!  It solved the problem of hot-spots on some
>>> interrupts.  However, I now have a new problem (which is documented in
>>> my previous posts).  The short of it is that I'm only seeing 4 (out of
>>> 8) ksoftirqd's busy under heavy load... the other 4 seem idle.  The
>>> busy 4 are always on one physical package (but not always the same
>>> package (it'll change on reboot or when I change some parameters via
>>> ethtool), but never both.  This, despite /proc/interrupts showing me
>>> that all 8 interrupts are being hit evenly.  There's more details in
>>> my last mail. ;-D
>>>
>> Well, I was reacting to your 'voodo' comment about
>>
>> return (u16) (((u64) hash * dev->real_num_tx_queues) >> 32);
>>
>> Since this is not the problem. Problem is coming from jhash() which shuffles
>> the input, while in your case we want to select same output queue
>> because of cpu affinities. No shuffle required.
> 
> Agreed.  I don't want to jhash(), and I'm not.
> 
>> (assuming cpu0 is handling tx-queue-0 and rx-queue-0,
>>          cpu1 is handling tx-queue-1 and rx-queue-1, and so on...)
> 
> That's a correct assumption. :D
> 
>> Then /proc/interrupts show your rx interrupts are not evenly distributed.
>>
>> Or that ksoftirqd is triggered only on one physical cpu, while on other
>> cpu, softirqds are not run from ksoftirqd. Its only a matter of load.
> 
> Hrmm... more fuel for the fire...
> 
> The NIC seems to be doing a good job of hashing the incoming data and
> the kernel is now finding the right TX queue:
> -bash-3.2# ethtool -S eth2 | grep -vw 0 | grep packets
>      rx_packets: 1286009099
>      tx_packets: 1287853570
>      tx_queue_0_packets: 162469405
>      tx_queue_1_packets: 162452446
>      tx_queue_2_packets: 162481160
>      tx_queue_3_packets: 162441839
>      tx_queue_4_packets: 162484930
>      tx_queue_5_packets: 162478402
>      tx_queue_6_packets: 162492530
>      tx_queue_7_packets: 162477162
>      rx_queue_0_packets: 162469449
>      rx_queue_1_packets: 162452440
>      rx_queue_2_packets: 162481186
>      rx_queue_3_packets: 162441885
>      rx_queue_4_packets: 162484949
>      rx_queue_5_packets: 162478427
> 
> Here's where it gets juicy.  If I reduce the rate at which I'm pushing
> traffic to a 0-loss level (in this case about 2.2Mpps), then top looks
> as follow:
> Cpu0  :  0.0%us,  0.0%sy,  0.0%ni,100.0%id,  0.0%wa,  0.0%hi,  0.0%si,  0.0%st
> Cpu1  :  0.0%us,  0.0%sy,  0.0%ni,100.0%id,  0.0%wa,  0.0%hi,  0.0%si,  0.0%st
> Cpu2  :  0.0%us,  0.0%sy,  0.0%ni,100.0%id,  0.0%wa,  0.0%hi,  0.0%si,  0.0%st
> Cpu3  :  0.0%us,  0.0%sy,  0.0%ni,100.0%id,  0.0%wa,  0.0%hi,  0.0%si,  0.0%st
> Cpu4  :  0.0%us,  0.0%sy,  0.0%ni,100.0%id,  0.0%wa,  0.0%hi,  0.0%si,  0.0%st
> Cpu5  :  0.0%us,  0.0%sy,  0.0%ni,100.0%id,  0.0%wa,  0.0%hi,  0.0%si,  0.0%st
> Cpu6  :  0.0%us,  0.0%sy,  0.0%ni,100.0%id,  0.0%wa,  0.0%hi,  0.0%si,  0.0%st
> Cpu7  :  0.0%us,  0.3%sy,  0.0%ni, 99.3%id,  0.0%wa,  0.0%hi,  0.3%si,  0.0%st
> 
> And if I watch /proc/interrupts, I see that all of the tx and rx
> queues are handling a fairly similar number of interrupts (ballpark,
> 7-8k/sec on rx, 10k on tx).
> 
> OK... now let me double the packet rate (to about 4.4Mpps), top looks like this:
> 
> Cpu0  :  0.0%us,  0.0%sy,  0.0%ni,  1.9%id,  0.0%wa,  5.5%hi, 92.5%si,  0.0%st
> Cpu1  :  0.0%us,  0.0%sy,  0.0%ni, 98.7%id,  0.0%wa,  0.0%hi,  1.3%si,  0.0%st
> Cpu2  :  0.0%us,  0.0%sy,  0.0%ni,  2.3%id,  0.0%wa,  4.9%hi, 92.9%si,  0.0%st
> Cpu3  :  0.0%us,  0.3%sy,  0.0%ni, 97.7%id,  0.0%wa,  0.0%hi,  1.9%si,  0.0%st
> Cpu4  :  0.0%us,  0.0%sy,  0.0%ni,  5.2%id,  0.0%wa,  5.2%hi, 89.6%si,  0.0%st
> Cpu5  :  0.0%us,  0.0%sy,  0.0%ni, 97.7%id,  0.0%wa,  0.3%hi,  1.9%si,  0.0%st
> Cpu6  :  0.0%us,  0.0%sy,  0.0%ni,  0.3%id,  0.0%wa,  4.9%hi, 94.8%si,  0.0%st
> Cpu7  :  0.0%us,  0.0%sy,  0.0%ni, 99.7%id,  0.0%wa,  0.0%hi,  0.3%si,  0.0%st
> 
> And if I watch /proc/interrupts again, I see that the even-CPUs (i.e.
> 0,2,4, and 6) RX queues are receiving relatively few interrupts
> (5-ish/sec (not 5k... just 5)) and the odd-CPUS RX queues are
> receiving about 2-3k/sec.  What's extra strange is that the TX queues
> are still handling about 10k/sec each.
> 
> So, below some magic threshold (approx 2.3Mpps), the box is basically
> idle and happily routing all the packets (I can confirm that my
> network test device ixia is showing 0-loss).  Above the magic
> threshold, the box starts acting as described above and I'm unable to
> push it beyond that threshold.  While I understand that there are
> limits to how fast I can route packets (obviously), it seems very
> strange that I'm seeing this physical-CPU affinity on the ksoftirqd
> "processes".
> 

box is not idle, you hit a bug in kernel, I already corrected this week :)

check for "sched: account system time properly" in google



> Here's how fragile this "magic threshold" is...  2.292 Mpps, box looks
> idle, 0 loss.   2.300 Mpps, even-CPU ksoftirqd processes at 50%-ish.
> 2.307 Mpps, even-CPU ksoftirqd processes at 75%.  2.323 Mpps, even-CPU
> ksoftirqd proccesses at 100%.  Never during this did the odd-CPU
> ksoftirqd processes show any utilization at all.
> 
> These are 64-byte frames, so I shouldn't be hitting any bandwidth
> issues that I'm aware of, 1.3Gbps in, and 1.3Gbps out (same NIC, I'm
> just routing packets back out the one NIC).
> 
> =/
> 


--
To unsubscribe from this list: send the line "unsubscribe netdev" in
the body of a message to majordomo@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Andrew Dickinson - May 1, 2009, 7:34 a.m.
On Fri, May 1, 2009 at 12:31 AM, Eric Dumazet <dada1@cosmosbay.com> wrote:
> Andrew Dickinson a écrit :
>> On Thu, Apr 30, 2009 at 11:40 PM, Eric Dumazet <dada1@cosmosbay.com> wrote:
>>> Andrew Dickinson a écrit :
>>>> On Thu, Apr 30, 2009 at 11:14 PM, Eric Dumazet <dada1@cosmosbay.com> wrote:
>>>>> Andrew Dickinson a écrit :
>>>>>> OK... I've got some more data on it...
>>>>>>
>>>>>> I passed a small number of packets through the system and added a ton
>>>>>> of printks to it ;-P
>>>>>>
>>>>>> Here's the distribution of values as seen by
>>>>>> skb_rx_queue_recorded()... count on the left, value on the right:
>>>>>>      37 0
>>>>>>      31 1
>>>>>>      31 2
>>>>>>      39 3
>>>>>>      37 4
>>>>>>      31 5
>>>>>>      42 6
>>>>>>      39 7
>>>>>>
>>>>>> That's nice and even....  Here's what's getting returned from the
>>>>>> skb_tx_hash().  Again, count on the left, value on the right:
>>>>>>      31 0
>>>>>>      81 1
>>>>>>      37 2
>>>>>>      70 3
>>>>>>      37 4
>>>>>>      31 6
>>>>>>
>>>>>> Note that we're entirely missing 5 and 7 and that those interrupts
>>>>>> seem to have gotten munged onto 1 and 3.
>>>>>>
>>>>>> I think the voodoo lies within:
>>>>>>     return (u16) (((u64) hash * dev->real_num_tx_queues) >> 32);
>>>>>>
>>>>>> David,  I made the change that you suggested:
>>>>>>         //hash = skb_get_rx_queue(skb);
>>>>>>         return skb_get_rx_queue(skb) % dev->real_num_tx_queues;
>>>>>>
>>>>>> And now, I see a nice even mixing of interrupts on the TX side (yay!).
>>>>>>
>>>>>> However, my problem's not solved entirely... here's what top is showing me:
>>>>>> top - 23:37:49 up 9 min,  1 user,  load average: 3.93, 2.68, 1.21
>>>>>> Tasks: 119 total,   5 running, 114 sleeping,   0 stopped,   0 zombie
>>>>>> Cpu0  :  0.0%us,  0.0%sy,  0.0%ni, 99.3%id,  0.0%wa,  0.3%hi,  0.3%si,  0.0%st
>>>>>> Cpu1  :  0.0%us,  0.0%sy,  0.0%ni,  0.0%id,  0.0%wa,  4.3%hi, 95.7%si,  0.0%st
>>>>>> Cpu2  :  0.0%us,  0.0%sy,  0.0%ni, 99.7%id,  0.0%wa,  0.0%hi,  0.3%si,  0.0%st
>>>>>> Cpu3  :  0.0%us,  0.0%sy,  0.0%ni,  0.0%id,  0.0%wa,  4.3%hi, 95.7%si,  0.0%st
>>>>>> Cpu4  :  0.0%us,  0.0%sy,  0.0%ni, 99.3%id,  0.0%wa,  0.3%hi,  0.3%si,  0.0%st
>>>>>> Cpu5  :  0.0%us,  0.0%sy,  0.0%ni,  2.0%id,  0.0%wa,  4.0%hi, 94.0%si,  0.0%st
>>>>>> Cpu6  :  0.0%us,  0.0%sy,  0.0%ni,100.0%id,  0.0%wa,  0.0%hi,  0.0%si,  0.0%st
>>>>>> Cpu7  :  0.0%us,  0.0%sy,  0.0%ni,  5.6%id,  0.0%wa,  2.3%hi, 92.1%si,  0.0%st
>>>>>> Mem:  16403476k total,   335884k used, 16067592k free,    10108k buffers
>>>>>> Swap:  2096472k total,        0k used,  2096472k free,   146364k cached
>>>>>>
>>>>>>   PID USER      PR  NI  VIRT  RES  SHR S %CPU %MEM    TIME+  COMMAND
>>>>>>     7 root      15  -5     0    0    0 R 100.2  0.0   5:35.24
>>>>>> ksoftirqd/1
>>>>>>    13 root      15  -5     0    0    0 R 100.2  0.0   5:36.98
>>>>>> ksoftirqd/3
>>>>>>    19 root      15  -5     0    0    0 R 97.8  0.0   5:34.52
>>>>>> ksoftirqd/5
>>>>>>    25 root      15  -5     0    0    0 R 94.5  0.0   5:13.56
>>>>>> ksoftirqd/7
>>>>>>  3905 root      20   0 12612 1084  820 R  0.3  0.0   0:00.14 top
>>>>>> <snip>
>>>>>>
>>>>>>
>>>>>> It appears that only the odd CPUs are actually handling the
>>>>>> interrupts, which doesn't jive with what /proc/interrupts shows me:
>>>>>>             CPU0       CPU1     CPU2       CPU3       CPU4       CPU5       CPU6       CPU7
>>>>>>   66:    2970565          0          0          0          0
>>>>>> 0          0          0   PCI-MSI-edge          eth2-rx-0
>>>>>>   67:         28     821122          0          0          0
>>>>>> 0          0          0   PCI-MSI-edge          eth2-rx-1
>>>>>>   68:         28          0    2943299          0          0
>>>>>> 0          0          0   PCI-MSI-edge          eth2-rx-2
>>>>>>   69:         28          0          0     817776          0
>>>>>> 0          0          0   PCI-MSI-edge          eth2-rx-3
>>>>>>   70:         28          0          0          0    2963924
>>>>>> 0          0          0   PCI-MSI-edge          eth2-rx-4
>>>>>>   71:         28          0          0          0          0
>>>>>> 821032          0          0   PCI-MSI-edge     eth2-rx-5
>>>>>>   72:         28          0          0          0          0
>>>>>> 0    2979987          0   PCI-MSI-edge          eth2-rx-6
>>>>>>   73:         28          0          0          0          0
>>>>>> 0          0     845422   PCI-MSI-edge          eth2-rx-7
>>>>>>   74:    4664732          0          0          0          0
>>>>>> 0          0          0   PCI-MSI-edge          eth2-tx-0
>>>>>>   75:         34    4679312          0          0          0
>>>>>> 0          0          0   PCI-MSI-edge          eth2-tx-1
>>>>>>   76:         28          0    4665014          0          0
>>>>>> 0          0          0   PCI-MSI-edge          eth2-tx-2
>>>>>>   77:         28          0          0    4681531          0
>>>>>> 0          0          0   PCI-MSI-edge          eth2-tx-3
>>>>>>   78:         28          0          0          0    4665793
>>>>>> 0          0          0   PCI-MSI-edge          eth2-tx-4
>>>>>>   79:         28          0          0          0          0
>>>>>> 4671596          0          0   PCI-MSI-edge    eth2-tx-5
>>>>>>   80:         28          0          0          0          0
>>>>>> 0    4665279          0   PCI-MSI-edge          eth2-tx-6
>>>>>>   81:         28          0          0          0          0
>>>>>> 0          0    4664504   PCI-MSI-edge          eth2-tx-7
>>>>>>   82:          2          0          0          0          0
>>>>>> 0          0          0   PCI-MSI-edge          eth2:lsc
>>>>>>
>>>>>>
>>>>>> Why would ksoftirqd only run on half of the cores (and only the odd
>>>>>> ones to boot)?  The one commonality that's striking me is that that
>>>>>> all the odd CPU#'s are on the same physical processor:
>>>>>>
>>>>>> -bash-3.2# cat /proc/cpuinfo | grep -E '(physical|processor)' | grep -v virtual
>>>>>> processor     : 0
>>>>>> physical id   : 0
>>>>>> processor     : 1
>>>>>> physical id   : 1
>>>>>> processor     : 2
>>>>>> physical id   : 0
>>>>>> processor     : 3
>>>>>> physical id   : 1
>>>>>> processor     : 4
>>>>>> physical id   : 0
>>>>>> processor     : 5
>>>>>> physical id   : 1
>>>>>> processor     : 6
>>>>>> physical id   : 0
>>>>>> processor     : 7
>>>>>> physical id   : 1
>>>>>>
>>>>>> I did compile the kernel with NUMA support... am I being bitten by
>>>>>> something there?  Other thoughts on where I should look.
>>>>>>
>>>>>> Also... is there an incantation to get NAPI to work in the torvalds
>>>>>> kernel?  As you can see, I'm generating quite a few interrrupts.
>>>>>>
>>>>>> -A
>>>>>>
>>>>>>
>>>>>> On Thu, Apr 30, 2009 at 7:08 AM, David Miller <davem@davemloft.net> wrote:
>>>>>>> From: Andrew Dickinson <andrew@whydna.net>
>>>>>>> Date: Thu, 30 Apr 2009 07:04:33 -0700
>>>>>>>
>>>>>>>>  I'll do some debugging around skb_tx_hash() and see if I can make
>>>>>>>> sense of it.  I'll let you know what I find.  My hypothesis is that
>>>>>>>> skb_record_rx_queue() isn't being called, but I should dig into it
>>>>>>>> before I start making claims. ;-P
>>>>>>> That's one possibility.
>>>>>>>
>>>>>>> Another is that the hashing isn't working out.  One way to
>>>>>>> play with that is to simply replace the:
>>>>>>>
>>>>>>>                hash = skb_get_rx_queue(skb);
>>>>>>>
>>>>>>> in skb_tx_hash() with something like:
>>>>>>>
>>>>>>>                return skb_get_rx_queue(skb) % dev->real_num_tx_queues;
>>>>>>>
>>>>>>> and see if that improves the situation.
>>>>>>>
>>>>> Hi Andrew
>>>>>
>>>>> Please try following patch (I dont have multi-queue NIC, sorry)
>>>>>
>>>>> I will do the followup patch if this ones corrects the distribution problem
>>>>> you noticed.
>>>>>
>>>>> Thanks very much for all your findings.
>>>>>
>>>>> [PATCH] net: skb_tx_hash() improvements
>>>>>
>>>>> When skb_rx_queue_recorded() is true, we dont want to use jash distribution
>>>>> as the device driver exactly told us which queue was selected at RX time.
>>>>> jhash makes a statistical shuffle, but this wont work with 8 static inputs.
>>>>>
>>>>> Later improvements would be to compute reciprocal value of real_num_tx_queues
>>>>> to avoid a divide here. But this computation should be done once,
>>>>> when real_num_tx_queues is set. This needs a separate patch, and a new
>>>>> field in struct net_device.
>>>>>
>>>>> Reported-by: Andrew Dickinson <andrew@whydna.net>
>>>>> Signed-off-by: Eric Dumazet <dada1@cosmosbay.com>
>>>>>
>>>>> diff --git a/net/core/dev.c b/net/core/dev.c
>>>>> index 308a7d0..e2e9e4a 100644
>>>>> --- a/net/core/dev.c
>>>>> +++ b/net/core/dev.c
>>>>> @@ -1735,11 +1735,12 @@ u16 skb_tx_hash(const struct net_device *dev, const struct sk_buff *skb)
>>>>>  {
>>>>>        u32 hash;
>>>>>
>>>>> -       if (skb_rx_queue_recorded(skb)) {
>>>>> -               hash = skb_get_rx_queue(skb);
>>>>> -       } else if (skb->sk && skb->sk->sk_hash) {
>>>>> +       if (skb_rx_queue_recorded(skb))
>>>>> +               return skb_get_rx_queue(skb) % dev->real_num_tx_queues;
>>>>> +
>>>>> +       if (skb->sk && skb->sk->sk_hash)
>>>>>                hash = skb->sk->sk_hash;
>>>>> -       } else
>>>>> +       else
>>>>>                hash = skb->protocol;
>>>>>
>>>>>        hash = jhash_1word(hash, skb_tx_hashrnd);
>>>>>
>>>>>
>>>> Eric,
>>>>
>>>> That's exactly what I did!  It solved the problem of hot-spots on some
>>>> interrupts.  However, I now have a new problem (which is documented in
>>>> my previous posts).  The short of it is that I'm only seeing 4 (out of
>>>> 8) ksoftirqd's busy under heavy load... the other 4 seem idle.  The
>>>> busy 4 are always on one physical package (but not always the same
>>>> package (it'll change on reboot or when I change some parameters via
>>>> ethtool), but never both.  This, despite /proc/interrupts showing me
>>>> that all 8 interrupts are being hit evenly.  There's more details in
>>>> my last mail. ;-D
>>>>
>>> Well, I was reacting to your 'voodo' comment about
>>>
>>> return (u16) (((u64) hash * dev->real_num_tx_queues) >> 32);
>>>
>>> Since this is not the problem. Problem is coming from jhash() which shuffles
>>> the input, while in your case we want to select same output queue
>>> because of cpu affinities. No shuffle required.
>>
>> Agreed.  I don't want to jhash(), and I'm not.
>>
>>> (assuming cpu0 is handling tx-queue-0 and rx-queue-0,
>>>          cpu1 is handling tx-queue-1 and rx-queue-1, and so on...)
>>
>> That's a correct assumption. :D
>>
>>> Then /proc/interrupts show your rx interrupts are not evenly distributed.
>>>
>>> Or that ksoftirqd is triggered only on one physical cpu, while on other
>>> cpu, softirqds are not run from ksoftirqd. Its only a matter of load.
>>
>> Hrmm... more fuel for the fire...
>>
>> The NIC seems to be doing a good job of hashing the incoming data and
>> the kernel is now finding the right TX queue:
>> -bash-3.2# ethtool -S eth2 | grep -vw 0 | grep packets
>>      rx_packets: 1286009099
>>      tx_packets: 1287853570
>>      tx_queue_0_packets: 162469405
>>      tx_queue_1_packets: 162452446
>>      tx_queue_2_packets: 162481160
>>      tx_queue_3_packets: 162441839
>>      tx_queue_4_packets: 162484930
>>      tx_queue_5_packets: 162478402
>>      tx_queue_6_packets: 162492530
>>      tx_queue_7_packets: 162477162
>>      rx_queue_0_packets: 162469449
>>      rx_queue_1_packets: 162452440
>>      rx_queue_2_packets: 162481186
>>      rx_queue_3_packets: 162441885
>>      rx_queue_4_packets: 162484949
>>      rx_queue_5_packets: 162478427
>>
>> Here's where it gets juicy.  If I reduce the rate at which I'm pushing
>> traffic to a 0-loss level (in this case about 2.2Mpps), then top looks
>> as follow:
>> Cpu0  :  0.0%us,  0.0%sy,  0.0%ni,100.0%id,  0.0%wa,  0.0%hi,  0.0%si,  0.0%st
>> Cpu1  :  0.0%us,  0.0%sy,  0.0%ni,100.0%id,  0.0%wa,  0.0%hi,  0.0%si,  0.0%st
>> Cpu2  :  0.0%us,  0.0%sy,  0.0%ni,100.0%id,  0.0%wa,  0.0%hi,  0.0%si,  0.0%st
>> Cpu3  :  0.0%us,  0.0%sy,  0.0%ni,100.0%id,  0.0%wa,  0.0%hi,  0.0%si,  0.0%st
>> Cpu4  :  0.0%us,  0.0%sy,  0.0%ni,100.0%id,  0.0%wa,  0.0%hi,  0.0%si,  0.0%st
>> Cpu5  :  0.0%us,  0.0%sy,  0.0%ni,100.0%id,  0.0%wa,  0.0%hi,  0.0%si,  0.0%st
>> Cpu6  :  0.0%us,  0.0%sy,  0.0%ni,100.0%id,  0.0%wa,  0.0%hi,  0.0%si,  0.0%st
>> Cpu7  :  0.0%us,  0.3%sy,  0.0%ni, 99.3%id,  0.0%wa,  0.0%hi,  0.3%si,  0.0%st
>>
>> And if I watch /proc/interrupts, I see that all of the tx and rx
>> queues are handling a fairly similar number of interrupts (ballpark,
>> 7-8k/sec on rx, 10k on tx).
>>
>> OK... now let me double the packet rate (to about 4.4Mpps), top looks like this:
>>
>> Cpu0  :  0.0%us,  0.0%sy,  0.0%ni,  1.9%id,  0.0%wa,  5.5%hi, 92.5%si,  0.0%st
>> Cpu1  :  0.0%us,  0.0%sy,  0.0%ni, 98.7%id,  0.0%wa,  0.0%hi,  1.3%si,  0.0%st
>> Cpu2  :  0.0%us,  0.0%sy,  0.0%ni,  2.3%id,  0.0%wa,  4.9%hi, 92.9%si,  0.0%st
>> Cpu3  :  0.0%us,  0.3%sy,  0.0%ni, 97.7%id,  0.0%wa,  0.0%hi,  1.9%si,  0.0%st
>> Cpu4  :  0.0%us,  0.0%sy,  0.0%ni,  5.2%id,  0.0%wa,  5.2%hi, 89.6%si,  0.0%st
>> Cpu5  :  0.0%us,  0.0%sy,  0.0%ni, 97.7%id,  0.0%wa,  0.3%hi,  1.9%si,  0.0%st
>> Cpu6  :  0.0%us,  0.0%sy,  0.0%ni,  0.3%id,  0.0%wa,  4.9%hi, 94.8%si,  0.0%st
>> Cpu7  :  0.0%us,  0.0%sy,  0.0%ni, 99.7%id,  0.0%wa,  0.0%hi,  0.3%si,  0.0%st
>>
>> And if I watch /proc/interrupts again, I see that the even-CPUs (i.e.
>> 0,2,4, and 6) RX queues are receiving relatively few interrupts
>> (5-ish/sec (not 5k... just 5)) and the odd-CPUS RX queues are
>> receiving about 2-3k/sec.  What's extra strange is that the TX queues
>> are still handling about 10k/sec each.
>>
>> So, below some magic threshold (approx 2.3Mpps), the box is basically
>> idle and happily routing all the packets (I can confirm that my
>> network test device ixia is showing 0-loss).  Above the magic
>> threshold, the box starts acting as described above and I'm unable to
>> push it beyond that threshold.  While I understand that there are
>> limits to how fast I can route packets (obviously), it seems very
>> strange that I'm seeing this physical-CPU affinity on the ksoftirqd
>> "processes".
>>
>
> box is not idle, you hit a bug in kernel, I already corrected this week :)
>
> check for "sched: account system time properly" in google
>
> diff --git a/kernel/sched.c b/kernel/sched.c
> index b902e58..26efa47 100644
> --- a/kernel/sched.c
> +++ b/kernel/sched.c
> @@ -4732,7 +4732,7 @@ void account_process_tick(struct task_struct *p, int user_tick)
>
>        if (user_tick)
>                account_user_time(p, one_jiffy, one_jiffy_scaled);
> -       else if (p != rq->idle)
> +       else if ((p != rq->idle) || (irq_count() != HARDIRQ_OFFSET))
>                account_system_time(p, HARDIRQ_OFFSET, one_jiffy,
>                                    one_jiffy_scaled);
>        else
>

<whew>, I'm not crazy! ;-P

I'll apply this patch and let you know how that changes things.

-A


>> Here's how fragile this "magic threshold" is...  2.292 Mpps, box looks
>> idle, 0 loss.   2.300 Mpps, even-CPU ksoftirqd processes at 50%-ish.
>> 2.307 Mpps, even-CPU ksoftirqd processes at 75%.  2.323 Mpps, even-CPU
>> ksoftirqd proccesses at 100%.  Never during this did the odd-CPU
>> ksoftirqd processes show any utilization at all.
>>
>> These are 64-byte frames, so I shouldn't be hitting any bandwidth
>> issues that I'm aware of, 1.3Gbps in, and 1.3Gbps out (same NIC, I'm
>> just routing packets back out the one NIC).
>>
>> =/
>>
>
>
>
--
To unsubscribe from this list: send the line "unsubscribe netdev" in
the body of a message to majordomo@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html

Patch

diff --git a/kernel/sched.c b/kernel/sched.c
index b902e58..26efa47 100644
--- a/kernel/sched.c
+++ b/kernel/sched.c
@@ -4732,7 +4732,7 @@  void account_process_tick(struct task_struct *p, int user_tick)
 
 	if (user_tick)
 		account_user_time(p, one_jiffy, one_jiffy_scaled);
-	else if (p != rq->idle)
+	else if ((p != rq->idle) || (irq_count() != HARDIRQ_OFFSET))
 		account_system_time(p, HARDIRQ_OFFSET, one_jiffy,
 				    one_jiffy_scaled);
 	else