Patchwork tx queue hashing hot-spots and poor performance (multiq, ixgbe)

login
register
mail settings
Submitter Eric Dumazet
Date May 1, 2009, 6:14 a.m.
Message ID <49FA932B.4030405@cosmosbay.com>
Download mbox | patch
Permalink /patch/26745/
State Accepted
Delegated to: David Miller
Headers show

Comments

Eric Dumazet - May 1, 2009, 6:14 a.m.
Andrew Dickinson a écrit :
> OK... I've got some more data on it...
> 
> I passed a small number of packets through the system and added a ton
> of printks to it ;-P
> 
> Here's the distribution of values as seen by
> skb_rx_queue_recorded()... count on the left, value on the right:
>      37 0
>      31 1
>      31 2
>      39 3
>      37 4
>      31 5
>      42 6
>      39 7
> 
> That's nice and even....  Here's what's getting returned from the
> skb_tx_hash().  Again, count on the left, value on the right:
>      31 0
>      81 1
>      37 2
>      70 3
>      37 4
>      31 6
> 
> Note that we're entirely missing 5 and 7 and that those interrupts
> seem to have gotten munged onto 1 and 3.
> 
> I think the voodoo lies within:
>     return (u16) (((u64) hash * dev->real_num_tx_queues) >> 32);
> 
> David,  I made the change that you suggested:
>         //hash = skb_get_rx_queue(skb);
>         return skb_get_rx_queue(skb) % dev->real_num_tx_queues;
> 
> And now, I see a nice even mixing of interrupts on the TX side (yay!).
> 
> However, my problem's not solved entirely... here's what top is showing me:
> top - 23:37:49 up 9 min,  1 user,  load average: 3.93, 2.68, 1.21
> Tasks: 119 total,   5 running, 114 sleeping,   0 stopped,   0 zombie
> Cpu0  :  0.0%us,  0.0%sy,  0.0%ni, 99.3%id,  0.0%wa,  0.3%hi,  0.3%si,  0.0%st
> Cpu1  :  0.0%us,  0.0%sy,  0.0%ni,  0.0%id,  0.0%wa,  4.3%hi, 95.7%si,  0.0%st
> Cpu2  :  0.0%us,  0.0%sy,  0.0%ni, 99.7%id,  0.0%wa,  0.0%hi,  0.3%si,  0.0%st
> Cpu3  :  0.0%us,  0.0%sy,  0.0%ni,  0.0%id,  0.0%wa,  4.3%hi, 95.7%si,  0.0%st
> Cpu4  :  0.0%us,  0.0%sy,  0.0%ni, 99.3%id,  0.0%wa,  0.3%hi,  0.3%si,  0.0%st
> Cpu5  :  0.0%us,  0.0%sy,  0.0%ni,  2.0%id,  0.0%wa,  4.0%hi, 94.0%si,  0.0%st
> Cpu6  :  0.0%us,  0.0%sy,  0.0%ni,100.0%id,  0.0%wa,  0.0%hi,  0.0%si,  0.0%st
> Cpu7  :  0.0%us,  0.0%sy,  0.0%ni,  5.6%id,  0.0%wa,  2.3%hi, 92.1%si,  0.0%st
> Mem:  16403476k total,   335884k used, 16067592k free,    10108k buffers
> Swap:  2096472k total,        0k used,  2096472k free,   146364k cached
> 
>   PID USER      PR  NI  VIRT  RES  SHR S %CPU %MEM    TIME+  COMMAND
>     7 root      15  -5     0    0    0 R 100.2  0.0   5:35.24
> ksoftirqd/1
>    13 root      15  -5     0    0    0 R 100.2  0.0   5:36.98
> ksoftirqd/3
>    19 root      15  -5     0    0    0 R 97.8  0.0   5:34.52
> ksoftirqd/5
>    25 root      15  -5     0    0    0 R 94.5  0.0   5:13.56
> ksoftirqd/7
>  3905 root      20   0 12612 1084  820 R  0.3  0.0   0:00.14 top
> <snip>
> 
> 
> It appears that only the odd CPUs are actually handling the
> interrupts, which doesn't jive with what /proc/interrupts shows me:
>             CPU0       CPU1	  CPU2       CPU3	CPU4	   CPU5       CPU6	 CPU7
>   66:    2970565          0          0          0          0
> 0          0          0   PCI-MSI-edge	  eth2-rx-0
>   67:         28     821122          0          0          0
> 0          0          0   PCI-MSI-edge	  eth2-rx-1
>   68:         28          0    2943299          0          0
> 0          0          0   PCI-MSI-edge	  eth2-rx-2
>   69:         28          0          0     817776          0
> 0          0          0   PCI-MSI-edge	  eth2-rx-3
>   70:         28          0          0          0    2963924
> 0          0          0   PCI-MSI-edge	  eth2-rx-4
>   71:         28          0          0          0          0
> 821032          0          0   PCI-MSI-edge	  eth2-rx-5
>   72:         28          0          0          0          0
> 0    2979987          0   PCI-MSI-edge	  eth2-rx-6
>   73:         28          0          0          0          0
> 0          0     845422   PCI-MSI-edge	  eth2-rx-7
>   74:    4664732          0          0          0          0
> 0          0          0   PCI-MSI-edge	  eth2-tx-0
>   75:         34    4679312          0          0          0
> 0          0          0   PCI-MSI-edge	  eth2-tx-1
>   76:         28          0    4665014          0          0
> 0          0          0   PCI-MSI-edge	  eth2-tx-2
>   77:         28          0          0    4681531          0
> 0          0          0   PCI-MSI-edge	  eth2-tx-3
>   78:         28          0          0          0    4665793
> 0          0          0   PCI-MSI-edge	  eth2-tx-4
>   79:         28          0          0          0          0
> 4671596          0          0   PCI-MSI-edge	  eth2-tx-5
>   80:         28          0          0          0          0
> 0    4665279          0   PCI-MSI-edge	  eth2-tx-6
>   81:         28          0          0          0          0
> 0          0    4664504   PCI-MSI-edge	  eth2-tx-7
>   82:          2          0          0          0          0
> 0          0          0   PCI-MSI-edge	  eth2:lsc
> 
> 
> Why would ksoftirqd only run on half of the cores (and only the odd
> ones to boot)?  The one commonality that's striking me is that that
> all the odd CPU#'s are on the same physical processor:
> 
> -bash-3.2# cat /proc/cpuinfo | grep -E '(physical|processor)' | grep -v virtual
> processor	: 0
> physical id	: 0
> processor	: 1
> physical id	: 1
> processor	: 2
> physical id	: 0
> processor	: 3
> physical id	: 1
> processor	: 4
> physical id	: 0
> processor	: 5
> physical id	: 1
> processor	: 6
> physical id	: 0
> processor	: 7
> physical id	: 1
> 
> I did compile the kernel with NUMA support... am I being bitten by
> something there?  Other thoughts on where I should look.
> 
> Also... is there an incantation to get NAPI to work in the torvalds
> kernel?  As you can see, I'm generating quite a few interrrupts.
> 
> -A
> 
> 
> On Thu, Apr 30, 2009 at 7:08 AM, David Miller <davem@davemloft.net> wrote:
>> From: Andrew Dickinson <andrew@whydna.net>
>> Date: Thu, 30 Apr 2009 07:04:33 -0700
>>
>>>  I'll do some debugging around skb_tx_hash() and see if I can make
>>> sense of it.  I'll let you know what I find.  My hypothesis is that
>>> skb_record_rx_queue() isn't being called, but I should dig into it
>>> before I start making claims. ;-P
>> That's one possibility.
>>
>> Another is that the hashing isn't working out.  One way to
>> play with that is to simply replace the:
>>
>>                hash = skb_get_rx_queue(skb);
>>
>> in skb_tx_hash() with something like:
>>
>>                return skb_get_rx_queue(skb) % dev->real_num_tx_queues;
>>
>> and see if that improves the situation.
>>

Hi Andrew

Please try following patch (I dont have multi-queue NIC, sorry)

I will do the followup patch if this ones corrects the distribution problem
you noticed.

Thanks very much for all your findings.

[PATCH] net: skb_tx_hash() improvements

When skb_rx_queue_recorded() is true, we dont want to use jash distribution
as the device driver exactly told us which queue was selected at RX time.
jhash makes a statistical shuffle, but this wont work with 8 static inputs.

Later improvements would be to compute reciprocal value of real_num_tx_queues
to avoid a divide here. But this computation should be done once,
when real_num_tx_queues is set. This needs a separate patch, and a new
field in struct net_device.

Reported-by: Andrew Dickinson <andrew@whydna.net>
Signed-off-by: Eric Dumazet <dada1@cosmosbay.com>


--
To unsubscribe from this list: send the line "unsubscribe netdev" in
the body of a message to majordomo@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Andrew Dickinson - May 1, 2009, 6:19 a.m.
On Thu, Apr 30, 2009 at 11:14 PM, Eric Dumazet <dada1@cosmosbay.com> wrote:
> Andrew Dickinson a écrit :
>> OK... I've got some more data on it...
>>
>> I passed a small number of packets through the system and added a ton
>> of printks to it ;-P
>>
>> Here's the distribution of values as seen by
>> skb_rx_queue_recorded()... count on the left, value on the right:
>>      37 0
>>      31 1
>>      31 2
>>      39 3
>>      37 4
>>      31 5
>>      42 6
>>      39 7
>>
>> That's nice and even....  Here's what's getting returned from the
>> skb_tx_hash().  Again, count on the left, value on the right:
>>      31 0
>>      81 1
>>      37 2
>>      70 3
>>      37 4
>>      31 6
>>
>> Note that we're entirely missing 5 and 7 and that those interrupts
>> seem to have gotten munged onto 1 and 3.
>>
>> I think the voodoo lies within:
>>     return (u16) (((u64) hash * dev->real_num_tx_queues) >> 32);
>>
>> David,  I made the change that you suggested:
>>         //hash = skb_get_rx_queue(skb);
>>         return skb_get_rx_queue(skb) % dev->real_num_tx_queues;
>>
>> And now, I see a nice even mixing of interrupts on the TX side (yay!).
>>
>> However, my problem's not solved entirely... here's what top is showing me:
>> top - 23:37:49 up 9 min,  1 user,  load average: 3.93, 2.68, 1.21
>> Tasks: 119 total,   5 running, 114 sleeping,   0 stopped,   0 zombie
>> Cpu0  :  0.0%us,  0.0%sy,  0.0%ni, 99.3%id,  0.0%wa,  0.3%hi,  0.3%si,  0.0%st
>> Cpu1  :  0.0%us,  0.0%sy,  0.0%ni,  0.0%id,  0.0%wa,  4.3%hi, 95.7%si,  0.0%st
>> Cpu2  :  0.0%us,  0.0%sy,  0.0%ni, 99.7%id,  0.0%wa,  0.0%hi,  0.3%si,  0.0%st
>> Cpu3  :  0.0%us,  0.0%sy,  0.0%ni,  0.0%id,  0.0%wa,  4.3%hi, 95.7%si,  0.0%st
>> Cpu4  :  0.0%us,  0.0%sy,  0.0%ni, 99.3%id,  0.0%wa,  0.3%hi,  0.3%si,  0.0%st
>> Cpu5  :  0.0%us,  0.0%sy,  0.0%ni,  2.0%id,  0.0%wa,  4.0%hi, 94.0%si,  0.0%st
>> Cpu6  :  0.0%us,  0.0%sy,  0.0%ni,100.0%id,  0.0%wa,  0.0%hi,  0.0%si,  0.0%st
>> Cpu7  :  0.0%us,  0.0%sy,  0.0%ni,  5.6%id,  0.0%wa,  2.3%hi, 92.1%si,  0.0%st
>> Mem:  16403476k total,   335884k used, 16067592k free,    10108k buffers
>> Swap:  2096472k total,        0k used,  2096472k free,   146364k cached
>>
>>   PID USER      PR  NI  VIRT  RES  SHR S %CPU %MEM    TIME+  COMMAND
>>     7 root      15  -5     0    0    0 R 100.2  0.0   5:35.24
>> ksoftirqd/1
>>    13 root      15  -5     0    0    0 R 100.2  0.0   5:36.98
>> ksoftirqd/3
>>    19 root      15  -5     0    0    0 R 97.8  0.0   5:34.52
>> ksoftirqd/5
>>    25 root      15  -5     0    0    0 R 94.5  0.0   5:13.56
>> ksoftirqd/7
>>  3905 root      20   0 12612 1084  820 R  0.3  0.0   0:00.14 top
>> <snip>
>>
>>
>> It appears that only the odd CPUs are actually handling the
>> interrupts, which doesn't jive with what /proc/interrupts shows me:
>>             CPU0       CPU1     CPU2       CPU3       CPU4       CPU5       CPU6       CPU7
>>   66:    2970565          0          0          0          0
>> 0          0          0   PCI-MSI-edge          eth2-rx-0
>>   67:         28     821122          0          0          0
>> 0          0          0   PCI-MSI-edge          eth2-rx-1
>>   68:         28          0    2943299          0          0
>> 0          0          0   PCI-MSI-edge          eth2-rx-2
>>   69:         28          0          0     817776          0
>> 0          0          0   PCI-MSI-edge          eth2-rx-3
>>   70:         28          0          0          0    2963924
>> 0          0          0   PCI-MSI-edge          eth2-rx-4
>>   71:         28          0          0          0          0
>> 821032          0          0   PCI-MSI-edge     eth2-rx-5
>>   72:         28          0          0          0          0
>> 0    2979987          0   PCI-MSI-edge          eth2-rx-6
>>   73:         28          0          0          0          0
>> 0          0     845422   PCI-MSI-edge          eth2-rx-7
>>   74:    4664732          0          0          0          0
>> 0          0          0   PCI-MSI-edge          eth2-tx-0
>>   75:         34    4679312          0          0          0
>> 0          0          0   PCI-MSI-edge          eth2-tx-1
>>   76:         28          0    4665014          0          0
>> 0          0          0   PCI-MSI-edge          eth2-tx-2
>>   77:         28          0          0    4681531          0
>> 0          0          0   PCI-MSI-edge          eth2-tx-3
>>   78:         28          0          0          0    4665793
>> 0          0          0   PCI-MSI-edge          eth2-tx-4
>>   79:         28          0          0          0          0
>> 4671596          0          0   PCI-MSI-edge    eth2-tx-5
>>   80:         28          0          0          0          0
>> 0    4665279          0   PCI-MSI-edge          eth2-tx-6
>>   81:         28          0          0          0          0
>> 0          0    4664504   PCI-MSI-edge          eth2-tx-7
>>   82:          2          0          0          0          0
>> 0          0          0   PCI-MSI-edge          eth2:lsc
>>
>>
>> Why would ksoftirqd only run on half of the cores (and only the odd
>> ones to boot)?  The one commonality that's striking me is that that
>> all the odd CPU#'s are on the same physical processor:
>>
>> -bash-3.2# cat /proc/cpuinfo | grep -E '(physical|processor)' | grep -v virtual
>> processor     : 0
>> physical id   : 0
>> processor     : 1
>> physical id   : 1
>> processor     : 2
>> physical id   : 0
>> processor     : 3
>> physical id   : 1
>> processor     : 4
>> physical id   : 0
>> processor     : 5
>> physical id   : 1
>> processor     : 6
>> physical id   : 0
>> processor     : 7
>> physical id   : 1
>>
>> I did compile the kernel with NUMA support... am I being bitten by
>> something there?  Other thoughts on where I should look.
>>
>> Also... is there an incantation to get NAPI to work in the torvalds
>> kernel?  As you can see, I'm generating quite a few interrrupts.
>>
>> -A
>>
>>
>> On Thu, Apr 30, 2009 at 7:08 AM, David Miller <davem@davemloft.net> wrote:
>>> From: Andrew Dickinson <andrew@whydna.net>
>>> Date: Thu, 30 Apr 2009 07:04:33 -0700
>>>
>>>>  I'll do some debugging around skb_tx_hash() and see if I can make
>>>> sense of it.  I'll let you know what I find.  My hypothesis is that
>>>> skb_record_rx_queue() isn't being called, but I should dig into it
>>>> before I start making claims. ;-P
>>> That's one possibility.
>>>
>>> Another is that the hashing isn't working out.  One way to
>>> play with that is to simply replace the:
>>>
>>>                hash = skb_get_rx_queue(skb);
>>>
>>> in skb_tx_hash() with something like:
>>>
>>>                return skb_get_rx_queue(skb) % dev->real_num_tx_queues;
>>>
>>> and see if that improves the situation.
>>>
>
> Hi Andrew
>
> Please try following patch (I dont have multi-queue NIC, sorry)
>
> I will do the followup patch if this ones corrects the distribution problem
> you noticed.
>
> Thanks very much for all your findings.
>
> [PATCH] net: skb_tx_hash() improvements
>
> When skb_rx_queue_recorded() is true, we dont want to use jash distribution
> as the device driver exactly told us which queue was selected at RX time.
> jhash makes a statistical shuffle, but this wont work with 8 static inputs.
>
> Later improvements would be to compute reciprocal value of real_num_tx_queues
> to avoid a divide here. But this computation should be done once,
> when real_num_tx_queues is set. This needs a separate patch, and a new
> field in struct net_device.
>
> Reported-by: Andrew Dickinson <andrew@whydna.net>
> Signed-off-by: Eric Dumazet <dada1@cosmosbay.com>
>
> diff --git a/net/core/dev.c b/net/core/dev.c
> index 308a7d0..e2e9e4a 100644
> --- a/net/core/dev.c
> +++ b/net/core/dev.c
> @@ -1735,11 +1735,12 @@ u16 skb_tx_hash(const struct net_device *dev, const struct sk_buff *skb)
>  {
>        u32 hash;
>
> -       if (skb_rx_queue_recorded(skb)) {
> -               hash = skb_get_rx_queue(skb);
> -       } else if (skb->sk && skb->sk->sk_hash) {
> +       if (skb_rx_queue_recorded(skb))
> +               return skb_get_rx_queue(skb) % dev->real_num_tx_queues;
> +
> +       if (skb->sk && skb->sk->sk_hash)
>                hash = skb->sk->sk_hash;
> -       } else
> +       else
>                hash = skb->protocol;
>
>        hash = jhash_1word(hash, skb_tx_hashrnd);
>
>

Eric,

That's exactly what I did!  It solved the problem of hot-spots on some
interrupts.  However, I now have a new problem (which is documented in
my previous posts).  The short of it is that I'm only seeing 4 (out of
8) ksoftirqd's busy under heavy load... the other 4 seem idle.  The
busy 4 are always on one physical package (but not always the same
package (it'll change on reboot or when I change some parameters via
ethtool), but never both.  This, despite /proc/interrupts showing me
that all 8 interrupts are being hit evenly.  There's more details in
my last mail. ;-D

-Andrew
--
To unsubscribe from this list: send the line "unsubscribe netdev" in
the body of a message to majordomo@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Eric Dumazet - May 1, 2009, 6:40 a.m.
Andrew Dickinson a écrit :
> On Thu, Apr 30, 2009 at 11:14 PM, Eric Dumazet <dada1@cosmosbay.com> wrote:
>> Andrew Dickinson a écrit :
>>> OK... I've got some more data on it...
>>>
>>> I passed a small number of packets through the system and added a ton
>>> of printks to it ;-P
>>>
>>> Here's the distribution of values as seen by
>>> skb_rx_queue_recorded()... count on the left, value on the right:
>>>      37 0
>>>      31 1
>>>      31 2
>>>      39 3
>>>      37 4
>>>      31 5
>>>      42 6
>>>      39 7
>>>
>>> That's nice and even....  Here's what's getting returned from the
>>> skb_tx_hash().  Again, count on the left, value on the right:
>>>      31 0
>>>      81 1
>>>      37 2
>>>      70 3
>>>      37 4
>>>      31 6
>>>
>>> Note that we're entirely missing 5 and 7 and that those interrupts
>>> seem to have gotten munged onto 1 and 3.
>>>
>>> I think the voodoo lies within:
>>>     return (u16) (((u64) hash * dev->real_num_tx_queues) >> 32);
>>>
>>> David,  I made the change that you suggested:
>>>         //hash = skb_get_rx_queue(skb);
>>>         return skb_get_rx_queue(skb) % dev->real_num_tx_queues;
>>>
>>> And now, I see a nice even mixing of interrupts on the TX side (yay!).
>>>
>>> However, my problem's not solved entirely... here's what top is showing me:
>>> top - 23:37:49 up 9 min,  1 user,  load average: 3.93, 2.68, 1.21
>>> Tasks: 119 total,   5 running, 114 sleeping,   0 stopped,   0 zombie
>>> Cpu0  :  0.0%us,  0.0%sy,  0.0%ni, 99.3%id,  0.0%wa,  0.3%hi,  0.3%si,  0.0%st
>>> Cpu1  :  0.0%us,  0.0%sy,  0.0%ni,  0.0%id,  0.0%wa,  4.3%hi, 95.7%si,  0.0%st
>>> Cpu2  :  0.0%us,  0.0%sy,  0.0%ni, 99.7%id,  0.0%wa,  0.0%hi,  0.3%si,  0.0%st
>>> Cpu3  :  0.0%us,  0.0%sy,  0.0%ni,  0.0%id,  0.0%wa,  4.3%hi, 95.7%si,  0.0%st
>>> Cpu4  :  0.0%us,  0.0%sy,  0.0%ni, 99.3%id,  0.0%wa,  0.3%hi,  0.3%si,  0.0%st
>>> Cpu5  :  0.0%us,  0.0%sy,  0.0%ni,  2.0%id,  0.0%wa,  4.0%hi, 94.0%si,  0.0%st
>>> Cpu6  :  0.0%us,  0.0%sy,  0.0%ni,100.0%id,  0.0%wa,  0.0%hi,  0.0%si,  0.0%st
>>> Cpu7  :  0.0%us,  0.0%sy,  0.0%ni,  5.6%id,  0.0%wa,  2.3%hi, 92.1%si,  0.0%st
>>> Mem:  16403476k total,   335884k used, 16067592k free,    10108k buffers
>>> Swap:  2096472k total,        0k used,  2096472k free,   146364k cached
>>>
>>>   PID USER      PR  NI  VIRT  RES  SHR S %CPU %MEM    TIME+  COMMAND
>>>     7 root      15  -5     0    0    0 R 100.2  0.0   5:35.24
>>> ksoftirqd/1
>>>    13 root      15  -5     0    0    0 R 100.2  0.0   5:36.98
>>> ksoftirqd/3
>>>    19 root      15  -5     0    0    0 R 97.8  0.0   5:34.52
>>> ksoftirqd/5
>>>    25 root      15  -5     0    0    0 R 94.5  0.0   5:13.56
>>> ksoftirqd/7
>>>  3905 root      20   0 12612 1084  820 R  0.3  0.0   0:00.14 top
>>> <snip>
>>>
>>>
>>> It appears that only the odd CPUs are actually handling the
>>> interrupts, which doesn't jive with what /proc/interrupts shows me:
>>>             CPU0       CPU1     CPU2       CPU3       CPU4       CPU5       CPU6       CPU7
>>>   66:    2970565          0          0          0          0
>>> 0          0          0   PCI-MSI-edge          eth2-rx-0
>>>   67:         28     821122          0          0          0
>>> 0          0          0   PCI-MSI-edge          eth2-rx-1
>>>   68:         28          0    2943299          0          0
>>> 0          0          0   PCI-MSI-edge          eth2-rx-2
>>>   69:         28          0          0     817776          0
>>> 0          0          0   PCI-MSI-edge          eth2-rx-3
>>>   70:         28          0          0          0    2963924
>>> 0          0          0   PCI-MSI-edge          eth2-rx-4
>>>   71:         28          0          0          0          0
>>> 821032          0          0   PCI-MSI-edge     eth2-rx-5
>>>   72:         28          0          0          0          0
>>> 0    2979987          0   PCI-MSI-edge          eth2-rx-6
>>>   73:         28          0          0          0          0
>>> 0          0     845422   PCI-MSI-edge          eth2-rx-7
>>>   74:    4664732          0          0          0          0
>>> 0          0          0   PCI-MSI-edge          eth2-tx-0
>>>   75:         34    4679312          0          0          0
>>> 0          0          0   PCI-MSI-edge          eth2-tx-1
>>>   76:         28          0    4665014          0          0
>>> 0          0          0   PCI-MSI-edge          eth2-tx-2
>>>   77:         28          0          0    4681531          0
>>> 0          0          0   PCI-MSI-edge          eth2-tx-3
>>>   78:         28          0          0          0    4665793
>>> 0          0          0   PCI-MSI-edge          eth2-tx-4
>>>   79:         28          0          0          0          0
>>> 4671596          0          0   PCI-MSI-edge    eth2-tx-5
>>>   80:         28          0          0          0          0
>>> 0    4665279          0   PCI-MSI-edge          eth2-tx-6
>>>   81:         28          0          0          0          0
>>> 0          0    4664504   PCI-MSI-edge          eth2-tx-7
>>>   82:          2          0          0          0          0
>>> 0          0          0   PCI-MSI-edge          eth2:lsc
>>>
>>>
>>> Why would ksoftirqd only run on half of the cores (and only the odd
>>> ones to boot)?  The one commonality that's striking me is that that
>>> all the odd CPU#'s are on the same physical processor:
>>>
>>> -bash-3.2# cat /proc/cpuinfo | grep -E '(physical|processor)' | grep -v virtual
>>> processor     : 0
>>> physical id   : 0
>>> processor     : 1
>>> physical id   : 1
>>> processor     : 2
>>> physical id   : 0
>>> processor     : 3
>>> physical id   : 1
>>> processor     : 4
>>> physical id   : 0
>>> processor     : 5
>>> physical id   : 1
>>> processor     : 6
>>> physical id   : 0
>>> processor     : 7
>>> physical id   : 1
>>>
>>> I did compile the kernel with NUMA support... am I being bitten by
>>> something there?  Other thoughts on where I should look.
>>>
>>> Also... is there an incantation to get NAPI to work in the torvalds
>>> kernel?  As you can see, I'm generating quite a few interrrupts.
>>>
>>> -A
>>>
>>>
>>> On Thu, Apr 30, 2009 at 7:08 AM, David Miller <davem@davemloft.net> wrote:
>>>> From: Andrew Dickinson <andrew@whydna.net>
>>>> Date: Thu, 30 Apr 2009 07:04:33 -0700
>>>>
>>>>>  I'll do some debugging around skb_tx_hash() and see if I can make
>>>>> sense of it.  I'll let you know what I find.  My hypothesis is that
>>>>> skb_record_rx_queue() isn't being called, but I should dig into it
>>>>> before I start making claims. ;-P
>>>> That's one possibility.
>>>>
>>>> Another is that the hashing isn't working out.  One way to
>>>> play with that is to simply replace the:
>>>>
>>>>                hash = skb_get_rx_queue(skb);
>>>>
>>>> in skb_tx_hash() with something like:
>>>>
>>>>                return skb_get_rx_queue(skb) % dev->real_num_tx_queues;
>>>>
>>>> and see if that improves the situation.
>>>>
>> Hi Andrew
>>
>> Please try following patch (I dont have multi-queue NIC, sorry)
>>
>> I will do the followup patch if this ones corrects the distribution problem
>> you noticed.
>>
>> Thanks very much for all your findings.
>>
>> [PATCH] net: skb_tx_hash() improvements
>>
>> When skb_rx_queue_recorded() is true, we dont want to use jash distribution
>> as the device driver exactly told us which queue was selected at RX time.
>> jhash makes a statistical shuffle, but this wont work with 8 static inputs.
>>
>> Later improvements would be to compute reciprocal value of real_num_tx_queues
>> to avoid a divide here. But this computation should be done once,
>> when real_num_tx_queues is set. This needs a separate patch, and a new
>> field in struct net_device.
>>
>> Reported-by: Andrew Dickinson <andrew@whydna.net>
>> Signed-off-by: Eric Dumazet <dada1@cosmosbay.com>
>>
>> diff --git a/net/core/dev.c b/net/core/dev.c
>> index 308a7d0..e2e9e4a 100644
>> --- a/net/core/dev.c
>> +++ b/net/core/dev.c
>> @@ -1735,11 +1735,12 @@ u16 skb_tx_hash(const struct net_device *dev, const struct sk_buff *skb)
>>  {
>>        u32 hash;
>>
>> -       if (skb_rx_queue_recorded(skb)) {
>> -               hash = skb_get_rx_queue(skb);
>> -       } else if (skb->sk && skb->sk->sk_hash) {
>> +       if (skb_rx_queue_recorded(skb))
>> +               return skb_get_rx_queue(skb) % dev->real_num_tx_queues;
>> +
>> +       if (skb->sk && skb->sk->sk_hash)
>>                hash = skb->sk->sk_hash;
>> -       } else
>> +       else
>>                hash = skb->protocol;
>>
>>        hash = jhash_1word(hash, skb_tx_hashrnd);
>>
>>
> 
> Eric,
> 
> That's exactly what I did!  It solved the problem of hot-spots on some
> interrupts.  However, I now have a new problem (which is documented in
> my previous posts).  The short of it is that I'm only seeing 4 (out of
> 8) ksoftirqd's busy under heavy load... the other 4 seem idle.  The
> busy 4 are always on one physical package (but not always the same
> package (it'll change on reboot or when I change some parameters via
> ethtool), but never both.  This, despite /proc/interrupts showing me
> that all 8 interrupts are being hit evenly.  There's more details in
> my last mail. ;-D
> 

Well, I was reacting to your 'voodo' comment about 

return (u16) (((u64) hash * dev->real_num_tx_queues) >> 32);

Since this is not the problem. Problem is coming from jhash() which shuffles
the input, while in your case we want to select same output queue
because of cpu affinities. No shuffle required.

(assuming cpu0 is handling tx-queue-0 and rx-queue-0,
          cpu1 is handling tx-queue-1 and rx-queue-1, and so on...)

Then /proc/interrupts show your rx interrupts are not evenly distributed.

Or that ksoftirqd is triggered only on one physical cpu, while on other
cpu, softirqds are not run from ksoftirqd. Its only a matter of load.


--
To unsubscribe from this list: send the line "unsubscribe netdev" in
the body of a message to majordomo@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Andrew Dickinson - May 1, 2009, 7:23 a.m.
On Thu, Apr 30, 2009 at 11:40 PM, Eric Dumazet <dada1@cosmosbay.com> wrote:
> Andrew Dickinson a écrit :
>> On Thu, Apr 30, 2009 at 11:14 PM, Eric Dumazet <dada1@cosmosbay.com> wrote:
>>> Andrew Dickinson a écrit :
>>>> OK... I've got some more data on it...
>>>>
>>>> I passed a small number of packets through the system and added a ton
>>>> of printks to it ;-P
>>>>
>>>> Here's the distribution of values as seen by
>>>> skb_rx_queue_recorded()... count on the left, value on the right:
>>>>      37 0
>>>>      31 1
>>>>      31 2
>>>>      39 3
>>>>      37 4
>>>>      31 5
>>>>      42 6
>>>>      39 7
>>>>
>>>> That's nice and even....  Here's what's getting returned from the
>>>> skb_tx_hash().  Again, count on the left, value on the right:
>>>>      31 0
>>>>      81 1
>>>>      37 2
>>>>      70 3
>>>>      37 4
>>>>      31 6
>>>>
>>>> Note that we're entirely missing 5 and 7 and that those interrupts
>>>> seem to have gotten munged onto 1 and 3.
>>>>
>>>> I think the voodoo lies within:
>>>>     return (u16) (((u64) hash * dev->real_num_tx_queues) >> 32);
>>>>
>>>> David,  I made the change that you suggested:
>>>>         //hash = skb_get_rx_queue(skb);
>>>>         return skb_get_rx_queue(skb) % dev->real_num_tx_queues;
>>>>
>>>> And now, I see a nice even mixing of interrupts on the TX side (yay!).
>>>>
>>>> However, my problem's not solved entirely... here's what top is showing me:
>>>> top - 23:37:49 up 9 min,  1 user,  load average: 3.93, 2.68, 1.21
>>>> Tasks: 119 total,   5 running, 114 sleeping,   0 stopped,   0 zombie
>>>> Cpu0  :  0.0%us,  0.0%sy,  0.0%ni, 99.3%id,  0.0%wa,  0.3%hi,  0.3%si,  0.0%st
>>>> Cpu1  :  0.0%us,  0.0%sy,  0.0%ni,  0.0%id,  0.0%wa,  4.3%hi, 95.7%si,  0.0%st
>>>> Cpu2  :  0.0%us,  0.0%sy,  0.0%ni, 99.7%id,  0.0%wa,  0.0%hi,  0.3%si,  0.0%st
>>>> Cpu3  :  0.0%us,  0.0%sy,  0.0%ni,  0.0%id,  0.0%wa,  4.3%hi, 95.7%si,  0.0%st
>>>> Cpu4  :  0.0%us,  0.0%sy,  0.0%ni, 99.3%id,  0.0%wa,  0.3%hi,  0.3%si,  0.0%st
>>>> Cpu5  :  0.0%us,  0.0%sy,  0.0%ni,  2.0%id,  0.0%wa,  4.0%hi, 94.0%si,  0.0%st
>>>> Cpu6  :  0.0%us,  0.0%sy,  0.0%ni,100.0%id,  0.0%wa,  0.0%hi,  0.0%si,  0.0%st
>>>> Cpu7  :  0.0%us,  0.0%sy,  0.0%ni,  5.6%id,  0.0%wa,  2.3%hi, 92.1%si,  0.0%st
>>>> Mem:  16403476k total,   335884k used, 16067592k free,    10108k buffers
>>>> Swap:  2096472k total,        0k used,  2096472k free,   146364k cached
>>>>
>>>>   PID USER      PR  NI  VIRT  RES  SHR S %CPU %MEM    TIME+  COMMAND
>>>>     7 root      15  -5     0    0    0 R 100.2  0.0   5:35.24
>>>> ksoftirqd/1
>>>>    13 root      15  -5     0    0    0 R 100.2  0.0   5:36.98
>>>> ksoftirqd/3
>>>>    19 root      15  -5     0    0    0 R 97.8  0.0   5:34.52
>>>> ksoftirqd/5
>>>>    25 root      15  -5     0    0    0 R 94.5  0.0   5:13.56
>>>> ksoftirqd/7
>>>>  3905 root      20   0 12612 1084  820 R  0.3  0.0   0:00.14 top
>>>> <snip>
>>>>
>>>>
>>>> It appears that only the odd CPUs are actually handling the
>>>> interrupts, which doesn't jive with what /proc/interrupts shows me:
>>>>             CPU0       CPU1     CPU2       CPU3       CPU4       CPU5       CPU6       CPU7
>>>>   66:    2970565          0          0          0          0
>>>> 0          0          0   PCI-MSI-edge          eth2-rx-0
>>>>   67:         28     821122          0          0          0
>>>> 0          0          0   PCI-MSI-edge          eth2-rx-1
>>>>   68:         28          0    2943299          0          0
>>>> 0          0          0   PCI-MSI-edge          eth2-rx-2
>>>>   69:         28          0          0     817776          0
>>>> 0          0          0   PCI-MSI-edge          eth2-rx-3
>>>>   70:         28          0          0          0    2963924
>>>> 0          0          0   PCI-MSI-edge          eth2-rx-4
>>>>   71:         28          0          0          0          0
>>>> 821032          0          0   PCI-MSI-edge     eth2-rx-5
>>>>   72:         28          0          0          0          0
>>>> 0    2979987          0   PCI-MSI-edge          eth2-rx-6
>>>>   73:         28          0          0          0          0
>>>> 0          0     845422   PCI-MSI-edge          eth2-rx-7
>>>>   74:    4664732          0          0          0          0
>>>> 0          0          0   PCI-MSI-edge          eth2-tx-0
>>>>   75:         34    4679312          0          0          0
>>>> 0          0          0   PCI-MSI-edge          eth2-tx-1
>>>>   76:         28          0    4665014          0          0
>>>> 0          0          0   PCI-MSI-edge          eth2-tx-2
>>>>   77:         28          0          0    4681531          0
>>>> 0          0          0   PCI-MSI-edge          eth2-tx-3
>>>>   78:         28          0          0          0    4665793
>>>> 0          0          0   PCI-MSI-edge          eth2-tx-4
>>>>   79:         28          0          0          0          0
>>>> 4671596          0          0   PCI-MSI-edge    eth2-tx-5
>>>>   80:         28          0          0          0          0
>>>> 0    4665279          0   PCI-MSI-edge          eth2-tx-6
>>>>   81:         28          0          0          0          0
>>>> 0          0    4664504   PCI-MSI-edge          eth2-tx-7
>>>>   82:          2          0          0          0          0
>>>> 0          0          0   PCI-MSI-edge          eth2:lsc
>>>>
>>>>
>>>> Why would ksoftirqd only run on half of the cores (and only the odd
>>>> ones to boot)?  The one commonality that's striking me is that that
>>>> all the odd CPU#'s are on the same physical processor:
>>>>
>>>> -bash-3.2# cat /proc/cpuinfo | grep -E '(physical|processor)' | grep -v virtual
>>>> processor     : 0
>>>> physical id   : 0
>>>> processor     : 1
>>>> physical id   : 1
>>>> processor     : 2
>>>> physical id   : 0
>>>> processor     : 3
>>>> physical id   : 1
>>>> processor     : 4
>>>> physical id   : 0
>>>> processor     : 5
>>>> physical id   : 1
>>>> processor     : 6
>>>> physical id   : 0
>>>> processor     : 7
>>>> physical id   : 1
>>>>
>>>> I did compile the kernel with NUMA support... am I being bitten by
>>>> something there?  Other thoughts on where I should look.
>>>>
>>>> Also... is there an incantation to get NAPI to work in the torvalds
>>>> kernel?  As you can see, I'm generating quite a few interrrupts.
>>>>
>>>> -A
>>>>
>>>>
>>>> On Thu, Apr 30, 2009 at 7:08 AM, David Miller <davem@davemloft.net> wrote:
>>>>> From: Andrew Dickinson <andrew@whydna.net>
>>>>> Date: Thu, 30 Apr 2009 07:04:33 -0700
>>>>>
>>>>>>  I'll do some debugging around skb_tx_hash() and see if I can make
>>>>>> sense of it.  I'll let you know what I find.  My hypothesis is that
>>>>>> skb_record_rx_queue() isn't being called, but I should dig into it
>>>>>> before I start making claims. ;-P
>>>>> That's one possibility.
>>>>>
>>>>> Another is that the hashing isn't working out.  One way to
>>>>> play with that is to simply replace the:
>>>>>
>>>>>                hash = skb_get_rx_queue(skb);
>>>>>
>>>>> in skb_tx_hash() with something like:
>>>>>
>>>>>                return skb_get_rx_queue(skb) % dev->real_num_tx_queues;
>>>>>
>>>>> and see if that improves the situation.
>>>>>
>>> Hi Andrew
>>>
>>> Please try following patch (I dont have multi-queue NIC, sorry)
>>>
>>> I will do the followup patch if this ones corrects the distribution problem
>>> you noticed.
>>>
>>> Thanks very much for all your findings.
>>>
>>> [PATCH] net: skb_tx_hash() improvements
>>>
>>> When skb_rx_queue_recorded() is true, we dont want to use jash distribution
>>> as the device driver exactly told us which queue was selected at RX time.
>>> jhash makes a statistical shuffle, but this wont work with 8 static inputs.
>>>
>>> Later improvements would be to compute reciprocal value of real_num_tx_queues
>>> to avoid a divide here. But this computation should be done once,
>>> when real_num_tx_queues is set. This needs a separate patch, and a new
>>> field in struct net_device.
>>>
>>> Reported-by: Andrew Dickinson <andrew@whydna.net>
>>> Signed-off-by: Eric Dumazet <dada1@cosmosbay.com>
>>>
>>> diff --git a/net/core/dev.c b/net/core/dev.c
>>> index 308a7d0..e2e9e4a 100644
>>> --- a/net/core/dev.c
>>> +++ b/net/core/dev.c
>>> @@ -1735,11 +1735,12 @@ u16 skb_tx_hash(const struct net_device *dev, const struct sk_buff *skb)
>>>  {
>>>        u32 hash;
>>>
>>> -       if (skb_rx_queue_recorded(skb)) {
>>> -               hash = skb_get_rx_queue(skb);
>>> -       } else if (skb->sk && skb->sk->sk_hash) {
>>> +       if (skb_rx_queue_recorded(skb))
>>> +               return skb_get_rx_queue(skb) % dev->real_num_tx_queues;
>>> +
>>> +       if (skb->sk && skb->sk->sk_hash)
>>>                hash = skb->sk->sk_hash;
>>> -       } else
>>> +       else
>>>                hash = skb->protocol;
>>>
>>>        hash = jhash_1word(hash, skb_tx_hashrnd);
>>>
>>>
>>
>> Eric,
>>
>> That's exactly what I did!  It solved the problem of hot-spots on some
>> interrupts.  However, I now have a new problem (which is documented in
>> my previous posts).  The short of it is that I'm only seeing 4 (out of
>> 8) ksoftirqd's busy under heavy load... the other 4 seem idle.  The
>> busy 4 are always on one physical package (but not always the same
>> package (it'll change on reboot or when I change some parameters via
>> ethtool), but never both.  This, despite /proc/interrupts showing me
>> that all 8 interrupts are being hit evenly.  There's more details in
>> my last mail. ;-D
>>
>
> Well, I was reacting to your 'voodo' comment about
>
> return (u16) (((u64) hash * dev->real_num_tx_queues) >> 32);
>
> Since this is not the problem. Problem is coming from jhash() which shuffles
> the input, while in your case we want to select same output queue
> because of cpu affinities. No shuffle required.

Agreed.  I don't want to jhash(), and I'm not.

> (assuming cpu0 is handling tx-queue-0 and rx-queue-0,
>          cpu1 is handling tx-queue-1 and rx-queue-1, and so on...)

That's a correct assumption. :D

> Then /proc/interrupts show your rx interrupts are not evenly distributed.
>
> Or that ksoftirqd is triggered only on one physical cpu, while on other
> cpu, softirqds are not run from ksoftirqd. Its only a matter of load.

Hrmm... more fuel for the fire...

The NIC seems to be doing a good job of hashing the incoming data and
the kernel is now finding the right TX queue:
-bash-3.2# ethtool -S eth2 | grep -vw 0 | grep packets
     rx_packets: 1286009099
     tx_packets: 1287853570
     tx_queue_0_packets: 162469405
     tx_queue_1_packets: 162452446
     tx_queue_2_packets: 162481160
     tx_queue_3_packets: 162441839
     tx_queue_4_packets: 162484930
     tx_queue_5_packets: 162478402
     tx_queue_6_packets: 162492530
     tx_queue_7_packets: 162477162
     rx_queue_0_packets: 162469449
     rx_queue_1_packets: 162452440
     rx_queue_2_packets: 162481186
     rx_queue_3_packets: 162441885
     rx_queue_4_packets: 162484949
     rx_queue_5_packets: 162478427

Here's where it gets juicy.  If I reduce the rate at which I'm pushing
traffic to a 0-loss level (in this case about 2.2Mpps), then top looks
as follow:
Cpu0  :  0.0%us,  0.0%sy,  0.0%ni,100.0%id,  0.0%wa,  0.0%hi,  0.0%si,  0.0%st
Cpu1  :  0.0%us,  0.0%sy,  0.0%ni,100.0%id,  0.0%wa,  0.0%hi,  0.0%si,  0.0%st
Cpu2  :  0.0%us,  0.0%sy,  0.0%ni,100.0%id,  0.0%wa,  0.0%hi,  0.0%si,  0.0%st
Cpu3  :  0.0%us,  0.0%sy,  0.0%ni,100.0%id,  0.0%wa,  0.0%hi,  0.0%si,  0.0%st
Cpu4  :  0.0%us,  0.0%sy,  0.0%ni,100.0%id,  0.0%wa,  0.0%hi,  0.0%si,  0.0%st
Cpu5  :  0.0%us,  0.0%sy,  0.0%ni,100.0%id,  0.0%wa,  0.0%hi,  0.0%si,  0.0%st
Cpu6  :  0.0%us,  0.0%sy,  0.0%ni,100.0%id,  0.0%wa,  0.0%hi,  0.0%si,  0.0%st
Cpu7  :  0.0%us,  0.3%sy,  0.0%ni, 99.3%id,  0.0%wa,  0.0%hi,  0.3%si,  0.0%st

And if I watch /proc/interrupts, I see that all of the tx and rx
queues are handling a fairly similar number of interrupts (ballpark,
7-8k/sec on rx, 10k on tx).

OK... now let me double the packet rate (to about 4.4Mpps), top looks like this:

Cpu0  :  0.0%us,  0.0%sy,  0.0%ni,  1.9%id,  0.0%wa,  5.5%hi, 92.5%si,  0.0%st
Cpu1  :  0.0%us,  0.0%sy,  0.0%ni, 98.7%id,  0.0%wa,  0.0%hi,  1.3%si,  0.0%st
Cpu2  :  0.0%us,  0.0%sy,  0.0%ni,  2.3%id,  0.0%wa,  4.9%hi, 92.9%si,  0.0%st
Cpu3  :  0.0%us,  0.3%sy,  0.0%ni, 97.7%id,  0.0%wa,  0.0%hi,  1.9%si,  0.0%st
Cpu4  :  0.0%us,  0.0%sy,  0.0%ni,  5.2%id,  0.0%wa,  5.2%hi, 89.6%si,  0.0%st
Cpu5  :  0.0%us,  0.0%sy,  0.0%ni, 97.7%id,  0.0%wa,  0.3%hi,  1.9%si,  0.0%st
Cpu6  :  0.0%us,  0.0%sy,  0.0%ni,  0.3%id,  0.0%wa,  4.9%hi, 94.8%si,  0.0%st
Cpu7  :  0.0%us,  0.0%sy,  0.0%ni, 99.7%id,  0.0%wa,  0.0%hi,  0.3%si,  0.0%st

And if I watch /proc/interrupts again, I see that the even-CPUs (i.e.
0,2,4, and 6) RX queues are receiving relatively few interrupts
(5-ish/sec (not 5k... just 5)) and the odd-CPUS RX queues are
receiving about 2-3k/sec.  What's extra strange is that the TX queues
are still handling about 10k/sec each.

So, below some magic threshold (approx 2.3Mpps), the box is basically
idle and happily routing all the packets (I can confirm that my
network test device ixia is showing 0-loss).  Above the magic
threshold, the box starts acting as described above and I'm unable to
push it beyond that threshold.  While I understand that there are
limits to how fast I can route packets (obviously), it seems very
strange that I'm seeing this physical-CPU affinity on the ksoftirqd
"processes".

Here's how fragile this "magic threshold" is...  2.292 Mpps, box looks
idle, 0 loss.   2.300 Mpps, even-CPU ksoftirqd processes at 50%-ish.
2.307 Mpps, even-CPU ksoftirqd processes at 75%.  2.323 Mpps, even-CPU
ksoftirqd proccesses at 100%.  Never during this did the odd-CPU
ksoftirqd processes show any utilization at all.

These are 64-byte frames, so I shouldn't be hitting any bandwidth
issues that I'm aware of, 1.3Gbps in, and 1.3Gbps out (same NIC, I'm
just routing packets back out the one NIC).

=/

-A
--
To unsubscribe from this list: send the line "unsubscribe netdev" in
the body of a message to majordomo@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
David Miller - May 1, 2009, 4:08 p.m.
From: Eric Dumazet <dada1@cosmosbay.com>
Date: Fri, 01 May 2009 08:14:03 +0200

> [PATCH] net: skb_tx_hash() improvements
> 
> When skb_rx_queue_recorded() is true, we dont want to use jash distribution
> as the device driver exactly told us which queue was selected at RX time.
> jhash makes a statistical shuffle, but this wont work with 8 static inputs.
> 
> Later improvements would be to compute reciprocal value of real_num_tx_queues
> to avoid a divide here. But this computation should be done once,
> when real_num_tx_queues is set. This needs a separate patch, and a new
> field in struct net_device.
> 
> Reported-by: Andrew Dickinson <andrew@whydna.net>
> Signed-off-by: Eric Dumazet <dada1@cosmosbay.com>

Applied, except that I changed the commit message header line to more
reflect that this is in fact a bug fix.

BTW, you don't need the reciprocol when num-tx-queues <= num-rx-queues
(you can just use the RX queue recording as the hash, straight) and
that's the kind of check what I intended to add to net-2.6 had you not
beaten me to this patch.

Also, thanks for giving me absolutely no credit for this whole thing
in your commit message.  I know I do that to you all the time :-/ How
can you forget so quickly that I'm the one that even suggested the
exact code change for Andrew to test in the first place?
--
To unsubscribe from this list: send the line "unsubscribe netdev" in
the body of a message to majordomo@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Eric Dumazet - May 1, 2009, 4:48 p.m.
David Miller a écrit :
> From: Eric Dumazet <dada1@cosmosbay.com>
> Date: Fri, 01 May 2009 08:14:03 +0200
> 
>> [PATCH] net: skb_tx_hash() improvements
>>
>> When skb_rx_queue_recorded() is true, we dont want to use jash distribution
>> as the device driver exactly told us which queue was selected at RX time.
>> jhash makes a statistical shuffle, but this wont work with 8 static inputs.
>>
>> Later improvements would be to compute reciprocal value of real_num_tx_queues
>> to avoid a divide here. But this computation should be done once,
>> when real_num_tx_queues is set. This needs a separate patch, and a new
>> field in struct net_device.
>>
>> Reported-by: Andrew Dickinson <andrew@whydna.net>
>> Signed-off-by: Eric Dumazet <dada1@cosmosbay.com>
> 
> Applied, except that I changed the commit message header line to more
> reflect that this is in fact a bug fix.
> 
> BTW, you don't need the reciprocol when num-tx-queues <= num-rx-queues
> (you can just use the RX queue recording as the hash, straight) and
> that's the kind of check what I intended to add to net-2.6 had you not
> beaten me to this patch.
> 
> Also, thanks for giving me absolutely no credit for this whole thing
> in your commit message.  I know I do that to you all the time :-/ How
> can you forget so quickly that I'm the one that even suggested the
> exact code change for Andrew to test in the first place?

Hoho, your Honor, I am totaly guilty and sorry, sometime I think I am
David Miller, silly me ! :)

I am not fighting for credit or whatever, certainly not with you.


--
To unsubscribe from this list: send the line "unsubscribe netdev" in
the body of a message to majordomo@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
David Miller - May 1, 2009, 5:22 p.m.
From: Eric Dumazet <dada1@cosmosbay.com>
Date: Fri, 01 May 2009 18:48:30 +0200

> Hoho, your Honor, I am totaly guilty and sorry, sometime I think I am
> David Miller, silly me ! :)
> 
> I am not fighting for credit or whatever, certainly not with you.

Great, just making sure it wasn't intentional :-)
--
To unsubscribe from this list: send the line "unsubscribe netdev" in
the body of a message to majordomo@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Jesse Brandeburg - May 1, 2009, 9:37 p.m.
I'm going to try to clarify just a few minor things in the hope of helping 
explain why things look the way they do from the ixgbe perspective.

On Fri, 1 May 2009, Andrew Dickinson wrote:
> >> That's exactly what I did!  It solved the problem of hot-spots on some
> >> interrupts.  However, I now have a new problem (which is documented in
> >> my previous posts).  The short of it is that I'm only seeing 4 (out of
> >> 8) ksoftirqd's busy under heavy load... the other 4 seem idle.  The
> >> busy 4 are always on one physical package (but not always the same
> >> package (it'll change on reboot or when I change some parameters via
> >> ethtool), but never both.  This, despite /proc/interrupts showing me
> >> that all 8 interrupts are being hit evenly.  There's more details in
> >> my last mail. ;-D
> >>
> >
> > Well, I was reacting to your 'voodo' comment about
> >
> > return (u16) (((u64) hash * dev->real_num_tx_queues) >> 32);
> >
> > Since this is not the problem. Problem is coming from jhash() which shuffles
> > the input, while in your case we want to select same output queue
> > because of cpu affinities. No shuffle required.
> 
> Agreed.  I don't want to jhash(), and I'm not.
> 
> > (assuming cpu0 is handling tx-queue-0 and rx-queue-0,
> >          cpu1 is handling tx-queue-1 and rx-queue-1, and so on...)
> 
> That's a correct assumption. :D
> 
> > Then /proc/interrupts show your rx interrupts are not evenly distributed.
> >
> > Or that ksoftirqd is triggered only on one physical cpu, while on other
> > cpu, softirqds are not run from ksoftirqd. Its only a matter of load.
> 
> Hrmm... more fuel for the fire...
> 
> The NIC seems to be doing a good job of hashing the incoming data and
> the kernel is now finding the right TX queue:
> -bash-3.2# ethtool -S eth2 | grep -vw 0 | grep packets
>      rx_packets: 1286009099
>      tx_packets: 1287853570
>      tx_queue_0_packets: 162469405
>      tx_queue_1_packets: 162452446
>      tx_queue_2_packets: 162481160
>      tx_queue_3_packets: 162441839
>      tx_queue_4_packets: 162484930
>      tx_queue_5_packets: 162478402
>      tx_queue_6_packets: 162492530
>      tx_queue_7_packets: 162477162
>      rx_queue_0_packets: 162469449
>      rx_queue_1_packets: 162452440
>      rx_queue_2_packets: 162481186
>      rx_queue_3_packets: 162441885
>      rx_queue_4_packets: 162484949
>      rx_queue_5_packets: 162478427
> 
> Here's where it gets juicy.  If I reduce the rate at which I'm pushing
> traffic to a 0-loss level (in this case about 2.2Mpps), then top looks
> as follow:
> Cpu0  :  0.0%us,  0.0%sy,  0.0%ni,100.0%id,  0.0%wa,  0.0%hi,  0.0%si,  0.0%st
> Cpu1  :  0.0%us,  0.0%sy,  0.0%ni,100.0%id,  0.0%wa,  0.0%hi,  0.0%si,  0.0%st
> Cpu2  :  0.0%us,  0.0%sy,  0.0%ni,100.0%id,  0.0%wa,  0.0%hi,  0.0%si,  0.0%st
> Cpu3  :  0.0%us,  0.0%sy,  0.0%ni,100.0%id,  0.0%wa,  0.0%hi,  0.0%si,  0.0%st
> Cpu4  :  0.0%us,  0.0%sy,  0.0%ni,100.0%id,  0.0%wa,  0.0%hi,  0.0%si,  0.0%st
> Cpu5  :  0.0%us,  0.0%sy,  0.0%ni,100.0%id,  0.0%wa,  0.0%hi,  0.0%si,  0.0%st
> Cpu6  :  0.0%us,  0.0%sy,  0.0%ni,100.0%id,  0.0%wa,  0.0%hi,  0.0%si,  0.0%st
> Cpu7  :  0.0%us,  0.3%sy,  0.0%ni, 99.3%id,  0.0%wa,  0.0%hi,  0.3%si,  0.0%st
> 
> And if I watch /proc/interrupts, I see that all of the tx and rx
> queues are handling a fairly similar number of interrupts (ballpark,
> 7-8k/sec on rx, 10k on tx).
> 
> OK... now let me double the packet rate (to about 4.4Mpps), top looks like this:
> 
> Cpu0  :  0.0%us,  0.0%sy,  0.0%ni,  1.9%id,  0.0%wa,  5.5%hi, 92.5%si,  0.0%st
> Cpu1  :  0.0%us,  0.0%sy,  0.0%ni, 98.7%id,  0.0%wa,  0.0%hi,  1.3%si,  0.0%st
> Cpu2  :  0.0%us,  0.0%sy,  0.0%ni,  2.3%id,  0.0%wa,  4.9%hi, 92.9%si,  0.0%st
> Cpu3  :  0.0%us,  0.3%sy,  0.0%ni, 97.7%id,  0.0%wa,  0.0%hi,  1.9%si,  0.0%st
> Cpu4  :  0.0%us,  0.0%sy,  0.0%ni,  5.2%id,  0.0%wa,  5.2%hi, 89.6%si,  0.0%st
> Cpu5  :  0.0%us,  0.0%sy,  0.0%ni, 97.7%id,  0.0%wa,  0.3%hi,  1.9%si,  0.0%st
> Cpu6  :  0.0%us,  0.0%sy,  0.0%ni,  0.3%id,  0.0%wa,  4.9%hi, 94.8%si,  0.0%st
> Cpu7  :  0.0%us,  0.0%sy,  0.0%ni, 99.7%id,  0.0%wa,  0.0%hi,  0.3%si,  0.0%st
> 
> And if I watch /proc/interrupts again, I see that the even-CPUs (i.e.
> 0,2,4, and 6) RX queues are receiving relatively few interrupts
> (5-ish/sec (not 5k... just 5)) and the odd-CPUS RX queues are
> receiving about 2-3k/sec.  What's extra strange is that the TX queues
> are still handling about 10k/sec each.

rx interrupts start polling (100% time)
tx queues keep doing 10K per second because tx queues don't run in NAPI 
mode for MSI-X vectors.  They do try to limit the amount of work done at 
once as to not hog a cpu.

> So, below some magic threshold (approx 2.3Mpps), the box is basically
> idle and happily routing all the packets (I can confirm that my
> network test device ixia is showing 0-loss).  Above the magic
> threshold, the box starts acting as described above and I'm unable to
> push it beyond that threshold.  While I understand that there are
> limits to how fast I can route packets (obviously), it seems very
> strange that I'm seeing this physical-CPU affinity on the ksoftirqd
> "processes".
> 
> Here's how fragile this "magic threshold" is...  2.292 Mpps, box looks
> idle, 0 loss.   2.300 Mpps, even-CPU ksoftirqd processes at 50%-ish.
> 2.307 Mpps, even-CPU ksoftirqd processes at 75%.  2.323 Mpps, even-CPU
> ksoftirqd proccesses at 100%.  Never during this did the odd-CPU
> ksoftirqd processes show any utilization at all.
> 
> These are 64-byte frames, so I shouldn't be hitting any bandwidth
> issues that I'm aware of, 1.3Gbps in, and 1.3Gbps out (same NIC, I'm
> just routing packets back out the one NIC).

do you have all six channels populated with memory?

you're probably just hitting the limits of the OS combined with the 
hardware.  You could try reducing your rx/tx queue count (have to change 
code, 'num_rx_queues =') - hope we get ethtool to do that someday.

and then assigning each rx queue to one core and a tx queue to another on 
a shared cache.

on a Nehalem the kernel in numa mode (is your BIOS in numa mode?) may not 
be balancing the memory utilization evenly between channels.  are you 
using slub or slqb?

changing netdev_alloc_skb to __alloc_skb(be sure to specify node=-1 and 
getting rid of the skb_reserve(NET_IP_ALIGN) and skb_reserve(16)
might help align rx packets for dma.

hope this helps,
  Jesse
--
To unsubscribe from this list: send the line "unsubscribe netdev" in
the body of a message to majordomo@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html

Patch

diff --git a/net/core/dev.c b/net/core/dev.c
index 308a7d0..e2e9e4a 100644
--- a/net/core/dev.c
+++ b/net/core/dev.c
@@ -1735,11 +1735,12 @@  u16 skb_tx_hash(const struct net_device *dev, const struct sk_buff *skb)
 {
 	u32 hash;
 
-	if (skb_rx_queue_recorded(skb)) {
-		hash = skb_get_rx_queue(skb);
-	} else if (skb->sk && skb->sk->sk_hash) {
+	if (skb_rx_queue_recorded(skb))
+		return skb_get_rx_queue(skb) % dev->real_num_tx_queues;
+
+	if (skb->sk && skb->sk->sk_hash)
 		hash = skb->sk->sk_hash;
-	} else
+	else
 		hash = skb->protocol;
 
 	hash = jhash_1word(hash, skb_tx_hashrnd);