diff mbox

[net-next-2.6] net: speedup udp receive path

Message ID 1272463605.2267.70.camel@edumazet-laptop
State Accepted, archived
Delegated to: David Miller
Headers show

Commit Message

Eric Dumazet April 28, 2010, 2:06 p.m. UTC
Le mercredi 28 avril 2010 à 08:36 -0400, jamal a écrit :
> On Wed, 2010-04-28 at 14:33 +0200, Eric Dumazet wrote:
> 
> > If you wait a bit, I have another patch to speedup udp receive path ;)
> 
> Shoot whenever you are ready ;-> I will test with and without your
> patch..
> 

Here it is ;)

Thanks

[PATCH net-next-2.6] net: speedup udp receive path

Since commit 95766fff ([UDP]: Add memory accounting.), 
each received packet needs one extra sock_lock()/sock_release() pair.

This added latency because of possible backlog handling. Then later,
ticket spinlocks added yet another latency source in case of DDOS.

This patch introduces lock_sock_bh() and unlock_sock_bh()
synchronization primitives, avoiding one atomic operation and backlog
processing.

skb_free_datagram_locked() uses them instead of full blown
lock_sock()/release_sock(). skb is orphaned inside locked section for
proper socket memory reclaim, and finally freed outside of it.

UDP receive path now take the socket spinlock only once.

Signed-off-by: Eric DUmazet <eric.dumazet@gmail.com>
---
 include/net/sock.h  |   10 ++++++++++
 net/core/datagram.c |   10 +++++++---
 net/ipv4/udp.c      |   12 ++++++------
 net/ipv6/udp.c      |    4 ++--
 4 files changed, 25 insertions(+), 11 deletions(-)



--
To unsubscribe from this list: send the line "unsubscribe netdev" in
the body of a message to majordomo@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html

Comments

Eric Dumazet April 28, 2010, 2:19 p.m. UTC | #1
Le mercredi 28 avril 2010 à 16:06 +0200, Eric Dumazet a écrit :
> Le mercredi 28 avril 2010 à 08:36 -0400, jamal a écrit :
> > On Wed, 2010-04-28 at 14:33 +0200, Eric Dumazet wrote:
> > 
> > > If you wait a bit, I have another patch to speedup udp receive path ;)
> > 
> > Shoot whenever you are ready ;-> I will test with and without your
> > patch..
> > 
> 
> Here it is ;)
> 
> Thanks

I forgot to say that with my previous DDOS test/bench (16 cpus trying to
feed one udp socket), my receiver can now process 420.000 pps instead of
200.000 ;)



--
To unsubscribe from this list: send the line "unsubscribe netdev" in
the body of a message to majordomo@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Eric Dumazet April 28, 2010, 2:34 p.m. UTC | #2
Le mercredi 28 avril 2010 à 16:19 +0200, Eric Dumazet a écrit :

> I forgot to say that with my previous DDOS test/bench (16 cpus trying to
> feed one udp socket), my receiver can now process 420.000 pps instead of
> 200.000 ;)

And perf top of the cpu dedicated to the thread doing the recvmsg() is :
(after patch)

----------------------------------------------------------------------------------------------------------------------------------------------
   PerfTop:    1001 irqs/sec  kernel:98.0% [1000Hz cycles],  (all, cpu: 1)
----------------------------------------------------------------------------------------------------------------------------------------------

             samples  pcnt function                      DSO
             _______ _____ _____________________________ ____________________________

             5463.00 45.5% _raw_spin_lock_bh             vmlinux                     
              761.00  6.3% copy_user_generic_string      vmlinux                     
              662.00  5.5% sock_recv_ts_and_drops        vmlinux                     
              645.00  5.4% kfree                         vmlinux                     
              568.00  4.7% _raw_spin_lock                vmlinux                     
              494.00  4.1% __skb_recv_datagram           vmlinux                     
              488.00  4.1% skb_copy_datagram_iovec       vmlinux                     
              467.00  3.9% __slab_free                   vmlinux                     
              176.00  1.5% udp_recvmsg                   vmlinux                     
              168.00  1.4% ia32_sysenter_target          vmlinux                     
              161.00  1.3% kmem_cache_free               vmlinux                     
              161.00  1.3% _raw_spin_lock_irqsave        vmlinux                     
              151.00  1.3% memcpy_toiovec                vmlinux                     
              131.00  1.1% fget_light                    vmlinux                     
              130.00  1.1% sock_rfree                    vmlinux                     
              104.00  0.9% inet_recvmsg                  vmlinux                     
               99.00  0.8% dst_release                   vmlinux                     
               98.00  0.8% skb_release_head_state        vmlinux                     
               83.00  0.7% __sk_mem_reclaim              vmlinux                     
               75.00  0.6% sys_recvfrom                  vmlinux                     
               61.00  0.5% sysexit_from_sys_call         vmlinux                     
               59.00  0.5% fput                          vmlinux                     
               56.00  0.5% schedule                      vmlinux                     
               56.00  0.5% sock_recvmsg                  vmlinux                     
               54.00  0.4% move_addr_to_user             vmlinux                     
               51.00  0.4% compat_sys_socketcall         vmlinux                     
               48.00  0.4% _raw_spin_unlock_bh           vmlinux                    


--
To unsubscribe from this list: send the line "unsubscribe netdev" in
the body of a message to majordomo@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
David Miller April 28, 2010, 9:36 p.m. UTC | #3
From: Eric Dumazet <eric.dumazet@gmail.com>
Date: Wed, 28 Apr 2010 16:06:45 +0200

> [PATCH net-next-2.6] net: speedup udp receive path
> 
> Since commit 95766fff ([UDP]: Add memory accounting.), 
> each received packet needs one extra sock_lock()/sock_release() pair.
> 
> This added latency because of possible backlog handling. Then later,
> ticket spinlocks added yet another latency source in case of DDOS.
> 
> This patch introduces lock_sock_bh() and unlock_sock_bh()
> synchronization primitives, avoiding one atomic operation and backlog
> processing.
> 
> skb_free_datagram_locked() uses them instead of full blown
> lock_sock()/release_sock(). skb is orphaned inside locked section for
> proper socket memory reclaim, and finally freed outside of it.
> 
> UDP receive path now take the socket spinlock only once.
> 
> Signed-off-by: Eric DUmazet <eric.dumazet@gmail.com>

Clever, let's see what this breaks :-)

Applied, thanks Eric.
--
To unsubscribe from this list: send the line "unsubscribe netdev" in
the body of a message to majordomo@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
jamal April 28, 2010, 11:44 p.m. UTC | #4
On Wed, 2010-04-28 at 16:06 +0200, Eric Dumazet wrote:

> Here it is ;)

Sorry - things got a little hectic with TheMan.

I am afraid i dont have good news.
Actually, I should say i dont have good news in regards to rps.
For my sample app, two things seem to be happening:
a) The overall performance has gotten better for both rps
and non-rps.
b) non-rps is now performing relatively better

This is just what i see in net-next not related to your patch.
It seems the kernels i tested prior to April 23 showed rps better.
The one i tested on Apr23 showed rps being about the same as non-rps.
As i stated in my last result posting, I thought i didnt test properly
but i did again today and saw the same thing. And now non-rps is
_consistently_ better.
So some regression is going on...

Your patch has improved the performance of rps relative to what is in
net-next very lightly; but it has also improved the performance of
non-rps;->
My traces look different for the app cpu than yours - likely because of
the apps being different.

At the moment i dont have time to dig deeper into code, but i could
test as cycles show up.

I am attaching the profile traces and results.

cheers,
jamal
April 23 net-next

kernel           sink    cpu all     cpuint       cpuapp
---------------------------------------------------------
nn              93.95%   84.5%        99.8%        79.8%
nn-rps          96.41%   85.4%        95.5%        82.5%
nn-cl           97.29%   84.0%        99.9%        79.6%
nn-cl-rps       97.76%   86.5%        96.5%        84.8%

nn: Basic net-next from Apr23
nn-rps: Basic net-next from Apr23 with rps mask ee and irq affinity to cpu0
nn-cl: Basic net-next from Apr23 + Changli patch
nn-cl-rps: Basic net-next from Apr23 + Changli patch + rps mask ee,irq aff cpu0
sink: the amount of traffic the system was able to sink in.
cpu all: avg % system cpu consumed in test
cpuint: avg %cpu consumed by the cpu where interrupts happened
cpuapp: avg %cpu consumed by a sample cpu which did app processing

Now repeat with Erics changes and kernel from Apr-28

kernel         sink      cpu all     cpuint       cpuapp
---------------------------------------------------------
nn2              98.78%   83.6%        100.0%       82.8%
nn2-rps          94.43%   84.2%        98.1%        82.0%
nn2-ed           98.74%   83.2%        99.9%        81.6%
nn2-ed-rps       95.15%   84.5%        97.3%        82.1%


nn2: Basic net-next from Apr28
nn2-rps: Basic net-next from Apr23 with rps mask ee and irq affinity to cpu0
nn2-ed: Basic net-next from Apr23 + Eric patch
nn2-ed-rps: Basic net-next from Apr23 + Eric patch + rps mask ee,irq aff cpu0
I: net-next

Average udp sink: 98.78%

--------------------------------------------------------------------------------------------------
   PerfTop:    3632 irqs/sec  kernel:83.7% [1000Hz cycles],  (all, 8 CPUs)
--------------------------------------------------------------------------------------------------

             samples  pcnt function                    DSO
             _______ _____ ___________________________ ____________________

             2738.00  9.8% sky2_poll                   [sky2]              
             1543.00  5.5% _raw_spin_lock_irqsave      [kernel]            
             1019.00  3.7% system_call                 [kernel]            
              740.00  2.7% copy_user_generic_string    [kernel]            
              687.00  2.5% fget                        [kernel]            
              640.00  2.3% _raw_spin_unlock_irqrestore [kernel]            
              634.00  2.3% sys_epoll_ctl               [kernel]            
              613.00  2.2% datagram_poll               [kernel]            
              553.00  2.0% _raw_spin_lock_bh           [kernel]            
              530.00  1.9% kmem_cache_free             [kernel]            
              522.00  1.9% schedule                    [kernel]            
              487.00  1.7% vread_tsc                   [kernel].vsyscall_fn
              467.00  1.7% _raw_spin_lock              [kernel]            
              432.00  1.5% udp_recvmsg                 [kernel]            
              426.00  1.5% kmem_cache_alloc            [kernel]            
              418.00  1.5% __udp4_lib_lookup           [kernel]            
              417.00  1.5% sys_epoll_wait              [kernel]            
              376.00  1.3% fput                        [kernel]            
              361.00  1.3% ip_route_input              [kernel]            
              344.00  1.2% local_bh_enable_ip          [kernel]            
              326.00  1.2% ip_rcv                      [kernel]            
              321.00  1.2% first_packet_length         [kernel]            
              307.00  1.1% ep_remove                   [kernel]            
              303.00  1.1% dst_release                 [kernel]            
              301.00  1.1% skb_copy_datagram_iovec     [kernel]            
              297.00  1.1% mutex_lock                  [kernel]            



--------------------------------------------------------------------------------------------------
   PerfTop:    4018 irqs/sec  kernel:83.3% [1000Hz cycles],  (all, 8 CPUs)
--------------------------------------------------------------------------------------------------

             samples  pcnt function                    DSO
             _______ _____ ___________________________ ______________________

             4274.00  9.7% sky2_poll                   [sky2]                
             2473.00  5.6% _raw_spin_lock_irqsave      [kernel]              
             1585.00  3.6% system_call                 [kernel]              
             1179.00  2.7% copy_user_generic_string    [kernel]              
             1089.00  2.5% fget                        [kernel]              
             1019.00  2.3% _raw_spin_unlock_irqrestore [kernel]              
             1011.00  2.3% sys_epoll_ctl               [kernel]              
              965.00  2.2% datagram_poll               [kernel]              
              902.00  2.0% kmem_cache_free             [kernel]              
              841.00  1.9% _raw_spin_lock_bh           [kernel]              
              837.00  1.9% schedule                    [kernel]              
              735.00  1.7% vread_tsc                   [kernel].vsyscall_fn  
              730.00  1.7% udp_recvmsg                 [kernel]              
              729.00  1.7% _raw_spin_lock              [kernel]              
              678.00  1.5% kmem_cache_alloc            [kernel]              
              651.00  1.5% sys_epoll_wait              [kernel]              
              635.00  1.4% __udp4_lib_lookup           [kernel]              
              595.00  1.3% fput                        [kernel]              
              568.00  1.3% local_bh_enable_ip          [kernel]              
              562.00  1.3% ip_route_input              [kernel]              
              516.00  1.2% dst_release                 [kernel]              
              502.00  1.1% ep_remove                   [kernel]              
              485.00  1.1% skb_copy_datagram_iovec     [kernel]              
              484.00  1.1% first_packet_length         [kernel]              
              476.00  1.1% ip_rcv                      [kernel]              
              470.00  1.1% __alloc_skb                 [kernel]              
              459.00  1.0% epoll_ctl                   /lib/libc-2.7.so      
              458.00  1.0% mutex_lock                  [kernel]              


--------------------------------------------------------------------------------------------------
   PerfTop:    1000 irqs/sec  kernel:100.0% [1000Hz cycles],  (all, cpu: 0)
--------------------------------------------------------------------------------------------------

             samples  pcnt function                    DSO
             _______ _____ ___________________________ ________

             3534.00 34.7% sky2_poll                   [sky2]  
              545.00  5.3% __udp4_lib_lookup           [kernel]
              537.00  5.3% ip_route_input              [kernel]
              427.00  4.2% _raw_spin_lock_irqsave      [kernel]
              401.00  3.9% __alloc_skb                 [kernel]
              360.00  3.5% ip_rcv                      [kernel]
              332.00  3.3% _raw_spin_lock              [kernel]
              292.00  2.9% sock_queue_rcv_skb          [kernel]
              291.00  2.9% __udp4_lib_rcv              [kernel]
              273.00  2.7% sock_def_readable           [kernel]
              269.00  2.6% __netif_receive_skb         [kernel]
              209.00  2.1% __wake_up_common            [kernel]
              196.00  1.9% __kmalloc                   [kernel]
              164.00  1.6% _raw_read_lock              [kernel]
              157.00  1.5% kmem_cache_alloc            [kernel]
              157.00  1.5% ep_poll_callback            [kernel]
              133.00  1.3% resched_task                [kernel]
              128.00  1.3% task_rq_lock                [kernel]
              120.00  1.2% swiotlb_sync_single         [kernel]
              120.00  1.2% sky2_rx_submit              [sky2]  
              117.00  1.1% udp_queue_rcv_skb           [kernel]
              108.00  1.1% ip_local_deliver            [kernel]
              104.00  1.0% try_to_wake_up              [kernel]
              102.00  1.0% _raw_spin_unlock_irqrestore [kernel]
               98.00  1.0% select_task_rq_fair         [kernel]



--------------------------------------------------------------------------------------------------
   PerfTop:    1000 irqs/sec  kernel:100.0% [1000Hz cycles],  (all, cpu: 0)
--------------------------------------------------------------------------------------------------

             samples  pcnt function                    DSO
             _______ _____ ___________________________ ________

             4601.00 34.0% sky2_poll                   [sky2]  
              732.00  5.4% __udp4_lib_lookup           [kernel]
              724.00  5.3% ip_route_input              [kernel]
              527.00  3.9% _raw_spin_lock_irqsave      [kernel]
              520.00  3.8% __alloc_skb                 [kernel]
              483.00  3.6% ip_rcv                      [kernel]
              441.00  3.3% _raw_spin_lock              [kernel]
              401.00  3.0% sock_queue_rcv_skb          [kernel]
              373.00  2.8% __udp4_lib_rcv              [kernel]
              365.00  2.7% sock_def_readable           [kernel]
              353.00  2.6% __netif_receive_skb         [kernel]
              285.00  2.1% __wake_up_common            [kernel]
              273.00  2.0% __kmalloc                   [kernel]
              230.00  1.7% _raw_read_lock              [kernel]
              208.00  1.5% ep_poll_callback            [kernel]
              199.00  1.5% kmem_cache_alloc            [kernel]
              180.00  1.3% task_rq_lock                [kernel]
              172.00  1.3% sky2_rx_submit              [sky2]  
              171.00  1.3% resched_task                [kernel]
              165.00  1.2% ip_local_deliver            [kernel]
              162.00  1.2% udp_queue_rcv_skb           [kernel]
              158.00  1.2% _raw_spin_unlock_irqrestore [kernel]
              148.00  1.1% select_task_rq_fair         [kernel]
              144.00  1.1% try_to_wake_up              [kernel]
              142.00  1.0% sky2_remove                 [sky2]  
              140.00  1.0% swiotlb_sync_single         [kernel]
               95.00  0.7% cache_alloc_refill          [kernel]
               92.00  0.7% dev_gro_receive             [kernel]
               82.00  0.6% is_swiotlb_buffer           [kernel]


--------------------------------------------------------------------------------------------------
   PerfTop:     622 irqs/sec  kernel:74.9% [1000Hz cycles],  (all, cpu: 2)
--------------------------------------------------------------------------------------------------

             samples  pcnt function                    DSO
             _______ _____ ___________________________ _____________________________________

              113.00  6.5% _raw_spin_lock_irqsave      /lib/modules/2.6.34-rc5/build/vmlinux
              105.00  6.0% system_call                 /lib/modules/2.6.34-rc5/build/vmlinux
               69.00  3.9% fget                        /lib/modules/2.6.34-rc5/build/vmlinux
               64.00  3.7% datagram_poll               /lib/modules/2.6.34-rc5/build/vmlinux
               56.00  3.2% copy_user_generic_string    /lib/modules/2.6.34-rc5/build/vmlinux
               55.00  3.1% sys_epoll_ctl               /lib/modules/2.6.34-rc5/build/vmlinux
               53.00  3.0% _raw_spin_unlock_irqrestore /lib/modules/2.6.34-rc5/build/vmlinux
               46.00  2.6% _raw_spin_lock_bh           /lib/modules/2.6.34-rc5/build/vmlinux
               42.00  2.4% kmem_cache_free             /lib/modules/2.6.34-rc5/build/vmlinux
               37.00  2.1% dst_release                 /lib/modules/2.6.34-rc5/build/vmlinux
               37.00  2.1% schedule                    /lib/modules/2.6.34-rc5/build/vmlinux
               35.00  2.0% mutex_lock                  /lib/modules/2.6.34-rc5/build/vmlinux
               35.00  2.0% vread_tsc                   [kernel].vsyscall_fn                 
               35.00  2.0% udp_recvmsg                 /lib/modules/2.6.34-rc5/build/vmlinux
               34.00  1.9% sys_epoll_wait              /lib/modules/2.6.34-rc5/build/vmlinux
               31.00  1.8% local_bh_enable_ip          /lib/modules/2.6.34-rc5/build/vmlinux
               29.00  1.7% ep_remove                   /lib/modules/2.6.34-rc5/build/vmlinux
               28.00  1.6% kmem_cache_alloc            /lib/modules/2.6.34-rc5/build/vmlinux
               27.00  1.5% process_recv                /home/hadi/udp_sink/mcpudp           
               25.00  1.4% mutex_unlock                /lib/modules/2.6.34-rc5/build/vmlinux
               24.00  1.4% ep_send_events_proc         /lib/modules/2.6.34-rc5/build/vmlinux
               24.00  1.4% clock_gettime               /lib/librt-2.7.so                    
               23.00  1.3% fput                        /lib/modules/2.6.34-rc5/build/vmlinux
               23.00  1.3% skb_copy_datagram_iovec     /lib/modules/2.6.34-rc5/build/vmlinux
               20.00  1.1% sock_recv_ts_and_drops      /lib/modules/2.6.34-rc5/build/vmlinux
               20.00  1.1% inet_recvmsg                /lib/modules/2.6.34-rc5/build/vmlinux
               19.00  1.1% epoll_dispatch              /usr/lib/libevent-1.3e.so.1.0.3      
               19.00  1.1% first_packet_length         /lib/modules/2.6.34-rc5/build/vmlinux



--------------------------------------------------------------------------------------------------
   PerfTop:     625 irqs/sec  kernel:83.0% [1000Hz cycles],  (all, cpu: 2)
--------------------------------------------------------------------------------------------------

             samples  pcnt function                    DSO
             _______ _____ ___________________________ _____________________________________

              315.00  6.8% _raw_spin_lock_irqsave      /lib/modules/2.6.34-rc5/build/vmlinux
              232.00  5.0% system_call                 /lib/modules/2.6.34-rc5/build/vmlinux
              175.00  3.8% fget                        /lib/modules/2.6.34-rc5/build/vmlinux
              174.00  3.8% datagram_poll               /lib/modules/2.6.34-rc5/build/vmlinux
              168.00  3.6% sys_epoll_ctl               /lib/modules/2.6.34-rc5/build/vmlinux
              155.00  3.4% copy_user_generic_string    /lib/modules/2.6.34-rc5/build/vmlinux
              144.00  3.1% kmem_cache_free             /lib/modules/2.6.34-rc5/build/vmlinux
              133.00  2.9% _raw_spin_lock_bh           /lib/modules/2.6.34-rc5/build/vmlinux
              126.00  2.7% _raw_spin_unlock_irqrestore /lib/modules/2.6.34-rc5/build/vmlinux
              113.00  2.4% vread_tsc                   [kernel].vsyscall_fn                 
              110.00  2.4% _raw_spin_unlock_bh         /lib/modules/2.6.34-rc5/build/vmlinux
              106.00  2.3% schedule                    /lib/modules/2.6.34-rc5/build/vmlinux
              103.00  2.2% local_bh_enable_ip          /lib/modules/2.6.34-rc5/build/vmlinux
              101.00  2.2% udp_recvmsg                 /lib/modules/2.6.34-rc5/build/vmlinux
               97.00  2.1% sys_epoll_wait              /lib/modules/2.6.34-rc5/build/vmlinux
               84.00  1.8% dst_release                 /lib/modules/2.6.34-rc5/build/vmlinux
               78.00  1.7% fput                        /lib/modules/2.6.34-rc5/build/vmlinux
               75.00  1.6% first_packet_length         /lib/modules/2.6.34-rc5/build/vmlinux
               74.00  1.6% kmem_cache_alloc            /lib/modules/2.6.34-rc5/build/vmlinux
               71.00  1.5% ep_remove                   /lib/modules/2.6.34-rc5/build/vmlinux
               69.00  1.5% epoll_ctl                   /lib/libc-2.7.so                     
               67.00  1.5% mutex_lock                  /lib/modules/2.6.34-rc5/build/vmlinux
               65.00  1.4% sock_recv_ts_and_drops      /lib/modules/2.6.34-rc5/build/vmlinux
               65.00  1.4% inet_recvmsg                /lib/modules/2.6.34-rc5/build/vmlinux
               64.00  1.4% process_recv                /home/hadi/udp_sink/mcpudp           
               62.00  1.3% skb_copy_datagram_iovec     /lib/modules/2.6.34-rc5/build/vmlinux
               60.00  1.3% clock_gettime               /lib/librt-2.7.so                    


--------------------------------------------------------------------------------------------------
   PerfTop:     700 irqs/sec  kernel:84.3% [1000Hz cycles],  (all, cpu: 2)
--------------------------------------------------------------------------------------------------

             samples  pcnt function                    DSO
             _______ _____ ___________________________ _____________________________________

              489.00  6.4% _raw_spin_lock_irqsave      /lib/modules/2.6.34-rc5/build/vmlinux
              376.00  4.9% system_call                 /lib/modules/2.6.34-rc5/build/vmlinux
              308.00  4.0% fget                        /lib/modules/2.6.34-rc5/build/vmlinux
              302.00  3.9% copy_user_generic_string    /lib/modules/2.6.34-rc5/build/vmlinux
              280.00  3.6% sys_epoll_ctl               /lib/modules/2.6.34-rc5/build/vmlinux
              274.00  3.6% datagram_poll               /lib/modules/2.6.34-rc5/build/vmlinux
              249.00  3.2% kmem_cache_free             /lib/modules/2.6.34-rc5/build/vmlinux
              223.00  2.9% _raw_spin_unlock_irqrestore /lib/modules/2.6.34-rc5/build/vmlinux
              221.00  2.9% _raw_spin_unlock_bh         /lib/modules/2.6.34-rc5/build/vmlinux
              221.00  2.9% local_bh_enable_ip          /lib/modules/2.6.34-rc5/build/vmlinux
              208.00  2.7% vread_tsc                   [kernel].vsyscall_fn                 
              200.00  2.6% _raw_spin_lock_bh           /lib/modules/2.6.34-rc5/build/vmlinux
              191.00  2.5% schedule                    /lib/modules/2.6.34-rc5/build/vmlinux
              188.00  2.4% sys_epoll_wait              /lib/modules/2.6.34-rc5/build/vmlinux
              177.00  2.3% udp_recvmsg                 /lib/modules/2.6.34-rc5/build/vmlinux
              141.00  1.8% fput                        /lib/modules/2.6.34-rc5/build/vmlinux
              140.00  1.8% first_packet_length         /lib/modules/2.6.34-rc5/build/vmlinux
              128.00  1.7% kmem_cache_alloc            /lib/modules/2.6.34-rc5/build/vmlinux
              119.00  1.5% dst_release                 /lib/modules/2.6.34-rc5/build/vmlinux
              105.00  1.4% ep_remove                   /lib/modules/2.6.34-rc5/build/vmlinux
              104.00  1.4% epoll_ctl                   /lib/libc-2.7.so                     
              102.00  1.3% skb_copy_datagram_iovec     /lib/modules/2.6.34-rc5/build/vmlinux
              100.00  1.3% mutex_lock                  /lib/modules/2.6.34-rc5/build/vmlinux
               95.00  1.2% mutex_unlock                /lib/modules/2.6.34-rc5/build/vmlinux
               94.00  1.2% sock_recv_ts_and_drops      /lib/modules/2.6.34-rc5/build/vmlinux
               92.00  1.2% ep_send_events_proc         /lib/modules/2.6.34-rc5/build/vmlinux
               92.00  1.2% clock_gettime               /lib/librt-2.7.so                    
               92.00  1.2% __skb_recv_datagram         /lib/modules/2.6.34-rc5/build/vmlinux
               91.00  1.2% process_recv                /home/hadi/udp_sink/mcpudp           
               88.00  1.1% kfree                       /lib/modules/2.6.34-rc5/build/vmlinux
               86.00  1.1% _raw_spin_lock              /lib/modules/2.6.34-rc5/build/vmlinux



II: net-next with rps = ee

94.43%
--------------



--------------------------------------------------------------------------------------------------
   PerfTop:    4328 irqs/sec  kernel:84.0% [1000Hz cycles],  (all, 8 CPUs)
--------------------------------------------------------------------------------------------------

             samples  pcnt function                       DSO
             _______ _____ ______________________________ ______________________

             3908.00 17.1% sky2_poll                      [sky2]                
              694.00  3.0% _raw_spin_lock_irqsave         [kernel]              
              584.00  2.6% sky2_intr                      [sky2]                
              557.00  2.4% system_call                    [kernel]              
              490.00  2.1% _raw_spin_unlock_irqrestore    [kernel]              
              488.00  2.1% fget                           [kernel]              
              425.00  1.9% ip_rcv                         [kernel]              
              405.00  1.8% sys_epoll_ctl                  [kernel]              
              398.00  1.7% __netif_receive_skb            [kernel]              
              375.00  1.6% _raw_spin_lock                 [kernel]              
              365.00  1.6% copy_user_generic_string       [kernel]              
              363.00  1.6% ip_route_input                 [kernel]              
              350.00  1.5% kmem_cache_free                [kernel]              
              346.00  1.5% schedule                       [kernel]              
              319.00  1.4% call_function_single_interrupt [kernel]              
              295.00  1.3% vread_tsc                      [kernel].vsyscall_fn  
              270.00  1.2% __udp4_lib_lookup              [kernel]              
              264.00  1.2% kmem_cache_alloc               [kernel]              
              235.00  1.0% fput                           [kernel]              
              219.00  1.0% datagram_poll                  [kernel]              


--------------------------------------------------------------------------------------------------
   PerfTop:    3791 irqs/sec  kernel:84.4% [1000Hz cycles],  (all, 8 CPUs)
--------------------------------------------------------------------------------------------------

             samples  pcnt function                       DSO
             _______ _____ ______________________________ ______________________

             6274.00 17.2% sky2_poll                      [sky2]                
             1139.00  3.1% _raw_spin_lock_irqsave         [kernel]              
              953.00  2.6% system_call                    [kernel]              
              942.00  2.6% sky2_intr                      [sky2]                
              785.00  2.2% _raw_spin_unlock_irqrestore    [kernel]              
              745.00  2.0% fget                           [kernel]              
              695.00  1.9% ip_rcv                         [kernel]              
              653.00  1.8% sys_epoll_ctl                  [kernel]              
              609.00  1.7% ip_route_input                 [kernel]              
              606.00  1.7% __netif_receive_skb            [kernel]              
              583.00  1.6% _raw_spin_lock                 [kernel]              
              569.00  1.6% kmem_cache_free                [kernel]              
              564.00  1.5% copy_user_generic_string       [kernel]              
              554.00  1.5% schedule                       [kernel]              
              510.00  1.4% call_function_single_interrupt [kernel]              
              488.00  1.3% vread_tsc                      [kernel].vsyscall_fn  
              459.00  1.3% kmem_cache_alloc               [kernel]              
              417.00  1.1% __udp4_lib_lookup              [kernel]              
              387.00  1.1% fput                           [kernel]              
              358.00  1.0% __udp4_lib_rcv                 [kernel]              
              347.00  1.0% event_base_loop                libevent-1.3e.so.1.0.3

-----------------------------------------------------------------------------------------------
   PerfTop:     997 irqs/sec  kernel:98.2% [1000Hz cycles],  (all, cpu: 0)
-----------------------------------------------------------------------------------------------

             samples  pcnt function                            DSO
             _______ _____ ___________________________________ ________

             3926.00 61.0% sky2_poll                           [sky2]  
              671.00 10.4% sky2_intr                           [sky2]  
              192.00  3.0% __alloc_skb                         [kernel]
              126.00  2.0% get_rps_cpu                         [kernel]
              111.00  1.7% __kmalloc                           [kernel]
               97.00  1.5% enqueue_to_backlog                  [kernel]
               95.00  1.5% _raw_spin_lock_irqsave              [kernel]
               93.00  1.4% _raw_spin_lock                      [kernel]
               79.00  1.2% kmem_cache_alloc                    [kernel]
               63.00  1.0% sky2_rx_submit                      [sky2]  

-----------------------------------------------------------------------------------------------
   PerfTop:     980 irqs/sec  kernel:98.0% [1000Hz cycles],  (all, cpu: 0)
-----------------------------------------------------------------------------------------------

             samples  pcnt function                            DSO
             _______ _____ ___________________________________ ____________________

             6945.00 61.4% sky2_poll                           [sky2]              
             1219.00 10.8% sky2_intr                           [sky2]              
              323.00  2.9% __alloc_skb                         [kernel]            
              243.00  2.1% get_rps_cpu                         [kernel]            
              195.00  1.7% __kmalloc                           [kernel]            
              161.00  1.4% _raw_spin_lock_irqsave              [kernel]            
              149.00  1.3% enqueue_to_backlog                  [kernel]            
              139.00  1.2% _raw_spin_lock                      [kernel]            
              136.00  1.2% kmem_cache_alloc                    [kernel]            
              135.00  1.2% irq_entries_start                   [kernel]            
              108.00  1.0% sky2_rx_submit                      [sky2]              


-----------------------------------------------------------------------------------------------
   PerfTop:     458 irqs/sec  kernel:80.8% [1000Hz cycles],  (all, cpu: 2)
-----------------------------------------------------------------------------------------------

             samples  pcnt function                       DSO
             _______ _____ ______________________________ _____________________________________

              130.00  4.7% _raw_spin_lock_irqsave         /lib/modules/2.6.34-rc5/build/vmlinux
              114.00  4.1% system_call                    /lib/modules/2.6.34-rc5/build/vmlinux
               91.00  3.3% ip_rcv                         /lib/modules/2.6.34-rc5/build/vmlinux
               82.00  3.0% _raw_spin_unlock_irqrestore    /lib/modules/2.6.34-rc5/build/vmlinux
               74.00  2.7% call_function_single_interrupt /lib/modules/2.6.34-rc5/build/vmlinux
               74.00  2.7% fget                           /lib/modules/2.6.34-rc5/build/vmlinux
               71.00  2.6% __netif_receive_skb            /lib/modules/2.6.34-rc5/build/vmlinux
               69.00  2.5% ip_route_input                 /lib/modules/2.6.34-rc5/build/vmlinux
               66.00  2.4% schedule                       /lib/modules/2.6.34-rc5/build/vmlinux
               63.00  2.3% kmem_cache_free                /lib/modules/2.6.34-rc5/build/vmlinux
               61.00  2.2% sys_epoll_ctl                  /lib/modules/2.6.34-rc5/build/vmlinux
               61.00  2.2% __udp4_lib_lookup              /lib/modules/2.6.34-rc5/build/vmlinux
               57.00  2.1% copy_user_generic_string       /lib/modules/2.6.34-rc5/build/vmlinux
               49.00  1.8% vread_tsc                      [kernel].vsyscall_fn                 
               49.00  1.8% _raw_spin_lock                 /lib/modules/2.6.34-rc5/build/vmlinux
               47.00  1.7% ep_remove                      /lib/modules/2.6.34-rc5/build/vmlinux
               45.00  1.6% fput                           /lib/modules/2.6.34-rc5/build/vmlinux
               44.00  1.6% sys_epoll_wait                 /lib/modules/2.6.34-rc5/build/vmlinux
               40.00  1.4% kmem_cache_alloc               /lib/modules/2.6.34-rc5/build/vmlinux
               40.00  1.4% local_bh_enable_ip             /lib/modules/2.6.34-rc5/build/vmlinux
               38.00  1.4% sock_recv_ts_and_drops         /lib/modules/2.6.34-rc5/build/vmlinux
               35.00  1.3% process_recv                   /home/hadi/udp_sink/mcpudp           
               34.00  1.2% mutex_unlock                   /lib/modules/2.6.34-rc5/build/vmlinux
               31.00  1.1% _raw_spin_unlock_bh            /lib/modules/2.6.34-rc5/build/vmlinux
               31.00  1.1% event_base_loop                /usr/lib/libevent-1.3e.so.1.0.3      


-----------------------------------------------------------------------------------------------
   PerfTop:     552 irqs/sec  kernel:82.4% [1000Hz cycles],  (all, cpu: 2)
-----------------------------------------------------------------------------------------------

             samples  pcnt function                       DSO
             _______ _____ ______________________________ _____________________________________

              204.00  4.7% _raw_spin_lock_irqsave         /lib/modules/2.6.34-rc5/build/vmlinux
              169.00  3.9% system_call                    /lib/modules/2.6.34-rc5/build/vmlinux
              151.00  3.5% _raw_spin_unlock_irqrestore    /lib/modules/2.6.34-rc5/build/vmlinux
              132.00  3.0% ip_rcv                         /lib/modules/2.6.34-rc5/build/vmlinux
              129.00  3.0% fget                           /lib/modules/2.6.34-rc5/build/vmlinux
              123.00  2.8% __netif_receive_skb            /lib/modules/2.6.34-rc5/build/vmlinux
              115.00  2.6% ip_route_input                 /lib/modules/2.6.34-rc5/build/vmlinux
              112.00  2.6% call_function_single_interrupt /lib/modules/2.6.34-rc5/build/vmlinux
              112.00  2.6% sys_epoll_ctl                  /lib/modules/2.6.34-rc5/build/vmlinux
              103.00  2.4% schedule                       /lib/modules/2.6.34-rc5/build/vmlinux
               94.00  2.2% kmem_cache_free                /lib/modules/2.6.34-rc5/build/vmlinux
               89.00  2.0% copy_user_generic_string       /lib/modules/2.6.34-rc5/build/vmlinux
               86.00  2.0% _raw_spin_lock                 /lib/modules/2.6.34-rc5/build/vmlinux
               83.00  1.9% __udp4_lib_lookup              /lib/modules/2.6.34-rc5/build/vmlinux
               76.00  1.7% vread_tsc                      [kernel].vsyscall_fn                 
               68.00  1.6% ep_remove                      /lib/modules/2.6.34-rc5/build/vmlinux
               67.00  1.5% fput                           /lib/modules/2.6.34-rc5/build/vmlinux
               64.00  1.5% kmem_cache_alloc               /lib/modules/2.6.34-rc5/build/vmlinux
               62.00  1.4% sys_epoll_wait                 /lib/modules/2.6.34-rc5/build/vmlinux
               60.00  1.4% dst_release                    /lib/modules/2.6.34-rc5/build/vmlinux
               60.00  1.4% sock_recv_ts_and_drops         /lib/modules/2.6.34-rc5/build/vmlinux
               56.00  1.3% _raw_spin_lock_bh              /lib/modules/2.6.34-rc5/build/vmlinux
               53.00  1.2% event_base_loop                /usr/lib/libevent-1.3e.so.1.0.3      
               51.00  1.2% datagram_poll                  /lib/modules/2.6.34-rc5/build/vmlinux
               48.00  1.1% epoll_ctl                      /lib/libc-2.7.so                     
               48.00  1.1% kfree                          /lib/modules/2.6.34-rc5/build/vmlinux
               47.00  1.1% _raw_spin_unlock_bh            /lib/modules/2.6.34-rc5/build/vmlinux
               47.00  1.1% mutex_unlock                   /lib/modules/2.6.34-rc5/build/vmlinux
               45.00  1.0% __udp4_lib_rcv                 /lib/modules/2.6.34-rc5/build/vmlinux
               45.00  1.0% tick_nohz_stop_sched_tick      /lib/modules/2.6.34-rc5/build/vmlinux

-----------------------------------------------------------------------------------------------
   PerfTop:     408 irqs/sec  kernel:82.1% [1000Hz cycles],  (all, cpu: 2)
-----------------------------------------------------------------------------------------------

             samples  pcnt function                       DSO
             _______ _____ ______________________________ _____________________________________

              240.00  4.8% _raw_spin_lock_irqsave         /lib/modules/2.6.34-rc5/build/vmlinux
              200.00  4.0% system_call                    /lib/modules/2.6.34-rc5/build/vmlinux
              165.00  3.3% _raw_spin_unlock_irqrestore    /lib/modules/2.6.34-rc5/build/vmlinux
              161.00  3.2% ip_rcv                         /lib/modules/2.6.34-rc5/build/vmlinux
              158.00  3.1% fget                           /lib/modules/2.6.34-rc5/build/vmlinux
              150.00  3.0% sys_epoll_ctl                  /lib/modules/2.6.34-rc5/build/vmlinux
              135.00  2.7% __netif_receive_skb            /lib/modules/2.6.34-rc5/build/vmlinux
              122.00  2.4% ip_route_input                 /lib/modules/2.6.34-rc5/build/vmlinux
              117.00  2.3% call_function_single_interrupt /lib/modules/2.6.34-rc5/build/vmlinux
              114.00  2.3% schedule                       /lib/modules/2.6.34-rc5/build/vmlinux
              110.00  2.2% _raw_spin_lock                 /lib/modules/2.6.34-rc5/build/vmlinux
              108.00  2.1% copy_user_generic_string       /lib/modules/2.6.34-rc5/build/vmlinux
              101.00  2.0% kmem_cache_free                /lib/modules/2.6.34-rc5/build/vmlinux
               94.00  1.9% vread_tsc                      [kernel].vsyscall_fn                 
               90.00  1.8% __udp4_lib_lookup              /lib/modules/2.6.34-rc5/build/vmlinux
               85.00  1.7% fput                           /lib/modules/2.6.34-rc5/build/vmlinux
               78.00  1.5% dst_release                    /lib/modules/2.6.34-rc5/build/vmlinux
               77.00  1.5% ep_remove                      /lib/modules/2.6.34-rc5/build/vmlinux
               75.00  1.5% kmem_cache_alloc               /lib/modules/2.6.34-rc5/build/vmlinux
               74.00  1.5% _raw_spin_lock_bh              /lib/modules/2.6.34-rc5/build/vmlinux
               69.00  1.4% sys_epoll_wait                 /lib/modules/2.6.34-rc5/build/vmlinux
               68.00  1.3% event_base_loop                /usr/lib/libevent-1.3e.so.1.0.3      
               68.00  1.3% sock_recv_ts_and_drops         /lib/modules/2.6.34-rc5/build/vmlinux
               62.00  1.2% _raw_spin_unlock_bh            /lib/modules/2.6.34-rc5/build/vmlinux
               62.00  1.2% datagram_poll                  /lib/modules/2.6.34-rc5/build/vmlinux
               55.00  1.1% epoll_ctl                      /lib/libc-2.7.so                     
               53.00  1.1% local_bh_enable_ip             /lib/modules/2.6.34-rc5/build/vmlinux
               53.00  1.1% tick_nohz_stop_sched_tick      /lib/modules/2.6.34-rc5/build/vmlinux
               52.00  1.0% mutex_unlock                   /lib/modules/2.6.34-rc5/build/vmlinux

-----------------------------------------------------------------------------------------------
   PerfTop:     440 irqs/sec  kernel:85.0% [1000Hz cycles],  (all, cpu: 2)
-----------------------------------------------------------------------------------------------

             samples  pcnt function                       DSO
             _______ _____ ______________________________ _____________________________________

              226.00  4.6% _raw_spin_lock_irqsave         /lib/modules/2.6.34-rc5/build/vmlinux
              213.00  4.3% system_call                    /lib/modules/2.6.34-rc5/build/vmlinux
              154.00  3.1% _raw_spin_unlock_irqrestore    /lib/modules/2.6.34-rc5/build/vmlinux
              148.00  3.0% ip_rcv                         /lib/modules/2.6.34-rc5/build/vmlinux
              143.00  2.9% fget                           /lib/modules/2.6.34-rc5/build/vmlinux
              143.00  2.9% ip_route_input                 /lib/modules/2.6.34-rc5/build/vmlinux
              140.00  2.8% __netif_receive_skb            /lib/modules/2.6.34-rc5/build/vmlinux
              124.00  2.5% call_function_single_interrupt /lib/modules/2.6.34-rc5/build/vmlinux
              124.00  2.5% sys_epoll_ctl                  /lib/modules/2.6.34-rc5/build/vmlinux
              104.00  2.1% copy_user_generic_string       /lib/modules/2.6.34-rc5/build/vmlinux
              103.00  2.1% vread_tsc                      [kernel].vsyscall_fn                 
              101.00  2.0% schedule                       /lib/modules/2.6.34-rc5/build/vmlinux
              100.00  2.0% kmem_cache_free                /lib/modules/2.6.34-rc5/build/vmlinux
               99.00  2.0% _raw_spin_lock                 /lib/modules/2.6.34-rc5/build/vmlinux
               93.00  1.9% __udp4_lib_lookup              /lib/modules/2.6.34-rc5/build/vmlinux
               80.00  1.6% fput                           /lib/modules/2.6.34-rc5/build/vmlinux
               76.00  1.5% kmem_cache_alloc               /lib/modules/2.6.34-rc5/build/vmlinux
               75.00  1.5% sock_recv_ts_and_drops         /lib/modules/2.6.34-rc5/build/vmlinux
               73.00  1.5% dst_release                    /lib/modules/2.6.34-rc5/build/vmlinux
               70.00  1.4% sys_epoll_wait                 /lib/modules/2.6.34-rc5/build/vmlinux
               69.00  1.4% datagram_poll                  /lib/modules/2.6.34-rc5/build/vmlinux
               65.00  1.3% event_base_loop                /usr/lib/libevent-1.3e.so.1.0.3      
               65.00  1.3% ep_remove                      /lib/modules/2.6.34-rc5/build/vmlinux



III: Kernel compiled with Erics patch, rps mask 00

Avg udp packets sunk: 98.74%

-------------------------------------------------------------------------------
   PerfTop:    4202 irqs/sec  kernel:82.5% [1000Hz cycles],  (all, 8 CPUs)
-------------------------------------------------------------------------------

             samples  pcnt function                    DSO
             _______ _____ ___________________________ ______________________

             1639.00  9.0% sky2_poll                   [sky2]                
             1051.00  5.8% _raw_spin_lock_irqsave      [kernel]              
              665.00  3.7% system_call                 [kernel]              
              578.00  3.2% fget                        [kernel]              
              476.00  2.6% _raw_spin_unlock_irqrestore [kernel]              
              457.00  2.5% copy_user_generic_string    [kernel]              
              427.00  2.4% sys_epoll_ctl               [kernel]              
              401.00  2.2% datagram_poll               [kernel]              
              391.00  2.2% kmem_cache_free             [kernel]              
              349.00  1.9% schedule                    [kernel]              
              339.00  1.9% vread_tsc                   [kernel].vsyscall_fn  
              323.00  1.8% udp_recvmsg                 [kernel]              
              292.00  1.6% kmem_cache_alloc            [kernel]              
              285.00  1.6% _raw_spin_lock              [kernel]              
              272.00  1.5% _raw_spin_lock_bh           [kernel]              
              268.00  1.5% sys_epoll_wait              [kernel]              
              260.00  1.4% fput                        [kernel]              
              234.00  1.3% ip_route_input              [kernel]              
              221.00  1.2% __udp4_lib_lookup           [kernel]              
              212.00  1.2% dst_release                 [kernel]              
              209.00  1.2% ip_rcv                      [kernel]              
              203.00  1.1% ep_remove                   [kernel]              
              202.00  1.1% first_packet_length         [kernel]              


-------------------------------------------------------------------------------
   PerfTop:    3999 irqs/sec  kernel:82.3% [1000Hz cycles],  (all, 8 CPUs)
-------------------------------------------------------------------------------

             samples  pcnt function                    DSO
             _______ _____ ___________________________ ______________________

             3452.00  9.3% sky2_poll                   [sky2]                
             2212.00  5.9% _raw_spin_lock_irqsave      [kernel]              
             1350.00  3.6% system_call                 [kernel]              
             1187.00  3.2% fget                        [kernel]              
             1010.00  2.7% copy_user_generic_string    [kernel]              
              965.00  2.6% _raw_spin_unlock_irqrestore [kernel]              
              842.00  2.3% sys_epoll_ctl               [kernel]              
              833.00  2.2% datagram_poll               [kernel]              
              770.00  2.1% kmem_cache_free             [kernel]              
              710.00  1.9% vread_tsc                   [kernel].vsyscall_fn  
              688.00  1.8% schedule                    [kernel]              
              651.00  1.7% udp_recvmsg                 [kernel]              
              603.00  1.6% _raw_spin_lock_bh           [kernel]              
              599.00  1.6% _raw_spin_lock              [kernel]              
              597.00  1.6% sys_epoll_wait              [kernel]              
              594.00  1.6% kmem_cache_alloc            [kernel]              
              553.00  1.5% ip_route_input              [kernel]              
              528.00  1.4% fput                        [kernel]              
              496.00  1.3% __udp4_lib_lookup           [kernel]              
              444.00  1.2% dst_release                 [kernel]              
              433.00  1.2% ip_rcv                      [kernel]              
              408.00  1.1% first_packet_length         [kernel]              

-------------------------------------------------------------------------------
   PerfTop:    3765 irqs/sec  kernel:83.7% [1000Hz cycles],  (all, 8 CPUs)
-------------------------------------------------------------------------------

             samples  pcnt function                    DSO
             _______ _____ ___________________________ ______________________

             4275.00  9.5% sky2_poll                   [sky2]                
             2684.00  6.0% _raw_spin_lock_irqsave      [kernel]              
             1654.00  3.7% system_call                 [kernel]              
             1447.00  3.2% fget                        [kernel]              
             1223.00  2.7% copy_user_generic_string    [kernel]              
             1146.00  2.5% _raw_spin_unlock_irqrestore [kernel]              
             1036.00  2.3% sys_epoll_ctl               [kernel]              
             1019.00  2.3% datagram_poll               [kernel]              
              974.00  2.2% kmem_cache_free             [kernel]              
              843.00  1.9% vread_tsc                   [kernel].vsyscall_fn  
              799.00  1.8% schedule                    [kernel]              
              761.00  1.7% udp_recvmsg                 [kernel]              
              736.00  1.6% kmem_cache_alloc            [kernel]              
              719.00  1.6% _raw_spin_lock_bh           [kernel]              
              716.00  1.6% _raw_spin_lock              [kernel]              
              696.00  1.5% sys_epoll_wait              [kernel]              
              680.00  1.5% ip_route_input              [kernel]              
              657.00  1.5% fput                        [kernel]              
              613.00  1.4% __udp4_lib_lookup           [kernel]              
              552.00  1.2% dst_release                 [kernel]              
              507.00  1.1% ip_rcv                      [kernel]            


-------------------------------------------------------------------------------
   PerfTop:    1001 irqs/sec  kernel:99.9% [1000Hz cycles],  (all, cpu: 0)
-------------------------------------------------------------------------------

             samples  pcnt function                    DSO
             _______ _____ ___________________________ ________

              669.00 32.2% sky2_poll                   [sky2]  
              128.00  6.2% ip_route_input              [kernel]
              106.00  5.1% ip_rcv                      [kernel]
              105.00  5.1% __udp4_lib_lookup           [kernel]
               86.00  4.1% _raw_spin_lock              [kernel]
               85.00  4.1% _raw_spin_lock_irqsave      [kernel]
               82.00  3.9% __alloc_skb                 [kernel]
               78.00  3.8% sock_queue_rcv_skb          [kernel]
               57.00  2.7% __netif_receive_skb         [kernel]
               53.00  2.6% __wake_up_common            [kernel]
               47.00  2.3% __udp4_lib_rcv              [kernel]
               42.00  2.0% sock_def_readable           [kernel]
               37.00  1.8% kmem_cache_alloc            [kernel]
               34.00  1.6% ep_poll_callback            [kernel]
               34.00  1.6% __kmalloc                   [kernel]
               34.00  1.6% select_task_rq_fair         [kernel]
               30.00  1.4% _raw_read_lock              [kernel]
               27.00  1.3% _raw_spin_unlock_irqrestore [kernel]
               24.00  1.2% sky2_rx_submit              [sky2]  
               22.00  1.1% udp_queue_rcv_skb           [kernel]
               21.00  1.0% try_to_wake_up              [kernel]


-------------------------------------------------------------------------------
   PerfTop:    1000 irqs/sec  kernel:100.0% [1000Hz cycles],  (all, cpu: 0)
-------------------------------------------------------------------------------

             samples  pcnt function                    DSO
             _______ _____ ___________________________ ________

             3061.00 31.9% sky2_poll                   [sky2]  
              529.00  5.5% ip_route_input              [kernel]
              518.00  5.4% __udp4_lib_lookup           [kernel]
              424.00  4.4% ip_rcv                      [kernel]
              390.00  4.1% _raw_spin_lock_irqsave      [kernel]
              389.00  4.1% __alloc_skb                 [kernel]
              365.00  3.8% _raw_spin_lock              [kernel]
              326.00  3.4% sock_queue_rcv_skb          [kernel]
              297.00  3.1% __netif_receive_skb         [kernel]
              273.00  2.8% __udp4_lib_rcv              [kernel]
              223.00  2.3% sock_def_readable           [kernel]
              205.00  2.1% __wake_up_common            [kernel]
              181.00  1.9% __kmalloc                   [kernel]
              151.00  1.6% kmem_cache_alloc            [kernel]
              147.00  1.5% _raw_read_lock              [kernel]
              143.00  1.5% ep_poll_callback            [kernel]
              136.00  1.4% sky2_rx_submit              [sky2]  
              123.00  1.3% task_rq_lock                [kernel]
              118.00  1.2% _raw_spin_unlock_irqrestore [kernel]
              114.00  1.2% select_task_rq_fair         [kernel]
              104.00  1.1% resched_task                [kernel]
              104.00  1.1% sky2_remove                 [sky2]  
              102.00  1.1% udp_queue_rcv_skb           [kernel]


-------------------------------------------------------------------------------
   PerfTop:    1001 irqs/sec  kernel:100.0% [1000Hz cycles],  (all, cpu: 0)
-------------------------------------------------------------------------------

             samples  pcnt function                    DSO
             _______ _____ ___________________________ ________

             3898.00 31.0% sky2_poll                   [sky2]  
              715.00  5.7% ip_route_input              [kernel]
              651.00  5.2% __udp4_lib_lookup           [kernel]
              576.00  4.6% ip_rcv                      [kernel]
              534.00  4.2% __alloc_skb                 [kernel]
              518.00  4.1% _raw_spin_lock_irqsave      [kernel]
              441.00  3.5% sock_queue_rcv_skb          [kernel]
              439.00  3.5% _raw_spin_lock              [kernel]
              396.00  3.1% __netif_receive_skb         [kernel]
              351.00  2.8% __udp4_lib_rcv              [kernel]
              300.00  2.4% sock_def_readable           [kernel]
              264.00  2.1% __wake_up_common            [kernel]
              260.00  2.1% __kmalloc                   [kernel]
              198.00  1.6% kmem_cache_alloc            [kernel]
              193.00  1.5% ep_poll_callback            [kernel]
              192.00  1.5% _raw_read_lock              [kernel]
              168.00  1.3% sky2_rx_submit              [sky2]  
              167.00  1.3% task_rq_lock                [kernel]
              153.00  1.2% udp_queue_rcv_skb           [kernel]
              149.00  1.2% _raw_spin_unlock_irqrestore [kernel]
              147.00  1.2% ip_local_deliver            [kernel]
              144.00  1.1% resched_task                [kernel]
              137.00  1.1% sky2_remove                 [sky2]  


-------------------------------------------------------------------------------
   PerfTop:     663 irqs/sec  kernel:81.9% [1000Hz cycles],  (all, cpu: 2)
-------------------------------------------------------------------------------

             samples  pcnt function                    DSO
             _______ _____ ___________________________ ____________________

              129.00  7.0% _raw_spin_lock_irqsave      [kernel]            
               84.00  4.5% fget                        [kernel]            
               83.00  4.5% system_call                 [kernel]            
               82.00  4.4% copy_user_generic_string    [kernel]            
               67.00  3.6% _raw_spin_unlock_irqrestore [kernel]            
               63.00  3.4% datagram_poll               [kernel]            
               57.00  3.1% udp_recvmsg                 [kernel]            
               55.00  3.0% sys_epoll_ctl               [kernel]            
               55.00  3.0% vread_tsc                   [kernel].vsyscall_fn
               43.00  2.3% sys_epoll_wait              [kernel]            
               43.00  2.3% _raw_spin_lock_bh           [kernel]            
               41.00  2.2% first_packet_length         [kernel]            
               40.00  2.2% dst_release                 [kernel]            
               37.00  2.0% fput                        [kernel]            
               37.00  2.0% kmem_cache_free             [kernel]            
               36.00  1.9% mutex_unlock                [kernel]            
               35.00  1.9% schedule                    [kernel]            
               34.00  1.8% skb_copy_datagram_iovec     [kernel]            
               34.00  1.8% ep_remove                   [kernel]            
               29.00  1.6% mutex_lock                  [kernel]            
               29.00  1.6% _raw_spin_lock              [kernel]            
               28.00  1.5% __skb_recv_datagram         [kernel]            
               25.00  1.4% epoll_ctl                   /lib/libc-2.7.so    
               25.00  1.4% tick_nohz_stop_sched_tick   [kernel]            


-------------------------------------------------------------------------------
   PerfTop:     629 irqs/sec  kernel:81.1% [1000Hz cycles],  (all, cpu: 2)
-------------------------------------------------------------------------------

             samples  pcnt function                    DSO
             _______ _____ ___________________________ ______________________

              351.00  7.9% _raw_spin_lock_irqsave      [kernel]              
              248.00  5.6% system_call                 [kernel]              
              219.00  5.0% fget                        [kernel]              
              194.00  4.4% copy_user_generic_string    [kernel]              
              184.00  4.2% datagram_poll               [kernel]              
              162.00  3.7% sys_epoll_ctl               [kernel]              
              159.00  3.6% _raw_spin_unlock_irqrestore [kernel]              
              129.00  2.9% udp_recvmsg                 [kernel]              
              129.00  2.9% kmem_cache_free             [kernel]              
              123.00  2.8% vread_tsc                   [kernel].vsyscall_fn  
              108.00  2.4% schedule                    [kernel]              
              107.00  2.4% _raw_spin_lock_bh           [kernel]              
              104.00  2.4% sys_epoll_wait              [kernel]              
              100.00  2.3% fput                        [kernel]              
               94.00  2.1% dst_release                 [kernel]              
               78.00  1.8% first_packet_length         [kernel]              
               73.00  1.7% ep_remove                   [kernel]              
               69.00  1.6% epoll_ctl                   /lib/libc-2.7.so      
               66.00  1.5% skb_copy_datagram_iovec     [kernel]              
               66.00  1.5% mutex_unlock                [kernel]              
               64.00  1.4% __skb_recv_datagram         [kernel]              
               64.00  1.4% mutex_lock                  [kernel]              
               57.00  1.3% sock_recv_ts_and_drops      [kernel]              
               51.00  1.2% kmem_cache_alloc            [kernel]              
               49.00  1.1% ep_send_events_proc         [kernel]              

-------------------------------------------------------------------------------
   PerfTop:     457 irqs/sec  kernel:72.0% [1000Hz cycles],  (all, cpu: 2)
-------------------------------------------------------------------------------

             samples  pcnt function                    DSO
             _______ _____ ___________________________ ______________________

              411.00  7.8% _raw_spin_lock_irqsave      [kernel]              
              280.00  5.3% system_call                 [kernel]              
              269.00  5.1% fget                        [kernel]              
              239.00  4.5% copy_user_generic_string    [kernel]              
              232.00  4.4% datagram_poll               [kernel]              
              175.00  3.3% _raw_spin_unlock_irqrestore [kernel]              
              170.00  3.2% sys_epoll_ctl               [kernel]              
              169.00  3.2% kmem_cache_free             [kernel]              
              149.00  2.8% udp_recvmsg                 [kernel]              
              144.00  2.7% vread_tsc                   [kernel].vsyscall_fn  
              129.00  2.4% sys_epoll_wait              [kernel]              
              128.00  2.4% _raw_spin_lock_bh           [kernel]              
              115.00  2.2% fput                        [kernel]              
              112.00  2.1% schedule                    [kernel]              
              108.00  2.0% dst_release                 [kernel]              
               88.00  1.7% first_packet_length         [kernel]              
               86.00  1.6% ep_remove                   [kernel]              
               83.00  1.6% mutex_lock                  [kernel]              
               79.00  1.5% skb_copy_datagram_iovec     [kernel]              
               76.00  1.4% mutex_unlock                [kernel]              
               75.00  1.4% epoll_ctl                   /lib/libc-2.7.so      
               73.00  1.4% sock_recv_ts_and_drops      [kernel]              
               67.00  1.3% __skb_recv_datagram         [kernel]              
               65.00  1.2% tick_nohz_stop_sched_tick   [kernel]              


Interesting stuff; check cache miss contributions - wow, how low is eth_type_trans..
and yet we keep optimizing that!

-------------------------------------------------------------------------------
   PerfTop:    1021 irqs/sec  kernel:98.8% [1000Hz cache-misses],  (all, 8 CPUs)
-------------------------------------------------------------------------------

             samples  pcnt function                        DSO
             _______ _____ _______________________________ ________

             5271.00 77.8% sky2_poll                       [sky2]  
              706.00 10.4% kmem_cache_alloc                [kernel]
              154.00  2.3% dev_gro_receive                 [kernel]
              149.00  2.2% __napi_gro_receive              [kernel]
              128.00  1.9% napi_gro_receive                [kernel]
              106.00  1.6% __alloc_skb                     [kernel]
               57.00  0.8% eth_type_trans                  [kernel]
               45.00  0.7% skb_gro_reset_offset            [kernel]
               26.00  0.4% drain_array                     [kernel]
               23.00  0.3% perf_session__mmap_read_counter perf    
               10.00  0.1% cache_alloc_refill              [kernel]
                9.00  0.1% __netdev_alloc_skb              [kernel]
                9.00  0.1% event__preprocess_sample        perf    


-------------------------------------------------------------------------------
   PerfTop:     997 irqs/sec  kernel:100.0% [1000Hz cache-misses],  (all, cpu: 0)
-------------------------------------------------------------------------------

             samples  pcnt function             DSO
             _______ _____ ____________________ ________

             3019.00 79.4% sky2_poll            [sky2]  
              360.00  9.5% kmem_cache_alloc     [kernel]
               91.00  2.4% dev_gro_receive      [kernel]
               86.00  2.3% __alloc_skb          [kernel]
               83.00  2.2% __napi_gro_receive   [kernel]
               69.00  1.8% napi_gro_receive     [kernel]
               45.00  1.2% eth_type_trans       [kernel]
               25.00  0.7% skb_gro_reset_offset [kernel]
                9.00  0.2% __netdev_alloc_skb   [kernel]
                5.00  0.1% cache_alloc_refill   [kernel]
                5.00  0.1% skb_pull             [kernel]


-------------------------------------------------------------------------------
   PerfTop:     997 irqs/sec  kernel:100.0% [1000Hz cache-misses],  (all, cpu: 0)
-------------------------------------------------------------------------------

             samples  pcnt function             DSO
             _______ _____ ____________________ ________

             8887.00 79.8% sky2_poll            [sky2]  
             1138.00 10.2% kmem_cache_alloc     [kernel]
              273.00  2.5% __napi_gro_receive   [kernel]
              246.00  2.2% dev_gro_receive      [kernel]
              189.00  1.7% napi_gro_receive     [kernel]
              159.00  1.4% __alloc_skb          [kernel]
              119.00  1.1% eth_type_trans       [kernel]
               86.00  0.8% skb_gro_reset_offset [kernel]
               13.00  0.1% __netdev_alloc_skb   [kernel]
                8.00  0.1% skb_pull             [kernel]
                7.00  0.1% cache_alloc_refill   [kernel]


Not much going on in other cpus .. i.e hardly anything shows up in
the profile ..

IV: rps with ee and irq affinity to cpu0

Avg udp packets sunk: 95.15%


-------------------------------------------------------------------------------
   PerfTop:    3558 irqs/sec  kernel:84.6% [1000Hz cycles],  (all, 8 CPUs)
-------------------------------------------------------------------------------

             samples  pcnt function                      DSO
             _______ _____ _____________________________ ______________________

             3096.00 17.1% sky2_poll                     [sky2]                
              645.00  3.6% _raw_spin_lock_irqsave        [kernel]              
              493.00  2.7% system_call                   [kernel]              
              462.00  2.6% sky2_intr                     [sky2]                
              416.00  2.3% _raw_spin_unlock_irqrestore   [kernel]              
              382.00  2.1% fget                          [kernel]              
              361.00  2.0% __netif_receive_skb           [kernel]              
              342.00  1.9% ip_rcv                        [kernel]              
              334.00  1.8% _raw_spin_lock                [kernel]              
              320.00  1.8% sys_epoll_ctl                 [kernel]              
              298.00  1.6% copy_user_generic_string      [kernel]              
              288.00  1.6% call_function_single_interrup [kernel]              
              277.00  1.5% load_balance                  [kernel]              
              271.00  1.5% ip_route_input                [kernel]              
              270.00  1.5% vread_tsc                     [kernel].vsyscall_fn  
              256.00  1.4% kmem_cache_free               [kernel]              
              222.00  1.2% __udp4_lib_lookup             [kernel]              
              222.00  1.2% schedule                      [kernel]              
              194.00  1.1% fput                          [kernel]              
              189.00  1.0% kmem_cache_alloc              [kernel]              
              171.00  0.9% sys_epoll_wait                [kernel]              
              164.00  0.9% ep_remove                     [kernel]          

-------------------------------------------------------------------------------
   PerfTop:    3452 irqs/sec  kernel:84.3% [1000Hz cycles],  (all, 8 CPUs)
-------------------------------------------------------------------------------

             samples  pcnt function                      DSO
             _______ _____ _____________________________ ______________________

             5033.00 16.2% sky2_poll                     [sky2]                
             1147.00  3.7% _raw_spin_lock_irqsave        [kernel]              
              888.00  2.9% system_call                   [kernel]              
              774.00  2.5% sky2_intr                     [sky2]                
              757.00  2.4% _raw_spin_unlock_irqrestore   [kernel]              
              702.00  2.3% fget                          [kernel]              
              630.00  2.0% __netif_receive_skb           [kernel]              
              609.00  2.0% _raw_spin_lock                [kernel]              
              607.00  2.0% ip_rcv                        [kernel]              
              553.00  1.8% sys_epoll_ctl                 [kernel]              
              514.00  1.7% ip_route_input                [kernel]              
              508.00  1.6% call_function_single_interrup [kernel]              
              504.00  1.6% copy_user_generic_string      [kernel]              
              466.00  1.5% kmem_cache_free               [kernel]              
              452.00  1.5% schedule                      [kernel]              
              450.00  1.4% vread_tsc                     [kernel].vsyscall_fn  
              390.00  1.3% load_balance                  [kernel]              
              377.00  1.2% fput                          [kernel]              
              364.00  1.2% __udp4_lib_lookup             [kernel]              
              329.00  1.1% kmem_cache_alloc              [kernel]              
              314.00  1.0% ep_remove                     [kernel]              
              289.00  0.9% dst_release                   [kernel]              
              276.00  0.9% sys_epoll_wait                [kernel]              
              265.00  0.9% datagram_poll                 [kernel]              

-------------------------------------------------------------------------------
   PerfTop:    3328 irqs/sec  kernel:85.7% [1000Hz cycles],  (all, 8 CPUs)
-------------------------------------------------------------------------------

             samples  pcnt function                      DSO
             _______ _____ _____________________________ ______________________

             6788.00 17.5% sky2_poll                     [sky2]                
             1413.00  3.6% _raw_spin_lock_irqsave        [kernel]              
             1042.00  2.7% system_call                   [kernel]              
              997.00  2.6% sky2_intr                     [sky2]                
              903.00  2.3% _raw_spin_unlock_irqrestore   [kernel]              
              837.00  2.2% fget                          [kernel]              
              740.00  1.9% _raw_spin_lock                [kernel]              
              725.00  1.9% __netif_receive_skb           [kernel]              
              722.00  1.9% ip_rcv                        [kernel]              
              651.00  1.7% sys_epoll_ctl                 [kernel]              
              609.00  1.6% call_function_single_interrup [kernel]              
              604.00  1.6% ip_route_input                [kernel]              
              601.00  1.5% copy_user_generic_string      [kernel]              
              573.00  1.5% schedule                      [kernel]              
              561.00  1.4% kmem_cache_free               [kernel]              
              538.00  1.4% load_balance                  [kernel]              
              515.00  1.3% vread_tsc                     [kernel].vsyscall_fn  
              480.00  1.2% fput                          [kernel]              
              421.00  1.1% kmem_cache_alloc              [kernel]              
              418.00  1.1% __udp4_lib_lookup             [kernel]              
              377.00  1.0% ep_remove                     [kernel]              
              347.00  0.9% datagram_poll                 [kernel]              
              335.00  0.9% dst_release                   [kernel]              

-------------------------------------------------------------------------------
   PerfTop:    1000 irqs/sec  kernel:96.2% [1000Hz cycles],  (all, cpu: 0)
-------------------------------------------------------------------------------

             samples  pcnt function                      DSO
             _______ _____ _____________________________ ______________________

             2109.00 61.3% sky2_poll                     [sky2]                
              366.00 10.6% sky2_intr                     [sky2]                
               84.00  2.4% __alloc_skb                   [kernel]              
               57.00  1.7% _raw_spin_lock_irqsave        [kernel]              
               56.00  1.6% get_rps_cpu                   [kernel]              
               52.00  1.5% __kmalloc                     [kernel]              
               39.00  1.1% irq_entries_start             [kernel]              
               39.00  1.1% enqueue_to_backlog            [kernel]              
               34.00  1.0% kmem_cache_alloc              [kernel]              
               33.00  1.0% default_send_IPI_mask_sequenc [kernel]              
               32.00  0.9% sky2_rx_submit                [sky2]                
               30.00  0.9% swiotlb_sync_single           [kernel]              
               28.00  0.8% _raw_spin_lock                [kernel]              
               23.00  0.7% sky2_remove                   [sky2]                
               22.00  0.6% __smp_call_function_single    [kernel]              
               19.00  0.6% system_call                   [kernel]              
               18.00  0.5% sys_epoll_ctl                 [kernel]              
               18.00  0.5% fget                          [kernel]              
               17.00  0.5% cache_alloc_refill            [kernel]              
               16.00  0.5% copy_user_generic_string      [kernel]              
               16.00  0.5% _raw_spin_unlock_irqrestore   [kernel]              
               15.00  0.4% dev_gro_receive               [kernel]              
               14.00  0.4% net_rx_action                 [kernel]             

-------------------------------------------------------------------------------
   PerfTop:    1000 irqs/sec  kernel:97.9% [1000Hz cycles],  (all, cpu: 0)
-------------------------------------------------------------------------------

             samples  pcnt function                        DSO
             _______ _____ _______________________________ ____________________

             4479.00 60.9% sky2_poll                       [sky2]              
              849.00 11.5% sky2_intr                       [sky2]              
              163.00  2.2% __alloc_skb                     [kernel]            
              155.00  2.1% get_rps_cpu                     [kernel]            
              121.00  1.6% _raw_spin_lock_irqsave          [kernel]            
               92.00  1.3% __kmalloc                       [kernel]            
               89.00  1.2% _raw_spin_lock                  [kernel]            
               83.00  1.1% enqueue_to_backlog              [kernel]            
               79.00  1.1% irq_entries_start               [kernel]            
               78.00  1.1% kmem_cache_alloc                [kernel]            
               69.00  0.9% sky2_rx_submit                  [sky2]              
               65.00  0.9% swiotlb_sync_single             [kernel]            
               58.00  0.8% default_send_IPI_mask_sequence_ [kernel]            
               50.00  0.7% system_call                     [kernel]            
               45.00  0.6% fget                            [kernel]            
               40.00  0.5% sky2_remove                     [sky2]              
               37.00  0.5% __smp_call_function_single      [kernel]            
               36.00  0.5% datagram_poll                   [kernel]            
               36.00  0.5% _raw_spin_unlock_irqrestore     [kernel]            
               34.00  0.5% cache_alloc_refill              [kernel]            
               31.00  0.4% net_rx_action                   [kernel]            
               28.00  0.4% kmem_cache_free                 [kernel]            
               27.00  0.4% _raw_spin_lock_bh               [kernel]            
               27.00  0.4% copy_user_generic_string        [kernel]            
               25.00  0.3% dev_gro_receive                 [kernel]            


-------------------------------------------------------------------------------
   PerfTop:     980 irqs/sec  kernel:97.3% [1000Hz cycles],  (all, cpu: 0)
-------------------------------------------------------------------------------

             samples  pcnt function                        DSO
             _______ _____ _______________________________ ____________________

             6544.00 61.6% sky2_poll                       [sky2]              
             1098.00 10.3% sky2_intr                       [sky2]              
              248.00  2.3% __alloc_skb                     [kernel]            
              198.00  1.9% get_rps_cpu                     [kernel]            
              182.00  1.7% _raw_spin_lock_irqsave          [kernel]            
              144.00  1.4% __kmalloc                       [kernel]            
              138.00  1.3% _raw_spin_lock                  [kernel]            
              127.00  1.2% kmem_cache_alloc                [kernel]            
              125.00  1.2% irq_entries_start               [kernel]            
              119.00  1.1% enqueue_to_backlog              [kernel]            
               93.00  0.9% sky2_rx_submit                  [sky2]              
               91.00  0.9% swiotlb_sync_single             [kernel]            
               83.00  0.8% default_send_IPI_mask_sequence_ [kernel]            
               82.00  0.8% system_call                     [kernel]            
               64.00  0.6% sky2_remove                     [sky2]              
               60.00  0.6% fget                            [kernel]            
               58.00  0.5% cache_alloc_refill              [kernel]            
               57.00  0.5% _raw_spin_unlock_irqrestore     [kernel]            
               51.00  0.5% datagram_poll                   [kernel]            
               47.00  0.4% copy_user_generic_string        [kernel]            


-------------------------------------------------------------------------------
   PerfTop:     315 irqs/sec  kernel:81.0% [1000Hz cycles],  (all, cpu: 2)
-------------------------------------------------------------------------------

             samples  pcnt function                      DSO
             _______ _____ _____________________________ ______________________

              114.00  4.5% system_call                   [kernel]              
               98.00  3.9% _raw_spin_lock_irqsave        [kernel]              
               89.00  3.5% _raw_spin_unlock_irqrestore   [kernel]              
               89.00  3.5% ip_rcv                        [kernel]              
               83.00  3.3% call_function_single_interrup [kernel]              
               76.00  3.0% __netif_receive_skb           [kernel]              
               67.00  2.6% fget                          [kernel]              
               62.00  2.4% ip_route_input                [kernel]              
               59.00  2.3% vread_tsc                     [kernel].vsyscall_fn  
               54.00  2.1% kmem_cache_free               [kernel]              
               54.00  2.1% sys_epoll_ctl                 [kernel]              
               51.00  2.0% schedule                      [kernel]              
               49.00  1.9% _raw_spin_lock                [kernel]              
               49.00  1.9% __udp4_lib_lookup             [kernel]              
               44.00  1.7% ep_remove                     [kernel]              
               44.00  1.7% copy_user_generic_string      [kernel]              
               41.00  1.6% fput                          [kernel]              
               38.00  1.5% sys_epoll_wait                [kernel]              
               37.00  1.5% tick_nohz_stop_sched_tick     [kernel]              
               36.00  1.4% kmem_cache_alloc              [kernel]              
               34.00  1.3% datagram_poll                 [kernel]              
               33.00  1.3% __udp4_lib_rcv                [kernel]              
               31.00  1.2% process_recv                  mcpudp               

-------------------------------------------------------------------------------
   PerfTop:     292 irqs/sec  kernel:82.9% [1000Hz cycles],  (all, cpu: 2)
-------------------------------------------------------------------------------

             samples  pcnt function                      DSO
             _______ _____ _____________________________ ______________________

              154.00  4.7% _raw_spin_lock_irqsave        [kernel]              
              140.00  4.2% system_call                   [kernel]              
              111.00  3.4% ip_rcv                        [kernel]              
              106.00  3.2% _raw_spin_unlock_irqrestore   [kernel]              
               96.00  2.9% call_function_single_interrup [kernel]              
               95.00  2.9% fget                          [kernel]              
               90.00  2.7% __netif_receive_skb           [kernel]              
               89.00  2.7% sys_epoll_ctl                 [kernel]              
               77.00  2.3% copy_user_generic_string      [kernel]              
               77.00  2.3% ip_route_input                [kernel]              
               76.00  2.3% kmem_cache_free               [kernel]              
               74.00  2.2% _raw_spin_lock                [kernel]              
               71.00  2.1% schedule                      [kernel]              
               69.00  2.1% vread_tsc                     [kernel].vsyscall_fn  
               58.00  1.8% __udp4_lib_lookup             [kernel]              
               52.00  1.6% __udp4_lib_rcv                [kernel]              
               51.00  1.5% fput                          [kernel]              
               47.00  1.4% ep_remove                     [kernel]              
               47.00  1.4% event_base_loop               libevent-1.3e.so.1.0.3
               39.00  1.2% process_recv                  mcpudp                
               39.00  1.2% sys_epoll_wait                [kernel]              
               38.00  1.2% udp_recvmsg                   [kernel]              
               38.00  1.2% sock_recv_ts_and_drops        [kernel]              
               37.00  1.1% __switch_to                   [kernel]              

-------------------------------------------------------------------------------
   PerfTop:     290 irqs/sec  kernel:82.1% [1000Hz cycles],  (all, cpu: 2)
-------------------------------------------------------------------------------

             samples  pcnt function                      DSO
             _______ _____ _____________________________ ______________________

              175.00  4.7% _raw_spin_lock_irqsave        [kernel]              
              153.00  4.2% system_call                   [kernel]              
              122.00  3.3% ip_rcv                        [kernel]              
              114.00  3.1% _raw_spin_unlock_irqrestore   [kernel]              
              114.00  3.1% fget                          [kernel]              
              105.00  2.8% __netif_receive_skb           [kernel]              
              101.00  2.7% sys_epoll_ctl                 [kernel]              
              100.00  2.7% call_function_single_interrup [kernel]              
               90.00  2.4% copy_user_generic_string      [kernel]              
               84.00  2.3% schedule                      [kernel]              
               76.00  2.1% kmem_cache_free               [kernel]              
               76.00  2.1% _raw_spin_lock                [kernel]              
               72.00  2.0% ip_route_input                [kernel]              
               70.00  1.9% vread_tsc                     [kernel].vsyscall_fn  
               68.00  1.8% __udp4_lib_lookup             [kernel]              
               68.00  1.8% __udp4_lib_rcv                [kernel]              
               57.00  1.5% ep_remove                     [kernel]              
               57.00  1.5% fput                          [kernel]              
               55.00  1.5% kmem_cache_alloc              [kernel]              
               51.00  1.4% process_recv                  mcpudp
jamal April 29, 2010, midnight UTC | #5
On Wed, 2010-04-28 at 19:45 -0400, jamal wrote:

> Your patch has improved the performance of rps relative to what is in
> net-next very lightly; but it has also improved the performance of
> non-rps;->

Correction: Last part of sentence not true (obvious if you look at
results i attached)

cheers,
jamal


--
To unsubscribe from this list: send the line "unsubscribe netdev" in
the body of a message to majordomo@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Eric Dumazet April 29, 2010, 4:09 a.m. UTC | #6
Le mercredi 28 avril 2010 à 19:44 -0400, jamal a écrit :
> On Wed, 2010-04-28 at 16:06 +0200, Eric Dumazet wrote:
> 
> > Here it is ;)
> 
> Sorry - things got a little hectic with TheMan.
> 
> I am afraid i dont have good news.
> Actually, I should say i dont have good news in regards to rps.
> For my sample app, two things seem to be happening:
> a) The overall performance has gotten better for both rps
> and non-rps.
> b) non-rps is now performing relatively better
> 
> This is just what i see in net-next not related to your patch.
> It seems the kernels i tested prior to April 23 showed rps better.
> The one i tested on Apr23 showed rps being about the same as non-rps.
> As i stated in my last result posting, I thought i didnt test properly
> but i did again today and saw the same thing. And now non-rps is
> _consistently_ better.
> So some regression is going on...
> 
> Your patch has improved the performance of rps relative to what is in
> net-next very lightly; but it has also improved the performance of
> non-rps;->
> My traces look different for the app cpu than yours - likely because of
> the apps being different.
> 
> At the moment i dont have time to dig deeper into code, but i could
> test as cycles show up.
> 
> I am attaching the profile traces and results.
> 
> cheers,
> jamal

Hi Jamal

I dont see in your results the number of pps, number of udp ports,
number of flows.

In my latest results, I can handle more pps than before, regardless of
rps being on or off, and with various number of udp ports (one user
thread per port), number of flows (many src addr so that rps spread
packets on many cpus)

If/when contention windows are smaller, cpu can run uncontended, and can
consume more cycles to process more frames ?

With a non yet published patch, I even can reach 600.000 pps in DDOS
situations, instead of 400.000.

Thanks !


--
To unsubscribe from this list: send the line "unsubscribe netdev" in
the body of a message to majordomo@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
jamal April 29, 2010, 11:35 a.m. UTC | #7
On Thu, 2010-04-29 at 06:09 +0200, Eric Dumazet wrote:


> I dont see in your results the number of pps, number of udp ports,
> number of flows.

My test scenario is still the same: send 1M packets of 8 flows
round-robin at 750Kpps. Repeat test 4-6 times and average out. 8 flows
map to 8 cpus. Any rate above 750Kpps and the driver starts dropping.
The flows are {Fixed dst IP, fixed src IP, fixed src port, 8 variable
dst port}. ip_rcv and friends show up in profile as we have already
discussed - but i dont want to change the test characteristic because i
cant do fair backward comparison. Also i use rps mask ee to use all the
cpus except the core doing demux (core 0).
In the results when i say "udp sink 90%" it means 90% of 750Kpps was
successfuly received by the app (on the multiple cpus).

> In my latest results, I can handle more pps than before, regardless of
> rps being on or off, 

Same here - even in my worst case scenario 88.5% of 750Kpps > 600Kpps.
Attached is history results to make more sense of what i am saying:
we have net-next kernels from apr14, apr23, apr23 with changlis change,
apr28, apr28 with your change. What you'll see is non-rps (blue) gets
better and rps (Orange) gets better slowly then by apr28 it is worse.

> and with various number of udp ports (one user
> thread per port), number of flows (many src addr so that rps spread
> packets on many cpus)
> 

This is true for me except for non rps getting relatively better and rps
getting worse in plain net-next for Apr 28. Sorry, dont have time to
dissect where things changed but i figured if i reported it will point
to something obvious.

> If/when contention windows are smaller, cpu can run uncontended, and can
> consume more cycles to process more frames ?
> 
> With a non yet published patch, I even can reach 600.000 pps in DDOS
> situations, instead of 400.000.

So my tests are simpler. What i was hoping to see was at minimum rps
maintains its gap of 6-7% more capacity. I dont mind seeing
rps get better. If both rps and non-rps get better that even more
interesting.

cheers,
jamal
Changli Gao April 29, 2010, 12:12 p.m. UTC | #8
On Thu, Apr 29, 2010 at 7:35 PM, jamal <hadi@cyberus.ca> wrote:
>
> Same here - even in my worst case scenario 88.5% of 750Kpps > 600Kpps.
> Attached is history results to make more sense of what i am saying:
> we have net-next kernels from apr14, apr23, apr23 with changlis change,
> apr28, apr28 with your change. What you'll see is non-rps (blue) gets
> better and rps (Orange) gets better slowly then by apr28 it is worse.

Did the number of IPIs increase in the apr28 test? The finial patch
with Eric's change may introduce more IPIs. And I am wondering why
23rdcl-non-rps is better than before. Maybe it is the side effect of
my patch: enlarge the netdev_max_backlog.
Eric Dumazet April 29, 2010, 12:45 p.m. UTC | #9
Le jeudi 29 avril 2010 à 20:12 +0800, Changli Gao a écrit :
> On Thu, Apr 29, 2010 at 7:35 PM, jamal <hadi@cyberus.ca> wrote:
> >
> > Same here - even in my worst case scenario 88.5% of 750Kpps > 600Kpps.
> > Attached is history results to make more sense of what i am saying:
> > we have net-next kernels from apr14, apr23, apr23 with changlis change,
> > apr28, apr28 with your change. What you'll see is non-rps (blue) gets
> > better and rps (Orange) gets better slowly then by apr28 it is worse.
> 
> Did the number of IPIs increase in the apr28 test? The finial patch
> with Eric's change may introduce more IPIs. And I am wondering why
> 23rdcl-non-rps is better than before. Maybe it is the side effect of
> my patch: enlarge the netdev_max_backlog.
> 
> 

Changli, I wonder how you can cook "performance" patches without testing
them at all for real... This cannot be true ?

When the cpu doing the device softirq is flooded, it handles 300 packets
per net_rx_action() round (netdev_budget), so sends at most 6 ipis per
300 packets, with or without my patch, with or without your patch as
well.

(At most because if remote cpus are flooded as well, they dont
napi_complete so no IPI needed at all)

(My patch had an effect only on normal load, ie one packet received in a
while... up to 50.000 pps I would say). And it also has a nice effect on
non RPS loads (mostly the more typical load for following years).
If a second packet comes 3us after the first one, and before 2nd CPU
handled it, we _can_ afford an extra IPI.

750.000/50 = 15.000 IPI per second.

Even with 200.000 IPI per second, 'perf top -C CPU_IPI_sender' shows
that sending IPI is very cheap (maybe ~1% of cpu cycles)

# Samples: 32033467127
#
# Overhead         Command      Shared Object  Symbol
# ........  ..............  .................  ......
#
    18.05%            init  [kernel.kallsyms]  [k] poll_idle
    10.91%            init  [kernel.kallsyms]  [k] bnx2x_rx_int
    10.42%            init  [kernel.kallsyms]  [k] eth_type_trans
     5.72%            init  [kernel.kallsyms]  [k] kmem_cache_alloc_node
     5.43%            init  [kernel.kallsyms]  [k] __memset
     5.20%            init  [kernel.kallsyms]  [k] get_rps_cpu
     4.82%            init  [kernel.kallsyms]  [k] __slab_alloc
     4.34%            init  [kernel.kallsyms]  [k] get_partial_node
     4.22%            init  [kernel.kallsyms]  [k] _raw_spin_lock
     3.41%            init  [kernel.kallsyms]  [k] __kmalloc_node_track_caller
     3.01%            init  [kernel.kallsyms]  [k] __alloc_skb
     2.22%            init  [kernel.kallsyms]  [k] enqueue_to_backlog
     2.10%            init  [kernel.kallsyms]  [k] vlan_gro_common
     1.34%            init  [kernel.kallsyms]  [k] swiotlb_map_page
     1.25%            init  [kernel.kallsyms]  [k] skb_put
     1.06%            init  [kernel.kallsyms]  [k] _raw_spin_lock_irqsave
     0.92%            init  [kernel.kallsyms]  [k] dev_gro_receive
     0.88%            init  [kernel.kallsyms]  [k] swiotlb_dma_mapping_error
     0.83%            init  [kernel.kallsyms]  [k] vlan_gro_receive
     0.83%            init  [kernel.kallsyms]  [k] __phys_addr
     0.83%            init  [kernel.kallsyms]  [k] __napi_complete
     0.83%            init  [kernel.kallsyms]  [k] default_send_IPI_mask_sequence_phys
     0.77%            init  [kernel.kallsyms]  [k] is_swiotlb_buffer
     0.76%            init  [kernel.kallsyms]  [k] __netdev_alloc_skb
     0.74%            init  [kernel.kallsyms]  [k] deactivate_slab
     0.73%            init  [kernel.kallsyms]  [k] netif_receive_skb
     0.72%            init  [kernel.kallsyms]  [k] unmap_single
     0.69%            init  [kernel.kallsyms]  [k] csd_lock
     0.63%            init  [kernel.kallsyms]  [k] bnx2x_poll
     0.61%            init  [kernel.kallsyms]  [k] bnx2x_msix_fp_int
     0.59%            init  [kernel.kallsyms]  [k] irq_entries_start
     0.59%            init  [kernel.kallsyms]  [k] swiotlb_sync_single
     0.54%            init  [kernel.kallsyms]  [k] get_slab
     0.46%            init  [kernel.kallsyms]  [k] napi_skb_finish



--
To unsubscribe from this list: send the line "unsubscribe netdev" in
the body of a message to majordomo@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
jamal April 29, 2010, 1:17 p.m. UTC | #10
On Thu, 2010-04-29 at 14:45 +0200, Eric Dumazet wrote:

> 
> Changli, I wonder how you can cook "performance" patches without testing
> them at all for real... This cannot be true ?

Eric, I am with you, however you are in the minority of people who test
and produce numbers ;-> The system rewards people for sending patches
not much for anything else - so i cant blame Changli ;->

> When the cpu doing the device softirq is flooded, it handles 300 packets
> per net_rx_action() round (netdev_budget), so sends at most 6 ipis per
> 300 packets, with or without my patch, with or without your patch as
> well.
> 
> (At most because if remote cpus are flooded as well, they dont
> napi_complete so no IPI needed at all)
>
> (My patch had an effect only on normal load, ie one packet received in a
> while... up to 50.000 pps I would say). And it also has a nice effect on
> non RPS loads (mostly the more typical load for following years).
> If a second packet comes 3us after the first one, and before 2nd CPU
> handled it, we _can_ afford an extra IPI.
> 
> 750.000/50 = 15.000 IPI per second.

Could we have some stat in there that shows IPIs being produced? I think
it would help to at least observe any changes over variety of tests.
I did try to patch my system during the first few tests to record IPIs
but it seems to make more sense to have it as a perf stat.

> Even with 200.000 IPI per second, 'perf top -C CPU_IPI_sender' shows
> that sending IPI is very cheap (maybe ~1% of cpu cycles)
> 
> # Samples: 32033467127
> #

One thing i observed is our profiles seem different. Could you send me
your .config for a single nehalem and i will try to go as close as
possible to it? I have a sky2 instead of bnx - but i suspect everything
else will be very similar...
I apologize i dont have much time to look into details - but what i can
do is test at least.

cheers,
jamal



--
To unsubscribe from this list: send the line "unsubscribe netdev" in
the body of a message to majordomo@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Eric Dumazet April 29, 2010, 1:21 p.m. UTC | #11
Le jeudi 29 avril 2010 à 09:17 -0400, jamal a écrit :

> Could we have some stat in there that shows IPIs being produced? I think
> it would help to at least observe any changes over variety of tests.
> I did try to patch my system during the first few tests to record IPIs
> but it seems to make more sense to have it as a perf stat.
> 
> > Even with 200.000 IPI per second, 'perf top -C CPU_IPI_sender' shows
> > that sending IPI is very cheap (maybe ~1% of cpu cycles)
> > 
> > # Samples: 32033467127
> > #
> 
> One thing i observed is our profiles seem different. Could you send me
> your .config for a single nehalem and i will try to go as close as
> possible to it? I have a sky2 instead of bnx - but i suspect everything
> else will be very similar...
> I apologize i dont have much time to look into details - but what i can
> do is test at least.

I'am going to redo some test on my 'old machine', with tg3 driver.

You could try following program :


#include <stdio.h>
#include <string.h>
#include <stdlib.h>
#include <unistd.h>

struct softnet_stat_vals {
	int flip;
	unsigned int tab[2][10];
};

int read_file(struct softnet_stat_vals *v)
{
	char buffer[1024];
	FILE *F = fopen("/proc/net/softnet_stat", "r");

	v->flip ^= 1;
	if (!F)
		return -1;

	memset(v->tab[v->flip], 0, 10 * sizeof(unsigned int));
	while (fgets(buffer, sizeof(buffer), F)) {
		int i, pos = 0;
		unsigned int val;
	
		for (i = 0; ;) {
			if (sscanf(buffer + pos, "%08x", &val) != 1) break;
			v->tab[v->flip][i] += val;
			pos += 9;
			if (++i == 10)
				break;
			}
		}
	fclose(F);

}


int main(int argc, char *argv[])
{
	struct softnet_stat_vals *v = calloc(sizeof(struct softnet_stat_vals), 1);
	
	read_file(v);
	for (;;) {
		sleep(1);
		read_file(v);
		printf("%u rps\n", v->tab[v->flip][9] - v->tab[v->flip^1][9]);
	}
}


--
To unsubscribe from this list: send the line "unsubscribe netdev" in
the body of a message to majordomo@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
jamal April 29, 2010, 1:37 p.m. UTC | #12
On Thu, 2010-04-29 at 15:21 +0200, Eric Dumazet wrote:


> 
> You could try following program :
> 

Will do later today (test machine is not on the network and is about 20
minutes from here; so worst case i will get you results by end of day)
I guess this program is good enough since it tells me the system wide
ipi count - what my patch did was also to break it down by which cpu got
how many IPIs (served to check if there was uneven distribution)

> 
> Is your application mono threaded and receiving data to 8 sockets ?
> 

I fork one instance per detected cpu and bind to different ports each
time. Example bind to port 8200 on cpu0, 8201 on cpu1, etc.

cheers,
jamal

--
To unsubscribe from this list: send the line "unsubscribe netdev" in
the body of a message to majordomo@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Eric Dumazet April 29, 2010, 1:49 p.m. UTC | #13
Le jeudi 29 avril 2010 à 09:37 -0400, jamal a écrit :
> On Thu, 2010-04-29 at 15:21 +0200, Eric Dumazet wrote:
> 
> 
> > 
> > You could try following program :
> > 
> 
> Will do later today (test machine is not on the network and is about 20
> minutes from here; so worst case i will get you results by end of day)
> I guess this program is good enough since it tells me the system wide
> ipi count - what my patch did was also to break it down by which cpu got
> how many IPIs (served to check if there was uneven distribution)
> 
> > 
> > Is your application mono threaded and receiving data to 8 sockets ?
> > 
> 
> I fork one instance per detected cpu and bind to different ports each
> time. Example bind to port 8200 on cpu0, 8201 on cpu1, etc.
> 

I guess this is the problem ;)

With RPS, you should not bind your threads to cpu.
This is the rps hash who will decide for you.


I am using following program :

/*
 *  Usage: udpsink [ -p baseport] nbports
 *
 */
#include <sys/socket.h>
#include <netinet/in.h>
#include <arpa/inet.h>
#include <string.h>
#include <stdio.h>
#include <errno.h>
#include <unistd.h>
#include <stdlib.h>
#include <fcntl.h>

struct worker_data {
	int fd;
	unsigned long pack_count;
	unsigned long bytes_count;
	unsigned long _padd[16 - 3]; /* alignment */ 
};

void usage(int code)
{
	fprintf(stderr, "Usage: udpsink [-p baseport] nbports\n");
	exit(code);
}

void *worker_func(void *arg)
{
	struct worker_data *wdata = (struct worker_data *)arg;
	char buffer[4096];
	struct sockaddr_in addr;
	int lu;

	while (1) {
		socklen_t len = sizeof(addr);
		lu = recvfrom(wdata->fd, buffer, sizeof(buffer), 0, (struct sockaddr *)&addr, &len);
		if (lu > 0) {
			wdata->pack_count++;
			wdata->bytes_count += lu;
		}
	}
}

int main(int argc, char *argv[])
{
int c;
int baseport = 4000;
int nbthreads;
struct worker_data *wdata;
unsigned long ototal = 0;
int concurrent = 0;
int verbose = 0;
int i;
	while ((c = getopt(argc, argv, "cvp:")) != -1) {
		if (c == 'p')
			baseport = atoi(optarg);
		else if (c == 'c')
			concurrent = 1;
		else if (c == 'v')
			verbose++;
		else usage(1);
	}
	if (optind == argc)
		usage(1);
	nbthreads = atoi(argv[optind]);
	wdata = calloc(sizeof(struct worker_data), nbthreads);
	if (!wdata) {
		perror("calloc");
		return 1;
	}
	for (i = 0; i < nbthreads; i++) {
		struct sockaddr_in addr;
		pthread_t tid;

		if (i && concurrent) {
			wdata[i].fd = wdata[0].fd ;
		} else {
			wdata[i].fd = socket(PF_INET, SOCK_DGRAM, 0);
			if (wdata[i].fd == -1) {
				perror("socket");
				return 1;
			}
			memset(&addr, 0, sizeof(addr));
			addr.sin_family = AF_INET;
//			addr.sin_addr.s_addr = inet_addr(argv[optind]);
			addr.sin_port = htons(baseport + i);
			if (bind(wdata[i].fd, (struct sockaddr *) &addr, sizeof(addr)) < 0) {
				perror("bind");
				return 1;
				}
//			fcntl(wdata[i].fd, F_SETFL, O_NDELAY);
			}
		pthread_create(&tid, NULL, worker_func, wdata + i);
	}
	for (;;) {
		unsigned long total;
		long delta;

		sleep(1);
		total = 0;
		for (i = 0; i < nbthreads;i++) {
			total += wdata[i].pack_count;
		}
		delta = total - ototal;
		if (delta) {
			printf("%lu pps (%lu", delta, total);
			if (verbose) {
				for (i = 0; i < nbthreads;i++) { 
					if (wdata[i].pack_count)
						printf(" %d:%lu", i, wdata[i].pack_count);
				}
			}
			printf(")\n");
		}
		ototal = total;
	}
}



--
To unsubscribe from this list: send the line "unsubscribe netdev" in
the body of a message to majordomo@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
jamal April 29, 2010, 1:56 p.m. UTC | #14
On Thu, 2010-04-29 at 15:49 +0200, Eric Dumazet wrote:

> > I fork one instance per detected cpu and bind to different ports each
> > time. Example bind to port 8200 on cpu0, 8201 on cpu1, etc.
> > 
> 
> I guess this is the problem ;)
> 
> With RPS, you should not bind your threads to cpu.
> This is the rps hash who will decide for you.
> 

Sorry - I was not clear; i have the option of binding to cpu
vs the setsched api; but what i meant in this case is:
- for each cpu detected, fork
-- open socket
---bind to udp port cpu# + 8200

I could also bind to a cpu in the last step and i did notice it
improved distribution - but all my tests since apr23 dont do that ;->

> 
> I am using following program :
> 

I will try your program instead so we can reduce the variables

cheers,
jamal

--
To unsubscribe from this list: send the line "unsubscribe netdev" in
the body of a message to majordomo@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
jamal April 29, 2010, 8:36 p.m. UTC | #15
On Thu, 2010-04-29 at 09:56 -0400, jamal wrote:

> 
> I will try your program instead so we can reduce the variables

Results attached.
With your app rps does a hell lot better and non-rps worse ;->
With my proggie, non-rps does much better than yours and rps does
a lot worse for same setup. I see the scheduler kicking quiet a bit in
non-rps for you...

The main difference between us as i see it is:
a) i use epoll - actually linked to libevent (1.0.something)
b) I fork processes and you use pthreads.

I dont have time to chase it today, but 1) I am either going to change
yours to use libevent or make mine get rid of it then 2) move towards
pthreads or have yours fork..
then observe if that makes any difference..


cheers,
jamal
No RPS; same kernel as yesterday with Eric's changes

-------------------------------------------------------------------------------
   PerfTop:    2572 irqs/sec  kernel:94.7% [1000Hz cycles],  (all, 8 CPUs)
-------------------------------------------------------------------------------

             samples  pcnt function                    DSO
             _______ _____ ___________________________ ________

             2901.00 17.4% sky2_poll                   [sky2]  
              781.00  4.7% schedule                    [kernel]
              574.00  3.4% __skb_recv_datagram         [kernel]
              518.00  3.1% _raw_spin_lock_irqsave      [kernel]
              460.00  2.8% udp_recvmsg                 [kernel]
              457.00  2.7% copy_user_generic_string    [kernel]
              397.00  2.4% _raw_spin_lock_bh           [kernel]
              340.00  2.0% __udp4_lib_lookup           [kernel]
              320.00  1.9% ip_route_input              [kernel]
              295.00  1.8% _raw_spin_lock              [kernel]
              293.00  1.8% dst_release                 [kernel]
              282.00  1.7% ip_rcv                      [kernel]
              275.00  1.6% skb_copy_datagram_iovec     [kernel]
              263.00  1.6% __switch_to                 [kernel]
              257.00  1.5% __alloc_skb                 [kernel]
              256.00  1.5% system_call                 [kernel]
              243.00  1.5% sock_recv_ts_and_drops      [kernel]
              227.00  1.4% sock_queue_rcv_skb          [kernel]
              225.00  1.3% _raw_spin_unlock_irqrestore [kernel]
              220.00  1.3% fget_light                  [kernel]
              218.00  1.3% pick_next_task_fair         [kernel]

-------------------------------------------------------------------------------
   PerfTop:    1000 irqs/sec  kernel:100.0% [1000Hz cycles],  (all, cpu: 0)
-------------------------------------------------------------------------------

             samples  pcnt function                    DSO
             _______ _____ ___________________________ ________

             1508.00 37.9% sky2_poll                   [sky2]  
              198.00  5.0% ip_route_input              [kernel]
              184.00  4.6% __udp4_lib_lookup           [kernel]
              172.00  4.3% ip_rcv                      [kernel]
              139.00  3.5% _raw_spin_lock              [kernel]
              131.00  3.3% __alloc_skb                 [kernel]
              130.00  3.3% sock_queue_rcv_skb          [kernel]
              111.00  2.8% __udp4_lib_rcv              [kernel]
              101.00  2.5% __netif_receive_skb         [kernel]
               78.00  2.0% select_task_rq_fair         [kernel]
               74.00  1.9% try_to_wake_up              [kernel]
               73.00  1.8% sock_def_readable           [kernel]
               72.00  1.8% _raw_spin_lock_irqsave      [kernel]
               67.00  1.7% task_rq_lock                [kernel]
               66.00  1.7% _raw_read_lock              [kernel]
               64.00  1.6% __kmalloc                   [kernel]
               62.00  1.6% resched_task                [kernel]
               61.00  1.5% sky2_rx_submit              [sky2]  
               52.00  1.3% ip_local_deliver            [kernel]
               51.00  1.3% kmem_cache_alloc            [kernel]
               51.00  1.3% swiotlb_sync_single         [kernel]
               43.00  1.1% sky2_remove                 [sky2]  
               41.00  1.0% udp_queue_rcv_skb           [kernel]
               39.00  1.0% __wake_up_common            [kernel]


-------------------------------------------------------------------------------
   PerfTop:     368 irqs/sec  kernel:95.9% [1000Hz cycles],  (all, cpu: 1)
-------------------------------------------------------------------------------

             samples  pcnt function                    DSO
             _______ _____ ___________________________ ________

              279.00  8.2% schedule                    [kernel]
              260.00  7.7% __skb_recv_datagram         [kernel]
              196.00  5.8% _raw_spin_lock_bh           [kernel]
              180.00  5.3% copy_user_generic_string    [kernel]
              176.00  5.2% udp_recvmsg                 [kernel]
              150.00  4.4% _raw_spin_lock_irqsave      [kernel]
              142.00  4.2% dst_release                 [kernel]
              106.00  3.1% skb_copy_datagram_iovec     [kernel]
               97.00  2.9% sock_recv_ts_and_drops      [kernel]
               93.00  2.7% tick_nohz_stop_sched_tick   [kernel]
               89.00  2.6% sys_recvfrom                [kernel]
               89.00  2.6% __switch_to                 [kernel]
               86.00  2.5% pick_next_task_fair         [kernel]
               82.00  2.4% sock_rfree                  [kernel]
               75.00  2.2% system_call                 [kernel]
               73.00  2.2% fget_light                  [kernel]
               70.00  2.1% _raw_spin_lock_irq          [kernel]
               63.00  1.9% kmem_cache_free             [kernel]
               61.00  1.8% _raw_spin_unlock_irqrestore [kernel]
               60.00  1.8% kfree                       [kernel]
               56.00  1.7% select_nohz_load_balancer   [kernel]
               55.00  1.6% finish_task_switch          [kernel]
               48.00  1.4% inet_recvmsg                [kernel]
               41.00  1.2% security_socket_recvmsg     [kernel]


-------------------------------------------------------------------------------
   PerfTop:      97 irqs/sec  kernel:81.4% [1000Hz cycles],  (all, cpu: 7)
-------------------------------------------------------------------------------

             samples  pcnt function                     DSO
             _______ _____ ____________________________ ________

               55.00 10.8% schedule                     [kernel]
               38.00  7.5% __skb_recv_datagram          [kernel]
               36.00  7.1% udp_recvmsg                  [kernel]
               32.00  6.3% _raw_spin_lock_irqsave       [kernel]
               31.00  6.1% _raw_spin_lock_bh            [kernel]
               30.00  5.9% copy_user_generic_string     [kernel]
               29.00  5.7% sock_recv_ts_and_drops       [kernel]
               27.00  5.3% skb_copy_datagram_iovec      [kernel]
               17.00  3.3% system_call                  [kernel]
               17.00  3.3% dst_release                  [kernel]
               14.00  2.7% _raw_spin_unlock_irqrestore  [kernel]
               12.00  2.4% __switch_to                  [kernel]
               12.00  2.4% pick_next_task_fair          [kernel]
               11.00  2.2% inet_recvmsg                 [kernel]
               11.00  2.2% sys_recvfrom                 [kernel]
               10.00  2.0% finish_task_switch           [kernel]
               10.00  2.0% sock_rfree                   [kernel]
               10.00  2.0% select_nohz_load_balancer    [kernel]
                7.00  1.4% rcu_enter_nohz               [kernel]
                7.00  1.4% tick_nohz_stop_sched_tick    [kernel]
                7.00  1.4% tick_nohz_restart_sched_tick [kernel]
                5.00  1.0% ktime_get                    [kernel]

Run1
----
557257 pps (557257 0:69750 1:69417 2:69063 3:68818 4:70139 5:69824 6:70135 7:70113)
737468 pps (1294725 0:162765 1:162430 2:162075 3:155770 4:163150 5:162838 6:163150 7:162549)
744238 pps (2038963 0:255795 1:255460 2:255105 3:248800 4:256180 5:255867 6:256180 7:255579)
719343 pps (2758306 0:348825 1:348202 2:348135 3:338166 4:349210 5:333030 6:349210 7:343528)
741830 pps (3500136 0:440870 1:440933 2:441165 3:430162 4:442240 5:425970 6:442240 7:436558)
686289 pps (4186425 0:533900 1:533749 2:515637 3:511486 4:531997 5:504717 6:525536 7:529406)
681708 pps (4868133 0:613701 1:617409 2:608667 3:599774 4:607480 5:589487 6:609802 7:621817)
697577 pps (5565710 0:704183 1:710439 2:688904 3:681696 4:689120 5:673932 6:702448 7:714988)
729284 pps (6294994 0:797213 1:803469 2:775863 3:770959 4:781160 5:766105 6:792207 7:808018)
734160 pps (7029154 0:886389 1:896504 2:868898 3:863506 4:868426 5:859138 6:885242 7:901053)
728541 pps (7757695 0:978789 1:989534 2:961928 3:946834 4:961458 5:952170 6:978272 7:988714)
709578 pps (8467273 0:1071819 1:1079000 2:1041101 3:1038974 4:1047215 5:1037254 6:1070168 7:1081744)
684154 pps (9151427 0:1160855 1:1158471 2:1122874 3:1129012 4:1136563 5:1120258 6:1153624 7:1169773)
498291 pps (9649718 0:1224303 1:1214178 2:1185737 3:1191467 4:1200058 5:1183753 6:1217121 7:1233101)

Essentially sink in about 96.5% of 10M packet

run2
---
402553 pps (402553 0:51530 1:53289 2:53625 3:45748 4:53625 5:49484 6:42292 7:52960)
711539 pps (1114092 0:144028 1:146426 2:144237 3:124551 4:146760 5:142619 6:119376 7:146095)
692319 pps (1806411 0:208285 1:239557 2:220103 3:211096 4:239890 5:235749 6:212506 7:239225)
731896 pps (2538307 0:301450 1:332723 2:308718 3:304264 4:333055 5:320036 6:305671 7:332390)
712869 pps (3251176 0:393270 1:418806 2:397578 3:396844 4:426245 5:406943 6:398861 7:412629)
681513 pps (3932689 0:486300 1:501926 2:490613 3:489874 4:466455 5:499973 6:491891 7:505659)
697308 pps (4629997 0:567969 1:585032 2:583643 3:576712 4:548243 5:589399 6:581080 7:597922)
712903 pps (5342900 0:657579 1:660221 2:676673 3:669744 4:641273 5:682222 6:674110 7:681082)
687765 pps (6030665 0:744421 1:752470 2:764631 3:751445 4:722250 5:771799 6:761224 7:762426)
695799 pps (6726464 0:832438 1:842797 2:853337 3:844470 4:804427 5:857412 6:846918 7:844668)
720011 pps (7446475 0:925210 1:934696 2:934883 3:937280 4:894644 5:949883 6:932740 7:937142)
712021 pps (8158496 0:1017246 1:1027726 2:1016841 3:1024712 4:978513 5:1042913 6:1023516 7:1027031)
709810 pps (8868306 0:1098522 1:1111823 2:1109871 3:1117444 4:1070124 5:1131774 6:1109841 7:1118909)
591817 pps (9460123 0:1178005 1:1185698 2:1189381 3:1196367 4:1143880 5:1198406 6:1176121 7:1192265)

94.6%

run3
---
682714 pps (682714 0:83336 1:86683 2:86895 3:86243 4:84616 5:81152 6:86895 7:86895)
691212 pps (1373926 0:164602 1:179240 2:171897 3:174162 4:176509 5:158115 6:174083 7:175321)
661913 pps (2035839 0:243004 1:263829 2:259312 3:267160 4:268875 5:231009 6:253411 7:249239)
715612 pps (2751451 0:336034 1:350220 2:346461 3:360190 4:359219 5:317625 6:346441 7:335265)
655354 pps (3406805 0:419339 1:434934 2:432010 3:442138 4:437837 5:394805 6:427064 7:418679)
592126 pps (3998931 0:494253 1:511454 2:508829 3:511992 4:508978 5:474866 6:496884 7:491679)
697177 pps (4696108 0:584474 1:601703 2:589111 3:602252 4:598767 5:565114 6:582153 7:572539)
681004 pps (5377112 0:662864 1:684427 2:678825 3:688402 4:685441 5:651962 6:673697 7:651495)
669622 pps (6046734 0:740275 1:765126 2:762764 3:773772 4:772144 5:731330 6:762339 7:738987)
645906 pps (6692640 0:825606 1:850550 2:846793 3:858243 4:850408 5:812402 6:838248 7:810391)
705873 pps (7398513 0:916877 1:937693 2:929956 3:950433 4:938179 5:894913 6:928125 7:902337)
735460 pps (8133973 0:1009907 1:1030722 2:1022986 3:1037959 4:1031209 5:987943 6:1021155 7:992092)
707605 pps (8841578 0:1102933 1:1122367 2:1101160 3:1129212 4:1124239 5:1063617 6:1112929 7:1085122)
347807 pps (9189385 0:1149677 1:1168026 2:1147905 3:1170556 4:1158858 5:1110362 6:1152134 7:1131867)

91.9%

run4
----
552606 pps (552606 0:72743 1:75411 2:67732 3:70204 4:63741 5:64934 6:66096 7:71746)
684450 pps (1237056 0:162839 1:165064 2:148974 3:160417 4:153919 5:135895 6:156238 7:153710)
696799 pps (1933855 0:254440 1:252304 2:240107 3:249399 4:246028 5:228009 6:247409 7:216161)
676546 pps (2610401 0:341132 1:336959 2:325332 3:330438 4:336250 5:305238 6:336208 7:298848)
712251 pps (3322652 0:432976 1:428990 2:413228 3:419977 4:425918 5:386917 6:426275 7:388371)
615680 pps (3938332 0:515679 1:497421 2:491618 3:505449 4:489452 5:462820 6:505336 7:470561)
635467 pps (4573799 0:597340 1:582917 2:555389 3:582751 4:573273 5:545378 6:584378 7:552373)
725581 pps (5299380 0:690038 1:675870 2:636347 3:676029 4:666231 5:632208 6:677337 7:645324)
699015 pps (5998395 0:783068 1:763654 2:725184 3:762784 4:752559 5:709123 6:764439 7:737586)
674472 pps (6672867 0:872645 1:847669 2:808333 3:827766 4:842267 5:798997 6:853779 7:821412)
680913 pps (7353780 0:961487 1:926760 2:887273 3:919158 4:925165 5:891082 6:929793 7:913064)
666279 pps (8020059 0:1050823 1:1012028 2:972691 3:988738 4:1009904 5:974127 6:1017940 7:993808)
680615 pps (8700674 0:1124223 1:1087779 2:1057541 3:1080546 4:1094373 5:1066880 6:1102496 7:1086838)
420306 pps (9120980 0:1177541 1:1130287 2:1111621 3:1134624 4:1148453 5:1120960 6:1156576 7:1140918)

91.2%

run5
------
294229 pps (294229 0:38805 1:30946 2:32655 3:36613 4:38805 5:38805 6:38800 7:38801)
694748 pps (988977 0:124394 1:123976 2:114107 3:128079 4:111317 5:131835 6:131835 7:123434)
690185 pps (1679162 0:217405 1:216988 2:194192 3:204091 4:195948 5:224678 6:220924 7:204937)
726561 pps (2405723 0:307828 1:309671 2:278163 3:296811 4:286642 5:317346 6:311296 7:297967)
695974 pps (3101697 0:391228 1:395256 2:371056 3:388790 4:379533 5:410242 6:393051 7:372541)
665395 pps (3767092 0:473134 1:484367 2:447394 3:462837 4:471026 5:491170 6:473947 7:463219)
671483 pps (4438575 0:562883 1:574014 2:534258 3:544512 4:534064 5:581420 6:560073 7:547353)
679400 pps (5117975 0:641135 1:663809 2:618019 3:633448 4:605085 5:674433 6:649865 7:632183)
696263 pps (5814238 0:734516 1:743715 2:711049 3:717481 4:693193 5:758493 6:740374 7:715417)
681791 pps (6496029 0:823596 1:836004 2:795579 3:809104 4:783457 5:820061 6:820219 7:808010)
670672 pps (7166701 0:911202 1:927618 2:888127 3:875504 4:874363 5:889342 6:911838 7:888707)
743444 pps (7910145 0:1004233 1:1020652 2:981157 3:968534 4:967393 5:982078 6:1004362 7:981737)
725623 pps (8635768 0:1096546 1:1113682 2:1059978 3:1061564 4:1060423 5:1072761 6:1097392 7:1073423)
662504 pps (9298272 0:1171688 1:1197579 2:1137559 3:1154595 4:1146405 5:1161670 6:1176001 7:1152776)
12979 pps (9311251 0:1173488 1:1199379 2:1137914 3:1156399 4:1148209 5:1163475 6:1177806 7:1154581)

93.1%

Average for no-rps 93.5% of 10M incoming at ~ 750Kpps.


# echo 1 >  /proc/irq/55/smp_affinity 
# echo ee  > /sys/class/net/eth0/queues/rx-0/rps_cpus


-------------------------------------------------------------------------------
   PerfTop:    2273 irqs/sec  kernel:93.7% [1000Hz cycles],  (all, 8 CPUs)
-------------------------------------------------------------------------------

             samples  pcnt function                       DSO
             _______ _____ ______________________________ ________

              922.00 10.3% sky2_poll                      [sky2]  
              402.00  4.5% __netif_receive_skb            [kernel]
              400.00  4.4% ip_rcv                         [kernel]
              356.00  4.0% call_function_single_interrupt [kernel]
              339.00  3.8% ip_route_input                 [kernel]
              282.00  3.1% schedule                       [kernel]
              194.00  2.2% _raw_spin_lock_irqsave         [kernel]
              180.00  2.0% sock_recv_ts_and_drops         [kernel]
              178.00  2.0% _raw_spin_lock                 [kernel]
              173.00  1.9% __udp4_lib_lookup              [kernel]
              171.00  1.9% __udp4_lib_rcv                 [kernel]
              162.00  1.8% system_call                    [kernel]
              154.00  1.7% kfree                          [kernel]
              147.00  1.6% __skb_recv_datagram            [kernel]
              146.00  1.6% copy_user_generic_string       [kernel]
              136.00  1.5% dst_release                    [kernel]
              136.00  1.5% _raw_spin_unlock_irqrestore    [kernel]
              126.00  1.4% fget_light                     [kernel]
              126.00  1.4% sky2_intr                      [sky2]  
              122.00  1.4% udp_recvmsg                    [kernel]
              111.00  1.2% sock_queue_rcv_skb             [kernel]



-------------------------------------------------------------------------------
   PerfTop:     325 irqs/sec  kernel:93.2% [1000Hz cycles],  (all, cpu: 0)
-------------------------------------------------------------------------------

             samples  pcnt function                            DSO
             _______ _____ ___________________________________ ________

             1033.00 62.9% sky2_poll                           [sky2]  
              159.00  9.7% sky2_intr                           [sky2]  
              119.00  7.3% irq_entries_start                   [kernel]
               51.00  3.1% __alloc_skb                         [kernel]
               48.00  2.9% get_rps_cpu                         [kernel]
               24.00  1.5% __kmalloc                           [kernel]
               23.00  1.4% swiotlb_sync_single                 [kernel]
               20.00  1.2% _raw_spin_lock                      [kernel]
               17.00  1.0% sky2_rx_submit                      [sky2]  
               15.00  0.9% enqueue_to_backlog                  [kernel]
               14.00  0.9% kmem_cache_alloc                    [kernel]
               11.00  0.7% default_send_IPI_mask_sequence_phys [kernel]
               10.00  0.6% sky2_remove                         [sky2]  
               10.00  0.6% cache_alloc_refill                  [kernel]
                8.00  0.5% _raw_spin_lock_irqsave              [kernel]
                7.00  0.4% dev_gro_receive                     [kernel]
                6.00  0.4% net_rx_action                       [kernel]
                6.00  0.4% __netdev_alloc_skb                  [kernel]
                6.00  0.4% load_balance                        [kernel]
                5.00  0.3% __smp_call_function_single          [kernel]


-------------------------------------------------------------------------------
   PerfTop:     347 irqs/sec  kernel:96.3% [1000Hz cycles],  (all, cpu: 1)
-------------------------------------------------------------------------------

             samples  pcnt function                       DSO
             _______ _____ ______________________________ ________

              104.00  6.7% call_function_single_interrupt [kernel]
              104.00  6.7% __netif_receive_skb            [kernel]
               95.00  6.1% ip_rcv                         [kernel]
               93.00  6.0% ip_route_input                 [kernel]
               62.00  4.0% schedule                       [kernel]
               49.00  3.2% sock_recv_ts_and_drops         [kernel]
               46.00  3.0% system_call                    [kernel]
               46.00  3.0% dst_release                    [kernel]
               45.00  2.9% _raw_spin_lock                 [kernel]
               41.00  2.7% _raw_spin_lock_irqsave         [kernel]
               40.00  2.6% _raw_spin_unlock_irqrestore    [kernel]
               36.00  2.3% copy_user_generic_string       [kernel]
               34.00  2.2% __udp4_lib_rcv                 [kernel]
               30.00  1.9% fget_light                     [kernel]
               30.00  1.9% sock_queue_rcv_skb             [kernel]
               28.00  1.8% udp_recvmsg                    [kernel]
               28.00  1.8% __udp4_lib_lookup              [kernel]
               26.00  1.7% select_task_rq_fair            [kernel]
               25.00  1.6% tick_nohz_stop_sched_tick      [kernel]
               23.00  1.5% __napi_complete                [kernel]
               20.00  1.3% __switch_to                    [kernel]
               20.00  1.3% finish_task_switch             [kernel]
               20.00  1.3% kmem_cache_free                [kernel]
               20.00  1.3% sys_recvfrom                   [kernel]
               19.00  1.2% kfree                          [kernel]
               19.00  1.2% __skb_recv_datagram            [kernel]

-------------------------------------------------------------------------------
   PerfTop:     243 irqs/sec  kernel:95.5% [1000Hz cycles],  (all, cpu: 7)
-------------------------------------------------------------------------------

             samples  pcnt function                       DSO
             _______ _____ ______________________________ ________

               92.00  7.3% ip_rcv                         [kernel]
               74.00  5.9% __netif_receive_skb            [kernel]
               57.00  4.6% ip_route_input                 [kernel]
               49.00  3.9% sock_recv_ts_and_drops         [kernel]
               49.00  3.9% system_call                    [kernel]
               47.00  3.8% schedule                       [kernel]
               39.00  3.1% _raw_spin_lock_irqsave         [kernel]
               36.00  2.9% call_function_single_interrupt [kernel]
               34.00  2.7% udp_recvmsg                    [kernel]
               32.00  2.6% __udp4_lib_rcv                 [kernel]
               31.00  2.5% copy_user_generic_string       [kernel]
               31.00  2.5% fget_light                     [kernel]
               30.00  2.4% __udp4_lib_lookup              [kernel]
               26.00  2.1% kfree                          [kernel]
               25.00  2.0% __skb_recv_datagram            [kernel]
               25.00  2.0% sock_queue_rcv_skb             [kernel]
               23.00  1.8% __switch_to                    [kernel]
               22.00  1.8% sock_recvmsg                   [kernel]
               22.00  1.8% _raw_spin_unlock_irqrestore    [kernel]
               21.00  1.7% select_task_rq_fair            [kernel]
               18.00  1.4% _raw_spin_lock                 [kernel]
               17.00  1.4% process_backlog                [kernel]
               17.00  1.4% sys_recvfrom                   [kernel]
               17.00  1.4% _raw_spin_lock_bh              [kernel]

run1
----
590479 pps (590479 0:73820 1:73817 2:73820 3:73819 4:73815 5:73815 6:73815 7:73815)
744641 pps (1335120 0:166895 1:166895 2:166895 3:166895 4:166895 5:166895 6:166895 7:166895)
744374 pps (2079494 0:259940 1:259940 2:259940 3:259940 4:259940 5:259940 6:259940 7:259940)
744340 pps (2823834 0:352985 1:352985 2:352985 3:352985 4:352985 5:352985 6:352980 7:352985)
744390 pps (3568224 0:446035 1:446035 2:446035 3:446035 4:446035 5:446035 6:446032 7:446030)
744404 pps (4312628 0:539085 1:539085 2:539085 3:539081 4:539085 5:539085 6:539085 7:539085)
744369 pps (5056997 0:632130 1:632130 2:632130 3:632130 4:632130 5:632130 6:632130 7:632130)
744394 pps (5801391 0:725180 1:725180 2:725180 3:725180 4:725180 5:725180 6:725180 7:725180)
744399 pps (6545790 0:818230 1:818230 2:818229 3:818230 4:818230 5:818226 6:818225 7:818225)
744354 pps (7290144 0:911275 1:911275 2:911275 3:911275 4:911270 5:911270 6:911270 7:911270)
744363 pps (8034507 0:1004320 1:1004320 2:1004320 3:1004320 4:1004320 5:1004306 6:1004320 7:1004317)
744379 pps (8778886 0:1097370 1:1097368 2:1097370 3:1097370 4:1097370 5:1097356 6:1097367 7:1097365)
744449 pps (9523335 0:1190425 1:1190425 2:1190425 3:1190421 4:1190425 5:1190411 6:1190425 7:1190425)
476651 pps (9999986 0:1250000 1:1250000 2:1250000 3:1250000 4:1250000 5:1249986 6:1250000 7:1250000)

99.9% !

rps counter..
865721 rps
1067721 rps

run2
----
573759 pps (573759 0:71720 1:71720 2:71720 3:71723 4:71721 5:71720 6:71720 7:71719)
744249 pps (1318008 0:164755 1:164753 2:164750 3:164750 4:164750 5:164750 6:164750 7:164750)
744260 pps (2062268 0:257785 1:257785 2:257785 3:257785 4:257785 5:257783 6:257780 7:257780)
744238 pps (2806506 0:350815 1:350815 2:350815 3:350815 4:350815 5:350811 6:350810 7:350810)
744233 pps (3550739 0:443845 1:443845 2:443845 3:443845 4:443844 5:443841 6:443841 7:443840)
744236 pps (4294975 0:536875 1:536875 2:536875 3:536870 4:536870 5:536870 6:536870 7:536870)
744244 pps (5039219 0:629905 1:629905 2:629905 3:629905 4:629905 5:629901 6:629901 7:629900)
744240 pps (5783459 0:722935 1:722935 2:722935 3:722934 4:722930 5:722930 6:722930 7:722930)
744214 pps (6527673 0:815962 1:815960 2:815965 3:815963 4:815962 5:815960 6:815955 7:815955)
744268 pps (7271941 0:908995 1:908995 2:908995 3:908995 4:908991 5:908990 6:908990 7:908990)
744239 pps (8016180 0:1002025 1:1002025 2:1002025 3:1002025 4:1002020 5:1002020 6:1002020 7:1002020)
744241 pps (8760421 0:1095055 1:1095055 2:1095052 3:1095055 4:1095055 5:1095050 6:1095050 7:1095050)
744234 pps (9504655 0:1188085 1:1188085 2:1188084 3:1188085 4:1188085 5:1188081 6:1188080 7:1188080)
495345 pps (10000000 0:1250000 1:1250000 2:1250000 3:1250000 4:1250000 5:1250000 6:1250000 7:1250000)

100.0% !!!

rps count ..
3651 rps
1455997 rps
498777 rps

run3
----
72947 pps (72947 0:9120 1:9120 2:9120 3:9120 4:9120 5:9117 6:9115 7:9115)
744616 pps (817563 0:102198 1:102195 2:102195 3:102195 4:102195 5:102195 6:102195 7:102195)
744710 pps (1562273 0:195285 1:195285 2:195285 3:195285 4:195285 5:195285 6:195285 7:195283)
744478 pps (2306751 0:288345 1:288345 2:288345 3:288345 4:288345 5:288345 6:288341 7:288340)
744603 pps (3051354 0:381422 1:381420 2:381420 3:381414 4:381420 5:381420 6:381420 7:381420)
744475 pps (3795829 0:474480 1:474480 2:474480 3:474472 4:474480 5:474480 6:474480 7:474477)
744740 pps (4540569 0:567575 1:567575 2:567575 3:567564 4:567570 5:567570 6:567570 7:567570)
744641 pps (5285210 0:660655 1:660655 2:660655 3:660646 4:660650 5:660650 6:660650 7:660650)
744300 pps (6029510 0:753695 1:753690 2:753690 3:753682 4:753690 5:753690 6:753690 7:753690)
744249 pps (6773759 0:846725 1:846725 2:846725 3:846712 4:846720 5:846720 6:846720 7:846720)
744709 pps (7518468 0:939814 1:939810 2:939810 3:939802 4:939810 5:939810 6:939810 7:939810)
744647 pps (8263115 0:1032893 1:1032890 2:1032890 3:1032882 4:1032890 5:1032890 6:1032890 7:1032890)
744672 pps (9007787 0:1125976 1:1125975 2:1125975 3:1125967 4:1125975 5:1125975 6:1125975 7:1125970)
744692 pps (9752479 0:1219065 1:1219065 2:1219062 3:1219056 4:1219060 5:1219060 6:1219060 7:1219060)
247513 pps (9999992 0:1250000 1:1250000 2:1250000 3:1249992 4:1250000 5:1250000 6:1250000 7:1250000)

99.9%!
rps count ...
1118484 rps
842940 rps

run4
----
288558 pps (288558 0:36070 1:36070 2:36070 3:36070 4:36070 5:36070 6:36070 7:36068)
744237 pps (1032795 0:129103 1:129100 2:129105 3:129100 4:129100 5:129100 6:129095 7:129095)
742988 pps (1775783 0:222135 1:222135 2:222135 3:222135 4:220853 5:222130 6:222130 7:222130)
744210 pps (2519993 0:315160 1:315160 2:315160 3:315160 4:313883 5:315160 6:315155 7:315155)
744214 pps (3264207 0:408189 1:408185 2:408185 3:408185 4:406908 5:408185 6:408185 7:408185)
744278 pps (4008485 0:501223 1:501220 2:501220 3:501220 4:499943 5:501220 6:501220 7:501220)
743699 pps (4752184 0:594252 1:594250 2:593718 3:594250 4:592973 5:594250 6:594248 7:594245)
744243 pps (5496427 0:687280 1:687280 2:686748 3:687280 4:686003 5:687280 6:687280 7:687276)
744231 pps (6240658 0:780310 1:780310 2:779778 3:780310 4:779033 5:780300 6:780310 7:780307)
743958 pps (6984616 0:873342 1:873340 2:872808 3:873340 4:872063 5:873043 6:873340 7:873340)
744241 pps (7728857 0:966373 1:966370 2:965838 3:966370 4:965093 5:966073 6:966370 7:966370)
744232 pps (8473089 0:1059400 1:1059400 2:1058868 3:1059400 4:1058123 5:1059103 6:1059397 7:1059398)
743660 pps (9216749 0:1152434 1:1152430 2:1151898 3:1152430 4:1151153 5:1151556 6:1152427 7:1152430)
744251 pps (9961000 0:1245463 1:1245460 2:1244928 3:1245460 4:1244183 5:1244586 6:1245460 7:1245460)
36317 pps (9997317 0:1250000 1:1250000 2:1249468 3:1250000 4:1248723 5:1249126 6:1250000 7:1250000)

99.9%!
rps count
818552 rps
1146570 rps

run 5
----
686211 pps (686211 0:85780 1:85780 2:85775 3:85779 4:85780 5:85780 6:85775 7:85775)
744260 pps (1430471 0:178810 1:178810 2:178810 3:178810 4:178810 5:178810 6:178806 7:178805)
744242 pps (2174713 0:271840 1:271840 2:271840 3:271840 4:271840 5:271840 6:271838 7:271835)
744241 pps (2918954 0:364870 1:364870 2:364870 3:364870 4:364870 5:364870 6:364869 7:364865)
744238 pps (3663192 0:457900 1:457900 2:457900 3:457900 4:457900 5:457900 6:457900 7:457899)
744240 pps (4407432 0:550930 1:550930 2:550930 3:550930 4:550930 5:550930 6:550927 7:550925)
744244 pps (5151676 0:643960 1:643960 2:643960 3:643960 4:643960 5:643960 6:643960 7:643956)
744236 pps (5895912 0:736990 1:736990 2:736990 3:736990 4:736990 5:736990 6:736987 7:736985)
744241 pps (6640153 0:830020 1:830020 2:830020 3:830020 4:830020 5:830020 6:830018 7:830015)
744235 pps (7384388 0:923050 1:923050 2:923050 3:923050 4:923050 5:923049 6:923045 7:923047)
744244 pps (8128632 0:1016080 1:1016080 2:1016080 3:1016080 4:1016080 5:1016080 6:1016079 7:1016075)
744231 pps (8872863 0:1109110 1:1109110 2:1109110 3:1109110 4:1109108 5:1109105 6:1109105 7:1109105)
744258 pps (9617121 0:1202141 1:1202140 2:1202140 3:1202140 4:1202140 5:1202140 6:1202140 7:1202140)
382879 pps (10000000 0:1250000 1:1250000 2:1250000 3:1250000 4:1250000 5:1250000 6:1250000 7:1250000)

100%
rpsipi count ..
768383 rps
1178132 rps
Changli Gao April 29, 2010, 11:07 p.m. UTC | #16
On Thu, Apr 29, 2010 at 8:45 PM, Eric Dumazet <eric.dumazet@gmail.com> wrote:
>
> Changli, I wonder how you can cook "performance" patches without testing
> them at all for real... This cannot be true ?
>

I am sorry. But I wasn't against your patch, and I just wanted to
understand the test result from jamal. It is my fault submitting a
performance patch without testing them. I should not reply on code
inspection for the performance patch.
Brian Bloniarz April 30, 2010, 1:55 p.m. UTC | #17
Eric Dumazet wrote:
> Here is last 'patch of the day' for me ;)
> Next one will be able to coalesce wakeup calls (they'll be delayed at
> the end of net_rx_action(), like a patch I did last year to help
> multicast reception)
>
> vger seems to be down, I suspect I'll have to resend it later.
>
> [PATCH net-next-2.6] net: sock_def_readable() and friends RCU conversion
>
> sk_callback_lock rwlock actually protects sk->sk_sleep pointer, so we
> need two atomic operations (and associated dirtying) per incoming
> packet.
>   

This patch boots for me, I haven't noticed any strangeness yet.

I ran a few benchmarks (the multicast fan-out mcasttest.c
from last year, a few other things we have lying around).
I think I see a modest improvement from this and your other
2 packets. Presumably the big wins are where multiple cores
perform bh for the same socket, that's not the case in
these benchmarks. If it's appropriate:

Tested-by: Brian Bloniarz <bmb@athenacr.com>

> Next one will be able to coalesce wakeup calls (they'll be delayed at
> the end of net_rx_action(), like a patch I did last year to help
> multicast reception)

Keep em coming :)
--
To unsubscribe from this list: send the line "unsubscribe netdev" in
the body of a message to majordomo@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
jamal April 30, 2010, 7:30 p.m. UTC | #18
Eric!

I managed to mod your program to look conceptually similar to mine
and i reproduced the results with same test kernel from yesterday. 
So it is likely the issue is in using epoll vs not using any async as
in your case.
Results attached as well as modified program.

Note: the key things to remember:
rps with this program gets worse over time and different net-next
kernels since Apr14 (look at graph i supplied). Sorry, I am really
busy-ed out to dig any further.

cheers,
jamal



On Thu, 2010-04-29 at 16:36 -0400, jamal wrote:
> On Thu, 2010-04-29 at 09:56 -0400, jamal wrote:
> 
> > 
> > I will try your program instead so we can reduce the variables
> 
> Results attached.
> With your app rps does a hell lot better and non-rps worse ;->
> With my proggie, non-rps does much better than yours and rps does
> a lot worse for same setup. I see the scheduler kicking quiet a bit in
> non-rps for you...
> 
> The main difference between us as i see it is:
> a) i use epoll - actually linked to libevent (1.0.something)
> b) I fork processes and you use pthreads.
> 
> I dont have time to chase it today, but 1) I am either going to change
> yours to use libevent or make mine get rid of it then 2) move towards
> pthreads or have yours fork..
> then observe if that makes any difference..
> 
> 
> cheers,
> jamal
First a few runs with Eric's code + epoll/libevent

-------------------------------------------------------------------------------
   PerfTop:    4009 irqs/sec  kernel:83.4% [1000Hz cycles],  (all, 8 CPUs)
-------------------------------------------------------------------------------

             samples  pcnt function                    DSO
             _______ _____ ___________________________ ____________________

             2097.00  8.6% sky2_poll                   [sky2]              
             1742.00  7.2% _raw_spin_lock_irqsave      [kernel]            
              831.00  3.4% system_call                 [kernel]            
              654.00  2.7% copy_user_generic_string    [kernel]            
              654.00  2.7% datagram_poll               [kernel]            
              647.00  2.7% fget                        [kernel]            
              623.00  2.6% _raw_spin_unlock_irqrestore [kernel]            
              547.00  2.3% _raw_spin_lock_bh           [kernel]            
              506.00  2.1% sys_epoll_ctl               [kernel]            
              475.00  2.0% kmem_cache_free             [kernel]            
              466.00  1.9% schedule                    [kernel]            
              436.00  1.8% vread_tsc                   [kernel].vsyscall_fn
              417.00  1.7% fput                        [kernel]            
              415.00  1.7% sys_epoll_wait              [kernel]            
              402.00  1.7% _raw_spin_lock              [kernel]            


-------------------------------------------------------------------------------
   PerfTop:     616 irqs/sec  kernel:98.7% [1000Hz cycles],  (all, cpu: 0)
-------------------------------------------------------------------------------

             samples  pcnt function               DSO
             _______ _____ ______________________ ________

             2534.00 28.6% sky2_poll              [sky2]  
              503.00  5.7% ip_route_input         [kernel]
              438.00  4.9% _raw_spin_lock_irqsave [kernel]
              418.00  4.7% __udp4_lib_lookup      [kernel]
              378.00  4.3% __alloc_skb            [kernel]
              364.00  4.1% ip_rcv                 [kernel]
              323.00  3.6% _raw_spin_lock         [kernel]
              315.00  3.5% sock_queue_rcv_skb     [kernel]
              284.00  3.2% __netif_receive_skb    [kernel]
              281.00  3.2% __udp4_lib_rcv         [kernel]
              266.00  3.0% __wake_up_common       [kernel]
              238.00  2.7% sock_def_readable      [kernel]
              181.00  2.0% __kmalloc              [kernel]
              163.00  1.8% kmem_cache_alloc       [kernel]
              150.00  1.7% ep_poll_callback       [kernel]


-------------------------------------------------------------------------------
   PerfTop:     854 irqs/sec  kernel:80.2% [1000Hz cycles],  (all, cpu: 2)
-------------------------------------------------------------------------------

             samples  pcnt function                    DSO
             _______ _____ ___________________________ ____________________

              341.00  8.0% _raw_spin_lock_irqsave      [kernel]            
              235.00  5.5% system_call                 [kernel]            
              174.00  4.1% datagram_poll               [kernel]            
              174.00  4.1% fget                        [kernel]            
              173.00  4.1% copy_user_generic_string    [kernel]            
              135.00  3.2% _raw_spin_unlock_irqrestore [kernel]            
              125.00  2.9% _raw_spin_lock_bh           [kernel]            
              122.00  2.9% schedule                    [kernel]            
              113.00  2.6% sys_epoll_ctl               [kernel]            
              113.00  2.6% kmem_cache_free             [kernel]            
              108.00  2.5% vread_tsc                   [kernel].vsyscall_fn
              105.00  2.5% sys_epoll_wait              [kernel]            
              102.00  2.4% udp_recvmsg                 [kernel]            
               95.00  2.2% mutex_lock                  [kernel]            

Average 97.55% of 10M packets at 750Kpps

Turn on rps mask ee and irq affinity to cpu0

-------------------------------------------------------------------------------
   PerfTop:    3885 irqs/sec  kernel:83.6% [1000Hz cycles],  (all, 8 CPUs)
-------------------------------------------------------------------------------

             samples  pcnt function                       DSO
             _______ _____ ______________________________ ________

             2945.00 16.7% sky2_poll                      [sky2]  
              653.00  3.7% _raw_spin_lock_irqsave         [kernel]
              460.00  2.6% system_call                    [kernel]
              420.00  2.4% _raw_spin_unlock_irqrestore    [kernel]
              414.00  2.3% sky2_intr                      [sky2]  
              392.00  2.2% fget                           [kernel]
              360.00  2.0% ip_rcv                         [kernel]
              324.00  1.8% sys_epoll_ctl                  [kernel]
              323.00  1.8% __netif_receive_skb            [kernel]
              310.00  1.8% schedule                       [kernel]
              292.00  1.7% ip_route_input                 [kernel]
              292.00  1.7% _raw_spin_lock                 [kernel]
              291.00  1.7% copy_user_generic_string       [kernel]
              284.00  1.6% kmem_cache_free                [kernel]
              262.00  1.5% call_function_single_interrupt [kernel]

-------------------------------------------------------------------------------
   PerfTop:    1000 irqs/sec  kernel:98.1% [1000Hz cycles],  (all, cpu: 0)
-------------------------------------------------------------------------------

             samples  pcnt function                            DSO
             _______ _____ ___________________________________ ________

             4170.00 61.9% sky2_poll                           [sky2]  
              723.00 10.7% sky2_intr                           [sky2]  
              159.00  2.4% __alloc_skb                         [kernel]
              140.00  2.1% get_rps_cpu                         [kernel]
              106.00  1.6% __kmalloc                           [kernel]
               95.00  1.4% enqueue_to_backlog                  [kernel]
               86.00  1.3% kmem_cache_alloc                    [kernel]
               85.00  1.3% irq_entries_start                   [kernel]
               85.00  1.3% _raw_spin_lock_irqsave              [kernel]
               82.00  1.2% _raw_spin_lock                      [kernel]
               66.00  1.0% swiotlb_sync_single                 [kernel]
               58.00  0.9% sky2_remove                         [sky2]  
               49.00  0.7% default_send_IPI_mask_sequence_phys [kernel]
               47.00  0.7% sky2_rx_submit                      [sky2]  
               36.00  0.5% _raw_spin_unlock_irqrestore         [kernel]

-------------------------------------------------------------------------------
   PerfTop:     344 irqs/sec  kernel:84.3% [1000Hz cycles],  (all, cpu: 2)
-------------------------------------------------------------------------------

             samples  pcnt function                       DSO
             _______ _____ ______________________________ ____________________

              114.00  5.2% _raw_spin_lock_irqsave         [kernel]            
               79.00  3.6% fget                           [kernel]            
               78.00  3.6% ip_rcv                         [kernel]            
               78.00  3.6% system_call                    [kernel]            
               75.00  3.4% _raw_spin_unlock_irqrestore    [kernel]            
               67.00  3.1% sys_epoll_ctl                  [kernel]            
               65.00  3.0% schedule                       [kernel]            
               61.00  2.8% ip_route_input                 [kernel]            
               48.00  2.2% vread_tsc                      [kernel].vsyscall_fn
               48.00  2.2% call_function_single_interrupt [kernel]            
               46.00  2.1% kmem_cache_free                [kernel]            
               45.00  2.1% __netif_receive_skb            [kernel]            
               41.00  1.9% process_recv                   snkudp              
               40.00  1.8% kfree                          [kernel]            
               39.00  1.8% _raw_spin_lock                 [kernel]            

92.97% of 10M packets at 750Kpps


Ok, so this is exactly what i saw with my app. non-rps is better.
To summarize: It used to be the opposite on net-next before around
Apr14. rps has gotten worse.
Eric Dumazet April 30, 2010, 8:40 p.m. UTC | #19
Le vendredi 30 avril 2010 à 15:30 -0400, jamal a écrit :
> Eric!
> 
> I managed to mod your program to look conceptually similar to mine
> and i reproduced the results with same test kernel from yesterday. 
> So it is likely the issue is in using epoll vs not using any async as
> in your case.
> Results attached as well as modified program.
> 
> Note: the key things to remember:
> rps with this program gets worse over time and different net-next
> kernels since Apr14 (look at graph i supplied). Sorry, I am really
> busy-ed out to dig any further.
> 
> cheers,
> jamal
> 

I am lost.

I used your program, and with RPS off, I can get at most 220.000 pps
with my "old" hardware. I dont understand how you can reach 700.000 pps
with RPS off. Or is it with your Nehalem ?



--
To unsubscribe from this list: send the line "unsubscribe netdev" in
the body of a message to majordomo@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
jamal May 1, 2010, 12:06 a.m. UTC | #20
On Fri, 2010-04-30 at 22:40 +0200, Eric Dumazet wrote:

> 
> I used your program, and with RPS off, I can get at most 220.000 pps
> with my "old" hardware. I dont understand how you can reach 700.000 pps
> with RPS off. Or is it with your Nehalem ?

Yes, Nehalem. 
RPS off is better (~700Kpp) than RPS on(~650kpps). Are you seeing the
same trend on the old hardware?

cheers,
jamal

--
To unsubscribe from this list: send the line "unsubscribe netdev" in
the body of a message to majordomo@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Eric Dumazet May 1, 2010, 5:57 a.m. UTC | #21
Le vendredi 30 avril 2010 à 20:06 -0400, jamal a écrit :

> Yes, Nehalem. 
> RPS off is better (~700Kpp) than RPS on(~650kpps). Are you seeing the
> same trend on the old hardware?
> 

Of course not ! Or else RPS would be useless :(

I changed your program a bit to use EV_PERSIST, (to avoid epoll_ctl()
overhead for each packet...)

RPS off : 220.000 pps 

RPS on (ee mask) : 700.000 pps  (with a slightly modified tg3 driver)
96% of delivered packets

This is on tg3 adapter, and tg3 has copybreak feature : small packets
are copied into skb of the right size.

define TG3_RX_COPY_THRESHOLD       256 -> 40 ...

We really should disable this feature for RPS workload,
unfortunatly ethtool cannot tweak this.

So profile of cpu 0 (RPS ON) looks like :

------------------------------------------------------------------------------------------------------------------------
   PerfTop:    1001 irqs/sec  kernel:99.7% [1000Hz cycles],  (all, cpu: 0)
------------------------------------------------------------------------------------------------------------------------

             samples  pcnt function               DSO
             _______ _____ ______________________ _______

              819.00 12.6% __alloc_skb            vmlinux
              592.00  9.1% eth_type_trans         vmlinux
              509.00  7.8% _raw_spin_lock         vmlinux
              475.00  7.3% __kmalloc_track_caller vmlinux
              358.00  5.5% tg3_read32             vmlinux
              345.00  5.3% __netdev_alloc_skb     vmlinux
              329.00  5.0% kmem_cache_alloc       vmlinux
              307.00  4.7% _raw_spin_lock_irqsave vmlinux
              284.00  4.4% bnx2_interrupt         vmlinux
              277.00  4.2% skb_pull               vmlinux
              248.00  3.8% tg3_poll_work          vmlinux
              202.00  3.1% __slab_alloc           vmlinux
              197.00  3.0% get_rps_cpu            vmlinux
              106.00  1.6% enqueue_to_backlog     vmlinux
               87.00  1.3% _raw_spin_lock_bh      vmlinux
               80.00  1.2% __copy_to_user_ll      vmlinux
               77.00  1.2% nommu_map_page         vmlinux
               77.00  1.2% __napi_gro_receive     vmlinux
               65.00  1.0% tg3_alloc_rx_skb       vmlinux
               60.00  0.9% skb_gro_reset_offset   vmlinux
               57.00  0.9% skb_put                vmlinux
               57.00  0.9% __slab_free            vmlinux


/*
 *  Usage: udpsnkfrk [ -p baseport] nbports
*/
#include <sys/socket.h>
#include <netinet/in.h>
#include <arpa/inet.h>
#include <string.h>
#include <stdio.h>
#include <errno.h>
#include <unistd.h>
#include <stdlib.h>
#include <fcntl.h>
#include <event.h>

struct worker_data {
	struct event *snk_ev;
	struct event_base *base;
	struct timeval t;
	unsigned long pack_count;
	unsigned long bytes_count;
	unsigned long tout;
	int fd;			/* move to avoid hole on 64-bit */
	int pad1;	
	unsigned long _padd[99]; /* avoid false sharing */
};

void usage(int code)
{
	fprintf(stderr, "Usage: udpsink [-p baseport] nbports\n");
	exit(code);
}

void process_recv(int fd, short ev, void *arg)
{
	char buffer[4096];
	struct sockaddr_in addr;
	socklen_t len = sizeof(addr);
	struct worker_data *wdata = (struct worker_data *)arg;
	int lu = 0;


	if (ev == EV_TIMEOUT) {
		wdata->tout++;
		if ((event_add(wdata->snk_ev, &wdata->t)) < 0) {
			perror("cb event_add");
			return;
		}
	} else {
		do {
			lu = recvfrom(wdata->fd, buffer, sizeof(buffer), 0,
			      (struct sockaddr *)&addr, &len);
			if (lu > 0) {
				wdata->pack_count++;
				wdata->bytes_count += lu;
			}
		} while (lu > 0);
	}
}

int prep_thread(struct worker_data *wdata)
{
	wdata->t.tv_sec = 1;
	wdata->t.tv_usec = random() % 50000L;

	wdata->base = event_init();
	event_set(wdata->snk_ev, wdata->fd, EV_READ|EV_PERSIST, process_recv, wdata);
	event_base_set(wdata->base, wdata->snk_ev);
	if ((event_add(wdata->snk_ev, &wdata->t)) < 0) {
		perror("event_add");
		return -1;
	}
	return 0;
}

void *worker_func(void *arg)
{
	struct worker_data *wdata = (struct worker_data *)arg;

	return (void *)event_base_loop(wdata->base, 0);
}

int main(int argc, char *argv[])
{
	int c;
	int baseport = 4000;
	int nbthreads;
	struct worker_data *wdata;
	unsigned long ototal = 0;
	int concurrent = 0;
	int verbose = 0;
	int i;
	while ((c = getopt(argc, argv, "cvp:")) != -1) {
		if (c == 'p')
			baseport = atoi(optarg);
		else if (c == 'c')
			concurrent = 1;
		else if (c == 'v')
			verbose++;
		else
			usage(1);
	}
	if (optind == argc)
		usage(1);
	nbthreads = atoi(argv[optind]);
	wdata = calloc(sizeof(struct worker_data), nbthreads);
	if (!wdata) {
		perror("calloc");
		return 1;
	}

	for (i = 0; i < nbthreads; i++) {
		struct sockaddr_in addr;
		pthread_t tid;

		if (i && concurrent) {
			wdata[i].fd = wdata[0].fd;
		} else {
			wdata[i].snk_ev = malloc(sizeof(struct event));
			if (!wdata[i].snk_ev)
				return 1;
			memset(wdata[i].snk_ev, 0, sizeof(struct event));

			wdata[i].fd = socket(PF_INET, SOCK_DGRAM, 0);
			if (wdata[i].fd == -1) {
				free(wdata[i].snk_ev);
				perror("socket");
				return 1;
			}
			memset(&addr, 0, sizeof(addr));
			addr.sin_family = AF_INET;
//                      addr.sin_addr.s_addr = inet_addr(argv[optind]);
			addr.sin_port = htons(baseport + i);
			if (bind
			    (wdata[i].fd, (struct sockaddr *)&addr,
			     sizeof(addr)) < 0) {
				free(wdata[i].snk_ev);
				perror("bind");
				return 1;
			}
                      fcntl(wdata[i].fd, F_SETFL, O_NDELAY);
		}
		if (prep_thread(wdata + i)) {
			printf("failed to allocate thread %d, exit\n", i);
			exit(0);
		}
		pthread_create(&tid, NULL, worker_func, wdata + i);
	}

	for (;;) {
		unsigned long total;
		long delta;

		sleep(1);
		total = 0;
		for (i = 0; i < nbthreads; i++) {
			total += wdata[i].pack_count;
		}
		delta = total - ototal;
		if (delta) {
			printf("%lu pps (%lu", delta, total);
			if (verbose) {
				for (i = 0; i < nbthreads; i++) {
					if (wdata[i].pack_count)
						printf(" %d:%lu", i,
						       wdata[i].pack_count);
				}
			}
			printf(")\n");
		}
		ototal = total;
	}
}



--
To unsubscribe from this list: send the line "unsubscribe netdev" in
the body of a message to majordomo@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Eric Dumazet May 1, 2010, 6:14 a.m. UTC | #22
Le samedi 01 mai 2010 à 07:57 +0200, Eric Dumazet a écrit :
> Le vendredi 30 avril 2010 à 20:06 -0400, jamal a écrit :
> 
> > Yes, Nehalem. 
> > RPS off is better (~700Kpp) than RPS on(~650kpps). Are you seeing the
> > same trend on the old hardware?
> > 
> 
> Of course not ! Or else RPS would be useless :(
> 
> I changed your program a bit to use EV_PERSIST, (to avoid epoll_ctl()
> overhead for each packet...)
> 
> RPS off : 220.000 pps 
> 
> RPS on (ee mask) : 700.000 pps  (with a slightly modified tg3 driver)
> 96% of delivered packets

BTW, using ee mask, cpu4 is not used at _all_, even for the user
threads. Scheduler does a bad job IMHO.

Using fe mask, I get all packets (sent at 733311pps by my pktgen
machine), and my CPU0 even has idle time !!!

Limit seems to be around 800.000 pps

------------------------------------------------------------------------------------------------------------------------
   PerfTop:    5616 irqs/sec  kernel:93.9% [1000Hz cycles],  (all, 8 CPUs)
------------------------------------------------------------------------------------------------------------------------

             samples  pcnt function                    DSO
             _______ _____ ___________________________ _______

             3492.00  6.2% __slab_free                 vmlinux
             2334.00  4.2% _raw_spin_lock              vmlinux
             2314.00  4.1% _raw_spin_lock_irqsave      vmlinux
             1807.00  3.2% ip_rcv                      vmlinux
             1605.00  2.9% schedule                    vmlinux
             1474.00  2.6% __netif_receive_skb         vmlinux
             1464.00  2.6% kfree                       vmlinux
             1405.00  2.5% ip_route_input              vmlinux
             1318.00  2.4% __copy_to_user_ll           vmlinux
             1214.00  2.2% __alloc_skb                 vmlinux
             1160.00  2.1% nf_hook_slow                vmlinux
             1020.00  1.8% eth_type_trans              vmlinux
              860.00  1.5% sched_clock_local           vmlinux
              775.00  1.4% read_tsc                    vmlinux
              773.00  1.4% ipt_do_table                vmlinux
              766.00  1.4% _raw_spin_unlock_irqrestore vmlinux
              748.00  1.3% sock_recv_ts_and_drops      vmlinux
              747.00  1.3% ia32_sysenter_target        vmlinux
              740.00  1.3% select_nohz_load_balancer   vmlinux
              644.00  1.2% __kmalloc_track_caller      vmlinux
              596.00  1.1% tg3_read32                  vmlinux
              566.00  1.0% __udp4_lib_lookup           vmlinux




--
To unsubscribe from this list: send the line "unsubscribe netdev" in
the body of a message to majordomo@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Changli Gao May 1, 2010, 10:24 a.m. UTC | #23
On Sat, May 1, 2010 at 2:14 PM, Eric Dumazet <eric.dumazet@gmail.com> wrote:
>
> BTW, using ee mask, cpu4 is not used at _all_, even for the user
> threads. Scheduler does a bad job IMHO.
>
> Using fe mask, I get all packets (sent at 733311pps by my pktgen
> machine), and my CPU0 even has idle time !!!
>
> Limit seems to be around 800.000 pps
>
> ------------------------------------------------------------------------------------------------------------------------
>   PerfTop:    5616 irqs/sec  kernel:93.9% [1000Hz cycles],  (all, 8 CPUs)
> ------------------------------------------------------------------------------------------------------------------------
>

Oh, cpu0 usage is about 100-(100-93.9)*8 = 51.2%(Am I right?). If we
can do weighted packet distributing: cpu0's weight is 1, and other
cpus are 2. maybe we can utilize all the cpu power.
Eric Dumazet May 1, 2010, 10:47 a.m. UTC | #24
Le samedi 01 mai 2010 à 18:24 +0800, Changli Gao a écrit :
> On Sat, May 1, 2010 at 2:14 PM, Eric Dumazet <eric.dumazet@gmail.com> wrote:
> >
> > BTW, using ee mask, cpu4 is not used at _all_, even for the user
> > threads. Scheduler does a bad job IMHO.
> >
> > Using fe mask, I get all packets (sent at 733311pps by my pktgen
> > machine), and my CPU0 even has idle time !!!
> >
> > Limit seems to be around 800.000 pps
> >
> > ------------------------------------------------------------------------------------------------------------------------
> >   PerfTop:    5616 irqs/sec  kernel:93.9% [1000Hz cycles],  (all, 8 CPUs)
> > ------------------------------------------------------------------------------------------------------------------------
> >
> 
> Oh, cpu0 usage is about 100-(100-93.9)*8 = 51.2%(Am I right?). If we
> can do weighted packet distributing: cpu0's weight is 1, and other
> cpus are 2. maybe we can utilize all the cpu power.
> 

Nope, cpu0 was at 100% in this test, other cpus were about at 50% each.

weigthed would be ok if I wanted to use cpu0 in the 'slave' cpus (RPS
targets). But I know the workload I am interested to, and ability to
resist to DDOS, want to keep cpu0 outside of IP/TCP/UDP stack.


Later, skb_pull() inline in eth_type_trans() permitted to reach 840.000
pps.

top - 12:42:55 up  3:00,  2 users,  load average: 0.44, 0.11, 0.03
Tasks: 126 total,   1 running, 125 sleeping,   0 stopped,   0 zombie
Cpu(s):  2.2%us, 16.5%sy,  0.0%ni, 46.5%id, 11.4%wa,  0.9%hi, 22.5%si,
0.0%st
Mem:   4148112k total,   211152k used,  3936960k free,    15228k buffers
Swap:  4192928k total,        0k used,  4192928k free,   121804k cached

You can see average idle of 46%
So there is probably more optimizations to do to reach maybe 1.300.000
pps ;)



--
To unsubscribe from this list: send the line "unsubscribe netdev" in
the body of a message to majordomo@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
jamal May 1, 2010, 11:23 a.m. UTC | #25
On Sat, 2010-05-01 at 07:57 +0200, Eric Dumazet wrote:

> I changed your program a bit to use EV_PERSIST, (to avoid epoll_ctl()
> overhead for each packet...)

Thats a different test case then ;-> You can also get rid of the timer
(I doubt it will show much difference in results) - I have it in there
because it i am trying to replicate what i saw causing the regression.

> RPS off : 220.000 pps 
> 
> RPS on (ee mask) : 700.000 pps  (with a slightly modified tg3 driver)
> 96% of delivered packets
> 

That's a very very huge gap. What were the numbers before you changed to
EV_PERSIST?
Note: i did not add any of your other patches for dst refcnt, sockets
etc. Were you running with those patches in these tests? I will try the
next opportunity i get to have latest kernel + those patches. 

> This is on tg3 adapter, and tg3 has copybreak feature : small packets
> are copied into skb of the right size.

Ok, so the driver tuning is also important then (and it shows in the
profile).

cheers,
jamal

--
To unsubscribe from this list: send the line "unsubscribe netdev" in
the body of a message to majordomo@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
jamal May 1, 2010, 11:29 a.m. UTC | #26
On Sat, 2010-05-01 at 08:14 +0200, Eric Dumazet wrote:

> BTW, using ee mask, cpu4 is not used at _all_, even for the user
> threads. Scheduler does a bad job IMHO.

I have the opposite frustration ;->
I did notice it got used. My goal was to totally avoid using it, for
simple reason it is an SMT thread that shares same core as cpu0.
In retrospect i should probably set irq affinity then to cpu0 and 4.

> Using fe mask, I get all packets (sent at 733311pps by my pktgen
> machine), and my CPU0 even has idle time !!!

I will try this next time i get the chance.

cheers,
jamal

--
To unsubscribe from this list: send the line "unsubscribe netdev" in
the body of a message to majordomo@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Eric Dumazet May 1, 2010, 11:42 a.m. UTC | #27
Le samedi 01 mai 2010 à 07:23 -0400, jamal a écrit :
> On Sat, 2010-05-01 at 07:57 +0200, Eric Dumazet wrote:
> 
> > I changed your program a bit to use EV_PERSIST, (to avoid epoll_ctl()
> > overhead for each packet...)
> 
> Thats a different test case then ;-> You can also get rid of the timer
> (I doubt it will show much difference in results) - I have it in there
> because it i am trying to replicate what i saw causing the regression.
> 
> > RPS off : 220.000 pps 
> > 
> > RPS on (ee mask) : 700.000 pps  (with a slightly modified tg3 driver)
> > 96% of delivered packets
> > 
> 
> That's a very very huge gap. What were the numbers before you changed to
> EV_PERSIST?

But, whole point of epoll is to not change interest each time you get an
event.

Without EV_PERSIST, you need two more syscalls per recvfrom()

epoll_wait()
 epoll_ctl(REMOVE)
 epoll_ctl(ADD)
 recvfrom()

Even poll() would be faster in your case

poll(one fd)
recvfrom()



> Note: i did not add any of your other patches for dst refcnt, sockets
> etc. Were you running with those patches in these tests? I will try the
> next opportunity i get to have latest kernel + those patches. 
> 
> > This is on tg3 adapter, and tg3 has copybreak feature : small packets
> > are copied into skb of the right size.
> 
> Ok, so the driver tuning is also important then (and it shows in the
> profile).

I always thought copybreak was borderline...

It can help to reduce memory footprint (allocating 128 bytes instead of
2048/4096 bytes per frame), but with RPS, it would make sense to perform
copybreak after RPS, not before.

Reducing memory footprint also means less changes on
udp_memory_allocated /tcp_memory_allocate (memory reclaim logic)



--
To unsubscribe from this list: send the line "unsubscribe netdev" in
the body of a message to majordomo@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
jamal May 1, 2010, 11:56 a.m. UTC | #28
On Sat, 2010-05-01 at 13:42 +0200, Eric Dumazet wrote:

> But, whole point of epoll is to not change interest each time you get an
> event.
> 
> Without EV_PERSIST, you need two more syscalls per recvfrom()
> 
> epoll_wait()
>  epoll_ctl(REMOVE)
>  epoll_ctl(ADD)
>  recvfrom()
> 
> Even poll() would be faster in your case
> 
> poll(one fd)
> recvfrom()
> 

This is true - but my goal was/is to replicate the regression i was
seeing[1]. 
I will try with PERSIST next opportunity. If it gets better
then it is something that needs documentation in the doc Tom
promised ;->

> I always thought copybreak was borderline...
> It can help to reduce memory footprint (allocating 128 bytes instead of
> 2048/4096 bytes per frame), but with RPS, it would make sense to perform
> copybreak after RPS, not before.
> 
> Reducing memory footprint also means less changes on
> udp_memory_allocated /tcp_memory_allocate (memory reclaim logic)

Indeed, something that didnt cross my mind in the rush to test - it is
one of those things that need to be mentioned in some doc somewhere.
Tom, are you listening? ;->

cheers,
jamal

[1]i.e with this program rps was getting worse (it was much better
before say net-next of apr14) and that non-rps has been getting better
numbers since. The regression is real - but it is likely in another
subsystem.

--
To unsubscribe from this list: send the line "unsubscribe netdev" in
the body of a message to majordomo@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Eric Dumazet May 1, 2010, 1:22 p.m. UTC | #29
Le samedi 01 mai 2010 à 07:56 -0400, jamal a écrit :

> 
> [1]i.e with this program rps was getting worse (it was much better
> before say net-next of apr14) and that non-rps has been getting better
> numbers since. The regression is real - but it is likely in another
> subsystem.
> 

You must understand that the whole 'bench' is mostly governed by
scheduler artifacts. The regression you mention is probably a side
effect.

By slowing down one part, its possible to zap all calls to scheduler and
go maybe 300% faster (Because consumer threads can avoid 3/4 of the time
to schedule)

Reciprocally, optimizing one part of the network stack might make
threads hitting an empty queue, and need to call more often the
scheduler.

This is why some higly specialized programs never block/schedule and
perform busy loops instead.



--
To unsubscribe from this list: send the line "unsubscribe netdev" in
the body of a message to majordomo@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
jamal May 1, 2010, 1:49 p.m. UTC | #30
On Sat, 2010-05-01 at 15:22 +0200, Eric Dumazet wrote:

> You must understand that the whole 'bench' is mostly governed by
> scheduler artifacts. The regression you mention is probably a side
> effect.

likely.

> By slowing down one part, its possible to zap all calls to scheduler and
> go maybe 300% faster (Because consumer threads can avoid 3/4 of the time
> to schedule)
> 
> Reciprocally, optimizing one part of the network stack might make
> threads hitting an empty queue, and need to call more often the
> scheduler.

It is fair to say that what i am seeing is _not_ fatal because it is rps
that is regressing; non-rps is fine. I would consider non-rps to be the
common use scenario and if that was doing badly then it is a problem.
The good news is it is getting better - likely because of some changes
made on behalf of rps ;->
With rps, one could follow some instructions on how to make it better.
I am hoping that some of the system "magic" is documented as Tom
mentioned he will.

> This is why some higly specialized programs never block/schedule and
> perform busy loops instead.

Agreed. My brain cells should learn to accept this fact ;->

cheers,
jamal

--
To unsubscribe from this list: send the line "unsubscribe netdev" in
the body of a message to majordomo@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
jamal May 3, 2010, 8:10 p.m. UTC | #31
On Sat, 2010-05-01 at 07:56 -0400, jamal wrote:
> On Sat, 2010-05-01 at 13:42 +0200, Eric Dumazet wrote:
> 
> > But, whole point of epoll is to not change interest each time you get an
> > event.
> > 
> > Without EV_PERSIST, you need two more syscalls per recvfrom()
> > 
> > epoll_wait()
> >  epoll_ctl(REMOVE)
> >  epoll_ctl(ADD)
> >  recvfrom()
> > 
> > Even poll() would be faster in your case
> > 
> > poll(one fd)
> > recvfrom()
> > 
> 
> This is true - but my goal was/is to replicate the regression i was
> seeing[1]. 
> I will try with PERSIST next opportunity. If it gets better
> then it is something that needs documentation in the doc Tom
> promised ;->

I tried it with PERSIST and today's net-next and you are right:
rps was better compared with (99.4% vs 98.1% of 750Kpps).
If however i removed the PERSIST i.e both rps and non-rps
have two extra syscalls, again rps performed worse (93.2% vs 97.8%
of 750Kpps). Eric, I know the answer is not to do the non-PERSIST mode
for rps ;-> But lets just ignore that for a sec:
what the heck is going on? I would expect the degradation to be the same
for both non-rps. 
I also wanna do the broken record reminder that kernels before net-next
of Apr14 were doing about 97% (as opposed to 93% currently for same
test).

cheers,
jamal

--
To unsubscribe from this list: send the line "unsubscribe netdev" in
the body of a message to majordomo@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
diff mbox

Patch

diff --git a/include/net/sock.h b/include/net/sock.h
index cf12b1e..d361c77 100644
--- a/include/net/sock.h
+++ b/include/net/sock.h
@@ -1021,6 +1021,16 @@  extern void release_sock(struct sock *sk);
 				SINGLE_DEPTH_NESTING)
 #define bh_unlock_sock(__sk)	spin_unlock(&((__sk)->sk_lock.slock))
 
+static inline void lock_sock_bh(struct sock *sk)
+{
+	spin_lock_bh(&sk->sk_lock.slock);
+}
+
+static inline void unlock_sock_bh(struct sock *sk)
+{
+	spin_unlock_bh(&sk->sk_lock.slock);
+}
+
 extern struct sock		*sk_alloc(struct net *net, int family,
 					  gfp_t priority,
 					  struct proto *prot);
diff --git a/net/core/datagram.c b/net/core/datagram.c
index 5574a5d..95b851f 100644
--- a/net/core/datagram.c
+++ b/net/core/datagram.c
@@ -229,9 +229,13 @@  EXPORT_SYMBOL(skb_free_datagram);
 
 void skb_free_datagram_locked(struct sock *sk, struct sk_buff *skb)
 {
-	lock_sock(sk);
-	skb_free_datagram(sk, skb);
-	release_sock(sk);
+	lock_sock_bh(sk);
+	skb_orphan(skb);
+	sk_mem_reclaim_partial(sk);
+	unlock_sock_bh(sk);
+
+	/* skb is now orphaned, might be freed outside of locked section */
+	consume_skb(skb);
 }
 EXPORT_SYMBOL(skb_free_datagram_locked);
 
diff --git a/net/ipv4/udp.c b/net/ipv4/udp.c
index 63eb56b..1f86965 100644
--- a/net/ipv4/udp.c
+++ b/net/ipv4/udp.c
@@ -1062,10 +1062,10 @@  static unsigned int first_packet_length(struct sock *sk)
 	spin_unlock_bh(&rcvq->lock);
 
 	if (!skb_queue_empty(&list_kill)) {
-		lock_sock(sk);
+		lock_sock_bh(sk);
 		__skb_queue_purge(&list_kill);
 		sk_mem_reclaim_partial(sk);
-		release_sock(sk);
+		unlock_sock_bh(sk);
 	}
 	return res;
 }
@@ -1196,10 +1196,10 @@  out:
 	return err;
 
 csum_copy_err:
-	lock_sock(sk);
+	lock_sock_bh(sk);
 	if (!skb_kill_datagram(sk, skb, flags))
 		UDP_INC_STATS_USER(sock_net(sk), UDP_MIB_INERRORS, is_udplite);
-	release_sock(sk);
+	unlock_sock_bh(sk);
 
 	if (noblock)
 		return -EAGAIN;
@@ -1624,9 +1624,9 @@  int udp_rcv(struct sk_buff *skb)
 
 void udp_destroy_sock(struct sock *sk)
 {
-	lock_sock(sk);
+	lock_sock_bh(sk);
 	udp_flush_pending_frames(sk);
-	release_sock(sk);
+	unlock_sock_bh(sk);
 }
 
 /*
diff --git a/net/ipv6/udp.c b/net/ipv6/udp.c
index 3ead20a..91c60f0 100644
--- a/net/ipv6/udp.c
+++ b/net/ipv6/udp.c
@@ -424,7 +424,7 @@  out:
 	return err;
 
 csum_copy_err:
-	lock_sock(sk);
+	lock_sock_bh(sk);
 	if (!skb_kill_datagram(sk, skb, flags)) {
 		if (is_udp4)
 			UDP_INC_STATS_USER(sock_net(sk),
@@ -433,7 +433,7 @@  csum_copy_err:
 			UDP6_INC_STATS_USER(sock_net(sk),
 					UDP_MIB_INERRORS, is_udplite);
 	}
-	release_sock(sk);
+	unlock_sock_bh(sk);
 
 	if (flags & MSG_DONTWAIT)
 		return -EAGAIN;