Message ID | 1307600810.3980.85.camel@edumazet-laptop |
---|---|
State | Accepted, archived |
Delegated to: | David Miller |
Headers | show |
From: Eric Dumazet <eric.dumazet@gmail.com> Date: Thu, 09 Jun 2011 08:26:50 +0200 > Profiles show false sharing in addr_compare() because refcnt/dtime > changes dirty the first inet_peer cache line, where are lying the keys > used at lookup time. If many cpus are calling inet_getpeer() and > inet_putpeer(), or need frag ids, addr_compare() is in 2nd position in > "perf top". > > Before patch, my udpflood bench (16 threads) on my 2x4x2 machine : .. > After patch, no more addr_compare() in profiles : .. > Signed-off-by: Eric Dumazet <eric.dumazet@gmail.com> > CC: Andi Kleen <andi@firstfloor.org> > CC: Tim Chen <tim.c.chen@linux.intel.com> Applied, thanks! -- To unsubscribe from this list: send the line "unsubscribe netdev" in the body of a message to majordomo@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
On Thu, 2011-06-09 at 08:26 +0200, Eric Dumazet wrote: > Profiles show false sharing in addr_compare() because refcnt/dtime > changes dirty the first inet_peer cache line, where are lying the keys > used at lookup time. If many cpus are calling inet_getpeer() and > inet_putpeer(), or need frag ids, addr_compare() is in 2nd position in > "perf top". > I've applied both inetpeer patches. I also no longer have inet_getpeer and inet_putpeer and addr_compare in my profile. Instead, neighbor lookup is now dominant. See profile below. When I retest with original 3.0-rc2 kernel, inet_putpeer no longer shows up, wonder if dst->peer was not set for some reason. Tim - 27.06% memcached [kernel.kallsyms] [k] atomic_add_unless.clone.34 - atomic_add_unless.clone.34 - 99.97% neigh_lookup __neigh_lookup_errno.clone.17 arp_bind_neighbour rt_intern_hash __ip_route_output_key ip_route_output_flow udp_sendmsg inet_sendmsg __sock_sendmsg sock_sendmsg __sys_sendmsg sys_sendmsg system_call_fastpath __sendmsg - 13.33% memcached [kernel.kallsyms] [k] atomic_dec_and_test - atomic_dec_and_test - 99.89% dst_destroy - dst_release - 98.12% skb_dst_drop.clone.55 dev_hard_start_xmit + sch_direct_xmit + 1.88% skb_release_head_state - 3.26% memcached [kernel.kallsyms] [k] do_raw_spin_lock - do_raw_spin_lock - 92.24% _raw_spin_lock + 41.39% sch_direct_xmit -- To unsubscribe from this list: send the line "unsubscribe netdev" in the body of a message to majordomo@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
From: Tim Chen <tim.c.chen@linux.intel.com> Date: Thu, 09 Jun 2011 17:03:55 -0700 > When I retest with original 3.0-rc2 kernel, inet_putpeer no longer shows > up, wonder if dst->peer was not set for some reason. The overhead will only show up if an inetpeer entry exists for the destination IP address. You can force one to be created, for example, by making a TCP connection to that destination. -- To unsubscribe from this list: send the line "unsubscribe netdev" in the body of a message to majordomo@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
Le jeudi 09 juin 2011 à 17:03 -0700, Tim Chen a écrit : > On Thu, 2011-06-09 at 08:26 +0200, Eric Dumazet wrote: > > Profiles show false sharing in addr_compare() because refcnt/dtime > > changes dirty the first inet_peer cache line, where are lying the keys > > used at lookup time. If many cpus are calling inet_getpeer() and > > inet_putpeer(), or need frag ids, addr_compare() is in 2nd position in > > "perf top". > > > > I've applied both inetpeer patches. I also no longer have inet_getpeer > and inet_putpeer and addr_compare in my profile. Instead, neighbor > lookup is now dominant. See profile below. > > When I retest with original 3.0-rc2 kernel, inet_putpeer no longer shows > up, wonder if dst->peer was not set for some reason. > > Tim > > - 27.06% memcached [kernel.kallsyms] [k] atomic_add_unless.clone.34 > - atomic_add_unless.clone.34 > - 99.97% neigh_lookup > __neigh_lookup_errno.clone.17 > arp_bind_neighbour > rt_intern_hash > __ip_route_output_key > ip_route_output_flow > udp_sendmsg > inet_sendmsg > __sock_sendmsg > sock_sendmsg > __sys_sendmsg > sys_sendmsg > system_call_fastpath > __sendmsg > - 13.33% memcached [kernel.kallsyms] [k] atomic_dec_and_test > - atomic_dec_and_test > - 99.89% dst_destroy > - dst_release > - 98.12% skb_dst_drop.clone.55 > dev_hard_start_xmit > + sch_direct_xmit > + 1.88% skb_release_head_state > - 3.26% memcached [kernel.kallsyms] [k] do_raw_spin_lock > - do_raw_spin_lock > - 92.24% _raw_spin_lock > + 41.39% sch_direct_xmit > > Thanks Tim I have some questions for further optimizations. 1) How many different destinations are used in your stress load ? 2) Could you provide a distribution of the size of packet lengthes ? Or maybe the average length would be OK -- To unsubscribe from this list: send the line "unsubscribe netdev" in the body of a message to majordomo@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
Le jeudi 09 juin 2011 à 20:43 -0700, David Miller a écrit : > From: Tim Chen <tim.c.chen@linux.intel.com> > Date: Thu, 09 Jun 2011 17:03:55 -0700 > > > When I retest with original 3.0-rc2 kernel, inet_putpeer no longer shows > > up, wonder if dst->peer was not set for some reason. > > The overhead will only show up if an inetpeer entry exists for > the destination IP address. > > You can force one to be created, for example, by making a TCP > connection to that destination. Or sending big frames, to force ip fragmentation (we store the id generator in inetpeer cache) -- To unsubscribe from this list: send the line "unsubscribe netdev" in the body of a message to majordomo@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
On Fri, 2011-06-10 at 06:31 +0200, Eric Dumazet wrote: > > Thanks Tim > > I have some questions for further optimizations. > > 1) How many different destinations are used in your stress load ? > 2) Could you provide a distribution of the size of packet lengthes ? > Or maybe the average length would be OK > > > Actually I have one load generator and one server connected to each other via a 10Gb link. The server is a 40 core 4 socket Westmere-EX machine and the load generator is a 12 core 2 socket Westmere-EP machine. There are 40 memcached daemons on the server each bound to a cpu core and listening on a distinctive UDP port. The load generator has 40 threads, with each thread sending memcache request to a particular UDP port. The load generator's memcache request packet has a UDP payload of 25 bytes. The response packet from the daemon has a UDP payload of 13 bytes. The UPD packets on the load generator and server are distributed across 16 Tx-Rx queues by hashing on the UDP ports (with slight modification of the hash flags of ixgbe). Thanks. Tim -- To unsubscribe from this list: send the line "unsubscribe netdev" in the body of a message to majordomo@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
Le vendredi 10 juin 2011 à 10:05 -0700, Tim Chen a écrit : > On Fri, 2011-06-10 at 06:31 +0200, Eric Dumazet wrote: > > > > > Thanks Tim > > > > I have some questions for further optimizations. > > > > 1) How many different destinations are used in your stress load ? > > 2) Could you provide a distribution of the size of packet lengthes ? > > Or maybe the average length would be OK > > > > > > > > Actually I have one load generator and one server connected to each > other via a 10Gb link. > > The server is a 40 core 4 socket Westmere-EX machine and the load > generator is a 12 core 2 socket Westmere-EP machine. > > There are 40 memcached daemons on the server each bound to a cpu core > and listening on a distinctive UDP port. The load generator has 40 > threads, with each thread sending memcache request to a particular UDP > port. > > > The load generator's memcache request packet has a UDP payload of 25 > bytes. The response packet from the daemon has a UDP payload of 13 > bytes. > > The UPD packets on the load generator and server are distributed across > 16 Tx-Rx queues by hashing on the UDP ports (with slight modification of > the hash flags of ixgbe). Excellent, thanks for all these details. I had the idea some weeks ago to add a fast path to udp_sendmsg() for small messages [doing the user->kernel copy before RCU route lookup], that should fit your workload (and other typical UDP workloads) I'll try to cook patches in next days. Thanks -- To unsubscribe from this list: send the line "unsubscribe netdev" in the body of a message to majordomo@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
On Thu, 2011-06-09 at 20:43 -0700, David Miller wrote: > From: Tim Chen <tim.c.chen@linux.intel.com> > Date: Thu, 09 Jun 2011 17:03:55 -0700 > > > When I retest with original 3.0-rc2 kernel, inet_putpeer no longer shows > > up, wonder if dst->peer was not set for some reason. > > The overhead will only show up if an inetpeer entry exists for > the destination IP address. > > You can force one to be created, for example, by making a TCP > connection to that destination. You're right. By adding the TCP connection, inet peer shows up now in my profile of the patched kernel with Eric's two patches. Eric's patches produced much better cpu utilization. The addr_compare (used to consume 10% cpu) and atomic_dec_and_lock (used to consume 20.5% cpu) in inet_putpeer is eliminated and inet_putpeer uses only 10% cpu now. Though inet_getpeer and inet_putpeer still consumes significant cpu compared to the other test case when peer is not present. Tim Profile with Eric's two patches and peer forced to be present with TCP added looks like this: - 19.38% memcached [kernel.kallsyms] [k] inet_getpeer - inet_getpeer + 99.97% inet_getpeer_v4 - 11.49% memcached [kernel.kallsyms] [k] inet_putpeer - inet_putpeer - 99.96% ipv4_dst_destroy dst_destroy + dst_release - 5.71% memcached [kernel.kallsyms] [k] rt_set_nexthop.clone.30 - rt_set_nexthop.clone.30 + 99.89% __ip_route_output_key - 5.60% memcached [kernel.kallsyms] [k] atomic_add_unless.clone.34 - atomic_add_unless.clone.34 + 99.94% neigh_lookup + 3.02% memcached [kernel.kallsyms] [k] do_raw_spin_lock + 2.87% memcached [kernel.kallsyms] [k] atomic_dec_and_test + 1.45% memcached [kernel.kallsyms] [k] atomic_add + 1.04% memcached [kernel.kallsyms] [k] _raw_spin_lock_irqsave + 1.03% memcached [kernel.kallsyms] [k] bit_spin_lock.clone.41 -- To unsubscribe from this list: send the line "unsubscribe netdev" in the body of a message to majordomo@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
On Sat, Jun 11, 2011 at 6:33 AM, Tim Chen <tim.c.chen@linux.intel.com> wrote: > > You're right. By adding the TCP connection, inet peer shows up now in > my profile of the patched kernel with Eric's two patches. > > Eric's patches produced much better cpu utilization. > The addr_compare (used to consume 10% cpu) and atomic_dec_and_lock (used > to consume 20.5% cpu) in inet_putpeer is eliminated and inet_putpeer > uses only 10% cpu now. Though inet_getpeer and inet_putpeer still > consumes significant cpu compared to the other test case when peer is > not present. > > Tim > > Profile with Eric's two patches and peer forced to be present with TCP > added looks like this: > > - 19.38% memcached [kernel.kallsyms] [k] inet_getpeer > - inet_getpeer > + 99.97% inet_getpeer_v4 > - 11.49% memcached [kernel.kallsyms] [k] inet_putpeer > - inet_putpeer > - 99.96% ipv4_dst_destroy > dst_destroy > + dst_release > - 5.71% memcached [kernel.kallsyms] [k] rt_set_nexthop.clone.30 > - rt_set_nexthop.clone.30 > + 99.89% __ip_route_output_key > - 5.60% memcached [kernel.kallsyms] [k] atomic_add_unless.clone.34 > - atomic_add_unless.clone.34 > + 99.94% neigh_lookup > + 3.02% memcached [kernel.kallsyms] [k] do_raw_spin_lock > + 2.87% memcached [kernel.kallsyms] [k] atomic_dec_and_test > + 1.45% memcached [kernel.kallsyms] [k] atomic_add > + 1.04% memcached [kernel.kallsyms] [k] _raw_spin_lock_irqsave > + 1.03% memcached [kernel.kallsyms] [k] bit_spin_lock.clone.41 > > Did you disable routing cache when profiling? If so, enable it and try again.
Le samedi 11 juin 2011 à 08:54 +0800, Changli Gao a écrit : > Did you disable routing cache when profiling? If so, enable it and try again. > Whole point of the exercice is to prepare ground for routing cache removal :) If you want a server being hit by millions of clients around the world, routing cache is a real pain because of memory needs. -- To unsubscribe from this list: send the line "unsubscribe netdev" in the body of a message to majordomo@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
On Sat, Jun 11, 2011 at 12:54 PM, Eric Dumazet <eric.dumazet@gmail.com> wrote: > Le samedi 11 juin 2011 à 08:54 +0800, Changli Gao a écrit : > > > Whole point of the exercice is to prepare ground for routing cache > removal :) > > If you want a server being hit by millions of clients around the world, > routing cache is a real pain because of memory needs. > Yes. I know the routing cache removal is our goal. But for his scenario, there aren't so many routing cache entries, so routing cache may be a better option currently. However, if he just wants to evaluate the effect of the routing cache removal, it is another thing. :)
On Sat, Jun 11, 2011 at 02:17:24PM +0800, Changli Gao wrote: > On Sat, Jun 11, 2011 at 12:54 PM, Eric Dumazet <eric.dumazet@gmail.com> wrote: > > Le samedi 11 juin 2011 à 08:54 +0800, Changli Gao a écrit : > > > > > > Whole point of the exercice is to prepare ground for routing cache > > removal :) > > > > If you want a server being hit by millions of clients around the world, > > routing cache is a real pain because of memory needs. > > > > Yes. I know the routing cache removal is our goal. But for his > scenario, there aren't so many routing cache entries, so routing cache > may be a better option currently. However, if he just wants to The routing cache has terrible cache line bouncing in the reference count too. -andi -- To unsubscribe from this list: send the line "unsubscribe netdev" in the body of a message to majordomo@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
diff --git a/include/net/inetpeer.h b/include/net/inetpeer.h index 1f0966f..39d1230 100644 --- a/include/net/inetpeer.h +++ b/include/net/inetpeer.h @@ -32,12 +32,17 @@ struct inet_peer { struct inet_peer __rcu *avl_left, *avl_right; struct inetpeer_addr daddr; __u32 avl_height; - __u32 dtime; /* the time of last use of not - * referenced entries */ - atomic_t refcnt; + + u32 metrics[RTAX_MAX]; + u32 rate_tokens; /* rate limiting for ICMP */ + unsigned long rate_last; + unsigned long pmtu_expires; + u32 pmtu_orig; + u32 pmtu_learned; + struct inetpeer_addr_base redirect_learned; /* * Once inet_peer is queued for deletion (refcnt == -1), following fields - * are not available: rid, ip_id_count, tcp_ts, tcp_ts_stamp, metrics + * are not available: rid, ip_id_count, tcp_ts, tcp_ts_stamp * We can share memory with rcu_head to help keep inet_peer small. */ union { @@ -46,17 +51,14 @@ struct inet_peer { atomic_t ip_id_count; /* IP ID for the next packet */ __u32 tcp_ts; __u32 tcp_ts_stamp; - u32 metrics[RTAX_MAX]; - u32 rate_tokens; /* rate limiting for ICMP */ - unsigned long rate_last; - unsigned long pmtu_expires; - u32 pmtu_orig; - u32 pmtu_learned; - struct inetpeer_addr_base redirect_learned; }; struct rcu_head rcu; struct inet_peer *gc_next; }; + + /* following fields might be frequently dirtied */ + __u32 dtime; /* the time of last use of not referenced entries */ + atomic_t refcnt; }; void inet_initpeers(void) __init;
Profiles show false sharing in addr_compare() because refcnt/dtime changes dirty the first inet_peer cache line, where are lying the keys used at lookup time. If many cpus are calling inet_getpeer() and inet_putpeer(), or need frag ids, addr_compare() is in 2nd position in "perf top". Before patch, my udpflood bench (16 threads) on my 2x4x2 machine : 5784.00 9.7% csum_partial_copy_generic [kernel] 3356.00 5.6% addr_compare [kernel] 2638.00 4.4% fib_table_lookup [kernel] 2625.00 4.4% ip_fragment [kernel] 1934.00 3.2% neigh_lookup [kernel] 1617.00 2.7% udp_sendmsg [kernel] 1608.00 2.7% __ip_route_output_key [kernel] 1480.00 2.5% __ip_append_data [kernel] 1396.00 2.3% kfree [kernel] 1195.00 2.0% kmem_cache_free [kernel] 1157.00 1.9% inet_getpeer [kernel] 1121.00 1.9% neigh_resolve_output [kernel] 1012.00 1.7% dev_queue_xmit [kernel] # time ./udpflood.sh real 0m44.511s user 0m20.020s sys 11m22.780s # time ./udpflood.sh real 0m44.099s user 0m20.140s sys 11m15.870s After patch, no more addr_compare() in profiles : 4171.00 10.7% csum_partial_copy_generic [kernel] 1787.00 4.6% fib_table_lookup [kernel] 1756.00 4.5% ip_fragment [kernel] 1234.00 3.2% udp_sendmsg [kernel] 1191.00 3.0% neigh_lookup [kernel] 1118.00 2.9% __ip_append_data [kernel] 1022.00 2.6% kfree [kernel] 993.00 2.5% __ip_route_output_key [kernel] 841.00 2.2% neigh_resolve_output [kernel] 816.00 2.1% kmem_cache_free [kernel] 658.00 1.7% ia32_sysenter_target [kernel] 632.00 1.6% kmem_cache_alloc_node [kernel] # time ./udpflood.sh real 0m41.587s user 0m19.190s sys 10m36.370s # time ./udpflood.sh real 0m41.486s user 0m19.290s sys 10m33.650s Signed-off-by: Eric Dumazet <eric.dumazet@gmail.com> CC: Andi Kleen <andi@firstfloor.org> CC: Tim Chen <tim.c.chen@linux.intel.com> --- include/net/inetpeer.h | 24 +++++++++++++----------- 1 files changed, 13 insertions(+), 11 deletions(-) -- To unsubscribe from this list: send the line "unsubscribe netdev" in the body of a message to majordomo@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html