Patchwork [net-next-2.6] inetpeer: lower false sharing effect

login
register
mail settings
Submitter Eric Dumazet
Date June 9, 2011, 6:26 a.m.
Message ID <1307600810.3980.85.camel@edumazet-laptop>
Download mbox | patch
Permalink /patch/99623/
State Accepted
Delegated to: David Miller
Headers show

Comments

Eric Dumazet - June 9, 2011, 6:26 a.m.
Profiles show false sharing in addr_compare() because refcnt/dtime
changes dirty the first inet_peer cache line, where are lying the keys
used at lookup time. If many cpus are calling inet_getpeer() and
inet_putpeer(), or need frag ids, addr_compare() is in 2nd position in
"perf top".

Before patch, my udpflood bench (16 threads) on my 2x4x2 machine :

             5784.00  9.7% csum_partial_copy_generic [kernel]
             3356.00  5.6% addr_compare              [kernel]
             2638.00  4.4% fib_table_lookup          [kernel]
             2625.00  4.4% ip_fragment               [kernel]
             1934.00  3.2% neigh_lookup              [kernel]
             1617.00  2.7% udp_sendmsg               [kernel]
             1608.00  2.7% __ip_route_output_key     [kernel]
             1480.00  2.5% __ip_append_data          [kernel]
             1396.00  2.3% kfree                     [kernel]
             1195.00  2.0% kmem_cache_free           [kernel]
             1157.00  1.9% inet_getpeer              [kernel]
             1121.00  1.9% neigh_resolve_output      [kernel]
             1012.00  1.7% dev_queue_xmit            [kernel]
# time ./udpflood.sh

real	0m44.511s
user	0m20.020s
sys	11m22.780s

# time ./udpflood.sh

real	0m44.099s
user	0m20.140s
sys	11m15.870s

After patch, no more addr_compare() in profiles :

             4171.00 10.7% csum_partial_copy_generic   [kernel]
             1787.00  4.6% fib_table_lookup            [kernel]
             1756.00  4.5% ip_fragment                 [kernel]
             1234.00  3.2% udp_sendmsg                 [kernel]
             1191.00  3.0% neigh_lookup                [kernel]
             1118.00  2.9% __ip_append_data            [kernel]
             1022.00  2.6% kfree                       [kernel]
              993.00  2.5% __ip_route_output_key       [kernel]
              841.00  2.2% neigh_resolve_output        [kernel]
              816.00  2.1% kmem_cache_free             [kernel]
              658.00  1.7% ia32_sysenter_target        [kernel]
              632.00  1.6% kmem_cache_alloc_node       [kernel]

# time ./udpflood.sh

real	0m41.587s
user	0m19.190s
sys	10m36.370s

# time ./udpflood.sh

real	0m41.486s
user	0m19.290s
sys	10m33.650s

Signed-off-by: Eric Dumazet <eric.dumazet@gmail.com>
CC: Andi Kleen <andi@firstfloor.org>
CC: Tim Chen <tim.c.chen@linux.intel.com>
---
 include/net/inetpeer.h |   24 +++++++++++++-----------
 1 files changed, 13 insertions(+), 11 deletions(-)



--
To unsubscribe from this list: send the line "unsubscribe netdev" in
the body of a message to majordomo@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
David Miller - June 9, 2011, 6:31 a.m.
From: Eric Dumazet <eric.dumazet@gmail.com>
Date: Thu, 09 Jun 2011 08:26:50 +0200

> Profiles show false sharing in addr_compare() because refcnt/dtime
> changes dirty the first inet_peer cache line, where are lying the keys
> used at lookup time. If many cpus are calling inet_getpeer() and
> inet_putpeer(), or need frag ids, addr_compare() is in 2nd position in
> "perf top".
> 
> Before patch, my udpflood bench (16 threads) on my 2x4x2 machine :
 ..
> After patch, no more addr_compare() in profiles :
 ..
> Signed-off-by: Eric Dumazet <eric.dumazet@gmail.com>
> CC: Andi Kleen <andi@firstfloor.org>
> CC: Tim Chen <tim.c.chen@linux.intel.com>

Applied, thanks!
--
To unsubscribe from this list: send the line "unsubscribe netdev" in
the body of a message to majordomo@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Tim Chen - June 10, 2011, 12:03 a.m.
On Thu, 2011-06-09 at 08:26 +0200, Eric Dumazet wrote:
> Profiles show false sharing in addr_compare() because refcnt/dtime
> changes dirty the first inet_peer cache line, where are lying the keys
> used at lookup time. If many cpus are calling inet_getpeer() and
> inet_putpeer(), or need frag ids, addr_compare() is in 2nd position in
> "perf top".
> 

I've applied both inetpeer patches.  I also no longer have inet_getpeer
and inet_putpeer and addr_compare in my profile.  Instead, neighbor
lookup is now dominant.  See profile below.

When I retest with original 3.0-rc2 kernel, inet_putpeer no longer shows
up, wonder if dst->peer was not set for some reason. 

Tim

-     27.06%     memcached  [kernel.kallsyms]             [k] atomic_add_unless.clone.34
   - atomic_add_unless.clone.34
      - 99.97% neigh_lookup
           __neigh_lookup_errno.clone.17
           arp_bind_neighbour
           rt_intern_hash
           __ip_route_output_key
           ip_route_output_flow
           udp_sendmsg
           inet_sendmsg
           __sock_sendmsg
           sock_sendmsg
           __sys_sendmsg
           sys_sendmsg
           system_call_fastpath
           __sendmsg
-     13.33%     memcached  [kernel.kallsyms]             [k] atomic_dec_and_test
   - atomic_dec_and_test
      - 99.89% dst_destroy
         - dst_release
            - 98.12% skb_dst_drop.clone.55
                 dev_hard_start_xmit
               + sch_direct_xmit
            + 1.88% skb_release_head_state
-      3.26%     memcached  [kernel.kallsyms]             [k] do_raw_spin_lock
   - do_raw_spin_lock
      - 92.24% _raw_spin_lock
         + 41.39% sch_direct_xmit


--
To unsubscribe from this list: send the line "unsubscribe netdev" in
the body of a message to majordomo@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
David Miller - June 10, 2011, 3:43 a.m.
From: Tim Chen <tim.c.chen@linux.intel.com>
Date: Thu, 09 Jun 2011 17:03:55 -0700

> When I retest with original 3.0-rc2 kernel, inet_putpeer no longer shows
> up, wonder if dst->peer was not set for some reason. 

The overhead will only show up if an inetpeer entry exists for
the destination IP address.

You can force one to be created, for example, by making a TCP
connection to that destination.
--
To unsubscribe from this list: send the line "unsubscribe netdev" in
the body of a message to majordomo@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Eric Dumazet - June 10, 2011, 4:31 a.m.
Le jeudi 09 juin 2011 à 17:03 -0700, Tim Chen a écrit :
> On Thu, 2011-06-09 at 08:26 +0200, Eric Dumazet wrote:
> > Profiles show false sharing in addr_compare() because refcnt/dtime
> > changes dirty the first inet_peer cache line, where are lying the keys
> > used at lookup time. If many cpus are calling inet_getpeer() and
> > inet_putpeer(), or need frag ids, addr_compare() is in 2nd position in
> > "perf top".
> > 
> 
> I've applied both inetpeer patches.  I also no longer have inet_getpeer
> and inet_putpeer and addr_compare in my profile.  Instead, neighbor
> lookup is now dominant.  See profile below.
> 
> When I retest with original 3.0-rc2 kernel, inet_putpeer no longer shows
> up, wonder if dst->peer was not set for some reason. 
> 
> Tim
> 
> -     27.06%     memcached  [kernel.kallsyms]             [k] atomic_add_unless.clone.34
>    - atomic_add_unless.clone.34
>       - 99.97% neigh_lookup
>            __neigh_lookup_errno.clone.17
>            arp_bind_neighbour
>            rt_intern_hash
>            __ip_route_output_key
>            ip_route_output_flow
>            udp_sendmsg
>            inet_sendmsg
>            __sock_sendmsg
>            sock_sendmsg
>            __sys_sendmsg
>            sys_sendmsg
>            system_call_fastpath
>            __sendmsg
> -     13.33%     memcached  [kernel.kallsyms]             [k] atomic_dec_and_test
>    - atomic_dec_and_test
>       - 99.89% dst_destroy
>          - dst_release
>             - 98.12% skb_dst_drop.clone.55
>                  dev_hard_start_xmit
>                + sch_direct_xmit
>             + 1.88% skb_release_head_state
> -      3.26%     memcached  [kernel.kallsyms]             [k] do_raw_spin_lock
>    - do_raw_spin_lock
>       - 92.24% _raw_spin_lock
>          + 41.39% sch_direct_xmit
> 
> 

Thanks Tim

I have some questions for further optimizations.

1) How many different destinations are used in your stress load ?
2) Could you provide a distribution of the size of packet lengthes ?
   Or maybe the average length would be OK




--
To unsubscribe from this list: send the line "unsubscribe netdev" in
the body of a message to majordomo@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Eric Dumazet - June 10, 2011, 4:47 a.m.
Le jeudi 09 juin 2011 à 20:43 -0700, David Miller a écrit :
> From: Tim Chen <tim.c.chen@linux.intel.com>
> Date: Thu, 09 Jun 2011 17:03:55 -0700
> 
> > When I retest with original 3.0-rc2 kernel, inet_putpeer no longer shows
> > up, wonder if dst->peer was not set for some reason. 
> 
> The overhead will only show up if an inetpeer entry exists for
> the destination IP address.
> 
> You can force one to be created, for example, by making a TCP
> connection to that destination.

Or sending big frames, to force ip fragmentation (we store the id
generator in inetpeer cache)



--
To unsubscribe from this list: send the line "unsubscribe netdev" in
the body of a message to majordomo@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Tim Chen - June 10, 2011, 5:05 p.m.
On Fri, 2011-06-10 at 06:31 +0200, Eric Dumazet wrote:

> 
> Thanks Tim
> 
> I have some questions for further optimizations.
> 
> 1) How many different destinations are used in your stress load ?
> 2) Could you provide a distribution of the size of packet lengthes ?
>    Or maybe the average length would be OK
> 
> 
> 

Actually I have one load generator and one server connected to each
other via a 10Gb link.

The server is a 40 core 4 socket Westmere-EX machine and the load
generator is a 12 core 2 socket Westmere-EP machine.

There are 40 memcached daemons on the server each bound to a cpu core
and listening on a distinctive UDP port.  The load generator has 40
threads, with each thread sending memcache request to a particular UDP
port.


The load generator's memcache request packet has a UDP payload of 25
bytes.  The response packet from the daemon has a UDP payload of 13
bytes.

The UPD packets on the load generator and server are distributed across
16 Tx-Rx queues by hashing on the UDP ports (with slight modification of
the hash flags of ixgbe).

Thanks.

Tim


--
To unsubscribe from this list: send the line "unsubscribe netdev" in
the body of a message to majordomo@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Eric Dumazet - June 10, 2011, 5:17 p.m.
Le vendredi 10 juin 2011 à 10:05 -0700, Tim Chen a écrit :
> On Fri, 2011-06-10 at 06:31 +0200, Eric Dumazet wrote:
> 
> > 
> > Thanks Tim
> > 
> > I have some questions for further optimizations.
> > 
> > 1) How many different destinations are used in your stress load ?
> > 2) Could you provide a distribution of the size of packet lengthes ?
> >    Or maybe the average length would be OK
> > 
> > 
> > 
> 
> Actually I have one load generator and one server connected to each
> other via a 10Gb link.
> 
> The server is a 40 core 4 socket Westmere-EX machine and the load
> generator is a 12 core 2 socket Westmere-EP machine.
> 
> There are 40 memcached daemons on the server each bound to a cpu core
> and listening on a distinctive UDP port.  The load generator has 40
> threads, with each thread sending memcache request to a particular UDP
> port.
> 
> 
> The load generator's memcache request packet has a UDP payload of 25
> bytes.  The response packet from the daemon has a UDP payload of 13
> bytes.
> 
> The UPD packets on the load generator and server are distributed across
> 16 Tx-Rx queues by hashing on the UDP ports (with slight modification of
> the hash flags of ixgbe).

Excellent, thanks for all these details.

I had the idea some weeks ago to add a fast path to udp_sendmsg() for
small messages [doing the user->kernel copy before RCU route lookup],
that should fit your workload (and other typical UDP workloads)

I'll try to cook patches in next days.

Thanks



--
To unsubscribe from this list: send the line "unsubscribe netdev" in
the body of a message to majordomo@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Tim Chen - June 10, 2011, 10:33 p.m.
On Thu, 2011-06-09 at 20:43 -0700, David Miller wrote:
> From: Tim Chen <tim.c.chen@linux.intel.com>
> Date: Thu, 09 Jun 2011 17:03:55 -0700
> 
> > When I retest with original 3.0-rc2 kernel, inet_putpeer no longer shows
> > up, wonder if dst->peer was not set for some reason. 
> 
> The overhead will only show up if an inetpeer entry exists for
> the destination IP address.
> 
> You can force one to be created, for example, by making a TCP
> connection to that destination.

You're right.  By adding the TCP connection, inet peer shows up now in
my profile of the patched kernel with Eric's two patches.

Eric's patches produced much better cpu utilization.
The addr_compare (used to consume 10% cpu) and atomic_dec_and_lock (used
to consume 20.5% cpu) in inet_putpeer is eliminated and inet_putpeer
uses only 10% cpu now.  Though inet_getpeer and inet_putpeer still
consumes significant cpu compared to the other test case when peer is
not present.

Tim

Profile with Eric's two patches and peer forced to be present with TCP
added looks like this:

-     19.38%     memcached  [kernel.kallsyms]             [k] inet_getpeer
   - inet_getpeer                                                
      + 99.97% inet_getpeer_v4                                                             
-     11.49%     memcached  [kernel.kallsyms]             [k] inet_putpeer
   - inet_putpeer                                                                          
      - 99.96% ipv4_dst_destroy                                  
           dst_destroy                                                                     
         + dst_release                                           
-      5.71%     memcached  [kernel.kallsyms]             [k] rt_set_nexthop.clone.30
   - rt_set_nexthop.clone.30                                     
      + 99.89% __ip_route_output_key                                                       
-      5.60%     memcached  [kernel.kallsyms]             [k] atomic_add_unless.clone.34
   - atomic_add_unless.clone.34                                                            
      + 99.94% neigh_lookup                                      
+      3.02%     memcached  [kernel.kallsyms]             [k] do_raw_spin_lock
+      2.87%     memcached  [kernel.kallsyms]             [k] atomic_dec_and_test
+      1.45%     memcached  [kernel.kallsyms]             [k] atomic_add
+      1.04%     memcached  [kernel.kallsyms]             [k] _raw_spin_lock_irqsave
+      1.03%     memcached  [kernel.kallsyms]             [k] bit_spin_lock.clone.41


--
To unsubscribe from this list: send the line "unsubscribe netdev" in
the body of a message to majordomo@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Changli Gao - June 11, 2011, 12:54 a.m.
On Sat, Jun 11, 2011 at 6:33 AM, Tim Chen <tim.c.chen@linux.intel.com> wrote:
>
> You're right.  By adding the TCP connection, inet peer shows up now in
> my profile of the patched kernel with Eric's two patches.
>
> Eric's patches produced much better cpu utilization.
> The addr_compare (used to consume 10% cpu) and atomic_dec_and_lock (used
> to consume 20.5% cpu) in inet_putpeer is eliminated and inet_putpeer
> uses only 10% cpu now.  Though inet_getpeer and inet_putpeer still
> consumes significant cpu compared to the other test case when peer is
> not present.
>
> Tim
>
> Profile with Eric's two patches and peer forced to be present with TCP
> added looks like this:
>
> -     19.38%     memcached  [kernel.kallsyms]             [k] inet_getpeer
>   - inet_getpeer
>      + 99.97% inet_getpeer_v4
> -     11.49%     memcached  [kernel.kallsyms]             [k] inet_putpeer
>   - inet_putpeer
>      - 99.96% ipv4_dst_destroy
>           dst_destroy
>         + dst_release
> -      5.71%     memcached  [kernel.kallsyms]             [k] rt_set_nexthop.clone.30
>   - rt_set_nexthop.clone.30
>      + 99.89% __ip_route_output_key
> -      5.60%     memcached  [kernel.kallsyms]             [k] atomic_add_unless.clone.34
>   - atomic_add_unless.clone.34
>      + 99.94% neigh_lookup
> +      3.02%     memcached  [kernel.kallsyms]             [k] do_raw_spin_lock
> +      2.87%     memcached  [kernel.kallsyms]             [k] atomic_dec_and_test
> +      1.45%     memcached  [kernel.kallsyms]             [k] atomic_add
> +      1.04%     memcached  [kernel.kallsyms]             [k] _raw_spin_lock_irqsave
> +      1.03%     memcached  [kernel.kallsyms]             [k] bit_spin_lock.clone.41
>
>

Did you disable routing cache when profiling? If so, enable it and try again.
Eric Dumazet - June 11, 2011, 4:54 a.m.
Le samedi 11 juin 2011 à 08:54 +0800, Changli Gao a écrit :

> Did you disable routing cache when profiling? If so, enable it and try again.
> 

Whole point of the exercice is to prepare ground for routing cache
removal :)

If you want a server being hit by millions of clients around the world,
routing cache is a real pain because of memory needs.


--
To unsubscribe from this list: send the line "unsubscribe netdev" in
the body of a message to majordomo@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Changli Gao - June 11, 2011, 6:17 a.m.
On Sat, Jun 11, 2011 at 12:54 PM, Eric Dumazet <eric.dumazet@gmail.com> wrote:
> Le samedi 11 juin 2011 à 08:54 +0800, Changli Gao a écrit :
>
>
> Whole point of the exercice is to prepare ground for routing cache
> removal :)
>
> If you want a server being hit by millions of clients around the world,
> routing cache is a real pain because of memory needs.
>

Yes. I know the routing cache removal is our goal. But for his
scenario, there aren't so many routing cache entries, so routing cache
may be a better option currently. However, if he just wants to
evaluate the effect of the routing cache removal, it is another thing.
:)
Andi Kleen - June 11, 2011, 7:09 a.m.
On Sat, Jun 11, 2011 at 02:17:24PM +0800, Changli Gao wrote:
> On Sat, Jun 11, 2011 at 12:54 PM, Eric Dumazet <eric.dumazet@gmail.com> wrote:
> > Le samedi 11 juin 2011 à 08:54 +0800, Changli Gao a écrit :
> >
> >
> > Whole point of the exercice is to prepare ground for routing cache
> > removal :)
> >
> > If you want a server being hit by millions of clients around the world,
> > routing cache is a real pain because of memory needs.
> >
> 
> Yes. I know the routing cache removal is our goal. But for his
> scenario, there aren't so many routing cache entries, so routing cache
> may be a better option currently. However, if he just wants to

The routing cache has terrible cache line bouncing in the reference
count too.

-andi
--
To unsubscribe from this list: send the line "unsubscribe netdev" in
the body of a message to majordomo@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html

Patch

diff --git a/include/net/inetpeer.h b/include/net/inetpeer.h
index 1f0966f..39d1230 100644
--- a/include/net/inetpeer.h
+++ b/include/net/inetpeer.h
@@ -32,12 +32,17 @@  struct inet_peer {
 	struct inet_peer __rcu	*avl_left, *avl_right;
 	struct inetpeer_addr	daddr;
 	__u32			avl_height;
-	__u32			dtime;		/* the time of last use of not
-						 * referenced entries */
-	atomic_t		refcnt;
+
+	u32			metrics[RTAX_MAX];
+	u32			rate_tokens;	/* rate limiting for ICMP */
+	unsigned long		rate_last;
+	unsigned long		pmtu_expires;
+	u32			pmtu_orig;
+	u32			pmtu_learned;
+	struct inetpeer_addr_base redirect_learned;
 	/*
 	 * Once inet_peer is queued for deletion (refcnt == -1), following fields
-	 * are not available: rid, ip_id_count, tcp_ts, tcp_ts_stamp, metrics
+	 * are not available: rid, ip_id_count, tcp_ts, tcp_ts_stamp
 	 * We can share memory with rcu_head to help keep inet_peer small.
 	 */
 	union {
@@ -46,17 +51,14 @@  struct inet_peer {
 			atomic_t			ip_id_count;	/* IP ID for the next packet */
 			__u32				tcp_ts;
 			__u32				tcp_ts_stamp;
-			u32				metrics[RTAX_MAX];
-			u32				rate_tokens;	/* rate limiting for ICMP */
-			unsigned long			rate_last;
-			unsigned long			pmtu_expires;
-			u32				pmtu_orig;
-			u32				pmtu_learned;
-			struct inetpeer_addr_base	redirect_learned;
 		};
 		struct rcu_head         rcu;
 		struct inet_peer	*gc_next;
 	};
+
+	/* following fields might be frequently dirtied */
+	__u32			dtime;	/* the time of last use of not referenced entries */
+	atomic_t		refcnt;
 };
 
 void			inet_initpeers(void) __init;