diff mbox

[net-next] net: percpu net_device refcount

Message ID 1286471555.2912.291.camel@edumazet-laptop
State Changes Requested, archived
Delegated to: David Miller
Headers show

Commit Message

Eric Dumazet Oct. 7, 2010, 5:12 p.m. UTC
We tried very hard to remove all possible dev_hold()/dev_put() pairs in
network stack, using RCU conversions.

There is still an unavoidable device refcount change for every dst we
create/destroy, and this can slow down some workloads (routers or some
app servers)

We can switch to a percpu refcount implementation, now dynamic per_cpu
infrastructure is mature. On a 64 cpus machine, this consumes 256 bytes
per device.

On x86, dev_hold(dev) code :

before
	lock	incl 0x280(%ebx)
after:
	movl	0x260(%ebx),%eax
	incl	fs:(%eax)

Stress bench :

(Sending 160.000.000 UDP frames,
IP route cache disabled, dual E5540 @2.53GHz,
32bit kernel, FIB_TRIE)

Before:

real	1m1.662s
user	0m14.373s
sys	12m55.960s

After:

real	0m51.179s
user	0m15.329s
sys	10m15.942s

Signed-off-by: Eric Dumazet <eric.dumazet@gmail.com>
---
 include/linux/netdevice.h |    6 ++--
 net/core/dev.c            |   47 ++++++++++++++++++++++++++++++------
 2 files changed, 43 insertions(+), 10 deletions(-)



--
To unsubscribe from this list: send the line "unsubscribe netdev" in
the body of a message to majordomo@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html

Comments

stephen hemminger Oct. 7, 2010, 5:30 p.m. UTC | #1
On Thu, 07 Oct 2010 19:12:35 +0200
Eric Dumazet <eric.dumazet@gmail.com> wrote:

> We tried very hard to remove all possible dev_hold()/dev_put() pairs in
> network stack, using RCU conversions.
> 
> There is still an unavoidable device refcount change for every dst we
> create/destroy, and this can slow down some workloads (routers or some
> app servers)
> 
> We can switch to a percpu refcount implementation, now dynamic per_cpu
> infrastructure is mature. On a 64 cpus machine, this consumes 256 bytes
> per device.
> 

It makes sense, but what about 256 cores and 1024 Vlans?
That adds up to 4M of memory which is might be noticeable.
Eric Dumazet Oct. 7, 2010, 6:06 p.m. UTC | #2
Le jeudi 07 octobre 2010 à 10:30 -0700, Stephen Hemminger a écrit :

> It makes sense, but what about 256 cores and 1024 Vlans?
> That adds up to 4M of memory which is might be noticeable.
> 
> 

Well, 256 cores and 1024 Vlan -> 1 Mbyte of memory, not 4 ;)

This seems reasonable to me.

Eventually we could use a fallback, if percpu allocation failed -> use a
static field in net_device.



--
To unsubscribe from this list: send the line "unsubscribe netdev" in
the body of a message to majordomo@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Paul E. McKenney Oct. 8, 2010, 9:56 p.m. UTC | #3
On Thu, Oct 07, 2010 at 10:30:51AM -0700, Stephen Hemminger wrote:
> On Thu, 07 Oct 2010 19:12:35 +0200
> Eric Dumazet <eric.dumazet@gmail.com> wrote:
> 
> > We tried very hard to remove all possible dev_hold()/dev_put() pairs in
> > network stack, using RCU conversions.
> > 
> > There is still an unavoidable device refcount change for every dst we
> > create/destroy, and this can slow down some workloads (routers or some
> > app servers)
> > 
> > We can switch to a percpu refcount implementation, now dynamic per_cpu
> > infrastructure is mature. On a 64 cpus machine, this consumes 256 bytes
> > per device.
> 
> It makes sense, but what about 256 cores and 1024 Vlans?
> That adds up to 4M of memory which is might be noticeable.

I bet that systems that have 256 cores have >100GB of memory, at which
point 4MB is way down in the noise.

							Thanx, Paul
--
To unsubscribe from this list: send the line "unsubscribe netdev" in
the body of a message to majordomo@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Eric Dumazet Oct. 9, 2010, 6:23 a.m. UTC | #4
Le vendredi 08 octobre 2010 à 14:56 -0700, Paul E. McKenney a écrit :
> On Thu, Oct 07, 2010 at 10:30:51AM -0700, Stephen Hemminger wrote:
> > On Thu, 07 Oct 2010 19:12:35 +0200
> > Eric Dumazet <eric.dumazet@gmail.com> wrote:
> > 
> > > We tried very hard to remove all possible dev_hold()/dev_put() pairs in
> > > network stack, using RCU conversions.
> > > 
> > > There is still an unavoidable device refcount change for every dst we
> > > create/destroy, and this can slow down some workloads (routers or some
> > > app servers)
> > > 
> > > We can switch to a percpu refcount implementation, now dynamic per_cpu
> > > infrastructure is mature. On a 64 cpus machine, this consumes 256 bytes
> > > per device.
> > 
> > It makes sense, but what about 256 cores and 1024 Vlans?
> > That adds up to 4M of memory which is might be noticeable.
> 
> I bet that systems that have 256 cores have >100GB of memory, at which
> point 4MB is way down in the noise.

Well, first its 1MB added, and secondly we added percpu stats for vlan
devices, and this consumed 8x more :

(struct vlan_rx_stats is 32 bytes per cpu and per vlan
32*256*1024  ->  8 Mbytes

Some strange machines have many cores sharing a small amount of memory,
but I am not sure they want to run many net devices ;)



--
To unsubscribe from this list: send the line "unsubscribe netdev" in
the body of a message to majordomo@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Paul E. McKenney Oct. 9, 2010, 4:58 p.m. UTC | #5
On Sat, Oct 09, 2010 at 08:23:16AM +0200, Eric Dumazet wrote:
> Le vendredi 08 octobre 2010 à 14:56 -0700, Paul E. McKenney a écrit :
> > On Thu, Oct 07, 2010 at 10:30:51AM -0700, Stephen Hemminger wrote:
> > > On Thu, 07 Oct 2010 19:12:35 +0200
> > > Eric Dumazet <eric.dumazet@gmail.com> wrote:
> > > 
> > > > We tried very hard to remove all possible dev_hold()/dev_put() pairs in
> > > > network stack, using RCU conversions.
> > > > 
> > > > There is still an unavoidable device refcount change for every dst we
> > > > create/destroy, and this can slow down some workloads (routers or some
> > > > app servers)
> > > > 
> > > > We can switch to a percpu refcount implementation, now dynamic per_cpu
> > > > infrastructure is mature. On a 64 cpus machine, this consumes 256 bytes
> > > > per device.
> > > 
> > > It makes sense, but what about 256 cores and 1024 Vlans?
> > > That adds up to 4M of memory which is might be noticeable.
> > 
> > I bet that systems that have 256 cores have >100GB of memory, at which
> > point 4MB is way down in the noise.
> 
> Well, first its 1MB added, and secondly we added percpu stats for vlan
> devices, and this consumed 8x more :
> 
> (struct vlan_rx_stats is 32 bytes per cpu and per vlan
> 32*256*1024  ->  8 Mbytes
> 
> Some strange machines have many cores sharing a small amount of memory,
> but I am not sure they want to run many net devices ;)

I do have to admit that the rapid growth rate in the data required might
well be cause for concern.  But only if it continues.  ;-)

							Thanx, Paul
--
To unsubscribe from this list: send the line "unsubscribe netdev" in
the body of a message to majordomo@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
David Miller Oct. 11, 2010, 7:13 p.m. UTC | #6
From: Eric Dumazet <eric.dumazet@gmail.com>
Date: Thu, 07 Oct 2010 19:12:35 +0200

> We tried very hard to remove all possible dev_hold()/dev_put() pairs in
> network stack, using RCU conversions.
> 
> There is still an unavoidable device refcount change for every dst we
> create/destroy, and this can slow down some workloads (routers or some
> app servers)
> 
> We can switch to a percpu refcount implementation, now dynamic per_cpu
> infrastructure is mature. On a 64 cpus machine, this consumes 256 bytes
> per device.
 ...
> Signed-off-by: Eric Dumazet <eric.dumazet@gmail.com>

Ok, I'm fine with this.

But could you please get rid of that "#if 0" code block?  The comment
is fine and should stay, but the commented out code shouldn't really
stay there.

Thanks!
--
To unsubscribe from this list: send the line "unsubscribe netdev" in
the body of a message to majordomo@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
diff mbox

Patch

diff --git a/include/linux/netdevice.h b/include/linux/netdevice.h
index 6abcef6..fa1d88d 100644
--- a/include/linux/netdevice.h
+++ b/include/linux/netdevice.h
@@ -1020,7 +1020,7 @@  struct net_device {
 	struct timer_list	watchdog_timer;
 
 	/* Number of references to this device */
-	atomic_t		refcnt ____cacheline_aligned_in_smp;
+	int __percpu		*pcpu_refcnt;
 
 	/* delayed register/unregister */
 	struct list_head	todo_list;
@@ -1792,7 +1792,7 @@  extern void netdev_run_todo(void);
  */
 static inline void dev_put(struct net_device *dev)
 {
-	atomic_dec(&dev->refcnt);
+	irqsafe_cpu_dec(*dev->pcpu_refcnt);
 }
 
 /**
@@ -1803,7 +1803,7 @@  static inline void dev_put(struct net_device *dev)
  */
 static inline void dev_hold(struct net_device *dev)
 {
-	atomic_inc(&dev->refcnt);
+	irqsafe_cpu_inc(*dev->pcpu_refcnt);
 }
 
 /* Carrier loss detection, dial on demand. The functions netif_carrier_on
diff --git a/net/core/dev.c b/net/core/dev.c
index 7d14955..2186e1e 100644
--- a/net/core/dev.c
+++ b/net/core/dev.c
@@ -5195,9 +5195,6 @@  int init_dummy_netdev(struct net_device *dev)
 	 */
 	dev->reg_state = NETREG_DUMMY;
 
-	/* initialize the ref count */
-	atomic_set(&dev->refcnt, 1);
-
 	/* NAPI wants this */
 	INIT_LIST_HEAD(&dev->napi_list);
 
@@ -5205,6 +5202,19 @@  int init_dummy_netdev(struct net_device *dev)
 	set_bit(__LINK_STATE_PRESENT, &dev->state);
 	set_bit(__LINK_STATE_START, &dev->state);
 
+#if 0
+	/* We dont allocate pcpu_refcnt for dummy devices,
+	 * because users of this 'device' dont need to change
+	 * its refcount
+	 */
+	dev->pcpu_refcnt = alloc_percpu(int);
+	if (!dev->pcpu_refcnt)
+		return -ENOMEM;
+
+	/* initialize the ref count */
+	dev_hold(dev);
+#endif
+
 	return 0;
 }
 EXPORT_SYMBOL_GPL(init_dummy_netdev);
@@ -5246,6 +5256,15 @@  out:
 }
 EXPORT_SYMBOL(register_netdev);
 
+static int netdev_refcnt_read(const struct net_device *dev)
+{
+	int i, refcnt = 0;
+
+	for_each_possible_cpu(i)
+		refcnt += *per_cpu_ptr(dev->pcpu_refcnt, i);
+	return refcnt;
+}
+
 /*
  * netdev_wait_allrefs - wait until all references are gone.
  *
@@ -5260,11 +5279,14 @@  EXPORT_SYMBOL(register_netdev);
 static void netdev_wait_allrefs(struct net_device *dev)
 {
 	unsigned long rebroadcast_time, warning_time;
+	int refcnt;
 
 	linkwatch_forget_dev(dev);
 
 	rebroadcast_time = warning_time = jiffies;
-	while (atomic_read(&dev->refcnt) != 0) {
+	refcnt = netdev_refcnt_read(dev);
+
+	while (refcnt != 0) {
 		if (time_after(jiffies, rebroadcast_time + 1 * HZ)) {
 			rtnl_lock();
 
@@ -5291,11 +5313,13 @@  static void netdev_wait_allrefs(struct net_device *dev)
 
 		msleep(250);
 
+		refcnt = netdev_refcnt_read(dev);
+
 		if (time_after(jiffies, warning_time + 10 * HZ)) {
 			printk(KERN_EMERG "unregister_netdevice: "
 			       "waiting for %s to become free. Usage "
 			       "count = %d\n",
-			       dev->name, atomic_read(&dev->refcnt));
+			       dev->name, refcnt);
 			warning_time = jiffies;
 		}
 	}
@@ -5353,7 +5377,7 @@  void netdev_run_todo(void)
 		netdev_wait_allrefs(dev);
 
 		/* paranoia */
-		BUG_ON(atomic_read(&dev->refcnt));
+		BUG_ON(netdev_refcnt_read(dev));
 		WARN_ON(rcu_dereference_raw(dev->ip_ptr));
 		WARN_ON(dev->ip6_ptr);
 		WARN_ON(dev->dn_ptr);
@@ -5523,9 +5547,13 @@  struct net_device *alloc_netdev_mq(int sizeof_priv, const char *name,
 	dev = PTR_ALIGN(p, NETDEV_ALIGN);
 	dev->padded = (char *)dev - (char *)p;
 
-	if (dev_addr_init(dev))
+	dev->pcpu_refcnt = alloc_percpu(int);
+	if (!dev->pcpu_refcnt)
 		goto free_tx;
 
+	if (dev_addr_init(dev))
+		goto free_pcpu;
+
 	dev_mc_init(dev);
 	dev_uc_init(dev);
 
@@ -5556,6 +5584,8 @@  struct net_device *alloc_netdev_mq(int sizeof_priv, const char *name,
 
 free_tx:
 	kfree(tx);
+free_pcpu:
+	free_percpu(dev->pcpu_refcnt);
 free_p:
 	kfree(p);
 	return NULL;
@@ -5589,6 +5619,9 @@  void free_netdev(struct net_device *dev)
 	list_for_each_entry_safe(p, n, &dev->napi_list, dev_list)
 		netif_napi_del(p);
 
+	free_percpu(dev->pcpu_refcnt);
+	dev->pcpu_refcnt = NULL;
+
 	/*  Compatibility with error handling in drivers */
 	if (dev->reg_state == NETREG_UNINITIALIZED) {
 		kfree((char *)dev - dev->padded);