diff mbox

[RFC] NAPI as kobject proposal

Message ID 20100129101839.36944ba5@nehalam
State RFC, archived
Delegated to: David Miller
Headers show

Commit Message

stephen hemminger Jan. 29, 2010, 6:18 p.m. UTC
The NAPI interface structure in current kernels is managed by the driver.
As part of receive packet steering there is a requirement to add an
additional parameter to this for the CPU map. And this map needs to
have an API to set it.

The right way to do this in the kernel model is to make NAPI into
a kobject and associate it back with the network device (parent).
This isn't wildly difficult but does change some of the API for
network device drivers because:
  1. They need to handle another possible error on setup
  2. NAPI object needs to be dynamically allocated
     separately (not as part of netdev_priv)
  3. Driver should pass index that can be uses as part of
     name (easier than scanning)

Eventually, there will be:
  /sys/class/net/eth0/napi0/
                            weight
                            cpumap


So here is a starting point patch that shows how the API might look like.


---
 include/linux/netdevice.h |   20 ++++++++++++++------
 net/core/dev.c            |   28 ++++++++++++++++++++++++++--
 2 files changed, 40 insertions(+), 8 deletions(-)





--
To unsubscribe from this list: send the line "unsubscribe netdev" in
the body of a message to majordomo@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html

Comments

Ben Hutchings Feb. 3, 2010, 9:23 p.m. UTC | #1
On Fri, 2010-01-29 at 10:18 -0800, Stephen Hemminger wrote:
> The NAPI interface structure in current kernels is managed by the driver.
> As part of receive packet steering there is a requirement to add an
> additional parameter to this for the CPU map. And this map needs to
> have an API to set it.
> 
> The right way to do this in the kernel model is to make NAPI into
> a kobject and associate it back with the network device (parent).
> This isn't wildly difficult but does change some of the API for
> network device drivers because:
>   1. They need to handle another possible error on setup
>   2. NAPI object needs to be dynamically allocated
>      separately (not as part of netdev_priv)
>   3. Driver should pass index that can be uses as part of
>      name (easier than scanning)
> 
> Eventually, there will be:
>   /sys/class/net/eth0/napi0/
>                             weight
>                             cpumap

I think the NAPI objects should be created as children of a bus device
and then linked from the net device directories, e.g.

/sys/devices/.../ net/eth0/
                           napi0 -> ../../napi/napi0
                      eth1/
                           napi0 -> ../../napi/napi0
                  napi/napi0/
                             weight
                             cpumap

I also think it might be helpful to use a more descriptive name here
than 'napi', e.g. 'channel'.  But maybe that can be left to management
tools.

> So here is a starting point patch that shows how the API might look like.
> 
> 
> ---
>  include/linux/netdevice.h |   20 ++++++++++++++------
>  net/core/dev.c            |   28 ++++++++++++++++++++++++++--
>  2 files changed, 40 insertions(+), 8 deletions(-)
> 
> --- a/include/linux/netdevice.h	2010-01-29 10:00:55.820739116 -0800
> +++ b/include/linux/netdevice.h	2010-01-29 10:15:33.098863437 -0800
> @@ -378,6 +378,8 @@ struct napi_struct {
>  	struct list_head	dev_list;
>  	struct sk_buff		*gro_list;
>  	struct sk_buff		*skb;
> +
> +	struct kobject		kobj;
>  };
>  
>  enum {
> @@ -1037,25 +1039,31 @@ static inline void *netdev_priv(const st
>  #define SET_NETDEV_DEVTYPE(net, devtype)	((net)->dev.type = (devtype))
>  
>  /**
> - *	netif_napi_add - initialize a napi context
> + *	netif_napi_init - initialize a napi context
>   *	@dev:  network device
>   *	@napi: napi context
> + *	@index: queue number
>   *	@poll: polling function
>   *	@weight: default weight
>   *
> - * netif_napi_add() must be used to initialize a napi context prior to calling
> + * netif_napi_init() must be used to create a napi context prior to calling
>   * *any* of the other napi related functions.
> + *
> + * in case of error, the context is not left in napi_list so it can
> + * be cleaned up by free_netdev, but is not valid for use.
>   */
> -void netif_napi_add(struct net_device *dev, struct napi_struct *napi,
> -		    int (*poll)(struct napi_struct *, int), int weight);
> +extern int netif_napi_init(struct net_device *dev, struct napi_struct *napi,
> +			   unsigned index,
> +			   int (*poll)(struct napi_struct *, int), int weight);
[...
> --- a/net/core/dev.c	2010-01-29 10:00:55.810739850 -0800
> +++ b/net/core/dev.c	2010-01-29 10:14:53.388864572 -0800
> @@ -2926,9 +2926,24 @@ void napi_complete(struct napi_struct *n
>  }
>  EXPORT_SYMBOL(napi_complete);
>  
> -void netif_napi_add(struct net_device *dev, struct napi_struct *napi,
> +static void release_napi(struct kobject *kobj)
> +{
> +	struct napi_struct *napi
> +		= container_of(kobj, struct napi_struct, kobj);
> +	kfree(napi);
> +}
[...]

This means the napi_struct can no longer be embedded in a larger struct.
So I think netif_napi_init() should actually allocate the napi_struct
and be named netif_napi_create().

Ben.
Al Viro Feb. 3, 2010, 9:26 p.m. UTC | #2
On Wed, Feb 03, 2010 at 09:23:36PM +0000, Ben Hutchings wrote:
> On Fri, 2010-01-29 at 10:18 -0800, Stephen Hemminger wrote:
> > The NAPI interface structure in current kernels is managed by the driver.
> > As part of receive packet steering there is a requirement to add an
> > additional parameter to this for the CPU map. And this map needs to
> > have an API to set it.
> > 
> > The right way to do this in the kernel model is to make NAPI into
> > a kobject and associate it back with the network device (parent).
> > This isn't wildly difficult but does change some of the API for
> > network device drivers because:
> >   1. They need to handle another possible error on setup
> >   2. NAPI object needs to be dynamically allocated
> >      separately (not as part of netdev_priv)
> >   3. Driver should pass index that can be uses as part of
> >      name (easier than scanning)

4. Lifetime rules become oh-so-fscking-interesting?
--
To unsubscribe from this list: send the line "unsubscribe netdev" in
the body of a message to majordomo@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
David Daney Feb. 3, 2010, 9:38 p.m. UTC | #3
Ben Hutchings wrote:
> On Fri, 2010-01-29 at 10:18 -0800, Stephen Hemminger wrote:
>> The NAPI interface structure in current kernels is managed by the driver.
>> As part of receive packet steering there is a requirement to add an
>> additional parameter to this for the CPU map. And this map needs to
>> have an API to set it.
>>
>> The right way to do this in the kernel model is to make NAPI into
>> a kobject and associate it back with the network device (parent).
>> This isn't wildly difficult but does change some of the API for
>> network device drivers because:
>>   1. They need to handle another possible error on setup
>>   2. NAPI object needs to be dynamically allocated
>>      separately (not as part of netdev_priv)
>>   3. Driver should pass index that can be uses as part of
>>      name (easier than scanning)
>>
>> Eventually, there will be:
>>   /sys/class/net/eth0/napi0/
>>                             weight
>>                             cpumap
> 
> I think the NAPI objects should be created as children of a bus device
> and then linked from the net device directories, e.g.
> 
> /sys/devices/.../ net/eth0/
>                            napi0 -> ../../napi/napi0
>                       eth1/
>                            napi0 -> ../../napi/napi0
>                   napi/napi0/
>                              weight
>                              cpumap
> 

This seems right.

Some drivers have a single napi object logically associated with 
multiple interfaces.  Although the current architecture forces us to 
associate the napi object with a single net_device, it would be nice to 
move towards something that allows sharing a napi object between 
multiple devices.

David Daney
--
To unsubscribe from this list: send the line "unsubscribe netdev" in
the body of a message to majordomo@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
stephen hemminger Feb. 3, 2010, 9:41 p.m. UTC | #4
On Wed, 3 Feb 2010 21:26:41 +0000
Al Viro <viro@ZenIV.linux.org.uk> wrote:

> On Wed, Feb 03, 2010 at 09:23:36PM +0000, Ben Hutchings wrote:
> > On Fri, 2010-01-29 at 10:18 -0800, Stephen Hemminger wrote:
> > > The NAPI interface structure in current kernels is managed by the driver.
> > > As part of receive packet steering there is a requirement to add an
> > > additional parameter to this for the CPU map. And this map needs to
> > > have an API to set it.
> > > 
> > > The right way to do this in the kernel model is to make NAPI into
> > > a kobject and associate it back with the network device (parent).
> > > This isn't wildly difficult but does change some of the API for
> > > network device drivers because:
> > >   1. They need to handle another possible error on setup
> > >   2. NAPI object needs to be dynamically allocated
> > >      separately (not as part of netdev_priv)
> > >   3. Driver should pass index that can be uses as part of
> > >      name (easier than scanning)
> 
> 4. Lifetime rules become oh-so-fscking-interesting?

My original proposal doesn't go far enough. I am doing a bigger
version that changes API to:
   napi_alloc / napi_release

The lifetime problem isn't that bad because the napi is a child
of network device object.  Make it a child bus object is a bigger
problem because then network device can come and go.
David Miller Feb. 4, 2010, 1:33 a.m. UTC | #5
From: Stephen Hemminger <shemminger@vyatta.com>
Date: Fri, 29 Jan 2010 10:18:39 -0800

> As part of receive packet steering there is a requirement to add an
> additional parameter to this for the CPU map.

Hmmm, where did this come from?

The RPS maps are per-device.

I think I vaguely recall you "suggesting" that the RPS maps become
per-NAPI.

But, firstly, I didn't see any movement in that part of the
discussion.

And, secondly, I don't think this makes any sense at all.

Things are already overly complicated as it is.  Having the user know
what traffic goes to a particular RX queue (ie. NAPI instance) and set
the RPS map in some way specific to that RX queue is over the top.

If the issue is the case of sharing a NAPI instance between two
devices, there are a few other ways to deal with this.

One I would suggest is to simply clone the RPS map amongst the
devices sharing a NAPI instance.

I currently see NAPI kobjects is just an over-abstraction for a
perceived need rather than a real one.
--
To unsubscribe from this list: send the line "unsubscribe netdev" in
the body of a message to majordomo@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
stephen hemminger Feb. 4, 2010, 1:58 a.m. UTC | #6
On Wed, 03 Feb 2010 17:33:05 -0800 (PST)
David Miller <davem@davemloft.net> wrote:

> From: Stephen Hemminger <shemminger@vyatta.com>
> Date: Fri, 29 Jan 2010 10:18:39 -0800
> 
> > As part of receive packet steering there is a requirement to add an
> > additional parameter to this for the CPU map.
> 
> Hmmm, where did this come from?
> 
> The RPS maps are per-device.
> 
> I think I vaguely recall you "suggesting" that the RPS maps become
> per-NAPI.
> 
> But, firstly, I didn't see any movement in that part of the
> discussion.
> 
> And, secondly, I don't think this makes any sense at all.
> 
> Things are already overly complicated as it is.  Having the user know
> what traffic goes to a particular RX queue (ie. NAPI instance) and set
> the RPS map in some way specific to that RX queue is over the top.
> 
> If the issue is the case of sharing a NAPI instance between two
> devices, there are a few other ways to deal with this.
> 
> One I would suggest is to simply clone the RPS map amongst the
> devices sharing a NAPI instance.
> 
> I currently see NAPI kobjects is just an over-abstraction for a
> perceived need rather than a real one.

It started with doing RPS, and not wanting to implement the proposed
sysfs interface (anything doing get_token is misuse of sysfs).

The usage model I see is wanting to have:
  1. only some cores being used for receive traffic
     on single Rx devices (NAPI)
  2. only some cores being used for receive traffic
     on legacy devices (non-NAPI)
  3. being able to configure a set of cpus with same   
     IRQ/cache when doing Rx multi-queue.  Assign MSI-X
     IRQ per core and allow both HT on core to split
     that RX traffic.

All this should be manageable by some user utility like irqbalance.

#1 and #2 argue for a per device map (like irq_affinity) but
#3 is harder; not sure the right API for that.

--
To unsubscribe from this list: send the line "unsubscribe netdev" in
the body of a message to majordomo@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
David Miller Feb. 4, 2010, 2:17 a.m. UTC | #7
From: Stephen Hemminger <shemminger@vyatta.com>
Date: Wed, 3 Feb 2010 17:58:46 -0800

> The usage model I see is wanting to have:
>   1. only some cores being used for receive traffic
>      on single Rx devices (NAPI)
>   2. only some cores being used for receive traffic
>      on legacy devices (non-NAPI)
>   3. being able to configure a set of cpus with same   
>      IRQ/cache when doing Rx multi-queue.  Assign MSI-X
>      IRQ per core and allow both HT on core to split
>      that RX traffic.
> 
> All this should be manageable by some user utility like irqbalance.
> 
> #1 and #2 argue for a per device map (like irq_affinity) but
> #3 is harder; not sure the right API for that.

Do you think the people setting RPS maps aren't capable of
configuring IRQ affinities? :-)
--
To unsubscribe from this list: send the line "unsubscribe netdev" in
the body of a message to majordomo@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
diff mbox

Patch

--- a/include/linux/netdevice.h	2010-01-29 10:00:55.820739116 -0800
+++ b/include/linux/netdevice.h	2010-01-29 10:15:33.098863437 -0800
@@ -378,6 +378,8 @@  struct napi_struct {
 	struct list_head	dev_list;
 	struct sk_buff		*gro_list;
 	struct sk_buff		*skb;
+
+	struct kobject		kobj;
 };
 
 enum {
@@ -1037,25 +1039,31 @@  static inline void *netdev_priv(const st
 #define SET_NETDEV_DEVTYPE(net, devtype)	((net)->dev.type = (devtype))
 
 /**
- *	netif_napi_add - initialize a napi context
+ *	netif_napi_init - initialize a napi context
  *	@dev:  network device
  *	@napi: napi context
+ *	@index: queue number
  *	@poll: polling function
  *	@weight: default weight
  *
- * netif_napi_add() must be used to initialize a napi context prior to calling
+ * netif_napi_init() must be used to create a napi context prior to calling
  * *any* of the other napi related functions.
+ *
+ * in case of error, the context is not left in napi_list so it can
+ * be cleaned up by free_netdev, but is not valid for use.
  */
-void netif_napi_add(struct net_device *dev, struct napi_struct *napi,
-		    int (*poll)(struct napi_struct *, int), int weight);
+extern int netif_napi_init(struct net_device *dev, struct napi_struct *napi,
+			   unsigned index,
+			   int (*poll)(struct napi_struct *, int), int weight);
 
 /**
- *  netif_napi_del - remove a napi context
+ *  netif_napi_del - free a napi context
  *  @napi: napi context
  *
  *  netif_napi_del() removes a napi context from the network device napi list
+ *                   and frees it.
  */
-void netif_napi_del(struct napi_struct *napi);
+extern void netif_napi_del(struct napi_struct *napi);
 
 struct napi_gro_cb {
 	/* Virtual address of skb_shinfo(skb)->frags[0].page + offset. */
--- a/net/core/dev.c	2010-01-29 10:00:55.810739850 -0800
+++ b/net/core/dev.c	2010-01-29 10:14:53.388864572 -0800
@@ -2926,9 +2926,24 @@  void napi_complete(struct napi_struct *n
 }
 EXPORT_SYMBOL(napi_complete);
 
-void netif_napi_add(struct net_device *dev, struct napi_struct *napi,
+static void release_napi(struct kobject *kobj)
+{
+	struct napi_struct *napi
+		= container_of(kobj, struct napi_struct, kobj);
+	kfree(napi);
+}
+
+static struct kobj_type napi_ktype = {
+	/* insert future sysfs hooks ... */
+	.release = release_napi,
+};
+
+int netif_napi_init(struct net_device *dev, struct napi_struct *napi,
+		    unsigned index,
 		    int (*poll)(struct napi_struct *, int), int weight)
 {
+	int err;
+
 	INIT_LIST_HEAD(&napi->poll_list);
 	napi->gro_count = 0;
 	napi->gro_list = NULL;
@@ -2941,9 +2956,16 @@  void netif_napi_add(struct net_device *d
 	spin_lock_init(&napi->poll_lock);
 	napi->poll_owner = -1;
 #endif
+
+	err = kobject_init_and_add(&napi->kobj, &napi_ktype,
+				   &dev->dev.kobj, "napi%d", index);
+	if (err)
+		return err;
+
 	set_bit(NAPI_STATE_SCHED, &napi->state);
+	return 0;
 }
-EXPORT_SYMBOL(netif_napi_add);
+EXPORT_SYMBOL(netif_napi_init);
 
 void netif_napi_del(struct napi_struct *napi)
 {
@@ -2960,6 +2982,8 @@  void netif_napi_del(struct napi_struct *
 
 	napi->gro_list = NULL;
 	napi->gro_count = 0;
+
+	kobject_put(&napi->kobj);
 }
 EXPORT_SYMBOL(netif_napi_del);