Message ID | 20100129101839.36944ba5@nehalam |
---|---|
State | RFC, archived |
Delegated to: | David Miller |
Headers | show |
On Fri, 2010-01-29 at 10:18 -0800, Stephen Hemminger wrote: > The NAPI interface structure in current kernels is managed by the driver. > As part of receive packet steering there is a requirement to add an > additional parameter to this for the CPU map. And this map needs to > have an API to set it. > > The right way to do this in the kernel model is to make NAPI into > a kobject and associate it back with the network device (parent). > This isn't wildly difficult but does change some of the API for > network device drivers because: > 1. They need to handle another possible error on setup > 2. NAPI object needs to be dynamically allocated > separately (not as part of netdev_priv) > 3. Driver should pass index that can be uses as part of > name (easier than scanning) > > Eventually, there will be: > /sys/class/net/eth0/napi0/ > weight > cpumap I think the NAPI objects should be created as children of a bus device and then linked from the net device directories, e.g. /sys/devices/.../ net/eth0/ napi0 -> ../../napi/napi0 eth1/ napi0 -> ../../napi/napi0 napi/napi0/ weight cpumap I also think it might be helpful to use a more descriptive name here than 'napi', e.g. 'channel'. But maybe that can be left to management tools. > So here is a starting point patch that shows how the API might look like. > > > --- > include/linux/netdevice.h | 20 ++++++++++++++------ > net/core/dev.c | 28 ++++++++++++++++++++++++++-- > 2 files changed, 40 insertions(+), 8 deletions(-) > > --- a/include/linux/netdevice.h 2010-01-29 10:00:55.820739116 -0800 > +++ b/include/linux/netdevice.h 2010-01-29 10:15:33.098863437 -0800 > @@ -378,6 +378,8 @@ struct napi_struct { > struct list_head dev_list; > struct sk_buff *gro_list; > struct sk_buff *skb; > + > + struct kobject kobj; > }; > > enum { > @@ -1037,25 +1039,31 @@ static inline void *netdev_priv(const st > #define SET_NETDEV_DEVTYPE(net, devtype) ((net)->dev.type = (devtype)) > > /** > - * netif_napi_add - initialize a napi context > + * netif_napi_init - initialize a napi context > * @dev: network device > * @napi: napi context > + * @index: queue number > * @poll: polling function > * @weight: default weight > * > - * netif_napi_add() must be used to initialize a napi context prior to calling > + * netif_napi_init() must be used to create a napi context prior to calling > * *any* of the other napi related functions. > + * > + * in case of error, the context is not left in napi_list so it can > + * be cleaned up by free_netdev, but is not valid for use. > */ > -void netif_napi_add(struct net_device *dev, struct napi_struct *napi, > - int (*poll)(struct napi_struct *, int), int weight); > +extern int netif_napi_init(struct net_device *dev, struct napi_struct *napi, > + unsigned index, > + int (*poll)(struct napi_struct *, int), int weight); [... > --- a/net/core/dev.c 2010-01-29 10:00:55.810739850 -0800 > +++ b/net/core/dev.c 2010-01-29 10:14:53.388864572 -0800 > @@ -2926,9 +2926,24 @@ void napi_complete(struct napi_struct *n > } > EXPORT_SYMBOL(napi_complete); > > -void netif_napi_add(struct net_device *dev, struct napi_struct *napi, > +static void release_napi(struct kobject *kobj) > +{ > + struct napi_struct *napi > + = container_of(kobj, struct napi_struct, kobj); > + kfree(napi); > +} [...] This means the napi_struct can no longer be embedded in a larger struct. So I think netif_napi_init() should actually allocate the napi_struct and be named netif_napi_create(). Ben.
On Wed, Feb 03, 2010 at 09:23:36PM +0000, Ben Hutchings wrote: > On Fri, 2010-01-29 at 10:18 -0800, Stephen Hemminger wrote: > > The NAPI interface structure in current kernels is managed by the driver. > > As part of receive packet steering there is a requirement to add an > > additional parameter to this for the CPU map. And this map needs to > > have an API to set it. > > > > The right way to do this in the kernel model is to make NAPI into > > a kobject and associate it back with the network device (parent). > > This isn't wildly difficult but does change some of the API for > > network device drivers because: > > 1. They need to handle another possible error on setup > > 2. NAPI object needs to be dynamically allocated > > separately (not as part of netdev_priv) > > 3. Driver should pass index that can be uses as part of > > name (easier than scanning) 4. Lifetime rules become oh-so-fscking-interesting? -- To unsubscribe from this list: send the line "unsubscribe netdev" in the body of a message to majordomo@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
Ben Hutchings wrote: > On Fri, 2010-01-29 at 10:18 -0800, Stephen Hemminger wrote: >> The NAPI interface structure in current kernels is managed by the driver. >> As part of receive packet steering there is a requirement to add an >> additional parameter to this for the CPU map. And this map needs to >> have an API to set it. >> >> The right way to do this in the kernel model is to make NAPI into >> a kobject and associate it back with the network device (parent). >> This isn't wildly difficult but does change some of the API for >> network device drivers because: >> 1. They need to handle another possible error on setup >> 2. NAPI object needs to be dynamically allocated >> separately (not as part of netdev_priv) >> 3. Driver should pass index that can be uses as part of >> name (easier than scanning) >> >> Eventually, there will be: >> /sys/class/net/eth0/napi0/ >> weight >> cpumap > > I think the NAPI objects should be created as children of a bus device > and then linked from the net device directories, e.g. > > /sys/devices/.../ net/eth0/ > napi0 -> ../../napi/napi0 > eth1/ > napi0 -> ../../napi/napi0 > napi/napi0/ > weight > cpumap > This seems right. Some drivers have a single napi object logically associated with multiple interfaces. Although the current architecture forces us to associate the napi object with a single net_device, it would be nice to move towards something that allows sharing a napi object between multiple devices. David Daney -- To unsubscribe from this list: send the line "unsubscribe netdev" in the body of a message to majordomo@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
On Wed, 3 Feb 2010 21:26:41 +0000 Al Viro <viro@ZenIV.linux.org.uk> wrote: > On Wed, Feb 03, 2010 at 09:23:36PM +0000, Ben Hutchings wrote: > > On Fri, 2010-01-29 at 10:18 -0800, Stephen Hemminger wrote: > > > The NAPI interface structure in current kernels is managed by the driver. > > > As part of receive packet steering there is a requirement to add an > > > additional parameter to this for the CPU map. And this map needs to > > > have an API to set it. > > > > > > The right way to do this in the kernel model is to make NAPI into > > > a kobject and associate it back with the network device (parent). > > > This isn't wildly difficult but does change some of the API for > > > network device drivers because: > > > 1. They need to handle another possible error on setup > > > 2. NAPI object needs to be dynamically allocated > > > separately (not as part of netdev_priv) > > > 3. Driver should pass index that can be uses as part of > > > name (easier than scanning) > > 4. Lifetime rules become oh-so-fscking-interesting? My original proposal doesn't go far enough. I am doing a bigger version that changes API to: napi_alloc / napi_release The lifetime problem isn't that bad because the napi is a child of network device object. Make it a child bus object is a bigger problem because then network device can come and go.
From: Stephen Hemminger <shemminger@vyatta.com> Date: Fri, 29 Jan 2010 10:18:39 -0800 > As part of receive packet steering there is a requirement to add an > additional parameter to this for the CPU map. Hmmm, where did this come from? The RPS maps are per-device. I think I vaguely recall you "suggesting" that the RPS maps become per-NAPI. But, firstly, I didn't see any movement in that part of the discussion. And, secondly, I don't think this makes any sense at all. Things are already overly complicated as it is. Having the user know what traffic goes to a particular RX queue (ie. NAPI instance) and set the RPS map in some way specific to that RX queue is over the top. If the issue is the case of sharing a NAPI instance between two devices, there are a few other ways to deal with this. One I would suggest is to simply clone the RPS map amongst the devices sharing a NAPI instance. I currently see NAPI kobjects is just an over-abstraction for a perceived need rather than a real one. -- To unsubscribe from this list: send the line "unsubscribe netdev" in the body of a message to majordomo@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
On Wed, 03 Feb 2010 17:33:05 -0800 (PST) David Miller <davem@davemloft.net> wrote: > From: Stephen Hemminger <shemminger@vyatta.com> > Date: Fri, 29 Jan 2010 10:18:39 -0800 > > > As part of receive packet steering there is a requirement to add an > > additional parameter to this for the CPU map. > > Hmmm, where did this come from? > > The RPS maps are per-device. > > I think I vaguely recall you "suggesting" that the RPS maps become > per-NAPI. > > But, firstly, I didn't see any movement in that part of the > discussion. > > And, secondly, I don't think this makes any sense at all. > > Things are already overly complicated as it is. Having the user know > what traffic goes to a particular RX queue (ie. NAPI instance) and set > the RPS map in some way specific to that RX queue is over the top. > > If the issue is the case of sharing a NAPI instance between two > devices, there are a few other ways to deal with this. > > One I would suggest is to simply clone the RPS map amongst the > devices sharing a NAPI instance. > > I currently see NAPI kobjects is just an over-abstraction for a > perceived need rather than a real one. It started with doing RPS, and not wanting to implement the proposed sysfs interface (anything doing get_token is misuse of sysfs). The usage model I see is wanting to have: 1. only some cores being used for receive traffic on single Rx devices (NAPI) 2. only some cores being used for receive traffic on legacy devices (non-NAPI) 3. being able to configure a set of cpus with same IRQ/cache when doing Rx multi-queue. Assign MSI-X IRQ per core and allow both HT on core to split that RX traffic. All this should be manageable by some user utility like irqbalance. #1 and #2 argue for a per device map (like irq_affinity) but #3 is harder; not sure the right API for that. -- To unsubscribe from this list: send the line "unsubscribe netdev" in the body of a message to majordomo@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
From: Stephen Hemminger <shemminger@vyatta.com> Date: Wed, 3 Feb 2010 17:58:46 -0800 > The usage model I see is wanting to have: > 1. only some cores being used for receive traffic > on single Rx devices (NAPI) > 2. only some cores being used for receive traffic > on legacy devices (non-NAPI) > 3. being able to configure a set of cpus with same > IRQ/cache when doing Rx multi-queue. Assign MSI-X > IRQ per core and allow both HT on core to split > that RX traffic. > > All this should be manageable by some user utility like irqbalance. > > #1 and #2 argue for a per device map (like irq_affinity) but > #3 is harder; not sure the right API for that. Do you think the people setting RPS maps aren't capable of configuring IRQ affinities? :-) -- To unsubscribe from this list: send the line "unsubscribe netdev" in the body of a message to majordomo@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
--- a/include/linux/netdevice.h 2010-01-29 10:00:55.820739116 -0800 +++ b/include/linux/netdevice.h 2010-01-29 10:15:33.098863437 -0800 @@ -378,6 +378,8 @@ struct napi_struct { struct list_head dev_list; struct sk_buff *gro_list; struct sk_buff *skb; + + struct kobject kobj; }; enum { @@ -1037,25 +1039,31 @@ static inline void *netdev_priv(const st #define SET_NETDEV_DEVTYPE(net, devtype) ((net)->dev.type = (devtype)) /** - * netif_napi_add - initialize a napi context + * netif_napi_init - initialize a napi context * @dev: network device * @napi: napi context + * @index: queue number * @poll: polling function * @weight: default weight * - * netif_napi_add() must be used to initialize a napi context prior to calling + * netif_napi_init() must be used to create a napi context prior to calling * *any* of the other napi related functions. + * + * in case of error, the context is not left in napi_list so it can + * be cleaned up by free_netdev, but is not valid for use. */ -void netif_napi_add(struct net_device *dev, struct napi_struct *napi, - int (*poll)(struct napi_struct *, int), int weight); +extern int netif_napi_init(struct net_device *dev, struct napi_struct *napi, + unsigned index, + int (*poll)(struct napi_struct *, int), int weight); /** - * netif_napi_del - remove a napi context + * netif_napi_del - free a napi context * @napi: napi context * * netif_napi_del() removes a napi context from the network device napi list + * and frees it. */ -void netif_napi_del(struct napi_struct *napi); +extern void netif_napi_del(struct napi_struct *napi); struct napi_gro_cb { /* Virtual address of skb_shinfo(skb)->frags[0].page + offset. */ --- a/net/core/dev.c 2010-01-29 10:00:55.810739850 -0800 +++ b/net/core/dev.c 2010-01-29 10:14:53.388864572 -0800 @@ -2926,9 +2926,24 @@ void napi_complete(struct napi_struct *n } EXPORT_SYMBOL(napi_complete); -void netif_napi_add(struct net_device *dev, struct napi_struct *napi, +static void release_napi(struct kobject *kobj) +{ + struct napi_struct *napi + = container_of(kobj, struct napi_struct, kobj); + kfree(napi); +} + +static struct kobj_type napi_ktype = { + /* insert future sysfs hooks ... */ + .release = release_napi, +}; + +int netif_napi_init(struct net_device *dev, struct napi_struct *napi, + unsigned index, int (*poll)(struct napi_struct *, int), int weight) { + int err; + INIT_LIST_HEAD(&napi->poll_list); napi->gro_count = 0; napi->gro_list = NULL; @@ -2941,9 +2956,16 @@ void netif_napi_add(struct net_device *d spin_lock_init(&napi->poll_lock); napi->poll_owner = -1; #endif + + err = kobject_init_and_add(&napi->kobj, &napi_ktype, + &dev->dev.kobj, "napi%d", index); + if (err) + return err; + set_bit(NAPI_STATE_SCHED, &napi->state); + return 0; } -EXPORT_SYMBOL(netif_napi_add); +EXPORT_SYMBOL(netif_napi_init); void netif_napi_del(struct napi_struct *napi) { @@ -2960,6 +2982,8 @@ void netif_napi_del(struct napi_struct * napi->gro_list = NULL; napi->gro_count = 0; + + kobject_put(&napi->kobj); } EXPORT_SYMBOL(netif_napi_del);