xps-mq: Transmit Packet Steering for multiqueue

Message ID alpine.DEB.1.00.1008222233530.31382@pokey.mtv.corp.google.com
State Changes Requested, archived
Delegated to: David Miller
Headers show

Commit Message

Tom Herbert Aug. 23, 2010, 5:39 a.m.
This patch implements transmit packet steering (XPS) for multiqueue
devices.  XPS selects a transmit queue during packet transmission based
on configuration.  This is done by mapping the CPU transmitting the
packet to a queue.  This is the transmit side analogue to RPS-- where
RPS is selecting a CPU based on receive queue, XPS selects a queue
based on the CPU (previously there was an XPS patch from Eric
Dumazet, but that might more appropriately be called transmit completion
steering).

Each transmit queue can be associated with a number of CPUs which will
used the queue to send packets.  This is configured as a CPU mask on a
per queue basis in:

/sys/class/net/eth<n>/queues/tx-<n>/xps_cpus

The mappings are stored per device in an inverted data structure that
maps CPUs to queues.  In the netdevice structure this is an array of
num_possible_cpu structures where each array entry contains a bit map
of queues which that CPU can use.

We also allow the mapping of a socket to queue to be modified, for
instance if a thread is scheduled on a different CPU the desired queue
for transmitting packets would likely change.  To maintain in order
packet transmission a flag (ooo_okay) has been added to the sk_buf
structure.  If a transport layer sets this flag on a packet, the
transmit queue can be changed for this socket.  Presumably, the
transport would set this is there was no possbility of creating ooo
packets (for instance there are no packets in flight for the socket).
This patch includes the modification in TCP output for setting this
flag.

The benefits of XPS are improved locality in the per queue data
strutures.  Also, transmit completions are more likely to be done
nearer to the sending thread so this should promote locality back 
to the socket (e.g. UDP).  The benefits of XPS are dependent on
cache hierarchy, application load, and other factors.  XPS would
nominally be configured so that a queue would only be shared by CPUs
which are sharing a cache, the degenerative configuration woud be that
each CPU has it's own queue.

Below are some benchmark results which show the potential benfit of
this patch.  The netperf test has 500 instances of netperf TCP_RR test
with 1 byte req. and resp.

bnx2x on 16 core AMD
   XPS (16 queues, 1 TX queue per CPU)	1015K at 99% CPU
   No XPS (16 queues)			1127K at 98% CPU

Signed-off-by: Tom Herbert <therbert@google.com>
---

--
To unsubscribe from this list: send the line "unsubscribe netdev" in
the body of a message to majordomo@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html

Comments

Ben Hutchings Aug. 23, 2010, 5:09 p.m. | #1
On Sun, 2010-08-22 at 22:39 -0700, Tom Herbert wrote:
[...]
> Each transmit queue can be associated with a number of CPUs which will
> used the queue to send packets.  This is configured as a CPU mask on a
> per queue basis in:
> 
> /sys/class/net/eth<n>/queues/tx-<n>/xps_cpus
> 
> The mappings are stored per device in an inverted data structure that
> maps CPUs to queues.  In the netdevice structure this is an array of
> num_possible_cpu structures where each array entry contains a bit map
> of queues which that CPU can use.
[...]

The mapping of TX queue to CPU should match the affinity of the
completion IRQ for that queue.  It should not be a separate setting.

Ben.
Ben Hutchings Aug. 23, 2010, 5:50 p.m. | #2
On Mon, 2010-08-23 at 10:30 -0700, Tom Herbert wrote:
> 
> 
> On Mon, Aug 23, 2010 at 10:09 AM, Ben Hutchings
> <bhutchings@solarflare.com> wrote:
>         On Sun, 2010-08-22 at 22:39 -0700, Tom Herbert wrote:
>         [...]
>         > Each transmit queue can be associated with a number of CPUs
>         which will
>         > used the queue to send packets.  This is configured as a CPU
>         mask on a
>         > per queue basis in:
>         >
>         > /sys/class/net/eth<n>/queues/tx-<n>/xps_cpus
>         >
>         > The mappings are stored per device in an inverted data
>         structure that
>         > maps CPUs to queues.  In the netdevice structure this is an
>         array of
>         > num_possible_cpu structures where each array entry contains
>         a bit map
>         > of queues which that CPU can use.
>         
>         [...]
>         
>         The mapping of TX queue to CPU should match the affinity of
>         the
>         completion IRQ for that queue.  It should not be a separate
>         setting.
>         
>         
>         
> That implies one possible configuration, but there are others.  For
> instance, there may be fewer queues than CPUs in which case a TX queue
> would be used by more than just the CPU handling the IRQ.

The affinity of IRQs is not restricted to a single CPU either, but I
take your point.

The IRQ affinity and the mapping of sender to queue do at least need to
be coordinated, and I think that continuing to add independent knobs for
CPU affinity of closely-related objects makes it too hard for
administrators to get this right.

Ben.
stephen hemminger Aug. 23, 2010, 5:59 p.m. | #3
On Sun, 22 Aug 2010 22:39:57 -0700 (PDT)
Tom Herbert <therbert@google.com> wrote:

> This patch implements transmit packet steering (XPS) for multiqueue
> devices.  XPS selects a transmit queue during packet transmission based
> on configuration.  This is done by mapping the CPU transmitting the
> packet to a queue.  This is the transmit side analogue to RPS-- where
> RPS is selecting a CPU based on receive queue, XPS selects a queue
> based on the CPU (previously there was an XPS patch from Eric
> Dumazet, but that might more appropriately be called transmit completion
> steering).
> 
> Each transmit queue can be associated with a number of CPUs which will
> used the queue to send packets.  This is configured as a CPU mask on a
> per queue basis in:
> 
> /sys/class/net/eth<n>/queues/tx-<n>/xps_cpus
> 
> The mappings are stored per device in an inverted data structure that
> maps CPUs to queues.  In the netdevice structure this is an array of
> num_possible_cpu structures where each array entry contains a bit map
> of queues which that CPU can use.
> 
> We also allow the mapping of a socket to queue to be modified, for
> instance if a thread is scheduled on a different CPU the desired queue
> for transmitting packets would likely change.  To maintain in order
> packet transmission a flag (ooo_okay) has been added to the sk_buf
> structure.  If a transport layer sets this flag on a packet, the
> transmit queue can be changed for this socket.  Presumably, the
> transport would set this is there was no possbility of creating ooo
> packets (for instance there are no packets in flight for the socket).
> This patch includes the modification in TCP output for setting this
> flag.
> 
> The benefits of XPS are improved locality in the per queue data
> strutures.  Also, transmit completions are more likely to be done
> nearer to the sending thread so this should promote locality back 
> to the socket (e.g. UDP).  The benefits of XPS are dependent on
> cache hierarchy, application load, and other factors.  XPS would
> nominally be configured so that a queue would only be shared by CPUs
> which are sharing a cache, the degenerative configuration woud be that
> each CPU has it's own queue.
> 
> Below are some benchmark results which show the potential benfit of
> this patch.  The netperf test has 500 instances of netperf TCP_RR test
> with 1 byte req. and resp.
> 
> bnx2x on 16 core AMD
>    XPS (16 queues, 1 TX queue per CPU)	1015K at 99% CPU
>    No XPS (16 queues)			1127K at 98% CPU
> 
> Signed-off-by: Tom Herbert <therbert@google.com>
> ---
> diff --git a/include/linux/netdevice.h b/include/linux/netdevice.h
> index 46c36ff..0ff6c9f 100644
> --- a/include/linux/netdevice.h
> +++ b/include/linux/netdevice.h
> @@ -497,6 +497,12 @@ struct netdev_queue {
>  	struct Qdisc		*qdisc;
>  	unsigned long		state;
>  	struct Qdisc		*qdisc_sleeping;
> +#ifdef CONFIG_RPS
> +	struct kobject		kobj;
> +	struct netdev_queue	*first;
> +	atomic_t		count;
> +#endif
> +
>  /*
>   * write mostly part
>   */
> @@ -524,6 +530,22 @@ struct rps_map {
>  #define RPS_MAP_SIZE(_num) (sizeof(struct rps_map) + (_num * sizeof(u16)))
>  
>  /*
> + * This structure holds an XPS map which can be of variable length.  queues
> + * is an array of num_possible_cpus entries, where each entry is a mask of
> + * queues for that CPU (up to num_tx_queues bits for device).
> + */
> +struct xps_map {
> +	struct rcu_head rcu;
> +	unsigned long queues[0];
> +};
> +
> +#define QUEUE_MASK_SIZE(dev) (BITS_TO_LONGS(dev->num_tx_queues))
> +#define XPS_MAP_SIZE(dev) (sizeof(struct xps_map) + (num_possible_cpus() * \
> +    QUEUE_MASK_SIZE(dev) * sizeof(unsigned long)))
> +#define XPS_ENTRY(map, offset, dev) \
> +    (&map->queues[offset * QUEUE_MASK_SIZE(dev)])
> +
> +/*
>   * The rps_dev_flow structure contains the mapping of a flow to a CPU and the
>   * tail pointer for that CPU's input queue at the time of last enqueue.
>   */
> @@ -978,6 +1000,9 @@ struct net_device {
>  	void			*rx_handler_data;
>  
>  	struct netdev_queue	*_tx ____cacheline_aligned_in_smp;
> +#ifdef CONFIG_RPS
> +	struct xps_map		*xps_maps;
> +#endif
>  
>  	/* Number of TX queues allocated at alloc_netdev_mq() time  */
>  	unsigned int		num_tx_queues;
> diff --git a/include/linux/skbuff.h b/include/linux/skbuff.h
> index f067c95..146df6f 100644
> --- a/include/linux/skbuff.h
> +++ b/include/linux/skbuff.h
> @@ -381,6 +381,7 @@ struct sk_buff {
>  #else
>  	__u8			deliver_no_wcard:1;
>  #endif
> +	__u8			ooo_okay:1;
>  	kmemcheck_bitfield_end(flags2);
>  
>  	/* 0/14 bit hole */
> diff --git a/net/core/dev.c b/net/core/dev.c
> index da584f5..d23f9c4 100644
> --- a/net/core/dev.c
> +++ b/net/core/dev.c
> @@ -2054,6 +2054,60 @@ static inline u16 dev_cap_txqueue(struct net_device *dev, u16 queue_index)
>  	return queue_index;
>  }
>  
> +static inline int get_xps_queue(struct net_device *dev, struct sk_buff *skb,
> +				int queue_index)
> +{
> +	struct xps_map *maps;
> +	int cpu = smp_processor_id();
> +	u32 hash;
> +	unsigned long *queues;
> +	int weight, select;
> +
> +	rcu_read_lock();
> +	maps = rcu_dereference(dev->xps_maps);
> +
> +	if (!maps) {
> +		rcu_read_unlock();
> +		return queue_index;
> +	}
> +
> +	queues = XPS_ENTRY(maps, cpu, dev);
> +
> +	if (queue_index >= 0) {
> +		if (test_bit(queue_index, queues)) {
> +			rcu_read_unlock();
> +			return queue_index;
> +		}
> +	}
> +
> +	weight = bitmap_weight(queues, dev->real_num_tx_queues);
> +	switch (weight) {
> +	case 0:
> +		break;
> +	case 1:
> +		queue_index =
> +		    find_first_bit(queues, dev->real_num_tx_queues);
> +		break;
> +	default:
> +		if (skb->sk && skb->sk->sk_hash)
> +			hash = skb->sk->sk_hash;
> +		else
> +			hash = (__force u16) skb->protocol ^ skb->rxhash;
> +		hash = jhash_1word(hash, hashrnd);
> +
> +		select = ((u64) hash * weight) >> 32;
> +		queue_index =
> +		    find_first_bit(queues, dev->real_num_tx_queues);
> +		while (select--)
> +			queue_index = find_next_bit(queues,
> +			    dev->real_num_tx_queues, queue_index);
> +		break;
> +	}
> +
> +	rcu_read_unlock();
> +	return queue_index;
> +}
> +
>  static struct netdev_queue *dev_pick_tx(struct net_device *dev,
>  					struct sk_buff *skb)
>  {
> @@ -2061,23 +2115,30 @@ static struct netdev_queue *dev_pick_tx(struct net_device *dev,
>  	struct sock *sk = skb->sk;
>  
>  	queue_index = sk_tx_queue_get(sk);
> -	if (queue_index < 0) {
> +
> +	if (queue_index < 0 || (skb->ooo_okay && dev->real_num_tx_queues > 1)) {
>  		const struct net_device_ops *ops = dev->netdev_ops;
> +		int old_index = queue_index;
>  
>  		if (ops->ndo_select_queue) {
>  			queue_index = ops->ndo_select_queue(dev, skb);
>  			queue_index = dev_cap_txqueue(dev, queue_index);
>  		} else {
> -			queue_index = 0;
> -			if (dev->real_num_tx_queues > 1)
> -				queue_index = skb_tx_hash(dev, skb);
> +			if (dev->real_num_tx_queues > 1) {
> +				queue_index = get_xps_queue(dev,
> +				    skb, queue_index);
> +				if (queue_index < 0)
> +					queue_index = skb_tx_hash(dev, skb);
> +			} else
> +				queue_index = 0;
> +		}
>  
> -			if (sk) {
> -				struct dst_entry *dst = rcu_dereference_check(sk->sk_dst_cache, 1);
> +		if ((queue_index != old_index) && sk) {
> +			struct dst_entry *dst =
> +			    rcu_dereference_check(sk->sk_dst_cache, 1);
>  
> -				if (dst && skb_dst(skb) == dst)
> -					sk_tx_queue_set(sk, queue_index);
> -			}
> +			if (dst && skb_dst(skb) == dst)
> +				sk_tx_queue_set(sk, queue_index);
>  		}
>  	}
>  
> @@ -5429,6 +5490,15 @@ struct net_device *alloc_netdev_mq(int sizeof_priv, const char *name,
>  	}
>  
>  #ifdef CONFIG_RPS
> +	atomic_set(&tx->count, queue_count);
> +
> +	/*
> +	 * Set a pointer to first element in the array which holds the
> +	 * reference count.
> +	 */
> +	for (i = 0; i < queue_count; i++)
> +		tx[i].first = tx;
> +
>  	rx = kcalloc(queue_count, sizeof(struct netdev_rx_queue), GFP_KERNEL);
>  	if (!rx) {
>  		printk(KERN_ERR "alloc_netdev: Unable to allocate "
> @@ -5506,7 +5576,9 @@ void free_netdev(struct net_device *dev)
>  
>  	release_net(dev_net(dev));
>  
> +#ifndef CONFIG_RPS
>  	kfree(dev->_tx);
> +#endif
>  
>  	/* Flush device addresses */
>  	dev_addr_flush(dev);
> diff --git a/net/core/net-sysfs.c b/net/core/net-sysfs.c
> index af4dfba..661c481 100644
> --- a/net/core/net-sysfs.c
> +++ b/net/core/net-sysfs.c
> @@ -742,34 +742,295 @@ static int rx_queue_add_kobject(struct net_device *net, int index)
>  	return error;
>  }
>  
> -static int rx_queue_register_kobjects(struct net_device *net)
> +/*
> + * netdev_queue sysfs structures and functions.
> + */
> +struct netdev_queue_attribute {
> +	struct attribute attr;
> +	ssize_t (*show)(struct netdev_queue *queue,
> +	    struct netdev_queue_attribute *attr, char *buf);
> +	ssize_t (*store)(struct netdev_queue *queue,
> +	    struct netdev_queue_attribute *attr, const char *buf, size_t len);
> +};
> +#define to_netdev_queue_attr(_attr) container_of(_attr,		\
> +    struct netdev_queue_attribute, attr)
> +
> +#define to_netdev_queue(obj) container_of(obj, struct netdev_queue, kobj)
> +
> +static ssize_t netdev_queue_attr_show(struct kobject *kobj,
> +				      struct attribute *attr, char *buf)
> +{
> +	struct netdev_queue_attribute *attribute = to_netdev_queue_attr(attr);
> +	struct netdev_queue *queue = to_netdev_queue(kobj);
> +
> +	if (!attribute->show)
> +		return -EIO;
> +
> +	return attribute->show(queue, attribute, buf);
> +}
> +
> +static ssize_t netdev_queue_attr_store(struct kobject *kobj,
> +				       struct attribute *attr,
> +				       const char *buf, size_t count)
> +{
> +	struct netdev_queue_attribute *attribute = to_netdev_queue_attr(attr);
> +	struct netdev_queue *queue = to_netdev_queue(kobj);
> +
> +	if (!attribute->store)
> +		return -EIO;
> +
> +	return attribute->store(queue, attribute, buf, count);
> +}
> +
> +static struct sysfs_ops netdev_queue_sysfs_ops = {
> +	.show = netdev_queue_attr_show,
> +	.store = netdev_queue_attr_store,
> +};
> +
> +static inline unsigned int get_netdev_queue_index(struct netdev_queue *queue)
>  {
> +	struct net_device *dev = queue->dev;
> +	int i;
> +
> +	for (i = 0; i < dev->num_tx_queues; i++)
> +		if (queue == &dev->_tx[i])
> +			break;
> +
> +	BUG_ON(i >= dev->num_tx_queues);
> +
> +	return i;
> +}
> +
> +static ssize_t show_xps_map(struct netdev_queue *queue,
> +			    struct netdev_queue_attribute *attribute, char *buf)
> +{
> +	struct net_device *dev = queue->dev;
> +	struct xps_map *maps;
> +	cpumask_var_t mask;
> +	unsigned long *qmask, index;
> +	size_t len = 0;
>  	int i;
> +	unsigned int qmask_size = QUEUE_MASK_SIZE(dev);
> +
> +	if (!zalloc_cpumask_var(&mask, GFP_KERNEL))
> +		return -ENOMEM;
> +
> +	index = get_netdev_queue_index(queue);
> +
> +	rcu_read_lock();
> +	maps = rcu_dereference(dev->xps_maps);
> +	if (maps) {
> +		qmask = maps->queues;
> +		for (i = 0; i < num_possible_cpus(); i++) {
> +			if (test_bit(index, qmask))
> +				cpumask_set_cpu(i, mask);
> +			qmask += qmask_size;
> +		}
> +	}
> +	len += cpumask_scnprintf(buf + len, PAGE_SIZE, mask);
> +	if (PAGE_SIZE - len < 3) {
> +		rcu_read_unlock();
> +		free_cpumask_var(mask);
> +		return -EINVAL;
> +	}
> +	rcu_read_unlock();
> +
> +	free_cpumask_var(mask);
> +	len += sprintf(buf + len, "\n");
> +	return len;
> +}
> +
> +static void xps_map_release(struct rcu_head *rcu)
> +{
> +	struct xps_map *map = container_of(rcu, struct xps_map, rcu);
> +
> +	kfree(map);
> +}
> +
> +static DEFINE_MUTEX(xps_map_lock);
> +
> +static ssize_t store_xps_map(struct netdev_queue *queue,
> +		      struct netdev_queue_attribute *attribute,
> +		      const char *buf, size_t len)
> +{
> +	struct net_device *dev = queue->dev;
> +	struct xps_map *maps;
> +	cpumask_var_t mask;
> +	int err, i, nonempty = 0;
> +	unsigned long *qmask, index;
> +	unsigned int qmask_size = QUEUE_MASK_SIZE(dev);
> +
> +	if (!capable(CAP_NET_ADMIN))
> +		return -EPERM;
> +
> +	if (!alloc_cpumask_var(&mask, GFP_KERNEL))
> +		return -ENOMEM;
> +
> +	err = bitmap_parse(buf, len, cpumask_bits(mask), nr_cpumask_bits);
> +	if (err) {
> +		free_cpumask_var(mask);
> +		return err;
> +	}
> +
> +	mutex_lock(&xps_map_lock);
> +
> +	maps = dev->xps_maps;
> +	if (!maps) {
> +		if (!cpumask_weight(mask)) {
> +			mutex_unlock(&xps_map_lock);
> +			free_cpumask_var(mask);
> +			return 0;
> +		}
> +		maps = kzalloc(XPS_MAP_SIZE(dev), GFP_KERNEL);
> +		if (!maps) {
> +			mutex_unlock(&xps_map_lock);
> +			free_cpumask_var(mask);
> +			return -ENOMEM;
> +		}
> +		rcu_assign_pointer(dev->xps_maps, maps);
> +	}
> +
> +	index = get_netdev_queue_index(queue);
> +
> +	qmask = maps->queues;
> +	for (i = 0; i < num_possible_cpus(); i++) {
> +		if (cpu_isset(i, *mask) && cpu_online(i)) {
> +			set_bit(index, qmask);
> +			nonempty = 1;
> +		} else
> +			clear_bit(index, qmask);
> +		if (!nonempty &&
> +		    bitmap_weight(qmask, dev->real_num_tx_queues))
> +			nonempty = 1;
> +		qmask += qmask_size;
> +	}
> +
> +	if (!nonempty) {
> +		rcu_assign_pointer(dev->xps_maps, NULL);
> +		call_rcu(&maps->rcu, xps_map_release);
> +	}
> +
> +	mutex_unlock(&xps_map_lock);
> +
> +	free_cpumask_var(mask);
> +	return len;
> +}
> +
> +static struct netdev_queue_attribute xps_cpus_attribute =
> +    __ATTR(xps_cpus, S_IRUGO | S_IWUSR, show_xps_map, store_xps_map);
> +
> +static struct attribute *netdev_queue_default_attrs[] = {
> +	&xps_cpus_attribute.attr,
> +	NULL
> +};
> +
> +static void netdev_queue_release(struct kobject *kobj)
> +{
> +	struct netdev_queue *queue = to_netdev_queue(kobj);
> +	struct net_device *dev = queue->dev;
> +	struct netdev_queue *first = queue->first;
> +	struct xps_map *maps;
> +	unsigned long *qmask, index;
> +	int i, nonempty = 0;
> +	unsigned int qmask_size = QUEUE_MASK_SIZE(dev);
> +
> +	index = get_netdev_queue_index(queue);
> +
> +	mutex_lock(&xps_map_lock);
> +
> +	maps = dev->xps_maps;
> +
> +	if (maps) {
> +		qmask = maps->queues;
> +		for (i = 0; i < num_possible_cpus(); i++) {
> +			clear_bit(index, qmask);
> +			if (!nonempty &&
> +			    bitmap_weight(qmask, dev->real_num_tx_queues))
> +				nonempty = 1;
> +			qmask += qmask_size;
> +		}
> +
> +		if (!nonempty) {
> +			rcu_assign_pointer(dev->xps_maps, NULL);
> +			call_rcu(&maps->rcu, xps_map_release);
> +		}
> +	}
> +	mutex_unlock(&xps_map_lock);
> +
> +	if (atomic_dec_and_test(&first->count)) {
> +		kfree(first);
> +		dev_put(dev);
> +	}
> +}
> +
> +static struct kobj_type netdev_queue_ktype = {
> +	.sysfs_ops = &netdev_queue_sysfs_ops,
> +	.release = netdev_queue_release,
> +	.default_attrs = netdev_queue_default_attrs,
> +};
> +
> +static int netdev_queue_add_kobject(struct net_device *net, int index)
> +{
> +	struct netdev_queue *queue = net->_tx + index;
> +	struct kobject *kobj = &queue->kobj;
> +	int error = 0;
> +
> +	kobj->kset = net->queues_kset;
> +	error = kobject_init_and_add(kobj, &netdev_queue_ktype, NULL,
> +	    "tx-%u", index);
> +	if (error) {
> +		kobject_put(kobj);
> +		return error;
> +	}
> +
> +	kobject_uevent(kobj, KOBJ_ADD);
> +
> +	return error;
> +}
> +
> +static int register_queue_kobjects(struct net_device *net)
> +{
> +	int rx = 0, tx = 0;
>  	int error = 0;
>  
>  	net->queues_kset = kset_create_and_add("queues",
>  	    NULL, &net->dev.kobj);
>  	if (!net->queues_kset)
>  		return -ENOMEM;
> -	for (i = 0; i < net->num_rx_queues; i++) {
> -		error = rx_queue_add_kobject(net, i);
> +
> +	for (rx = 0; rx < net->num_rx_queues; rx++) {
> +		error = rx_queue_add_kobject(net, rx);
>  		if (error)
> -			break;
> +			goto error;
>  	}
>  
> -	if (error)
> -		while (--i >= 0)
> -			kobject_put(&net->_rx[i].kobj);
> +	for (tx = 0; tx < net->num_tx_queues; tx++) {
> +		error = netdev_queue_add_kobject(net, tx);
> +		if (error)
> +			goto error;
> +	}
> +	dev_hold(net);
> +
> +	return error;
> +
> +error:
> +	while (--rx >= 0)
> +		kobject_put(&net->_rx[rx].kobj);
> +
> +	while (--tx >= 0)
> +		kobject_put(&net->_tx[tx].kobj);
>  
>  	return error;
>  }
>  
> -static void rx_queue_remove_kobjects(struct net_device *net)
> +static void remove_queue_kobjects(struct net_device *net)
>  {
>  	int i;
>  
>  	for (i = 0; i < net->num_rx_queues; i++)
>  		kobject_put(&net->_rx[i].kobj);
> +	for (i = 0; i < net->num_tx_queues; i++)
> +		kobject_put(&net->_tx[i].kobj);
>  	kset_unregister(net->queues_kset);
>  }
>  #endif /* CONFIG_RPS */
> @@ -871,7 +1132,7 @@ void netdev_unregister_kobject(struct net_device * net)
>  	kobject_get(&dev->kobj);
>  
>  #ifdef CONFIG_RPS
> -	rx_queue_remove_kobjects(net);
> +	remove_queue_kobjects(net);
>  #endif
>  
>  	device_del(dev);
> @@ -912,7 +1173,7 @@ int netdev_register_kobject(struct net_device *net)
>  		return error;
>  
>  #ifdef CONFIG_RPS
> -	error = rx_queue_register_kobjects(net);
> +	error = register_queue_kobjects(net);
>  	if (error) {
>  		device_del(dev);
>  		return error;
> diff --git a/net/ipv4/tcp_output.c b/net/ipv4/tcp_output.c
> index de3bd84..80c1928 100644
> --- a/net/ipv4/tcp_output.c
> +++ b/net/ipv4/tcp_output.c
> @@ -828,8 +828,10 @@ static int tcp_transmit_skb(struct sock *sk, struct sk_buff *skb, int clone_it,
>  							   &md5);
>  	tcp_header_size = tcp_options_size + sizeof(struct tcphdr);
>  
> -	if (tcp_packets_in_flight(tp) == 0)
> +	if (tcp_packets_in_flight(tp) == 0) {
>  		tcp_ca_event(sk, CA_EVENT_TX_START);
> +		skb->ooo_okay = 1;
> +	}
>  
>  	skb_push(skb, tcp_header_size);
>  	skb_reset_transport_header(skb);

Why don't we do this in the normal transmit processing.
There is already so much policy mechanism filters/actions/qdisc that
doing it in higher level is fighting against these.
Bill Fink Aug. 24, 2010, 4:31 a.m. | #4
On Sun, 22 Aug 2010, Tom Herbert wrote:

> This patch implements transmit packet steering (XPS) for multiqueue
> devices.  XPS selects a transmit queue during packet transmission based
> on configuration.  This is done by mapping the CPU transmitting the
> packet to a queue.  This is the transmit side analogue to RPS-- where
> RPS is selecting a CPU based on receive queue, XPS selects a queue
> based on the CPU (previously there was an XPS patch from Eric
> Dumazet, but that might more appropriately be called transmit completion
> steering).
> 
> Each transmit queue can be associated with a number of CPUs which will
> used the queue to send packets.  This is configured as a CPU mask on a
> per queue basis in:
> 
> /sys/class/net/eth<n>/queues/tx-<n>/xps_cpus
> 
> The mappings are stored per device in an inverted data structure that
> maps CPUs to queues.  In the netdevice structure this is an array of
> num_possible_cpu structures where each array entry contains a bit map
> of queues which that CPU can use.
> 
> We also allow the mapping of a socket to queue to be modified, for
> instance if a thread is scheduled on a different CPU the desired queue
> for transmitting packets would likely change.  To maintain in order
> packet transmission a flag (ooo_okay) has been added to the sk_buf
> structure.  If a transport layer sets this flag on a packet, the
> transmit queue can be changed for this socket.  Presumably, the
> transport would set this is there was no possbility of creating ooo
> packets (for instance there are no packets in flight for the socket).
> This patch includes the modification in TCP output for setting this
> flag.
> 
> The benefits of XPS are improved locality in the per queue data
> strutures.  Also, transmit completions are more likely to be done
> nearer to the sending thread so this should promote locality back 
> to the socket (e.g. UDP).  The benefits of XPS are dependent on
> cache hierarchy, application load, and other factors.  XPS would
> nominally be configured so that a queue would only be shared by CPUs
> which are sharing a cache, the degenerative configuration woud be that
> each CPU has it's own queue.
> 
> Below are some benchmark results which show the potential benfit of
> this patch.  The netperf test has 500 instances of netperf TCP_RR test
> with 1 byte req. and resp.
> 
> bnx2x on 16 core AMD
>    XPS (16 queues, 1 TX queue per CPU)	1015K at 99% CPU
>    No XPS (16 queues)			1127K at 98% CPU

I don't grok your performance numbers.  What do the 1015K and 1127K
numbers represent?  I was originally guessing that they were basically
transactions per second, but that would seem to imply that the No XPS
case was better.  Please clarify.

						-Thanks

						-Bill
--
To unsubscribe from this list: send the line "unsubscribe netdev" in
the body of a message to majordomo@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Tom Herbert Aug. 24, 2010, 4:37 a.m. | #5
1 byte req. and resp.
> >
> > bnx2x on 16 core AMD
> >    XPS (16 queues, 1 TX queue per CPU)        1015K at 99% CPU
> >    No XPS (16 queues)                 1127K at 98% CPU
>
> I don't grok your performance numbers.  What do the 1015K and 1127K
> numbers represent?  I was originally guessing that they were basically
> transactions per second, but that would seem to imply that the No XPS
> case was better.  Please clarify.
>
Yes, TPS and the numbers were switched... XPS case was better!

Thanks for pointing that out.

Tom

>                                                -Thanks
>
>                                                -Bill
--
To unsubscribe from this list: send the line "unsubscribe netdev" in
the body of a message to majordomo@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Tom Herbert Sept. 1, 2010, 3:41 p.m. | #6
> Why don't we do this in the normal transmit processing.
> There is already so much policy mechanism filters/actions/qdisc that
> doing it in higher level is fighting against these.
>
Are you proposing that TX queue selection be done in the qdiscs?  The
queue has to be selected before taking the lock (cannot afford taking
a lock over the whole interface).  This would necessitate moving the
locking and probably rearranging a lot of the xmit code around that.

Tom
--
To unsubscribe from this list: send the line "unsubscribe netdev" in
the body of a message to majordomo@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Eric Dumazet Sept. 1, 2010, 3:54 p.m. | #7
Le mercredi 01 septembre 2010 à 08:41 -0700, Tom Herbert a écrit :
> > Why don't we do this in the normal transmit processing.
> > There is already so much policy mechanism filters/actions/qdisc that
> > doing it in higher level is fighting against these.
> >
> Are you proposing that TX queue selection be done in the qdiscs?  The
> queue has to be selected before taking the lock (cannot afford taking
> a lock over the whole interface).  This would necessitate moving the
> locking and probably rearranging a lot of the xmit code around that.

Stephen point is not adding yet another layer 'before' qdisc layer.

I would like something not as complex as your patch.

1) Why current selection fails ?

2) Could we change current selection to :

 - Use a lightweight selection, with no special configuration.

 - Use driver RX multiqueue information if available, in a one-to-one
relationship.

3) Eventually have a user selectable selection (socket option, or system
wide, but one sysctl, not many bitmasks ;) ).



--
To unsubscribe from this list: send the line "unsubscribe netdev" in
the body of a message to majordomo@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
David Miller Sept. 1, 2010, 4:09 p.m. | #8
From: Tom Herbert <therbert@google.com>
Date: Wed, 1 Sep 2010 08:41:14 -0700

>> Why don't we do this in the normal transmit processing.
>> There is already so much policy mechanism filters/actions/qdisc that
>> doing it in higher level is fighting against these.
>>
> Are you proposing that TX queue selection be done in the qdiscs?  The
> queue has to be selected before taking the lock (cannot afford taking
> a lock over the whole interface).  This would necessitate moving the
> locking and probably rearranging a lot of the xmit code around that.

Right, we really have to pick queues before entering the qdisc for
the full benefit of lock seperation.
--
To unsubscribe from this list: send the line "unsubscribe netdev" in
the body of a message to majordomo@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Tom Herbert Sept. 1, 2010, 4:24 p.m. | #9
On Wed, Sep 1, 2010 at 8:54 AM, Eric Dumazet <eric.dumazet@gmail.com> wrote:
> Le mercredi 01 septembre 2010 à 08:41 -0700, Tom Herbert a écrit :
>> > Why don't we do this in the normal transmit processing.
>> > There is already so much policy mechanism filters/actions/qdisc that
>> > doing it in higher level is fighting against these.
>> >
>> Are you proposing that TX queue selection be done in the qdiscs?  The
>> queue has to be selected before taking the lock (cannot afford taking
>> a lock over the whole interface).  This would necessitate moving the
>> locking and probably rearranging a lot of the xmit code around that.
>
> Stephen point is not adding yet another layer 'before' qdisc layer.
>
> I would like something not as complex as your patch.
>
> 1) Why current selection fails ?
>
Current selection does a hash on 4-tuple to map packets to queues.  So
any CPU can send on any queue which leads to cache line bouncing of
transmit structures.  Also when sending from one CPU to a queue whose
transmit interrupt is on a CPU in another cache domain cause more
cache line bouncing with transmit completion.  So while the current
scheme nicely distributes load across the queues, it does nothing  to
promote locality.  Getting some reasonable locality is where the
benefits come from that we are demonstrating.

> 2) Could we change current selection to :
>
>  - Use a lightweight selection, with no special configuration.
>
>  - Use driver RX multiqueue information if available, in a one-to-one
> relationship.
>
Not generally.  It's very possible that the only a subset of CPUs are
getting RX interrupts in multiqueue (consider when #queues < #CPUs),
so there's really not an obvious 1-1 relationship.  But each CPU can
send and should be mapped to at least one transmit queue; the most
obvious plan would be to send in a queue in the same cache domain.

> 3) Eventually have a user selectable selection (socket option, or system
> wide, but one sysctl, not many bitmasks ;) ).
>
Right, but it would also be nice if a single sysctl could optimally
set up multiqueue, RSS, RPS, and all my interrupt affinities for me
;-)

Tom
--
To unsubscribe from this list: send the line "unsubscribe netdev" in
the body of a message to majordomo@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
David Miller Sept. 2, 2010, 1:32 a.m. | #10
From: Tom Herbert <therbert@google.com>
Date: Wed, 1 Sep 2010 09:24:18 -0700

> On Wed, Sep 1, 2010 at 8:54 AM, Eric Dumazet <eric.dumazet@gmail.com> wrote:
>> 3) Eventually have a user selectable selection (socket option, or system
>> wide, but one sysctl, not many bitmasks ;) ).
>>
> Right, but it would also be nice if a single sysctl could optimally
> set up multiqueue, RSS, RPS, and all my interrupt affinities for me
> ;-)

It's becomming increasingly obvious to me that we need (somewhere,
not necessarily the kernel) a complete datastructure representing
the NUMA, cache, cpu, device hierarchy.

And that can be used to tweak all of this stuff.

The policy should probably be in userspace, we just need to provide
the knobs in the kernel to tweak it however userspace wants.

Userspace should be able to, for example, move a TX queue into a
NUMA domain and have this invoke several side effects:

1) IRQs for that TX queue get rerouted to a cpu in the NUMA
   domain.

2) TX queue datastructures in the driver get reallocated using
   memory in that NUMA domain.

3) TX hashing is configured to use the set of cpus in the NUMA
   domain.

It's alot of tedious work and involves some delicate tasks figuring
out where each of these things go, but really then we'd solve all
of this crap one and for all.
--
To unsubscribe from this list: send the line "unsubscribe netdev" in
the body of a message to majordomo@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
stephen hemminger Sept. 2, 2010, 1:48 a.m. | #11
On Wed, 01 Sep 2010 18:32:51 -0700 (PDT)
David Miller <davem@davemloft.net> wrote:

> From: Tom Herbert <therbert@google.com>
> Date: Wed, 1 Sep 2010 09:24:18 -0700
> 
> > On Wed, Sep 1, 2010 at 8:54 AM, Eric Dumazet <eric.dumazet@gmail.com> wrote:
> >> 3) Eventually have a user selectable selection (socket option, or system
> >> wide, but one sysctl, not many bitmasks ;) ).
> >>
> > Right, but it would also be nice if a single sysctl could optimally
> > set up multiqueue, RSS, RPS, and all my interrupt affinities for me
> > ;-)
> 
> It's becomming increasingly obvious to me that we need (somewhere,
> not necessarily the kernel) a complete datastructure representing
> the NUMA, cache, cpu, device hierarchy.
> 
> And that can be used to tweak all of this stuff.
> 
> The policy should probably be in userspace, we just need to provide
> the knobs in the kernel to tweak it however userspace wants.
> 
> Userspace should be able to, for example, move a TX queue into a
> NUMA domain and have this invoke several side effects:
> 
> 1) IRQs for that TX queue get rerouted to a cpu in the NUMA
>    domain.
> 
> 2) TX queue datastructures in the driver get reallocated using
>    memory in that NUMA domain.
> 
> 3) TX hashing is configured to use the set of cpus in the NUMA
>    domain.
> 
> It's alot of tedious work and involves some delicate tasks figuring
> out where each of these things go, but really then we'd solve all
> of this crap one and for all.

Plus it needs to work with scheduler (not fight it).  All this doesn't
work very well if process keeps bouncing away from its resources.
stephen hemminger Sept. 2, 2010, 1:56 a.m. | #12
On Wed, 01 Sep 2010 18:32:51 -0700 (PDT)
David Miller <davem@davemloft.net> wrote:

> From: Tom Herbert <therbert@google.com>
> Date: Wed, 1 Sep 2010 09:24:18 -0700
> 
> > On Wed, Sep 1, 2010 at 8:54 AM, Eric Dumazet <eric.dumazet@gmail.com> wrote:
> >> 3) Eventually have a user selectable selection (socket option, or system
> >> wide, but one sysctl, not many bitmasks ;) ).
> >>
> > Right, but it would also be nice if a single sysctl could optimally
> > set up multiqueue, RSS, RPS, and all my interrupt affinities for me
> > ;-)
> 
> It's becomming increasingly obvious to me that we need (somewhere,
> not necessarily the kernel) a complete datastructure representing
> the NUMA, cache, cpu, device hierarchy.
> 
> And that can be used to tweak all of this stuff.
> 
> The policy should probably be in userspace, we just need to provide
> the knobs in the kernel to tweak it however userspace wants.
> 
> Userspace should be able to, for example, move a TX queue into a
> NUMA domain and have this invoke several side effects:
> 
> 1) IRQs for that TX queue get rerouted to a cpu in the NUMA
>    domain.
> 
> 2) TX queue datastructures in the driver get reallocated using
>    memory in that NUMA domain.
> 
> 3) TX hashing is configured to use the set of cpus in the NUMA
>    domain.
> 
> It's alot of tedious work and involves some delicate tasks figuring
> out where each of these things go, but really then we'd solve all
> of this crap one and for all.

Just to be contrarian :-) This same idea had started before when IBM
proposed a user-space NUMA API.  It never got any traction, the concept
of "lets make the applications NUMA aware" never got accepted because
it is so hard to do right and fragile that it was the wrong idea
to start with. The only people that can manage it are the engineers
tweeking a one off database benchmark.

I would rather see a "good enough" policy in the kernel that works
for everything from a single-core embedded system to a 100 core
server environment.  Forget the benchmarkers. The ideal solution
should work with a mix of traffic and adapt. Today the application
doesn't have to make a service level agreement with kernel
everytime it opens a TCP socket.

Doing it in userspace doesn't really help much. The API's keep changing
and the focus fades (see irqbalance).
Greg Lindahl Sept. 2, 2010, 6:41 a.m. | #13
On Wed, Sep 01, 2010 at 06:56:27PM -0700, Stephen Hemminger wrote:

> Just to be contrarian :-) This same idea had started before when IBM
> proposed a user-space NUMA API.  It never got any traction, the concept
> of "lets make the applications NUMA aware" never got accepted because
> it is so hard to do right and fragile that it was the wrong idea
> to start with. The only people that can manage it are the engineers
> tweeking a one off database benchmark.

As an non-database user-space example, there are many applications
which know about the typical 'first touch' locality policy for pages
and use that to be NUMA-aware. Just about every OpenMP program ever
written does that; it's even fairly portable among OSes.

A second user-level example is MPI implementations such as OpenMPI.
Those guys run 1 process per core and they don't need to move around,
so getting process locked to a core and all the pages in the right
place is a nice win without the MPI programmer doing anything.

For kernel (but non-Ethernet) networking examples, HPC interconnects
typically go out of their way to ensure locality of kernel pages
related to a given core's workload.  Examples include Myrinet's
OpenMX+MPI and the InfiniPath InfiniBand adapater, whatever QLogic
renamed it to this week (TrueScale, I suppose.) How can you get ~ 1
microsecond messages if you've got a buffer in the wrong place?  Or
achieve extremely high messaging rates when you're waiting for remote
memory all the time?

> I would rather see a "good enough" policy in the kernel that works
> for everything from a single-core embedded system to a 100 core
> server environment.

I'd like a pony. Yes, it's challenging to directly aapply the above
networking example to Ethernet networking, but there's a pony in there
somewhere.

-- greg


--
To unsubscribe from this list: send the line "unsubscribe netdev" in
the body of a message to majordomo@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Loke, Chetan Sept. 2, 2010, 3:55 p.m. | #14
> From: netdev-owner@vger.kernel.org [mailto:netdev-
> owner@vger.kernel.org] On Behalf Of David Miller
> It's becomming increasingly obvious to me that we need (somewhere,
> not necessarily the kernel) a complete datastructure representing
> the NUMA, cache, cpu, device hierarchy.
> 
> And that can be used to tweak all of this stuff.
> 
> The policy should probably be in userspace, we just need to provide
> the knobs in the kernel to tweak it however userspace wants.
> 

I agree. But only if we have all the knobs in the kernel.

http://www.spinics.net/lists/linux-numa/msg00709.html

I had to do some changes manually. Thanks to Lee for pointing out the
dma-call.


> Userspace should be able to, for example, move a TX queue into a
> NUMA domain and have this invoke several side effects:
> 
> 1) IRQs for that TX queue get rerouted to a cpu in the NUMA
>    domain.
> 
> 2) TX queue datastructures in the driver get reallocated using
>    memory in that NUMA domain.
> 
> 3) TX hashing is configured to use the set of cpus in the NUMA
>    domain.
> 
> It's alot of tedious work and involves some delicate tasks figuring
> out where each of these things go, but really then we'd solve all
> of this crap one and for all.
> --

The MSI-X+MQ combo is something that should be taken care off. We could
come up with an automatic-node-binding shim in the kernel and then all
the sub-systems can use that.

Basically, the userland should query 
i) the adapter-capabilities 
ii)get the node-binding info
iii)and then start affinitizing everything


Some older msi-x (non-ethernet)adapters assume the msi-x info to remain
static once the adapter is initialized. If we had an auto-node-binding
shim then that would mean re-initializing the adapter,correct?



Chetan Loke
--
To unsubscribe from this list: send the line "unsubscribe netdev" in
the body of a message to majordomo@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Loke, Chetan Sept. 2, 2010, 4 p.m. | #15
> -----Original Message-----
> From: netdev-owner@vger.kernel.org [mailto:netdev-
> owner@vger.kernel.org] On Behalf Of Stephen Hemminger
> Sent: September 01, 2010 9:48 PM
> To: David Miller
> Cc: therbert@google.com; eric.dumazet@gmail.com;
netdev@vger.kernel.org
> Subject: Re: [PATCH] xps-mq: Transmit Packet Steering for multiqueue
 
> Plus it needs to work with scheduler (not fight it).  All this doesn't
> work very well if process keeps bouncing away from its resources.
> 

userland folks who actually try to exploit the MQ/MSIX-ness will almost
always pin down their high-prio[or subset of] threads/processes.

Chetan Loke
 

--
To unsubscribe from this list: send the line "unsubscribe netdev" in
the body of a message to majordomo@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Loke, Chetan Sept. 2, 2010, 4:18 p.m. | #16
> Just to be contrarian :-) This same idea had started before when IBM
> proposed a user-space NUMA API.  It never got any traction, the
concept
> of "lets make the applications NUMA aware" never got accepted because
> it is so hard to do right and fragile that it was the wrong idea
> to start with. The only people that can manage it are the engineers
> tweeking a one off database benchmark.
> 

If you design an appliance then this would be one of the knobs that
people tinker with. To reap the actual benefits the whole stack(from
adapter's f/w to the userland-thread) needs to be astro-aligned. Almost
all the user-space guys that I've talked to never got it right because
they never understood the concepts behind NUMA-IOH-Mempolicy-MSI-X etc. 
So it's a little difficult for them to get it right the first time.
Also, Stoakley/Nehalem rigs got into mainstream ~2-3 years back? Before
that only a handful of engineers could experiment because the mobo's
were expensive. Plus, you didn't really have 10G fabric then.

Chetan




--
To unsubscribe from this list: send the line "unsubscribe netdev" in
the body of a message to majordomo@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Tom Herbert Sept. 2, 2010, 7:52 p.m. | #17
> userland folks who actually try to exploit the MQ/MSIX-ness will almost
> always pin down their high-prio[or subset of] threads/processes.
>
I don't really see that.  Pinning is a last resort and in this context
we could only do that on a dedicated server.  On a shared server, with
many different apps, pinning for MQ/MSIX is not an easy option;
meeting scheduler constraints will be the first priority and its up to
networking to work with the scheduler to to the right thing.
Scheduler aware networking (or vice versa) is important.

Tom
--
To unsubscribe from this list: send the line "unsubscribe netdev" in
the body of a message to majordomo@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Loke, Chetan Sept. 2, 2010, 11:17 p.m. | #18
> From: Tom Herbert [mailto:therbert@google.com]
> Sent: September 02, 2010 3:53 PM
> To: Loke, Chetan
> Cc: Stephen Hemminger; David Miller; eric.dumazet@gmail.com;
> netdev@vger.kernel.org
> Subject: Re: [PATCH] xps-mq: Transmit Packet Steering for multiqueue
> 
> > userland folks who actually try to exploit the MQ/MSIX-ness will
> almost
> > always pin down their high-prio[or subset of] threads/processes.
> >
> I don't really see that.  Pinning is a last resort and in this context
> we could only do that on a dedicated server.  On a shared server, with
> many different apps, pinning for MQ/MSIX is not an easy option;
> meeting scheduler constraints will be the first priority and its up to
> networking to work with the scheduler to to the right thing.
> Scheduler aware networking (or vice versa) is important.
> 

For my use-case it's an appliance. Newer adapters might have like 64+(?)
h/w queues and not just f/w emulated queues. With these many queues you
could partition your threads/queues, no?
It's easier to get started that way. All you need is a shim(or just a
driver stub because you can then load it on any box that had older
kernels) in the kernel that will tell you which queue-set(for a
MQ-capable adapter) is still under the high-watermark. If all are full
then it should just round-robin(across queues and nodes).So make a
syscall(or shoot a mbx-cmd or pick your trick), find out which queue you
could use, get the binding info and then launch your threads. So once
you narrow down the scope, the scheduler will have less work to do.

If the worker threads are short lived then there's no point in this
binding. And for long-lived tasks, a couple of initial prep-calls will
not hurt performance that much.And if you still care about syscalls at
runtime then you could have a dedicated mgmt-thread that will receive
async-events from the shim. And all other user-land logic could consult
this mgmt-thread.


> Tom
Chetan Loke
--
To unsubscribe from this list: send the line "unsubscribe netdev" in
the body of a message to majordomo@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Ben Hutchings Sept. 16, 2010, 9:52 p.m. | #19
On Wed, 2010-09-01 at 18:32 -0700, David Miller wrote:
> From: Tom Herbert <therbert@google.com>
> Date: Wed, 1 Sep 2010 09:24:18 -0700
> 
> > On Wed, Sep 1, 2010 at 8:54 AM, Eric Dumazet <eric.dumazet@gmail.com> wrote:
> >> 3) Eventually have a user selectable selection (socket option, or system
> >> wide, but one sysctl, not many bitmasks ;) ).
> >>
> > Right, but it would also be nice if a single sysctl could optimally
> > set up multiqueue, RSS, RPS, and all my interrupt affinities for me
> > ;-)
> 
> It's becomming increasingly obvious to me that we need (somewhere,
> not necessarily the kernel) a complete datastructure representing
> the NUMA, cache, cpu, device hierarchy.

And ideally a cheap way (not O(N^2)) to find the distance between 2 CPU
threads (not just nodes).

> And that can be used to tweak all of this stuff.
> 
> The policy should probably be in userspace, we just need to provide
> the knobs in the kernel to tweak it however userspace wants.
> 
> Userspace should be able to, for example, move a TX queue into a
> NUMA domain and have this invoke several side effects:
> 
> 1) IRQs for that TX queue get rerouted to a cpu in the NUMA
>    domain.
> 
> 2) TX queue datastructures in the driver get reallocated using
>    memory in that NUMA domain.

I've actually done some work on an interface and implementation of this,
although I didn't include actually setting the IRQ affinity as there has
been pushback whenever people propose letting drivers set this.  If they
only do so as directed by the administrator this might be more
acceptable though.

Unfortunately in my limited testing on a 2-node system I didn't see a
whole lot of improvement in performance when the affinities were all
lined up.  I should try to get some time on a 4-node system.

> 3) TX hashing is configured to use the set of cpus in the NUMA
>    domain.
> 
> It's alot of tedious work and involves some delicate tasks figuring
> out where each of these things go, but really then we'd solve all
> of this crap one and for all.

Right.

The other thing I've been working on lately which sort of ties into this
is hardware acceleration of Receive Flow Steering.  Multiqueue NICs such
as ours tend to have RX flow filters as well as hashing.  So why not use
those to do a first level of steering?  We're going to do some more
internal testing and review but I hope to send out a first version of
this next week.

Ben.
Michael S. Tsirkin Sept. 19, 2010, 5:24 p.m. | #20
On Thu, Sep 16, 2010 at 10:52:41PM +0100, Ben Hutchings wrote:
> On Wed, 2010-09-01 at 18:32 -0700, David Miller wrote:
> > From: Tom Herbert <therbert@google.com>
> > Date: Wed, 1 Sep 2010 09:24:18 -0700
> > 
> > > On Wed, Sep 1, 2010 at 8:54 AM, Eric Dumazet <eric.dumazet@gmail.com> wrote:
> > >> 3) Eventually have a user selectable selection (socket option, or system
> > >> wide, but one sysctl, not many bitmasks ;) ).
> > >>
> > > Right, but it would also be nice if a single sysctl could optimally
> > > set up multiqueue, RSS, RPS, and all my interrupt affinities for me
> > > ;-)
> > 
> > It's becomming increasingly obvious to me that we need (somewhere,
> > not necessarily the kernel) a complete datastructure representing
> > the NUMA, cache, cpu, device hierarchy.
> 
> And ideally a cheap way (not O(N^2)) to find the distance between 2 CPU
> threads (not just nodes).
> 
> > And that can be used to tweak all of this stuff.
> > 
> > The policy should probably be in userspace, we just need to provide
> > the knobs in the kernel to tweak it however userspace wants.
> > 
> > Userspace should be able to, for example, move a TX queue into a
> > NUMA domain and have this invoke several side effects:
> > 
> > 1) IRQs for that TX queue get rerouted to a cpu in the NUMA
> >    domain.
> > 
> > 2) TX queue datastructures in the driver get reallocated using
> >    memory in that NUMA domain.
> 
> I've actually done some work on an interface and implementation of this,
> although I didn't include actually setting the IRQ affinity as there has
> been pushback whenever people propose letting drivers set this.  If they
> only do so as directed by the administrator this might be more
> acceptable though.
> 
> Unfortunately in my limited testing on a 2-node system I didn't see a
> whole lot of improvement in performance when the affinities were all
> lined up.  I should try to get some time on a 4-node system.

I've been trying to look into this as well.
It'd be very interesting to see the patches even if they don't show
good performance.  Could you post them?


> > 3) TX hashing is configured to use the set of cpus in the NUMA
> >    domain.
> > 
> > It's alot of tedious work and involves some delicate tasks figuring
> > out where each of these things go, but really then we'd solve all
> > of this crap one and for all.
> 
> Right.
> 
> The other thing I've been working on lately which sort of ties into this
> is hardware acceleration of Receive Flow Steering.  Multiqueue NICs such
> as ours tend to have RX flow filters as well as hashing.  So why not use
> those to do a first level of steering?  We're going to do some more
> internal testing and review but I hope to send out a first version of
> this next week.
> 
> Ben.
> 
--
To unsubscribe from this list: send the line "unsubscribe netdev" in
the body of a message to majordomo@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Ben Hutchings Feb. 21, 2011, 6:19 p.m. | #21
On Wed, 2010-09-01 at 18:32 -0700, David Miller wrote:
> From: Tom Herbert <therbert@google.com>
> Date: Wed, 1 Sep 2010 09:24:18 -0700
> 
> > On Wed, Sep 1, 2010 at 8:54 AM, Eric Dumazet <eric.dumazet@gmail.com> wrote:
> >> 3) Eventually have a user selectable selection (socket option, or system
> >> wide, but one sysctl, not many bitmasks ;) ).
> >>
> > Right, but it would also be nice if a single sysctl could optimally
> > set up multiqueue, RSS, RPS, and all my interrupt affinities for me
> > ;-)
> 
> It's becomming increasingly obvious to me that we need (somewhere,
> not necessarily the kernel) a complete datastructure representing
> the NUMA, cache, cpu, device hierarchy.
> 
> And that can be used to tweak all of this stuff.
> 
> The policy should probably be in userspace, we just need to provide
> the knobs in the kernel to tweak it however userspace wants.
> 
> Userspace should be able to, for example, move a TX queue into a
> NUMA domain and have this invoke several side effects:

I think most of the pieces are now ready:

> 1) IRQs for that TX queue get rerouted to a cpu in the NUMA
>    domain.

There is a longstanding procfs interface for IRQ affinity, and userland
infrastructure built on it.  Adding a new interface would be contentious
and I have tried to build on it instead.

> 2) TX queue datastructures in the driver get reallocated using
>    memory in that NUMA domain.

I've previously sent patches to add an ethtool API for NUMA control,
which include the option to allocate on the same node where IRQs are
handled.  However, there is currently no function to allocate
DMA-coherent memory on a specified NUMA node (rather than the device's
node).  This is likely to be beneficial for event rings and might be
good for descriptor rings for some devices.  (The implementation I sent
for sfc mistakenly switched it to allocating non-coherent memory, for
which it *is* possible to specify the node.)

> 3) TX hashing is configured to use the set of cpus in the NUMA
>    domain.

I posted patches for automatic XPS configuration at the end of last
week.  And RFS acceleration covers the other direction.

Ben.

> It's alot of tedious work and involves some delicate tasks figuring
> out where each of these things go, but really then we'd solve all
> of this crap one and for all.
Jeremy Eder Feb. 21, 2011, 7:31 p.m. | #22
On Mon, 2011-02-21 at 18:19 +0000, Ben Hutchings wrote:
> On Wed, 2010-09-01 at 18:32 -0700, David Miller wrote:
> > From: Tom Herbert <therbert@google.com>
> > Date: Wed, 1 Sep 2010 09:24:18 -0700
> > 
> > > On Wed, Sep 1, 2010 at 8:54 AM, Eric Dumazet <eric.dumazet@gmail.com> wrote:
> > >> 3) Eventually have a user selectable selection (socket option, or system
> > >> wide, but one sysctl, not many bitmasks ;) ).
> > >>
> > > Right, but it would also be nice if a single sysctl could optimally
> > > set up multiqueue, RSS, RPS, and all my interrupt affinities for me
> > > ;-)

Are cgroups the right place to have the network stack pull "guidance"
from ?

If an app is bound to a socket/core/NUMA node; then the network stack
could inherit the cgroup's tuning and adjust kernel knobs accordingly.

This would not replace procfs tuning, but it seems like a natural
extension of how cgroups and RPS/RFS/XPS could be integrated to ease the
management burden that these new technologies might impose.

--jer
David Miller Feb. 26, 2011, 7:09 a.m. | #23
From: Ben Hutchings <bhutchings@solarflare.com>
Date: Mon, 21 Feb 2011 18:19:55 +0000

> On Wed, 2010-09-01 at 18:32 -0700, David Miller wrote:
>> 2) TX queue datastructures in the driver get reallocated using
>>    memory in that NUMA domain.
> 
> I've previously sent patches to add an ethtool API for NUMA control,
> which include the option to allocate on the same node where IRQs are
> handled.  However, there is currently no function to allocate
> DMA-coherent memory on a specified NUMA node (rather than the device's
> node).  This is likely to be beneficial for event rings and might be
> good for descriptor rings for some devices.  (The implementation I sent
> for sfc mistakenly switched it to allocating non-coherent memory, for
> which it *is* possible to specify the node.)

The thing to do is to work with someone like FUJITA Tomonori on this.

It's simply a matter of making new APIs that take the node specifier,
have the implementations either make use of or completely ignore the node,
and have the existing APIs pass in "-1" for the node or whatever the
CPP macro is for this :-)
--
To unsubscribe from this list: send the line "unsubscribe netdev" in
the body of a message to majordomo@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html

Patch

diff --git a/include/linux/netdevice.h b/include/linux/netdevice.h
index 46c36ff..0ff6c9f 100644
--- a/include/linux/netdevice.h
+++ b/include/linux/netdevice.h
@@ -497,6 +497,12 @@  struct netdev_queue {
 	struct Qdisc		*qdisc;
 	unsigned long		state;
 	struct Qdisc		*qdisc_sleeping;
+#ifdef CONFIG_RPS
+	struct kobject		kobj;
+	struct netdev_queue	*first;
+	atomic_t		count;
+#endif
+
 /*
  * write mostly part
  */
@@ -524,6 +530,22 @@  struct rps_map {
 #define RPS_MAP_SIZE(_num) (sizeof(struct rps_map) + (_num * sizeof(u16)))
 
 /*
+ * This structure holds an XPS map which can be of variable length.  queues
+ * is an array of num_possible_cpus entries, where each entry is a mask of
+ * queues for that CPU (up to num_tx_queues bits for device).
+ */
+struct xps_map {
+	struct rcu_head rcu;
+	unsigned long queues[0];
+};
+
+#define QUEUE_MASK_SIZE(dev) (BITS_TO_LONGS(dev->num_tx_queues))
+#define XPS_MAP_SIZE(dev) (sizeof(struct xps_map) + (num_possible_cpus() * \
+    QUEUE_MASK_SIZE(dev) * sizeof(unsigned long)))
+#define XPS_ENTRY(map, offset, dev) \
+    (&map->queues[offset * QUEUE_MASK_SIZE(dev)])
+
+/*
  * The rps_dev_flow structure contains the mapping of a flow to a CPU and the
  * tail pointer for that CPU's input queue at the time of last enqueue.
  */
@@ -978,6 +1000,9 @@  struct net_device {
 	void			*rx_handler_data;
 
 	struct netdev_queue	*_tx ____cacheline_aligned_in_smp;
+#ifdef CONFIG_RPS
+	struct xps_map		*xps_maps;
+#endif
 
 	/* Number of TX queues allocated at alloc_netdev_mq() time  */
 	unsigned int		num_tx_queues;
diff --git a/include/linux/skbuff.h b/include/linux/skbuff.h
index f067c95..146df6f 100644
--- a/include/linux/skbuff.h
+++ b/include/linux/skbuff.h
@@ -381,6 +381,7 @@  struct sk_buff {
 #else
 	__u8			deliver_no_wcard:1;
 #endif
+	__u8			ooo_okay:1;
 	kmemcheck_bitfield_end(flags2);
 
 	/* 0/14 bit hole */
diff --git a/net/core/dev.c b/net/core/dev.c
index da584f5..d23f9c4 100644
--- a/net/core/dev.c
+++ b/net/core/dev.c
@@ -2054,6 +2054,60 @@  static inline u16 dev_cap_txqueue(struct net_device *dev, u16 queue_index)
 	return queue_index;
 }
 
+static inline int get_xps_queue(struct net_device *dev, struct sk_buff *skb,
+				int queue_index)
+{
+	struct xps_map *maps;
+	int cpu = smp_processor_id();
+	u32 hash;
+	unsigned long *queues;
+	int weight, select;
+
+	rcu_read_lock();
+	maps = rcu_dereference(dev->xps_maps);
+
+	if (!maps) {
+		rcu_read_unlock();
+		return queue_index;
+	}
+
+	queues = XPS_ENTRY(maps, cpu, dev);
+
+	if (queue_index >= 0) {
+		if (test_bit(queue_index, queues)) {
+			rcu_read_unlock();
+			return queue_index;
+		}
+	}
+
+	weight = bitmap_weight(queues, dev->real_num_tx_queues);
+	switch (weight) {
+	case 0:
+		break;
+	case 1:
+		queue_index =
+		    find_first_bit(queues, dev->real_num_tx_queues);
+		break;
+	default:
+		if (skb->sk && skb->sk->sk_hash)
+			hash = skb->sk->sk_hash;
+		else
+			hash = (__force u16) skb->protocol ^ skb->rxhash;
+		hash = jhash_1word(hash, hashrnd);
+
+		select = ((u64) hash * weight) >> 32;
+		queue_index =
+		    find_first_bit(queues, dev->real_num_tx_queues);
+		while (select--)
+			queue_index = find_next_bit(queues,
+			    dev->real_num_tx_queues, queue_index);
+		break;
+	}
+
+	rcu_read_unlock();
+	return queue_index;
+}
+
 static struct netdev_queue *dev_pick_tx(struct net_device *dev,
 					struct sk_buff *skb)
 {
@@ -2061,23 +2115,30 @@  static struct netdev_queue *dev_pick_tx(struct net_device *dev,
 	struct sock *sk = skb->sk;
 
 	queue_index = sk_tx_queue_get(sk);
-	if (queue_index < 0) {
+
+	if (queue_index < 0 || (skb->ooo_okay && dev->real_num_tx_queues > 1)) {
 		const struct net_device_ops *ops = dev->netdev_ops;
+		int old_index = queue_index;
 
 		if (ops->ndo_select_queue) {
 			queue_index = ops->ndo_select_queue(dev, skb);
 			queue_index = dev_cap_txqueue(dev, queue_index);
 		} else {
-			queue_index = 0;
-			if (dev->real_num_tx_queues > 1)
-				queue_index = skb_tx_hash(dev, skb);
+			if (dev->real_num_tx_queues > 1) {
+				queue_index = get_xps_queue(dev,
+				    skb, queue_index);
+				if (queue_index < 0)
+					queue_index = skb_tx_hash(dev, skb);
+			} else
+				queue_index = 0;
+		}
 
-			if (sk) {
-				struct dst_entry *dst = rcu_dereference_check(sk->sk_dst_cache, 1);
+		if ((queue_index != old_index) && sk) {
+			struct dst_entry *dst =
+			    rcu_dereference_check(sk->sk_dst_cache, 1);
 
-				if (dst && skb_dst(skb) == dst)
-					sk_tx_queue_set(sk, queue_index);
-			}
+			if (dst && skb_dst(skb) == dst)
+				sk_tx_queue_set(sk, queue_index);
 		}
 	}
 
@@ -5429,6 +5490,15 @@  struct net_device *alloc_netdev_mq(int sizeof_priv, const char *name,
 	}
 
 #ifdef CONFIG_RPS
+	atomic_set(&tx->count, queue_count);
+
+	/*
+	 * Set a pointer to first element in the array which holds the
+	 * reference count.
+	 */
+	for (i = 0; i < queue_count; i++)
+		tx[i].first = tx;
+
 	rx = kcalloc(queue_count, sizeof(struct netdev_rx_queue), GFP_KERNEL);
 	if (!rx) {
 		printk(KERN_ERR "alloc_netdev: Unable to allocate "
@@ -5506,7 +5576,9 @@  void free_netdev(struct net_device *dev)
 
 	release_net(dev_net(dev));
 
+#ifndef CONFIG_RPS
 	kfree(dev->_tx);
+#endif
 
 	/* Flush device addresses */
 	dev_addr_flush(dev);
diff --git a/net/core/net-sysfs.c b/net/core/net-sysfs.c
index af4dfba..661c481 100644
--- a/net/core/net-sysfs.c
+++ b/net/core/net-sysfs.c
@@ -742,34 +742,295 @@  static int rx_queue_add_kobject(struct net_device *net, int index)
 	return error;
 }
 
-static int rx_queue_register_kobjects(struct net_device *net)
+/*
+ * netdev_queue sysfs structures and functions.
+ */
+struct netdev_queue_attribute {
+	struct attribute attr;
+	ssize_t (*show)(struct netdev_queue *queue,
+	    struct netdev_queue_attribute *attr, char *buf);
+	ssize_t (*store)(struct netdev_queue *queue,
+	    struct netdev_queue_attribute *attr, const char *buf, size_t len);
+};
+#define to_netdev_queue_attr(_attr) container_of(_attr,		\
+    struct netdev_queue_attribute, attr)
+
+#define to_netdev_queue(obj) container_of(obj, struct netdev_queue, kobj)
+
+static ssize_t netdev_queue_attr_show(struct kobject *kobj,
+				      struct attribute *attr, char *buf)
+{
+	struct netdev_queue_attribute *attribute = to_netdev_queue_attr(attr);
+	struct netdev_queue *queue = to_netdev_queue(kobj);
+
+	if (!attribute->show)
+		return -EIO;
+
+	return attribute->show(queue, attribute, buf);
+}
+
+static ssize_t netdev_queue_attr_store(struct kobject *kobj,
+				       struct attribute *attr,
+				       const char *buf, size_t count)
+{
+	struct netdev_queue_attribute *attribute = to_netdev_queue_attr(attr);
+	struct netdev_queue *queue = to_netdev_queue(kobj);
+
+	if (!attribute->store)
+		return -EIO;
+
+	return attribute->store(queue, attribute, buf, count);
+}
+
+static struct sysfs_ops netdev_queue_sysfs_ops = {
+	.show = netdev_queue_attr_show,
+	.store = netdev_queue_attr_store,
+};
+
+static inline unsigned int get_netdev_queue_index(struct netdev_queue *queue)
 {
+	struct net_device *dev = queue->dev;
+	int i;
+
+	for (i = 0; i < dev->num_tx_queues; i++)
+		if (queue == &dev->_tx[i])
+			break;
+
+	BUG_ON(i >= dev->num_tx_queues);
+
+	return i;
+}
+
+static ssize_t show_xps_map(struct netdev_queue *queue,
+			    struct netdev_queue_attribute *attribute, char *buf)
+{
+	struct net_device *dev = queue->dev;
+	struct xps_map *maps;
+	cpumask_var_t mask;
+	unsigned long *qmask, index;
+	size_t len = 0;
 	int i;
+	unsigned int qmask_size = QUEUE_MASK_SIZE(dev);
+
+	if (!zalloc_cpumask_var(&mask, GFP_KERNEL))
+		return -ENOMEM;
+
+	index = get_netdev_queue_index(queue);
+
+	rcu_read_lock();
+	maps = rcu_dereference(dev->xps_maps);
+	if (maps) {
+		qmask = maps->queues;
+		for (i = 0; i < num_possible_cpus(); i++) {
+			if (test_bit(index, qmask))
+				cpumask_set_cpu(i, mask);
+			qmask += qmask_size;
+		}
+	}
+	len += cpumask_scnprintf(buf + len, PAGE_SIZE, mask);
+	if (PAGE_SIZE - len < 3) {
+		rcu_read_unlock();
+		free_cpumask_var(mask);
+		return -EINVAL;
+	}
+	rcu_read_unlock();
+
+	free_cpumask_var(mask);
+	len += sprintf(buf + len, "\n");
+	return len;
+}
+
+static void xps_map_release(struct rcu_head *rcu)
+{
+	struct xps_map *map = container_of(rcu, struct xps_map, rcu);
+
+	kfree(map);
+}
+
+static DEFINE_MUTEX(xps_map_lock);
+
+static ssize_t store_xps_map(struct netdev_queue *queue,
+		      struct netdev_queue_attribute *attribute,
+		      const char *buf, size_t len)
+{
+	struct net_device *dev = queue->dev;
+	struct xps_map *maps;
+	cpumask_var_t mask;
+	int err, i, nonempty = 0;
+	unsigned long *qmask, index;
+	unsigned int qmask_size = QUEUE_MASK_SIZE(dev);
+
+	if (!capable(CAP_NET_ADMIN))
+		return -EPERM;
+
+	if (!alloc_cpumask_var(&mask, GFP_KERNEL))
+		return -ENOMEM;
+
+	err = bitmap_parse(buf, len, cpumask_bits(mask), nr_cpumask_bits);
+	if (err) {
+		free_cpumask_var(mask);
+		return err;
+	}
+
+	mutex_lock(&xps_map_lock);
+
+	maps = dev->xps_maps;
+	if (!maps) {
+		if (!cpumask_weight(mask)) {
+			mutex_unlock(&xps_map_lock);
+			free_cpumask_var(mask);
+			return 0;
+		}
+		maps = kzalloc(XPS_MAP_SIZE(dev), GFP_KERNEL);
+		if (!maps) {
+			mutex_unlock(&xps_map_lock);
+			free_cpumask_var(mask);
+			return -ENOMEM;
+		}
+		rcu_assign_pointer(dev->xps_maps, maps);
+	}
+
+	index = get_netdev_queue_index(queue);
+
+	qmask = maps->queues;
+	for (i = 0; i < num_possible_cpus(); i++) {
+		if (cpu_isset(i, *mask) && cpu_online(i)) {
+			set_bit(index, qmask);
+			nonempty = 1;
+		} else
+			clear_bit(index, qmask);
+		if (!nonempty &&
+		    bitmap_weight(qmask, dev->real_num_tx_queues))
+			nonempty = 1;
+		qmask += qmask_size;
+	}
+
+	if (!nonempty) {
+		rcu_assign_pointer(dev->xps_maps, NULL);
+		call_rcu(&maps->rcu, xps_map_release);
+	}
+
+	mutex_unlock(&xps_map_lock);
+
+	free_cpumask_var(mask);
+	return len;
+}
+
+static struct netdev_queue_attribute xps_cpus_attribute =
+    __ATTR(xps_cpus, S_IRUGO | S_IWUSR, show_xps_map, store_xps_map);
+
+static struct attribute *netdev_queue_default_attrs[] = {
+	&xps_cpus_attribute.attr,
+	NULL
+};
+
+static void netdev_queue_release(struct kobject *kobj)
+{
+	struct netdev_queue *queue = to_netdev_queue(kobj);
+	struct net_device *dev = queue->dev;
+	struct netdev_queue *first = queue->first;
+	struct xps_map *maps;
+	unsigned long *qmask, index;
+	int i, nonempty = 0;
+	unsigned int qmask_size = QUEUE_MASK_SIZE(dev);
+
+	index = get_netdev_queue_index(queue);
+
+	mutex_lock(&xps_map_lock);
+
+	maps = dev->xps_maps;
+
+	if (maps) {
+		qmask = maps->queues;
+		for (i = 0; i < num_possible_cpus(); i++) {
+			clear_bit(index, qmask);
+			if (!nonempty &&
+			    bitmap_weight(qmask, dev->real_num_tx_queues))
+				nonempty = 1;
+			qmask += qmask_size;
+		}
+
+		if (!nonempty) {
+			rcu_assign_pointer(dev->xps_maps, NULL);
+			call_rcu(&maps->rcu, xps_map_release);
+		}
+	}
+	mutex_unlock(&xps_map_lock);
+
+	if (atomic_dec_and_test(&first->count)) {
+		kfree(first);
+		dev_put(dev);
+	}
+}
+
+static struct kobj_type netdev_queue_ktype = {
+	.sysfs_ops = &netdev_queue_sysfs_ops,
+	.release = netdev_queue_release,
+	.default_attrs = netdev_queue_default_attrs,
+};
+
+static int netdev_queue_add_kobject(struct net_device *net, int index)
+{
+	struct netdev_queue *queue = net->_tx + index;
+	struct kobject *kobj = &queue->kobj;
+	int error = 0;
+
+	kobj->kset = net->queues_kset;
+	error = kobject_init_and_add(kobj, &netdev_queue_ktype, NULL,
+	    "tx-%u", index);
+	if (error) {
+		kobject_put(kobj);
+		return error;
+	}
+
+	kobject_uevent(kobj, KOBJ_ADD);
+
+	return error;
+}
+
+static int register_queue_kobjects(struct net_device *net)
+{
+	int rx = 0, tx = 0;
 	int error = 0;
 
 	net->queues_kset = kset_create_and_add("queues",
 	    NULL, &net->dev.kobj);
 	if (!net->queues_kset)
 		return -ENOMEM;
-	for (i = 0; i < net->num_rx_queues; i++) {
-		error = rx_queue_add_kobject(net, i);
+
+	for (rx = 0; rx < net->num_rx_queues; rx++) {
+		error = rx_queue_add_kobject(net, rx);
 		if (error)
-			break;
+			goto error;
 	}
 
-	if (error)
-		while (--i >= 0)
-			kobject_put(&net->_rx[i].kobj);
+	for (tx = 0; tx < net->num_tx_queues; tx++) {
+		error = netdev_queue_add_kobject(net, tx);
+		if (error)
+			goto error;
+	}
+	dev_hold(net);
+
+	return error;
+
+error:
+	while (--rx >= 0)
+		kobject_put(&net->_rx[rx].kobj);
+
+	while (--tx >= 0)
+		kobject_put(&net->_tx[tx].kobj);
 
 	return error;
 }
 
-static void rx_queue_remove_kobjects(struct net_device *net)
+static void remove_queue_kobjects(struct net_device *net)
 {
 	int i;
 
 	for (i = 0; i < net->num_rx_queues; i++)
 		kobject_put(&net->_rx[i].kobj);
+	for (i = 0; i < net->num_tx_queues; i++)
+		kobject_put(&net->_tx[i].kobj);
 	kset_unregister(net->queues_kset);
 }
 #endif /* CONFIG_RPS */
@@ -871,7 +1132,7 @@  void netdev_unregister_kobject(struct net_device * net)
 	kobject_get(&dev->kobj);
 
 #ifdef CONFIG_RPS
-	rx_queue_remove_kobjects(net);
+	remove_queue_kobjects(net);
 #endif
 
 	device_del(dev);
@@ -912,7 +1173,7 @@  int netdev_register_kobject(struct net_device *net)
 		return error;
 
 #ifdef CONFIG_RPS
-	error = rx_queue_register_kobjects(net);
+	error = register_queue_kobjects(net);
 	if (error) {
 		device_del(dev);
 		return error;
diff --git a/net/ipv4/tcp_output.c b/net/ipv4/tcp_output.c
index de3bd84..80c1928 100644
--- a/net/ipv4/tcp_output.c
+++ b/net/ipv4/tcp_output.c
@@ -828,8 +828,10 @@  static int tcp_transmit_skb(struct sock *sk, struct sk_buff *skb, int clone_it,
 							   &md5);
 	tcp_header_size = tcp_options_size + sizeof(struct tcphdr);
 
-	if (tcp_packets_in_flight(tp) == 0)
+	if (tcp_packets_in_flight(tp) == 0) {
 		tcp_ca_event(sk, CA_EVENT_TX_START);
+		skb->ooo_okay = 1;
+	}
 
 	skb_push(skb, tcp_header_size);
 	skb_reset_transport_header(skb);