Message ID | alpine.DEB.1.00.1008222233530.31382@pokey.mtv.corp.google.com |
---|---|
State | Changes Requested, archived |
Delegated to: | David Miller |
Headers | show |
On Sun, 2010-08-22 at 22:39 -0700, Tom Herbert wrote: [...] > Each transmit queue can be associated with a number of CPUs which will > used the queue to send packets. This is configured as a CPU mask on a > per queue basis in: > > /sys/class/net/eth<n>/queues/tx-<n>/xps_cpus > > The mappings are stored per device in an inverted data structure that > maps CPUs to queues. In the netdevice structure this is an array of > num_possible_cpu structures where each array entry contains a bit map > of queues which that CPU can use. [...] The mapping of TX queue to CPU should match the affinity of the completion IRQ for that queue. It should not be a separate setting. Ben.
On Mon, 2010-08-23 at 10:30 -0700, Tom Herbert wrote: > > > On Mon, Aug 23, 2010 at 10:09 AM, Ben Hutchings > <bhutchings@solarflare.com> wrote: > On Sun, 2010-08-22 at 22:39 -0700, Tom Herbert wrote: > [...] > > Each transmit queue can be associated with a number of CPUs > which will > > used the queue to send packets. This is configured as a CPU > mask on a > > per queue basis in: > > > > /sys/class/net/eth<n>/queues/tx-<n>/xps_cpus > > > > The mappings are stored per device in an inverted data > structure that > > maps CPUs to queues. In the netdevice structure this is an > array of > > num_possible_cpu structures where each array entry contains > a bit map > > of queues which that CPU can use. > > [...] > > The mapping of TX queue to CPU should match the affinity of > the > completion IRQ for that queue. It should not be a separate > setting. > > > > That implies one possible configuration, but there are others. For > instance, there may be fewer queues than CPUs in which case a TX queue > would be used by more than just the CPU handling the IRQ. The affinity of IRQs is not restricted to a single CPU either, but I take your point. The IRQ affinity and the mapping of sender to queue do at least need to be coordinated, and I think that continuing to add independent knobs for CPU affinity of closely-related objects makes it too hard for administrators to get this right. Ben.
On Sun, 22 Aug 2010 22:39:57 -0700 (PDT) Tom Herbert <therbert@google.com> wrote: > This patch implements transmit packet steering (XPS) for multiqueue > devices. XPS selects a transmit queue during packet transmission based > on configuration. This is done by mapping the CPU transmitting the > packet to a queue. This is the transmit side analogue to RPS-- where > RPS is selecting a CPU based on receive queue, XPS selects a queue > based on the CPU (previously there was an XPS patch from Eric > Dumazet, but that might more appropriately be called transmit completion > steering). > > Each transmit queue can be associated with a number of CPUs which will > used the queue to send packets. This is configured as a CPU mask on a > per queue basis in: > > /sys/class/net/eth<n>/queues/tx-<n>/xps_cpus > > The mappings are stored per device in an inverted data structure that > maps CPUs to queues. In the netdevice structure this is an array of > num_possible_cpu structures where each array entry contains a bit map > of queues which that CPU can use. > > We also allow the mapping of a socket to queue to be modified, for > instance if a thread is scheduled on a different CPU the desired queue > for transmitting packets would likely change. To maintain in order > packet transmission a flag (ooo_okay) has been added to the sk_buf > structure. If a transport layer sets this flag on a packet, the > transmit queue can be changed for this socket. Presumably, the > transport would set this is there was no possbility of creating ooo > packets (for instance there are no packets in flight for the socket). > This patch includes the modification in TCP output for setting this > flag. > > The benefits of XPS are improved locality in the per queue data > strutures. Also, transmit completions are more likely to be done > nearer to the sending thread so this should promote locality back > to the socket (e.g. UDP). The benefits of XPS are dependent on > cache hierarchy, application load, and other factors. XPS would > nominally be configured so that a queue would only be shared by CPUs > which are sharing a cache, the degenerative configuration woud be that > each CPU has it's own queue. > > Below are some benchmark results which show the potential benfit of > this patch. The netperf test has 500 instances of netperf TCP_RR test > with 1 byte req. and resp. > > bnx2x on 16 core AMD > XPS (16 queues, 1 TX queue per CPU) 1015K at 99% CPU > No XPS (16 queues) 1127K at 98% CPU > > Signed-off-by: Tom Herbert <therbert@google.com> > --- > diff --git a/include/linux/netdevice.h b/include/linux/netdevice.h > index 46c36ff..0ff6c9f 100644 > --- a/include/linux/netdevice.h > +++ b/include/linux/netdevice.h > @@ -497,6 +497,12 @@ struct netdev_queue { > struct Qdisc *qdisc; > unsigned long state; > struct Qdisc *qdisc_sleeping; > +#ifdef CONFIG_RPS > + struct kobject kobj; > + struct netdev_queue *first; > + atomic_t count; > +#endif > + > /* > * write mostly part > */ > @@ -524,6 +530,22 @@ struct rps_map { > #define RPS_MAP_SIZE(_num) (sizeof(struct rps_map) + (_num * sizeof(u16))) > > /* > + * This structure holds an XPS map which can be of variable length. queues > + * is an array of num_possible_cpus entries, where each entry is a mask of > + * queues for that CPU (up to num_tx_queues bits for device). > + */ > +struct xps_map { > + struct rcu_head rcu; > + unsigned long queues[0]; > +}; > + > +#define QUEUE_MASK_SIZE(dev) (BITS_TO_LONGS(dev->num_tx_queues)) > +#define XPS_MAP_SIZE(dev) (sizeof(struct xps_map) + (num_possible_cpus() * \ > + QUEUE_MASK_SIZE(dev) * sizeof(unsigned long))) > +#define XPS_ENTRY(map, offset, dev) \ > + (&map->queues[offset * QUEUE_MASK_SIZE(dev)]) > + > +/* > * The rps_dev_flow structure contains the mapping of a flow to a CPU and the > * tail pointer for that CPU's input queue at the time of last enqueue. > */ > @@ -978,6 +1000,9 @@ struct net_device { > void *rx_handler_data; > > struct netdev_queue *_tx ____cacheline_aligned_in_smp; > +#ifdef CONFIG_RPS > + struct xps_map *xps_maps; > +#endif > > /* Number of TX queues allocated at alloc_netdev_mq() time */ > unsigned int num_tx_queues; > diff --git a/include/linux/skbuff.h b/include/linux/skbuff.h > index f067c95..146df6f 100644 > --- a/include/linux/skbuff.h > +++ b/include/linux/skbuff.h > @@ -381,6 +381,7 @@ struct sk_buff { > #else > __u8 deliver_no_wcard:1; > #endif > + __u8 ooo_okay:1; > kmemcheck_bitfield_end(flags2); > > /* 0/14 bit hole */ > diff --git a/net/core/dev.c b/net/core/dev.c > index da584f5..d23f9c4 100644 > --- a/net/core/dev.c > +++ b/net/core/dev.c > @@ -2054,6 +2054,60 @@ static inline u16 dev_cap_txqueue(struct net_device *dev, u16 queue_index) > return queue_index; > } > > +static inline int get_xps_queue(struct net_device *dev, struct sk_buff *skb, > + int queue_index) > +{ > + struct xps_map *maps; > + int cpu = smp_processor_id(); > + u32 hash; > + unsigned long *queues; > + int weight, select; > + > + rcu_read_lock(); > + maps = rcu_dereference(dev->xps_maps); > + > + if (!maps) { > + rcu_read_unlock(); > + return queue_index; > + } > + > + queues = XPS_ENTRY(maps, cpu, dev); > + > + if (queue_index >= 0) { > + if (test_bit(queue_index, queues)) { > + rcu_read_unlock(); > + return queue_index; > + } > + } > + > + weight = bitmap_weight(queues, dev->real_num_tx_queues); > + switch (weight) { > + case 0: > + break; > + case 1: > + queue_index = > + find_first_bit(queues, dev->real_num_tx_queues); > + break; > + default: > + if (skb->sk && skb->sk->sk_hash) > + hash = skb->sk->sk_hash; > + else > + hash = (__force u16) skb->protocol ^ skb->rxhash; > + hash = jhash_1word(hash, hashrnd); > + > + select = ((u64) hash * weight) >> 32; > + queue_index = > + find_first_bit(queues, dev->real_num_tx_queues); > + while (select--) > + queue_index = find_next_bit(queues, > + dev->real_num_tx_queues, queue_index); > + break; > + } > + > + rcu_read_unlock(); > + return queue_index; > +} > + > static struct netdev_queue *dev_pick_tx(struct net_device *dev, > struct sk_buff *skb) > { > @@ -2061,23 +2115,30 @@ static struct netdev_queue *dev_pick_tx(struct net_device *dev, > struct sock *sk = skb->sk; > > queue_index = sk_tx_queue_get(sk); > - if (queue_index < 0) { > + > + if (queue_index < 0 || (skb->ooo_okay && dev->real_num_tx_queues > 1)) { > const struct net_device_ops *ops = dev->netdev_ops; > + int old_index = queue_index; > > if (ops->ndo_select_queue) { > queue_index = ops->ndo_select_queue(dev, skb); > queue_index = dev_cap_txqueue(dev, queue_index); > } else { > - queue_index = 0; > - if (dev->real_num_tx_queues > 1) > - queue_index = skb_tx_hash(dev, skb); > + if (dev->real_num_tx_queues > 1) { > + queue_index = get_xps_queue(dev, > + skb, queue_index); > + if (queue_index < 0) > + queue_index = skb_tx_hash(dev, skb); > + } else > + queue_index = 0; > + } > > - if (sk) { > - struct dst_entry *dst = rcu_dereference_check(sk->sk_dst_cache, 1); > + if ((queue_index != old_index) && sk) { > + struct dst_entry *dst = > + rcu_dereference_check(sk->sk_dst_cache, 1); > > - if (dst && skb_dst(skb) == dst) > - sk_tx_queue_set(sk, queue_index); > - } > + if (dst && skb_dst(skb) == dst) > + sk_tx_queue_set(sk, queue_index); > } > } > > @@ -5429,6 +5490,15 @@ struct net_device *alloc_netdev_mq(int sizeof_priv, const char *name, > } > > #ifdef CONFIG_RPS > + atomic_set(&tx->count, queue_count); > + > + /* > + * Set a pointer to first element in the array which holds the > + * reference count. > + */ > + for (i = 0; i < queue_count; i++) > + tx[i].first = tx; > + > rx = kcalloc(queue_count, sizeof(struct netdev_rx_queue), GFP_KERNEL); > if (!rx) { > printk(KERN_ERR "alloc_netdev: Unable to allocate " > @@ -5506,7 +5576,9 @@ void free_netdev(struct net_device *dev) > > release_net(dev_net(dev)); > > +#ifndef CONFIG_RPS > kfree(dev->_tx); > +#endif > > /* Flush device addresses */ > dev_addr_flush(dev); > diff --git a/net/core/net-sysfs.c b/net/core/net-sysfs.c > index af4dfba..661c481 100644 > --- a/net/core/net-sysfs.c > +++ b/net/core/net-sysfs.c > @@ -742,34 +742,295 @@ static int rx_queue_add_kobject(struct net_device *net, int index) > return error; > } > > -static int rx_queue_register_kobjects(struct net_device *net) > +/* > + * netdev_queue sysfs structures and functions. > + */ > +struct netdev_queue_attribute { > + struct attribute attr; > + ssize_t (*show)(struct netdev_queue *queue, > + struct netdev_queue_attribute *attr, char *buf); > + ssize_t (*store)(struct netdev_queue *queue, > + struct netdev_queue_attribute *attr, const char *buf, size_t len); > +}; > +#define to_netdev_queue_attr(_attr) container_of(_attr, \ > + struct netdev_queue_attribute, attr) > + > +#define to_netdev_queue(obj) container_of(obj, struct netdev_queue, kobj) > + > +static ssize_t netdev_queue_attr_show(struct kobject *kobj, > + struct attribute *attr, char *buf) > +{ > + struct netdev_queue_attribute *attribute = to_netdev_queue_attr(attr); > + struct netdev_queue *queue = to_netdev_queue(kobj); > + > + if (!attribute->show) > + return -EIO; > + > + return attribute->show(queue, attribute, buf); > +} > + > +static ssize_t netdev_queue_attr_store(struct kobject *kobj, > + struct attribute *attr, > + const char *buf, size_t count) > +{ > + struct netdev_queue_attribute *attribute = to_netdev_queue_attr(attr); > + struct netdev_queue *queue = to_netdev_queue(kobj); > + > + if (!attribute->store) > + return -EIO; > + > + return attribute->store(queue, attribute, buf, count); > +} > + > +static struct sysfs_ops netdev_queue_sysfs_ops = { > + .show = netdev_queue_attr_show, > + .store = netdev_queue_attr_store, > +}; > + > +static inline unsigned int get_netdev_queue_index(struct netdev_queue *queue) > { > + struct net_device *dev = queue->dev; > + int i; > + > + for (i = 0; i < dev->num_tx_queues; i++) > + if (queue == &dev->_tx[i]) > + break; > + > + BUG_ON(i >= dev->num_tx_queues); > + > + return i; > +} > + > +static ssize_t show_xps_map(struct netdev_queue *queue, > + struct netdev_queue_attribute *attribute, char *buf) > +{ > + struct net_device *dev = queue->dev; > + struct xps_map *maps; > + cpumask_var_t mask; > + unsigned long *qmask, index; > + size_t len = 0; > int i; > + unsigned int qmask_size = QUEUE_MASK_SIZE(dev); > + > + if (!zalloc_cpumask_var(&mask, GFP_KERNEL)) > + return -ENOMEM; > + > + index = get_netdev_queue_index(queue); > + > + rcu_read_lock(); > + maps = rcu_dereference(dev->xps_maps); > + if (maps) { > + qmask = maps->queues; > + for (i = 0; i < num_possible_cpus(); i++) { > + if (test_bit(index, qmask)) > + cpumask_set_cpu(i, mask); > + qmask += qmask_size; > + } > + } > + len += cpumask_scnprintf(buf + len, PAGE_SIZE, mask); > + if (PAGE_SIZE - len < 3) { > + rcu_read_unlock(); > + free_cpumask_var(mask); > + return -EINVAL; > + } > + rcu_read_unlock(); > + > + free_cpumask_var(mask); > + len += sprintf(buf + len, "\n"); > + return len; > +} > + > +static void xps_map_release(struct rcu_head *rcu) > +{ > + struct xps_map *map = container_of(rcu, struct xps_map, rcu); > + > + kfree(map); > +} > + > +static DEFINE_MUTEX(xps_map_lock); > + > +static ssize_t store_xps_map(struct netdev_queue *queue, > + struct netdev_queue_attribute *attribute, > + const char *buf, size_t len) > +{ > + struct net_device *dev = queue->dev; > + struct xps_map *maps; > + cpumask_var_t mask; > + int err, i, nonempty = 0; > + unsigned long *qmask, index; > + unsigned int qmask_size = QUEUE_MASK_SIZE(dev); > + > + if (!capable(CAP_NET_ADMIN)) > + return -EPERM; > + > + if (!alloc_cpumask_var(&mask, GFP_KERNEL)) > + return -ENOMEM; > + > + err = bitmap_parse(buf, len, cpumask_bits(mask), nr_cpumask_bits); > + if (err) { > + free_cpumask_var(mask); > + return err; > + } > + > + mutex_lock(&xps_map_lock); > + > + maps = dev->xps_maps; > + if (!maps) { > + if (!cpumask_weight(mask)) { > + mutex_unlock(&xps_map_lock); > + free_cpumask_var(mask); > + return 0; > + } > + maps = kzalloc(XPS_MAP_SIZE(dev), GFP_KERNEL); > + if (!maps) { > + mutex_unlock(&xps_map_lock); > + free_cpumask_var(mask); > + return -ENOMEM; > + } > + rcu_assign_pointer(dev->xps_maps, maps); > + } > + > + index = get_netdev_queue_index(queue); > + > + qmask = maps->queues; > + for (i = 0; i < num_possible_cpus(); i++) { > + if (cpu_isset(i, *mask) && cpu_online(i)) { > + set_bit(index, qmask); > + nonempty = 1; > + } else > + clear_bit(index, qmask); > + if (!nonempty && > + bitmap_weight(qmask, dev->real_num_tx_queues)) > + nonempty = 1; > + qmask += qmask_size; > + } > + > + if (!nonempty) { > + rcu_assign_pointer(dev->xps_maps, NULL); > + call_rcu(&maps->rcu, xps_map_release); > + } > + > + mutex_unlock(&xps_map_lock); > + > + free_cpumask_var(mask); > + return len; > +} > + > +static struct netdev_queue_attribute xps_cpus_attribute = > + __ATTR(xps_cpus, S_IRUGO | S_IWUSR, show_xps_map, store_xps_map); > + > +static struct attribute *netdev_queue_default_attrs[] = { > + &xps_cpus_attribute.attr, > + NULL > +}; > + > +static void netdev_queue_release(struct kobject *kobj) > +{ > + struct netdev_queue *queue = to_netdev_queue(kobj); > + struct net_device *dev = queue->dev; > + struct netdev_queue *first = queue->first; > + struct xps_map *maps; > + unsigned long *qmask, index; > + int i, nonempty = 0; > + unsigned int qmask_size = QUEUE_MASK_SIZE(dev); > + > + index = get_netdev_queue_index(queue); > + > + mutex_lock(&xps_map_lock); > + > + maps = dev->xps_maps; > + > + if (maps) { > + qmask = maps->queues; > + for (i = 0; i < num_possible_cpus(); i++) { > + clear_bit(index, qmask); > + if (!nonempty && > + bitmap_weight(qmask, dev->real_num_tx_queues)) > + nonempty = 1; > + qmask += qmask_size; > + } > + > + if (!nonempty) { > + rcu_assign_pointer(dev->xps_maps, NULL); > + call_rcu(&maps->rcu, xps_map_release); > + } > + } > + mutex_unlock(&xps_map_lock); > + > + if (atomic_dec_and_test(&first->count)) { > + kfree(first); > + dev_put(dev); > + } > +} > + > +static struct kobj_type netdev_queue_ktype = { > + .sysfs_ops = &netdev_queue_sysfs_ops, > + .release = netdev_queue_release, > + .default_attrs = netdev_queue_default_attrs, > +}; > + > +static int netdev_queue_add_kobject(struct net_device *net, int index) > +{ > + struct netdev_queue *queue = net->_tx + index; > + struct kobject *kobj = &queue->kobj; > + int error = 0; > + > + kobj->kset = net->queues_kset; > + error = kobject_init_and_add(kobj, &netdev_queue_ktype, NULL, > + "tx-%u", index); > + if (error) { > + kobject_put(kobj); > + return error; > + } > + > + kobject_uevent(kobj, KOBJ_ADD); > + > + return error; > +} > + > +static int register_queue_kobjects(struct net_device *net) > +{ > + int rx = 0, tx = 0; > int error = 0; > > net->queues_kset = kset_create_and_add("queues", > NULL, &net->dev.kobj); > if (!net->queues_kset) > return -ENOMEM; > - for (i = 0; i < net->num_rx_queues; i++) { > - error = rx_queue_add_kobject(net, i); > + > + for (rx = 0; rx < net->num_rx_queues; rx++) { > + error = rx_queue_add_kobject(net, rx); > if (error) > - break; > + goto error; > } > > - if (error) > - while (--i >= 0) > - kobject_put(&net->_rx[i].kobj); > + for (tx = 0; tx < net->num_tx_queues; tx++) { > + error = netdev_queue_add_kobject(net, tx); > + if (error) > + goto error; > + } > + dev_hold(net); > + > + return error; > + > +error: > + while (--rx >= 0) > + kobject_put(&net->_rx[rx].kobj); > + > + while (--tx >= 0) > + kobject_put(&net->_tx[tx].kobj); > > return error; > } > > -static void rx_queue_remove_kobjects(struct net_device *net) > +static void remove_queue_kobjects(struct net_device *net) > { > int i; > > for (i = 0; i < net->num_rx_queues; i++) > kobject_put(&net->_rx[i].kobj); > + for (i = 0; i < net->num_tx_queues; i++) > + kobject_put(&net->_tx[i].kobj); > kset_unregister(net->queues_kset); > } > #endif /* CONFIG_RPS */ > @@ -871,7 +1132,7 @@ void netdev_unregister_kobject(struct net_device * net) > kobject_get(&dev->kobj); > > #ifdef CONFIG_RPS > - rx_queue_remove_kobjects(net); > + remove_queue_kobjects(net); > #endif > > device_del(dev); > @@ -912,7 +1173,7 @@ int netdev_register_kobject(struct net_device *net) > return error; > > #ifdef CONFIG_RPS > - error = rx_queue_register_kobjects(net); > + error = register_queue_kobjects(net); > if (error) { > device_del(dev); > return error; > diff --git a/net/ipv4/tcp_output.c b/net/ipv4/tcp_output.c > index de3bd84..80c1928 100644 > --- a/net/ipv4/tcp_output.c > +++ b/net/ipv4/tcp_output.c > @@ -828,8 +828,10 @@ static int tcp_transmit_skb(struct sock *sk, struct sk_buff *skb, int clone_it, > &md5); > tcp_header_size = tcp_options_size + sizeof(struct tcphdr); > > - if (tcp_packets_in_flight(tp) == 0) > + if (tcp_packets_in_flight(tp) == 0) { > tcp_ca_event(sk, CA_EVENT_TX_START); > + skb->ooo_okay = 1; > + } > > skb_push(skb, tcp_header_size); > skb_reset_transport_header(skb); Why don't we do this in the normal transmit processing. There is already so much policy mechanism filters/actions/qdisc that doing it in higher level is fighting against these.
On Sun, 22 Aug 2010, Tom Herbert wrote: > This patch implements transmit packet steering (XPS) for multiqueue > devices. XPS selects a transmit queue during packet transmission based > on configuration. This is done by mapping the CPU transmitting the > packet to a queue. This is the transmit side analogue to RPS-- where > RPS is selecting a CPU based on receive queue, XPS selects a queue > based on the CPU (previously there was an XPS patch from Eric > Dumazet, but that might more appropriately be called transmit completion > steering). > > Each transmit queue can be associated with a number of CPUs which will > used the queue to send packets. This is configured as a CPU mask on a > per queue basis in: > > /sys/class/net/eth<n>/queues/tx-<n>/xps_cpus > > The mappings are stored per device in an inverted data structure that > maps CPUs to queues. In the netdevice structure this is an array of > num_possible_cpu structures where each array entry contains a bit map > of queues which that CPU can use. > > We also allow the mapping of a socket to queue to be modified, for > instance if a thread is scheduled on a different CPU the desired queue > for transmitting packets would likely change. To maintain in order > packet transmission a flag (ooo_okay) has been added to the sk_buf > structure. If a transport layer sets this flag on a packet, the > transmit queue can be changed for this socket. Presumably, the > transport would set this is there was no possbility of creating ooo > packets (for instance there are no packets in flight for the socket). > This patch includes the modification in TCP output for setting this > flag. > > The benefits of XPS are improved locality in the per queue data > strutures. Also, transmit completions are more likely to be done > nearer to the sending thread so this should promote locality back > to the socket (e.g. UDP). The benefits of XPS are dependent on > cache hierarchy, application load, and other factors. XPS would > nominally be configured so that a queue would only be shared by CPUs > which are sharing a cache, the degenerative configuration woud be that > each CPU has it's own queue. > > Below are some benchmark results which show the potential benfit of > this patch. The netperf test has 500 instances of netperf TCP_RR test > with 1 byte req. and resp. > > bnx2x on 16 core AMD > XPS (16 queues, 1 TX queue per CPU) 1015K at 99% CPU > No XPS (16 queues) 1127K at 98% CPU I don't grok your performance numbers. What do the 1015K and 1127K numbers represent? I was originally guessing that they were basically transactions per second, but that would seem to imply that the No XPS case was better. Please clarify. -Thanks -Bill -- To unsubscribe from this list: send the line "unsubscribe netdev" in the body of a message to majordomo@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
1 byte req. and resp. > > > > bnx2x on 16 core AMD > > XPS (16 queues, 1 TX queue per CPU) 1015K at 99% CPU > > No XPS (16 queues) 1127K at 98% CPU > > I don't grok your performance numbers. What do the 1015K and 1127K > numbers represent? I was originally guessing that they were basically > transactions per second, but that would seem to imply that the No XPS > case was better. Please clarify. > Yes, TPS and the numbers were switched... XPS case was better! Thanks for pointing that out. Tom > -Thanks > > -Bill -- To unsubscribe from this list: send the line "unsubscribe netdev" in the body of a message to majordomo@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
> Why don't we do this in the normal transmit processing. > There is already so much policy mechanism filters/actions/qdisc that > doing it in higher level is fighting against these. > Are you proposing that TX queue selection be done in the qdiscs? The queue has to be selected before taking the lock (cannot afford taking a lock over the whole interface). This would necessitate moving the locking and probably rearranging a lot of the xmit code around that. Tom -- To unsubscribe from this list: send the line "unsubscribe netdev" in the body of a message to majordomo@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
Le mercredi 01 septembre 2010 à 08:41 -0700, Tom Herbert a écrit : > > Why don't we do this in the normal transmit processing. > > There is already so much policy mechanism filters/actions/qdisc that > > doing it in higher level is fighting against these. > > > Are you proposing that TX queue selection be done in the qdiscs? The > queue has to be selected before taking the lock (cannot afford taking > a lock over the whole interface). This would necessitate moving the > locking and probably rearranging a lot of the xmit code around that. Stephen point is not adding yet another layer 'before' qdisc layer. I would like something not as complex as your patch. 1) Why current selection fails ? 2) Could we change current selection to : - Use a lightweight selection, with no special configuration. - Use driver RX multiqueue information if available, in a one-to-one relationship. 3) Eventually have a user selectable selection (socket option, or system wide, but one sysctl, not many bitmasks ;) ). -- To unsubscribe from this list: send the line "unsubscribe netdev" in the body of a message to majordomo@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
From: Tom Herbert <therbert@google.com> Date: Wed, 1 Sep 2010 08:41:14 -0700 >> Why don't we do this in the normal transmit processing. >> There is already so much policy mechanism filters/actions/qdisc that >> doing it in higher level is fighting against these. >> > Are you proposing that TX queue selection be done in the qdiscs? The > queue has to be selected before taking the lock (cannot afford taking > a lock over the whole interface). This would necessitate moving the > locking and probably rearranging a lot of the xmit code around that. Right, we really have to pick queues before entering the qdisc for the full benefit of lock seperation. -- To unsubscribe from this list: send the line "unsubscribe netdev" in the body of a message to majordomo@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
On Wed, Sep 1, 2010 at 8:54 AM, Eric Dumazet <eric.dumazet@gmail.com> wrote: > Le mercredi 01 septembre 2010 à 08:41 -0700, Tom Herbert a écrit : >> > Why don't we do this in the normal transmit processing. >> > There is already so much policy mechanism filters/actions/qdisc that >> > doing it in higher level is fighting against these. >> > >> Are you proposing that TX queue selection be done in the qdiscs? The >> queue has to be selected before taking the lock (cannot afford taking >> a lock over the whole interface). This would necessitate moving the >> locking and probably rearranging a lot of the xmit code around that. > > Stephen point is not adding yet another layer 'before' qdisc layer. > > I would like something not as complex as your patch. > > 1) Why current selection fails ? > Current selection does a hash on 4-tuple to map packets to queues. So any CPU can send on any queue which leads to cache line bouncing of transmit structures. Also when sending from one CPU to a queue whose transmit interrupt is on a CPU in another cache domain cause more cache line bouncing with transmit completion. So while the current scheme nicely distributes load across the queues, it does nothing to promote locality. Getting some reasonable locality is where the benefits come from that we are demonstrating. > 2) Could we change current selection to : > > - Use a lightweight selection, with no special configuration. > > - Use driver RX multiqueue information if available, in a one-to-one > relationship. > Not generally. It's very possible that the only a subset of CPUs are getting RX interrupts in multiqueue (consider when #queues < #CPUs), so there's really not an obvious 1-1 relationship. But each CPU can send and should be mapped to at least one transmit queue; the most obvious plan would be to send in a queue in the same cache domain. > 3) Eventually have a user selectable selection (socket option, or system > wide, but one sysctl, not many bitmasks ;) ). > Right, but it would also be nice if a single sysctl could optimally set up multiqueue, RSS, RPS, and all my interrupt affinities for me ;-) Tom -- To unsubscribe from this list: send the line "unsubscribe netdev" in the body of a message to majordomo@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
From: Tom Herbert <therbert@google.com> Date: Wed, 1 Sep 2010 09:24:18 -0700 > On Wed, Sep 1, 2010 at 8:54 AM, Eric Dumazet <eric.dumazet@gmail.com> wrote: >> 3) Eventually have a user selectable selection (socket option, or system >> wide, but one sysctl, not many bitmasks ;) ). >> > Right, but it would also be nice if a single sysctl could optimally > set up multiqueue, RSS, RPS, and all my interrupt affinities for me > ;-) It's becomming increasingly obvious to me that we need (somewhere, not necessarily the kernel) a complete datastructure representing the NUMA, cache, cpu, device hierarchy. And that can be used to tweak all of this stuff. The policy should probably be in userspace, we just need to provide the knobs in the kernel to tweak it however userspace wants. Userspace should be able to, for example, move a TX queue into a NUMA domain and have this invoke several side effects: 1) IRQs for that TX queue get rerouted to a cpu in the NUMA domain. 2) TX queue datastructures in the driver get reallocated using memory in that NUMA domain. 3) TX hashing is configured to use the set of cpus in the NUMA domain. It's alot of tedious work and involves some delicate tasks figuring out where each of these things go, but really then we'd solve all of this crap one and for all. -- To unsubscribe from this list: send the line "unsubscribe netdev" in the body of a message to majordomo@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
On Wed, 01 Sep 2010 18:32:51 -0700 (PDT) David Miller <davem@davemloft.net> wrote: > From: Tom Herbert <therbert@google.com> > Date: Wed, 1 Sep 2010 09:24:18 -0700 > > > On Wed, Sep 1, 2010 at 8:54 AM, Eric Dumazet <eric.dumazet@gmail.com> wrote: > >> 3) Eventually have a user selectable selection (socket option, or system > >> wide, but one sysctl, not many bitmasks ;) ). > >> > > Right, but it would also be nice if a single sysctl could optimally > > set up multiqueue, RSS, RPS, and all my interrupt affinities for me > > ;-) > > It's becomming increasingly obvious to me that we need (somewhere, > not necessarily the kernel) a complete datastructure representing > the NUMA, cache, cpu, device hierarchy. > > And that can be used to tweak all of this stuff. > > The policy should probably be in userspace, we just need to provide > the knobs in the kernel to tweak it however userspace wants. > > Userspace should be able to, for example, move a TX queue into a > NUMA domain and have this invoke several side effects: > > 1) IRQs for that TX queue get rerouted to a cpu in the NUMA > domain. > > 2) TX queue datastructures in the driver get reallocated using > memory in that NUMA domain. > > 3) TX hashing is configured to use the set of cpus in the NUMA > domain. > > It's alot of tedious work and involves some delicate tasks figuring > out where each of these things go, but really then we'd solve all > of this crap one and for all. Plus it needs to work with scheduler (not fight it). All this doesn't work very well if process keeps bouncing away from its resources.
On Wed, 01 Sep 2010 18:32:51 -0700 (PDT) David Miller <davem@davemloft.net> wrote: > From: Tom Herbert <therbert@google.com> > Date: Wed, 1 Sep 2010 09:24:18 -0700 > > > On Wed, Sep 1, 2010 at 8:54 AM, Eric Dumazet <eric.dumazet@gmail.com> wrote: > >> 3) Eventually have a user selectable selection (socket option, or system > >> wide, but one sysctl, not many bitmasks ;) ). > >> > > Right, but it would also be nice if a single sysctl could optimally > > set up multiqueue, RSS, RPS, and all my interrupt affinities for me > > ;-) > > It's becomming increasingly obvious to me that we need (somewhere, > not necessarily the kernel) a complete datastructure representing > the NUMA, cache, cpu, device hierarchy. > > And that can be used to tweak all of this stuff. > > The policy should probably be in userspace, we just need to provide > the knobs in the kernel to tweak it however userspace wants. > > Userspace should be able to, for example, move a TX queue into a > NUMA domain and have this invoke several side effects: > > 1) IRQs for that TX queue get rerouted to a cpu in the NUMA > domain. > > 2) TX queue datastructures in the driver get reallocated using > memory in that NUMA domain. > > 3) TX hashing is configured to use the set of cpus in the NUMA > domain. > > It's alot of tedious work and involves some delicate tasks figuring > out where each of these things go, but really then we'd solve all > of this crap one and for all. Just to be contrarian :-) This same idea had started before when IBM proposed a user-space NUMA API. It never got any traction, the concept of "lets make the applications NUMA aware" never got accepted because it is so hard to do right and fragile that it was the wrong idea to start with. The only people that can manage it are the engineers tweeking a one off database benchmark. I would rather see a "good enough" policy in the kernel that works for everything from a single-core embedded system to a 100 core server environment. Forget the benchmarkers. The ideal solution should work with a mix of traffic and adapt. Today the application doesn't have to make a service level agreement with kernel everytime it opens a TCP socket. Doing it in userspace doesn't really help much. The API's keep changing and the focus fades (see irqbalance).
On Wed, Sep 01, 2010 at 06:56:27PM -0700, Stephen Hemminger wrote: > Just to be contrarian :-) This same idea had started before when IBM > proposed a user-space NUMA API. It never got any traction, the concept > of "lets make the applications NUMA aware" never got accepted because > it is so hard to do right and fragile that it was the wrong idea > to start with. The only people that can manage it are the engineers > tweeking a one off database benchmark. As an non-database user-space example, there are many applications which know about the typical 'first touch' locality policy for pages and use that to be NUMA-aware. Just about every OpenMP program ever written does that; it's even fairly portable among OSes. A second user-level example is MPI implementations such as OpenMPI. Those guys run 1 process per core and they don't need to move around, so getting process locked to a core and all the pages in the right place is a nice win without the MPI programmer doing anything. For kernel (but non-Ethernet) networking examples, HPC interconnects typically go out of their way to ensure locality of kernel pages related to a given core's workload. Examples include Myrinet's OpenMX+MPI and the InfiniPath InfiniBand adapater, whatever QLogic renamed it to this week (TrueScale, I suppose.) How can you get ~ 1 microsecond messages if you've got a buffer in the wrong place? Or achieve extremely high messaging rates when you're waiting for remote memory all the time? > I would rather see a "good enough" policy in the kernel that works > for everything from a single-core embedded system to a 100 core > server environment. I'd like a pony. Yes, it's challenging to directly aapply the above networking example to Ethernet networking, but there's a pony in there somewhere. -- greg -- To unsubscribe from this list: send the line "unsubscribe netdev" in the body of a message to majordomo@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
> From: netdev-owner@vger.kernel.org [mailto:netdev- > owner@vger.kernel.org] On Behalf Of David Miller > It's becomming increasingly obvious to me that we need (somewhere, > not necessarily the kernel) a complete datastructure representing > the NUMA, cache, cpu, device hierarchy. > > And that can be used to tweak all of this stuff. > > The policy should probably be in userspace, we just need to provide > the knobs in the kernel to tweak it however userspace wants. > I agree. But only if we have all the knobs in the kernel. http://www.spinics.net/lists/linux-numa/msg00709.html I had to do some changes manually. Thanks to Lee for pointing out the dma-call. > Userspace should be able to, for example, move a TX queue into a > NUMA domain and have this invoke several side effects: > > 1) IRQs for that TX queue get rerouted to a cpu in the NUMA > domain. > > 2) TX queue datastructures in the driver get reallocated using > memory in that NUMA domain. > > 3) TX hashing is configured to use the set of cpus in the NUMA > domain. > > It's alot of tedious work and involves some delicate tasks figuring > out where each of these things go, but really then we'd solve all > of this crap one and for all. > -- The MSI-X+MQ combo is something that should be taken care off. We could come up with an automatic-node-binding shim in the kernel and then all the sub-systems can use that. Basically, the userland should query i) the adapter-capabilities ii)get the node-binding info iii)and then start affinitizing everything Some older msi-x (non-ethernet)adapters assume the msi-x info to remain static once the adapter is initialized. If we had an auto-node-binding shim then that would mean re-initializing the adapter,correct? Chetan Loke -- To unsubscribe from this list: send the line "unsubscribe netdev" in the body of a message to majordomo@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
> -----Original Message----- > From: netdev-owner@vger.kernel.org [mailto:netdev- > owner@vger.kernel.org] On Behalf Of Stephen Hemminger > Sent: September 01, 2010 9:48 PM > To: David Miller > Cc: therbert@google.com; eric.dumazet@gmail.com; netdev@vger.kernel.org > Subject: Re: [PATCH] xps-mq: Transmit Packet Steering for multiqueue > Plus it needs to work with scheduler (not fight it). All this doesn't > work very well if process keeps bouncing away from its resources. > userland folks who actually try to exploit the MQ/MSIX-ness will almost always pin down their high-prio[or subset of] threads/processes. Chetan Loke -- To unsubscribe from this list: send the line "unsubscribe netdev" in the body of a message to majordomo@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
> Just to be contrarian :-) This same idea had started before when IBM > proposed a user-space NUMA API. It never got any traction, the concept > of "lets make the applications NUMA aware" never got accepted because > it is so hard to do right and fragile that it was the wrong idea > to start with. The only people that can manage it are the engineers > tweeking a one off database benchmark. > If you design an appliance then this would be one of the knobs that people tinker with. To reap the actual benefits the whole stack(from adapter's f/w to the userland-thread) needs to be astro-aligned. Almost all the user-space guys that I've talked to never got it right because they never understood the concepts behind NUMA-IOH-Mempolicy-MSI-X etc. So it's a little difficult for them to get it right the first time. Also, Stoakley/Nehalem rigs got into mainstream ~2-3 years back? Before that only a handful of engineers could experiment because the mobo's were expensive. Plus, you didn't really have 10G fabric then. Chetan -- To unsubscribe from this list: send the line "unsubscribe netdev" in the body of a message to majordomo@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
> userland folks who actually try to exploit the MQ/MSIX-ness will almost > always pin down their high-prio[or subset of] threads/processes. > I don't really see that. Pinning is a last resort and in this context we could only do that on a dedicated server. On a shared server, with many different apps, pinning for MQ/MSIX is not an easy option; meeting scheduler constraints will be the first priority and its up to networking to work with the scheduler to to the right thing. Scheduler aware networking (or vice versa) is important. Tom -- To unsubscribe from this list: send the line "unsubscribe netdev" in the body of a message to majordomo@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
> From: Tom Herbert [mailto:therbert@google.com] > Sent: September 02, 2010 3:53 PM > To: Loke, Chetan > Cc: Stephen Hemminger; David Miller; eric.dumazet@gmail.com; > netdev@vger.kernel.org > Subject: Re: [PATCH] xps-mq: Transmit Packet Steering for multiqueue > > > userland folks who actually try to exploit the MQ/MSIX-ness will > almost > > always pin down their high-prio[or subset of] threads/processes. > > > I don't really see that. Pinning is a last resort and in this context > we could only do that on a dedicated server. On a shared server, with > many different apps, pinning for MQ/MSIX is not an easy option; > meeting scheduler constraints will be the first priority and its up to > networking to work with the scheduler to to the right thing. > Scheduler aware networking (or vice versa) is important. > For my use-case it's an appliance. Newer adapters might have like 64+(?) h/w queues and not just f/w emulated queues. With these many queues you could partition your threads/queues, no? It's easier to get started that way. All you need is a shim(or just a driver stub because you can then load it on any box that had older kernels) in the kernel that will tell you which queue-set(for a MQ-capable adapter) is still under the high-watermark. If all are full then it should just round-robin(across queues and nodes).So make a syscall(or shoot a mbx-cmd or pick your trick), find out which queue you could use, get the binding info and then launch your threads. So once you narrow down the scope, the scheduler will have less work to do. If the worker threads are short lived then there's no point in this binding. And for long-lived tasks, a couple of initial prep-calls will not hurt performance that much.And if you still care about syscalls at runtime then you could have a dedicated mgmt-thread that will receive async-events from the shim. And all other user-land logic could consult this mgmt-thread. > Tom Chetan Loke -- To unsubscribe from this list: send the line "unsubscribe netdev" in the body of a message to majordomo@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
On Wed, 2010-09-01 at 18:32 -0700, David Miller wrote: > From: Tom Herbert <therbert@google.com> > Date: Wed, 1 Sep 2010 09:24:18 -0700 > > > On Wed, Sep 1, 2010 at 8:54 AM, Eric Dumazet <eric.dumazet@gmail.com> wrote: > >> 3) Eventually have a user selectable selection (socket option, or system > >> wide, but one sysctl, not many bitmasks ;) ). > >> > > Right, but it would also be nice if a single sysctl could optimally > > set up multiqueue, RSS, RPS, and all my interrupt affinities for me > > ;-) > > It's becomming increasingly obvious to me that we need (somewhere, > not necessarily the kernel) a complete datastructure representing > the NUMA, cache, cpu, device hierarchy. And ideally a cheap way (not O(N^2)) to find the distance between 2 CPU threads (not just nodes). > And that can be used to tweak all of this stuff. > > The policy should probably be in userspace, we just need to provide > the knobs in the kernel to tweak it however userspace wants. > > Userspace should be able to, for example, move a TX queue into a > NUMA domain and have this invoke several side effects: > > 1) IRQs for that TX queue get rerouted to a cpu in the NUMA > domain. > > 2) TX queue datastructures in the driver get reallocated using > memory in that NUMA domain. I've actually done some work on an interface and implementation of this, although I didn't include actually setting the IRQ affinity as there has been pushback whenever people propose letting drivers set this. If they only do so as directed by the administrator this might be more acceptable though. Unfortunately in my limited testing on a 2-node system I didn't see a whole lot of improvement in performance when the affinities were all lined up. I should try to get some time on a 4-node system. > 3) TX hashing is configured to use the set of cpus in the NUMA > domain. > > It's alot of tedious work and involves some delicate tasks figuring > out where each of these things go, but really then we'd solve all > of this crap one and for all. Right. The other thing I've been working on lately which sort of ties into this is hardware acceleration of Receive Flow Steering. Multiqueue NICs such as ours tend to have RX flow filters as well as hashing. So why not use those to do a first level of steering? We're going to do some more internal testing and review but I hope to send out a first version of this next week. Ben.
On Thu, Sep 16, 2010 at 10:52:41PM +0100, Ben Hutchings wrote: > On Wed, 2010-09-01 at 18:32 -0700, David Miller wrote: > > From: Tom Herbert <therbert@google.com> > > Date: Wed, 1 Sep 2010 09:24:18 -0700 > > > > > On Wed, Sep 1, 2010 at 8:54 AM, Eric Dumazet <eric.dumazet@gmail.com> wrote: > > >> 3) Eventually have a user selectable selection (socket option, or system > > >> wide, but one sysctl, not many bitmasks ;) ). > > >> > > > Right, but it would also be nice if a single sysctl could optimally > > > set up multiqueue, RSS, RPS, and all my interrupt affinities for me > > > ;-) > > > > It's becomming increasingly obvious to me that we need (somewhere, > > not necessarily the kernel) a complete datastructure representing > > the NUMA, cache, cpu, device hierarchy. > > And ideally a cheap way (not O(N^2)) to find the distance between 2 CPU > threads (not just nodes). > > > And that can be used to tweak all of this stuff. > > > > The policy should probably be in userspace, we just need to provide > > the knobs in the kernel to tweak it however userspace wants. > > > > Userspace should be able to, for example, move a TX queue into a > > NUMA domain and have this invoke several side effects: > > > > 1) IRQs for that TX queue get rerouted to a cpu in the NUMA > > domain. > > > > 2) TX queue datastructures in the driver get reallocated using > > memory in that NUMA domain. > > I've actually done some work on an interface and implementation of this, > although I didn't include actually setting the IRQ affinity as there has > been pushback whenever people propose letting drivers set this. If they > only do so as directed by the administrator this might be more > acceptable though. > > Unfortunately in my limited testing on a 2-node system I didn't see a > whole lot of improvement in performance when the affinities were all > lined up. I should try to get some time on a 4-node system. I've been trying to look into this as well. It'd be very interesting to see the patches even if they don't show good performance. Could you post them? > > 3) TX hashing is configured to use the set of cpus in the NUMA > > domain. > > > > It's alot of tedious work and involves some delicate tasks figuring > > out where each of these things go, but really then we'd solve all > > of this crap one and for all. > > Right. > > The other thing I've been working on lately which sort of ties into this > is hardware acceleration of Receive Flow Steering. Multiqueue NICs such > as ours tend to have RX flow filters as well as hashing. So why not use > those to do a first level of steering? We're going to do some more > internal testing and review but I hope to send out a first version of > this next week. > > Ben. > -- To unsubscribe from this list: send the line "unsubscribe netdev" in the body of a message to majordomo@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
On Wed, 2010-09-01 at 18:32 -0700, David Miller wrote: > From: Tom Herbert <therbert@google.com> > Date: Wed, 1 Sep 2010 09:24:18 -0700 > > > On Wed, Sep 1, 2010 at 8:54 AM, Eric Dumazet <eric.dumazet@gmail.com> wrote: > >> 3) Eventually have a user selectable selection (socket option, or system > >> wide, but one sysctl, not many bitmasks ;) ). > >> > > Right, but it would also be nice if a single sysctl could optimally > > set up multiqueue, RSS, RPS, and all my interrupt affinities for me > > ;-) > > It's becomming increasingly obvious to me that we need (somewhere, > not necessarily the kernel) a complete datastructure representing > the NUMA, cache, cpu, device hierarchy. > > And that can be used to tweak all of this stuff. > > The policy should probably be in userspace, we just need to provide > the knobs in the kernel to tweak it however userspace wants. > > Userspace should be able to, for example, move a TX queue into a > NUMA domain and have this invoke several side effects: I think most of the pieces are now ready: > 1) IRQs for that TX queue get rerouted to a cpu in the NUMA > domain. There is a longstanding procfs interface for IRQ affinity, and userland infrastructure built on it. Adding a new interface would be contentious and I have tried to build on it instead. > 2) TX queue datastructures in the driver get reallocated using > memory in that NUMA domain. I've previously sent patches to add an ethtool API for NUMA control, which include the option to allocate on the same node where IRQs are handled. However, there is currently no function to allocate DMA-coherent memory on a specified NUMA node (rather than the device's node). This is likely to be beneficial for event rings and might be good for descriptor rings for some devices. (The implementation I sent for sfc mistakenly switched it to allocating non-coherent memory, for which it *is* possible to specify the node.) > 3) TX hashing is configured to use the set of cpus in the NUMA > domain. I posted patches for automatic XPS configuration at the end of last week. And RFS acceleration covers the other direction. Ben. > It's alot of tedious work and involves some delicate tasks figuring > out where each of these things go, but really then we'd solve all > of this crap one and for all.
On Mon, 2011-02-21 at 18:19 +0000, Ben Hutchings wrote: > On Wed, 2010-09-01 at 18:32 -0700, David Miller wrote: > > From: Tom Herbert <therbert@google.com> > > Date: Wed, 1 Sep 2010 09:24:18 -0700 > > > > > On Wed, Sep 1, 2010 at 8:54 AM, Eric Dumazet <eric.dumazet@gmail.com> wrote: > > >> 3) Eventually have a user selectable selection (socket option, or system > > >> wide, but one sysctl, not many bitmasks ;) ). > > >> > > > Right, but it would also be nice if a single sysctl could optimally > > > set up multiqueue, RSS, RPS, and all my interrupt affinities for me > > > ;-) Are cgroups the right place to have the network stack pull "guidance" from ? If an app is bound to a socket/core/NUMA node; then the network stack could inherit the cgroup's tuning and adjust kernel knobs accordingly. This would not replace procfs tuning, but it seems like a natural extension of how cgroups and RPS/RFS/XPS could be integrated to ease the management burden that these new technologies might impose. --jer
From: Ben Hutchings <bhutchings@solarflare.com> Date: Mon, 21 Feb 2011 18:19:55 +0000 > On Wed, 2010-09-01 at 18:32 -0700, David Miller wrote: >> 2) TX queue datastructures in the driver get reallocated using >> memory in that NUMA domain. > > I've previously sent patches to add an ethtool API for NUMA control, > which include the option to allocate on the same node where IRQs are > handled. However, there is currently no function to allocate > DMA-coherent memory on a specified NUMA node (rather than the device's > node). This is likely to be beneficial for event rings and might be > good for descriptor rings for some devices. (The implementation I sent > for sfc mistakenly switched it to allocating non-coherent memory, for > which it *is* possible to specify the node.) The thing to do is to work with someone like FUJITA Tomonori on this. It's simply a matter of making new APIs that take the node specifier, have the implementations either make use of or completely ignore the node, and have the existing APIs pass in "-1" for the node or whatever the CPP macro is for this :-) -- To unsubscribe from this list: send the line "unsubscribe netdev" in the body of a message to majordomo@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
diff --git a/include/linux/netdevice.h b/include/linux/netdevice.h index 46c36ff..0ff6c9f 100644 --- a/include/linux/netdevice.h +++ b/include/linux/netdevice.h @@ -497,6 +497,12 @@ struct netdev_queue { struct Qdisc *qdisc; unsigned long state; struct Qdisc *qdisc_sleeping; +#ifdef CONFIG_RPS + struct kobject kobj; + struct netdev_queue *first; + atomic_t count; +#endif + /* * write mostly part */ @@ -524,6 +530,22 @@ struct rps_map { #define RPS_MAP_SIZE(_num) (sizeof(struct rps_map) + (_num * sizeof(u16))) /* + * This structure holds an XPS map which can be of variable length. queues + * is an array of num_possible_cpus entries, where each entry is a mask of + * queues for that CPU (up to num_tx_queues bits for device). + */ +struct xps_map { + struct rcu_head rcu; + unsigned long queues[0]; +}; + +#define QUEUE_MASK_SIZE(dev) (BITS_TO_LONGS(dev->num_tx_queues)) +#define XPS_MAP_SIZE(dev) (sizeof(struct xps_map) + (num_possible_cpus() * \ + QUEUE_MASK_SIZE(dev) * sizeof(unsigned long))) +#define XPS_ENTRY(map, offset, dev) \ + (&map->queues[offset * QUEUE_MASK_SIZE(dev)]) + +/* * The rps_dev_flow structure contains the mapping of a flow to a CPU and the * tail pointer for that CPU's input queue at the time of last enqueue. */ @@ -978,6 +1000,9 @@ struct net_device { void *rx_handler_data; struct netdev_queue *_tx ____cacheline_aligned_in_smp; +#ifdef CONFIG_RPS + struct xps_map *xps_maps; +#endif /* Number of TX queues allocated at alloc_netdev_mq() time */ unsigned int num_tx_queues; diff --git a/include/linux/skbuff.h b/include/linux/skbuff.h index f067c95..146df6f 100644 --- a/include/linux/skbuff.h +++ b/include/linux/skbuff.h @@ -381,6 +381,7 @@ struct sk_buff { #else __u8 deliver_no_wcard:1; #endif + __u8 ooo_okay:1; kmemcheck_bitfield_end(flags2); /* 0/14 bit hole */ diff --git a/net/core/dev.c b/net/core/dev.c index da584f5..d23f9c4 100644 --- a/net/core/dev.c +++ b/net/core/dev.c @@ -2054,6 +2054,60 @@ static inline u16 dev_cap_txqueue(struct net_device *dev, u16 queue_index) return queue_index; } +static inline int get_xps_queue(struct net_device *dev, struct sk_buff *skb, + int queue_index) +{ + struct xps_map *maps; + int cpu = smp_processor_id(); + u32 hash; + unsigned long *queues; + int weight, select; + + rcu_read_lock(); + maps = rcu_dereference(dev->xps_maps); + + if (!maps) { + rcu_read_unlock(); + return queue_index; + } + + queues = XPS_ENTRY(maps, cpu, dev); + + if (queue_index >= 0) { + if (test_bit(queue_index, queues)) { + rcu_read_unlock(); + return queue_index; + } + } + + weight = bitmap_weight(queues, dev->real_num_tx_queues); + switch (weight) { + case 0: + break; + case 1: + queue_index = + find_first_bit(queues, dev->real_num_tx_queues); + break; + default: + if (skb->sk && skb->sk->sk_hash) + hash = skb->sk->sk_hash; + else + hash = (__force u16) skb->protocol ^ skb->rxhash; + hash = jhash_1word(hash, hashrnd); + + select = ((u64) hash * weight) >> 32; + queue_index = + find_first_bit(queues, dev->real_num_tx_queues); + while (select--) + queue_index = find_next_bit(queues, + dev->real_num_tx_queues, queue_index); + break; + } + + rcu_read_unlock(); + return queue_index; +} + static struct netdev_queue *dev_pick_tx(struct net_device *dev, struct sk_buff *skb) { @@ -2061,23 +2115,30 @@ static struct netdev_queue *dev_pick_tx(struct net_device *dev, struct sock *sk = skb->sk; queue_index = sk_tx_queue_get(sk); - if (queue_index < 0) { + + if (queue_index < 0 || (skb->ooo_okay && dev->real_num_tx_queues > 1)) { const struct net_device_ops *ops = dev->netdev_ops; + int old_index = queue_index; if (ops->ndo_select_queue) { queue_index = ops->ndo_select_queue(dev, skb); queue_index = dev_cap_txqueue(dev, queue_index); } else { - queue_index = 0; - if (dev->real_num_tx_queues > 1) - queue_index = skb_tx_hash(dev, skb); + if (dev->real_num_tx_queues > 1) { + queue_index = get_xps_queue(dev, + skb, queue_index); + if (queue_index < 0) + queue_index = skb_tx_hash(dev, skb); + } else + queue_index = 0; + } - if (sk) { - struct dst_entry *dst = rcu_dereference_check(sk->sk_dst_cache, 1); + if ((queue_index != old_index) && sk) { + struct dst_entry *dst = + rcu_dereference_check(sk->sk_dst_cache, 1); - if (dst && skb_dst(skb) == dst) - sk_tx_queue_set(sk, queue_index); - } + if (dst && skb_dst(skb) == dst) + sk_tx_queue_set(sk, queue_index); } } @@ -5429,6 +5490,15 @@ struct net_device *alloc_netdev_mq(int sizeof_priv, const char *name, } #ifdef CONFIG_RPS + atomic_set(&tx->count, queue_count); + + /* + * Set a pointer to first element in the array which holds the + * reference count. + */ + for (i = 0; i < queue_count; i++) + tx[i].first = tx; + rx = kcalloc(queue_count, sizeof(struct netdev_rx_queue), GFP_KERNEL); if (!rx) { printk(KERN_ERR "alloc_netdev: Unable to allocate " @@ -5506,7 +5576,9 @@ void free_netdev(struct net_device *dev) release_net(dev_net(dev)); +#ifndef CONFIG_RPS kfree(dev->_tx); +#endif /* Flush device addresses */ dev_addr_flush(dev); diff --git a/net/core/net-sysfs.c b/net/core/net-sysfs.c index af4dfba..661c481 100644 --- a/net/core/net-sysfs.c +++ b/net/core/net-sysfs.c @@ -742,34 +742,295 @@ static int rx_queue_add_kobject(struct net_device *net, int index) return error; } -static int rx_queue_register_kobjects(struct net_device *net) +/* + * netdev_queue sysfs structures and functions. + */ +struct netdev_queue_attribute { + struct attribute attr; + ssize_t (*show)(struct netdev_queue *queue, + struct netdev_queue_attribute *attr, char *buf); + ssize_t (*store)(struct netdev_queue *queue, + struct netdev_queue_attribute *attr, const char *buf, size_t len); +}; +#define to_netdev_queue_attr(_attr) container_of(_attr, \ + struct netdev_queue_attribute, attr) + +#define to_netdev_queue(obj) container_of(obj, struct netdev_queue, kobj) + +static ssize_t netdev_queue_attr_show(struct kobject *kobj, + struct attribute *attr, char *buf) +{ + struct netdev_queue_attribute *attribute = to_netdev_queue_attr(attr); + struct netdev_queue *queue = to_netdev_queue(kobj); + + if (!attribute->show) + return -EIO; + + return attribute->show(queue, attribute, buf); +} + +static ssize_t netdev_queue_attr_store(struct kobject *kobj, + struct attribute *attr, + const char *buf, size_t count) +{ + struct netdev_queue_attribute *attribute = to_netdev_queue_attr(attr); + struct netdev_queue *queue = to_netdev_queue(kobj); + + if (!attribute->store) + return -EIO; + + return attribute->store(queue, attribute, buf, count); +} + +static struct sysfs_ops netdev_queue_sysfs_ops = { + .show = netdev_queue_attr_show, + .store = netdev_queue_attr_store, +}; + +static inline unsigned int get_netdev_queue_index(struct netdev_queue *queue) { + struct net_device *dev = queue->dev; + int i; + + for (i = 0; i < dev->num_tx_queues; i++) + if (queue == &dev->_tx[i]) + break; + + BUG_ON(i >= dev->num_tx_queues); + + return i; +} + +static ssize_t show_xps_map(struct netdev_queue *queue, + struct netdev_queue_attribute *attribute, char *buf) +{ + struct net_device *dev = queue->dev; + struct xps_map *maps; + cpumask_var_t mask; + unsigned long *qmask, index; + size_t len = 0; int i; + unsigned int qmask_size = QUEUE_MASK_SIZE(dev); + + if (!zalloc_cpumask_var(&mask, GFP_KERNEL)) + return -ENOMEM; + + index = get_netdev_queue_index(queue); + + rcu_read_lock(); + maps = rcu_dereference(dev->xps_maps); + if (maps) { + qmask = maps->queues; + for (i = 0; i < num_possible_cpus(); i++) { + if (test_bit(index, qmask)) + cpumask_set_cpu(i, mask); + qmask += qmask_size; + } + } + len += cpumask_scnprintf(buf + len, PAGE_SIZE, mask); + if (PAGE_SIZE - len < 3) { + rcu_read_unlock(); + free_cpumask_var(mask); + return -EINVAL; + } + rcu_read_unlock(); + + free_cpumask_var(mask); + len += sprintf(buf + len, "\n"); + return len; +} + +static void xps_map_release(struct rcu_head *rcu) +{ + struct xps_map *map = container_of(rcu, struct xps_map, rcu); + + kfree(map); +} + +static DEFINE_MUTEX(xps_map_lock); + +static ssize_t store_xps_map(struct netdev_queue *queue, + struct netdev_queue_attribute *attribute, + const char *buf, size_t len) +{ + struct net_device *dev = queue->dev; + struct xps_map *maps; + cpumask_var_t mask; + int err, i, nonempty = 0; + unsigned long *qmask, index; + unsigned int qmask_size = QUEUE_MASK_SIZE(dev); + + if (!capable(CAP_NET_ADMIN)) + return -EPERM; + + if (!alloc_cpumask_var(&mask, GFP_KERNEL)) + return -ENOMEM; + + err = bitmap_parse(buf, len, cpumask_bits(mask), nr_cpumask_bits); + if (err) { + free_cpumask_var(mask); + return err; + } + + mutex_lock(&xps_map_lock); + + maps = dev->xps_maps; + if (!maps) { + if (!cpumask_weight(mask)) { + mutex_unlock(&xps_map_lock); + free_cpumask_var(mask); + return 0; + } + maps = kzalloc(XPS_MAP_SIZE(dev), GFP_KERNEL); + if (!maps) { + mutex_unlock(&xps_map_lock); + free_cpumask_var(mask); + return -ENOMEM; + } + rcu_assign_pointer(dev->xps_maps, maps); + } + + index = get_netdev_queue_index(queue); + + qmask = maps->queues; + for (i = 0; i < num_possible_cpus(); i++) { + if (cpu_isset(i, *mask) && cpu_online(i)) { + set_bit(index, qmask); + nonempty = 1; + } else + clear_bit(index, qmask); + if (!nonempty && + bitmap_weight(qmask, dev->real_num_tx_queues)) + nonempty = 1; + qmask += qmask_size; + } + + if (!nonempty) { + rcu_assign_pointer(dev->xps_maps, NULL); + call_rcu(&maps->rcu, xps_map_release); + } + + mutex_unlock(&xps_map_lock); + + free_cpumask_var(mask); + return len; +} + +static struct netdev_queue_attribute xps_cpus_attribute = + __ATTR(xps_cpus, S_IRUGO | S_IWUSR, show_xps_map, store_xps_map); + +static struct attribute *netdev_queue_default_attrs[] = { + &xps_cpus_attribute.attr, + NULL +}; + +static void netdev_queue_release(struct kobject *kobj) +{ + struct netdev_queue *queue = to_netdev_queue(kobj); + struct net_device *dev = queue->dev; + struct netdev_queue *first = queue->first; + struct xps_map *maps; + unsigned long *qmask, index; + int i, nonempty = 0; + unsigned int qmask_size = QUEUE_MASK_SIZE(dev); + + index = get_netdev_queue_index(queue); + + mutex_lock(&xps_map_lock); + + maps = dev->xps_maps; + + if (maps) { + qmask = maps->queues; + for (i = 0; i < num_possible_cpus(); i++) { + clear_bit(index, qmask); + if (!nonempty && + bitmap_weight(qmask, dev->real_num_tx_queues)) + nonempty = 1; + qmask += qmask_size; + } + + if (!nonempty) { + rcu_assign_pointer(dev->xps_maps, NULL); + call_rcu(&maps->rcu, xps_map_release); + } + } + mutex_unlock(&xps_map_lock); + + if (atomic_dec_and_test(&first->count)) { + kfree(first); + dev_put(dev); + } +} + +static struct kobj_type netdev_queue_ktype = { + .sysfs_ops = &netdev_queue_sysfs_ops, + .release = netdev_queue_release, + .default_attrs = netdev_queue_default_attrs, +}; + +static int netdev_queue_add_kobject(struct net_device *net, int index) +{ + struct netdev_queue *queue = net->_tx + index; + struct kobject *kobj = &queue->kobj; + int error = 0; + + kobj->kset = net->queues_kset; + error = kobject_init_and_add(kobj, &netdev_queue_ktype, NULL, + "tx-%u", index); + if (error) { + kobject_put(kobj); + return error; + } + + kobject_uevent(kobj, KOBJ_ADD); + + return error; +} + +static int register_queue_kobjects(struct net_device *net) +{ + int rx = 0, tx = 0; int error = 0; net->queues_kset = kset_create_and_add("queues", NULL, &net->dev.kobj); if (!net->queues_kset) return -ENOMEM; - for (i = 0; i < net->num_rx_queues; i++) { - error = rx_queue_add_kobject(net, i); + + for (rx = 0; rx < net->num_rx_queues; rx++) { + error = rx_queue_add_kobject(net, rx); if (error) - break; + goto error; } - if (error) - while (--i >= 0) - kobject_put(&net->_rx[i].kobj); + for (tx = 0; tx < net->num_tx_queues; tx++) { + error = netdev_queue_add_kobject(net, tx); + if (error) + goto error; + } + dev_hold(net); + + return error; + +error: + while (--rx >= 0) + kobject_put(&net->_rx[rx].kobj); + + while (--tx >= 0) + kobject_put(&net->_tx[tx].kobj); return error; } -static void rx_queue_remove_kobjects(struct net_device *net) +static void remove_queue_kobjects(struct net_device *net) { int i; for (i = 0; i < net->num_rx_queues; i++) kobject_put(&net->_rx[i].kobj); + for (i = 0; i < net->num_tx_queues; i++) + kobject_put(&net->_tx[i].kobj); kset_unregister(net->queues_kset); } #endif /* CONFIG_RPS */ @@ -871,7 +1132,7 @@ void netdev_unregister_kobject(struct net_device * net) kobject_get(&dev->kobj); #ifdef CONFIG_RPS - rx_queue_remove_kobjects(net); + remove_queue_kobjects(net); #endif device_del(dev); @@ -912,7 +1173,7 @@ int netdev_register_kobject(struct net_device *net) return error; #ifdef CONFIG_RPS - error = rx_queue_register_kobjects(net); + error = register_queue_kobjects(net); if (error) { device_del(dev); return error; diff --git a/net/ipv4/tcp_output.c b/net/ipv4/tcp_output.c index de3bd84..80c1928 100644 --- a/net/ipv4/tcp_output.c +++ b/net/ipv4/tcp_output.c @@ -828,8 +828,10 @@ static int tcp_transmit_skb(struct sock *sk, struct sk_buff *skb, int clone_it, &md5); tcp_header_size = tcp_options_size + sizeof(struct tcphdr); - if (tcp_packets_in_flight(tp) == 0) + if (tcp_packets_in_flight(tp) == 0) { tcp_ca_event(sk, CA_EVENT_TX_START); + skb->ooo_okay = 1; + } skb_push(skb, tcp_header_size); skb_reset_transport_header(skb);
This patch implements transmit packet steering (XPS) for multiqueue devices. XPS selects a transmit queue during packet transmission based on configuration. This is done by mapping the CPU transmitting the packet to a queue. This is the transmit side analogue to RPS-- where RPS is selecting a CPU based on receive queue, XPS selects a queue based on the CPU (previously there was an XPS patch from Eric Dumazet, but that might more appropriately be called transmit completion steering). Each transmit queue can be associated with a number of CPUs which will used the queue to send packets. This is configured as a CPU mask on a per queue basis in: /sys/class/net/eth<n>/queues/tx-<n>/xps_cpus The mappings are stored per device in an inverted data structure that maps CPUs to queues. In the netdevice structure this is an array of num_possible_cpu structures where each array entry contains a bit map of queues which that CPU can use. We also allow the mapping of a socket to queue to be modified, for instance if a thread is scheduled on a different CPU the desired queue for transmitting packets would likely change. To maintain in order packet transmission a flag (ooo_okay) has been added to the sk_buf structure. If a transport layer sets this flag on a packet, the transmit queue can be changed for this socket. Presumably, the transport would set this is there was no possbility of creating ooo packets (for instance there are no packets in flight for the socket). This patch includes the modification in TCP output for setting this flag. The benefits of XPS are improved locality in the per queue data strutures. Also, transmit completions are more likely to be done nearer to the sending thread so this should promote locality back to the socket (e.g. UDP). The benefits of XPS are dependent on cache hierarchy, application load, and other factors. XPS would nominally be configured so that a queue would only be shared by CPUs which are sharing a cache, the degenerative configuration woud be that each CPU has it's own queue. Below are some benchmark results which show the potential benfit of this patch. The netperf test has 500 instances of netperf TCP_RR test with 1 byte req. and resp. bnx2x on 16 core AMD XPS (16 queues, 1 TX queue per CPU) 1015K at 99% CPU No XPS (16 queues) 1127K at 98% CPU Signed-off-by: Tom Herbert <therbert@google.com> --- -- To unsubscribe from this list: send the line "unsubscribe netdev" in the body of a message to majordomo@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html