Message ID | 1415235320.13896.51.camel@edumazet-glaptop2.roam.corp.google.com |
---|---|
State | Superseded, archived |
Delegated to: | David Miller |
Headers | show |
On 11/05/2014 04:55 PM, Eric Dumazet wrote: > Tested: > Ran 200 netperf TCP_STREAM from A to B (10Gbe link, 8 RX queues) > > Without this feature, we send back about 305,000 ACK per second. > > GRO aggregation ratio is low (811/305 = 2.65 segments per GRO packet) > > Setting a timer of 2000 nsec is enough to increase GRO packet sizes > and reduce number of ACK packets. (811/19.2 = 42) > > Receiver performs less calls to upper stacks, less wakes up. > This also reduces cpu usage on the sender, as it receives less ACK > packets. > > Note that reducing number of wakes up increases cpu efficiency, but can > decrease QPS, as applications wont have the chance to warmup cpu caches > doing a partial read of RPC requests/answers if they fit in one skb. Speaking of QPS, what happens to 200 TCP_RR tests when the feature is enabled? rick jones -- To unsubscribe from this list: send the line "unsubscribe netdev" in the body of a message to majordomo@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
On Wed, 2014-11-05 at 17:38 -0800, Rick Jones wrote: > Speaking of QPS, what happens to 200 TCP_RR tests when the feature is > enabled? Nothing at all (but the usual noise I guess) 200 TCP_RR send packets with 1 byte of payload and Push flag, so no packet ever sits in napi->gro_list lpaa5:~# echo 0 >/sys/class/net/eth0/gro_flush_timeout lpaa6:~# echo 0 >/sys/class/net/eth0/gro_flush_timeout lpaa5:~# time ./super_netperf 200 -H lpaa6 -t TCP_RR -l 20 3.13827e+06 real 0m32.170s user 0m32.885s sys 7m38.868s lpaa5:~# echo 2000 >/sys/class/net/eth0/gro_flush_timeout lpaa6:~# echo 2000 >/sys/class/net/eth0/gro_flush_timeout lpaa5:~# time ./super_netperf 200 -H lpaa6 -t TCP_RR -l 20 3.19013e+06 real 0m37.152s user 0m33.477s sys 7m30.586s Now lets try TCP_RR with -- -r 4000,4000 ;) Reducing ACK packets allow us to better use the 10Gbe bandwith for payload, so QPS actually increase. lpaa5:~# echo 0 >/sys/class/net/eth0/gro_flush_timeout lpaa6:~# echo 0 >/sys/class/net/eth0/gro_flush_timeout lpaa5:~# time ./super_netperf 200 -H lpaa6 -t TCP_RR -l 20 -- -r 4000,4000 379645 real 0m32.201s user 0m4.390s sys 0m59.501s lpaa5:~# echo 2000 >/sys/class/net/eth0/gro_flush_timeout lpaa6:~# echo 2000 >/sys/class/net/eth0/gro_flush_timeout lpaa5:~# time ./super_netperf 200 -H lpaa6 -t TCP_RR -l 20 -- -r 4000,4000 400610 real 0m37.159s user 0m4.501s sys 0m59.665s -- To unsubscribe from this list: send the line "unsubscribe netdev" in the body of a message to majordomo@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
On Wed, 2014-11-05 at 18:14 -0800, Eric Dumazet wrote: > On Wed, 2014-11-05 at 17:38 -0800, Rick Jones wrote: > > > Speaking of QPS, what happens to 200 TCP_RR tests when the feature is > > enabled? The possible reduction of QPS happens when you have a single flow like TCP_RR -- -r 40000,40000 (Because we have one single TCP packet with 40000 bytes of payload, application is waked up once when Push flag is received) So cpu effiency is way better, but application has to copy 40000 bytes in one go _after_ Push flag, instead of being able to copy part of the data _before_ receiving the Push flag. lpaa5:~# echo 0 >/sys/class/net/eth0/gro_flush_timeout lpaa6:~# echo 0 >/sys/class/net/eth0/gro_flush_timeout lpaa5:~# ./netperf -H lpaa6 -t TCP_RR -l 20 -Cc -- -r 40000,40000 MIGRATED TCP REQUEST/RESPONSE TEST from 0.0.0.0 (0.0.0.0) port 0 AF_INET to lpaa6.prod.google.com () port 0 AF_INET : first burst 0 Local /Remote Socket Size Request Resp. Elapsed Trans. CPU CPU S.dem S.dem Send Recv Size Size Time Rate local remote local remote bytes bytes bytes bytes secs. per sec % S % S us/Tr us/Tr 16384 87380 40000 40000 20.00 9023.86 2.02 1.70 107.513 90.561 16384 87380 lpaa5:~# echo 2000 >/sys/class/net/eth0/gro_flush_timeout lpaa6:~# echo 2000 >/sys/class/net/eth0/gro_flush_timeout lpaa5:~# ./netperf -H lpaa6 -t TCP_RR -l 20 -Cc -- -r 40000,40000 MIGRATED TCP REQUEST/RESPONSE TEST from 0.0.0.0 (0.0.0.0) port 0 AF_INET to lpaa6.prod.google.com () port 0 AF_INET : first burst 0 Local /Remote Socket Size Request Resp. Elapsed Trans. CPU CPU S.dem S.dem Send Recv Size Size Time Rate local remote local remote bytes bytes bytes bytes secs. per sec % S % S us/Tr us/Tr 16384 87380 40000 40000 20.00 8651.26 0.66 1.02 36.502 56.710 16384 87380 -- To unsubscribe from this list: send the line "unsubscribe netdev" in the body of a message to majordomo@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
On 11/05/2014 06:39 PM, Eric Dumazet wrote: > On Wed, 2014-11-05 at 18:14 -0800, Eric Dumazet wrote: >> On Wed, 2014-11-05 at 17:38 -0800, Rick Jones wrote: >> >>> Speaking of QPS, what happens to 200 TCP_RR tests when the feature is >>> enabled? > > The possible reduction of QPS happens when you have a single flow like > TCP_RR -- -r 40000,40000 > > (Because we have one single TCP packet with 40000 bytes of payload, > application is waked up once when Push flag is received) > > So cpu effiency is way better, but application has to copy 40000 bytes > in one go _after_ Push flag, instead of being able to copy part of the > data _before_ receiving the Push flag. Thanks. That isn't too unlike what I've seen happen in the past with say an 8K request size and switching back and forth between a 1500 and 9000 byte MTU. happy benchmarking, rick -- To unsubscribe from this list: send the line "unsubscribe netdev" in the body of a message to majordomo@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
From: Eric Dumazet <eric.dumazet@gmail.com> Date: Wed, 05 Nov 2014 16:55:20 -0800 > @@ -4430,8 +4432,19 @@ void napi_complete(struct napi_struct *n) > if (unlikely(test_bit(NAPI_STATE_NPSVC, &n->state))) > return; > > - napi_gro_flush(n, false); > + if (n->gro_list) { > + unsigned long timeout = 0; > + > + if (n->napi_rx_count) > + timeout = n->dev->gro_flush_timeout; Under what circumstances would we see n->gro_list non-NULL yet n->napi_rx_count == 0? I'm not so sure it can happen. Said another way, it looks to me like you could implement this using less state. -- To unsubscribe from this list: send the line "unsubscribe netdev" in the body of a message to majordomo@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
On Thu, 2014-11-06 at 16:25 -0500, David Miller wrote: > From: Eric Dumazet <eric.dumazet@gmail.com> > Date: Wed, 05 Nov 2014 16:55:20 -0800 > > > @@ -4430,8 +4432,19 @@ void napi_complete(struct napi_struct *n) > > if (unlikely(test_bit(NAPI_STATE_NPSVC, &n->state))) > > return; > > > > - napi_gro_flush(n, false); > > + if (n->gro_list) { > > + unsigned long timeout = 0; > > + > > + if (n->napi_rx_count) > > + timeout = n->dev->gro_flush_timeout; > > Under what circumstances would we see n->gro_list non-NULL yet > n->napi_rx_count == 0? > > I'm not so sure it can happen. > > Said another way, it looks to me like you could implement this > using less state. My goal was to not change any driver, doing a generic change. Drivers call napi_complete() in their rx napi handler without giving us the 'work_done' value which tells us if a packet was processed. So I added a counter that is increased for every packet given to GRO engine (napi_rx_count), so that napi_complete() has a clue if a packet was received in _this_ NAPI run. If at least one packet was received (and we still have packets in gro_list) -> We ream the 'timer' If not, we flush packets in GRO engine. In order to avoid this state, I would have to add a new method, like napi_complete_done(napi, work_done), and change drivers. I am not sure its worth the effort ? -- To unsubscribe from this list: send the line "unsubscribe netdev" in the body of a message to majordomo@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
From: Eric Dumazet <eric.dumazet@gmail.com> Date: Thu, 06 Nov 2014 14:11:20 -0800 > My goal was to not change any driver, doing a generic change. > > Drivers call napi_complete() in their rx napi handler without giving us > the 'work_done' value which tells us if a packet was processed. > > So I added a counter that is increased for every packet given to GRO > engine (napi_rx_count), so that napi_complete() has a clue if a packet > was received in _this_ NAPI run. > > If at least one packet was received (and we still have packets in > gro_list) -> We ream the 'timer' > If not, we flush packets in GRO engine. > > In order to avoid this state, I would have to add a new method, like > napi_complete_done(napi, work_done), and change drivers. I am not sure > its worth the effort ? I think for such a critical path in the kernel it's worth it to avoid these increments for every packet, just to compute a value that's sitting in a register already in the driver's poll routine. Eric, you've been trimming cpu IRQ disables from these exact code paths, you should know better than me :-) I'm willing to do some of the monkey work of converting as many drivers as can be trivially done if you want. Almost all of the ones I looked at have the work_done variable right there at the napi_complete() call site. The rest can stay unconverted and not get access to this new facility. -- To unsubscribe from this list: send the line "unsubscribe netdev" in the body of a message to majordomo@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
On Thu, 2014-11-06 at 22:36 -0500, David Miller wrote: > From: Eric Dumazet <eric.dumazet@gmail.com> > Date: Thu, 06 Nov 2014 14:11:20 -0800 > > > My goal was to not change any driver, doing a generic change. > > > > Drivers call napi_complete() in their rx napi handler without giving us > > the 'work_done' value which tells us if a packet was processed. > > > > So I added a counter that is increased for every packet given to GRO > > engine (napi_rx_count), so that napi_complete() has a clue if a packet > > was received in _this_ NAPI run. > > > > If at least one packet was received (and we still have packets in > > gro_list) -> We ream the 'timer' > > If not, we flush packets in GRO engine. > > > > In order to avoid this state, I would have to add a new method, like > > napi_complete_done(napi, work_done), and change drivers. I am not sure > > its worth the effort ? > > I think for such a critical path in the kernel it's worth it to avoid > these increments for every packet, just to compute a value that's > sitting in a register already in the driver's poll routine. > > Eric, you've been trimming cpu IRQ disables from these exact code > paths, you should know better than me :-) > > I'm willing to do some of the monkey work of converting as many > drivers as can be trivially done if you want. Almost all of the > ones I looked at have the work_done variable right there at the > napi_complete() call site. > > The rest can stay unconverted and not get access to this new facility. This is your call. I actually did this in my first implementation, because I was not using a timer but rescheduling NAPI and not re-enabling interrupts. http://www.spinics.net/lists/netdev/msg302474.html (This is why I had to cook "d75b1ade567 net: less interrupt masking in NAPI") Given the cost of GRO processing, setting a bool (this is all I need actually) is really pure noise. -- To unsubscribe from this list: send the line "unsubscribe netdev" in the body of a message to majordomo@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
diff --git a/include/linux/netdevice.h b/include/linux/netdevice.h index 4767f546d7c0..8474fcfadc7c 100644 --- a/include/linux/netdevice.h +++ b/include/linux/netdevice.h @@ -314,6 +314,8 @@ struct napi_struct { struct net_device *dev; struct sk_buff *gro_list; struct sk_buff *skb; + unsigned long napi_rx_count; + struct hrtimer timer; struct list_head dev_list; struct hlist_node napi_hash_node; unsigned int napi_id; @@ -485,14 +487,7 @@ void napi_hash_del(struct napi_struct *napi); * Stop NAPI from being scheduled on this context. * Waits till any outstanding processing completes. */ -static inline void napi_disable(struct napi_struct *n) -{ - might_sleep(); - set_bit(NAPI_STATE_DISABLE, &n->state); - while (test_and_set_bit(NAPI_STATE_SCHED, &n->state)) - msleep(1); - clear_bit(NAPI_STATE_DISABLE, &n->state); -} +void napi_disable(struct napi_struct *n); /** * napi_enable - enable NAPI scheduling @@ -1603,6 +1598,7 @@ struct net_device { #endif + unsigned long gro_flush_timeout; rx_handler_func_t __rcu *rx_handler; void __rcu *rx_handler_data; diff --git a/net/core/dev.c b/net/core/dev.c index 40be481268de..c88651bd8ada 100644 --- a/net/core/dev.c +++ b/net/core/dev.c @@ -133,6 +133,7 @@ #include <linux/vmalloc.h> #include <linux/if_macvlan.h> #include <linux/errqueue.h> +#include <linux/hrtimer.h> #include "net-sysfs.h" @@ -4000,6 +4001,8 @@ static enum gro_result dev_gro_receive(struct napi_struct *napi, struct sk_buff if (skb_is_gso(skb) || skb_has_frag_list(skb) || skb->csum_bad) goto normal; + napi->napi_rx_count++; + gro_list_prepare(napi, skb); rcu_read_lock(); @@ -4411,7 +4414,6 @@ EXPORT_SYMBOL(__napi_schedule_irqoff); void __napi_complete(struct napi_struct *n) { BUG_ON(!test_bit(NAPI_STATE_SCHED, &n->state)); - BUG_ON(n->gro_list); list_del_init(&n->poll_list); smp_mb__before_atomic(); @@ -4430,8 +4432,19 @@ void napi_complete(struct napi_struct *n) if (unlikely(test_bit(NAPI_STATE_NPSVC, &n->state))) return; - napi_gro_flush(n, false); + if (n->gro_list) { + unsigned long timeout = 0; + + if (n->napi_rx_count) + timeout = n->dev->gro_flush_timeout; + if (timeout) + hrtimer_start(&n->timer, ns_to_ktime(timeout), + HRTIMER_MODE_REL_PINNED); + else + napi_gro_flush(n, false); + } + n->napi_rx_count = 0; if (likely(list_empty(&n->poll_list))) { WARN_ON_ONCE(!test_and_clear_bit(NAPI_STATE_SCHED, &n->state)); } else { @@ -4495,10 +4508,23 @@ void napi_hash_del(struct napi_struct *napi) } EXPORT_SYMBOL_GPL(napi_hash_del); +static enum hrtimer_restart napi_watchdog(struct hrtimer *timer) +{ + struct napi_struct *napi; + + napi = container_of(timer, struct napi_struct, timer); + if (napi->gro_list) + napi_schedule(napi); + + return HRTIMER_NORESTART; +} + void netif_napi_add(struct net_device *dev, struct napi_struct *napi, int (*poll)(struct napi_struct *, int), int weight) { INIT_LIST_HEAD(&napi->poll_list); + hrtimer_init(&napi->timer, CLOCK_MONOTONIC, HRTIMER_MODE_REL_PINNED); + napi->timer.function = napi_watchdog; napi->gro_count = 0; napi->gro_list = NULL; napi->skb = NULL; @@ -4517,6 +4543,20 @@ void netif_napi_add(struct net_device *dev, struct napi_struct *napi, } EXPORT_SYMBOL(netif_napi_add); +void napi_disable(struct napi_struct *n) +{ + might_sleep(); + set_bit(NAPI_STATE_DISABLE, &n->state); + + while (test_and_set_bit(NAPI_STATE_SCHED, &n->state)) + msleep(1); + + hrtimer_cancel(&n->timer); + + clear_bit(NAPI_STATE_DISABLE, &n->state); +} +EXPORT_SYMBOL(napi_disable); + void netif_napi_del(struct napi_struct *napi) { list_del_init(&napi->dev_list); diff --git a/net/core/net-sysfs.c b/net/core/net-sysfs.c index 9dd06699b09c..1a24602cd54e 100644 --- a/net/core/net-sysfs.c +++ b/net/core/net-sysfs.c @@ -325,6 +325,23 @@ static ssize_t tx_queue_len_store(struct device *dev, } NETDEVICE_SHOW_RW(tx_queue_len, fmt_ulong); +static int change_gro_flush_timeout(struct net_device *dev, unsigned long val) +{ + dev->gro_flush_timeout = val; + return 0; +} + +static ssize_t gro_flush_timeout_store(struct device *dev, + struct device_attribute *attr, + const char *buf, size_t len) +{ + if (!capable(CAP_NET_ADMIN)) + return -EPERM; + + return netdev_store(dev, attr, buf, len, change_gro_flush_timeout); +} +NETDEVICE_SHOW_RW(gro_flush_timeout, fmt_ulong); + static ssize_t ifalias_store(struct device *dev, struct device_attribute *attr, const char *buf, size_t len) { @@ -422,6 +439,7 @@ static struct attribute *net_class_attrs[] = { &dev_attr_mtu.attr, &dev_attr_flags.attr, &dev_attr_tx_queue_len.attr, + &dev_attr_gro_flush_timeout.attr, &dev_attr_phys_port_id.attr, NULL, };