diff mbox

[net-next] net: gro: add a per device gro flush timer

Message ID 1415235320.13896.51.camel@edumazet-glaptop2.roam.corp.google.com
State Superseded, archived
Delegated to: David Miller
Headers show

Commit Message

Eric Dumazet Nov. 6, 2014, 12:55 a.m. UTC
From: Eric Dumazet <edumazet@google.com>

Tuning coalescing parameters on NIC can be really hard.

Servers can handle both bulk and RPC like traffic, with conflicting
goals : bulk flows want as big GRO packets as possible, RPC want minimal
latencies.

To reach big GRO packets on 10Gbe NIC, one can use :

ethtool -C eth0 rx-usecs 4 rx-frames 44

But this penalizes rpc sessions, with an increase of latencies, up to
50% in some cases, as NICs generally do not force an interrupt when
a packet with TCP Push flag is received.

Some NICs do not have an absolute timer, only a timer rearmed for every
incoming packet.

This patch uses a different strategy : Let GRO stack decides what do do,
based on traffic pattern.

Packets with Push flag wont be delayed.
Packets without Push flag might be held in GRO engine, if we keep
receiving data.

This new mechanism is off by default, and shall be enabled by setting
/sys/class/net/eth0/gro_flush_timeout to a value in nanosecond.

Tested:
 Ran 200 netperf TCP_STREAM from A to B (10Gbe link, 8 RX queues)

Without this feature, we send back about 305,000 ACK per second.

GRO aggregation ratio is low (811/305 = 2.65 segments per GRO packet)

Setting a timer of 2000 nsec is enough to increase GRO packet sizes
and reduce number of ACK packets. (811/19.2 = 42)

Receiver performs less calls to upper stacks, less wakes up.
This also reduces cpu usage on the sender, as it receives less ACK
packets.

Note that reducing number of wakes up increases cpu efficiency, but can
decrease QPS, as applications wont have the chance to warmup cpu caches
doing a partial read of RPC requests/answers if they fit in one skb.

B:~# sar -n DEV 1 10 | grep eth0 | tail -1
Average:         eth0 811269.80 305732.30 1199462.57  19705.72      0.00      0.00      0.50

B:~# echo 2000 >/sys/class/net/eth0/gro_flush_timeout

lpaa6:~# sar -n DEV 1 10 | grep eth0 | tail -1
Average:         eth0 811577.30  19230.80 1199916.51   1239.80      0.00      0.00      0.50

Signed-off-by: Eric Dumazet <edumazet@google.com>
---
 include/linux/netdevice.h |   12 +++------
 net/core/dev.c            |   44 ++++++++++++++++++++++++++++++++++--
 net/core/net-sysfs.c      |   18 ++++++++++++++
 3 files changed, 64 insertions(+), 10 deletions(-)



--
To unsubscribe from this list: send the line "unsubscribe netdev" in
the body of a message to majordomo@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html

Comments

Rick Jones Nov. 6, 2014, 1:38 a.m. UTC | #1
On 11/05/2014 04:55 PM, Eric Dumazet wrote:
> Tested:
>   Ran 200 netperf TCP_STREAM from A to B (10Gbe link, 8 RX queues)
>
> Without this feature, we send back about 305,000 ACK per second.
>
> GRO aggregation ratio is low (811/305 = 2.65 segments per GRO packet)
>
> Setting a timer of 2000 nsec is enough to increase GRO packet sizes
> and reduce number of ACK packets. (811/19.2 = 42)
>
> Receiver performs less calls to upper stacks, less wakes up.
> This also reduces cpu usage on the sender, as it receives less ACK
> packets.
>
> Note that reducing number of wakes up increases cpu efficiency, but can
> decrease QPS, as applications wont have the chance to warmup cpu caches
> doing a partial read of RPC requests/answers if they fit in one skb.

Speaking of QPS, what happens to 200 TCP_RR tests when the feature is 
enabled?

rick jones
--
To unsubscribe from this list: send the line "unsubscribe netdev" in
the body of a message to majordomo@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Eric Dumazet Nov. 6, 2014, 2:14 a.m. UTC | #2
On Wed, 2014-11-05 at 17:38 -0800, Rick Jones wrote:

> Speaking of QPS, what happens to 200 TCP_RR tests when the feature is 
> enabled?

Nothing at all (but the usual noise I guess)

200 TCP_RR send packets with 1 byte of payload and Push flag,
so no packet ever sits in napi->gro_list

lpaa5:~# echo 0 >/sys/class/net/eth0/gro_flush_timeout
lpaa6:~# echo 0 >/sys/class/net/eth0/gro_flush_timeout
lpaa5:~# time ./super_netperf 200 -H lpaa6 -t TCP_RR -l 20
3.13827e+06

real	0m32.170s
user	0m32.885s
sys	7m38.868s

lpaa5:~# echo 2000 >/sys/class/net/eth0/gro_flush_timeout
lpaa6:~# echo 2000 >/sys/class/net/eth0/gro_flush_timeout
lpaa5:~# time ./super_netperf 200 -H lpaa6 -t TCP_RR -l 20
3.19013e+06

real	0m37.152s
user	0m33.477s
sys	7m30.586s

Now lets try TCP_RR with -- -r 4000,4000 ;)

Reducing ACK packets allow us to better use the 10Gbe bandwith for
payload, so QPS actually increase.

lpaa5:~# echo 0 >/sys/class/net/eth0/gro_flush_timeout
lpaa6:~# echo 0 >/sys/class/net/eth0/gro_flush_timeout
lpaa5:~# time ./super_netperf 200 -H lpaa6 -t TCP_RR -l 20 -- -r
4000,4000
379645

real	0m32.201s
user	0m4.390s
sys	0m59.501s

lpaa5:~# echo 2000 >/sys/class/net/eth0/gro_flush_timeout
lpaa6:~# echo 2000 >/sys/class/net/eth0/gro_flush_timeout
lpaa5:~# time ./super_netperf 200 -H lpaa6 -t TCP_RR -l 20 -- -r
4000,4000
400610

real	0m37.159s
user	0m4.501s
sys	0m59.665s




--
To unsubscribe from this list: send the line "unsubscribe netdev" in
the body of a message to majordomo@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Eric Dumazet Nov. 6, 2014, 2:39 a.m. UTC | #3
On Wed, 2014-11-05 at 18:14 -0800, Eric Dumazet wrote:
> On Wed, 2014-11-05 at 17:38 -0800, Rick Jones wrote:
> 
> > Speaking of QPS, what happens to 200 TCP_RR tests when the feature is 
> > enabled?

The possible reduction of QPS happens when you have a single flow like
TCP_RR  -- -r 40000,40000

(Because we have one single TCP packet with 40000 bytes of payload,
application is waked up once when Push flag is received)

So cpu effiency is way better, but application has to copy 40000 bytes
in one go _after_ Push flag, instead of being able to copy part of the
data _before_ receiving the Push flag.

lpaa5:~# echo 0 >/sys/class/net/eth0/gro_flush_timeout
lpaa6:~# echo 0 >/sys/class/net/eth0/gro_flush_timeout
lpaa5:~# ./netperf -H lpaa6 -t TCP_RR -l 20 -Cc -- -r 40000,40000
MIGRATED TCP REQUEST/RESPONSE TEST from 0.0.0.0 (0.0.0.0) port 0 AF_INET to lpaa6.prod.google.com () port 0 AF_INET : first burst 0
Local /Remote
Socket Size   Request Resp.  Elapsed Trans.   CPU    CPU    S.dem   S.dem
Send   Recv   Size    Size   Time    Rate     local  remote local   remote
bytes  bytes  bytes   bytes  secs.   per sec  % S    % S    us/Tr   us/Tr

16384  87380  40000   40000  20.00   9023.86  2.02   1.70   107.513  90.561 
16384  87380 

lpaa5:~# echo 2000 >/sys/class/net/eth0/gro_flush_timeout
lpaa6:~# echo 2000 >/sys/class/net/eth0/gro_flush_timeout
lpaa5:~# ./netperf -H lpaa6 -t TCP_RR -l 20 -Cc -- -r 40000,40000
MIGRATED TCP REQUEST/RESPONSE TEST from 0.0.0.0 (0.0.0.0) port 0 AF_INET to lpaa6.prod.google.com () port 0 AF_INET : first burst 0
Local /Remote
Socket Size   Request Resp.  Elapsed Trans.   CPU    CPU    S.dem   S.dem
Send   Recv   Size    Size   Time    Rate     local  remote local   remote
bytes  bytes  bytes   bytes  secs.   per sec  % S    % S    us/Tr   us/Tr

16384  87380  40000   40000  20.00   8651.26  0.66   1.02   36.502  56.710 
16384  87380 



--
To unsubscribe from this list: send the line "unsubscribe netdev" in
the body of a message to majordomo@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Rick Jones Nov. 6, 2014, 4:42 p.m. UTC | #4
On 11/05/2014 06:39 PM, Eric Dumazet wrote:
> On Wed, 2014-11-05 at 18:14 -0800, Eric Dumazet wrote:
>> On Wed, 2014-11-05 at 17:38 -0800, Rick Jones wrote:
>>
>>> Speaking of QPS, what happens to 200 TCP_RR tests when the feature is
>>> enabled?
>
> The possible reduction of QPS happens when you have a single flow like
> TCP_RR  -- -r 40000,40000
>
> (Because we have one single TCP packet with 40000 bytes of payload,
> application is waked up once when Push flag is received)
>
> So cpu effiency is way better, but application has to copy 40000 bytes
> in one go _after_ Push flag, instead of being able to copy part of the
> data _before_ receiving the Push flag.

Thanks.  That isn't too unlike what I've seen happen in the past with 
say an 8K request size and switching back and forth between a 1500 and 
9000 byte MTU.

happy benchmarking,

rick
--
To unsubscribe from this list: send the line "unsubscribe netdev" in
the body of a message to majordomo@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
David Miller Nov. 6, 2014, 9:25 p.m. UTC | #5
From: Eric Dumazet <eric.dumazet@gmail.com>
Date: Wed, 05 Nov 2014 16:55:20 -0800

> @@ -4430,8 +4432,19 @@ void napi_complete(struct napi_struct *n)
>  	if (unlikely(test_bit(NAPI_STATE_NPSVC, &n->state)))
>  		return;
>  
> -	napi_gro_flush(n, false);
> +	if (n->gro_list) {
> +		unsigned long timeout = 0;
> +
> +		if (n->napi_rx_count)
> +			timeout = n->dev->gro_flush_timeout;

Under what circumstances would we see n->gro_list non-NULL yet
n->napi_rx_count == 0?

I'm not so sure it can happen.

Said another way, it looks to me like you could implement this
using less state.
--
To unsubscribe from this list: send the line "unsubscribe netdev" in
the body of a message to majordomo@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Eric Dumazet Nov. 6, 2014, 10:11 p.m. UTC | #6
On Thu, 2014-11-06 at 16:25 -0500, David Miller wrote:
> From: Eric Dumazet <eric.dumazet@gmail.com>
> Date: Wed, 05 Nov 2014 16:55:20 -0800
> 
> > @@ -4430,8 +4432,19 @@ void napi_complete(struct napi_struct *n)
> >  	if (unlikely(test_bit(NAPI_STATE_NPSVC, &n->state)))
> >  		return;
> >  
> > -	napi_gro_flush(n, false);
> > +	if (n->gro_list) {
> > +		unsigned long timeout = 0;
> > +
> > +		if (n->napi_rx_count)
> > +			timeout = n->dev->gro_flush_timeout;
> 
> Under what circumstances would we see n->gro_list non-NULL yet
> n->napi_rx_count == 0?
> 
> I'm not so sure it can happen.
> 
> Said another way, it looks to me like you could implement this
> using less state.

My goal was to not change any driver, doing a generic change.

Drivers call napi_complete() in their rx napi handler without giving us
the 'work_done' value which tells us if a packet was processed.

So I added a counter that is increased for every packet given to GRO
engine (napi_rx_count), so that napi_complete() has a clue if a packet
was received in _this_ NAPI run.

If at least one packet was received (and we still have packets in
gro_list) -> We ream the 'timer'
If not, we flush packets in GRO engine.

In order to avoid this state, I would have to add a new method, like
napi_complete_done(napi, work_done), and change drivers. I am not sure
its worth the effort ?




--
To unsubscribe from this list: send the line "unsubscribe netdev" in
the body of a message to majordomo@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
David Miller Nov. 7, 2014, 3:36 a.m. UTC | #7
From: Eric Dumazet <eric.dumazet@gmail.com>
Date: Thu, 06 Nov 2014 14:11:20 -0800

> My goal was to not change any driver, doing a generic change.
> 
> Drivers call napi_complete() in their rx napi handler without giving us
> the 'work_done' value which tells us if a packet was processed.
> 
> So I added a counter that is increased for every packet given to GRO
> engine (napi_rx_count), so that napi_complete() has a clue if a packet
> was received in _this_ NAPI run.
> 
> If at least one packet was received (and we still have packets in
> gro_list) -> We ream the 'timer'
> If not, we flush packets in GRO engine.
> 
> In order to avoid this state, I would have to add a new method, like
> napi_complete_done(napi, work_done), and change drivers. I am not sure
> its worth the effort ?

I think for such a critical path in the kernel it's worth it to avoid
these increments for every packet, just to compute a value that's
sitting in a register already in the driver's poll routine.

Eric, you've been trimming cpu IRQ disables from these exact code
paths, you should know better than me :-)

I'm willing to do some of the monkey work of converting as many
drivers as can be trivially done if you want.  Almost all of the
ones I looked at have the work_done variable right there at the
napi_complete() call site.

The rest can stay unconverted and not get access to this new facility.
--
To unsubscribe from this list: send the line "unsubscribe netdev" in
the body of a message to majordomo@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Eric Dumazet Nov. 7, 2014, 4:15 a.m. UTC | #8
On Thu, 2014-11-06 at 22:36 -0500, David Miller wrote:
> From: Eric Dumazet <eric.dumazet@gmail.com>
> Date: Thu, 06 Nov 2014 14:11:20 -0800
> 
> > My goal was to not change any driver, doing a generic change.
> > 
> > Drivers call napi_complete() in their rx napi handler without giving us
> > the 'work_done' value which tells us if a packet was processed.
> > 
> > So I added a counter that is increased for every packet given to GRO
> > engine (napi_rx_count), so that napi_complete() has a clue if a packet
> > was received in _this_ NAPI run.
> > 
> > If at least one packet was received (and we still have packets in
> > gro_list) -> We ream the 'timer'
> > If not, we flush packets in GRO engine.
> > 
> > In order to avoid this state, I would have to add a new method, like
> > napi_complete_done(napi, work_done), and change drivers. I am not sure
> > its worth the effort ?
> 
> I think for such a critical path in the kernel it's worth it to avoid
> these increments for every packet, just to compute a value that's
> sitting in a register already in the driver's poll routine.
> 
> Eric, you've been trimming cpu IRQ disables from these exact code
> paths, you should know better than me :-)
> 
> I'm willing to do some of the monkey work of converting as many
> drivers as can be trivially done if you want.  Almost all of the
> ones I looked at have the work_done variable right there at the
> napi_complete() call site.
> 
> The rest can stay unconverted and not get access to this new facility.

This is your call. I actually did this in my first implementation,
because I was not using a timer but rescheduling NAPI and not
re-enabling interrupts.

http://www.spinics.net/lists/netdev/msg302474.html

(This is why I had to cook "d75b1ade567 net: less interrupt masking in
NAPI")

Given the cost of GRO processing, setting a bool (this is all I need
actually) is really pure noise.



--
To unsubscribe from this list: send the line "unsubscribe netdev" in
the body of a message to majordomo@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
diff mbox

Patch

diff --git a/include/linux/netdevice.h b/include/linux/netdevice.h
index 4767f546d7c0..8474fcfadc7c 100644
--- a/include/linux/netdevice.h
+++ b/include/linux/netdevice.h
@@ -314,6 +314,8 @@  struct napi_struct {
 	struct net_device	*dev;
 	struct sk_buff		*gro_list;
 	struct sk_buff		*skb;
+	unsigned long		napi_rx_count;
+	struct hrtimer		timer;
 	struct list_head	dev_list;
 	struct hlist_node	napi_hash_node;
 	unsigned int		napi_id;
@@ -485,14 +487,7 @@  void napi_hash_del(struct napi_struct *napi);
  * Stop NAPI from being scheduled on this context.
  * Waits till any outstanding processing completes.
  */
-static inline void napi_disable(struct napi_struct *n)
-{
-	might_sleep();
-	set_bit(NAPI_STATE_DISABLE, &n->state);
-	while (test_and_set_bit(NAPI_STATE_SCHED, &n->state))
-		msleep(1);
-	clear_bit(NAPI_STATE_DISABLE, &n->state);
-}
+void napi_disable(struct napi_struct *n);
 
 /**
  *	napi_enable - enable NAPI scheduling
@@ -1603,6 +1598,7 @@  struct net_device {
 
 #endif
 
+	unsigned long		gro_flush_timeout;
 	rx_handler_func_t __rcu	*rx_handler;
 	void __rcu		*rx_handler_data;
 
diff --git a/net/core/dev.c b/net/core/dev.c
index 40be481268de..c88651bd8ada 100644
--- a/net/core/dev.c
+++ b/net/core/dev.c
@@ -133,6 +133,7 @@ 
 #include <linux/vmalloc.h>
 #include <linux/if_macvlan.h>
 #include <linux/errqueue.h>
+#include <linux/hrtimer.h>
 
 #include "net-sysfs.h"
 
@@ -4000,6 +4001,8 @@  static enum gro_result dev_gro_receive(struct napi_struct *napi, struct sk_buff
 	if (skb_is_gso(skb) || skb_has_frag_list(skb) || skb->csum_bad)
 		goto normal;
 
+	napi->napi_rx_count++;
+
 	gro_list_prepare(napi, skb);
 
 	rcu_read_lock();
@@ -4411,7 +4414,6 @@  EXPORT_SYMBOL(__napi_schedule_irqoff);
 void __napi_complete(struct napi_struct *n)
 {
 	BUG_ON(!test_bit(NAPI_STATE_SCHED, &n->state));
-	BUG_ON(n->gro_list);
 
 	list_del_init(&n->poll_list);
 	smp_mb__before_atomic();
@@ -4430,8 +4432,19 @@  void napi_complete(struct napi_struct *n)
 	if (unlikely(test_bit(NAPI_STATE_NPSVC, &n->state)))
 		return;
 
-	napi_gro_flush(n, false);
+	if (n->gro_list) {
+		unsigned long timeout = 0;
+
+		if (n->napi_rx_count)
+			timeout = n->dev->gro_flush_timeout;
 
+		if (timeout)
+			hrtimer_start(&n->timer, ns_to_ktime(timeout),
+				      HRTIMER_MODE_REL_PINNED);
+		else
+			napi_gro_flush(n, false);
+	}
+	n->napi_rx_count = 0;
 	if (likely(list_empty(&n->poll_list))) {
 		WARN_ON_ONCE(!test_and_clear_bit(NAPI_STATE_SCHED, &n->state));
 	} else {
@@ -4495,10 +4508,23 @@  void napi_hash_del(struct napi_struct *napi)
 }
 EXPORT_SYMBOL_GPL(napi_hash_del);
 
+static enum hrtimer_restart napi_watchdog(struct hrtimer *timer)
+{
+	struct napi_struct *napi;
+
+	napi = container_of(timer, struct napi_struct, timer);
+	if (napi->gro_list)
+		napi_schedule(napi);
+
+	return HRTIMER_NORESTART;
+}
+
 void netif_napi_add(struct net_device *dev, struct napi_struct *napi,
 		    int (*poll)(struct napi_struct *, int), int weight)
 {
 	INIT_LIST_HEAD(&napi->poll_list);
+	hrtimer_init(&napi->timer, CLOCK_MONOTONIC, HRTIMER_MODE_REL_PINNED);
+	napi->timer.function = napi_watchdog;
 	napi->gro_count = 0;
 	napi->gro_list = NULL;
 	napi->skb = NULL;
@@ -4517,6 +4543,20 @@  void netif_napi_add(struct net_device *dev, struct napi_struct *napi,
 }
 EXPORT_SYMBOL(netif_napi_add);
 
+void napi_disable(struct napi_struct *n)
+{
+	might_sleep();
+	set_bit(NAPI_STATE_DISABLE, &n->state);
+
+	while (test_and_set_bit(NAPI_STATE_SCHED, &n->state))
+		msleep(1);
+
+	hrtimer_cancel(&n->timer);
+
+	clear_bit(NAPI_STATE_DISABLE, &n->state);
+}
+EXPORT_SYMBOL(napi_disable);
+
 void netif_napi_del(struct napi_struct *napi)
 {
 	list_del_init(&napi->dev_list);
diff --git a/net/core/net-sysfs.c b/net/core/net-sysfs.c
index 9dd06699b09c..1a24602cd54e 100644
--- a/net/core/net-sysfs.c
+++ b/net/core/net-sysfs.c
@@ -325,6 +325,23 @@  static ssize_t tx_queue_len_store(struct device *dev,
 }
 NETDEVICE_SHOW_RW(tx_queue_len, fmt_ulong);
 
+static int change_gro_flush_timeout(struct net_device *dev, unsigned long val)
+{
+	dev->gro_flush_timeout = val;
+	return 0;
+}
+
+static ssize_t gro_flush_timeout_store(struct device *dev,
+				  struct device_attribute *attr,
+				  const char *buf, size_t len)
+{
+	if (!capable(CAP_NET_ADMIN))
+		return -EPERM;
+
+	return netdev_store(dev, attr, buf, len, change_gro_flush_timeout);
+}
+NETDEVICE_SHOW_RW(gro_flush_timeout, fmt_ulong);
+
 static ssize_t ifalias_store(struct device *dev, struct device_attribute *attr,
 			     const char *buf, size_t len)
 {
@@ -422,6 +439,7 @@  static struct attribute *net_class_attrs[] = {
 	&dev_attr_mtu.attr,
 	&dev_attr_flags.attr,
 	&dev_attr_tx_queue_len.attr,
+	&dev_attr_gro_flush_timeout.attr,
 	&dev_attr_phys_port_id.attr,
 	NULL,
 };