diff mbox

net: gro: selective flush of packets

Message ID 1349546929.21172.1598.camel@edumazet-glaptop
State Accepted, archived
Delegated to: David Miller
Headers show

Commit Message

Eric Dumazet Oct. 6, 2012, 6:08 p.m. UTC
From: Eric Dumazet <edumazet@google.com>

Current GRO can hold packets in gro_list for almost unlimited
time, in case napi->poll() handler consumes its budget over and over.

In this case, napi_complete()/napi_gro_flush() are not called.

Another problem is that gro_list is flushed in non friendly way :
We scan the list and complete packets in the reverse order.
(youngest packets first, oldest packets last)
This defeats priorities that sender could have cooked.

Since GRO currently only store TCP packets, we dont really notice the
bug because of retransmits, but this behavior can add unexpected
latencies, particularly on mice flows clamped by elephant flows.

This patch makes sure no packet can stay more than 1 ms in queue, and
only in stress situations.

It also complete packets in the right order to minimize latencies.

Signed-off-by: Eric Dumazet <edumazet@google.com>
Cc: Herbert Xu <herbert@gondor.apana.org.au>
Cc: Jesse Gross <jesse@nicira.com>
Cc: Tom Herbert <therbert@google.com>
Cc: Yuchung Cheng <ycheng@google.com>
---
 drivers/net/ethernet/marvell/skge.c   |    2 -
 drivers/net/ethernet/realtek/8139cp.c |    2 -
 include/linux/netdevice.h             |   15 +++++----
 net/core/dev.c                        |   38 +++++++++++++++++++-----
 4 files changed, 42 insertions(+), 15 deletions(-)



--
To unsubscribe from this list: send the line "unsubscribe netdev" in
the body of a message to majordomo@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html

Comments

Herbert Xu Oct. 7, 2012, 12:32 a.m. UTC | #1
On Sat, Oct 06, 2012 at 08:08:49PM +0200, Eric Dumazet wrote:
>
> @@ -3981,8 +3996,17 @@ static void net_rx_action(struct softirq_action *h)
>  				local_irq_enable();
>  				napi_complete(n);
>  				local_irq_disable();
> -			} else
> +			} else {
> +				if (n->gro_list) {
> +					/* flush too old packets
> +					 * If HZ < 1000, flush all packets.
> +					 */
> +					local_irq_enable();
> +					napi_gro_flush(n, HZ >= 1000);
> +					local_irq_disable();
> +				}
>  				list_move_tail(&n->poll_list, &sd->poll_list);
> +			}

Why don't we just always flush everything?

Thanks,
Eric Dumazet Oct. 7, 2012, 5:29 a.m. UTC | #2
On Sun, 2012-10-07 at 08:32 +0800, Herbert Xu wrote:
> On Sat, Oct 06, 2012 at 08:08:49PM +0200, Eric Dumazet wrote:
> >
> > @@ -3981,8 +3996,17 @@ static void net_rx_action(struct softirq_action *h)
> >  				local_irq_enable();
> >  				napi_complete(n);
> >  				local_irq_disable();
> > -			} else
> > +			} else {
> > +				if (n->gro_list) {
> > +					/* flush too old packets
> > +					 * If HZ < 1000, flush all packets.
> > +					 */
> > +					local_irq_enable();
> > +					napi_gro_flush(n, HZ >= 1000);
> > +					local_irq_disable();
> > +				}
> >  				list_move_tail(&n->poll_list, &sd->poll_list);
> > +			}
> 
> Why don't we just always flush everything?

This is what I tried first, but it lowered performance on several
typical workloads.

Using this simple heuristic increases performance.



--
To unsubscribe from this list: send the line "unsubscribe netdev" in
the body of a message to majordomo@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Eric Dumazet Oct. 8, 2012, 7:39 a.m. UTC | #3
On Sun, 2012-10-07 at 07:29 +0200, Eric Dumazet wrote:
> 	On Sun, 2012-10-07 at 08:32 +0800, Herbert Xu wrote:

> > Why don't we just always flush everything?
> 

> This is what I tried first, but it lowered performance on several
> typical workloads.
> 
> Using this simple heuristic increases performance.
> 
> 

By the way, one of the beauty of GRO is it helps under load to aggregate
packets and reduce cpu load. People wanting very low latencies should
probably not use GRO, and if they use it or not, receiving a full 64
packets batch on a particular NIC makes latencies very unpredictable.

So if we consumed all budget in a napi->poll() handler, its because we
are under load and we dont really want to cancel GRO aggregation.

Next napi->poll() invocation will have more chances to coalesce frames.

If there is only one flow, its OK because a 64 packet window allows ~4
GRO super packets to be built, regardless of an unconditional flush, but
with 8 flows, it would roughly give 100% increase of GRO packets sent to
upper layers.

Only needed safety measure is to make sure we dont let packets for a too
long time in case we never complete napi, this is what this patch does,
with a latency average of 0.5 ms (for slow flows)



--
To unsubscribe from this list: send the line "unsubscribe netdev" in
the body of a message to majordomo@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Rick Jones Oct. 8, 2012, 4:42 p.m. UTC | #4
On 10/08/2012 12:39 AM, Eric Dumazet wrote:
> On Sun, 2012-10-07 at 07:29 +0200, Eric Dumazet wrote:
>> 	On Sun, 2012-10-07 at 08:32 +0800, Herbert Xu wrote:
>
>>> Why don't we just always flush everything?
>>
>
>> This is what I tried first, but it lowered performance on several
>> typical workloads.
>>
>> Using this simple heuristic increases performance.
>>
>>
>
> By the way, one of the beauty of GRO is it helps under load to aggregate
> packets and reduce cpu load. People wanting very low latencies should
> probably not use GRO, and if they use it or not, receiving a full 64
> packets batch on a particular NIC makes latencies very unpredictable.
>
> So if we consumed all budget in a napi->poll() handler, its because we
> are under load and we dont really want to cancel GRO aggregation.

Is that actually absolute, or does it depend on GRO aggregation actually 
aggregating?  In your opening message you talked about how with though 
flows GRO is defeated but its overhead remains.

rick jones
--
To unsubscribe from this list: send the line "unsubscribe netdev" in
the body of a message to majordomo@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Eric Dumazet Oct. 8, 2012, 5:10 p.m. UTC | #5
On Mon, 2012-10-08 at 09:42 -0700, Rick Jones wrote:

> > By the way, one of the beauty of GRO is it helps under load to aggregate
> > packets and reduce cpu load. People wanting very low latencies should
> > probably not use GRO, and if they use it or not, receiving a full 64
> > packets batch on a particular NIC makes latencies very unpredictable.
> >
> > So if we consumed all budget in a napi->poll() handler, its because we
> > are under load and we dont really want to cancel GRO aggregation.
> 
> Is that actually absolute, or does it depend on GRO aggregation actually 
> aggregating?  In your opening message you talked about how with though 
> flows GRO is defeated but its overhead remains.
> 

Sorry, I dont understand the question.

We consume all budget when 64 packets are fetched from NIC.

This has nothing to do with GRO, but NAPI behavior.

Sure, if these packets are UDP messages and cross GRO stack for nothing,
its pure overhead.

Current situation is :

You receive a burst of packets, with one (or few) TCP message(s), and
other frames are UDP only.

This TCP message is held in GRO queue, and stay here as long as we dont
receive another packet for the same flow, or the burst ends.

Note that I dont really care of these few TCP messages right now,
but when/if we use a hash table and allow XXX packets in GRO stack,
things are different ;)



--
To unsubscribe from this list: send the line "unsubscribe netdev" in
the body of a message to majordomo@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
David Miller Oct. 8, 2012, 6:52 p.m. UTC | #6
From: Eric Dumazet <eric.dumazet@gmail.com>
Date: Sat, 06 Oct 2012 20:08:49 +0200

> From: Eric Dumazet <edumazet@google.com>
> 
> Current GRO can hold packets in gro_list for almost unlimited
> time, in case napi->poll() handler consumes its budget over and over.
> 
> In this case, napi_complete()/napi_gro_flush() are not called.
> 
> Another problem is that gro_list is flushed in non friendly way :
> We scan the list and complete packets in the reverse order.
> (youngest packets first, oldest packets last)
> This defeats priorities that sender could have cooked.
> 
> Since GRO currently only store TCP packets, we dont really notice the
> bug because of retransmits, but this behavior can add unexpected
> latencies, particularly on mice flows clamped by elephant flows.
> 
> This patch makes sure no packet can stay more than 1 ms in queue, and
> only in stress situations.
> 
> It also complete packets in the right order to minimize latencies.
> 
> Signed-off-by: Eric Dumazet <edumazet@google.com>

Applied, thanks Eric.
--
To unsubscribe from this list: send the line "unsubscribe netdev" in
the body of a message to majordomo@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
diff mbox

Patch

diff --git a/drivers/net/ethernet/marvell/skge.c b/drivers/net/ethernet/marvell/skge.c
index 5a30bf8..eb3cba0 100644
--- a/drivers/net/ethernet/marvell/skge.c
+++ b/drivers/net/ethernet/marvell/skge.c
@@ -3189,7 +3189,7 @@  static int skge_poll(struct napi_struct *napi, int to_do)
 	if (work_done < to_do) {
 		unsigned long flags;
 
-		napi_gro_flush(napi);
+		napi_gro_flush(napi, false);
 		spin_lock_irqsave(&hw->hw_lock, flags);
 		__napi_complete(napi);
 		hw->intr_mask |= napimask[skge->port];
diff --git a/drivers/net/ethernet/realtek/8139cp.c b/drivers/net/ethernet/realtek/8139cp.c
index 995d0cf..1c81825 100644
--- a/drivers/net/ethernet/realtek/8139cp.c
+++ b/drivers/net/ethernet/realtek/8139cp.c
@@ -563,7 +563,7 @@  rx_next:
 		if (cpr16(IntrStatus) & cp_rx_intr_mask)
 			goto rx_status_loop;
 
-		napi_gro_flush(napi);
+		napi_gro_flush(napi, false);
 		spin_lock_irqsave(&cp->lock, flags);
 		__napi_complete(napi);
 		cpw16_f(IntrMask, cp_intr_mask);
diff --git a/include/linux/netdevice.h b/include/linux/netdevice.h
index 01646aa..63efc28 100644
--- a/include/linux/netdevice.h
+++ b/include/linux/netdevice.h
@@ -1497,19 +1497,22 @@  struct napi_gro_cb {
 	/* This indicates where we are processing relative to skb->data. */
 	int data_offset;
 
-	/* This is non-zero if the packet may be of the same flow. */
-	int same_flow;
-
 	/* This is non-zero if the packet cannot be merged with the new skb. */
 	int flush;
 
 	/* Number of segments aggregated. */
-	int count;
+	u16	count;
+
+	/* This is non-zero if the packet may be of the same flow. */
+	u8	same_flow;
 
 	/* Free the skb? */
-	int free;
+	u8	free;
 #define NAPI_GRO_FREE		  1
 #define NAPI_GRO_FREE_STOLEN_HEAD 2
+
+	/* jiffies when first packet was created/queued */
+	unsigned long age;
 };
 
 #define NAPI_GRO_CB(skb) ((struct napi_gro_cb *)(skb)->cb)
@@ -2157,7 +2160,7 @@  extern gro_result_t	dev_gro_receive(struct napi_struct *napi,
 extern gro_result_t	napi_skb_finish(gro_result_t ret, struct sk_buff *skb);
 extern gro_result_t	napi_gro_receive(struct napi_struct *napi,
 					 struct sk_buff *skb);
-extern void		napi_gro_flush(struct napi_struct *napi);
+extern void		napi_gro_flush(struct napi_struct *napi, bool flush_old);
 extern struct sk_buff *	napi_get_frags(struct napi_struct *napi);
 extern gro_result_t	napi_frags_finish(struct napi_struct *napi,
 					  struct sk_buff *skb,
diff --git a/net/core/dev.c b/net/core/dev.c
index 1e0a184..8c960e8 100644
--- a/net/core/dev.c
+++ b/net/core/dev.c
@@ -3471,17 +3471,31 @@  out:
 	return netif_receive_skb(skb);
 }
 
-inline void napi_gro_flush(struct napi_struct *napi)
+/* napi->gro_list contains packets ordered by age.
+ * youngest packets at the head of it.
+ * Complete skbs in reverse order to reduce latencies.
+ */
+void napi_gro_flush(struct napi_struct *napi, bool flush_old)
 {
-	struct sk_buff *skb, *next;
+	struct sk_buff *skb, *prev = NULL;
 
-	for (skb = napi->gro_list; skb; skb = next) {
-		next = skb->next;
+	/* scan list and build reverse chain */
+	for (skb = napi->gro_list; skb != NULL; skb = skb->next) {
+		skb->prev = prev;
+		prev = skb;
+	}
+
+	for (skb = prev; skb; skb = prev) {
 		skb->next = NULL;
+
+		if (flush_old && NAPI_GRO_CB(skb)->age == jiffies)
+			return;
+
+		prev = skb->prev;
 		napi_gro_complete(skb);
+		napi->gro_count--;
 	}
 
-	napi->gro_count = 0;
 	napi->gro_list = NULL;
 }
 EXPORT_SYMBOL(napi_gro_flush);
@@ -3542,6 +3556,7 @@  enum gro_result dev_gro_receive(struct napi_struct *napi, struct sk_buff *skb)
 
 	napi->gro_count++;
 	NAPI_GRO_CB(skb)->count = 1;
+	NAPI_GRO_CB(skb)->age = jiffies;
 	skb_shinfo(skb)->gso_size = skb_gro_len(skb);
 	skb->next = napi->gro_list;
 	napi->gro_list = skb;
@@ -3876,7 +3891,7 @@  void napi_complete(struct napi_struct *n)
 	if (unlikely(test_bit(NAPI_STATE_NPSVC, &n->state)))
 		return;
 
-	napi_gro_flush(n);
+	napi_gro_flush(n, false);
 	local_irq_save(flags);
 	__napi_complete(n);
 	local_irq_restore(flags);
@@ -3981,8 +3996,17 @@  static void net_rx_action(struct softirq_action *h)
 				local_irq_enable();
 				napi_complete(n);
 				local_irq_disable();
-			} else
+			} else {
+				if (n->gro_list) {
+					/* flush too old packets
+					 * If HZ < 1000, flush all packets.
+					 */
+					local_irq_enable();
+					napi_gro_flush(n, HZ >= 1000);
+					local_irq_disable();
+				}
 				list_move_tail(&n->poll_list, &sd->poll_list);
+			}
 		}
 
 		netpoll_poll_unlock(have);