diff mbox

Quota in __qdisc_run() (was: qdisc: validate skb without holding lock)

Message ID 1412694080.11091.131.camel@edumazet-glaptop2.roam.corp.google.com
State RFC, archived
Delegated to: David Miller
Headers show

Commit Message

Eric Dumazet Oct. 7, 2014, 3:01 p.m. UTC
On Tue, 2014-10-07 at 16:43 +0200, Hannes Frederic Sowa wrote:

> This needs to be:
> 
> do
>    ...
> while ((iskb = iskb->next))

I do not feel needed to break the bulk dequeue at precise quota
boundary. These quotas are advisory, and bql prefers to get its full
budget for appropriate feedback from TX completion.

Quota was a packet quota, which was quite irrelevant if segmentation had
to be done, so I would just let the dequeue be done so that we benefit
from optimal xmit_more.



--
To unsubscribe from this list: send the line "unsubscribe netdev" in
the body of a message to majordomo@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html

Comments

Eric Dumazet Oct. 7, 2014, 3:06 p.m. UTC | #1
On Tue, 2014-10-07 at 08:01 -0700, Eric Dumazet wrote:

> Quota was a packet quota, which was quite irrelevant if segmentation had
> to be done, so I would just let the dequeue be done so that we benefit
> from optimal xmit_more.

And it also is better to allow receiver to get full LRO/GRO aggregation,
if we do not break GSO train in multiple parts.




--
To unsubscribe from this list: send the line "unsubscribe netdev" in
the body of a message to majordomo@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
David Miller Oct. 7, 2014, 5:19 p.m. UTC | #2
From: Eric Dumazet <eric.dumazet@gmail.com>
Date: Tue, 07 Oct 2014 08:01:20 -0700

> On Tue, 2014-10-07 at 16:43 +0200, Hannes Frederic Sowa wrote:
> 
>> This needs to be:
>> 
>> do
>>    ...
>> while ((iskb = iskb->next))
> 
> I do not feel needed to break the bulk dequeue at precise quota
> boundary. These quotas are advisory, and bql prefers to get its full
> budget for appropriate feedback from TX completion.
> 
> Quota was a packet quota, which was quite irrelevant if segmentation had
> to be done, so I would just let the dequeue be done so that we benefit
> from optimal xmit_more.

Yes, this makes sense, do a full qdisc_restart() cycle without boundaries,
then check how much quota was used afterwards to guard the outermost loop.
--
To unsubscribe from this list: send the line "unsubscribe netdev" in
the body of a message to majordomo@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Eric Dumazet Oct. 7, 2014, 5:32 p.m. UTC | #3
On Tue, 2014-10-07 at 13:19 -0400, David Miller wrote:

> Yes, this makes sense, do a full qdisc_restart() cycle without boundaries,
> then check how much quota was used afterwards to guard the outermost loop.

I am testing this, and also am testing the xmit_more patch for I40E.

Will send patches today.

Thanks


--
To unsubscribe from this list: send the line "unsubscribe netdev" in
the body of a message to majordomo@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Jesper Dangaard Brouer Oct. 7, 2014, 6:03 p.m. UTC | #4
On Tue, 07 Oct 2014 13:19:38 -0400 (EDT)
David Miller <davem@davemloft.net> wrote:

> From: Eric Dumazet <eric.dumazet@gmail.com>
> Date: Tue, 07 Oct 2014 08:01:20 -0700
> 
> > On Tue, 2014-10-07 at 16:43 +0200, Hannes Frederic Sowa wrote:
> > 
> >> This needs to be:
> >> 
> >> do
> >>    ...
> >> while ((iskb = iskb->next))
> > 
> > I do not feel needed to break the bulk dequeue at precise quota
> > boundary. These quotas are advisory, and bql prefers to get its full
> > budget for appropriate feedback from TX completion.
> > 
> > Quota was a packet quota, which was quite irrelevant if segmentation had
> > to be done, so I would just let the dequeue be done so that we benefit
> > from optimal xmit_more.
> 
> Yes, this makes sense, do a full qdisc_restart() cycle without boundaries,
> then check how much quota was used afterwards to guard the outermost loop.

According to my measurements, at 10Gbit/s TCP_STREAM test the BQL limit
is 381528 bytes / 1514 = 252 packets, that will (potentially) be bulk
dequeued at once (with your version of the patch).

It seems to have the potential to exceed the weight_p(64) quite a lot.
And with e.g. TX ring size 512, we also also challenge the drivers at
this early adoption phase of tailptr writes.  Just saying...
Jesper Dangaard Brouer Oct. 7, 2014, 6:37 p.m. UTC | #5
On Tue, 07 Oct 2014 10:32:12 -0700
Eric Dumazet <eric.dumazet@gmail.com> wrote:

> On Tue, 2014-10-07 at 13:19 -0400, David Miller wrote:
> 
> > Yes, this makes sense, do a full qdisc_restart() cycle without boundaries,
> > then check how much quota was used afterwards to guard the outermost loop.
> 
> I am testing this, and also am testing the xmit_more patch for I40E.

Check, I'm also testing both yours and Hannes patch.

Results at:
 http://people.netfilter.org/hawk/qdisc/measure18_restore_quota_fairness/
 http://people.netfilter.org/hawk/qdisc/measure19_restore_quota_erics/
 http://people.netfilter.org/hawk/qdisc/measure20_no_quota_baseline_at_git_02c0fc1/
Eric Dumazet Oct. 7, 2014, 7:10 p.m. UTC | #6
On Tue, 2014-10-07 at 20:03 +0200, Jesper Dangaard Brouer wrote:

> According to my measurements, at 10Gbit/s TCP_STREAM test the BQL limit
> is 381528 bytes / 1514 = 252 packets, that will (potentially) be bulk
> dequeued at once (with your version of the patch).
> 

That's because you use a single queue maybe ?

In reality, 10Gbe NIC are used in multiqueue mode ...

Here we have limits around 2 TSO packets.

Even with only 4 tx queues I have :

# sar -n DEV 3 3 |grep eth1
12:05:19 PM      eth1 147217.67 809066.67   9488.71 1196207.78      0.00      0.00      0.00
12:05:22 PM      eth1 145958.00 807822.33   9407.48 1194366.73      0.00      0.00      0.00
12:05:25 PM      eth1 147502.33 804739.33   9507.26 1189804.23      0.00      0.00      0.33
Average:         eth1 146892.67 807209.44   9467.82 1193459.58      0.00      0.00      0.11


grep . /sys/class/net/eth1/queues/tx*/byte_queue_limits/{inflight,limit}
/sys/class/net/eth1/queues/tx-0/byte_queue_limits/inflight:115064
/sys/class/net/eth1/queues/tx-1/byte_queue_limits/inflight:0
/sys/class/net/eth1/queues/tx-2/byte_queue_limits/inflight:0
/sys/class/net/eth1/queues/tx-3/byte_queue_limits/inflight:0
/sys/class/net/eth1/queues/tx-0/byte_queue_limits/limit:102952
/sys/class/net/eth1/queues/tx-1/byte_queue_limits/limit:124148
/sys/class/net/eth1/queues/tx-2/byte_queue_limits/limit:102952
/sys/class/net/eth1/queues/tx-3/byte_queue_limits/limit:136260


> It seems to have the potential to exceed the weight_p(64) quite a lot.
> And with e.g. TX ring size 512, we also also challenge the drivers at
> this early adoption phase of tailptr writes.  Just saying...
> 

Yep, but remind we want to squeeze bugs out of the drivers, then add
additional knobs later.

Whatever limit we choose in core networking stack (being 64 packets for
example), hardware might have different constraints that need to be
taken care of in the driver.


--
To unsubscribe from this list: send the line "unsubscribe netdev" in
the body of a message to majordomo@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Jesper Dangaard Brouer Oct. 7, 2014, 7:34 p.m. UTC | #7
On Tue, 07 Oct 2014 12:10:15 -0700
Eric Dumazet <eric.dumazet@gmail.com> wrote:

> On Tue, 2014-10-07 at 20:03 +0200, Jesper Dangaard Brouer wrote:
> 
> > According to my measurements, at 10Gbit/s TCP_STREAM test the BQL limit
> > is 381528 bytes / 1514 = 252 packets, that will (potentially) be bulk
> > dequeued at once (with your version of the patch).
> > 
> 
> That's because you use a single queue maybe ?

No, I just double checked.

I was using a single netperf TCP_STREAM with GRO=off TSO=off.

> Here we have limits around 2 TSO packets.

Which should be approx 128k

Just tested with GRO=on TSO=on single TCP_STREAM, which is weird as I
should hit your 2xTSO limit right, and inflight shows 408780.

And I do have (which I guess is the 2xTSO):
 /proc/sys/net/ipv4/tcp_limit_output_bytes:131072

$ grep -H . /sys/class/net/eth4/queues/tx-*/byte_queue_limits/{inflight,limit}
/sys/class/net/eth4/queues/tx-0/byte_queue_limits/inflight:0
/sys/class/net/eth4/queues/tx-10/byte_queue_limits/inflight:0
/sys/class/net/eth4/queues/tx-11/byte_queue_limits/inflight:0
/sys/class/net/eth4/queues/tx-1/byte_queue_limits/inflight:0
/sys/class/net/eth4/queues/tx-2/byte_queue_limits/inflight:0
/sys/class/net/eth4/queues/tx-3/byte_queue_limits/inflight:0
/sys/class/net/eth4/queues/tx-4/byte_queue_limits/inflight:0
/sys/class/net/eth4/queues/tx-5/byte_queue_limits/inflight:0
/sys/class/net/eth4/queues/tx-6/byte_queue_limits/inflight:0
/sys/class/net/eth4/queues/tx-7/byte_queue_limits/inflight:0
/sys/class/net/eth4/queues/tx-8/byte_queue_limits/inflight:408780
/sys/class/net/eth4/queues/tx-9/byte_queue_limits/inflight:0
/sys/class/net/eth4/queues/tx-0/byte_queue_limits/limit:340848
/sys/class/net/eth4/queues/tx-10/byte_queue_limits/limit:340650
/sys/class/net/eth4/queues/tx-11/byte_queue_limits/limit:367902
/sys/class/net/eth4/queues/tx-1/byte_queue_limits/limit:272520
/sys/class/net/eth4/queues/tx-2/byte_queue_limits/limit:204390
/sys/class/net/eth4/queues/tx-3/byte_queue_limits/limit:162856
/sys/class/net/eth4/queues/tx-4/byte_queue_limits/limit:158314
/sys/class/net/eth4/queues/tx-5/byte_queue_limits/limit:136260
/sys/class/net/eth4/queues/tx-6/byte_queue_limits/limit:140802
/sys/class/net/eth4/queues/tx-7/byte_queue_limits/limit:152258
/sys/class/net/eth4/queues/tx-8/byte_queue_limits/limit:340650
/sys/class/net/eth4/queues/tx-9/byte_queue_limits/limit:340650

Strange...

> Even with only 4 tx queues I have :
> 
> # sar -n DEV 3 3 |grep eth1
> 12:05:19 PM      eth1 147217.67 809066.67   9488.71 1196207.78      0.00      0.00      0.00
> 12:05:22 PM      eth1 145958.00 807822.33   9407.48 1194366.73      0.00      0.00      0.00
> 12:05:25 PM      eth1 147502.33 804739.33   9507.26 1189804.23      0.00      0.00      0.33
> Average:         eth1 146892.67 807209.44   9467.82 1193459.58      0.00      0.00      0.11
> 
> 
> grep . /sys/class/net/eth1/queues/tx*/byte_queue_limits/{inflight,limit}
> /sys/class/net/eth1/queues/tx-0/byte_queue_limits/inflight:115064
> /sys/class/net/eth1/queues/tx-1/byte_queue_limits/inflight:0
> /sys/class/net/eth1/queues/tx-2/byte_queue_limits/inflight:0
> /sys/class/net/eth1/queues/tx-3/byte_queue_limits/inflight:0
> /sys/class/net/eth1/queues/tx-0/byte_queue_limits/limit:102952
> /sys/class/net/eth1/queues/tx-1/byte_queue_limits/limit:124148
> /sys/class/net/eth1/queues/tx-2/byte_queue_limits/limit:102952
> /sys/class/net/eth1/queues/tx-3/byte_queue_limits/limit:136260

Guess this is okay, 115064 / 1514 = 76 pkts which is closer to the 64 weight_p.


> > It seems to have the potential to exceed the weight_p(64) quite a lot.
> > And with e.g. TX ring size 512, we also also challenge the drivers at
> > this early adoption phase of tailptr writes.  Just saying...
> > 
> 
> Yep, but remind we want to squeeze bugs out of the drivers, then add
> additional knobs later.

Okay, for squeezing out bugs, then I understand this more aggressive
bulking strategy.  I'm all in then!


> Whatever limit we choose in core networking stack (being 64 packets for
> example), hardware might have different constraints that need to be
> taken care of in the driver.
Jesper Dangaard Brouer Oct. 7, 2014, 8:07 p.m. UTC | #8
On Tue, 7 Oct 2014 20:37:00 +0200
Jesper Dangaard Brouer <brouer@redhat.com> wrote:

> On Tue, 07 Oct 2014 10:32:12 -0700
> Eric Dumazet <eric.dumazet@gmail.com> wrote:
> 
> > On Tue, 2014-10-07 at 13:19 -0400, David Miller wrote:
> > 
> > > Yes, this makes sense, do a full qdisc_restart() cycle without boundaries,
> > > then check how much quota was used afterwards to guard the outermost loop.
> > 
> > I am testing this, and also am testing the xmit_more patch for I40E.
> 
> Check, I'm also testing both yours and Hannes patch.
> 
> Results at:
>  http://people.netfilter.org/hawk/qdisc/measure18_restore_quota_fairness/
>  http://people.netfilter.org/hawk/qdisc/measure19_restore_quota_erics/
>  http://people.netfilter.org/hawk/qdisc/measure20_no_quota_baseline_at_git_02c0fc1/

These tests are with ixgbe with a single TXQ, in-order to measure the
effect of HoL, by taking advantage of the high prio queue of pfifo_fast.

 (Cmdline trick for a single TXQ: "ethtool -L eth4 combined 1")

In case GSO=off TSO=off, Hannes "wins" with 0.04ms
 http://people.netfilter.org/hawk/qdisc/measure19_restore_quota_erics/compare_rr_latency_eric_vs_hannes_NoneXSO.png
which I guess we should not be concerned with.

In case GSO=on TSO=off, the diff is max 0.01ms (to Hannes advantage ;-))
Notice the extreme zoom level:
 http://people.netfilter.org/hawk/qdisc/measure19_restore_quota_erics/compare_rr_latency_eric_vs_hannes_GSO.png

In case GSO=on TSO=on, there are some spikes, where Eric's version have
the highest spike.
 http://people.netfilter.org/hawk/qdisc/measure19_restore_quota_erics/compare_rr_latency_eric_vs_hannes_TSO.png
Again nothing we should worry about.

Thus, guess we can safely go with Eric's solution, even-thought Hannes
version consistently shows less HoL blocking and less sever spikes, as
the difference is so small.  I'm ACKing Eric's version...


We do need this patch, as can be seen by the baseline test at git
commit 02c0fc1.  Where some bandwidth unfairness to the UDP flows
happens, but only in the case GSO=off TSO=off (others are fine).
  http://people.netfilter.org/hawk/qdisc/measure20_no_quota_baseline_at_git_02c0fc1/NoneXSO_10Gbit_base_02c0fc1_bandwidth_totals_unfairness.png
Kind of strange, but it went away in the two quota tests.
diff mbox

Patch

diff --git a/net/sched/sch_generic.c b/net/sched/sch_generic.c
index 2b349a4de3c8e3491fad210a9400d26bda5b52fe..581ba0dcc2474f325d1c0b3e1dc957648f11992f 100644
--- a/net/sched/sch_generic.c
+++ b/net/sched/sch_generic.c
@@ -58,7 +58,8 @@  static inline int dev_requeue_skb(struct sk_buff *skb, struct Qdisc *q)
 
 static void try_bulk_dequeue_skb(struct Qdisc *q,
 				 struct sk_buff *skb,
-				 const struct netdev_queue *txq)
+				 const struct netdev_queue *txq,
+				 int *packets)
 {
 	int bytelimit = qdisc_avail_bulklimit(txq) - skb->len;
 
@@ -71,6 +72,7 @@  static void try_bulk_dequeue_skb(struct Qdisc *q,
 		bytelimit -= nskb->len; /* covers GSO len */
 		skb->next = nskb;
 		skb = nskb;
+		(*packets)++;
 	}
 	skb->next = NULL;
 }
@@ -78,11 +80,13 @@  static void try_bulk_dequeue_skb(struct Qdisc *q,
 /* Note that dequeue_skb can possibly return a SKB list (via skb->next).
  * A requeued skb (via q->gso_skb) can also be a SKB list.
  */
-static struct sk_buff *dequeue_skb(struct Qdisc *q, bool *validate)
+static struct sk_buff *dequeue_skb(struct Qdisc *q, bool *validate,
+				   int *packets)
 {
 	struct sk_buff *skb = q->gso_skb;
 	const struct netdev_queue *txq = q->dev_queue;
 
+	*packets = 1; /* yes, this might be not accurate, only if BQL is wrong */
 	*validate = true;
 	if (unlikely(skb)) {
 		/* check the reason of requeuing without tx lock first */
@@ -99,7 +103,7 @@  static struct sk_buff *dequeue_skb(struct Qdisc *q, bool *validate)
 		    !netif_xmit_frozen_or_stopped(txq)) {
 			skb = q->dequeue(q);
 			if (skb && qdisc_may_bulk(q))
-				try_bulk_dequeue_skb(q, skb, txq);
+				try_bulk_dequeue_skb(q, skb, txq, packets);
 		}
 	}
 	return skb;
@@ -205,7 +209,7 @@  int sch_direct_xmit(struct sk_buff *skb, struct Qdisc *q,
  *				>0 - queue is not empty.
  *
  */
-static inline int qdisc_restart(struct Qdisc *q)
+static inline int qdisc_restart(struct Qdisc *q, int *packets)
 {
 	struct netdev_queue *txq;
 	struct net_device *dev;
@@ -214,7 +218,7 @@  static inline int qdisc_restart(struct Qdisc *q)
 	bool validate;
 
 	/* Dequeue packet */
-	skb = dequeue_skb(q, &validate);
+	skb = dequeue_skb(q, &validate, packets);
 	if (unlikely(!skb))
 		return 0;
 
@@ -230,14 +234,16 @@  static inline int qdisc_restart(struct Qdisc *q)
 void __qdisc_run(struct Qdisc *q)
 {
 	int quota = weight_p;
+	int packets;
 
-	while (qdisc_restart(q)) {
+	while (qdisc_restart(q, &packets)) {
 		/*
 		 * Ordered by possible occurrence: Postpone processing if
 		 * 1. we've exceeded packet quota
 		 * 2. another process needs the CPU;
 		 */
-		if (--quota <= 0 || need_resched()) {
+		quota -= packets;
+		if (quota <= 0 || need_resched()) {
 			__netif_schedule(q);
 			break;
 		}