Message ID | 1412694080.11091.131.camel@edumazet-glaptop2.roam.corp.google.com |
---|---|
State | RFC, archived |
Delegated to: | David Miller |
Headers | show |
On Tue, 2014-10-07 at 08:01 -0700, Eric Dumazet wrote: > Quota was a packet quota, which was quite irrelevant if segmentation had > to be done, so I would just let the dequeue be done so that we benefit > from optimal xmit_more. And it also is better to allow receiver to get full LRO/GRO aggregation, if we do not break GSO train in multiple parts. -- To unsubscribe from this list: send the line "unsubscribe netdev" in the body of a message to majordomo@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
From: Eric Dumazet <eric.dumazet@gmail.com> Date: Tue, 07 Oct 2014 08:01:20 -0700 > On Tue, 2014-10-07 at 16:43 +0200, Hannes Frederic Sowa wrote: > >> This needs to be: >> >> do >> ... >> while ((iskb = iskb->next)) > > I do not feel needed to break the bulk dequeue at precise quota > boundary. These quotas are advisory, and bql prefers to get its full > budget for appropriate feedback from TX completion. > > Quota was a packet quota, which was quite irrelevant if segmentation had > to be done, so I would just let the dequeue be done so that we benefit > from optimal xmit_more. Yes, this makes sense, do a full qdisc_restart() cycle without boundaries, then check how much quota was used afterwards to guard the outermost loop. -- To unsubscribe from this list: send the line "unsubscribe netdev" in the body of a message to majordomo@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
On Tue, 2014-10-07 at 13:19 -0400, David Miller wrote: > Yes, this makes sense, do a full qdisc_restart() cycle without boundaries, > then check how much quota was used afterwards to guard the outermost loop. I am testing this, and also am testing the xmit_more patch for I40E. Will send patches today. Thanks -- To unsubscribe from this list: send the line "unsubscribe netdev" in the body of a message to majordomo@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
On Tue, 07 Oct 2014 13:19:38 -0400 (EDT) David Miller <davem@davemloft.net> wrote: > From: Eric Dumazet <eric.dumazet@gmail.com> > Date: Tue, 07 Oct 2014 08:01:20 -0700 > > > On Tue, 2014-10-07 at 16:43 +0200, Hannes Frederic Sowa wrote: > > > >> This needs to be: > >> > >> do > >> ... > >> while ((iskb = iskb->next)) > > > > I do not feel needed to break the bulk dequeue at precise quota > > boundary. These quotas are advisory, and bql prefers to get its full > > budget for appropriate feedback from TX completion. > > > > Quota was a packet quota, which was quite irrelevant if segmentation had > > to be done, so I would just let the dequeue be done so that we benefit > > from optimal xmit_more. > > Yes, this makes sense, do a full qdisc_restart() cycle without boundaries, > then check how much quota was used afterwards to guard the outermost loop. According to my measurements, at 10Gbit/s TCP_STREAM test the BQL limit is 381528 bytes / 1514 = 252 packets, that will (potentially) be bulk dequeued at once (with your version of the patch). It seems to have the potential to exceed the weight_p(64) quite a lot. And with e.g. TX ring size 512, we also also challenge the drivers at this early adoption phase of tailptr writes. Just saying...
On Tue, 07 Oct 2014 10:32:12 -0700 Eric Dumazet <eric.dumazet@gmail.com> wrote: > On Tue, 2014-10-07 at 13:19 -0400, David Miller wrote: > > > Yes, this makes sense, do a full qdisc_restart() cycle without boundaries, > > then check how much quota was used afterwards to guard the outermost loop. > > I am testing this, and also am testing the xmit_more patch for I40E. Check, I'm also testing both yours and Hannes patch. Results at: http://people.netfilter.org/hawk/qdisc/measure18_restore_quota_fairness/ http://people.netfilter.org/hawk/qdisc/measure19_restore_quota_erics/ http://people.netfilter.org/hawk/qdisc/measure20_no_quota_baseline_at_git_02c0fc1/
On Tue, 2014-10-07 at 20:03 +0200, Jesper Dangaard Brouer wrote: > According to my measurements, at 10Gbit/s TCP_STREAM test the BQL limit > is 381528 bytes / 1514 = 252 packets, that will (potentially) be bulk > dequeued at once (with your version of the patch). > That's because you use a single queue maybe ? In reality, 10Gbe NIC are used in multiqueue mode ... Here we have limits around 2 TSO packets. Even with only 4 tx queues I have : # sar -n DEV 3 3 |grep eth1 12:05:19 PM eth1 147217.67 809066.67 9488.71 1196207.78 0.00 0.00 0.00 12:05:22 PM eth1 145958.00 807822.33 9407.48 1194366.73 0.00 0.00 0.00 12:05:25 PM eth1 147502.33 804739.33 9507.26 1189804.23 0.00 0.00 0.33 Average: eth1 146892.67 807209.44 9467.82 1193459.58 0.00 0.00 0.11 grep . /sys/class/net/eth1/queues/tx*/byte_queue_limits/{inflight,limit} /sys/class/net/eth1/queues/tx-0/byte_queue_limits/inflight:115064 /sys/class/net/eth1/queues/tx-1/byte_queue_limits/inflight:0 /sys/class/net/eth1/queues/tx-2/byte_queue_limits/inflight:0 /sys/class/net/eth1/queues/tx-3/byte_queue_limits/inflight:0 /sys/class/net/eth1/queues/tx-0/byte_queue_limits/limit:102952 /sys/class/net/eth1/queues/tx-1/byte_queue_limits/limit:124148 /sys/class/net/eth1/queues/tx-2/byte_queue_limits/limit:102952 /sys/class/net/eth1/queues/tx-3/byte_queue_limits/limit:136260 > It seems to have the potential to exceed the weight_p(64) quite a lot. > And with e.g. TX ring size 512, we also also challenge the drivers at > this early adoption phase of tailptr writes. Just saying... > Yep, but remind we want to squeeze bugs out of the drivers, then add additional knobs later. Whatever limit we choose in core networking stack (being 64 packets for example), hardware might have different constraints that need to be taken care of in the driver. -- To unsubscribe from this list: send the line "unsubscribe netdev" in the body of a message to majordomo@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
On Tue, 07 Oct 2014 12:10:15 -0700 Eric Dumazet <eric.dumazet@gmail.com> wrote: > On Tue, 2014-10-07 at 20:03 +0200, Jesper Dangaard Brouer wrote: > > > According to my measurements, at 10Gbit/s TCP_STREAM test the BQL limit > > is 381528 bytes / 1514 = 252 packets, that will (potentially) be bulk > > dequeued at once (with your version of the patch). > > > > That's because you use a single queue maybe ? No, I just double checked. I was using a single netperf TCP_STREAM with GRO=off TSO=off. > Here we have limits around 2 TSO packets. Which should be approx 128k Just tested with GRO=on TSO=on single TCP_STREAM, which is weird as I should hit your 2xTSO limit right, and inflight shows 408780. And I do have (which I guess is the 2xTSO): /proc/sys/net/ipv4/tcp_limit_output_bytes:131072 $ grep -H . /sys/class/net/eth4/queues/tx-*/byte_queue_limits/{inflight,limit} /sys/class/net/eth4/queues/tx-0/byte_queue_limits/inflight:0 /sys/class/net/eth4/queues/tx-10/byte_queue_limits/inflight:0 /sys/class/net/eth4/queues/tx-11/byte_queue_limits/inflight:0 /sys/class/net/eth4/queues/tx-1/byte_queue_limits/inflight:0 /sys/class/net/eth4/queues/tx-2/byte_queue_limits/inflight:0 /sys/class/net/eth4/queues/tx-3/byte_queue_limits/inflight:0 /sys/class/net/eth4/queues/tx-4/byte_queue_limits/inflight:0 /sys/class/net/eth4/queues/tx-5/byte_queue_limits/inflight:0 /sys/class/net/eth4/queues/tx-6/byte_queue_limits/inflight:0 /sys/class/net/eth4/queues/tx-7/byte_queue_limits/inflight:0 /sys/class/net/eth4/queues/tx-8/byte_queue_limits/inflight:408780 /sys/class/net/eth4/queues/tx-9/byte_queue_limits/inflight:0 /sys/class/net/eth4/queues/tx-0/byte_queue_limits/limit:340848 /sys/class/net/eth4/queues/tx-10/byte_queue_limits/limit:340650 /sys/class/net/eth4/queues/tx-11/byte_queue_limits/limit:367902 /sys/class/net/eth4/queues/tx-1/byte_queue_limits/limit:272520 /sys/class/net/eth4/queues/tx-2/byte_queue_limits/limit:204390 /sys/class/net/eth4/queues/tx-3/byte_queue_limits/limit:162856 /sys/class/net/eth4/queues/tx-4/byte_queue_limits/limit:158314 /sys/class/net/eth4/queues/tx-5/byte_queue_limits/limit:136260 /sys/class/net/eth4/queues/tx-6/byte_queue_limits/limit:140802 /sys/class/net/eth4/queues/tx-7/byte_queue_limits/limit:152258 /sys/class/net/eth4/queues/tx-8/byte_queue_limits/limit:340650 /sys/class/net/eth4/queues/tx-9/byte_queue_limits/limit:340650 Strange... > Even with only 4 tx queues I have : > > # sar -n DEV 3 3 |grep eth1 > 12:05:19 PM eth1 147217.67 809066.67 9488.71 1196207.78 0.00 0.00 0.00 > 12:05:22 PM eth1 145958.00 807822.33 9407.48 1194366.73 0.00 0.00 0.00 > 12:05:25 PM eth1 147502.33 804739.33 9507.26 1189804.23 0.00 0.00 0.33 > Average: eth1 146892.67 807209.44 9467.82 1193459.58 0.00 0.00 0.11 > > > grep . /sys/class/net/eth1/queues/tx*/byte_queue_limits/{inflight,limit} > /sys/class/net/eth1/queues/tx-0/byte_queue_limits/inflight:115064 > /sys/class/net/eth1/queues/tx-1/byte_queue_limits/inflight:0 > /sys/class/net/eth1/queues/tx-2/byte_queue_limits/inflight:0 > /sys/class/net/eth1/queues/tx-3/byte_queue_limits/inflight:0 > /sys/class/net/eth1/queues/tx-0/byte_queue_limits/limit:102952 > /sys/class/net/eth1/queues/tx-1/byte_queue_limits/limit:124148 > /sys/class/net/eth1/queues/tx-2/byte_queue_limits/limit:102952 > /sys/class/net/eth1/queues/tx-3/byte_queue_limits/limit:136260 Guess this is okay, 115064 / 1514 = 76 pkts which is closer to the 64 weight_p. > > It seems to have the potential to exceed the weight_p(64) quite a lot. > > And with e.g. TX ring size 512, we also also challenge the drivers at > > this early adoption phase of tailptr writes. Just saying... > > > > Yep, but remind we want to squeeze bugs out of the drivers, then add > additional knobs later. Okay, for squeezing out bugs, then I understand this more aggressive bulking strategy. I'm all in then! > Whatever limit we choose in core networking stack (being 64 packets for > example), hardware might have different constraints that need to be > taken care of in the driver.
On Tue, 7 Oct 2014 20:37:00 +0200 Jesper Dangaard Brouer <brouer@redhat.com> wrote: > On Tue, 07 Oct 2014 10:32:12 -0700 > Eric Dumazet <eric.dumazet@gmail.com> wrote: > > > On Tue, 2014-10-07 at 13:19 -0400, David Miller wrote: > > > > > Yes, this makes sense, do a full qdisc_restart() cycle without boundaries, > > > then check how much quota was used afterwards to guard the outermost loop. > > > > I am testing this, and also am testing the xmit_more patch for I40E. > > Check, I'm also testing both yours and Hannes patch. > > Results at: > http://people.netfilter.org/hawk/qdisc/measure18_restore_quota_fairness/ > http://people.netfilter.org/hawk/qdisc/measure19_restore_quota_erics/ > http://people.netfilter.org/hawk/qdisc/measure20_no_quota_baseline_at_git_02c0fc1/ These tests are with ixgbe with a single TXQ, in-order to measure the effect of HoL, by taking advantage of the high prio queue of pfifo_fast. (Cmdline trick for a single TXQ: "ethtool -L eth4 combined 1") In case GSO=off TSO=off, Hannes "wins" with 0.04ms http://people.netfilter.org/hawk/qdisc/measure19_restore_quota_erics/compare_rr_latency_eric_vs_hannes_NoneXSO.png which I guess we should not be concerned with. In case GSO=on TSO=off, the diff is max 0.01ms (to Hannes advantage ;-)) Notice the extreme zoom level: http://people.netfilter.org/hawk/qdisc/measure19_restore_quota_erics/compare_rr_latency_eric_vs_hannes_GSO.png In case GSO=on TSO=on, there are some spikes, where Eric's version have the highest spike. http://people.netfilter.org/hawk/qdisc/measure19_restore_quota_erics/compare_rr_latency_eric_vs_hannes_TSO.png Again nothing we should worry about. Thus, guess we can safely go with Eric's solution, even-thought Hannes version consistently shows less HoL blocking and less sever spikes, as the difference is so small. I'm ACKing Eric's version... We do need this patch, as can be seen by the baseline test at git commit 02c0fc1. Where some bandwidth unfairness to the UDP flows happens, but only in the case GSO=off TSO=off (others are fine). http://people.netfilter.org/hawk/qdisc/measure20_no_quota_baseline_at_git_02c0fc1/NoneXSO_10Gbit_base_02c0fc1_bandwidth_totals_unfairness.png Kind of strange, but it went away in the two quota tests.
diff --git a/net/sched/sch_generic.c b/net/sched/sch_generic.c index 2b349a4de3c8e3491fad210a9400d26bda5b52fe..581ba0dcc2474f325d1c0b3e1dc957648f11992f 100644 --- a/net/sched/sch_generic.c +++ b/net/sched/sch_generic.c @@ -58,7 +58,8 @@ static inline int dev_requeue_skb(struct sk_buff *skb, struct Qdisc *q) static void try_bulk_dequeue_skb(struct Qdisc *q, struct sk_buff *skb, - const struct netdev_queue *txq) + const struct netdev_queue *txq, + int *packets) { int bytelimit = qdisc_avail_bulklimit(txq) - skb->len; @@ -71,6 +72,7 @@ static void try_bulk_dequeue_skb(struct Qdisc *q, bytelimit -= nskb->len; /* covers GSO len */ skb->next = nskb; skb = nskb; + (*packets)++; } skb->next = NULL; } @@ -78,11 +80,13 @@ static void try_bulk_dequeue_skb(struct Qdisc *q, /* Note that dequeue_skb can possibly return a SKB list (via skb->next). * A requeued skb (via q->gso_skb) can also be a SKB list. */ -static struct sk_buff *dequeue_skb(struct Qdisc *q, bool *validate) +static struct sk_buff *dequeue_skb(struct Qdisc *q, bool *validate, + int *packets) { struct sk_buff *skb = q->gso_skb; const struct netdev_queue *txq = q->dev_queue; + *packets = 1; /* yes, this might be not accurate, only if BQL is wrong */ *validate = true; if (unlikely(skb)) { /* check the reason of requeuing without tx lock first */ @@ -99,7 +103,7 @@ static struct sk_buff *dequeue_skb(struct Qdisc *q, bool *validate) !netif_xmit_frozen_or_stopped(txq)) { skb = q->dequeue(q); if (skb && qdisc_may_bulk(q)) - try_bulk_dequeue_skb(q, skb, txq); + try_bulk_dequeue_skb(q, skb, txq, packets); } } return skb; @@ -205,7 +209,7 @@ int sch_direct_xmit(struct sk_buff *skb, struct Qdisc *q, * >0 - queue is not empty. * */ -static inline int qdisc_restart(struct Qdisc *q) +static inline int qdisc_restart(struct Qdisc *q, int *packets) { struct netdev_queue *txq; struct net_device *dev; @@ -214,7 +218,7 @@ static inline int qdisc_restart(struct Qdisc *q) bool validate; /* Dequeue packet */ - skb = dequeue_skb(q, &validate); + skb = dequeue_skb(q, &validate, packets); if (unlikely(!skb)) return 0; @@ -230,14 +234,16 @@ static inline int qdisc_restart(struct Qdisc *q) void __qdisc_run(struct Qdisc *q) { int quota = weight_p; + int packets; - while (qdisc_restart(q)) { + while (qdisc_restart(q, &packets)) { /* * Ordered by possible occurrence: Postpone processing if * 1. we've exceeded packet quota * 2. another process needs the CPU; */ - if (--quota <= 0 || need_resched()) { + quota -= packets; + if (quota <= 0 || need_resched()) { __netif_schedule(q); break; }