From patchwork Tue Nov 20 17:45:13 2012 Content-Type: text/plain; charset="utf-8" MIME-Version: 1.0 Content-Transfer-Encoding: 7bit X-Patchwork-Submitter: Paolo Valente X-Patchwork-Id: 200491 X-Patchwork-Delegate: davem@davemloft.net Return-Path: X-Original-To: patchwork-incoming@ozlabs.org Delivered-To: patchwork-incoming@ozlabs.org Received: from vger.kernel.org (vger.kernel.org [209.132.180.67]) by ozlabs.org (Postfix) with ESMTP id A3BF32C0097 for ; Wed, 21 Nov 2012 04:46:16 +1100 (EST) Received: (majordomo@vger.kernel.org) by vger.kernel.org via listexpand id S1752637Ab2KTRp4 (ORCPT ); Tue, 20 Nov 2012 12:45:56 -0500 Received: from smtp7.sms.unimo.it ([155.185.44.150]:34830 "EHLO smtp7.sms.unimo.it" rhost-flags-OK-OK-OK-OK) by vger.kernel.org with ESMTP id S1751886Ab2KTRpy (ORCPT ); Tue, 20 Nov 2012 12:45:54 -0500 Received: from [212.84.39.29] (port=55285 helo=paolo-ThinkPad-W520) by smtp7.sms.unimo.it with esmtpsa (TLS1.0:DHE_RSA_AES_128_CBC_SHA1:16) (Exim 4.72) (envelope-from ) id 1Tarst-0004dr-A0; Tue, 20 Nov 2012 18:45:38 +0100 Date: Tue, 20 Nov 2012 18:45:13 +0100 From: Paolo Valente To: davem@davemloft.net, jhs@mojatatu.com, shemminger@vyatta.com Cc: linux-kernel@vger.kernel.org, netdev@vger.kernel.org, rizzo@iet.unipi.it, fchecconi@gmail.com, paolo.valente@unimore.it Subject: [PATCH] pkt_sched: QFQ Plus: fair-queueing service at DRR cost Message-ID: <20121120174513.GA21618@paolo-ThinkPad-W520> MIME-Version: 1.0 Content-Disposition: inline User-Agent: Mutt/1.5.21 (2010-09-15) UNIMORE-X-SA-Score: -2.9 Sender: netdev-owner@vger.kernel.org Precedence: bulk List-ID: X-Mailing-List: netdev@vger.kernel.org This patch turns QFQ into QFQ+, a variant of QFQ that provides the following two benefits: 1) QFQ+ is faster than QFQ, 2) differently from QFQ, QFQ+ correctly schedules also non-leaves classes in a hierarchical setting. The way how QFQ+ achieves these goals is discussed briefly below. I also report some performance measurements. A detailed description of QFQ+ and of these results can be found in [1]. QFQ+ achieves a higher speed than QFQ by grouping classes into aggregates, and uses the original QFQ scheduling algorithm to schedule aggregates instead of single classes. An aggregate is made of at most M classes, all with the same weight and maximum packet size. M is equal to the minimum between tx_queue_len+1 and 8 (value chosen to get a good trade-off between execution time and service guarantees). QFQ+ associates each aggregate with a budget equal to the maximum packet size for the classes in the aggregate, multiplied by the number of classes of the aggregate. Once selected an aggregate for service, QFQ+ dequeues only the packets of its classes, until the aggregate finishes its budget. Finally, within an aggregate, classes are scheduled with DRR. In my measurements, described below, the execution time of QFQ+ without TSO/GSO and with M=8 was from 16% to 31% lower than that of QFQ, and close to that of DRR. QFQ+ does not use packet lengths for computing aggregate timestamps, but budgets. Hence it does not need to modify any timestamp if the head packet of a class changes. As a consequence, differently from QFQ, which uses head-packet lengths to compute class timestamps, QFQ+ does not need further modifications to correctly schedule also non-leaf classes and classes with non-FIFO qdiscs. As for service guarantees, thanks to the way how M is computed, the service of QFQ+ is close to the one of QFQ. For example, as proved in [1], under QFQ+ every packet of a given class is guaranteed the same worst-case completion time as under QFQ, plus an additional delay equal to the transmission time, at the rate reserved to the class, of three maximum-size packet. See [1, Section 7.1] for a numerical comparison among the packet delays guaranteed by QFQ+, QFQ and DRR. I measured the execution time of QFQ+, DRR and QFQ using the testing environment [2]. In particular, for each scheduler I measured the average total execution time of a packet enqueue plus a packet dequeue, without TSO/GSO. What would happen with TSO/GSO is discussed at the end of this description. For practical reasons, in the testing environment each enqueue&dequeue is also charged for the cost of generating and discarding an empty, fixed-size packet (using a free list). The following table reports the results with an i7-2760QM, against four different class sets. Time is measured in nanoseconds, while each set or subset of classes is denoted as -w, where and are, respectively, the number of classes and the weight of every class in the set/subset (for example, 250-w1 stands for 250 classes with weight 1). For QFQ+, the table shows the results for the two extremes for M: 1 and 8 (see [1, Section 7.2] for results with other values of M and for more information). ----------------------------------------------- | Set of | QFQ+ (M) | DRR QFQ | | classes | 1 8 | | |-----------------------------------------------| | 1k-w1 | 89 63 | 56 81 | |-----------------------------------------------| | 500-w1, | | | | 250-w2, | 102 71 | 87 103 | | 250-w4 | | | |-----------------------------------------------| | 32k-w1 | 267 225 | 173 257 | |-----------------------------------------------| | 16k-w1, | | | | 8k-w2, | 253 187 | 252 257 | | 8k-w4 | | | ----------------------------------------------- About DRR, it achieves its best performance when all the classes have the same weight. This is fortunate, because in such scenarios it is actually pointless to use a fair-queueing scheduler, as the latter would provide the same quality of service as DRR. In contrast, when classes have differentiated weights and the better service properties of QFQ+ make a difference, QFQ+ has better performance than DRR. It happens mainly because QFQ+ dequeues packets in an order that causes about 8% less cache misses than DRR. As for the number of instructions, QFQ+ executes instead about 7% more instructions than DRR, whereas QFQ executes from 25% to 34% more instructions than DRR. With TSO/GSO, the number of instructions per packet enqueue/dequeue executed by QFQ and QFQ+ is the same as without TSO/GSO. In contrast, for each dequeue, the number of iterations executed by the main loop of DRR is 64K/quantum times as high as without TSO/GSO. Paolo [1] P. Valente, "Reducing the Execution Time of Fair-Queueing Schedulers" http://algo.ing.unimo.it/people/paolo/agg-sched/agg-sched.pdf [2] http://algo.ing.unimo.it/people/paolo/agg-sched/test-env.tgz Signed-off-by: Paolo Valente --- net/sched/sch_qfq.c | 834 +++++++++++++++++++++++++++++++++++---------------- 1 file changed, 571 insertions(+), 263 deletions(-) diff --git a/net/sched/sch_qfq.c b/net/sched/sch_qfq.c index 9687fa1..fe1823a 100644 --- a/net/sched/sch_qfq.c +++ b/net/sched/sch_qfq.c @@ -1,7 +1,8 @@ /* - * net/sched/sch_qfq.c Quick Fair Queueing Scheduler. + * net/sched/sch_qfq.c Quick Fair Queueing Plus Scheduler. * * Copyright (c) 2009 Fabio Checconi, Luigi Rizzo, and Paolo Valente. + * Copyright (c) 2012 Paolo Valente. * * This program is free software; you can redistribute it and/or * modify it under the terms of the GNU General Public License @@ -19,12 +20,18 @@ #include -/* Quick Fair Queueing - =================== +/* Quick Fair Queueing Plus + ======================== Sources: - Fabio Checconi, Luigi Rizzo, and Paolo Valente: "QFQ: Efficient + [1] Paolo Valente, + "Reducing the Execution Time of Fair-Queueing Schedulers." + http://algo.ing.unimo.it/people/paolo/agg-sched/agg-sched.pdf + + Sources for QFQ: + + [2] Fabio Checconi, Luigi Rizzo, and Paolo Valente: "QFQ: Efficient Packet Scheduling with Tight Bandwidth Distribution Guarantees." See also: @@ -33,6 +40,20 @@ /* + QFQ+ divides classes into aggregates of at most MAX_AGG_CLASSES + classes. Each aggregate is timestamped with a virtual start time S + and a virtual finish time F, and scheduled according to its + timestamps. S and F are computed as a function of a system virtual + time function V. The classes within each aggregate are instead + scheduled with DRR. + + To speed up operations, QFQ+ divides also aggregates into a limited + number of groups. Which group a class belongs to depends on the + ratio between the maximum packet length for the class and the weight + of the class. Groups have their own S and F. In the end, QFQ+ + schedules groups, then aggregates within groups, then classes within + aggregates. See [1] and [2] for a full description. + Virtual time computations. S, F and V are all computed in fixed point arithmetic with @@ -76,27 +97,28 @@ #define QFQ_MAX_SLOTS 32 /* - * Shifts used for class<->group mapping. We allow class weights that are - * in the range [1, 2^MAX_WSHIFT], and we try to map each class i to the + * Shifts used for aggregate<->group mapping. We allow class weights that are + * in the range [1, 2^MAX_WSHIFT], and we try to map each aggregate i to the * group with the smallest index that can support the L_i / r_i configured - * for the class. + * for the classes in the aggregate. * * grp->index is the index of the group; and grp->slot_shift * is the shift for the corresponding (scaled) sigma_i. */ #define QFQ_MAX_INDEX 24 -#define QFQ_MAX_WSHIFT 12 +#define QFQ_MAX_WSHIFT 10 -#define QFQ_MAX_WEIGHT (1<> QFQ_MIN_SLOT_SHIFT; + size_map = slot_size >> min_slot_shift; if (!size_map) goto out; index = __fls(size_map) + 1; /* basically a log_2 */ - index -= !(slot_size - (1ULL << (index + QFQ_MIN_SLOT_SHIFT - 1))); + index -= !(slot_size - (1ULL << (index + min_slot_shift - 1))); if (index < 0) index = 0; @@ -204,66 +261,150 @@ out: return index; } -/* Length of the next packet (0 if the queue is empty). */ -static unsigned int qdisc_peek_len(struct Qdisc *sch) +static void qfq_deactivate_agg(struct qfq_sched *, struct qfq_aggregate *); +static void qfq_activate_agg(struct qfq_sched *, struct qfq_aggregate *, + enum update_reason); + +static void qfq_init_agg(struct qfq_sched *q, struct qfq_aggregate *agg, + u32 lmax, u32 weight) { - struct sk_buff *skb; + INIT_LIST_HEAD(&agg->active); + hlist_add_head(&agg->nonfull_next, &q->nonfull_aggs); - skb = sch->ops->peek(sch); - return skb ? qdisc_pkt_len(skb) : 0; + agg->lmax = lmax; + agg->class_weight = weight; } -static void qfq_deactivate_class(struct qfq_sched *, struct qfq_class *); -static void qfq_activate_class(struct qfq_sched *q, struct qfq_class *cl, - unsigned int len); +static struct qfq_aggregate *qfq_find_agg(struct qfq_sched *q, + u32 lmax, u32 weight) +{ + struct qfq_aggregate *agg; + struct hlist_node *n; -static void qfq_update_class_params(struct qfq_sched *q, struct qfq_class *cl, - u32 lmax, u32 inv_w, int delta_w) + hlist_for_each_entry(agg, n, &q->nonfull_aggs, nonfull_next) + if (agg->lmax == lmax && agg->class_weight == weight) + return agg; + + return NULL; +} + + +/* Update aggregate as a function of the new number of classes. */ +static void qfq_update_agg(struct qfq_sched *q, struct qfq_aggregate *agg, + int new_num_classes) { - int i; + u32 new_agg_weight; + + if (new_num_classes == q->max_agg_classes) + hlist_del_init(&agg->nonfull_next); - /* update qfq-specific data */ - cl->lmax = lmax; - cl->inv_w = inv_w; - i = qfq_calc_index(cl->inv_w, cl->lmax); + if (agg->num_classes > new_num_classes && + new_num_classes == q->max_agg_classes - 1) /* agg no more full */ + hlist_add_head(&agg->nonfull_next, &q->nonfull_aggs); - cl->grp = &q->groups[i]; + agg->budgetmax = new_num_classes * agg->lmax; + new_agg_weight = agg->class_weight * new_num_classes; + agg->inv_w = ONE_FP/new_agg_weight; - q->wsum += delta_w; + if (agg->grp == NULL) { + int i = qfq_calc_index(agg->inv_w, agg->budgetmax, + q->min_slot_shift); + agg->grp = &q->groups[i]; + } + + q->wsum += + (int) agg->class_weight * (new_num_classes - agg->num_classes); + + agg->num_classes = new_num_classes; } -static void qfq_update_reactivate_class(struct qfq_sched *q, - struct qfq_class *cl, - u32 inv_w, u32 lmax, int delta_w) +/* Add class to aggregate. */ +static void qfq_add_to_agg(struct qfq_sched *q, + struct qfq_aggregate *agg, + struct qfq_class *cl) { - bool need_reactivation = false; - int i = qfq_calc_index(inv_w, lmax); + cl->agg = agg; + + qfq_update_agg(q, agg, agg->num_classes+1); + if (cl->qdisc->q.qlen > 0) { /* adding an active class */ + list_add_tail(&cl->alist, &agg->active); + if (list_first_entry(&agg->active, struct qfq_class, alist) == + cl && q->in_serv_agg != agg) /* agg was inactive */ + qfq_activate_agg(q, agg, enqueue); /* schedule agg */ + } +} - if (&q->groups[i] != cl->grp && cl->qdisc->q.qlen > 0) { - /* - * shift cl->F back, to not charge the - * class for the not-yet-served head - * packet - */ - cl->F = cl->S; - /* remove class from its slot in the old group */ - qfq_deactivate_class(q, cl); - need_reactivation = true; +static struct qfq_aggregate *qfq_choose_next_agg(struct qfq_sched *); + +static void qfq_destroy_agg(struct qfq_sched *q, struct qfq_aggregate *agg) +{ + if (!hlist_unhashed(&agg->nonfull_next)) + hlist_del_init(&agg->nonfull_next); + if (q->in_serv_agg == agg) + q->in_serv_agg = qfq_choose_next_agg(q); + kfree(agg); +} + +/* Deschedule class from within its parent aggregate. */ +static void qfq_deactivate_class(struct qfq_sched *q, struct qfq_class *cl) +{ + struct qfq_aggregate *agg = cl->agg; + + + list_del(&cl->alist); /* remove from RR queue of the aggregate */ + if (list_empty(&agg->active)) /* agg is now inactive */ + qfq_deactivate_agg(q, agg); +} + +/* Remove class from its parent aggregate. */ +static void qfq_rm_from_agg(struct qfq_sched *q, struct qfq_class *cl) +{ + struct qfq_aggregate *agg = cl->agg; + + cl->agg = NULL; + if (agg->num_classes == 1) { /* agg being emptied, destroy it */ + qfq_destroy_agg(q, agg); + return; } + qfq_update_agg(q, agg, agg->num_classes-1); +} - qfq_update_class_params(q, cl, lmax, inv_w, delta_w); +/* Deschedule class and remove it from its parent aggregate. */ +static void qfq_deact_rm_from_agg(struct qfq_sched *q, struct qfq_class *cl) +{ + if (cl->qdisc->q.qlen > 0) /* class is active */ + qfq_deactivate_class(q, cl); - if (need_reactivation) /* activate in new group */ - qfq_activate_class(q, cl, qdisc_peek_len(cl->qdisc)); + qfq_rm_from_agg(q, cl); } +/* Move class to a new aggregate, matching the new class weight and/or lmax */ +static int qfq_change_agg(struct Qdisc *sch, struct qfq_class *cl, u32 weight, + u32 lmax) +{ + struct qfq_sched *q = qdisc_priv(sch); + struct qfq_aggregate *new_agg = qfq_find_agg(q, lmax, weight); + + if (new_agg == NULL) { /* create new aggregate */ + new_agg = kzalloc(sizeof(*new_agg), GFP_ATOMIC); + if (new_agg == NULL) + return -ENOBUFS; + qfq_init_agg(q, new_agg, lmax, weight); + } + qfq_deact_rm_from_agg(q, cl); + qfq_add_to_agg(q, new_agg, cl); + + return 0; +} static int qfq_change_class(struct Qdisc *sch, u32 classid, u32 parentid, struct nlattr **tca, unsigned long *arg) { struct qfq_sched *q = qdisc_priv(sch); struct qfq_class *cl = (struct qfq_class *)*arg; + bool existing = false; struct nlattr *tb[TCA_QFQ_MAX + 1]; + struct qfq_aggregate *new_agg = NULL; u32 weight, lmax, inv_w; int err; int delta_w; @@ -286,15 +427,6 @@ static int qfq_change_class(struct Qdisc *sch, u32 classid, u32 parentid, } else weight = 1; - inv_w = ONE_FP / weight; - weight = ONE_FP / inv_w; - delta_w = weight - (cl ? ONE_FP / cl->inv_w : 0); - if (q->wsum + delta_w > QFQ_MAX_WSUM) { - pr_notice("qfq: total weight out of range (%u + %u)\n", - delta_w, q->wsum); - return -EINVAL; - } - if (tb[TCA_QFQ_LMAX]) { lmax = nla_get_u32(tb[TCA_QFQ_LMAX]); if (lmax < QFQ_MIN_LMAX || lmax > (1UL << QFQ_MTU_SHIFT)) { @@ -304,7 +436,23 @@ static int qfq_change_class(struct Qdisc *sch, u32 classid, u32 parentid, } else lmax = psched_mtu(qdisc_dev(sch)); - if (cl != NULL) { + inv_w = ONE_FP / weight; + weight = ONE_FP / inv_w; + + if (cl != NULL && + lmax == cl->agg->lmax && + weight == cl->agg->class_weight) + return 0; /* nothing to change */ + + delta_w = weight - (cl ? cl->agg->class_weight : 0); + + if (q->wsum + delta_w > QFQ_MAX_WSUM) { + pr_notice("qfq: total weight out of range (%d + %u)\n", + delta_w, q->wsum); + return -EINVAL; + } + + if (cl != NULL) { /* modify existing class */ if (tca[TCA_RATE]) { err = gen_replace_estimator(&cl->bstats, &cl->rate_est, qdisc_root_sleeping_lock(sch), @@ -312,25 +460,18 @@ static int qfq_change_class(struct Qdisc *sch, u32 classid, u32 parentid, if (err) return err; } - - if (lmax == cl->lmax && inv_w == cl->inv_w) - return 0; /* nothing to update */ - - sch_tree_lock(sch); - qfq_update_reactivate_class(q, cl, inv_w, lmax, delta_w); - sch_tree_unlock(sch); - - return 0; + existing = true; + goto set_change_agg; } + /* create and init new class */ cl = kzalloc(sizeof(struct qfq_class), GFP_KERNEL); if (cl == NULL) return -ENOBUFS; cl->refcnt = 1; cl->common.classid = classid; - - qfq_update_class_params(q, cl, lmax, inv_w, delta_w); + cl->deficit = lmax; cl->qdisc = qdisc_create_dflt(sch->dev_queue, &pfifo_qdisc_ops, classid); @@ -341,11 +482,8 @@ static int qfq_change_class(struct Qdisc *sch, u32 classid, u32 parentid, err = gen_new_estimator(&cl->bstats, &cl->rate_est, qdisc_root_sleeping_lock(sch), tca[TCA_RATE]); - if (err) { - qdisc_destroy(cl->qdisc); - kfree(cl); - return err; - } + if (err) + goto destroy_class; } sch_tree_lock(sch); @@ -354,19 +492,39 @@ static int qfq_change_class(struct Qdisc *sch, u32 classid, u32 parentid, qdisc_class_hash_grow(sch, &q->clhash); +set_change_agg: + sch_tree_lock(sch); + new_agg = qfq_find_agg(q, lmax, weight); + if (new_agg == NULL) { /* create new aggregate */ + sch_tree_unlock(sch); + new_agg = kzalloc(sizeof(*new_agg), GFP_KERNEL); + if (new_agg == NULL) { + err = -ENOBUFS; + gen_kill_estimator(&cl->bstats, &cl->rate_est); + goto destroy_class; + } + sch_tree_lock(sch); + qfq_init_agg(q, new_agg, lmax, weight); + } + if (existing) + qfq_deact_rm_from_agg(q, cl); + qfq_add_to_agg(q, new_agg, cl); + sch_tree_unlock(sch); + *arg = (unsigned long)cl; return 0; + +destroy_class: + qdisc_destroy(cl->qdisc); + kfree(cl); + return err; } static void qfq_destroy_class(struct Qdisc *sch, struct qfq_class *cl) { struct qfq_sched *q = qdisc_priv(sch); - if (cl->inv_w) { - q->wsum -= ONE_FP / cl->inv_w; - cl->inv_w = 0; - } - + qfq_rm_from_agg(q, cl); gen_kill_estimator(&cl->bstats, &cl->rate_est); qdisc_destroy(cl->qdisc); kfree(cl); @@ -481,8 +639,8 @@ static int qfq_dump_class(struct Qdisc *sch, unsigned long arg, nest = nla_nest_start(skb, TCA_OPTIONS); if (nest == NULL) goto nla_put_failure; - if (nla_put_u32(skb, TCA_QFQ_WEIGHT, ONE_FP/cl->inv_w) || - nla_put_u32(skb, TCA_QFQ_LMAX, cl->lmax)) + if (nla_put_u32(skb, TCA_QFQ_WEIGHT, cl->agg->class_weight) || + nla_put_u32(skb, TCA_QFQ_LMAX, cl->agg->lmax)) goto nla_put_failure; return nla_nest_end(skb, nest); @@ -500,8 +658,8 @@ static int qfq_dump_class_stats(struct Qdisc *sch, unsigned long arg, memset(&xstats, 0, sizeof(xstats)); cl->qdisc->qstats.qlen = cl->qdisc->q.qlen; - xstats.weight = ONE_FP/cl->inv_w; - xstats.lmax = cl->lmax; + xstats.weight = cl->agg->class_weight; + xstats.lmax = cl->agg->lmax; if (gnet_stats_copy_basic(d, &cl->bstats) < 0 || gnet_stats_copy_rate_est(d, &cl->bstats, &cl->rate_est) < 0 || @@ -652,16 +810,16 @@ static void qfq_unblock_groups(struct qfq_sched *q, int index, u64 old_F) * perhaps * old_V ^= q->V; - old_V >>= QFQ_MIN_SLOT_SHIFT; + old_V >>= q->min_slot_shift; if (old_V) { ... } * */ -static void qfq_make_eligible(struct qfq_sched *q, u64 old_V) +static void qfq_make_eligible(struct qfq_sched *q) { - unsigned long vslot = q->V >> QFQ_MIN_SLOT_SHIFT; - unsigned long old_vslot = old_V >> QFQ_MIN_SLOT_SHIFT; + unsigned long vslot = q->V >> q->min_slot_shift; + unsigned long old_vslot = q->oldV >> q->min_slot_shift; if (vslot != old_vslot) { unsigned long mask = (1UL << fls(vslot ^ old_vslot)) - 1; @@ -672,34 +830,38 @@ static void qfq_make_eligible(struct qfq_sched *q, u64 old_V) /* - * If the weight and lmax (max_pkt_size) of the classes do not change, - * then QFQ guarantees that the slot index is never higher than - * 2 + ((1<S) >> grp->slot_shift; @@ -708,22 +870,22 @@ static void qfq_slot_insert(struct qfq_group *grp, struct qfq_class *cl, if (unlikely(slot > QFQ_MAX_SLOTS - 2)) { u64 deltaS = roundedS - grp->S - ((u64)(QFQ_MAX_SLOTS - 2)<slot_shift); - cl->S -= deltaS; - cl->F -= deltaS; + agg->S -= deltaS; + agg->F -= deltaS; slot = QFQ_MAX_SLOTS - 2; } i = (grp->front + slot) % QFQ_MAX_SLOTS; - hlist_add_head(&cl->next, &grp->slots[i]); + hlist_add_head(&agg->next, &grp->slots[i]); __set_bit(slot, &grp->full_slots); } /* Maybe introduce hlist_first_entry?? */ -static struct qfq_class *qfq_slot_head(struct qfq_group *grp) +static struct qfq_aggregate *qfq_slot_head(struct qfq_group *grp) { return hlist_entry(grp->slots[grp->front].first, - struct qfq_class, next); + struct qfq_aggregate, next); } /* @@ -731,20 +893,20 @@ static struct qfq_class *qfq_slot_head(struct qfq_group *grp) */ static void qfq_front_slot_remove(struct qfq_group *grp) { - struct qfq_class *cl = qfq_slot_head(grp); + struct qfq_aggregate *agg = qfq_slot_head(grp); - BUG_ON(!cl); - hlist_del(&cl->next); + BUG_ON(!agg); + hlist_del(&agg->next); if (hlist_empty(&grp->slots[grp->front])) __clear_bit(0, &grp->full_slots); } /* - * Returns the first full queue in a group. As a side effect, - * adjust the bucket list so the first non-empty bucket is at - * position 0 in full_slots. + * Returns the first aggregate in the first non-empty bucket of the + * group. As a side effect, adjusts the bucket list so the first + * non-empty bucket is at position 0 in full_slots. */ -static struct qfq_class *qfq_slot_scan(struct qfq_group *grp) +static struct qfq_aggregate *qfq_slot_scan(struct qfq_group *grp) { unsigned int i; @@ -780,7 +942,7 @@ static void qfq_slot_rotate(struct qfq_group *grp, u64 roundedS) grp->front = (grp->front - i) % QFQ_MAX_SLOTS; } -static void qfq_update_eligible(struct qfq_sched *q, u64 old_V) +static void qfq_update_eligible(struct qfq_sched *q) { struct qfq_group *grp; unsigned long ineligible; @@ -792,137 +954,229 @@ static void qfq_update_eligible(struct qfq_sched *q, u64 old_V) if (qfq_gt(grp->S, q->V)) q->V = grp->S; } - qfq_make_eligible(q, old_V); + qfq_make_eligible(q); } } -/* - * Updates the class, returns true if also the group needs to be updated. - */ -static bool qfq_update_class(struct qfq_group *grp, struct qfq_class *cl) +/* Dequeue head packet of the head class in the DRR queue of the aggregate. */ +static void agg_dequeue(struct qfq_aggregate *agg, + struct qfq_class *cl, unsigned int len) { - unsigned int len = qdisc_peek_len(cl->qdisc); + qdisc_dequeue_peeked(cl->qdisc); - cl->S = cl->F; - if (!len) - qfq_front_slot_remove(grp); /* queue is empty */ - else { - u64 roundedS; - - cl->F = cl->S + (u64)len * cl->inv_w; - roundedS = qfq_round_down(cl->S, grp->slot_shift); - if (roundedS == grp->S) - return false; + cl->deficit -= (int) len; - qfq_front_slot_remove(grp); - qfq_slot_insert(grp, cl, roundedS); + if (cl->qdisc->q.qlen == 0) /* no more packets, remove from list */ + list_del(&cl->alist); + else if (cl->deficit < qdisc_pkt_len(cl->qdisc->ops->peek(cl->qdisc))) { + cl->deficit += agg->lmax; + list_move_tail(&cl->alist, &agg->active); } +} + +static inline struct sk_buff *qfq_peek_skb(struct qfq_aggregate *agg, + struct qfq_class **cl, + unsigned int *len) +{ + struct sk_buff *skb; - return true; + *cl = list_first_entry(&agg->active, struct qfq_class, alist); + skb = (*cl)->qdisc->ops->peek((*cl)->qdisc); + if (skb == NULL) + WARN_ONCE(1, "qfq_dequeue: non-workconserving leaf\n"); + else + *len = qdisc_pkt_len(skb); + + return skb; +} + +/* Update F according to the actual service received by the aggregate. */ +static inline void charge_actual_service(struct qfq_aggregate *agg) +{ + /* compute the service received by the aggregate */ + u32 service_received = agg->initial_budget - agg->budget; + + agg->F = agg->S + (u64)service_received * agg->inv_w; } static struct sk_buff *qfq_dequeue(struct Qdisc *sch) { struct qfq_sched *q = qdisc_priv(sch); - struct qfq_group *grp; + struct qfq_aggregate *in_serv_agg = q->in_serv_agg; struct qfq_class *cl; - struct sk_buff *skb; + struct sk_buff *skb = NULL; unsigned int len; - u64 old_V; - if (!q->bitmaps[ER]) + if (in_serv_agg == NULL) return NULL; - grp = qfq_ffs(q, q->bitmaps[ER]); + if (!list_empty(&in_serv_agg->active)) { + skb = qfq_peek_skb(in_serv_agg, &cl, &len); + if (!skb) + return NULL; + } else + len = 0; /* no more active classes in the in-service agg */ - cl = qfq_slot_head(grp); - skb = qdisc_dequeue_peeked(cl->qdisc); - if (!skb) { - WARN_ONCE(1, "qfq_dequeue: non-workconserving leaf\n"); - return NULL; + /* + * If there are no active classes in the in-service aggregate, + * or if the aggregate has not enough budget to serve its next + * class, then choose the next aggregate to serve. + */ + if (len == 0 || in_serv_agg->budget < len) { + charge_actual_service(in_serv_agg); + + /* recharge the budget of the aggregate */ + in_serv_agg->initial_budget = in_serv_agg->budget = + in_serv_agg->budgetmax; + + if (!list_empty(&in_serv_agg->active)) + /* + * Still active: reschedule for + * service. Possible optimization: if no other + * aggregate is active, then there is no point + * in rescheduling this aggregate, and we can + * just keep it as the in-service one. This + * should be however a corner case, and to + * handle it, we would need to maintain an + * extra num_active_aggs field. + */ + qfq_activate_agg(q, in_serv_agg, requeue); + else if (sch->q.qlen == 0) { /* no aggregate to serve */ + q->in_serv_agg = NULL; + return NULL; + } + + /* + * If we get here, there are other aggregates queued: + * choose the new aggregate to serve. + */ + in_serv_agg = q->in_serv_agg = qfq_choose_next_agg(q); + skb = qfq_peek_skb(in_serv_agg, &cl, &len); + if (!skb) + return NULL; } sch->q.qlen--; qdisc_bstats_update(sch, skb); - old_V = q->V; - len = qdisc_pkt_len(skb); + agg_dequeue(in_serv_agg, cl, len); + in_serv_agg->budget -= len; q->V += (u64)len * IWSUM; pr_debug("qfq dequeue: len %u F %lld now %lld\n", - len, (unsigned long long) cl->F, (unsigned long long) q->V); + len, (unsigned long long) in_serv_agg->F, + (unsigned long long) q->V); - if (qfq_update_class(grp, cl)) { - u64 old_F = grp->F; + return skb; +} - cl = qfq_slot_scan(grp); - if (!cl) - __clear_bit(grp->index, &q->bitmaps[ER]); - else { - u64 roundedS = qfq_round_down(cl->S, grp->slot_shift); - unsigned int s; +static struct qfq_aggregate *qfq_choose_next_agg(struct qfq_sched *q) +{ + struct qfq_group *grp; + struct qfq_aggregate *agg, *new_front_agg; + u64 old_F; - if (grp->S == roundedS) - goto skip_unblock; - grp->S = roundedS; - grp->F = roundedS + (2ULL << grp->slot_shift); - __clear_bit(grp->index, &q->bitmaps[ER]); - s = qfq_calc_state(q, grp); - __set_bit(grp->index, &q->bitmaps[s]); - } + qfq_update_eligible(q); + q->oldV = q->V; + + if (!q->bitmaps[ER]) + return NULL; + + grp = qfq_ffs(q, q->bitmaps[ER]); + old_F = grp->F; - qfq_unblock_groups(q, grp->index, old_F); + agg = qfq_slot_head(grp); + + /* agg starts to be served, remove it from schedule */ + qfq_front_slot_remove(grp); + + new_front_agg = qfq_slot_scan(grp); + + if (new_front_agg == NULL) /* group is now inactive, remove from ER */ + __clear_bit(grp->index, &q->bitmaps[ER]); + else { + u64 roundedS = qfq_round_down(new_front_agg->S, + grp->slot_shift); + unsigned int s; + + if (grp->S == roundedS) + return agg; + grp->S = roundedS; + grp->F = roundedS + (2ULL << grp->slot_shift); + __clear_bit(grp->index, &q->bitmaps[ER]); + s = qfq_calc_state(q, grp); + __set_bit(grp->index, &q->bitmaps[s]); } -skip_unblock: - qfq_update_eligible(q, old_V); + qfq_unblock_groups(q, grp->index, old_F); - return skb; + return agg; } /* - * Assign a reasonable start time for a new flow k in group i. + * Assign a reasonable start time for a new aggregate in group i. * Admissible values for \hat(F) are multiples of \sigma_i * no greater than V+\sigma_i . Larger values mean that * we had a wraparound so we consider the timestamp to be stale. * * If F is not stale and F >= V then we set S = F. * Otherwise we should assign S = V, but this may violate - * the ordering in ER. So, if we have groups in ER, set S to - * the F_j of the first group j which would be blocking us. + * the ordering in EB (see [2]). So, if we have groups in ER, + * set S to the F_j of the first group j which would be blocking us. * We are guaranteed not to move S backward because * otherwise our group i would still be blocked. */ -static void qfq_update_start(struct qfq_sched *q, struct qfq_class *cl) +static void qfq_update_start(struct qfq_sched *q, struct qfq_aggregate *agg) { unsigned long mask; u64 limit, roundedF; - int slot_shift = cl->grp->slot_shift; + int slot_shift = agg->grp->slot_shift; - roundedF = qfq_round_down(cl->F, slot_shift); + roundedF = qfq_round_down(agg->F, slot_shift); limit = qfq_round_down(q->V, slot_shift) + (1ULL << slot_shift); - if (!qfq_gt(cl->F, q->V) || qfq_gt(roundedF, limit)) { + if (!qfq_gt(agg->F, q->V) || qfq_gt(roundedF, limit)) { /* timestamp was stale */ - mask = mask_from(q->bitmaps[ER], cl->grp->index); + mask = mask_from(q->bitmaps[ER], agg->grp->index); if (mask) { struct qfq_group *next = qfq_ffs(q, mask); if (qfq_gt(roundedF, next->F)) { if (qfq_gt(limit, next->F)) - cl->S = next->F; + agg->S = next->F; else /* preserve timestamp correctness */ - cl->S = limit; + agg->S = limit; return; } } - cl->S = q->V; + agg->S = q->V; } else /* timestamp is not stale */ - cl->S = cl->F; + agg->S = agg->F; +} + +/* + * Update the timestamps of agg before scheduling/rescheduling it for + * service. In particular, assign to agg->F its maximum possible + * value, i.e., the virtual finish time with which the aggregate + * should be labeled if it used all its budget once in service. + */ +static inline void +qfq_update_agg_ts(struct qfq_sched *q, + struct qfq_aggregate *agg, enum update_reason reason) +{ + if (reason != requeue) + qfq_update_start(q, agg); + else /* just charge agg for the service received */ + agg->S = agg->F; + + agg->F = agg->S + (u64)agg->budgetmax * agg->inv_w; } +static void qfq_schedule_agg(struct qfq_sched *, struct qfq_aggregate *); + static int qfq_enqueue(struct sk_buff *skb, struct Qdisc *sch) { struct qfq_sched *q = qdisc_priv(sch); struct qfq_class *cl; + struct qfq_aggregate *agg; int err = 0; cl = qfq_classify(skb, sch, &err); @@ -934,11 +1188,13 @@ static int qfq_enqueue(struct sk_buff *skb, struct Qdisc *sch) } pr_debug("qfq_enqueue: cl = %x\n", cl->common.classid); - if (unlikely(cl->lmax < qdisc_pkt_len(skb))) { + if (unlikely(cl->agg->lmax < qdisc_pkt_len(skb))) { pr_debug("qfq: increasing maxpkt from %u to %u for class %u", - cl->lmax, qdisc_pkt_len(skb), cl->common.classid); - qfq_update_reactivate_class(q, cl, cl->inv_w, - qdisc_pkt_len(skb), 0); + cl->agg->lmax, qdisc_pkt_len(skb), cl->common.classid); + err = qfq_change_agg(sch, cl, cl->agg->class_weight, + qdisc_pkt_len(skb)); + if (err) + return err; } err = qdisc_enqueue(skb, cl->qdisc); @@ -954,35 +1210,50 @@ static int qfq_enqueue(struct sk_buff *skb, struct Qdisc *sch) bstats_update(&cl->bstats, skb); ++sch->q.qlen; - /* If the new skb is not the head of queue, then done here. */ - if (cl->qdisc->q.qlen != 1) + agg = cl->agg; + /* if the queue was not empty, then done here */ + if (cl->qdisc->q.qlen != 1) { + if (unlikely(skb == cl->qdisc->ops->peek(cl->qdisc)) && + list_first_entry(&agg->active, struct qfq_class, alist) + == cl && cl->deficit < qdisc_pkt_len(skb)) + list_move_tail(&cl->alist, &agg->active); + return err; + } + + /* schedule class for service within the aggregate */ + cl->deficit = agg->lmax; + list_add_tail(&cl->alist, &agg->active); + + if (list_first_entry(&agg->active, struct qfq_class, alist) != cl) + return err; /* aggregate was not empty, nothing else to do */ + + /* recharge budget */ + agg->initial_budget = agg->budget = agg->budgetmax; - /* If reach this point, queue q was idle */ - qfq_activate_class(q, cl, qdisc_pkt_len(skb)); + qfq_update_agg_ts(q, agg, enqueue); + if (q->in_serv_agg == NULL) + q->in_serv_agg = agg; + else if (agg != q->in_serv_agg) + qfq_schedule_agg(q, agg); return err; } /* - * Handle class switch from idle to backlogged. + * Schedule aggregate according to its timestamps. */ -static void qfq_activate_class(struct qfq_sched *q, struct qfq_class *cl, - unsigned int pkt_len) +static void qfq_schedule_agg(struct qfq_sched *q, struct qfq_aggregate *agg) { - struct qfq_group *grp = cl->grp; + struct qfq_group *grp = agg->grp; u64 roundedS; int s; - qfq_update_start(q, cl); - - /* compute new finish time and rounded start. */ - cl->F = cl->S + (u64)pkt_len * cl->inv_w; - roundedS = qfq_round_down(cl->S, grp->slot_shift); + roundedS = qfq_round_down(agg->S, grp->slot_shift); /* - * insert cl in the correct bucket. - * If cl->S >= grp->S we don't need to adjust the + * Insert agg in the correct bucket. + * If agg->S >= grp->S we don't need to adjust the * bucket list and simply go to the insertion phase. * Otherwise grp->S is decreasing, we must make room * in the bucket list, and also recompute the group state. @@ -990,10 +1261,10 @@ static void qfq_activate_class(struct qfq_sched *q, struct qfq_class *cl, * was in ER make sure to adjust V. */ if (grp->full_slots) { - if (!qfq_gt(grp->S, cl->S)) + if (!qfq_gt(grp->S, agg->S)) goto skip_update; - /* create a slot for this cl->S */ + /* create a slot for this agg->S */ qfq_slot_rotate(grp, roundedS); /* group was surely ineligible, remove */ __clear_bit(grp->index, &q->bitmaps[IR]); @@ -1008,46 +1279,61 @@ static void qfq_activate_class(struct qfq_sched *q, struct qfq_class *cl, pr_debug("qfq enqueue: new state %d %#lx S %lld F %lld V %lld\n", s, q->bitmaps[s], - (unsigned long long) cl->S, - (unsigned long long) cl->F, + (unsigned long long) agg->S, + (unsigned long long) agg->F, (unsigned long long) q->V); skip_update: - qfq_slot_insert(grp, cl, roundedS); + qfq_slot_insert(grp, agg, roundedS); } +/* Update agg ts and schedule agg for service */ +static void qfq_activate_agg(struct qfq_sched *q, struct qfq_aggregate *agg, + enum update_reason reason) +{ + qfq_update_agg_ts(q, agg, reason); + qfq_schedule_agg(q, agg); +} + static void qfq_slot_remove(struct qfq_sched *q, struct qfq_group *grp, - struct qfq_class *cl) + struct qfq_aggregate *agg) { unsigned int i, offset; u64 roundedS; - roundedS = qfq_round_down(cl->S, grp->slot_shift); + roundedS = qfq_round_down(agg->S, grp->slot_shift); offset = (roundedS - grp->S) >> grp->slot_shift; + i = (grp->front + offset) % QFQ_MAX_SLOTS; - hlist_del(&cl->next); + hlist_del(&agg->next); if (hlist_empty(&grp->slots[i])) __clear_bit(offset, &grp->full_slots); } /* - * called to forcibly destroy a queue. - * If the queue is not in the front bucket, or if it has - * other queues in the front bucket, we can simply remove - * the queue with no other side effects. + * Called to forcibly deschedule an aggregate. If the aggregate is + * not in the front bucket, or if the latter has other aggregates in + * the front bucket, we can simply remove the aggregate with no other + * side effects. * Otherwise we must propagate the event up. */ -static void qfq_deactivate_class(struct qfq_sched *q, struct qfq_class *cl) +static void qfq_deactivate_agg(struct qfq_sched *q, struct qfq_aggregate *agg) { - struct qfq_group *grp = cl->grp; + struct qfq_group *grp = agg->grp; unsigned long mask; u64 roundedS; int s; - cl->F = cl->S; - qfq_slot_remove(q, grp, cl); + if (agg == q->in_serv_agg) { + charge_actual_service(agg); + q->in_serv_agg = qfq_choose_next_agg(q); + return; + } + + agg->F = agg->S; + qfq_slot_remove(q, grp, agg); if (!grp->full_slots) { __clear_bit(grp->index, &q->bitmaps[IR]); @@ -1066,8 +1352,8 @@ static void qfq_deactivate_class(struct qfq_sched *q, struct qfq_class *cl) } __clear_bit(grp->index, &q->bitmaps[ER]); } else if (hlist_empty(&grp->slots[grp->front])) { - cl = qfq_slot_scan(grp); - roundedS = qfq_round_down(cl->S, grp->slot_shift); + agg = qfq_slot_scan(grp); + roundedS = qfq_round_down(agg->S, grp->slot_shift); if (grp->S != roundedS) { __clear_bit(grp->index, &q->bitmaps[ER]); __clear_bit(grp->index, &q->bitmaps[IR]); @@ -1080,7 +1366,7 @@ static void qfq_deactivate_class(struct qfq_sched *q, struct qfq_class *cl) } } - qfq_update_eligible(q, q->V); + qfq_update_eligible(q); } static void qfq_qlen_notify(struct Qdisc *sch, unsigned long arg) @@ -1092,6 +1378,32 @@ static void qfq_qlen_notify(struct Qdisc *sch, unsigned long arg) qfq_deactivate_class(q, cl); } +static unsigned int qfq_drop_from_slot(struct qfq_sched *q, + struct hlist_head *slot) +{ + struct qfq_aggregate *agg; + struct hlist_node *n; + struct qfq_class *cl; + unsigned int len; + + hlist_for_each_entry(agg, n, slot, next) { + list_for_each_entry(cl, &agg->active, alist) { + + if (!cl->qdisc->ops->drop) + continue; + + len = cl->qdisc->ops->drop(cl->qdisc); + if (len > 0) { + if (cl->qdisc->q.qlen == 0) + qfq_deactivate_class(q, cl); + + return len; + } + } + } + return 0; +} + static unsigned int qfq_drop(struct Qdisc *sch) { struct qfq_sched *q = qdisc_priv(sch); @@ -1101,24 +1413,13 @@ static unsigned int qfq_drop(struct Qdisc *sch) for (i = 0; i <= QFQ_MAX_INDEX; i++) { grp = &q->groups[i]; for (j = 0; j < QFQ_MAX_SLOTS; j++) { - struct qfq_class *cl; - struct hlist_node *n; - - hlist_for_each_entry(cl, n, &grp->slots[j], next) { - - if (!cl->qdisc->ops->drop) - continue; - - len = cl->qdisc->ops->drop(cl->qdisc); - if (len > 0) { - sch->q.qlen--; - if (!cl->qdisc->q.qlen) - qfq_deactivate_class(q, cl); - - return len; - } + len = qfq_drop_from_slot(q, &grp->slots[j]); + if (len > 0) { + sch->q.qlen--; + return len; } } + } return 0; @@ -1129,44 +1430,51 @@ static int qfq_init_qdisc(struct Qdisc *sch, struct nlattr *opt) struct qfq_sched *q = qdisc_priv(sch); struct qfq_group *grp; int i, j, err; + u32 max_cl_shift, maxbudg_shift, max_classes; err = qdisc_class_hash_init(&q->clhash); if (err < 0) return err; + if (qdisc_dev(sch)->tx_queue_len + 1 > QFQ_MAX_AGG_CLASSES) + max_classes = QFQ_MAX_AGG_CLASSES; + else + max_classes = qdisc_dev(sch)->tx_queue_len + 1; + /* max_cl_shift = floor(log_2(max_classes)) */ + max_cl_shift = __fls(max_classes); + q->max_agg_classes = 1<min_slot_shift = FRAC_BITS + maxbudg_shift - QFQ_MAX_INDEX; + for (i = 0; i <= QFQ_MAX_INDEX; i++) { grp = &q->groups[i]; grp->index = i; - grp->slot_shift = QFQ_MTU_SHIFT + FRAC_BITS - - (QFQ_MAX_INDEX - i); + grp->slot_shift = q->min_slot_shift + i; for (j = 0; j < QFQ_MAX_SLOTS; j++) INIT_HLIST_HEAD(&grp->slots[j]); } + INIT_HLIST_HEAD(&q->nonfull_aggs); + return 0; } static void qfq_reset_qdisc(struct Qdisc *sch) { struct qfq_sched *q = qdisc_priv(sch); - struct qfq_group *grp; struct qfq_class *cl; - struct hlist_node *n, *tmp; - unsigned int i, j; + struct hlist_node *n; + unsigned int i; - for (i = 0; i <= QFQ_MAX_INDEX; i++) { - grp = &q->groups[i]; - for (j = 0; j < QFQ_MAX_SLOTS; j++) { - hlist_for_each_entry_safe(cl, n, tmp, - &grp->slots[j], next) { + for (i = 0; i < q->clhash.hashsize; i++) { + hlist_for_each_entry(cl, n, &q->clhash.hash[i], common.hnode) { + if (cl->qdisc->q.qlen > 0) qfq_deactivate_class(q, cl); - } - } - } - for (i = 0; i < q->clhash.hashsize; i++) { - hlist_for_each_entry(cl, n, &q->clhash.hash[i], common.hnode) qdisc_reset(cl->qdisc); + } } sch->q.qlen = 0; }