diff mbox

sched: QFQ - quick fair queue scheduler (v2)

Message ID 20110228171738.2cc8c9a0@nehalam
State Changes Requested, archived
Delegated to: David Miller
Headers show

Commit Message

stephen hemminger March 1, 2011, 1:17 a.m. UTC
This is an implementation of the Quick Fair Queue scheduler developed
by Fabio Checconi. The same algorithm is already implemented in ipfw
in FreeBSD. Fabio had an earlier version developed on Linux, I just
cleaned it up and tested it. All bugs are mine.

This version is still experimental, do not use for production.

Signed-off-by: Stephen Hemminger <shemminger@vyatta.com>

---
v2 - Did some debugging, and rebased against net-next tree.

 include/linux/pkt_sched.h |   14 
 net/sched/Kconfig         |   11 
 net/sched/Makefile        |    1 
 net/sched/sch_qfq.c       | 1124 ++++++++++++++++++++++++++++++++++++++++++++++
 4 files changed, 1150 insertions(+)

--
To unsubscribe from this list: send the line "unsubscribe netdev" in
the body of a message to majordomo@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html

Comments

Eric Dumazet March 1, 2011, 5:28 a.m. UTC | #1
Le lundi 28 février 2011 à 17:17 -0800, Stephen Hemminger a écrit :
> This is an implementation of the Quick Fair Queue scheduler developed
> by Fabio Checconi. The same algorithm is already implemented in ipfw
> in FreeBSD. Fabio had an earlier version developed on Linux, I just
> cleaned it up and tested it. All bugs are mine.
> 
> This version is still experimental, do not use for production.
> 
> Signed-off-by: Stephen Hemminger <shemminger@vyatta.com>

Seems very complex stuff ;)

Reviewed-by: Eric Dumazet <eric.dumazet@gmail.com>

I'll perform some tests today with QFQ, thanks !



--
To unsubscribe from this list: send the line "unsubscribe netdev" in
the body of a message to majordomo@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Fabio Checconi March 2, 2011, 2:06 a.m. UTC | #2
Hi,

> From: Stephen Hemminger <shemminger@vyatta.com>
> Date: Mon, Feb 28, 2011 05:17:38PM -0800
>
> This is an implementation of the Quick Fair Queue scheduler developed
> by Fabio Checconi. The same algorithm is already implemented in ipfw
> in FreeBSD. Fabio had an earlier version developed on Linux, I just
> cleaned it up and tested it. All bugs are mine.
> 

  thanks for posting, I'm pretty sure that bugs are more likely to be
ours than yours :)

During the development of the algorithm we used a simple traffic
simulator to verify (informally) the timestamping and the guarantees
provided.  I've tested this version of the code with the simulator and
so far it worked fine, so I think that timestamping and guarantees should
be OK.

Please let us know if we can be of any help,
fabio
--
To unsubscribe from this list: send the line "unsubscribe netdev" in
the body of a message to majordomo@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Eric Dumazet March 2, 2011, 11:17 a.m. UTC | #3
Le mercredi 02 mars 2011 à 03:06 +0100, Fabio Checconi a écrit :
> Hi,
> 
> > From: Stephen Hemminger <shemminger@vyatta.com>
> > Date: Mon, Feb 28, 2011 05:17:38PM -0800
> >
> > This is an implementation of the Quick Fair Queue scheduler developed
> > by Fabio Checconi. The same algorithm is already implemented in ipfw
> > in FreeBSD. Fabio had an earlier version developed on Linux, I just
> > cleaned it up and tested it. All bugs are mine.
> > 
> 
>   thanks for posting, I'm pretty sure that bugs are more likely to be
> ours than yours :)
> 
> During the development of the algorithm we used a simple traffic
> simulator to verify (informally) the timestamping and the guarantees
> provided.  I've tested this version of the code with the simulator and
> so far it worked fine, so I think that timestamping and guarantees should
> be OK.
> 
> Please let us know if we can be of any help,
> fabio

Hmm, I tried to setup QFQ (using iproute2patch
http://patchwork.ozlabs.org/patch/77902/ ) but failed

Do you have a link to a script/sample ?

Thanks


--
To unsubscribe from this list: send the line "unsubscribe netdev" in
the body of a message to majordomo@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Patrick McHardy March 2, 2011, 11:50 a.m. UTC | #4
Am 01.03.2011 02:17, schrieb Stephen Hemminger:
> This is an implementation of the Quick Fair Queue scheduler developed
> by Fabio Checconi. The same algorithm is already implemented in ipfw
> in FreeBSD. Fabio had an earlier version developed on Linux, I just
> cleaned it up and tested it. All bugs are mine.
> 
> This version is still experimental, do not use for production.
> 
> Signed-off-by: Stephen Hemminger <shemminger@vyatta.com>

Looks good to me from a qdisc API POV.
--
To unsubscribe from this list: send the line "unsubscribe netdev" in
the body of a message to majordomo@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Eric Dumazet March 2, 2011, 3:41 p.m. UTC | #5
Le lundi 28 février 2011 à 17:17 -0800, Stephen Hemminger a écrit :


> --- a/include/linux/pkt_sched.h	2011-02-28 13:28:57.763177314 -0800
> +++ b/include/linux/pkt_sched.h	2011-02-28 13:29:10.466792117 -0800
> @@ -588,4 +588,18 @@ struct tc_sfb_xstats {
>  
>  #define SFB_MAX_PROB 0xFFFF
>  
> +/* QFQ */
> +enum {

+	TCA_QFQ_UNSPEC,

> +	TCA_QFQ_WEIGHT,
> +	TCA_QFQ_LMAX,
> +	__TCA_QFQ_MAX
> +};
> +
> +#define TCA_QFQ_MAX	(__TCA_QFQ_MAX - 1)
> +

With TCA_QFQ_UNSPEC bit (and mirror the change in iproute2), it seems to
work better.

Oh wait, I had a crash in qfq_reset_qdisc() when re-running my setup
script.



--
To unsubscribe from this list: send the line "unsubscribe netdev" in
the body of a message to majordomo@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Eric Dumazet March 2, 2011, 3:53 p.m. UTC | #6
Le lundi 28 février 2011 à 17:17 -0800, Stephen Hemminger a écrit :
> This is an implementation of the Quick Fair Queue scheduler developed
> by Fabio Checconi. The same algorithm is already implemented in ipfw
> in FreeBSD. Fabio had an earlier version developed on Linux, I just
> cleaned it up and tested it. All bugs are mine.



Here is my crash analysis :

> +static void qfq_reset_qdisc(struct Qdisc *sch)
> +{
> +	struct qfq_sched *q = qdisc_priv(sch);
> +	struct qfq_group *grp;
> +	struct qfq_class *cl, **pp;
> +	struct hlist_node *n;
> +	unsigned int i, j;
> +
> +	for (i = 0; i <= QFQ_MAX_INDEX; i++) {
> +		grp = &q->groups[i];
> +		for (j = 0; j < QFQ_MAX_SLOTS; j++) {
> +			for (pp = &grp->slots[j]; *pp; pp = &(*pp)->next) {
> +				cl = *pp;
> +				if (cl->qdisc->q.qlen)
> +					qfq_deactivate_class(q, cl, pp);

Here, if we deactivated last class in chain, *pp is NULL, but

pp = &(*pp)->next  put 0x50 (on 64bit arches) in pp, so we crash ...





> +			}
> +		}
> +	}
> +
> +	for (i = 0; i < q->clhash.hashsize; i++) {
> +		hlist_for_each_entry(cl, n, &q->clhash.hash[i], common.hnode)
> +			qdisc_reset(cl->qdisc);
> +	}
> +	sch->q.qlen = 0;
> +}
> +



--
To unsubscribe from this list: send the line "unsubscribe netdev" in
the body of a message to majordomo@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
stephen hemminger March 2, 2011, 4:11 p.m. UTC | #7
On Wed, 02 Mar 2011 12:17:47 +0100
Eric Dumazet <eric.dumazet@gmail.com> wrote:

> Le mercredi 02 mars 2011 à 03:06 +0100, Fabio Checconi a écrit :
> > Hi,
> > 
> > > From: Stephen Hemminger <shemminger@vyatta.com>
> > > Date: Mon, Feb 28, 2011 05:17:38PM -0800
> > >
> > > This is an implementation of the Quick Fair Queue scheduler developed
> > > by Fabio Checconi. The same algorithm is already implemented in ipfw
> > > in FreeBSD. Fabio had an earlier version developed on Linux, I just
> > > cleaned it up and tested it. All bugs are mine.
> > > 
> > 
> >   thanks for posting, I'm pretty sure that bugs are more likely to be
> > ours than yours :)
> > 
> > During the development of the algorithm we used a simple traffic
> > simulator to verify (informally) the timestamping and the guarantees
> > provided.  I've tested this version of the code with the simulator and
> > so far it worked fine, so I think that timestamping and guarantees should
> > be OK.
> > 
> > Please let us know if we can be of any help,
> > fabio
> 
> Hmm, I tried to setup QFQ (using iproute2patch
> http://patchwork.ozlabs.org/patch/77902/ ) but failed
> 
> Do you have a link to a script/sample ?
> 
> Thanks

I put the iproute2 code into the repository in the experimental branch.
Eric Dumazet March 2, 2011, 4:18 p.m. UTC | #8
Le mercredi 02 mars 2011 à 08:11 -0800, Stephen Hemminger a écrit :

> I put the iproute2 code into the repository in the experimental branch.
> 

Thanks

It seems as soon as packets are dropped, qdisc is frozen (no more
packets dequeued)

Hmm...



--
To unsubscribe from this list: send the line "unsubscribe netdev" in
the body of a message to majordomo@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Eric Dumazet March 2, 2011, 5:31 p.m. UTC | #9
Le mercredi 02 mars 2011 à 17:18 +0100, Eric Dumazet a écrit :
> Le mercredi 02 mars 2011 à 08:11 -0800, Stephen Hemminger a écrit :
> 
> > I put the iproute2 code into the repository in the experimental branch.
> > 
> 
> Thanks
> 
> It seems as soon as packets are dropped, qdisc is frozen (no more
> packets dequeued)
> 
> Hmm...
> 

It seems class deletes are buggy.

After one "tc class del dev $ETH classid 11:1 ..."

a "tc -s -d qdisc show dev $ETH" triggers an Oops

[  414.517709] general protection fault: 0000 [#1] SMP 
[  414.517894] last sysfs file: /sys/devices/virtual/net/gre34/ifindex
[  414.517956] CPU 3 
[  414.517995] Modules linked in: sch_qfq sch_cbq ip_gre gre dummy ipmi_devintf ipmi_si ipmi_msghandler dm_mod video tg3 libphy sg [last unloaded: ip_tables]
[  414.518663] 
[  414.518717] Pid: 4692, comm: tc Not tainted 2.6.38-rc5-02726-gd486b8c-dirty #554 HP ProLiant BL460c G6
[  414.518905] RIP: 0010:[<ffffffff8145118e>]  [<ffffffff8145118e>] tc_fill_qdisc+0xbe/0x300
[  414.519025] RSP: 0018:ffff88011a123878  EFLAGS: 00010283
[  414.519085] RAX: a00e6660ffffffff RBX: 0000000000000002 RCX: 0000000000000024
[  414.519149] RDX: ffff880078683334 RSI: 0000000000000000 RDI: ffff88007867b400
[  414.519212] RBP: ffff88011a123928 R08: ffff880078683000 R09: 0000000000000002
[  414.519276] R10: 00000000000001f0 R11: 00000000000001e0 R12: ffff88007867b400
[  414.519340] R13: ffff88011cb1779c R14: ffff880078683324 R15: 0000000000001254
[  414.519405] FS:  0000000000000000(0000) GS:ffff88007fc40000(0063) knlGS:00000000f77778e0
[  414.519486] CS:  0010 DS: 002b ES: 002b CR0: 0000000080050033
[  414.519553] CR2: 00000000005ffc80 CR3: 0000000078087000 CR4: 00000000000006e0
[  414.519616] DR0: 0000000000000000 DR1: 0000000000000000 DR2: 0000000000000000
[  414.519680] DR3: 0000000000000000 DR6: 00000000ffff0ff0 DR7: 0000000000000400
[  414.519744] Process tc (pid: 4692, threadinfo ffff88011a122000, task ffff88011a1715d0)
[  414.519823] Stack:
[  414.519876]  ffff88011bbfb570 000000004d6e7c8e ffff880078683324 0000000000000000
[  414.520095]  ffff88011cb1789c ffff88007867b400 ffff8800786832bc 0000000400000003
[  414.520312]  0000000000000000 0000000000000000 0000000000000000 0000000000000000
[  414.520531] Call Trace:
[  414.520602]  [<ffffffff81451469>] tc_dump_qdisc_root+0x99/0x100
[  414.520666]  [<ffffffff81452680>] tc_dump_qdisc+0x80/0x100
[  414.520729]  [<ffffffff8146a50b>] netlink_dump+0x6b/0x1e0
[  414.520791]  [<ffffffff810f5d16>] ? kmem_cache_alloc_trace+0xb6/0x100
[  414.520855]  [<ffffffff8146d2ef>] netlink_dump_start+0x16f/0x190
[  414.520918]  [<ffffffff81452600>] ? tc_dump_qdisc+0x0/0x100
[  414.520981]  [<ffffffff81443b36>] rtnetlink_rcv_msg+0xb6/0x270
[  414.521043]  [<ffffffff81443a80>] ? rtnetlink_rcv_msg+0x0/0x270
[  414.521106]  [<ffffffff8146b609>] netlink_rcv_skb+0x99/0xc0
[  414.521167]  [<ffffffff81443a65>] rtnetlink_rcv+0x25/0x40
[  414.521229]  [<ffffffff8146b379>] netlink_unicast+0x2a9/0x2b0
[  414.521292]  [<ffffffff8142c3e3>] ? memcpy_fromiovec+0x63/0x80
[  414.521354]  [<ffffffff8146c30d>] netlink_sendmsg+0x24d/0x390
[  414.521418]  [<ffffffff81421510>] sock_sendmsg+0xc0/0xf0
[  414.521482]  [<ffffffff810b699a>] ? unlock_page+0x2a/0x40
[  414.521545]  [<ffffffff81421262>] ? move_addr_to_kernel+0x62/0x70
[  414.521608]  [<ffffffff8144d59f>] ? verify_compat_iovec+0x8f/0x100
[  414.521685]  [<ffffffff814222b0>] sys_sendmsg+0x180/0x300
[  414.521747]  [<ffffffff810d682c>] ? __pte_alloc+0xdc/0x100
[  414.521809]  [<ffffffff810d6992>] ? handle_mm_fault+0x142/0x1c0
[  414.521872]  [<ffffffff81569594>] ? do_page_fault+0x274/0x490
[  414.521935]  [<ffffffff81421f41>] ? sys_getsockname+0xa1/0xb0
[  414.521997]  [<ffffffff81421909>] ? sys_recvmsg+0x49/0x80
[  414.522059]  [<ffffffff8144d164>] compat_sys_sendmsg+0x14/0x20
[  414.522121]  [<ffffffff8144e08d>] compat_sys_socketcall+0x19d/0x1f0
[  414.522185]  [<ffffffff8102d640>] sysenter_dispatch+0x7/0x2e
[  414.522246] Code: 06 49 8d 56 10 45 89 7e 0c 66 41 89 46 04 8b 85 58 ff ff ff 41 c6 46 10 00 41 89 46 08 c6 42 01 00 66 c7 42 02 00 00 49 8b 45 68 <48> 8b 00 8b 80 c0 00 00 00 89 42 04 8b 85 5c ff ff ff 89 42 0c 
[  414.524636] RIP  [<ffffffff8145118e>] tc_fill_qdisc+0xbe/0x300
[  414.524734]  RSP <ffff88011a123878>
[  414.524799] ---[ end trace 513a4307e5c34d00 ]---


--
To unsubscribe from this list: send the line "unsubscribe netdev" in
the body of a message to majordomo@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Patrick McHardy March 2, 2011, 6:16 p.m. UTC | #10
Am 02.03.2011 18:31, schrieb Eric Dumazet:
> Le mercredi 02 mars 2011 à 17:18 +0100, Eric Dumazet a écrit :
>> Le mercredi 02 mars 2011 à 08:11 -0800, Stephen Hemminger a écrit :
>>
>>> I put the iproute2 code into the repository in the experimental branch.
>>>
>>
>> Thanks
>>
>> It seems as soon as packets are dropped, qdisc is frozen (no more
>> packets dequeued)
>>
>> Hmm...
>>
> 
> It seems class deletes are buggy.
> 
> After one "tc class del dev $ETH classid 11:1 ..."
> 
> a "tc -s -d qdisc show dev $ETH" triggers an Oops

What classes and filters did you have before issuing that command?
--
To unsubscribe from this list: send the line "unsubscribe netdev" in
the body of a message to majordomo@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
stephen hemminger March 2, 2011, 11:55 p.m. UTC | #11
On Wed, 02 Mar 2011 18:31:27 +0100
Eric Dumazet <eric.dumazet@gmail.com> wrote:

> Le mercredi 02 mars 2011 à 17:18 +0100, Eric Dumazet a écrit :
> > Le mercredi 02 mars 2011 à 08:11 -0800, Stephen Hemminger a écrit :
> > 
> > > I put the iproute2 code into the repository in the experimental branch.
> > > 
> > 
> > Thanks
> > 
> > It seems as soon as packets are dropped, qdisc is frozen (no more
> > packets dequeued)
> > 
> > Hmm...
> > 
> 
> It seems class deletes are buggy.
> 
> After one "tc class del dev $ETH classid 11:1 ..."
> 
> a "tc -s -d qdisc show dev $ETH" triggers an Oops
> 
> [  414.517709] general protection fault: 0000 [#1] SMP 
> [  414.517894] last sysfs file: /sys/devices/virtual/net/gre34/ifindex
> [  414.517956] CPU 3 
> [  414.517995] Modules linked in: sch_qfq sch_cbq ip_gre gre dummy ipmi_devintf ipmi_si ipmi_msghandler dm_mod video tg3 libphy sg [last unloaded: ip_tables]
> [  414.518663] 
> [  414.518717] Pid: 4692, comm: tc Not tainted 2.6.38-rc5-02726-gd486b8c-dirty #554 HP ProLiant BL460c G6
> [  414.518905] RIP: 0010:[<ffffffff8145118e>]  [<ffffffff8145118e>] tc_fill_qdisc+0xbe/0x300
> [  414.519025] RSP: 0018:ffff88011a123878  EFLAGS: 00010283
> [  414.519085] RAX: a00e6660ffffffff RBX: 0000000000000002 RCX: 0000000000000024
> [  414.519149] RDX: ffff880078683334 RSI: 0000000000000000 RDI: ffff88007867b400
> [  414.519212] RBP: ffff88011a123928 R08: ffff880078683000 R09: 0000000000000002
> [  414.519276] R10: 00000000000001f0 R11: 00000000000001e0 R12: ffff88007867b400
> [  414.519340] R13: ffff88011cb1779c R14: ffff880078683324 R15: 0000000000001254
> [  414.519405] FS:  0000000000000000(0000) GS:ffff88007fc40000(0063) knlGS:00000000f77778e0
> [  414.519486] CS:  0010 DS: 002b ES: 002b CR0: 0000000080050033
> [  414.519553] CR2: 00000000005ffc80 CR3: 0000000078087000 CR4: 00000000000006e0
> [  414.519616] DR0: 0000000000000000 DR1: 0000000000000000 DR2: 0000000000000000
> [  414.519680] DR3: 0000000000000000 DR6: 00000000ffff0ff0 DR7: 0000000000000400
> [  414.519744] Process tc (pid: 4692, threadinfo ffff88011a122000, task ffff88011a1715d0)
> [  414.519823] Stack:
> [  414.519876]  ffff88011bbfb570 000000004d6e7c8e ffff880078683324 0000000000000000
> [  414.520095]  ffff88011cb1789c ffff88007867b400 ffff8800786832bc 0000000400000003
> [  414.520312]  0000000000000000 0000000000000000 0000000000000000 0000000000000000
> [  414.520531] Call Trace:
> [  414.520602]  [<ffffffff81451469>] tc_dump_qdisc_root+0x99/0x100
> [  414.520666]  [<ffffffff81452680>] tc_dump_qdisc+0x80/0x100
> [  414.520729]  [<ffffffff8146a50b>] netlink_dump+0x6b/0x1e0
> [  414.520791]  [<ffffffff810f5d16>] ? kmem_cache_alloc_trace+0xb6/0x100
> [  414.520855]  [<ffffffff8146d2ef>] netlink_dump_start+0x16f/0x190
> [  414.520918]  [<ffffffff81452600>] ? tc_dump_qdisc+0x0/0x100
> [  414.520981]  [<ffffffff81443b36>] rtnetlink_rcv_msg+0xb6/0x270
> [  414.521043]  [<ffffffff81443a80>] ? rtnetlink_rcv_msg+0x0/0x270
> [  414.521106]  [<ffffffff8146b609>] netlink_rcv_skb+0x99/0xc0
> [  414.521167]  [<ffffffff81443a65>] rtnetlink_rcv+0x25/0x40
> [  414.521229]  [<ffffffff8146b379>] netlink_unicast+0x2a9/0x2b0
> [  414.521292]  [<ffffffff8142c3e3>] ? memcpy_fromiovec+0x63/0x80
> [  414.521354]  [<ffffffff8146c30d>] netlink_sendmsg+0x24d/0x390
> [  414.521418]  [<ffffffff81421510>] sock_sendmsg+0xc0/0xf0
> [  414.521482]  [<ffffffff810b699a>] ? unlock_page+0x2a/0x40
> [  414.521545]  [<ffffffff81421262>] ? move_addr_to_kernel+0x62/0x70
> [  414.521608]  [<ffffffff8144d59f>] ? verify_compat_iovec+0x8f/0x100
> [  414.521685]  [<ffffffff814222b0>] sys_sendmsg+0x180/0x300
> [  414.521747]  [<ffffffff810d682c>] ? __pte_alloc+0xdc/0x100
> [  414.521809]  [<ffffffff810d6992>] ? handle_mm_fault+0x142/0x1c0
> [  414.521872]  [<ffffffff81569594>] ? do_page_fault+0x274/0x490
> [  414.521935]  [<ffffffff81421f41>] ? sys_getsockname+0xa1/0xb0
> [  414.521997]  [<ffffffff81421909>] ? sys_recvmsg+0x49/0x80
> [  414.522059]  [<ffffffff8144d164>] compat_sys_sendmsg+0x14/0x20
> [  414.522121]  [<ffffffff8144e08d>] compat_sys_socketcall+0x19d/0x1f0
> [  414.522185]  [<ffffffff8102d640>] sysenter_dispatch+0x7/0x2e
> [  414.522246] Code: 06 49 8d 56 10 45 89 7e 0c 66 41 89 46 04 8b 85 58 ff ff ff 41 c6 46 10 00 41 89 46 08 c6 42 01 00 66 c7 42 02 00 00 49 8b 45 68 <48> 8b 00 8b 80 c0 00 00 00 89 42 04 8b 85 5c ff ff ff 89 42 0c 
> [  414.524636] RIP  [<ffffffff8145118e>] tc_fill_qdisc+0xbe/0x300
> [  414.524734]  RSP <ffff88011a123878>
> [  414.524799] ---[ end trace 513a4307e5c34d00 ]---
> 
> 

Most of the class code was copied from DRR. Does it have the
same problem?
--
To unsubscribe from this list: send the line "unsubscribe netdev" in
the body of a message to majordomo@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Fabio Checconi March 3, 2011, 8:19 a.m. UTC | #12
Hi,

> From: Eric Dumazet <eric.dumazet@gmail.com>
> Date: Wed, Mar 02, 2011 06:31:27PM +0100
>
> Le mercredi 02 mars 2011 à 17:18 +0100, Eric Dumazet a écrit :
> > Le mercredi 02 mars 2011 à 08:11 -0800, Stephen Hemminger a écrit :
> > 
> > > I put the iproute2 code into the repository in the experimental branch.
> > > 
> > 
> > Thanks
> > 
> > It seems as soon as packets are dropped, qdisc is frozen (no more
> > packets dequeued)

  I've been able to reproduce this problem using netem, and it seems to
be due to this check:

> +	/* If the new skb is not the head of queue, then done here. */
> +	if (skb != qdisc_peek_head(cl->qdisc))
> +		return err;

changing it to:

	if (cl->qdisc->q.qlen != 1)
		return err;

seems to make things work.  I think this is because qdisc_peek_head()
looks at the wrong list, as the skb is not queued directly in q, but
ends up in the child qdisc attached to cl->qdisc.


> > 
> > Hmm...
> > 
> 
> It seems class deletes are buggy.
> 
> After one "tc class del dev $ETH classid 11:1 ..."
> 
> a "tc -s -d qdisc show dev $ETH" triggers an Oops
> 

This seems to be due to:

> +static void qfq_destroy_class(struct Qdisc *sch, struct qfq_class *cl)
> +{
> +	struct qfq_sched *q = (struct qfq_sched *)sch;

it should be:

	struct qfq_sched *q = sched_priv(sch);

The same bug you identified in qfq_reset_qdisc() is present also in
qfq_drop(), both loops need to be corrected...

It should also be noted that this scheduler (like HFSC, IIRC) depends
on the child qdisc not to reorder packets for the guarantees to be met,
as the timestamps need to be in sync with the length of the packet at the
head of the queue.  If this can't be guaranteed, to preserve the formal
correctness it should be changed to always use the maximum packet size
to calculate the timestamps.

@Stephen: not that I'm proud of that, but all the bugs found so far are mine...
--
To unsubscribe from this list: send the line "unsubscribe netdev" in
the body of a message to majordomo@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Eric Dumazet March 3, 2011, 8:27 a.m. UTC | #13
Le jeudi 03 mars 2011 à 09:19 +0100, Fabio Checconi a écrit :
> Hi,
> 
> > From: Eric Dumazet <eric.dumazet@gmail.com>
> > Date: Wed, Mar 02, 2011 06:31:27PM +0100
> >
> > Le mercredi 02 mars 2011 à 17:18 +0100, Eric Dumazet a écrit :
> > > Le mercredi 02 mars 2011 à 08:11 -0800, Stephen Hemminger a écrit :
> > > 
> > > > I put the iproute2 code into the repository in the experimental branch.
> > > > 
> > > 
> > > Thanks
> > > 
> > > It seems as soon as packets are dropped, qdisc is frozen (no more
> > > packets dequeued)
> 
>   I've been able to reproduce this problem using netem, and it seems to
> be due to this check:
> 
> > +	/* If the new skb is not the head of queue, then done here. */
> > +	if (skb != qdisc_peek_head(cl->qdisc))
> > +		return err;
> 
> changing it to:
> 
> 	if (cl->qdisc->q.qlen != 1)
> 		return err;
> 
> seems to make things work.  I think this is because qdisc_peek_head()
> looks at the wrong list, as the skb is not queued directly in q, but
> ends up in the child qdisc attached to cl->qdisc.
> 

Strange, I used the default qdisc (pfifo) created in qfq classes

> 
> > > 
> > > Hmm...
> > > 
> > 
> > It seems class deletes are buggy.
> > 
> > After one "tc class del dev $ETH classid 11:1 ..."
> > 
> > a "tc -s -d qdisc show dev $ETH" triggers an Oops
> > 
> 
> This seems to be due to:
> 
> > +static void qfq_destroy_class(struct Qdisc *sch, struct qfq_class *cl)
> > +{
> > +	struct qfq_sched *q = (struct qfq_sched *)sch;
> 
> it should be:
> 
> 	struct qfq_sched *q = sched_priv(sch);
> 
> The same bug you identified in qfq_reset_qdisc() is present also in
> qfq_drop(), both loops need to be corrected...
> 
> It should also be noted that this scheduler (like HFSC, IIRC) depends
> on the child qdisc not to reorder packets for the guarantees to be met,
> as the timestamps need to be in sync with the length of the packet at the
> head of the queue.  If this can't be guaranteed, to preserve the formal
> correctness it should be changed to always use the maximum packet size
> to calculate the timestamps.
> 
> @Stephen: not that I'm proud of that, but all the bugs found so far are mine...

I am going to test an updated version, thanks for all these hints !



--
To unsubscribe from this list: send the line "unsubscribe netdev" in
the body of a message to majordomo@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Eric Dumazet March 3, 2011, 9:07 a.m. UTC | #14
Le jeudi 03 mars 2011 à 09:19 +0100, Fabio Checconi a écrit :

> The same bug you identified in qfq_reset_qdisc() is present also in
> qfq_drop(), both loops need to be corrected...

I dont think so, because in qfq_drop() we exit from qfq_drop() right
after call to qfq_deactivate_class()

But I agree code should be same ;)



--
To unsubscribe from this list: send the line "unsubscribe netdev" in
the body of a message to majordomo@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Eric Dumazet March 3, 2011, 9:35 a.m. UTC | #15
Le mercredi 02 mars 2011 à 19:16 +0100, Patrick McHardy a écrit :
> Am 02.03.2011 18:31, schrieb Eric Dumazet:
> > Le mercredi 02 mars 2011 à 17:18 +0100, Eric Dumazet a écrit :
> >> Le mercredi 02 mars 2011 à 08:11 -0800, Stephen Hemminger a écrit :
> >>
> >>> I put the iproute2 code into the repository in the experimental branch.
> >>>
> >>
> >> Thanks
> >>
> >> It seems as soon as packets are dropped, qdisc is frozen (no more
> >> packets dequeued)
> >>
> >> Hmm...
> >>
> > 
> > It seems class deletes are buggy.
> > 
> > After one "tc class del dev $ETH classid 11:1 ..."
> > 
> > a "tc -s -d qdisc show dev $ETH" triggers an Oops
> 
> What classes and filters did you have before issuing that command?

(Fabio found the origin of the problem in qfq_destroy_class())

Here is the script I wrote to use QFQ on dummy0 device
(I then use a normal udpflood workload to stress the thing)

More or less copied from other stuff I had, so its probably very
stupid ;)

# cat egress_dummy0.sh
ETH=dummy0
RATE="rate 40Mbit"
PRIVNETS="172.16.0.0/12 192.168.0.0/16 10.0.0.0/8"
ALLOT="allot 20000"


ethtool -K $ETH tso off

tc qdisc del dev $ETH root 2>/dev/null


tc qdisc add dev $ETH root handle 1: cbq avpkt 1000 rate 1000Mbit \
	bandwidth 1000Mbit
tc class add dev $ETH parent 1: classid 1:1 \
	est 1sec 8sec cbq allot 10000 mpu 64 \
	rate 1000Mbit prio 1 avpkt 1500 bounded

# output to private nets :  40 Mbit limit
tc class add dev $ETH parent 1:1 classid 1:11 \
	est 1sec 8sec cbq $ALLOT mpu 64      \
	$RATE prio 2 avpkt 1400 bounded

tc qdisc add dev $ETH parent 1:11 handle 11:  \
	est 1sec 8sec qfq

tc filter add dev $ETH protocol ip parent 11: handle 3 \
	flow hash keys rxhash divisor 16

tc class add dev $ETH classid 11:1 est 0.5sec 2sec qfq weight 100 maxpkt 1100
tc class add dev $ETH classid 11:2 est 0.5sec 2sec qfq weight 200 maxpkt 1200
tc class add dev $ETH classid 11:3 est 0.5sec 2sec qfq weight 300 maxpkt 1300
tc class add dev $ETH classid 11:4 est 0.5sec 2sec qfq weight 400 maxpkt 1400
tc class add dev $ETH classid 11:5 est 0.5sec 2sec qfq weight 500 maxpkt 1500
tc class add dev $ETH classid 11:6 est 0.5sec 2sec qfq weight 600 maxpkt 1600
tc class add dev $ETH classid 11:7 est 0.5sec 2sec qfq weight 700 maxpkt 1700
tc class add dev $ETH classid 11:8 est 0.5sec 2sec qfq weight 800 maxpkt 1800
tc class add dev $ETH classid 11:9 est 0.5sec 2sec qfq weight 900 maxpkt 1900
tc class add dev $ETH classid 11:a est 0.5sec 2sec qfq weight 1000 maxpkt 2000
tc class add dev $ETH classid 11:b est 0.5sec 2sec qfq weight 1100 maxpkt 2048
tc class add dev $ETH classid 11:c est 0.5sec 2sec qfq weight 1200 maxpkt 2048
tc class add dev $ETH classid 11:d est 0.5sec 2sec qfq weight 1300 maxpkt 2048
tc class add dev $ETH classid 11:e est 0.5sec 2sec qfq weight 1400 maxpkt 2048
tc class add dev $ETH classid 11:f est 0.5sec 2sec qfq weight 1500 maxpkt 2048
tc class add dev $ETH classid 11:10 est 0.5sec 2sec qfq weight 1600 maxpkt 2048


for privnet in $PRIVNETS
do
	tc filter add dev $ETH parent 1: protocol ip prio 100 u32 \
		match ip dst $privnet flowid 1:11
done

tc filter add dev $ETH parent 1: protocol ip prio 100 u32 \
	match ip protocol 0 0x00 flowid 1:1



--
To unsubscribe from this list: send the line "unsubscribe netdev" in
the body of a message to majordomo@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Eric Dumazet March 3, 2011, 10:40 a.m. UTC | #16
Le jeudi 03 mars 2011 à 09:27 +0100, Eric Dumazet a écrit :

> I am going to test an updated version, thanks for all these hints !
> 

Same problem in qfq_qlen_notify()

Should use :

struct qfq_sched *q = qdisc_priv(sch);


After a stress (and no more trafic) I still have some packets in
backlog:

(8 packets here)

# tc -s -d qdisc show dev dummy0
qdisc cbq 1: root refcnt 2 rate 1000Mbit cell 8b (bounded,isolated) prio no-transmit/8 weight 1000Mbit allot 1514b 
level 2 ewma 5 avpkt 1000b maxidle 0us 
 Sent 109916000 bytes 549580 pkt (dropped 350412, overlimits 1180608 requeues 0) 
 backlog 0b 8p requeues 0 
  borrowed 0 overactions 0 avgidle 125 undertime 0
qdisc qfq 11: parent 1:11 
 Sent 109916000 bytes 549580 pkt (dropped 350412, overlimits 0 requeues 0) 
 rate 2848bit 2pps backlog 0b 8p requeues 0 
# tc -s -d class show dev dummy0
class cbq 1:11 parent 1:1 leaf 11: rate 40000Kbit cell 128b mpu 64b (bounded) prio 2/2 weight 40000Kbit allot 20000b 
level 0 ewma 5 avpkt 1400b maxidle 0us 
 Sent 109916000 bytes 549580 pkt (dropped 350412, overlimits 675712 requeues 0) 
 rate 1516Kbit 947pps backlog 0b 8p requeues 0 
  borrowed 0 overactions 285254 avgidle -22 undertime -24
class cbq 1: root rate 1000Mbit cell 8b (bounded,isolated) prio no-transmit/8 weight 1000Mbit allot 1514b 
level 2 ewma 5 avpkt 1000b maxidle 0us 
 Sent 109916000 bytes 549580 pkt (dropped 0, overlimits 0 requeues 0) 
 backlog 0b 0p requeues 0 
  borrowed 0 overactions 0 avgidle 125 undertime 0
class cbq 1:1 parent 1: rate 1000Mbit cell 64b mpu 64b (bounded) prio 1/1 weight 1000Mbit allot 10000b 
level 1 ewma 5 avpkt 1500b maxidle 0us 
 Sent 109916000 bytes 549580 pkt (dropped 0, overlimits 0 requeues 0) 
 rate 1516Kbit 947pps backlog 0b 0p requeues 0 
  borrowed 0 overactions 0 avgidle 125 undertime 0
class qfq 11:10 root weight 1600 maxpkt 2048 
 Sent 200 bytes 1 pkt (dropped 18181, overlimits 0 requeues 0) 
 rate 0bit 0pps backlog 200b 0p requeues 0 
class qfq 11:1 root weight 100 maxpkt 1100 
 Sent 2731200 bytes 13656 pkt (dropped 40889, overlimits 0 requeues 0) 
 rate 0bit 0pps backlog 200b 0p requeues 0 
class qfq 11:2 root weight 200 maxpkt 1200 
 Sent 10899800 bytes 54499 pkt (dropped 47, overlimits 0 requeues 0) 
 rate 24bit 0pps backlog 0b 0p requeues 0 
class qfq 11:3 root weight 300 maxpkt 1300 
 Sent 12714400 bytes 63572 pkt (dropped 65, overlimits 0 requeues 0) 
 rate 32bit 0pps backlog 0b 0p requeues 0 
class qfq 11:4 root weight 400 maxpkt 1400 
 Sent 400 bytes 2 pkt (dropped 36362, overlimits 0 requeues 0) 
 rate 0bit 0pps backlog 200b 0p requeues 0 
class qfq 11:5 root weight 500 maxpkt 1500 
 Sent 400 bytes 2 pkt (dropped 45452, overlimits 0 requeues 0) 
 rate 0bit 0pps backlog 200b 0p requeues 0 
class qfq 11:6 root weight 600 maxpkt 1600 
 Sent 800 bytes 4 pkt (dropped 27269, overlimits 0 requeues 0) 
 rate 0bit 0pps backlog 200b 0p requeues 0 
class qfq 11:7 root weight 700 maxpkt 1700 
 Sent 600 bytes 3 pkt (dropped 45452, overlimits 0 requeues 0) 
 rate 0bit 0pps backlog 200b 0p requeues 0 
class qfq 11:8 root weight 800 maxpkt 1800 
 Sent 1000 bytes 5 pkt (dropped 72722, overlimits 0 requeues 0) 
 rate 0bit 0pps backlog 200b 0p requeues 0 
class qfq 11:9 root weight 900 maxpkt 1900 
 Sent 1200 bytes 6 pkt (dropped 63631, overlimits 0 requeues 0) 
 rate 0bit 0pps backlog 200b 0p requeues 0 
class qfq 11:a root weight 1000 maxpkt 2000 
 Sent 14535200 bytes 72676 pkt (dropped 51, overlimits 0 requeues 0) 
 rate 32bit 0pps backlog 0b 0p requeues 0 
class qfq 11:b root weight 1100 maxpkt 2048 
 Sent 14531200 bytes 72656 pkt (dropped 72, overlimits 0 requeues 0) 
 rate 32bit 0pps backlog 0b 0p requeues 0 
class qfq 11:c root weight 1200 maxpkt 2048 
 Sent 10905800 bytes 54529 pkt (dropped 16, overlimits 0 requeues 0) 
 rate 24bit 0pps backlog 0b 0p requeues 0 
class qfq 11:d root weight 1300 maxpkt 2048 
 Sent 14538200 bytes 72691 pkt (dropped 36, overlimits 0 requeues 0) 
 rate 32bit 0pps backlog 0b 0p requeues 0 
class qfq 11:e root weight 1400 maxpkt 2048 
 Sent 18162200 bytes 90811 pkt (dropped 96, overlimits 0 requeues 0) 
 rate 40bit 0pps backlog 0b 0p requeues 0 
class qfq 11:f root weight 1500 maxpkt 2048 
 Sent 10895000 bytes 54475 pkt (dropped 71, overlimits 0 requeues 0) 
 rate 24bit 0pps backlog 0b 0p requeues 0 




--
To unsubscribe from this list: send the line "unsubscribe netdev" in
the body of a message to majordomo@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Fabio Checconi March 3, 2011, 3:07 p.m. UTC | #17
> From: Eric Dumazet <eric.dumazet@gmail.com>
> Date: Thu, Mar 03, 2011 11:40:20AM +0100
>
> Le jeudi 03 mars 2011 à 09:27 +0100, Eric Dumazet a écrit :
> 
> > I am going to test an updated version, thanks for all these hints !
> > 
> 
> Same problem in qfq_qlen_notify()
> 
> Should use :
> 
> struct qfq_sched *q = qdisc_priv(sch);
> 

Absolutely right, as you're right about qfq_drop().  I'll be able to take
a look when I'm back from work this evening.

A question on the data below:

> After a stress (and no more trafic) I still have some packets in
> backlog:
> 
> (8 packets here)
> 
...
> class qfq 11:10 root weight 1600 maxpkt 2048 
>  Sent 200 bytes 1 pkt (dropped 18181, overlimits 0 requeues 0) 
>  rate 0bit 0pps backlog 200b 0p requeues 0 

Any other reason, apart from a bug in QFQ, why there could be 200b and
0p of backlog?  A packet count out of sync could explain also why you
were experiencing problems with pfifo in presence of drops...

Thanks for your help, fixes, and testing,
fabio
--
To unsubscribe from this list: send the line "unsubscribe netdev" in
the body of a message to majordomo@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Eric Dumazet March 3, 2011, 3:18 p.m. UTC | #18
Le jeudi 03 mars 2011 à 11:40 +0100, Eric Dumazet a écrit :
> Le jeudi 03 mars 2011 à 09:27 +0100, Eric Dumazet a écrit :
> 
> > I am going to test an updated version, thanks for all these hints !
> > 
> 
> Same problem in qfq_qlen_notify()
> 
> Should use :
> 
> struct qfq_sched *q = qdisc_priv(sch);
> 
> 
> After a stress (and no more trafic) I still have some packets in
> backlog:
> 
> (8 packets here)
> 
> # tc -s -d qdisc show dev dummy0
> qdisc cbq 1: root refcnt 2 rate 1000Mbit cell 8b (bounded,isolated) prio no-transmit/8 weight 1000Mbit allot 1514b 
> level 2 ewma 5 avpkt 1000b maxidle 0us 
>  Sent 109916000 bytes 549580 pkt (dropped 350412, overlimits 1180608 requeues 0) 
>  backlog 0b 8p requeues 0 
>   borrowed 0 overactions 0 avgidle 125 undertime 0
> qdisc qfq 11: parent 1:11 
>  Sent 109916000 bytes 549580 pkt (dropped 350412, overlimits 0 requeues 0) 
>  rate 2848bit 2pps backlog 0b 8p requeues 0 
> # tc -s -d class show dev dummy0
> class cbq 1:11 parent 1:1 leaf 11: rate 40000Kbit cell 128b mpu 64b (bounded) prio 2/2 weight 40000Kbit allot 20000b 
> level 0 ewma 5 avpkt 1400b maxidle 0us 
>  Sent 109916000 bytes 549580 pkt (dropped 350412, overlimits 675712 requeues 0) 
>  rate 1516Kbit 947pps backlog 0b 8p requeues 0 
>   borrowed 0 overactions 285254 avgidle -22 undertime -24
> class cbq 1: root rate 1000Mbit cell 8b (bounded,isolated) prio no-transmit/8 weight 1000Mbit allot 1514b 
> level 2 ewma 5 avpkt 1000b maxidle 0us 
>  Sent 109916000 bytes 549580 pkt (dropped 0, overlimits 0 requeues 0) 
>  backlog 0b 0p requeues 0 
>   borrowed 0 overactions 0 avgidle 125 undertime 0
> class cbq 1:1 parent 1: rate 1000Mbit cell 64b mpu 64b (bounded) prio 1/1 weight 1000Mbit allot 10000b 
> level 1 ewma 5 avpkt 1500b maxidle 0us 
>  Sent 109916000 bytes 549580 pkt (dropped 0, overlimits 0 requeues 0) 
>  rate 1516Kbit 947pps backlog 0b 0p requeues 0 
>   borrowed 0 overactions 0 avgidle 125 undertime 0
> class qfq 11:10 root weight 1600 maxpkt 2048 
>  Sent 200 bytes 1 pkt (dropped 18181, overlimits 0 requeues 0) 
>  rate 0bit 0pps backlog 200b 0p requeues 0 
> class qfq 11:1 root weight 100 maxpkt 1100 
>  Sent 2731200 bytes 13656 pkt (dropped 40889, overlimits 0 requeues 0) 
>  rate 0bit 0pps backlog 200b 0p requeues 0 
> class qfq 11:2 root weight 200 maxpkt 1200 
>  Sent 10899800 bytes 54499 pkt (dropped 47, overlimits 0 requeues 0) 
>  rate 24bit 0pps backlog 0b 0p requeues 0 
> class qfq 11:3 root weight 300 maxpkt 1300 
>  Sent 12714400 bytes 63572 pkt (dropped 65, overlimits 0 requeues 0) 
>  rate 32bit 0pps backlog 0b 0p requeues 0 
> class qfq 11:4 root weight 400 maxpkt 1400 
>  Sent 400 bytes 2 pkt (dropped 36362, overlimits 0 requeues 0) 
>  rate 0bit 0pps backlog 200b 0p requeues 0 
> class qfq 11:5 root weight 500 maxpkt 1500 
>  Sent 400 bytes 2 pkt (dropped 45452, overlimits 0 requeues 0) 
>  rate 0bit 0pps backlog 200b 0p requeues 0 
> class qfq 11:6 root weight 600 maxpkt 1600 
>  Sent 800 bytes 4 pkt (dropped 27269, overlimits 0 requeues 0) 
>  rate 0bit 0pps backlog 200b 0p requeues 0 
> class qfq 11:7 root weight 700 maxpkt 1700 
>  Sent 600 bytes 3 pkt (dropped 45452, overlimits 0 requeues 0) 
>  rate 0bit 0pps backlog 200b 0p requeues 0 
> class qfq 11:8 root weight 800 maxpkt 1800 
>  Sent 1000 bytes 5 pkt (dropped 72722, overlimits 0 requeues 0) 
>  rate 0bit 0pps backlog 200b 0p requeues 0 
> class qfq 11:9 root weight 900 maxpkt 1900 
>  Sent 1200 bytes 6 pkt (dropped 63631, overlimits 0 requeues 0) 
>  rate 0bit 0pps backlog 200b 0p requeues 0 
> class qfq 11:a root weight 1000 maxpkt 2000 
>  Sent 14535200 bytes 72676 pkt (dropped 51, overlimits 0 requeues 0) 
>  rate 32bit 0pps backlog 0b 0p requeues 0 
> class qfq 11:b root weight 1100 maxpkt 2048 
>  Sent 14531200 bytes 72656 pkt (dropped 72, overlimits 0 requeues 0) 
>  rate 32bit 0pps backlog 0b 0p requeues 0 
> class qfq 11:c root weight 1200 maxpkt 2048 
>  Sent 10905800 bytes 54529 pkt (dropped 16, overlimits 0 requeues 0) 
>  rate 24bit 0pps backlog 0b 0p requeues 0 
> class qfq 11:d root weight 1300 maxpkt 2048 
>  Sent 14538200 bytes 72691 pkt (dropped 36, overlimits 0 requeues 0) 
>  rate 32bit 0pps backlog 0b 0p requeues 0 
> class qfq 11:e root weight 1400 maxpkt 2048 
>  Sent 18162200 bytes 90811 pkt (dropped 96, overlimits 0 requeues 0) 
>  rate 40bit 0pps backlog 0b 0p requeues 0 
> class qfq 11:f root weight 1500 maxpkt 2048 
>  Sent 10895000 bytes 54475 pkt (dropped 71, overlimits 0 requeues 0) 
>  rate 24bit 0pps backlog 0b 0p requeues 0 
> 
> 
> 

With debugging, I see dequeue() returns early because bitmaps[ER] is
null


	if (!q->bitmaps[ER]) {
		pr_err("qfq_dequeue bitmaps[ER] is null, return NULL\n");
		return NULL;
	}


[ 3319.540984] qfq calc_index: W = 1, L = 2048, I = 19
[ 3319.544098] qfq calc_index: W = 1, L = 2048, I = 19
[ 3319.547184] qfq calc_index: W = 1, L = 2048, I = 19
[ 3319.550272] qfq calc_index: W = 1, L = 2048, I = 19
[ 3319.553354] qfq calc_index: W = 1, L = 2048, I = 19
[ 3319.556435] qfq calc_index: W = 1, L = 2048, I = 19
[ 3319.559533] qfq calc_index: W = 1, L = 2048, I = 19
[ 3319.562627] qfq calc_index: W = 1, L = 2048, I = 19
[ 3322.993412] qfq_enqueue: cl = 110004
[ 3322.993457] qfq enqueue: new state 0 0x80000 S 0 F 214748364800 V 0
[ 3322.993493] qfq dequeue: cl ffff880077d45380 len 200 F 214748364800 now 1638400
[ 3322.993543] qfq slot_scan: grp 19 full 0x0
[ 3322.993575] qfq_enqueue: cl = 110007
[ 3322.993603] qfq enqueue: new state 0 0x80000 S 1638400 F 214750003200 V 1638400
[ 3322.993653] qfq_enqueue: cl = 110006
[ 3322.993684] qfq dequeue: cl ffff880077d45100 len 200 F 214750003200 now 3276800
[ 3322.993735] qfq slot_scan: grp 19 full 0x1
[ 3322.993766] qfq_enqueue: cl = 110003
[ 3322.993797] qfq dequeue: cl ffff88011d10e080 len 200 F 214751641600 now 4915200
[ 3322.993847] qfq slot_scan: grp 19 full 0x1
[ 3322.993877] qfq_enqueue: cl = 110002
[ 3322.993906] qfq dequeue: cl ffff880077d45c00 len 200 F 214753280000 now 6553600
[ 3322.997895] qfq slot_scan: grp 19 full 0x1
[ 3322.997925] qfq_enqueue: cl = 110003
[ 3322.997956] qfq_enqueue: cl = 110004
[ 3322.997989] qfq dequeue: cl ffff880077d45380 len 200 F 429496729600 now 8192000
[ 3322.998040] qfq slot_scan: grp 19 full 0x1
[ 3322.998070] qfq_enqueue: cl = 110006
[ 3322.998099] qfq_enqueue: cl = 110008
[ 3322.998130] qfq_enqueue: cl = 110002
[ 3322.998160] qfq_dequeue bitmaps[ER] is null, return NULL
[ 3322.998192] qfq_dequeue bitmaps[ER] is null, return NULL
[ 3322.998224] qfq_enqueue: cl = 110002
[ 3322.998249] qfq_dequeue bitmaps[ER] is null, return NULL
[ 3322.998276] qfq_dequeue bitmaps[ER] is null, return NULL
[ 3322.998305] qfq_enqueue: cl = 110007
[ 3322.998333] qfq_dequeue bitmaps[ER] is null, return NULL
[ 3322.998363] qfq_dequeue bitmaps[ER] is null, return NULL
[ 3322.998395] qfq_enqueue: cl = 110001
[ 3322.998421] qfq_dequeue bitmaps[ER] is null, return NULL
[ 3322.998449] qfq_dequeue bitmaps[ER] is null, return NULL
[ 3322.998478] qfq_enqueue: cl = 110004
[ 3322.998502] qfq_dequeue bitmaps[ER] is null, return NULL
[ 3322.998530] qfq_dequeue bitmaps[ER] is null, return NULL
[ 3322.998558] qfq_enqueue: cl = 110006
[ 3322.998585] qfq_dequeue bitmaps[ER] is null, return NULL
[ 3322.998616] qfq_dequeue bitmaps[ER] is null, return NULL
[ 3322.998649] qfq_enqueue: cl = 110004
[ 3322.998677] qfq_dequeue bitmaps[ER] is null, return NULL
[ 3322.998708] qfq_dequeue bitmaps[ER] is null, return NULL
[ 3322.998739] qfq_enqueue: cl = 110007
[ 3322.998763] qfq_dequeue bitmaps[ER] is null, return NULL
[ 3322.998790] qfq_dequeue bitmaps[ER] is null, return NULL
[ 3322.998832] qfq_enqueue: cl = 110004
[ 3322.998875] qfq_dequeue bitmaps[ER] is null, return NULL
[ 3322.998907] qfq_dequeue bitmaps[ER] is null, return NULL
[ 3322.998937] qfq_enqueue: cl = 110001
[ 3322.998964] qfq_dequeue bitmaps[ER] is null, return NULL
[ 3322.998997] qfq_dequeue bitmaps[ER] is null, return NULL
[ 3322.999029] qfq_enqueue: cl = 110003
[ 3322.999057] qfq_dequeue bitmaps[ER] is null, return NULL
[ 3322.999087] qfq_dequeue bitmaps[ER] is null, return NULL
[ 3322.999120] qfq_enqueue: cl = 110007
[ 3322.999147] qfq_dequeue bitmaps[ER] is null, return NULL
[ 3322.999178] qfq_dequeue bitmaps[ER] is null, return NULL
[ 3322.999210] qfq_enqueue: cl = 110001
[ 3322.999234] qfq_dequeue bitmaps[ER] is null, return NULL
[ 3322.999261] qfq_dequeue bitmaps[ER] is null, return NULL
[ 3322.999290] qfq_enqueue: cl = 110005
[ 3322.999316] qfq_dequeue bitmaps[ER] is null, return NULL
[ 3322.999343] qfq_dequeue bitmaps[ER] is null, return NULL

[...]

Only 5 packets sent ...

qdisc cbq 1: root refcnt 2 rate 1000Mbit cell 8b (bounded,isolated) prio no-transmit/8 weight 1000Mbit allot 1514b 
level 2 ewma 5 avpkt 1000b maxidle 0us 
 Sent 1000 bytes 5 pkt (dropped 0, overlimits 82 requeues 0) 
 backlog 0b 85p requeues 0 
  borrowed 0 overactions 0 avgidle 125 undertime 0
qdisc qfq 11: parent 1:11 
 Sent 1000 bytes 5 pkt (dropped 0, overlimits 0 requeues 0) 
 rate 0bit 0pps backlog 0b 85p requeues 0 
qdisc pfifo 8011: parent 11:1 limit 30p
 Sent 0 bytes 0 pkt (dropped 0, overlimits 0 requeues 0) 
 backlog 2600b 13p requeues 0 
qdisc pfifo 8012: parent 11:2 limit 30p
 Sent 200 bytes 1 pkt (dropped 0, overlimits 0 requeues 0) 
 backlog 2400b 12p requeues 0 
qdisc pfifo 8013: parent 11:3 limit 30p
 Sent 200 bytes 1 pkt (dropped 0, overlimits 0 requeues 0) 
 backlog 2200b 11p requeues 0 
qdisc pfifo 8014: parent 11:4 limit 30p
 Sent 400 bytes 2 pkt (dropped 0, overlimits 0 requeues 0) 
 backlog 2400b 12p requeues 0 
qdisc pfifo 8015: parent 11:5 limit 30p
 Sent 0 bytes 0 pkt (dropped 0, overlimits 0 requeues 0) 
 backlog 2000b 10p requeues 0 
qdisc pfifo 8016: parent 11:6 limit 30p
 Sent 200 bytes 1 pkt (dropped 0, overlimits 0 requeues 0) 
 backlog 2000b 10p requeues 0 
qdisc pfifo 8017: parent 11:7 limit 30p
 Sent 0 bytes 0 pkt (dropped 0, overlimits 0 requeues 0) 
 backlog 2400b 12p requeues 0 
qdisc pfifo 8018: parent 11:8 limit 30p
 Sent 0 bytes 0 pkt (dropped 0, overlimits 0 requeues 0) 
 backlog 1000b 5p requeues 0 


--
To unsubscribe from this list: send the line "unsubscribe netdev" in
the body of a message to majordomo@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Eric Dumazet March 3, 2011, 3:25 p.m. UTC | #19
Le jeudi 03 mars 2011 à 16:07 +0100, Fabio Checconi a écrit :

> A question on the data below:
> 
> > After a stress (and no more trafic) I still have some packets in
> > backlog:
> > 
> > (8 packets here)
> > 
> ...
> > class qfq 11:10 root weight 1600 maxpkt 2048 
> >  Sent 200 bytes 1 pkt (dropped 18181, overlimits 0 requeues 0) 
> >  rate 0bit 0pps backlog 200b 0p requeues 0 
> 
> Any other reason, apart from a bug in QFQ, why there could be 200b and
> 0p of backlog?  A packet count out of sync could explain also why you
> were experiencing problems with pfifo in presence of drops...

My packets are all 200 bytes (from my packet generator)

qfq_dump_class_stats() probably miss one line :

scl->bstats.qlen = cl->qdisc->q.qlen;



--
To unsubscribe from this list: send the line "unsubscribe netdev" in
the body of a message to majordomo@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Eric Dumazet March 3, 2011, 3:31 p.m. UTC | #20
Le jeudi 03 mars 2011 à 16:25 +0100, Eric Dumazet a écrit :

> qfq_dump_class_stats() probably miss one line :
> 
> scl->bstats.qlen = cl->qdisc->q.qlen;
> 
> 

Hmm, 

cl->qdisc->qstats.qlen = cl->qdisc->q.qlen;

It should be better ;)



--
To unsubscribe from this list: send the line "unsubscribe netdev" in
the body of a message to majordomo@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Eric Dumazet March 3, 2011, 3:39 p.m. UTC | #21
Le jeudi 03 mars 2011 à 16:31 +0100, Eric Dumazet a écrit :
> Le jeudi 03 mars 2011 à 16:25 +0100, Eric Dumazet a écrit :
> 
> > qfq_dump_class_stats() probably miss one line :
> > 
> > scl->bstats.qlen = cl->qdisc->q.qlen;
> > 
> > 
> 
> Hmm, 
> 
> cl->qdisc->qstats.qlen = cl->qdisc->q.qlen;
> 
> It should be better ;)
> 
> 

Indeed, 'packet count' is now correct in qfq classes

# tc -s -d qdisc show dev dummy0
qdisc cbq 1: root refcnt 2 rate 1000Mbit cell 8b (bounded,isolated) prio no-transmit/8 weight 1000Mbit allot 1514b 
level 2 ewma 5 avpkt 1000b maxidle 0us 
 Sent 800 bytes 4 pkt (dropped 0, overlimits 83 requeues 0) 
 backlog 0b 86p requeues 0 
  borrowed 0 overactions 0 avgidle 125 undertime 0
qdisc qfq 11: parent 1:11 
 Sent 800 bytes 4 pkt (dropped 0, overlimits 0 requeues 0) 
 rate 360bit 0pps backlog 0b 86p requeues 0 
qdisc pfifo 8001: parent 11:1 limit 30p
 Sent 0 bytes 0 pkt (dropped 0, overlimits 0 requeues 0) 
 backlog 2400b 12p requeues 0 
qdisc pfifo 8002: parent 11:2 limit 30p
 Sent 0 bytes 0 pkt (dropped 0, overlimits 0 requeues 0) 
 backlog 2000b 10p requeues 0 
qdisc pfifo 8003: parent 11:3 limit 30p
 Sent 200 bytes 1 pkt (dropped 0, overlimits 0 requeues 0) 
 backlog 3400b 17p requeues 0 
qdisc pfifo 8004: parent 11:4 limit 30p
 Sent 0 bytes 0 pkt (dropped 0, overlimits 0 requeues 0) 
 backlog 1200b 6p requeues 0 
qdisc pfifo 8005: parent 11:5 limit 30p
 Sent 0 bytes 0 pkt (dropped 0, overlimits 0 requeues 0) 
 backlog 2400b 12p requeues 0 
qdisc pfifo 8006: parent 11:6 limit 30p
 Sent 0 bytes 0 pkt (dropped 0, overlimits 0 requeues 0) 
 backlog 1400b 7p requeues 0 
qdisc pfifo 8007: parent 11:7 limit 30p
 Sent 400 bytes 2 pkt (dropped 0, overlimits 0 requeues 0) 
 backlog 1400b 7p requeues 0 
qdisc pfifo 8008: parent 11:8 limit 30p
 Sent 200 bytes 1 pkt (dropped 0, overlimits 0 requeues 0) 
 backlog 3000b 15p requeues 0 


--
To unsubscribe from this list: send the line "unsubscribe netdev" in
the body of a message to majordomo@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Eric Dumazet March 3, 2011, 4:03 p.m. UTC | #22
Le lundi 28 février 2011 à 17:17 -0800, Stephen Hemminger a écrit :


> +static int qfq_dump_class_stats(struct Qdisc *sch, unsigned long arg,
> +				struct gnet_dump *d)
> +{
> +	struct qfq_class *cl = (struct qfq_class *)arg;
> +	struct tc_qfq_stats xstats;
> +
> +	memset(&xstats, 0, sizeof(xstats));
> +
> +	xstats.weight = ONE_FP/cl->inv_w;
> +	xstats.lmax = cl->lmax;
> +

cl->qdisc->qstats.qlen = cl->qdisc->q.qlen;

> +	if (gnet_stats_copy_basic(d, &cl->bstats) < 0 ||
> +	    gnet_stats_copy_rate_est(d, NULL, &cl->rate_est) < 0 ||
> +	    gnet_stats_copy_queue(d, &cl->qdisc->qstats) < 0)
> +		return -1;
> +
> +	return gnet_stats_copy_app(d, &xstats, sizeof(xstats));
> +}
> +

using 
	gnet_stats_copy_rate_est(d, &cl->bstats, &cl->rate_est) < 0 

is better : We avoid dumping null rate estimation, if admin never asked
a rate estimation...



--
To unsubscribe from this list: send the line "unsubscribe netdev" in
the body of a message to majordomo@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Eric Dumazet March 23, 2011, 6:45 a.m. UTC | #23
Le dimanche 13 mars 2011 à 17:44 +0100, Eric Dumazet a écrit :
> Le dimanche 13 mars 2011 à 12:34 +0100, Eric Dumazet a écrit :
> > Le mercredi 09 mars 2011 à 11:02 -0800, Stephen Hemminger a écrit :
> > > This is an implementation of the Quick Fair Queue scheduler developed
> > > by Fabio Checconi. The same algorithm is already implemented in ipfw
> > > in FreeBSD. Fabio had an earlier version developed on Linux, I just
> > > cleaned it up.  Thanks to Eric Dumazet for doing the testing and
> > > finding bugs.
> > > 
> > > Signed-off-by: Stephen Hemminger <shemminger@vyatta.com>
> > > 
> > 
> > Hi Stephen
> > 
> > My stress tests run fine on this version, please send it to David and
> > netdev and feel free to add my :
> > 
> > Signed-off-by: Eric Dumazet <eric.dumazet@gmail.com>
> > 
> 
> Ah well, no... sorry
> 
> After a while, I restarted some udpflood tests and no packet was going
> through my dummy0 device. So I wanted to restart my qfq_setup script and
> it froze :
> 
> root      9265 97.0  0.0  2056  440 pts/1    R+   17:41   0:29 tc qdisc del dev dummy0 root
> root      9268  0.0  0.0  2452  760 pts/0    R+   17:42   0:00 ps auxw
> 
> a bit later :
> 
> root      9265 99.3  0.0  2056  440 pts/1    R+   17:41   2:43 tc qdisc del dev dummy0 root
> 
> 

Hi guys

QFQ scheduler update :

While polishing QFQ scheduler, and tracking a bug in it, I finally
replaced in my tc scripts "QFQ experimental" by "SFQ rock solid" and
found I could have a hang in some situations :(

I made a bisection and found :

# git bisect good
7a6362800cb7d1d618a697a650c7aaed3eb39320 is the first bad commit

It seems I am stuck...

# git bisect log
git bisect start
# bad: [c360d5b53a7fec44ae4402e1f13fc888f57ddc3b] Merge branch 'master'
of git://git.kernel.org/pub/scm/linux/kernel/git/torvalds/linux-2.6
git bisect bad c360d5b53a7fec44ae4402e1f13fc888f57ddc3b
# good: [07a2039b8eb0af4ff464efd3dfd95de5c02648c6] Linux 2.6.30
git bisect good 07a2039b8eb0af4ff464efd3dfd95de5c02648c6
# good: [2ec8c6bb5d8f3a62a79f463525054bae1e3d4487] Merge branch 'master'
of /home/davem/src/GIT/linux-2.6/
git bisect good 2ec8c6bb5d8f3a62a79f463525054bae1e3d4487
# good: [e0e170bd7ded2ec16e2813d63c0faff43193fde8] Merge branch 'next'
of git://git.monstr.eu/linux-2.6-microblaze
git bisect good e0e170bd7ded2ec16e2813d63c0faff43193fde8
# good: [40c73abbb37e399eba274fe49e520ffa3dd65bdb] Merge branch
'for_linus' of
git://git.kernel.org/pub/scm/linux/kernel/git/jack/linux-fs-2.6
git bisect good 40c73abbb37e399eba274fe49e520ffa3dd65bdb
# good: [4b66fef9b591b95f447aea12242a1133deb0bd22] mcast: net_device dev
not used
git bisect good 4b66fef9b591b95f447aea12242a1133deb0bd22
# bad: [7a6362800cb7d1d618a697a650c7aaed3eb39320] Merge
git://git.kernel.org/pub/scm/linux/kernel/git/davem/net-next-2.6
git bisect bad 7a6362800cb7d1d618a697a650c7aaed3eb39320
# good: [971f115a50afbe409825c9f3399d5a3b9aca4381] Merge branch
'usb-next' of
git://git.kernel.org/pub/scm/linux/kernel/git/gregkh/usb-2.6
git bisect good 971f115a50afbe409825c9f3399d5a3b9aca4381
# good: [5917def58ab9f5848f2d1da835a33a490d0c8c69] staging/easycap:
reduce code nesting in easycap_sound.c
git bisect good 5917def58ab9f5848f2d1da835a33a490d0c8c69
# good: [6445ced8670f37cfc2c5e24a9de9b413dbfc788d] Merge branch
'staging-next' of
git://git.kernel.org/pub/scm/linux/kernel/git/gregkh/staging-2.6
git bisect good 6445ced8670f37cfc2c5e24a9de9b413dbfc788d
# good: [da91981bee8de20bcd06ee0dbddd53d62d23b1bd] ipv4: Use flowi4 in
ipmr code.
git bisect good da91981bee8de20bcd06ee0dbddd53d62d23b1bd
# good: [106af2c99a5249b809aaed45b8353ac087821f4a] Merge branch 'master'
of
git://git.kernel.org/pub/scm/linux/kernel/git/linville/wireless-next-2.6
into for-davem
git bisect good 106af2c99a5249b809aaed45b8353ac087821f4a
# good: [638be344593b66ccca6802c6076a5b3d9200829d] Phonet: fix
aligned-mode pipe socket buffer header reserve
git bisect good 638be344593b66ccca6802c6076a5b3d9200829d
# good: [c337ffb68e1e71bad069b14d2246fa1e0c31699c] Merge branch 'master'
of master.kernel.org:/pub/scm/linux/kernel/git/davem/net-2.6
git bisect good c337ffb68e1e71bad069b14d2246fa1e0c31699c
# good: [ee0caa79569a9c44febc18480beef4847aa8cecd] Merge branch 'master'
of git://git.kernel.org/pub/scm/linux/kernel/git/kaber/nf-next-2.6
git bisect good ee0caa79569a9c44febc18480beef4847aa8cecd
# good: [0bd80dad57d82676ee484fb1f9aa4c5e8b5bc469] net: get rid of
multiple bond-related netdevice->priv_flags
git bisect good 0bd80dad57d82676ee484fb1f9aa4c5e8b5bc469
# good: [8a4eb5734e8d1dc60a8c28576bbbdfdcc643626d] net: introduce
rx_handler results and logic around that
git bisect good 8a4eb5734e8d1dc60a8c28576bbbdfdcc643626d
# good: [ceda86a108671294052cbf51660097b6534672f5] bonding: enable
netpoll without checking link status
git bisect good ceda86a108671294052cbf51660097b6534672f5

Any idea how we can find the problem ?

Script to reproduce the problem :


modprobe dummy

ifconfig dummy0 10.2.2.254 netmask 255.255.255.0 up

for i in `seq 1 240`
do
 arp -H ether -i dummy0 -s 10.2.2.$i 00:00:0c:07:ac:$(printf %02x $i)
done


DEV=dummy0
RATE="rate 40Mbit"
TNETS="10.2.2.0/25"
ALLOT="allot 20000"


tc qdisc del dev dummy0 root 2>/dev/null


tc qdisc add dev $DEV root handle 1: est 1sec 8sec cbq avpkt 1000 rate 100Mbit \
	bandwidth 100Mbit
tc class add dev $DEV parent 1: classid 1:1 \
	est 1sec 8sec cbq allot 10000 mpu 64 \
	rate 100Mbit prio 1 avpkt 1500 bounded

# output to test nets :  40 Mbit limit
tc class add dev $DEV parent 1:1 classid 1:11 \
	est 1sec 8sec cbq $ALLOT mpu 64      \
	$RATE prio 2 avpkt 1400 bounded

tc qdisc add dev $DEV parent 1:11 handle 11:  \
	est 1sec 8sec sfq


for privnet in $TNETS
do
	tc filter add dev $DEV parent 1: protocol ip prio 100 u32 \
		match ip dst $privnet flowid 1:11
done

tc filter add dev $DEV parent 1: protocol ip prio 100 u32 \
	match ip protocol 0 0x00 flowid 1:1


iperf -u -c 10.2.2.1 -P 32 -l 50
iperf -u -c 10.2.2.1 -P 32 -l 50
iperf -u -c 10.2.2.1 -P 32 -l 50
tc -s -d qdisc show dev dummy0


--
To unsubscribe from this list: send the line "unsubscribe netdev" in
the body of a message to majordomo@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
diff mbox

Patch

--- a/include/linux/pkt_sched.h	2011-02-28 13:28:57.763177314 -0800
+++ b/include/linux/pkt_sched.h	2011-02-28 13:29:10.466792117 -0800
@@ -588,4 +588,18 @@  struct tc_sfb_xstats {
 
 #define SFB_MAX_PROB 0xFFFF
 
+/* QFQ */
+enum {
+	TCA_QFQ_WEIGHT,
+	TCA_QFQ_LMAX,
+	__TCA_QFQ_MAX
+};
+
+#define TCA_QFQ_MAX	(__TCA_QFQ_MAX - 1)
+
+struct tc_qfq_stats {
+	__u32 weight;
+	__u32 lmax;
+};
+
 #endif
--- a/net/sched/Kconfig	2011-02-28 13:28:57.771177070 -0800
+++ b/net/sched/Kconfig	2011-02-28 17:13:01.119718271 -0800
@@ -239,6 +239,17 @@  config NET_SCH_CHOKE
 	  To compile this code as a module, choose M here: the
 	  module will be called sch_choke.
 
+config NET_SCH_QFQ
+	tristate "Quick Fair Queueing scheduler (QFQ)"
+	help
+	  Say Y here if you want to use the Quick Fair Queueing Scheduler (QFQ)
+	  packet scheduling algorithm.
+
+	  To compile this driver as a module, choose M here: the module
+	  will be called sch_qfq.
+
+	  If unsure, say N.
+
 config NET_SCH_INGRESS
 	tristate "Ingress Qdisc"
 	depends on NET_CLS_ACT
--- a/net/sched/Makefile	2011-02-28 13:28:57.783176704 -0800
+++ b/net/sched/Makefile	2011-02-28 13:29:10.470791996 -0800
@@ -35,6 +35,7 @@  obj-$(CONFIG_NET_SCH_NETEM)	+= sch_netem
 obj-$(CONFIG_NET_SCH_DRR)	+= sch_drr.o
 obj-$(CONFIG_NET_SCH_MQPRIO)	+= sch_mqprio.o
 obj-$(CONFIG_NET_SCH_CHOKE)	+= sch_choke.o
+obj-$(CONFIG_NET_SCH_QFQ)	+= sch_qfq.o
 
 obj-$(CONFIG_NET_CLS_U32)	+= cls_u32.o
 obj-$(CONFIG_NET_CLS_ROUTE4)	+= cls_route.o
--- /dev/null	1970-01-01 00:00:00.000000000 +0000
+++ b/net/sched/sch_qfq.c	2011-02-28 17:10:48.961445894 -0800
@@ -0,0 +1,1124 @@ 
+/*
+ * net/sched/sch_qfq.c         Quick Fair Queueing Scheduler.
+ *
+ * Copyright (c) 2009 Fabio Checconi, Luigi Rizzo, and Paolo Valente.
+ *
+ * This program is free software; you can redistribute it and/or
+ * modify it under the terms of the GNU General Public License
+ * version 2 as published by the Free Software Foundation.
+ */
+
+#include <linux/module.h>
+#include <linux/init.h>
+#include <linux/bitops.h>
+#include <linux/errno.h>
+#include <linux/netdevice.h>
+#include <linux/pkt_sched.h>
+#include <net/sch_generic.h>
+#include <net/pkt_sched.h>
+#include <net/pkt_cls.h>
+
+
+/*  Quick Fair Queueing
+    ===================
+
+    Sources:
+    Fabio Checconi and Scuola Superiore and S. Anna
+    and Paolo Valente and Luigi Riz "QFQ: Efficient Packet Scheduling
+    with Tight Bandwidth Distribution Guarantees"
+
+    See also:
+    http://retis.sssup.it/~fabio/linux/qfq/
+ */
+
+/*
+
+  Virtual time computations.
+
+  S, F and V are all computed in fixed point arithmetic with
+  FRAC_BITS decimal bits.
+
+  QFQ_MAX_INDEX is the maximum index allowed for a group. We need
+	one bit per index.
+  QFQ_MAX_WSHIFT is the maximum power of two supported as a weight.
+
+  The layout of the bits is as below:
+
+                   [ MTU_SHIFT ][      FRAC_BITS    ]
+                   [ MAX_INDEX    ][ MIN_SLOT_SHIFT ]
+				 ^.__grp->index = 0
+				 *.__grp->slot_shift
+
+  where MIN_SLOT_SHIFT is derived by difference from the others.
+
+  The max group index corresponds to Lmax/w_min, where
+  Lmax=1<<MTU_SHIFT, w_min = 1 .
+  From this, and knowing how many groups (MAX_INDEX) we want,
+  we can derive the shift corresponding to each group.
+
+  Because we often need to compute
+	F = S + len/w_i  and V = V + len/wsum
+  instead of storing w_i store the value
+	inv_w = (1<<FRAC_BITS)/w_i
+  so we can do F = S + len * inv_w * wsum.
+  We use W_TOT in the formulas so we can easily move between
+  static and adaptive weight sum.
+
+  The per-scheduler-instance data contain all the data structures
+  for the scheduler: bitmaps and bucket lists.
+
+ */
+
+/*
+ * Maximum number of consecutive slots occupied by backlogged classes
+ * inside a group.
+ */
+#define QFQ_MAX_SLOTS	32
+
+/*
+ * Shifts used for class<->group mapping.  We allow class weights that are
+ * in the range [1, 2^MAX_WSHIFT], and we try to map each class i to the
+ * group with the smallest index that can support the L_i / r_i configured
+ * for the class.
+ *
+ * grp->index is the index of the group; and grp->slot_shift
+ * is the shift for the corresponding (scaled) sigma_i.
+ */
+#define QFQ_MAX_INDEX		19
+#define QFQ_MAX_WSHIFT		16
+
+#define	QFQ_MAX_WEIGHT		(1<<QFQ_MAX_WSHIFT)
+#define QFQ_MAX_WSUM		(2*QFQ_MAX_WEIGHT)
+
+#define FRAC_BITS		30	/* fixed point arithmetic */
+#define ONE_FP			(1UL << FRAC_BITS)
+#define IWSUM			(ONE_FP/QFQ_MAX_WSUM)
+
+#define QFQ_MTU_SHIFT		11
+#define QFQ_MIN_SLOT_SHIFT	(FRAC_BITS + QFQ_MTU_SHIFT - QFQ_MAX_INDEX)
+
+/*
+ * Possible group states.  These values are used as indexes for the bitmaps
+ * array of struct qfq_queue.
+ */
+enum qfq_state { ER, IR, EB, IB, QFQ_MAX_STATE };
+
+struct qfq_group;
+
+struct qfq_class {
+	struct Qdisc_class_common common;
+
+	unsigned int refcnt;
+	unsigned int filter_cnt;
+
+	struct gnet_stats_basic_packed bstats;
+	struct gnet_stats_queue qstats;
+	struct gnet_stats_rate_est rate_est;
+	struct Qdisc *qdisc;
+
+	struct qfq_class *next;	/* Link for the slot list. */
+	u64 S, F;		/* flow timestamps (exact) */
+
+	/* group we belong to. In principle we would need the index,
+	 * which is log_2(lmax/weight), but we never reference it
+	 * directly, only the group.
+	 */
+	struct qfq_group *grp;
+
+	/* these are copied from the flowset. */
+	u32	inv_w;		/* ONE_FP/weight */
+	u32	lmax;		/* Max packet size for this flow. */
+};
+
+struct qfq_group {
+	u64 S, F;			/* group timestamps (approx). */
+	unsigned int slot_shift;	/* Slot shift. */
+	unsigned int index;		/* Group index. */
+	unsigned int front;		/* Index of the front slot. */
+	unsigned long full_slots;	/* non-empty slots */
+
+	/* Array of RR lists of active classes. */
+	struct qfq_class *slots[QFQ_MAX_SLOTS];
+};
+
+struct qfq_sched {
+	struct tcf_proto *filter_list;
+	struct Qdisc_class_hash clhash;
+
+	u64		V;		/* Precise virtual time. */
+	u32		wsum;		/* weight sum */
+
+	unsigned long bitmaps[QFQ_MAX_STATE];	    /* Group bitmaps. */
+	struct qfq_group groups[QFQ_MAX_INDEX + 1]; /* The groups. */
+};
+
+static struct qfq_class *qfq_find_class(struct Qdisc *sch, u32 classid)
+{
+	struct qfq_sched *q = qdisc_priv(sch);
+	struct Qdisc_class_common *clc;
+
+	clc = qdisc_class_find(&q->clhash, classid);
+	if (clc == NULL)
+		return NULL;
+	return container_of(clc, struct qfq_class, common);
+}
+
+static void qfq_purge_queue(struct qfq_class *cl)
+{
+	unsigned int len = cl->qdisc->q.qlen;
+
+	qdisc_reset(cl->qdisc);
+	qdisc_tree_decrease_qlen(cl->qdisc, len);
+}
+
+static const struct nla_policy qfq_policy[TCA_QFQ_MAX + 1] = {
+	[TCA_QFQ_WEIGHT] = { .type = NLA_U32 },
+	[TCA_QFQ_LMAX] = { .type = NLA_U32 },
+};
+
+/*
+ * Calculate a flow index, given its weight and maximum packet length.
+ * index = log_2(maxlen/weight) but we need to apply the scaling.
+ * This is used only once at flow creation.
+ */
+static int qfq_calc_index(u32 inv_w, unsigned int maxlen)
+{
+	u64 slot_size = (u64)maxlen * inv_w;
+	unsigned long size_map;
+	int index = 0;
+
+	size_map = slot_size >> QFQ_MIN_SLOT_SHIFT;
+	if (!size_map)
+		goto out;
+
+	index = __fls(size_map) + 1;	/* basically a log_2 */
+	index -= !(slot_size - (1ULL << (index + QFQ_MIN_SLOT_SHIFT - 1)));
+
+	if (index < 0)
+		index = 0;
+out:
+	pr_debug("qfq calc_index: W = %lu, L = %u, I = %d\n",
+		 (unsigned long) ONE_FP/inv_w, maxlen, index);
+
+	return index;
+}
+
+static int qfq_change_class(struct Qdisc *sch, u32 classid, u32 parentid,
+			    struct nlattr **tca, unsigned long *arg)
+{
+	struct qfq_sched *q = qdisc_priv(sch);
+	struct qfq_class *cl = (struct qfq_class *)*arg;
+	struct nlattr *tb[TCA_QFQ_MAX + 1];
+	u32 weight, lmax, inv_w;
+	int i, err;
+
+	if (tca[TCA_OPTIONS] == NULL)
+		return -EINVAL;
+
+	err = nla_parse_nested(tb, TCA_QFQ_MAX, tca[TCA_OPTIONS], qfq_policy);
+	if (err < 0)
+		return err;
+
+	if (tb[TCA_QFQ_WEIGHT]) {
+		weight = nla_get_u32(tb[TCA_QFQ_WEIGHT]);
+		if (!weight || weight > (1UL << QFQ_MAX_WSHIFT)) {
+			pr_notice("qfq: invalid weight %u\n", weight);
+			return -EINVAL;
+		}
+	} else
+		weight = 1;
+
+	inv_w = ONE_FP / weight;
+	weight = ONE_FP / inv_w;
+	if (q->wsum + weight > QFQ_MAX_WSUM) {
+		pr_notice("qfq: total weight out of range (%u + %u)\n",
+			  weight, q->wsum);
+		return -EINVAL;
+	}
+
+	if (tb[TCA_QFQ_LMAX]) {
+		lmax = nla_get_u32(tb[TCA_QFQ_LMAX]);
+		if (!lmax || lmax > (1UL << QFQ_MTU_SHIFT)) {
+			pr_notice("qfq: invalid max length %u\n", lmax);
+			return -EINVAL;
+		}
+	} else
+		lmax = 1UL << QFQ_MTU_SHIFT;
+
+	if (cl != NULL) {
+		if (tca[TCA_RATE]) {
+			err = gen_replace_estimator(&cl->bstats, &cl->rate_est,
+						    qdisc_root_sleeping_lock(sch),
+						    tca[TCA_RATE]);
+			if (err)
+				return err;
+		}
+
+		sch_tree_lock(sch);
+		if (tb[TCA_QFQ_WEIGHT]) {
+			q->wsum = weight - ONE_FP / cl->inv_w;
+			cl->inv_w = inv_w;
+		}
+		sch_tree_unlock(sch);
+
+		return 0;
+	}
+
+	cl = kzalloc(sizeof(struct qfq_class), GFP_KERNEL);
+	if (cl == NULL)
+		return -ENOBUFS;
+
+	cl->refcnt = 1;
+	cl->common.classid = classid;
+	cl->lmax = lmax;
+	cl->inv_w = inv_w;
+	i = qfq_calc_index(cl->inv_w, cl->lmax);
+
+	cl->grp = &q->groups[i];
+	q->wsum += weight;
+
+	cl->qdisc = qdisc_create_dflt(sch->dev_queue,
+				      &pfifo_qdisc_ops, classid);
+	if (cl->qdisc == NULL)
+		cl->qdisc = &noop_qdisc;
+
+	if (tca[TCA_RATE]) {
+		err = gen_new_estimator(&cl->bstats, &cl->rate_est,
+					qdisc_root_sleeping_lock(sch),
+					tca[TCA_RATE]);
+		if (err) {
+			qdisc_destroy(cl->qdisc);
+			kfree(cl);
+			return err;
+		}
+	}
+
+	sch_tree_lock(sch);
+	qdisc_class_hash_insert(&q->clhash, &cl->common);
+	sch_tree_unlock(sch);
+
+	qdisc_class_hash_grow(sch, &q->clhash);
+
+	*arg = (unsigned long)cl;
+	return 0;
+}
+
+static void qfq_destroy_class(struct Qdisc *sch, struct qfq_class *cl)
+{
+	struct qfq_sched *q = (struct qfq_sched *)sch;
+
+	if (cl->inv_w) {
+		q->wsum -= ONE_FP / cl->inv_w;
+		cl->inv_w = 0;
+	}
+
+	gen_kill_estimator(&cl->bstats, &cl->rate_est);
+	qdisc_destroy(cl->qdisc);
+	kfree(cl);
+}
+
+static int qfq_delete_class(struct Qdisc *sch, unsigned long arg)
+{
+	struct qfq_sched *q = qdisc_priv(sch);
+	struct qfq_class *cl = (struct qfq_class *)arg;
+
+	if (cl->filter_cnt > 0)
+		return -EBUSY;
+
+	sch_tree_lock(sch);
+
+	qfq_purge_queue(cl);
+	qdisc_class_hash_remove(&q->clhash, &cl->common);
+
+	if (--cl->refcnt == 0)
+		qfq_destroy_class(sch, cl);
+
+	sch_tree_unlock(sch);
+	return 0;
+}
+
+static unsigned long qfq_get_class(struct Qdisc *sch, u32 classid)
+{
+	struct qfq_class *cl = qfq_find_class(sch, classid);
+
+	if (cl != NULL)
+		cl->refcnt++;
+
+	return (unsigned long)cl;
+}
+
+static void qfq_put_class(struct Qdisc *sch, unsigned long arg)
+{
+	struct qfq_class *cl = (struct qfq_class *)arg;
+
+	if (--cl->refcnt == 0)
+		qfq_destroy_class(sch, cl);
+}
+
+static struct tcf_proto **qfq_tcf_chain(struct Qdisc *sch, unsigned long cl)
+{
+	struct qfq_sched *q = qdisc_priv(sch);
+
+	if (cl)
+		return NULL;
+
+	return &q->filter_list;
+}
+
+static unsigned long qfq_bind_tcf(struct Qdisc *sch, unsigned long parent,
+				  u32 classid)
+{
+	struct qfq_class *cl = qfq_find_class(sch, classid);
+
+	if (cl != NULL)
+		cl->filter_cnt++;
+
+	return (unsigned long)cl;
+}
+
+static void qfq_unbind_tcf(struct Qdisc *sch, unsigned long arg)
+{
+	struct qfq_class *cl = (struct qfq_class *)arg;
+
+	cl->filter_cnt--;
+}
+
+static int qfq_graft_class(struct Qdisc *sch, unsigned long arg,
+			   struct Qdisc *new, struct Qdisc **old)
+{
+	struct qfq_class *cl = (struct qfq_class *)arg;
+
+	if (new == NULL) {
+		new = qdisc_create_dflt(sch->dev_queue,
+					&pfifo_qdisc_ops, cl->common.classid);
+		if (new == NULL)
+			new = &noop_qdisc;
+	}
+
+	sch_tree_lock(sch);
+	qfq_purge_queue(cl);
+	*old = cl->qdisc;
+	cl->qdisc = new;
+	sch_tree_unlock(sch);
+	return 0;
+}
+
+static struct Qdisc *qfq_class_leaf(struct Qdisc *sch, unsigned long arg)
+{
+	struct qfq_class *cl = (struct qfq_class *)arg;
+
+	return cl->qdisc;
+}
+
+static int qfq_dump_class(struct Qdisc *sch, unsigned long arg,
+			  struct sk_buff *skb, struct tcmsg *tcm)
+{
+	struct qfq_class *cl = (struct qfq_class *)arg;
+	struct nlattr *nest;
+
+	tcm->tcm_parent	= TC_H_ROOT;
+	tcm->tcm_handle	= cl->common.classid;
+	tcm->tcm_info	= cl->qdisc->handle;
+
+	nest = nla_nest_start(skb, TCA_OPTIONS);
+	if (nest == NULL)
+		goto nla_put_failure;
+	NLA_PUT_U32(skb, TCA_QFQ_WEIGHT, ONE_FP/cl->inv_w);
+	NLA_PUT_U32(skb, TCA_QFQ_LMAX, cl->lmax);
+	return nla_nest_end(skb, nest);
+
+nla_put_failure:
+	nla_nest_cancel(skb, nest);
+	return -EMSGSIZE;
+}
+
+static int qfq_dump_class_stats(struct Qdisc *sch, unsigned long arg,
+				struct gnet_dump *d)
+{
+	struct qfq_class *cl = (struct qfq_class *)arg;
+	struct tc_qfq_stats xstats;
+
+	memset(&xstats, 0, sizeof(xstats));
+
+	xstats.weight = ONE_FP/cl->inv_w;
+	xstats.lmax = cl->lmax;
+
+	if (gnet_stats_copy_basic(d, &cl->bstats) < 0 ||
+	    gnet_stats_copy_rate_est(d, NULL, &cl->rate_est) < 0 ||
+	    gnet_stats_copy_queue(d, &cl->qdisc->qstats) < 0)
+		return -1;
+
+	return gnet_stats_copy_app(d, &xstats, sizeof(xstats));
+}
+
+static void qfq_walk(struct Qdisc *sch, struct qdisc_walker *arg)
+{
+	struct qfq_sched *q = qdisc_priv(sch);
+	struct qfq_class *cl;
+	struct hlist_node *n;
+	unsigned int i;
+
+	if (arg->stop)
+		return;
+
+	for (i = 0; i < q->clhash.hashsize; i++) {
+		hlist_for_each_entry(cl, n, &q->clhash.hash[i], common.hnode) {
+			if (arg->count < arg->skip) {
+				arg->count++;
+				continue;
+			}
+			if (arg->fn(sch, (unsigned long)cl, arg) < 0) {
+				arg->stop = 1;
+				return;
+			}
+			arg->count++;
+		}
+	}
+}
+
+static struct qfq_class *qfq_classify(struct sk_buff *skb, struct Qdisc *sch,
+				      int *qerr)
+{
+	struct qfq_sched *q = qdisc_priv(sch);
+	struct qfq_class *cl;
+	struct tcf_result res;
+	int result;
+
+	if (TC_H_MAJ(skb->priority ^ sch->handle) == 0) {
+		pr_debug("qfq_classify: found %d\n", skb->priority);
+		cl = qfq_find_class(sch, skb->priority);
+		if (cl != NULL)
+			return cl;
+	}
+
+	*qerr = NET_XMIT_SUCCESS | __NET_XMIT_BYPASS;
+	result = tc_classify(skb, q->filter_list, &res);
+	if (result >= 0) {
+#ifdef CONFIG_NET_CLS_ACT
+		switch (result) {
+		case TC_ACT_QUEUED:
+		case TC_ACT_STOLEN:
+			*qerr = NET_XMIT_SUCCESS | __NET_XMIT_STOLEN;
+		case TC_ACT_SHOT:
+			return NULL;
+		}
+#endif
+		cl = (struct qfq_class *)res.class;
+		if (cl == NULL)
+			cl = qfq_find_class(sch, res.classid);
+		return cl;
+	}
+
+	return NULL;
+}
+
+/* Generic comparison function, handling wraparound. */
+static inline int qfq_gt(u64 a, u64 b)
+{
+	return (s64)(a - b) > 0;
+}
+
+/* Round a precise timestamp to its slotted value. */
+static inline u64 qfq_round_down(u64 ts, unsigned int shift)
+{
+	return ts & ~((1ULL << shift) - 1);
+}
+
+/* return the pointer to the group with lowest index in the bitmap */
+static inline struct qfq_group *qfq_ffs(struct qfq_sched *q,
+					unsigned long bitmap)
+{
+	int index = __ffs(bitmap);
+	return &q->groups[index];
+}
+/* Calculate a mask to mimic what would be ffs_from(). */
+static inline unsigned long mask_from(unsigned long bitmap, int from)
+{
+	return bitmap & ~((1UL << from) - 1);
+}
+
+/*
+ * The state computation relies on ER=0, IR=1, EB=2, IB=3
+ * First compute eligibility comparing grp->S, q->V,
+ * then check if someone is blocking us and possibly add EB
+ */
+static int qfq_calc_state(struct qfq_sched *q, const struct qfq_group *grp)
+{
+	/* if S > V we are not eligible */
+	unsigned int state = qfq_gt(grp->S, q->V);
+	unsigned long mask = mask_from(q->bitmaps[ER], grp->index);
+	struct qfq_group *next;
+
+	if (mask) {
+		next = qfq_ffs(q, mask);
+		if (qfq_gt(grp->F, next->F))
+			state |= EB;
+	}
+
+	return state;
+}
+
+
+/*
+ * In principle
+ *	q->bitmaps[dst] |= q->bitmaps[src] & mask;
+ *	q->bitmaps[src] &= ~mask;
+ * but we should make sure that src != dst
+ */
+static inline void qfq_move_groups(struct qfq_sched *q, unsigned long mask,
+				   int src, int dst)
+{
+	q->bitmaps[dst] |= q->bitmaps[src] & mask;
+	q->bitmaps[src] &= ~mask;
+}
+
+static void qfq_unblock_groups(struct qfq_sched *q, int index, u64 old_F)
+{
+	unsigned long mask = mask_from(q->bitmaps[ER], index + 1);
+	struct qfq_group *next;
+
+	if (mask) {
+		next = qfq_ffs(q, mask);
+		if (!qfq_gt(next->F, old_F))
+			return;
+	}
+
+	mask = (1UL << index) - 1;
+	qfq_move_groups(q, mask, EB, ER);
+	qfq_move_groups(q, mask, IB, IR);
+}
+
+/*
+ * perhaps
+ *
+	old_V ^= q->V;
+	old_V >>= QFQ_MIN_SLOT_SHIFT;
+	if (old_V) {
+		...
+	}
+ *
+ */
+static void qfq_make_eligible(struct qfq_sched *q, u64 old_V)
+{
+	unsigned long vslot = q->V >> QFQ_MIN_SLOT_SHIFT;
+	unsigned long old_vslot = old_V >> QFQ_MIN_SLOT_SHIFT;
+
+	if (vslot != old_vslot) {
+		unsigned long mask = (1UL << fls(vslot ^ old_vslot)) - 1;
+		qfq_move_groups(q, mask, IR, ER);
+		qfq_move_groups(q, mask, IB, EB);
+	}
+}
+
+/*
+ * XXX we should make sure that slot becomes less than 32.
+ * This is guaranteed by the input values.
+ * roundedS is always cl->S rounded on grp->slot_shift bits.
+ */
+static void qfq_slot_insert(struct qfq_group *grp, struct qfq_class *cl,
+				   u64 roundedS)
+{
+	u64 slot = (roundedS - grp->S) >> grp->slot_shift;
+	unsigned int i = (grp->front + slot) % QFQ_MAX_SLOTS;
+
+	cl->next = grp->slots[i];
+	grp->slots[i] = cl;
+	__set_bit(slot, &grp->full_slots);
+}
+
+/*
+ * remove the entry from the slot
+ */
+static void qfq_front_slot_remove(struct qfq_group *grp)
+{
+	struct qfq_class **h = &grp->slots[grp->front];
+
+	*h = (*h)->next;
+	if (!*h)
+		__clear_bit(0, &grp->full_slots);
+}
+
+/*
+ * Returns the first full queue in a group. As a side effect,
+ * adjust the bucket list so the first non-empty bucket is at
+ * position 0 in full_slots.
+ */
+static struct qfq_class *qfq_slot_scan(struct qfq_group *grp)
+{
+	unsigned int i;
+
+	pr_debug("qfq slot_scan: grp %u full %#lx\n",
+		 grp->index, grp->full_slots);
+
+	if (grp->full_slots == 0)
+		return NULL;
+
+	i = __ffs(grp->full_slots);  /* zero based */
+	if (i > 0) {
+		grp->front = (grp->front + i) % QFQ_MAX_SLOTS;
+		grp->full_slots >>= i;
+	}
+
+	return grp->slots[grp->front];
+}
+
+/*
+ * adjust the bucket list. When the start time of a group decreases,
+ * we move the index down (modulo QFQ_MAX_SLOTS) so we don't need to
+ * move the objects. The mask of occupied slots must be shifted
+ * because we use ffs() to find the first non-empty slot.
+ * This covers decreases in the group's start time, but what about
+ * increases of the start time ?
+ * Here too we should make sure that i is less than 32
+ */
+static void qfq_slot_rotate(struct qfq_group *grp, u64 roundedS)
+{
+	unsigned int i = (grp->S - roundedS) >> grp->slot_shift;
+
+	grp->full_slots <<= i;
+	grp->front = (grp->front - i) % QFQ_MAX_SLOTS;
+}
+
+static void qfq_update_eligible(struct qfq_sched *q, u64 old_V)
+{
+	struct qfq_group *grp;
+	unsigned long ineligible;
+
+	ineligible = q->bitmaps[IR] | q->bitmaps[IB];
+	if (ineligible) {
+		if (!q->bitmaps[ER]) {
+			grp = qfq_ffs(q, ineligible);
+			if (qfq_gt(grp->S, q->V))
+				q->V = grp->S;
+		}
+		qfq_make_eligible(q, old_V);
+	}
+}
+
+/* What is length of next packet in queue (0 if queue is empty) */
+static unsigned int qdisc_peek_len(struct Qdisc *sch)
+{
+	struct sk_buff *skb;
+
+	skb = sch->ops->peek(sch);
+	return skb ? qdisc_pkt_len(skb) : 0;
+}
+
+/*
+ * Updates the class, returns true if also the group needs to be updated.
+ */
+static bool qfq_update_class(struct qfq_group *grp, struct qfq_class *cl)
+{
+	unsigned int len = qdisc_peek_len(cl->qdisc);
+
+	cl->S = cl->F;
+	if (!len)
+		qfq_front_slot_remove(grp);	/* queue is empty */
+	else {
+		u64 roundedS;
+
+		cl->F = cl->S + (u64)len * cl->inv_w;
+		roundedS = qfq_round_down(cl->S, grp->slot_shift);
+		if (roundedS == grp->S)
+			return false;
+
+		qfq_front_slot_remove(grp);
+		qfq_slot_insert(grp, cl, roundedS);
+	}
+
+	return true;
+}
+
+static struct sk_buff *qfq_dequeue(struct Qdisc *sch)
+{
+	struct qfq_sched *q = qdisc_priv(sch);
+	struct qfq_group *grp;
+	struct qfq_class *cl;
+	struct sk_buff *skb;
+	unsigned int len;
+	u64 old_V;
+
+	if (!q->bitmaps[ER])
+		return NULL;
+
+	grp = qfq_ffs(q, q->bitmaps[ER]);
+
+	cl = grp->slots[grp->front];
+	skb = qdisc_dequeue_peeked(cl->qdisc);
+	if (!skb) {
+		WARN_ONCE(1, "qfq_dequeue: non-workconserving leaf\n");
+		return NULL;
+	}
+
+	sch->q.qlen--;
+	qdisc_bstats_update(sch, skb);
+
+	old_V = q->V;
+	len = qdisc_pkt_len(skb);
+	q->V += (u64)len * IWSUM;
+	pr_debug("qfq dequeue: len %u F %lld now %lld\n",
+		 len, (unsigned long long) cl->F, (unsigned long long) q->V);
+
+	if (qfq_update_class(grp, cl)) {
+		u64 old_F = grp->F;
+
+		cl = qfq_slot_scan(grp);
+		if (!cl)
+			__clear_bit(grp->index, &q->bitmaps[ER]);
+		else {
+			u64 roundedS = qfq_round_down(cl->S, grp->slot_shift);
+			unsigned int s;
+
+			if (grp->S == roundedS)
+				goto skip_unblock;
+			grp->S = roundedS;
+			grp->F = roundedS + (2ULL << grp->slot_shift);
+			__clear_bit(grp->index, &q->bitmaps[ER]);
+			s = qfq_calc_state(q, grp);
+			__set_bit(grp->index, &q->bitmaps[s]);
+		}
+
+		qfq_unblock_groups(q, grp->index, old_F);
+	}
+
+skip_unblock:
+	qfq_update_eligible(q, old_V);
+
+	return skb;
+}
+
+/*
+ * Assign a reasonable start time for a new flow k in group i.
+ * Admissible values for \hat(F) are multiples of \sigma_i
+ * no greater than V+\sigma_i . Larger values mean that
+ * we had a wraparound so we consider the timestamp to be stale.
+ *
+ * If F is not stale and F >= V then we set S = F.
+ * Otherwise we should assign S = V, but this may violate
+ * the ordering in ER. So, if we have groups in ER, set S to
+ * the F_j of the first group j which would be blocking us.
+ * We are guaranteed not to move S backward because
+ * otherwise our group i would still be blocked.
+ */
+static void qfq_update_start(struct qfq_sched *q, struct qfq_class *cl)
+{
+	unsigned long mask;
+	uint32_t limit, roundedF;
+	int slot_shift = cl->grp->slot_shift;
+
+	roundedF = qfq_round_down(cl->F, slot_shift);
+	limit = qfq_round_down(q->V, slot_shift) + (1UL << slot_shift);
+
+	if (!qfq_gt(cl->F, q->V) || qfq_gt(roundedF, limit)) {
+		/* timestamp was stale */
+		mask = mask_from(q->bitmaps[ER], cl->grp->index);
+		if (mask) {
+			struct qfq_group *next = qfq_ffs(q, mask);
+			if (qfq_gt(roundedF, next->F)) {
+				cl->S = next->F;
+				return;
+			}
+		}
+		cl->S = q->V;
+	} else  /* timestamp is not stale */
+		cl->S = cl->F;
+}
+
+static int qfq_enqueue(struct sk_buff *skb, struct Qdisc *sch)
+{
+	struct qfq_sched *q = qdisc_priv(sch);
+	struct qfq_group *grp;
+	struct qfq_class *cl;
+	int err;
+	u64 roundedS;
+	int s;
+
+	cl = qfq_classify(skb, sch, &err);
+	if (cl == NULL) {
+		if (err & __NET_XMIT_BYPASS)
+			sch->qstats.drops++;
+		kfree_skb(skb);
+		return err;
+	}
+	pr_debug("qfq_enqueue: cl = %x\n", cl->common.classid);
+
+	err = qdisc_enqueue(skb, cl->qdisc);
+	if (unlikely(err != NET_XMIT_SUCCESS)) {
+		pr_debug("qfq_enqueue: enqueue failed %d\n", err);
+		if (net_xmit_drop_count(err)) {
+			cl->qstats.drops++;
+			sch->qstats.drops++;
+		}
+		return err;
+	}
+
+	bstats_update(&cl->bstats, skb);
+	++sch->q.qlen;
+
+	/* If the new skb is not the head of queue, then done here. */
+	if (skb != qdisc_peek_head(cl->qdisc))
+		return err;
+
+	/* If reach this point, queue q was idle */
+	grp = cl->grp;
+	qfq_update_start(q, cl);
+
+	/* compute new finish time and rounded start. */
+	cl->F = cl->S + (u64)qdisc_pkt_len(skb) * cl->inv_w;
+	roundedS = qfq_round_down(cl->S, grp->slot_shift);
+
+	/*
+	 * insert cl in the correct bucket.
+	 * If cl->S >= grp->S we don't need to adjust the
+	 * bucket list and simply go to the insertion phase.
+	 * Otherwise grp->S is decreasing, we must make room
+	 * in the bucket list, and also recompute the group state.
+	 * Finally, if there were no flows in this group and nobody
+	 * was in ER make sure to adjust V.
+	 */
+	if (grp->full_slots) {
+		if (!qfq_gt(grp->S, cl->S))
+			goto skip_update;
+
+		/* create a slot for this cl->S */
+		qfq_slot_rotate(grp, roundedS);
+		/* group was surely ineligible, remove */
+		__clear_bit(grp->index, &q->bitmaps[IR]);
+		__clear_bit(grp->index, &q->bitmaps[IB]);
+	} else if (!q->bitmaps[ER] && qfq_gt(roundedS, q->V))
+		q->V = roundedS;
+
+	grp->S = roundedS;
+	grp->F = roundedS + (2ULL << grp->slot_shift);
+	s = qfq_calc_state(q, grp);
+	__set_bit(grp->index, &q->bitmaps[s]);
+
+	pr_debug("qfq enqueue: new state %d %#lx S %lld F %lld V %lld\n",
+		 s, q->bitmaps[s],
+		 (unsigned long long) cl->S,
+		 (unsigned long long) cl->F,
+		 (unsigned long long) q->V);
+
+skip_update:
+	qfq_slot_insert(grp, cl, roundedS);
+
+	return err;
+}
+
+
+static void qfq_slot_remove(struct qfq_sched *q, struct qfq_group *grp,
+			    struct qfq_class *cl, struct qfq_class **pprev)
+{
+	unsigned int i, offset;
+	u64 roundedS;
+
+	roundedS = qfq_round_down(cl->S, grp->slot_shift);
+	offset = (roundedS - grp->S) >> grp->slot_shift;
+	i = (grp->front + offset) % QFQ_MAX_SLOTS;
+
+	if (!pprev) {
+		pprev = &grp->slots[i];
+		while (*pprev && *pprev != cl)
+			pprev = &(*pprev)->next;
+	}
+
+	*pprev = cl->next;
+	if (!grp->slots[i])
+		__clear_bit(offset, &grp->full_slots);
+}
+
+/*
+ * called to forcibly destroy a queue.
+ * If the queue is not in the front bucket, or if it has
+ * other queues in the front bucket, we can simply remove
+ * the queue with no other side effects.
+ * Otherwise we must propagate the event up.
+ */
+static void qfq_deactivate_class(struct qfq_sched *q, struct qfq_class *cl,
+				 struct qfq_class **pprev)
+{
+	struct qfq_group *grp = cl->grp;
+	unsigned long mask;
+	u64 roundedS;
+	int s;
+
+	cl->F = cl->S;
+	qfq_slot_remove(q, grp, cl, pprev);
+
+	if (!grp->full_slots) {
+		__clear_bit(grp->index, &q->bitmaps[IR]);
+		__clear_bit(grp->index, &q->bitmaps[EB]);
+		__clear_bit(grp->index, &q->bitmaps[IB]);
+
+		if (test_bit(grp->index, &q->bitmaps[ER]) &&
+		    !(q->bitmaps[ER] & ~((1UL << grp->index) - 1))) {
+			mask = q->bitmaps[ER] & ((1UL << grp->index) - 1);
+			if (mask)
+				mask = ~((1UL << __fls(mask)) - 1);
+			else
+				mask = ~0UL;
+			qfq_move_groups(q, mask, EB, ER);
+			qfq_move_groups(q, mask, IB, IR);
+		}
+		__clear_bit(grp->index, &q->bitmaps[ER]);
+	} else if (!grp->slots[grp->front]) {
+		cl = qfq_slot_scan(grp);
+		roundedS = qfq_round_down(cl->S, grp->slot_shift);
+		if (grp->S != roundedS) {
+			__clear_bit(grp->index, &q->bitmaps[ER]);
+			__clear_bit(grp->index, &q->bitmaps[IR]);
+			__clear_bit(grp->index, &q->bitmaps[EB]);
+			__clear_bit(grp->index, &q->bitmaps[IB]);
+			grp->S = roundedS;
+			grp->F = roundedS + (2ULL << grp->slot_shift);
+			s = qfq_calc_state(q, grp);
+			__set_bit(grp->index, &q->bitmaps[s]);
+		}
+	}
+
+	qfq_update_eligible(q, q->V);
+}
+
+static void qfq_qlen_notify(struct Qdisc *sch, unsigned long arg)
+{
+	struct qfq_sched *q = (struct qfq_sched *)sch;
+	struct qfq_class *cl = (struct qfq_class *)arg;
+
+	if (cl->qdisc->q.qlen == 0)
+		qfq_deactivate_class(q, cl, NULL);
+}
+
+static unsigned int qfq_drop(struct Qdisc *sch)
+{
+	struct qfq_sched *q = qdisc_priv(sch);
+	struct qfq_group *grp;
+	struct qfq_class *cl, **pp;
+	unsigned int i, j, len;
+
+	for (i = 0; i <= QFQ_MAX_INDEX; i++) {
+		grp = &q->groups[i];
+		for (j = 0; j < QFQ_MAX_SLOTS; j++) {
+			for (pp = &grp->slots[j]; *pp; pp = &(*pp)->next) {
+				cl = *pp;
+				if (!cl->qdisc->ops->drop)
+					continue;
+
+				len = cl->qdisc->ops->drop(cl->qdisc);
+				if (len > 0) {
+					sch->q.qlen--;
+					if (!cl->qdisc->q.qlen)
+						qfq_deactivate_class(q, cl, pp);
+
+					return len;
+				}
+			}
+		}
+	}
+
+	return 0;
+}
+
+static int qfq_init_qdisc(struct Qdisc *sch, struct nlattr *opt)
+{
+	struct qfq_sched *q = qdisc_priv(sch);
+	struct qfq_group *grp;
+	int i, err;
+
+	err = qdisc_class_hash_init(&q->clhash);
+	if (err < 0)
+		return err;
+
+	for (i = 0; i <= QFQ_MAX_INDEX; i++) {
+		grp = &q->groups[i];
+		grp->index = i;
+	}
+
+	return 0;
+}
+
+static void qfq_reset_qdisc(struct Qdisc *sch)
+{
+	struct qfq_sched *q = qdisc_priv(sch);
+	struct qfq_group *grp;
+	struct qfq_class *cl, **pp;
+	struct hlist_node *n;
+	unsigned int i, j;
+
+	for (i = 0; i <= QFQ_MAX_INDEX; i++) {
+		grp = &q->groups[i];
+		for (j = 0; j < QFQ_MAX_SLOTS; j++) {
+			for (pp = &grp->slots[j]; *pp; pp = &(*pp)->next) {
+				cl = *pp;
+				if (cl->qdisc->q.qlen)
+					qfq_deactivate_class(q, cl, pp);
+			}
+		}
+	}
+
+	for (i = 0; i < q->clhash.hashsize; i++) {
+		hlist_for_each_entry(cl, n, &q->clhash.hash[i], common.hnode)
+			qdisc_reset(cl->qdisc);
+	}
+	sch->q.qlen = 0;
+}
+
+static void qfq_destroy_qdisc(struct Qdisc *sch)
+{
+	struct qfq_sched *q = qdisc_priv(sch);
+	struct qfq_class *cl;
+	struct hlist_node *n, *next;
+	unsigned int i;
+
+	tcf_destroy_chain(&q->filter_list);
+
+	for (i = 0; i < q->clhash.hashsize; i++) {
+		hlist_for_each_entry_safe(cl, n, next, &q->clhash.hash[i],
+					  common.hnode)
+			qfq_destroy_class(sch, cl);
+	}
+	qdisc_class_hash_destroy(&q->clhash);
+}
+
+static const struct Qdisc_class_ops qfq_class_ops = {
+	.change		= qfq_change_class,
+	.delete		= qfq_delete_class,
+	.get		= qfq_get_class,
+	.put		= qfq_put_class,
+	.tcf_chain	= qfq_tcf_chain,
+	.bind_tcf	= qfq_bind_tcf,
+	.unbind_tcf	= qfq_unbind_tcf,
+	.graft		= qfq_graft_class,
+	.leaf		= qfq_class_leaf,
+	.qlen_notify	= qfq_qlen_notify,
+	.dump		= qfq_dump_class,
+	.dump_stats	= qfq_dump_class_stats,
+	.walk		= qfq_walk,
+};
+
+static struct Qdisc_ops qfq_qdisc_ops __read_mostly = {
+	.cl_ops		= &qfq_class_ops,
+	.id		= "qfq",
+	.priv_size	= sizeof(struct qfq_sched),
+	.enqueue	= qfq_enqueue,
+	.dequeue	= qfq_dequeue,
+	.peek		= qdisc_peek_dequeued,
+	.drop		= qfq_drop,
+	.init		= qfq_init_qdisc,
+	.reset		= qfq_reset_qdisc,
+	.destroy	= qfq_destroy_qdisc,
+	.owner		= THIS_MODULE,
+};
+
+static int __init qfq_init(void)
+{
+	return register_qdisc(&qfq_qdisc_ops);
+}
+
+static void __exit qfq_exit(void)
+{
+	unregister_qdisc(&qfq_qdisc_ops);
+}
+
+module_init(qfq_init);
+module_exit(qfq_exit);
+MODULE_LICENSE("GPL");