diff mbox

Possible regression in HTB

Message ID 20081007045145.GA23883@verge.net.au
State RFC, archived
Delegated to: David Miller
Headers show

Commit Message

Simon Horman Oct. 7, 2008, 4:51 a.m. UTC
On Tue, Oct 07, 2008 at 12:15:52PM +1100, Simon Horman wrote:
> Hi Dave, Hi Jarek,
> 
> I know that you guys were/are playing around a lot in here, but
> unfortunately I think that "pkt_sched: Always use q->requeue in
> dev_requeue_skb()" (f0876520b0b721bedafd9cec3b1b0624ae566eee) has
> introduced a performance regression for HTB.
> 
> My tc rules are below, but in a nutshell I have 3 leaf classes.
> One with a rate of 500Mbit/s and the other two with 100Mbit/s.
> The ceiling for all classes is 1Gb/s and that is also both
> the rate and ceiling for the parent class.
> 
>                           [ rate=1Gbit/s ]
>                           [ ceil=1Gbit/s ]
>                                  |
>             +--------------------+--------------------+
>             |                    |                    |
>      [ rate=500Mbit/s ]   [ rate=100Mbit/s ]   [ rate=100Mbit/s ]
>      [ ceil=  1Gbit/s ]   [ ceil=100Mbit/s ]   [ ceil=  1Gbit/s ]
> 
> The tc rules have an extra class for all other traffic,
> but its idle, so I left it out of the diagram.
> 
> In order to test this I set up filters so that traffic to
> each of port 10194, 10196 and 10197 is directed to one of the leaf-classes.
> I then set up a process on the same host for each port sending
> UDP as fast as it could in a while() { send(); } loop. On another
> host I set up processes listening for the UDP traffic in a
> while () { recv(); } loop. And I measured the results.
> 
> ( I should be able to provide the code used for testing,
>   but its not mine and my colleague who wrote it is off
>   with the flu today. )
> 
> Prior to this patch the result looks like this:
> 
> 10194: 545134589bits/s 545Mbits/s
> 10197: 205358520bits/s 205Mbits/s
> 10196: 205311416bits/s 205Mbits/s
> -----------------------------------
> total: 955804525bits/s 955Mbits/s
> 
> And after the patch the result looks like this:
> 10194: 384248522bits/s 384Mbits/s
> 10197: 284706778bits/s 284Mbits/s
> 10196: 288119464bits/s 288Mbits/s
> -----------------------------------
> total: 957074765bits/s 957Mbits/s
> 
> There is some noise in these results, but I think that its clear
> that before the patch all leaf-classes received at least their rate,
> and after the patch the rate=500Mbit/s class received much less than
> its rate. This I believe is a regression.
> 
> I do not believe that this happens at lower bit rates, for instance
> if you reduce the ceiling and rate of all classes by a factor of 10.
> I can produce some numbers on that if you want them.
> 
> The test machine with the tc rules and udp-sending processes
> has two Intel Xeon Quad-cores running at 1.86GHz. The kernel
> is SMP x86_64.

With the following patch (basically a reversal of ""pkt_sched: Always use
q->requeue in dev_requeue_skb()" forward ported to the current
net-next-2.6 tree (tcp: Respect SO_RCVLOWAT in tcp_poll()), I get some
rather nice numbers (IMHO).

10194: 666780666bits/s 666Mbits/s
10197: 141154197bits/s 141Mbits/s
10196: 141023090bits/s 141Mbits/s
-----------------------------------
total: 948957954bits/s 948Mbits/s

I'm not sure what evil things this patch does to other aspects
of the qdisc code.

--
To unsubscribe from this list: send the line "unsubscribe netdev" in
the body of a message to majordomo@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html

Comments

Jarek Poplawski Oct. 7, 2008, 7:44 a.m. UTC | #1
On Tue, Oct 07, 2008 at 03:51:47PM +1100, Simon Horman wrote:
> On Tue, Oct 07, 2008 at 12:15:52PM +1100, Simon Horman wrote:
> > Hi Dave, Hi Jarek,

Hi Simon,

> > I know that you guys were/are playing around a lot in here, but
> > unfortunately I think that "pkt_sched: Always use q->requeue in
> > dev_requeue_skb()" (f0876520b0b721bedafd9cec3b1b0624ae566eee) has
> > introduced a performance regression for HTB.
...
> > The test machine with the tc rules and udp-sending processes
> > has two Intel Xeon Quad-cores running at 1.86GHz. The kernel
> > is SMP x86_64.
> 
> With the following patch (basically a reversal of ""pkt_sched: Always use
> q->requeue in dev_requeue_skb()" forward ported to the current
> net-next-2.6 tree (tcp: Respect SO_RCVLOWAT in tcp_poll()), I get some
> rather nice numbers (IMHO).
> 
> 10194: 666780666bits/s 666Mbits/s
> 10197: 141154197bits/s 141Mbits/s
> 10196: 141023090bits/s 141Mbits/s
> -----------------------------------
> total: 948957954bits/s 948Mbits/s
> 
> I'm not sure what evil things this patch does to other aspects
> of the qdisc code.

I'd like to establish this too. This patch was meant to remove some
other problems possibly the simplest way. Maybe it's too simple.
Anyway, it's kind of RFC, so the rest of the requeuing code is left
unchanged, just for easy revoking like below. But first we should
try to understand this more.

So, thanks for testing and reporting this. (BTW, what network card
do you use and is there multiqueuing on?)

Jarek P.

> 
> diff --git a/net/sched/sch_generic.c b/net/sched/sch_generic.c
> index 31f6b61..d2e0da6 100644
> --- a/net/sched/sch_generic.c
> +++ b/net/sched/sch_generic.c
> @@ -44,7 +44,10 @@ static inline int qdisc_qlen(struct Qdisc *q)
>  
>  static inline int dev_requeue_skb(struct sk_buff *skb, struct Qdisc *q)
>  {
> -	q->gso_skb = skb;
> +	if (unlikely(skb->next))
> +		q->gso_skb = skb;
> +	else
> +		q->ops->requeue(skb, q);
>  	__netif_schedule(q);
>  
>  	return 0;
--
To unsubscribe from this list: send the line "unsubscribe netdev" in
the body of a message to majordomo@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Patrick McHardy Oct. 7, 2008, 12:03 p.m. UTC | #2
Jarek Poplawski wrote:
> On Tue, Oct 07, 2008 at 03:51:47PM +1100, Simon Horman wrote:
>> With the following patch (basically a reversal of ""pkt_sched: Always use
>> q->requeue in dev_requeue_skb()" forward ported to the current
>> net-next-2.6 tree (tcp: Respect SO_RCVLOWAT in tcp_poll()), I get some
>> rather nice numbers (IMHO).
>>
>> 10194: 666780666bits/s 666Mbits/s
>> 10197: 141154197bits/s 141Mbits/s
>> 10196: 141023090bits/s 141Mbits/s
>> -----------------------------------
>> total: 948957954bits/s 948Mbits/s
>>
>> I'm not sure what evil things this patch does to other aspects
>> of the qdisc code.
> 
> I'd like to establish this too. This patch was meant to remove some
> other problems possibly the simplest way. Maybe it's too simple.
> Anyway, it's kind of RFC, so the rest of the requeuing code is left
> unchanged, just for easy revoking like below. But first we should
> try to understand this more.

Shooting in the dark: I don't see how this change could affect
the bandwidth expect by introducing higher dequeue-latency due
using netif_schedule instead of qdisc_watchdog. Does anyone know
how device scheduling latencies compare to hrtimers?

--
To unsubscribe from this list: send the line "unsubscribe netdev" in
the body of a message to majordomo@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Jarek Poplawski Oct. 7, 2008, 12:20 p.m. UTC | #3
On Tue, Oct 07, 2008 at 03:51:47PM +1100, Simon Horman wrote:
> On Tue, Oct 07, 2008 at 12:15:52PM +1100, Simon Horman wrote:
> > Hi Dave, Hi Jarek,
> > 
> > I know that you guys were/are playing around a lot in here, but
> > unfortunately I think that "pkt_sched: Always use q->requeue in
> > dev_requeue_skb()" (f0876520b0b721bedafd9cec3b1b0624ae566eee) has
> > introduced a performance regression for HTB.

Since this looks to me as possibly theoretical problem I added
Martin and Patrick to Cc.

> > 
> > My tc rules are below, but in a nutshell I have 3 leaf classes.
> > One with a rate of 500Mbit/s and the other two with 100Mbit/s.
> > The ceiling for all classes is 1Gb/s and that is also both
> > the rate and ceiling for the parent class.
> > 
> >                           [ rate=1Gbit/s ]
> >                           [ ceil=1Gbit/s ]
> >                                  |
> >             +--------------------+--------------------+
> >             |                    |                    |
> >      [ rate=500Mbit/s ]   [ rate=100Mbit/s ]   [ rate=100Mbit/s ]
> >      [ ceil=  1Gbit/s ]   [ ceil=100Mbit/s ]   [ ceil=  1Gbit/s ]

?!       [ ceil=  1Gbit/s ]   [ ceil=1Gbit/s ]     [ ceil=  1Gbit/s ]

> > 
> > The tc rules have an extra class for all other traffic,
> > but its idle, so I left it out of the diagram.
> > 
> > In order to test this I set up filters so that traffic to
> > each of port 10194, 10196 and 10197 is directed to one of the leaf-classes.
> > I then set up a process on the same host for each port sending
> > UDP as fast as it could in a while() { send(); } loop. On another
> > host I set up processes listening for the UDP traffic in a
> > while () { recv(); } loop. And I measured the results.
> > 
> > ( I should be able to provide the code used for testing,
> >   but its not mine and my colleague who wrote it is off
> >   with the flu today. )
> > 
> > Prior to this patch the result looks like this:
> > 
> > 10194: 545134589bits/s 545Mbits/s
> > 10197: 205358520bits/s 205Mbits/s
> > 10196: 205311416bits/s 205Mbits/s
> > -----------------------------------
> > total: 955804525bits/s 955Mbits/s
> > 
> > And after the patch the result looks like this:
> > 10194: 384248522bits/s 384Mbits/s
> > 10197: 284706778bits/s 284Mbits/s
> > 10196: 288119464bits/s 288Mbits/s
> > -----------------------------------
> > total: 957074765bits/s 957Mbits/s
> > 

So, in short, the results with requeuing off show the first class
doesn't get its rate while the others can borrow.

My first (maybe wrong) idea is that requeuing could be used here for
something it wasn't probably meant to. The scenario could be like this:
the first (and most privileged) class is sending until the card limit,
and when the xmit is stopped and requeuing on, it slows the others
(while it has to wait anyway) with requeuing procedures plus gets
"additional" packet in its queue.

In the "requeuing off" case there should be a bit more time for others
and each packet seen only once.

Since it looks like HTB was lending unused rate, it had to try the first
class first, so if it didn't use this, probably there were not enough
packets in its queue, and as mentioned above, requeuing code could help
to get them, and so to prevent lending to others, when there is not
enough enqueuing in the meantime.

So, maybe my diagnose is totally wrong, but there are the questions:

1) Is HTB or other similar scheduling code expected to limit correctly
   while we substantially overlimit (since requeuing should be used so
   much)?
2) Should requeuing be considered as such important factor of
   controlling the rates?

I've some doubts it should work like this.

Jarek P.


> > There is some noise in these results, but I think that its clear
> > that before the patch all leaf-classes received at least their rate,
> > and after the patch the rate=500Mbit/s class received much less than
> > its rate. This I believe is a regression.
> > 
> > I do not believe that this happens at lower bit rates, for instance
> > if you reduce the ceiling and rate of all classes by a factor of 10.
> > I can produce some numbers on that if you want them.
> > 
> > The test machine with the tc rules and udp-sending processes
> > has two Intel Xeon Quad-cores running at 1.86GHz. The kernel
> > is SMP x86_64.
> 
> With the following patch (basically a reversal of ""pkt_sched: Always use
> q->requeue in dev_requeue_skb()" forward ported to the current
> net-next-2.6 tree (tcp: Respect SO_RCVLOWAT in tcp_poll()), I get some
> rather nice numbers (IMHO).
> 
> 10194: 666780666bits/s 666Mbits/s
> 10197: 141154197bits/s 141Mbits/s
> 10196: 141023090bits/s 141Mbits/s
> -----------------------------------
> total: 948957954bits/s 948Mbits/s
> 
> I'm not sure what evil things this patch does to other aspects
> of the qdisc code.
> 
> diff --git a/net/sched/sch_generic.c b/net/sched/sch_generic.c
> index 31f6b61..d2e0da6 100644
> --- a/net/sched/sch_generic.c
> +++ b/net/sched/sch_generic.c
> @@ -44,7 +44,10 @@ static inline int qdisc_qlen(struct Qdisc *q)
>  
>  static inline int dev_requeue_skb(struct sk_buff *skb, struct Qdisc *q)
>  {
> -	q->gso_skb = skb;
> +	if (unlikely(skb->next))
> +		q->gso_skb = skb;
> +	else
> +		q->ops->requeue(skb, q);
>  	__netif_schedule(q);
>  
>  	return 0;


-------------------------

tc qdisc del dev eth0 root

tc qdisc add dev eth0 root handle 1: htb default 10 r2q 10000

tc class add dev eth0 parent 1:  classid 1:1   htb \
	rate 1Gbit ceil 1Gbit

tc class add dev eth0 parent 1:1 classid 1:10 htb \
	rate 1Gbit ceil 1Gbit
tc class add dev eth0 parent 1:1 classid 1:11 htb \
	rate 500Mbit ceil 1Gbit
tc class add dev eth0 parent 1:1 classid 1:12 htb \
	rate 100Mbit ceil 1Gbit
tc class add dev eth0 parent 1:1 classid 1:13 htb \
	rate 100Mbit ceil 1Gbit

tc filter add dev eth0 protocol ip parent 1: \
	u32 match ip dport 10194 0xffff flowid 1:11
tc filter add dev eth0 protocol ip parent 1: \
	u32 match ip dport 10196 0xffff flowid 1:12
tc filter add dev eth0 protocol ip parent 1: \
	u32 match ip dport 10197 0xffff flowid 1:13

--
To unsubscribe from this list: send the line "unsubscribe netdev" in
the body of a message to majordomo@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Patrick McHardy Oct. 7, 2008, 12:48 p.m. UTC | #4
Jarek Poplawski wrote:
>>> Prior to this patch the result looks like this:
>>>
>>> 10194: 545134589bits/s 545Mbits/s
>>> 10197: 205358520bits/s 205Mbits/s
>>> 10196: 205311416bits/s 205Mbits/s
>>> -----------------------------------
>>> total: 955804525bits/s 955Mbits/s
>>>
>>> And after the patch the result looks like this:
>>> 10194: 384248522bits/s 384Mbits/s
>>> 10197: 284706778bits/s 284Mbits/s
>>> 10196: 288119464bits/s 288Mbits/s
>>> -----------------------------------
>>> total: 957074765bits/s 957Mbits/s

I've misinterpreted the numbers, please disregard my previous mail.

I'm wondering though, even before this patch, the sharing doesn't
seem to be proportional to the allocated rates. Assuming the upper
limit is somewhere around 950mbit, we have 250 mbit for sharing
above the allocated rates, so it should be:

500mbit class: 500mbit + 250mbit/7*5 == 678.57mbit
100mbit class: 100mbit + 250mbit/1*5 == 150mbit
100mbit class: 100mbit + 250mbit/1*5 == 150mbit

But maybe my understanding of how excess bandwidth is distributed
with HTB is wrong.

> So, in short, the results with requeuing off show the first class
> doesn't get its rate while the others can borrow.
> 
> My first (maybe wrong) idea is that requeuing could be used here for
> something it wasn't probably meant to. The scenario could be like this:
> the first (and most privileged) class is sending until the card limit,
> and when the xmit is stopped and requeuing on, it slows the others
> (while it has to wait anyway) with requeuing procedures plus gets
> "additional" packet in its queue.
> 
> In the "requeuing off" case there should be a bit more time for others
> and each packet seen only once.
> 
> Since it looks like HTB was lending unused rate, it had to try the first
> class first, so if it didn't use this, probably there were not enough
> packets in its queue, and as mentioned above, requeuing code could help
> to get them, and so to prevent lending to others, when there is not
> enough enqueuing in the meantime.
> 
> So, maybe my diagnose is totally wrong, but there are the questions:
> 
> 1) Is HTB or other similar scheduling code expected to limit correctly
>    while we substantially overlimit (since requeuing should be used so
>    much)?
> 2) Should requeuing be considered as such important factor of
>    controlling the rates?
> 
> I've some doubts it should work like this.

I still can't really make anything of this bug, but the only two
visible differences to HTB resulting from requeing on an upper level
should be that

1) it doesn't reactivate classes that went passive by the last dequeue
2) the time checkpoint from the last dequeue event is different

I guess its in fact the second thing, if a lower priority packet
is requeued and dequeued again, HTB doesn't notice and might allow
the class to send earlier again than it would have previously.

Simon, if you set the ceiling to something around the real limit
you're able to reach (maybe try 940mbit), do the proportions change
significantly?
--
To unsubscribe from this list: send the line "unsubscribe netdev" in
the body of a message to majordomo@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Jarek Poplawski Oct. 7, 2008, 10 p.m. UTC | #5
Patrick McHardy wrote, On 10/07/2008 02:48 PM:

> Jarek Poplawski wrote:
>>>> Prior to this patch the result looks like this:
>>>>
>>>> 10194: 545134589bits/s 545Mbits/s
>>>> 10197: 205358520bits/s 205Mbits/s
>>>> 10196: 205311416bits/s 205Mbits/s
>>>> -----------------------------------
>>>> total: 955804525bits/s 955Mbits/s
>>>>
>>>> And after the patch the result looks like this:
>>>> 10194: 384248522bits/s 384Mbits/s
>>>> 10197: 284706778bits/s 284Mbits/s
>>>> 10196: 288119464bits/s 288Mbits/s
>>>> -----------------------------------
>>>> total: 957074765bits/s 957Mbits/s
> 
> I've misinterpreted the numbers, please disregard my previous mail.
> 
> I'm wondering though, even before this patch, the sharing doesn't
> seem to be proportional to the allocated rates. Assuming the upper
> limit is somewhere around 950mbit, we have 250 mbit for sharing
> above the allocated rates, so it should be:
> 
> 500mbit class: 500mbit + 250mbit/7*5 == 678.57mbit
> 100mbit class: 100mbit + 250mbit/1*5 == 150mbit
> 100mbit class: 100mbit + 250mbit/1*5 == 150mbit
> 
> But maybe my understanding of how excess bandwidth is distributed
> with HTB is wrong.

Good point, but the numbers a bit wrong:

500mbit class: 500mbit + 250mbit/7*5 == 678.57mbit
100mbit class: 100mbit + 250mbit/7*1 == 135.71mbit
100mbit class: 100mbit + 250mbit/7*1 == 135.71mbit
					==========
					950.00mbit

> 
> I still can't really make anything of this bug, but the only two
> visible differences to HTB resulting from requeing on an upper level
> should be that
> 
> 1) it doesn't reactivate classes that went passive by the last dequeue
> 2) the time checkpoint from the last dequeue event is different
> 
> I guess its in fact the second thing, if a lower priority packet
> is requeued and dequeued again, HTB doesn't notice and might allow
> the class to send earlier again than it would have previously.

With high requeuing the timing has to be wrong, but I'm not sure why
just lower priority has to gain here.

Anyway, IMHO this regression is really doubtful: since the digits are
wrong in both cases I can only agree the old method gives better wrong
results...

Jarek P.
--
To unsubscribe from this list: send the line "unsubscribe netdev" in
the body of a message to majordomo@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Simon Horman Oct. 8, 2008, 12:09 a.m. UTC | #6
On Tue, Oct 07, 2008 at 07:44:35AM +0000, Jarek Poplawski wrote:
> On Tue, Oct 07, 2008 at 03:51:47PM +1100, Simon Horman wrote:
> > On Tue, Oct 07, 2008 at 12:15:52PM +1100, Simon Horman wrote:
> > > Hi Dave, Hi Jarek,
> 
> Hi Simon,
> 
> > > I know that you guys were/are playing around a lot in here, but
> > > unfortunately I think that "pkt_sched: Always use q->requeue in
> > > dev_requeue_skb()" (f0876520b0b721bedafd9cec3b1b0624ae566eee) has
> > > introduced a performance regression for HTB.
> ...
> > > The test machine with the tc rules and udp-sending processes
> > > has two Intel Xeon Quad-cores running at 1.86GHz. The kernel
> > > is SMP x86_64.
> > 
> > With the following patch (basically a reversal of ""pkt_sched: Always use
> > q->requeue in dev_requeue_skb()" forward ported to the current
> > net-next-2.6 tree (tcp: Respect SO_RCVLOWAT in tcp_poll()), I get some
> > rather nice numbers (IMHO).
> > 
> > 10194: 666780666bits/s 666Mbits/s
> > 10197: 141154197bits/s 141Mbits/s
> > 10196: 141023090bits/s 141Mbits/s
> > -----------------------------------
> > total: 948957954bits/s 948Mbits/s
> > 
> > I'm not sure what evil things this patch does to other aspects
> > of the qdisc code.
> 
> I'd like to establish this too. This patch was meant to remove some
> other problems possibly the simplest way. Maybe it's too simple.
> Anyway, it's kind of RFC, so the rest of the requeuing code is left
> unchanged, just for easy revoking like below. But first we should
> try to understand this more.

Agreed.

> So, thanks for testing and reporting this. (BTW, what network card
> do you use and is there multiqueuing on?)

The network card is an e1000. Acutally there are two, but I
am only using the first one. I am not sure what the Broadcom
controllers are (I can get someone to go and inspect the
machine if its important).

# lspci | grep Ether
03:00.0 Ethernet controller: Broadcom Corporation NetXtreme II BCM5708 Gigabit Ethernet (rev 12)
05:00.0 Ethernet controller: Broadcom Corporation NetXtreme II BCM5708 Gigabit Ethernet (rev 12)
0b:00.0 Ethernet controller: Intel Corporation 82571EB Gigabit Ethernet Controller (rev 06)
0b:00.1 Ethernet controller: Intel Corporation 82571EB Gigabit Ethernet Controller (rev 06)

And from dmsg

e1000e 0000:0b:00.0: setting latency timer to 64
0000:0b:00.0: 0000:0b:00.0: Failed to initialize MSI interrupts.  Falling back to legacy interrupts.
0000:0b:00.0: eth0: (PCI Express:2.5GB/s:Width x4) 00:1b:78:57:dc:f2
0000:0b:00.0: eth0: Intel(R) PRO/1000 Network Connection
0000:0b:00.0: eth0: MAC: 0, PHY: 4, PBA No: d51930-003
e1000e 0000:0b:00.1: setting latency timer to 64
0000:0b:00.1: 0000:0b:00.1: Failed to initialize MSI interrupts.  Falling back to legacy interrupts.
0000:0b:00.1: eth1: (PCI Express:2.5GB/s:Width x4) 00:1b:78:57:dc:f3
0000:0b:00.1: eth1: Intel(R) PRO/1000 Network Connection
0000:0b:00.1: eth1: MAC: 0, PHY: 4, PBA No: d51930-003

It is eth0 that I am using.

My kernel is compiled with CONFIG_NET_SCH_MULTIQ disabled.
Simon Horman Oct. 8, 2008, 12:10 a.m. UTC | #7
On Tue, Oct 07, 2008 at 12:20:52PM +0000, Jarek Poplawski wrote:
> On Tue, Oct 07, 2008 at 03:51:47PM +1100, Simon Horman wrote:
> > On Tue, Oct 07, 2008 at 12:15:52PM +1100, Simon Horman wrote:
> > > Hi Dave, Hi Jarek,
> > > 
> > > I know that you guys were/are playing around a lot in here, but
> > > unfortunately I think that "pkt_sched: Always use q->requeue in
> > > dev_requeue_skb()" (f0876520b0b721bedafd9cec3b1b0624ae566eee) has
> > > introduced a performance regression for HTB.
> 
> Since this looks to me as possibly theoretical problem I added
> Martin and Patrick to Cc.
> 
> > > 
> > > My tc rules are below, but in a nutshell I have 3 leaf classes.
> > > One with a rate of 500Mbit/s and the other two with 100Mbit/s.
> > > The ceiling for all classes is 1Gb/s and that is also both
> > > the rate and ceiling for the parent class.
> > > 
> > >                           [ rate=1Gbit/s ]
> > >                           [ ceil=1Gbit/s ]
> > >                                  |
> > >             +--------------------+--------------------+
> > >             |                    |                    |
> > >      [ rate=500Mbit/s ]   [ rate=100Mbit/s ]   [ rate=100Mbit/s ]
> > >      [ ceil=  1Gbit/s ]   [ ceil=100Mbit/s ]   [ ceil=  1Gbit/s ]
> 
> ?!       [ ceil=  1Gbit/s ]   [ ceil=1Gbit/s ]     [ ceil=  1Gbit/s ]

Sorry, you are correct. ceil=1Gbit/s for all leaf-nodes is correct.
ceil=100Mbit/s was a cut and paste error.
Simon Horman Oct. 8, 2008, 12:21 a.m. UTC | #8
On Wed, Oct 08, 2008 at 12:00:22AM +0200, Jarek Poplawski wrote:
> Patrick McHardy wrote, On 10/07/2008 02:48 PM:
> 
> > Jarek Poplawski wrote:
> >>>> Prior to this patch the result looks like this:
> >>>>
> >>>> 10194: 545134589bits/s 545Mbits/s
> >>>> 10197: 205358520bits/s 205Mbits/s
> >>>> 10196: 205311416bits/s 205Mbits/s
> >>>> -----------------------------------
> >>>> total: 955804525bits/s 955Mbits/s
> >>>>
> >>>> And after the patch the result looks like this:
> >>>> 10194: 384248522bits/s 384Mbits/s
> >>>> 10197: 284706778bits/s 284Mbits/s
> >>>> 10196: 288119464bits/s 288Mbits/s
> >>>> -----------------------------------
> >>>> total: 957074765bits/s 957Mbits/s
> > 
> > I've misinterpreted the numbers, please disregard my previous mail.
> > 
> > I'm wondering though, even before this patch, the sharing doesn't
> > seem to be proportional to the allocated rates. Assuming the upper
> > limit is somewhere around 950mbit, we have 250 mbit for sharing
> > above the allocated rates, so it should be:
> > 
> > 500mbit class: 500mbit + 250mbit/7*5 == 678.57mbit
> > 100mbit class: 100mbit + 250mbit/1*5 == 150mbit
> > 100mbit class: 100mbit + 250mbit/1*5 == 150mbit
> > 
> > But maybe my understanding of how excess bandwidth is distributed
> > with HTB is wrong.
> 
> Good point, but the numbers a bit wrong:
> 
> 500mbit class: 500mbit + 250mbit/7*5 == 678.57mbit
> 100mbit class: 100mbit + 250mbit/7*1 == 135.71mbit
> 100mbit class: 100mbit + 250mbit/7*1 == 135.71mbit
> 					==========
> 					950.00mbit
> 
> > 
> > I still can't really make anything of this bug, but the only two
> > visible differences to HTB resulting from requeing on an upper level
> > should be that
> > 
> > 1) it doesn't reactivate classes that went passive by the last dequeue
> > 2) the time checkpoint from the last dequeue event is different
> > 
> > I guess its in fact the second thing, if a lower priority packet
> > is requeued and dequeued again, HTB doesn't notice and might allow
> > the class to send earlier again than it would have previously.
> 
> With high requeuing the timing has to be wrong, but I'm not sure why
> just lower priority has to gain here.
> 
> Anyway, IMHO this regression is really doubtful: since the digits are
> wrong in both cases I can only agree the old method gives better wrong
> results...

I first started looking into this problem because I noticed that
borrowing wasn't working in the correct proportions. That is
the problem that Patrick pointed out and you re-did the maths for above.

I noticed this on 2.6.26-rc7. So I did some testing on older kernels and
noticed that although 2.6.26-rc7 was imperfect, it did seem that progress
was being made in the right direction.  Though unfortunately there is noise
in the results, so the trend may not be real. It was also unfortunate that
I was not able to get any older kernels to boot on the hw that I was using
for testing (an HP dl360-g5 - any kernel-config tips welcome).

2.6.27-rc7
----------
10194: 568641840bits/s 568Mbits/s
10197: 193942866bits/s 193Mbits/s
10196: 194073184bits/s 194Mbits/s
-----------------------------------
total: 956657890bits/s 956Mbits/s

2.6.26
------
10194: 507581709bits/s 507Mbits/s
10197: 224391677bits/s 224Mbits/s
10196: 224863501bits/s 224Mbits/s
-----------------------------------
total: 956836888bits/s 956Mbits/s

2.6.25
------
10194: 426211904bits/s 426Mbits/s
10197: 265862037bits/s 265Mbits/s
10196: 264875210bits/s 264Mbits/s
-----------------------------------
total: 956949152bits/s 956Mbits/s


Then I tested net-next-2.6 and noticed that things were not good, as I
reported in my opening post for this thread. Curiously, the trivial revert
patch that I posted, when applied on top of yesterdays's net-next-2.6
("tcp: Respect SO_RCVLOWAT in tcp_poll()"), gives the closest to ideal
result that I have seen in any test.

10194: 666780666bits/s 666Mbits/s
10197: 141154197bits/s 141Mbits/s
10196: 141023090bits/s 141Mbits/s
-----------------------------------
total: 948957954bits/s 948Mbits/s


That does indeed seem promising. Though I do realise that my methods
have essentially been stabs in the dark and the problem needs to
be understood.
Patrick McHardy Oct. 8, 2008, 12:31 a.m. UTC | #9
Simon Horman wrote:
> On Wed, Oct 08, 2008 at 12:00:22AM +0200, Jarek Poplawski wrote:
>   
>> Anyway, IMHO this regression is really doubtful: since the digits are
>> wrong in both cases I can only agree the old method gives better wrong
>> results...
>>     
>
> I first started looking into this problem because I noticed that
> borrowing wasn't working in the correct proportions. That is
> the problem that Patrick pointed out and you re-did the maths for above.
>
> I noticed this on 2.6.26-rc7. So I did some testing on older kernels and
> noticed that although 2.6.26-rc7 was imperfect, it did seem that progress
> was being made in the right direction.  Though unfortunately there is noise
> in the results, so the trend may not be real. It was also unfortunate that
> I was not able to get any older kernels to boot on the hw that I was using
> for testing (an HP dl360-g5 - any kernel-config tips welcome).
>
> [...]

> noticed that things were not good, as I
> reported in my opening post for this thread. Curiously, the trivial revert
> patch that I posted, when applied on top of yesterdays's net-next-2.6
> ("tcp: Respect SO_RCVLOWAT in tcp_poll()"), gives the closest to ideal
> result that I have seen in any test.
>
> 10194: 666780666bits/s 666Mbits/s
> 10197: 141154197bits/s 141Mbits/s
> 10196: 141023090bits/s 141Mbits/s
> -----------------------------------
> total: 948957954bits/s 948Mbits/s
>
>
> That does indeed seem promising. Though I do realise that my methods
> have essentially been stabs in the dark and the problem needs to
> be understood.
>   

I'm pretty sure that the differences are caused by HTB not being
in control of the queue since the device is the real bottleneck
in this configuration. Its quite possible that there simply might
a subtle timing change that causes feedback through HTBs borrowing
and ceiling.

So what would really be useful to understand this is to make HTB
control the queue and see if it behaves as expected.

--
To unsubscribe from this list: send the line "unsubscribe netdev" in
the body of a message to majordomo@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Patrick McHardy Oct. 8, 2008, 12:40 a.m. UTC | #10
Patrick McHardy wrote:
> So what would really be useful to understand this is to make HTB
> control the queue and see if it behaves as expected. 

What also might help understand this better would be some
runtime data. An easy way to get some halfway usable data
is to just run

tc -s -d qdisc show dev ...; tc -s -d class show dev ...

in a loop.
--
To unsubscribe from this list: send the line "unsubscribe netdev" in
the body of a message to majordomo@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Jarek Poplawski Oct. 8, 2008, 6:37 a.m. UTC | #11
On Wed, Oct 08, 2008 at 11:09:28AM +1100, Simon Horman wrote:
> On Tue, Oct 07, 2008 at 07:44:35AM +0000, Jarek Poplawski wrote:
...
> > So, thanks for testing and reporting this. (BTW, what network card
> > do you use and is there multiqueuing on?)
> 
> The network card is an e1000. Acutally there are two, but I
> am only using the first one. I am not sure what the Broadcom
> controllers are (I can get someone to go and inspect the
> machine if its important).
> 
> # lspci | grep Ether
> 03:00.0 Ethernet controller: Broadcom Corporation NetXtreme II BCM5708 Gigabit Ethernet (rev 12)
> 05:00.0 Ethernet controller: Broadcom Corporation NetXtreme II BCM5708 Gigabit Ethernet (rev 12)
> 0b:00.0 Ethernet controller: Intel Corporation 82571EB Gigabit Ethernet Controller (rev 06)
> 0b:00.1 Ethernet controller: Intel Corporation 82571EB Gigabit Ethernet Controller (rev 06)
> 
> And from dmsg
> 
> e1000e 0000:0b:00.0: setting latency timer to 64
> 0000:0b:00.0: 0000:0b:00.0: Failed to initialize MSI interrupts.  Falling back to legacy interrupts.
> 0000:0b:00.0: eth0: (PCI Express:2.5GB/s:Width x4) 00:1b:78:57:dc:f2
> 0000:0b:00.0: eth0: Intel(R) PRO/1000 Network Connection
> 0000:0b:00.0: eth0: MAC: 0, PHY: 4, PBA No: d51930-003
> e1000e 0000:0b:00.1: setting latency timer to 64
> 0000:0b:00.1: 0000:0b:00.1: Failed to initialize MSI interrupts.  Falling back to legacy interrupts.
> 0000:0b:00.1: eth1: (PCI Express:2.5GB/s:Width x4) 00:1b:78:57:dc:f3
> 0000:0b:00.1: eth1: Intel(R) PRO/1000 Network Connection
> 0000:0b:00.1: eth1: MAC: 0, PHY: 4, PBA No: d51930-003
> 
> It is eth0 that I am using.

I've only wandered if we should consider any new effects because of
this, but e1000/1000e don't have this yet (bnx2 has).

> My kernel is compiled with CONFIG_NET_SCH_MULTIQ disabled.

Multiquing can work without this too.

Jarek P.
--
To unsubscribe from this list: send the line "unsubscribe netdev" in
the body of a message to majordomo@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Jarek Poplawski Oct. 8, 2008, 6:55 a.m. UTC | #12
On Wed, Oct 08, 2008 at 02:31:26AM +0200, Patrick McHardy wrote:
...
> I'm pretty sure that the differences are caused by HTB not being
> in control of the queue since the device is the real bottleneck
> in this configuration.

Yes, otherwise there would be no requeuing. And, btw. the golden rule
of scheduling/shaping is limiting below "hardware" limits.

> Its quite possible that there simply might
> a subtle timing change that causes feedback through HTBs borrowing
> and ceiling.

I'd add my previous suspicion there could be not enough enqeuing on
time for the fastest class (could be also very bursty), so other
classes can borrow more.

>
> So what would really be useful to understand this is to make HTB
> control the queue and see if it behaves as expected.
>

Right, trying with lower rates/ceils should explain this.

Jarek P.
--
To unsubscribe from this list: send the line "unsubscribe netdev" in
the body of a message to majordomo@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Denys Fedoryshchenko Oct. 8, 2008, 7:06 a.m. UTC | #13
On Wednesday 08 October 2008, Jarek Poplawski wrote:
> On Wed, Oct 08, 2008 at 02:31:26AM +0200, Patrick McHardy wrote:
> ...
>
> > I'm pretty sure that the differences are caused by HTB not being
> > in control of the queue since the device is the real bottleneck
> > in this configuration.
>
> Yes, otherwise there would be no requeuing. And, btw. the golden rule
> of scheduling/shaping is limiting below "hardware" limits.
By the way, HTB counting ethernet headers (e.g. if you send 1500 byte ping, it 
will be 1514 in tc counters) . So possible ethernet overhead counted.


--
To unsubscribe from this list: send the line "unsubscribe netdev" in
the body of a message to majordomo@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Simon Horman Oct. 8, 2008, 7:22 a.m. UTC | #14
On Wed, Oct 08, 2008 at 06:55:51AM +0000, Jarek Poplawski wrote:
> On Wed, Oct 08, 2008 at 02:31:26AM +0200, Patrick McHardy wrote:
> ...
> > I'm pretty sure that the differences are caused by HTB not being
> > in control of the queue since the device is the real bottleneck
> > in this configuration.
> 
> Yes, otherwise there would be no requeuing. And, btw. the golden rule
> of scheduling/shaping is limiting below "hardware" limits.
> 
> > Its quite possible that there simply might
> > a subtle timing change that causes feedback through HTBs borrowing
> > and ceiling.
> 
> I'd add my previous suspicion there could be not enough enqeuing on
> time for the fastest class (could be also very bursty), so other
> classes can borrow more.
> 
> >
> > So what would really be useful to understand this is to make HTB
> > control the queue and see if it behaves as expected.
> >
> 
> Right, trying with lower rates/ceils should explain this.

As I mentioned earlier things seem to work quite well with lower
rates/ceilings. When I set up the classes with 10x lower values
for rate and celing, as follows:


                           [ rate=100Mbit/s ]
                           [ ceil=100Mbit/s ]
                                  |
             +--------------------+--------------------+
             |                    |                    |
      [ rate= 50Mbit/s ]   [ rate= 10Mbit/s ]   [ rate= 10Mbit/s ]
      [ ceil=100Mbit/s ]   [ ceil=100Mbit/s ]   [ ceil= 100Mbit/s ]

Then I get results that are fairly close to the ideal values.

net-next-2.6 - d877984
----------------------
10194: 68075482bits/s 68Mbits/s
10197: 14464848bits/s 14Mbits/s
10196: 14465632bits/s 14Mbits/s
-----------------------------------
total: 97005962bits/s 97Mbits/s

And I get those kind of results consistently for various
different kernel versions.
Simon Horman Oct. 8, 2008, 7:22 a.m. UTC | #15
On Wed, Oct 08, 2008 at 06:37:01AM +0000, Jarek Poplawski wrote:
> On Wed, Oct 08, 2008 at 11:09:28AM +1100, Simon Horman wrote:
> > On Tue, Oct 07, 2008 at 07:44:35AM +0000, Jarek Poplawski wrote:
> ...
> > > So, thanks for testing and reporting this. (BTW, what network card
> > > do you use and is there multiqueuing on?)
> > 
> > The network card is an e1000. Acutally there are two, but I
> > am only using the first one. I am not sure what the Broadcom
> > controllers are (I can get someone to go and inspect the
> > machine if its important).
> > 
> > # lspci | grep Ether
> > 03:00.0 Ethernet controller: Broadcom Corporation NetXtreme II BCM5708 Gigabit Ethernet (rev 12)
> > 05:00.0 Ethernet controller: Broadcom Corporation NetXtreme II BCM5708 Gigabit Ethernet (rev 12)
> > 0b:00.0 Ethernet controller: Intel Corporation 82571EB Gigabit Ethernet Controller (rev 06)
> > 0b:00.1 Ethernet controller: Intel Corporation 82571EB Gigabit Ethernet Controller (rev 06)
> > 
> > And from dmsg
> > 
> > e1000e 0000:0b:00.0: setting latency timer to 64
> > 0000:0b:00.0: 0000:0b:00.0: Failed to initialize MSI interrupts.  Falling back to legacy interrupts.
> > 0000:0b:00.0: eth0: (PCI Express:2.5GB/s:Width x4) 00:1b:78:57:dc:f2
> > 0000:0b:00.0: eth0: Intel(R) PRO/1000 Network Connection
> > 0000:0b:00.0: eth0: MAC: 0, PHY: 4, PBA No: d51930-003
> > e1000e 0000:0b:00.1: setting latency timer to 64
> > 0000:0b:00.1: 0000:0b:00.1: Failed to initialize MSI interrupts.  Falling back to legacy interrupts.
> > 0000:0b:00.1: eth1: (PCI Express:2.5GB/s:Width x4) 00:1b:78:57:dc:f3
> > 0000:0b:00.1: eth1: Intel(R) PRO/1000 Network Connection
> > 0000:0b:00.1: eth1: MAC: 0, PHY: 4, PBA No: d51930-003
> > 
> > It is eth0 that I am using.
> 
> I've only wandered if we should consider any new effects because of
> this, but e1000/1000e don't have this yet (bnx2 has).
> 
> > My kernel is compiled with CONFIG_NET_SCH_MULTIQ disabled.
> 
> Multiquing can work without this too.

Ok, what should I be checking for?
Martin Devera Oct. 8, 2008, 7:34 a.m. UTC | #16
Patrick McHardy wrote:
> Patrick McHardy wrote:
>> So what would really be useful to understand this is to make HTB
>> control the queue and see if it behaves as expected. 
> 
> What also might help understand this better would be some
> runtime data. An easy way to get some halfway usable data
> is to just run
> 
> tc -s -d qdisc show dev ...; tc -s -d class show dev ...
> 
> in a loop.

IMHO there is problem with packet hold outside of the qdisc.
HTB like other non-workconserving qdiscs rely on information
about class backlogged state.
When there is packet in a queue then the queue is active (has
a demand to send).
The algorithm samples queue states at deterministic but unregular
intervals to see whose classes wants service and whose can lend.

If you hold a packet outside, relevant class thinks that it is
not backlogged - and if sampled at this time then the algorithm
decides to lend classe's time.

And because the most active class has the biggest probability
to see its packet catched outside, as final effect all classes
will end up equalized with (almost) the same rates.

Thus, from qdisc point, it is not good to keep a packet for
more time out of the qdisc.

devik
--
To unsubscribe from this list: send the line "unsubscribe netdev" in
the body of a message to majordomo@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Jarek Poplawski Oct. 8, 2008, 7:53 a.m. UTC | #17
On Wed, Oct 08, 2008 at 06:22:44PM +1100, Simon Horman wrote:
> On Wed, Oct 08, 2008 at 06:37:01AM +0000, Jarek Poplawski wrote:
> > On Wed, Oct 08, 2008 at 11:09:28AM +1100, Simon Horman wrote:
> > > On Tue, Oct 07, 2008 at 07:44:35AM +0000, Jarek Poplawski wrote:
> > ...
> > > > So, thanks for testing and reporting this. (BTW, what network card
> > > > do you use and is there multiqueuing on?)
> > > 
> > > The network card is an e1000. Acutally there are two, but I
...
> > I've only wandered if we should consider any new effects because of
> > this, but e1000/1000e don't have this yet (bnx2 has).
...
> Ok, what should I be checking for?

I think, it's all with this... We simply know multiquing doesn't
matter here.

Jarek P. 
--
To unsubscribe from this list: send the line "unsubscribe netdev" in
the body of a message to majordomo@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Jarek Poplawski Oct. 8, 2008, 8:03 a.m. UTC | #18
On Wed, Oct 08, 2008 at 06:22:04PM +1100, Simon Horman wrote:
> On Wed, Oct 08, 2008 at 06:55:51AM +0000, Jarek Poplawski wrote:
> > On Wed, Oct 08, 2008 at 02:31:26AM +0200, Patrick McHardy wrote:
> > ...
> > > I'm pretty sure that the differences are caused by HTB not being
> > > in control of the queue since the device is the real bottleneck
> > > in this configuration.
> > 
> > Yes, otherwise there would be no requeuing. And, btw. the golden rule
> > of scheduling/shaping is limiting below "hardware" limits.
> > 
> > > Its quite possible that there simply might
> > > a subtle timing change that causes feedback through HTBs borrowing
> > > and ceiling.
> > 
> > I'd add my previous suspicion there could be not enough enqeuing on
> > time for the fastest class (could be also very bursty), so other
> > classes can borrow more.
> > 
> > >
> > > So what would really be useful to understand this is to make HTB
> > > control the queue and see if it behaves as expected.
> > >
> > 
> > Right, trying with lower rates/ceils should explain this.
> 
> As I mentioned earlier things seem to work quite well with lower
> rates/ceilings. When I set up the classes with 10x lower values
> for rate and celing, as follows:
> 
> 
>                            [ rate=100Mbit/s ]
>                            [ ceil=100Mbit/s ]
>                                   |
>              +--------------------+--------------------+
>              |                    |                    |
>       [ rate= 50Mbit/s ]   [ rate= 10Mbit/s ]   [ rate= 10Mbit/s ]
>       [ ceil=100Mbit/s ]   [ ceil=100Mbit/s ]   [ ceil= 100Mbit/s ]
> 
> Then I get results that are fairly close to the ideal values.
> 
> net-next-2.6 - d877984
> ----------------------
> 10194: 68075482bits/s 68Mbits/s
> 10197: 14464848bits/s 14Mbits/s
> 10196: 14465632bits/s 14Mbits/s
> -----------------------------------
> total: 97005962bits/s 97Mbits/s
> 
> And I get those kind of results consistently for various
> different kernel versions.

OK. But as Patrick mentioned it would be interesting to try a little
below hardware limits: 950, or maybe lower, until HTB starts getting
accuracy.

Jarek P.
--
To unsubscribe from this list: send the line "unsubscribe netdev" in
the body of a message to majordomo@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Jarek Poplawski Oct. 8, 2008, 8:53 a.m. UTC | #19
On Wed, Oct 08, 2008 at 09:34:30AM +0200, Martin Devera wrote:
> Patrick McHardy wrote:
>> Patrick McHardy wrote:
>>> So what would really be useful to understand this is to make HTB
>>> control the queue and see if it behaves as expected. 
>>
>> What also might help understand this better would be some
>> runtime data. An easy way to get some halfway usable data
>> is to just run
>>
>> tc -s -d qdisc show dev ...; tc -s -d class show dev ...
>>
>> in a loop.
>
> IMHO there is problem with packet hold outside of the qdisc.
> HTB like other non-workconserving qdiscs rely on information
> about class backlogged state.
> When there is packet in a queue then the queue is active (has
> a demand to send).
> The algorithm samples queue states at deterministic but unregular
> intervals to see whose classes wants service and whose can lend.
>
> If you hold a packet outside, relevant class thinks that it is
> not backlogged - and if sampled at this time then the algorithm
> decides to lend classe's time.

Right, but on the other hand I can't see any correction of these
times/tokens, so it seems this can't give us "right" results
anyway? E.g. with 100% requeuing (each packet requeued once) HTB
should think it "gave" the rate 2x higher than seen on the other
side - or I miss something?

>
> And because the most active class has the biggest probability
> to see its packet catched outside, as final effect all classes
> will end up equalized with (almost) the same rates.

Yes, but if this most active class has always packets in its
queue, and is most prioritized, it seems this should get its
share anyway.

>
> Thus, from qdisc point, it is not good to keep a packet for
> more time out of the qdisc.

Sure, the question is how much it's useful against associated
code complications and additional cpu usage.

Thanks,
Jarek P.
--
To unsubscribe from this list: send the line "unsubscribe netdev" in
the body of a message to majordomo@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Martin Devera Oct. 8, 2008, 10:47 a.m. UTC | #20
>> The algorithm samples queue states at deterministic but unregular
>> intervals to see whose classes wants service and whose can lend.
>>
>> If you hold a packet outside, relevant class thinks that it is
>> not backlogged - and if sampled at this time then the algorithm
>> decides to lend classe's time.
> 
> Right, but on the other hand I can't see any correction of these
> times/tokens, so it seems this can't give us "right" results
> anyway? E.g. with 100% requeuing (each packet requeued once) HTB
> should think it "gave" the rate 2x higher than seen on the other
> side - or I miss something?

Yes, it is another problem - double acounting packet when requeued...
Well, you are right, the number are not too supportive to this
explanation...
It seems that the first class didn't get its basic "rate", which
is should be guaranteed.

Simon, can you try to these things (separately):
a/ increase quantum to the first class (say 10x)
b/ set ceil=rate on all three classes

The idea is to a/ make sure there is no requeue-related change
to the drr pointer which could boost reqeued class,
b/ to see whether priorized class has problems to send or
other classes are sending even when they should not.

>> Thus, from qdisc point, it is not good to keep a packet for
>> more time out of the qdisc.
> 
> Sure, the question is how much it's useful against associated
> code complications and additional cpu usage.

honestly, I'm not familiar with the new code. Can you tell me
in short what is gso_skb and where the skb goes now if not requeued ?

thanks, Martin
--
To unsubscribe from this list: send the line "unsubscribe netdev" in
the body of a message to majordomo@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Jarek Poplawski Oct. 8, 2008, 12:04 p.m. UTC | #21
On Wed, Oct 08, 2008 at 12:47:57PM +0200, Martin Devera wrote:
...
>>> Thus, from qdisc point, it is not good to keep a packet for
>>> more time out of the qdisc.
>>
>> Sure, the question is how much it's useful against associated
>> code complications and additional cpu usage.
>
> honestly, I'm not familiar with the new code. Can you tell me
> in short what is gso_skb and where the skb goes now if not requeued ?

I hope I don't mislead you too much: gso_skb is actually a list of
skbs after dividing a previous skb in dev_gso_segment() in
dev_hard_start_xmit(), which because of some device/driver error
gets back to qdisc_run()/ qdisc_restart(). It's stored as q->gso_skb
and tried for next xmits before any new dequeuing until sending.

And currently, "normal" skbs are treated exactly the same (so not
returned to the qdisc tree). There is still possible internal use of
old ->requeue() functions between qdiscs for some strange cases, but
it will be probably replaced too with storing to the q->requeue lists
(one level deep). So both these cases could be treated as: an skb is
out of the qdiscs, one foot in the driver's door (probably because of
some errors - otherwise driver should stop the queue earlier).

This is supposed to simplify the code, especially error handling,
especially wrt. multiqueuing.

Cheers,
Jarek P.
--
To unsubscribe from this list: send the line "unsubscribe netdev" in
the body of a message to majordomo@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Simon Horman Oct. 9, 2008, 12:54 a.m. UTC | #22
On Wed, Oct 08, 2008 at 08:03:40AM +0000, Jarek Poplawski wrote:
> On Wed, Oct 08, 2008 at 06:22:04PM +1100, Simon Horman wrote:
> > On Wed, Oct 08, 2008 at 06:55:51AM +0000, Jarek Poplawski wrote:
> > > On Wed, Oct 08, 2008 at 02:31:26AM +0200, Patrick McHardy wrote:
> > > ...
> > > > I'm pretty sure that the differences are caused by HTB not being
> > > > in control of the queue since the device is the real bottleneck
> > > > in this configuration.
> > > 
> > > Yes, otherwise there would be no requeuing. And, btw. the golden rule
> > > of scheduling/shaping is limiting below "hardware" limits.
> > > 
> > > > Its quite possible that there simply might
> > > > a subtle timing change that causes feedback through HTBs borrowing
> > > > and ceiling.
> > > 
> > > I'd add my previous suspicion there could be not enough enqeuing on
> > > time for the fastest class (could be also very bursty), so other
> > > classes can borrow more.
> > > 
> > > >
> > > > So what would really be useful to understand this is to make HTB
> > > > control the queue and see if it behaves as expected.
> > > >
> > > 
> > > Right, trying with lower rates/ceils should explain this.
> > 
> > As I mentioned earlier things seem to work quite well with lower
> > rates/ceilings. When I set up the classes with 10x lower values
> > for rate and celing, as follows:
> > 
> > 
> >                            [ rate=100Mbit/s ]
> >                            [ ceil=100Mbit/s ]
> >                                   |
> >              +--------------------+--------------------+
> >              |                    |                    |
> >       [ rate= 50Mbit/s ]   [ rate= 10Mbit/s ]   [ rate= 10Mbit/s ]
> >       [ ceil=100Mbit/s ]   [ ceil=100Mbit/s ]   [ ceil= 100Mbit/s ]
> > 
> > Then I get results that are fairly close to the ideal values.
> > 
> > net-next-2.6 - d877984
> > ----------------------
> > 10194: 68075482bits/s 68Mbits/s
> > 10197: 14464848bits/s 14Mbits/s
> > 10196: 14465632bits/s 14Mbits/s
> > -----------------------------------
> > total: 97005962bits/s 97Mbits/s
> > 
> > And I get those kind of results consistently for various
> > different kernel versions.
> 
> OK. But as Patrick mentioned it would be interesting to try a little
> below hardware limits: 950, or maybe lower, until HTB starts getting
> accuracy.

Hi,

it seems that for my particular setup the magic number is 935Mbit/s.


Kernel is net-next-2.6 071d7ab6649eb34a873a53e71635186e9117101d
("ipvs: Remove stray file left over from ipvs move"),
which is after Jarek's "pkt_sched: Update qdisc requeue stats in
dev_requeue_skb()" patch.

ideal (based on 950Mbit/s aggregate)
500mbit class (10194): 500mbit + 250mbit/7*5 == 678.57mbit
100mbit class (10196): 100mbit + 250mbit/7*1 == 135.71mbit
100mbit class (10197): 100mbit + 250mbit/7*1 == 135.71mbit
                                                ==========
						950.00mbit
n=900
10194: 677727637bits/s 677Mbits/s
10197: 136662048bits/s 136Mbits/s
10196: 136725637bits/s 136Mbits/s
-----------------------------------
total: 951115322bits/s 951Mbits/s

n=920
10194: 676271338bits/s 676Mbits/s
10197: 137301090bits/s 137Mbits/s
10196: 137301877bits/s 137Mbits/s
-----------------------------------
total: 950874306bits/s 950Mbits/s

n=930
10194: 674681581bits/s 674Mbits/s
10197: 137538965bits/s 137Mbits/s
10196: 137541320bits/s 137Mbits/s
-----------------------------------
total: 949761866bits/s 949Mbits/s

n=933
10194: 675568704bits/s 675Mbits/s
10197: 137661437bits/s 137Mbits/s
10196: 137662221bits/s 137Mbits/s
-----------------------------------
total: 950892362bits/s 950Mbits/s

n=934
10194: 675399128bits/s 675Mbits/s
10197: 137653586bits/s 137Mbits/s
10196: 137704613bits/s 137Mbits/s
-----------------------------------
total: 950757328bits/s 950Mbits/s

n=935
10194: 675169893bits/s 675Mbits/s
10197: 137667714bits/s 137Mbits/s
10196: 137670858bits/s 137Mbits/s
-----------------------------------
total: 950508466bits/s 950Mbits/s

n=936
10194: 385295016bits/s 385Mbits/s
10197: 285078114bits/s 285Mbits/s
10196: 286588581bits/s 286Mbits/s
-----------------------------------
total: 956961712bits/s 956Mbits/s

n=937
10194: 384569616bits/s 384Mbits/s
10197: 285480072bits/s 285Mbits/s
10196: 286627050bits/s 286Mbits/s
-----------------------------------
total: 956676738bits/s 956Mbits/s

n=940
10194: 384577466bits/s 384Mbits/s
10197: 285655138bits/s 285Mbits/s
10196: 286846872bits/s 286Mbits/s
-----------------------------------
total: 957079477bits/s 957Mbits/s

n=950
10194: 384097789bits/s 384Mbits/s
10197: 285950325bits/s 285Mbits/s
10196: 286894760bits/s 286Mbits/s
-----------------------------------
total: 956942874bits/s 956Mbits/s

n=1000
10194: 384639482bits/s 384Mbits/s
10197: 285133069bits/s 285Mbits/s
10196: 287172674bits/s 287Mbits/s
-----------------------------------
total: 956945226bits/s 956Mbits/s


# HTB
###########################################################################
n=933

tc qdisc del dev eth0 root

tc qdisc add dev eth0 root handle 1: htb default 10 r2q $((n * 10))

tc class add dev eth0 parent 1:  classid 1:1   htb \
        rate ${n}Mbit ceil ${n}Mbit

tc class add dev eth0 parent 1:1 classid 1:10 htb \
        rate ${n}Mbit ceil ${n}Mbit
tc class add dev eth0 parent 1:1 classid 1:11 htb \
        rate 500Mbit ceil ${n}Mbit
tc class add dev eth0 parent 1:1 classid 1:12 htb \
        rate 100Mbit ceil ${n}Mbit
tc class add dev eth0 parent 1:1 classid 1:13 htb \
        rate 100Mbit ceil ${n}Mbit

#tc filter add dev eth0 protocol ip parent 1: \
#       u32 match ip src 172.17.60.194 flowid 1:11
#tc filter add dev eth0 protocol ip parent 1: \
#       u32 match ip src 172.17.60.196 flowid 1:12
#tc filter add dev eth0 protocol ip parent 1: \
#       u32 match ip src 172.17.60.197 flowid 1:13

tc filter add dev eth0 protocol ip parent 1: \
        u32 match ip dport 10194 0xffff flowid 1:11
tc filter add dev eth0 protocol ip parent 1: \
        u32 match ip dport 10196 0xffff flowid 1:12
tc filter add dev eth0 protocol ip parent 1: \
        u32 match ip dport 10197 0xffff flowid 1:13

tc -s -d class show dev eth0

--
To unsubscribe from this list: send the line "unsubscribe netdev" in
the body of a message to majordomo@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Simon Horman Oct. 9, 2008, 1:09 a.m. UTC | #23
On Wed, Oct 08, 2008 at 12:47:57PM +0200, Martin Devera wrote:
>>> The algorithm samples queue states at deterministic but unregular
>>> intervals to see whose classes wants service and whose can lend.
>>>
>>> If you hold a packet outside, relevant class thinks that it is
>>> not backlogged - and if sampled at this time then the algorithm
>>> decides to lend classe's time.
>>
>> Right, but on the other hand I can't see any correction of these
>> times/tokens, so it seems this can't give us "right" results
>> anyway? E.g. with 100% requeuing (each packet requeued once) HTB
>> should think it "gave" the rate 2x higher than seen on the other
>> side - or I miss something?
>
> Yes, it is another problem - double acounting packet when requeued...
> Well, you are right, the number are not too supportive to this
> explanation...
> It seems that the first class didn't get its basic "rate", which
> is should be guaranteed.
>
> Simon, can you try to these things (separately):
> a/ increase quantum to the first class (say 10x)

Hi Martin,

Do you mean increase r2q ? If so, here are some results

r2q=10000 (original setting)
10194: 383950984bits/s 383Mbits/s
10197: 285071834bits/s 285Mbits/s
10196: 287241757bits/s 287Mbits/s
-----------------------------------
total: 956264576bits/s 956Mbits/s

r2q=25000
HTB: quantum of class 10012 is small. Consider r2q change.
HTB: quantum of class 10013 is small. Consider r2q change.
10194: 375149600bits/s 375Mbits/s
10197: 289728064bits/s 289Mbits/s
10196: 291783370bits/s 291Mbits/s
-----------------------------------
total: 956661034bits/s 956Mbits/s

r2q=50000
HTB: quantum of class 10012 is small. Consider r2q change.
HTB: quantum of class 10013 is small. Consider r2q change.
10194: 367905789bits/s 367Mbits/s
10197: 292223005bits/s 292Mbits/s
10196: 296510256bits/s 296Mbits/s
-----------------------------------
total: 956639050bits/s 956Mbits/s

r2q=100000
HTB: quantum of class 10011 is small. Consider r2q change.
HTB: quantum of class 10012 is small. Consider r2q change.
HTB: quantum of class 10013 is small. Consider r2q change.
10194: 367925416bits/s 367Mbits/s
10197: 293556834bits/s 293Mbits/s
10196: 295315384bits/s 295Mbits/s
-----------------------------------
total: 956797634bits/s 956Mbits/s

> b/ set ceil=rate on all three classes

This seems pretty close to the expected/ideal result.

10194: 496575074bits/s 496Mbits/s
10197:  96969861bits/s  96Mbits/s
10196:  96973002bits/s  96Mbits/s
-----------------------------------
total: 690517938bits/s 690Mbits/s


Kernel is net-next-2.6 071d7ab6649eb34a873a53e71635186e9117101d
("ipvs: Remove stray file left over from ipvs move"),
which is after Jarek's "pkt_sched: Update qdisc requeue stats in
dev_requeue_skb()" patch.

>
> The idea is to a/ make sure there is no requeue-related change
> to the drr pointer which could boost reqeued class,
> b/ to see whether priorized class has problems to send or
> other classes are sending even when they should not.
>
>>> Thus, from qdisc point, it is not good to keep a packet for
>>> more time out of the qdisc.
>>
>> Sure, the question is how much it's useful against associated
>> code complications and additional cpu usage.
>
> honestly, I'm not familiar with the new code. Can you tell me
> in short what is gso_skb and where the skb goes now if not requeued ?
>
> thanks, Martin
Jarek Poplawski Oct. 9, 2008, 6:21 a.m. UTC | #24
On Thu, Oct 09, 2008 at 11:54:40AM +1100, Simon Horman wrote:
> On Wed, Oct 08, 2008 at 08:03:40AM +0000, Jarek Poplawski wrote:
...
> > OK. But as Patrick mentioned it would be interesting to try a little
> > below hardware limits: 950, or maybe lower, until HTB starts getting
> > accuracy.
> 
> Hi,

Hi Simon,

> 
> it seems that for my particular setup the magic number is 935Mbit/s.
> 
> 
> Kernel is net-next-2.6 071d7ab6649eb34a873a53e71635186e9117101d
> ("ipvs: Remove stray file left over from ipvs move"),
> which is after Jarek's "pkt_sched: Update qdisc requeue stats in
> dev_requeue_skb()" patch.
> 
> ideal (based on 950Mbit/s aggregate)
> 500mbit class (10194): 500mbit + 250mbit/7*5 == 678.57mbit
> 100mbit class (10196): 100mbit + 250mbit/7*1 == 135.71mbit
> 100mbit class (10197): 100mbit + 250mbit/7*1 == 135.71mbit
>                                                 ==========
> 						950.00mbit
> n=900
> 10194: 677727637bits/s 677Mbits/s
> 10197: 136662048bits/s 136Mbits/s
> 10196: 136725637bits/s 136Mbits/s
> -----------------------------------
> total: 951115322bits/s 951Mbits/s
> 
> n=920
> 10194: 676271338bits/s 676Mbits/s
> 10197: 137301090bits/s 137Mbits/s
> 10196: 137301877bits/s 137Mbits/s
> -----------------------------------
> total: 950874306bits/s 950Mbits/s
> 
> n=930
> 10194: 674681581bits/s 674Mbits/s
> 10197: 137538965bits/s 137Mbits/s
> 10196: 137541320bits/s 137Mbits/s
> -----------------------------------
> total: 949761866bits/s 949Mbits/s
> 
> n=933
> 10194: 675568704bits/s 675Mbits/s
> 10197: 137661437bits/s 137Mbits/s
> 10196: 137662221bits/s 137Mbits/s
> -----------------------------------
> total: 950892362bits/s 950Mbits/s
> 
> n=934
> 10194: 675399128bits/s 675Mbits/s
> 10197: 137653586bits/s 137Mbits/s
> 10196: 137704613bits/s 137Mbits/s
> -----------------------------------
> total: 950757328bits/s 950Mbits/s
> 
> n=935
> 10194: 675169893bits/s 675Mbits/s
> 10197: 137667714bits/s 137Mbits/s
> 10196: 137670858bits/s 137Mbits/s
> -----------------------------------
> total: 950508466bits/s 950Mbits/s
> 
> n=936
> 10194: 385295016bits/s 385Mbits/s
> 10197: 285078114bits/s 285Mbits/s
> 10196: 286588581bits/s 286Mbits/s
> -----------------------------------
> total: 956961712bits/s 956Mbits/s

It's hard to believe so small change can matter so much here!!!
Great testing!

It seems it's safer to use something below such magic number, and
generally 10% below hardware limit could be the rule.

It also looks like my idea that the first class could get not enough
packets was wrong. It seems it's rather blocked by other classes when
HTB thinks it can lend more.

Anyway, I'm also surprised HTB could be still so exact: congratulations
Martin!

Simon, if I don't miss something, I guess you should be OK with these
last changes in requeuing? The older way did some safety for such
overestimated configs by hiding the real problem, but there were the
costs: cpu etc.

Thanks,
Jarek P.

> 
> n=937
> 10194: 384569616bits/s 384Mbits/s
> 10197: 285480072bits/s 285Mbits/s
> 10196: 286627050bits/s 286Mbits/s
> -----------------------------------
> total: 956676738bits/s 956Mbits/s
> 
> n=940
> 10194: 384577466bits/s 384Mbits/s
> 10197: 285655138bits/s 285Mbits/s
> 10196: 286846872bits/s 286Mbits/s
> -----------------------------------
> total: 957079477bits/s 957Mbits/s
> 
> n=950
> 10194: 384097789bits/s 384Mbits/s
> 10197: 285950325bits/s 285Mbits/s
> 10196: 286894760bits/s 286Mbits/s
> -----------------------------------
> total: 956942874bits/s 956Mbits/s
> 
> n=1000
> 10194: 384639482bits/s 384Mbits/s
> 10197: 285133069bits/s 285Mbits/s
> 10196: 287172674bits/s 287Mbits/s
> -----------------------------------
> total: 956945226bits/s 956Mbits/s
> 
> 
> # HTB
> ###########################################################################
> n=933
> 
> tc qdisc del dev eth0 root
> 
> tc qdisc add dev eth0 root handle 1: htb default 10 r2q $((n * 10))
> 
> tc class add dev eth0 parent 1:  classid 1:1   htb \
>         rate ${n}Mbit ceil ${n}Mbit
> 
> tc class add dev eth0 parent 1:1 classid 1:10 htb \
>         rate ${n}Mbit ceil ${n}Mbit
> tc class add dev eth0 parent 1:1 classid 1:11 htb \
>         rate 500Mbit ceil ${n}Mbit
> tc class add dev eth0 parent 1:1 classid 1:12 htb \
>         rate 100Mbit ceil ${n}Mbit
> tc class add dev eth0 parent 1:1 classid 1:13 htb \
>         rate 100Mbit ceil ${n}Mbit
> 
> #tc filter add dev eth0 protocol ip parent 1: \
> #       u32 match ip src 172.17.60.194 flowid 1:11
> #tc filter add dev eth0 protocol ip parent 1: \
> #       u32 match ip src 172.17.60.196 flowid 1:12
> #tc filter add dev eth0 protocol ip parent 1: \
> #       u32 match ip src 172.17.60.197 flowid 1:13
> 
> tc filter add dev eth0 protocol ip parent 1: \
>         u32 match ip dport 10194 0xffff flowid 1:11
> tc filter add dev eth0 protocol ip parent 1: \
>         u32 match ip dport 10196 0xffff flowid 1:12
> tc filter add dev eth0 protocol ip parent 1: \
>         u32 match ip dport 10197 0xffff flowid 1:13
> 
> tc -s -d class show dev eth0
> 
--
To unsubscribe from this list: send the line "unsubscribe netdev" in
the body of a message to majordomo@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Martin Devera Oct. 9, 2008, 6:22 a.m. UTC | #25
>>
 >> Simon, can you try to these things (separately):
 >> a/ increase quantum to the first class (say 10x)
 >
 > Hi Martin,
 >
 > Do you mean increase r2q ? If so, here are some results

no, no, add rather ... "quantum 50000" to the first class only
r2q affects computation for all classes

 >> b/ set ceil=rate on all three classes
 >
 > This seems pretty close to the expected/ideal result.
 >
 > 10194: 496575074bits/s 496Mbits/s
 > 10197:  96969861bits/s  96Mbits/s
 > 10196:  96973002bits/s  96Mbits/s
 > -----------------------------------
 > total: 690517938bits/s 690Mbits/s
 >

Hmm, it seems that problem is with drr loop or borrowing,
rate computation seems to work ok even in this corner case.

In any case from your other test follows that when NIC
throttles then HTB can't arrive at correct rates - I feel
this as HTB bug and add it to my (very long) todo list...

thanks, Martin
--
To unsubscribe from this list: send the line "unsubscribe netdev" in
the body of a message to majordomo@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Martin Devera Oct. 9, 2008, 6:53 a.m. UTC | #26
> It also looks like my idea that the first class could get not enough
> packets was wrong. It seems it's rather blocked by other classes when
> HTB thinks it can lend more.
> 
> Anyway, I'm also surprised HTB could be still so exact: congratulations
> Martin!

thanks for making my depressions smaller ;-) I'd like to make it work
even on edge..

devik
--
To unsubscribe from this list: send the line "unsubscribe netdev" in
the body of a message to majordomo@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Jarek Poplawski Oct. 9, 2008, 9:56 a.m. UTC | #27
On Thu, Oct 09, 2008 at 08:22:49AM +0200, Martin Devera wrote:
> >>
> >> Simon, can you try to these things (separately):
> >> a/ increase quantum to the first class (say 10x)
> >
> > Hi Martin,
> >
> > Do you mean increase r2q ? If so, here are some results
>
> no, no, add rather ... "quantum 50000" to the first class only
> r2q affects computation for all classes

Yes, it's a half-documented parameter: not in the man, but e.g. here:
tc class add htb help

BTW, I wonder if better priority/quantum doesn't actually harm here.
When HTB lends the rate the first class here is always served at the
beginning. And if it's over the hardware limit, this class's packets
are waiting. Other classes get their chunks later, probably after the
card did some cleaning, so they may never have to wait at this
situation. So, maybe it would be interesting to try this other way:
lower priority of the first class with a prio param e.g. 1?

Jarek P.
--
To unsubscribe from this list: send the line "unsubscribe netdev" in
the body of a message to majordomo@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Jarek Poplawski Oct. 9, 2008, 10:14 a.m. UTC | #28
On Thu, Oct 09, 2008 at 09:56:00AM +0000, Jarek Poplawski wrote:
...
> So, maybe it would be interesting to try this other way:
> lower priority of the first class with a prio param e.g. 1?

Actually, this alone could be not enough, and lowering the quantum for
this class (e.g. quantum 10000) additionally could be useful.

Jarek P.
--
To unsubscribe from this list: send the line "unsubscribe netdev" in
the body of a message to majordomo@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Martin Devera Oct. 9, 2008, 10:52 a.m. UTC | #29
Jarek Poplawski wrote:
> On Thu, Oct 09, 2008 at 08:22:49AM +0200, Martin Devera wrote:
>>>> Simon, can you try to these things (separately):
>>>> a/ increase quantum to the first class (say 10x)
>>> Hi Martin,
>>>
>>> Do you mean increase r2q ? If so, here are some results
>> no, no, add rather ... "quantum 50000" to the first class only
>> r2q affects computation for all classes
> 
> Yes, it's a half-documented parameter: not in the man, but e.g. here:
> tc class add htb help
> 
> BTW, I wonder if better priority/quantum doesn't actually harm here.
> When HTB lends the rate the first class here is always served at the
> beginning. And if it's over the hardware limit, this class's packets
> are waiting. Other classes get their chunks later, probably after the
> card did some cleaning, so they may never have to wait at this
> situation. So, maybe it would be interesting to try this other way:
> lower priority of the first class with a prio param e.g. 1?

It starts to seem too complex to me (too many races related hypotheses).
The original intent was to investigate possibly corrupted drr pointer,
but just now I think the only way is to stuff enough debug code into
htb and try myself...
It will be probable faster - once I find time to do it.
--
To unsubscribe from this list: send the line "unsubscribe netdev" in
the body of a message to majordomo@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Jarek Poplawski Oct. 9, 2008, 11:04 a.m. UTC | #30
On Thu, Oct 09, 2008 at 12:52:27PM +0200, Martin Devera wrote:
...
> It starts to seem too complex to me (too many races related hypotheses).
> The original intent was to investigate possibly corrupted drr pointer,
> but just now I think the only way is to stuff enough debug code into
> htb and try myself...
> It will be probable faster - once I find time to do it.

Sure, but, on the other hand, isn't it too simple...

Jarek P.
--
To unsubscribe from this list: send the line "unsubscribe netdev" in
the body of a message to majordomo@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Simon Horman Oct. 9, 2008, 11:11 a.m. UTC | #31
On Thu, Oct 09, 2008 at 08:22:49AM +0200, Martin Devera wrote:
> >>
> >> Simon, can you try to these things (separately):
> >> a/ increase quantum to the first class (say 10x)
> >
> > Hi Martin,
> >
> > Do you mean increase r2q ? If so, here are some results
>
> no, no, add rather ... "quantum 50000" to the first class only
> r2q affects computation for all classes

Sorry, I wasn't aware of that parameter.
I ran a test now. With the default quantum (6250) for the 500Mbit/s
class I got:

10194: 383384949bits/s 383Mbits/s
10197: 285879669bits/s 285Mbits/s
10196: 287648424bits/s 287Mbits/s
-----------------------------------
total: 956913042bits/s 956Mbits/s

And when a quantum of 50000 for the 500Mbit/s class I got:

10194: 396345613bits/s 396Mbits/s
10197: 279511994bits/s 279Mbits/s
10196: 281062498bits/s 281Mbits/s
-----------------------------------
total: 956920106bits/s 956Mbits/s

> >> b/ set ceil=rate on all three classes
> >
> > This seems pretty close to the expected/ideal result.
> >
> > 10194: 496575074bits/s 496Mbits/s
> > 10197:  96969861bits/s  96Mbits/s
> > 10196:  96973002bits/s  96Mbits/s
> > -----------------------------------
> > total: 690517938bits/s 690Mbits/s
> >
>
> Hmm, it seems that problem is with drr loop or borrowing,
> rate computation seems to work ok even in this corner case.
>
> In any case from your other test follows that when NIC
> throttles then HTB can't arrive at correct rates - I feel
> this as HTB bug and add it to my (very long) todo list...
>
> thanks, Martin
Simon Horman Oct. 9, 2008, 11:18 a.m. UTC | #32
On Thu, Oct 09, 2008 at 06:21:45AM +0000, Jarek Poplawski wrote:
> On Thu, Oct 09, 2008 at 11:54:40AM +1100, Simon Horman wrote:
> > On Wed, Oct 08, 2008 at 08:03:40AM +0000, Jarek Poplawski wrote:
> ...
> > > OK. But as Patrick mentioned it would be interesting to try a little
> > > below hardware limits: 950, or maybe lower, until HTB starts getting
> > > accuracy.
> > 
> > Hi,
> 
> Hi Simon,
> 
> > 
> > it seems that for my particular setup the magic number is 935Mbit/s.

[snip]

> It's hard to believe so small change can matter so much here!!!
> Great testing!

Thanks, I was rather surprised myself.

> It seems it's safer to use something below such magic number, and
> generally 10% below hardware limit could be the rule.

That kind of rule would certainly solve this case,
not it would be nice for such a rule not to be necessary.

> It also looks like my idea that the first class could get not enough
> packets was wrong. It seems it's rather blocked by other classes when
> HTB thinks it can lend more.
> 
> Anyway, I'm also surprised HTB could be still so exact: congratulations
> Martin!
> 
> Simon, if I don't miss something, I guess you should be OK with these
> last changes in requeuing? The older way did some safety for such
> overestimated configs by hiding the real problem, but there were the
> costs: cpu etc.

I think that as Martin pointed out in another post, there
is really too much hypotheses going on at this point.
I think that it would be nice to try and get to the bottom of this,
but I'm prepared to concede that it isn't a show-stopper.
Martin Devera Oct. 9, 2008, 11:22 a.m. UTC | #33
>> r2q affects computation for all classes
> 
> Sorry, I wasn't aware of that parameter.
> I ran a test now. With the default quantum (6250) for the 500Mbit/s
> class I got:
> 
> 10194: 383384949bits/s 383Mbits/s
> 10197: 285879669bits/s 285Mbits/s
> 10196: 287648424bits/s 287Mbits/s
> -----------------------------------
> total: 956913042bits/s 956Mbits/s
> 
> And when a quantum of 50000 for the 500Mbit/s class I got:
> 
> 10194: 396345613bits/s 396Mbits/s
> 10197: 279511994bits/s 279Mbits/s
> 10196: 281062498bits/s 281Mbits/s
> -----------------------------------
> total: 956920106bits/s 956Mbits/s

thanks! it seems there is no strong dependence. I'm out of simple
explanations and I'd have to simulate it myself.
Unfortunately I can't say when I can do it as I'm in complicated
part of life now....
So that let's live with that 90% of max rate limitation for now.
--
To unsubscribe from this list: send the line "unsubscribe netdev" in
the body of a message to majordomo@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Patrick McHardy Oct. 9, 2008, 11:58 a.m. UTC | #34
Simon Horman wrote:
>> It seems it's safer to use something below such magic number, and
>> generally 10% below hardware limit could be the rule.
> 
> That kind of rule would certainly solve this case,
> not it would be nice for such a rule not to be necessary.

It would be interesting to see if higher burst rates allow you
to use higher rates. Userspace automatically calculates the
bursts based on the timer frequency, which requires that the
timers actually reach that frequency during runtime. With high
packet rates that might not always be the case, so perhaps we
can do better by adding a few percent on top.
--
To unsubscribe from this list: send the line "unsubscribe netdev" in
the body of a message to majordomo@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Jarek Poplawski Oct. 9, 2008, 12:36 p.m. UTC | #35
On Thu, Oct 09, 2008 at 10:18:38PM +1100, Simon Horman wrote:
> On Thu, Oct 09, 2008 at 06:21:45AM +0000, Jarek Poplawski wrote:
...
> > It seems it's safer to use something below such magic number, and
> > generally 10% below hardware limit could be the rule.
> 
> That kind of rule would certainly solve this case,
> not it would be nice for such a rule not to be necessary.

As a matter of fact I think most of admins get used to this, especially
with various dsls etc. As you can see, your 1Gbit isn't real rate here,
so what you propose is ignoring parameters and doing autodetection,
which is possible but, probably erroneus as well.

> > Simon, if I don't miss something, I guess you should be OK with these
> > last changes in requeuing? The older way did some safety for such
> > overestimated configs by hiding the real problem, but there were the
> > costs: cpu etc.
> 
> I think that as Martin pointed out in another post, there
> is really too much hypotheses going on at this point.
> I think that it would be nice to try and get to the bottom of this,
> but I'm prepared to concede that it isn't a show-stopper.

Actually, I really doubt it's any stopper at all. It's a low level
tool, not intended for common use, which really needs some tweaking.
(Some people still prefer CBQ for more control...)

Thanks,
Jarek P.
--
To unsubscribe from this list: send the line "unsubscribe netdev" in
the body of a message to majordomo@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Jarek Poplawski Oct. 10, 2008, 6:59 a.m. UTC | #36
On 08-10-2008 00:00, Jarek Poplawski wrote:
> Patrick McHardy wrote, On 10/07/2008 02:48 PM:
...
>> I still can't really make anything of this bug, but the only two
>> visible differences to HTB resulting from requeing on an upper level
>> should be that
>>
>> 1) it doesn't reactivate classes that went passive by the last dequeue
>> 2) the time checkpoint from the last dequeue event is different
>>
>> I guess its in fact the second thing, if a lower priority packet
>> is requeued and dequeued again, HTB doesn't notice and might allow
>> the class to send earlier again than it would have previously.
> 
> With high requeuing the timing has to be wrong, but I'm not sure why
> just lower priority has to gain here.

I think it's quite simple and not hypothetical: lower priority classes
gain on overestimated ceils not on requeuing. Because of lending over
hardware limit all buckets get more tokens than their rate, and they
are allowed to send more and more until the hardware stops. HTB still
doesn't know about this and it can lend until mbuffer limit is reached.
So at some moment all buffers are full, everyone can send (bigger
buffers could still give some advantage) and everything is controlled
only by hardware. This is clear gain for the less privileged.

Requeuing actually can prevent this some way, which IMHO, shouldn't
matter here because it's not the way intended to shape the traffic.

But we could consider if, after removing requeuing which mattered
here, some change is needed in "proper" way of limiting such effects
of wrong parameters or hardware errors (like the size of mbuffer etc.)?

Thanks,
Jarek P.
--
To unsubscribe from this list: send the line "unsubscribe netdev" in
the body of a message to majordomo@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
diff mbox

Patch

diff --git a/net/sched/sch_generic.c b/net/sched/sch_generic.c
index 31f6b61..d2e0da6 100644
--- a/net/sched/sch_generic.c
+++ b/net/sched/sch_generic.c
@@ -44,7 +44,10 @@  static inline int qdisc_qlen(struct Qdisc *q)
 
 static inline int dev_requeue_skb(struct sk_buff *skb, struct Qdisc *q)
 {
-	q->gso_skb = skb;
+	if (unlikely(skb->next))
+		q->gso_skb = skb;
+	else
+		q->ops->requeue(skb, q);
 	__netif_schedule(q);
 
 	return 0;