Message ID | 20081007045145.GA23883@verge.net.au |
---|---|
State | RFC, archived |
Delegated to: | David Miller |
Headers | show |
On Tue, Oct 07, 2008 at 03:51:47PM +1100, Simon Horman wrote: > On Tue, Oct 07, 2008 at 12:15:52PM +1100, Simon Horman wrote: > > Hi Dave, Hi Jarek, Hi Simon, > > I know that you guys were/are playing around a lot in here, but > > unfortunately I think that "pkt_sched: Always use q->requeue in > > dev_requeue_skb()" (f0876520b0b721bedafd9cec3b1b0624ae566eee) has > > introduced a performance regression for HTB. ... > > The test machine with the tc rules and udp-sending processes > > has two Intel Xeon Quad-cores running at 1.86GHz. The kernel > > is SMP x86_64. > > With the following patch (basically a reversal of ""pkt_sched: Always use > q->requeue in dev_requeue_skb()" forward ported to the current > net-next-2.6 tree (tcp: Respect SO_RCVLOWAT in tcp_poll()), I get some > rather nice numbers (IMHO). > > 10194: 666780666bits/s 666Mbits/s > 10197: 141154197bits/s 141Mbits/s > 10196: 141023090bits/s 141Mbits/s > ----------------------------------- > total: 948957954bits/s 948Mbits/s > > I'm not sure what evil things this patch does to other aspects > of the qdisc code. I'd like to establish this too. This patch was meant to remove some other problems possibly the simplest way. Maybe it's too simple. Anyway, it's kind of RFC, so the rest of the requeuing code is left unchanged, just for easy revoking like below. But first we should try to understand this more. So, thanks for testing and reporting this. (BTW, what network card do you use and is there multiqueuing on?) Jarek P. > > diff --git a/net/sched/sch_generic.c b/net/sched/sch_generic.c > index 31f6b61..d2e0da6 100644 > --- a/net/sched/sch_generic.c > +++ b/net/sched/sch_generic.c > @@ -44,7 +44,10 @@ static inline int qdisc_qlen(struct Qdisc *q) > > static inline int dev_requeue_skb(struct sk_buff *skb, struct Qdisc *q) > { > - q->gso_skb = skb; > + if (unlikely(skb->next)) > + q->gso_skb = skb; > + else > + q->ops->requeue(skb, q); > __netif_schedule(q); > > return 0; -- To unsubscribe from this list: send the line "unsubscribe netdev" in the body of a message to majordomo@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
Jarek Poplawski wrote: > On Tue, Oct 07, 2008 at 03:51:47PM +1100, Simon Horman wrote: >> With the following patch (basically a reversal of ""pkt_sched: Always use >> q->requeue in dev_requeue_skb()" forward ported to the current >> net-next-2.6 tree (tcp: Respect SO_RCVLOWAT in tcp_poll()), I get some >> rather nice numbers (IMHO). >> >> 10194: 666780666bits/s 666Mbits/s >> 10197: 141154197bits/s 141Mbits/s >> 10196: 141023090bits/s 141Mbits/s >> ----------------------------------- >> total: 948957954bits/s 948Mbits/s >> >> I'm not sure what evil things this patch does to other aspects >> of the qdisc code. > > I'd like to establish this too. This patch was meant to remove some > other problems possibly the simplest way. Maybe it's too simple. > Anyway, it's kind of RFC, so the rest of the requeuing code is left > unchanged, just for easy revoking like below. But first we should > try to understand this more. Shooting in the dark: I don't see how this change could affect the bandwidth expect by introducing higher dequeue-latency due using netif_schedule instead of qdisc_watchdog. Does anyone know how device scheduling latencies compare to hrtimers? -- To unsubscribe from this list: send the line "unsubscribe netdev" in the body of a message to majordomo@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
On Tue, Oct 07, 2008 at 03:51:47PM +1100, Simon Horman wrote: > On Tue, Oct 07, 2008 at 12:15:52PM +1100, Simon Horman wrote: > > Hi Dave, Hi Jarek, > > > > I know that you guys were/are playing around a lot in here, but > > unfortunately I think that "pkt_sched: Always use q->requeue in > > dev_requeue_skb()" (f0876520b0b721bedafd9cec3b1b0624ae566eee) has > > introduced a performance regression for HTB. Since this looks to me as possibly theoretical problem I added Martin and Patrick to Cc. > > > > My tc rules are below, but in a nutshell I have 3 leaf classes. > > One with a rate of 500Mbit/s and the other two with 100Mbit/s. > > The ceiling for all classes is 1Gb/s and that is also both > > the rate and ceiling for the parent class. > > > > [ rate=1Gbit/s ] > > [ ceil=1Gbit/s ] > > | > > +--------------------+--------------------+ > > | | | > > [ rate=500Mbit/s ] [ rate=100Mbit/s ] [ rate=100Mbit/s ] > > [ ceil= 1Gbit/s ] [ ceil=100Mbit/s ] [ ceil= 1Gbit/s ] ?! [ ceil= 1Gbit/s ] [ ceil=1Gbit/s ] [ ceil= 1Gbit/s ] > > > > The tc rules have an extra class for all other traffic, > > but its idle, so I left it out of the diagram. > > > > In order to test this I set up filters so that traffic to > > each of port 10194, 10196 and 10197 is directed to one of the leaf-classes. > > I then set up a process on the same host for each port sending > > UDP as fast as it could in a while() { send(); } loop. On another > > host I set up processes listening for the UDP traffic in a > > while () { recv(); } loop. And I measured the results. > > > > ( I should be able to provide the code used for testing, > > but its not mine and my colleague who wrote it is off > > with the flu today. ) > > > > Prior to this patch the result looks like this: > > > > 10194: 545134589bits/s 545Mbits/s > > 10197: 205358520bits/s 205Mbits/s > > 10196: 205311416bits/s 205Mbits/s > > ----------------------------------- > > total: 955804525bits/s 955Mbits/s > > > > And after the patch the result looks like this: > > 10194: 384248522bits/s 384Mbits/s > > 10197: 284706778bits/s 284Mbits/s > > 10196: 288119464bits/s 288Mbits/s > > ----------------------------------- > > total: 957074765bits/s 957Mbits/s > > So, in short, the results with requeuing off show the first class doesn't get its rate while the others can borrow. My first (maybe wrong) idea is that requeuing could be used here for something it wasn't probably meant to. The scenario could be like this: the first (and most privileged) class is sending until the card limit, and when the xmit is stopped and requeuing on, it slows the others (while it has to wait anyway) with requeuing procedures plus gets "additional" packet in its queue. In the "requeuing off" case there should be a bit more time for others and each packet seen only once. Since it looks like HTB was lending unused rate, it had to try the first class first, so if it didn't use this, probably there were not enough packets in its queue, and as mentioned above, requeuing code could help to get them, and so to prevent lending to others, when there is not enough enqueuing in the meantime. So, maybe my diagnose is totally wrong, but there are the questions: 1) Is HTB or other similar scheduling code expected to limit correctly while we substantially overlimit (since requeuing should be used so much)? 2) Should requeuing be considered as such important factor of controlling the rates? I've some doubts it should work like this. Jarek P. > > There is some noise in these results, but I think that its clear > > that before the patch all leaf-classes received at least their rate, > > and after the patch the rate=500Mbit/s class received much less than > > its rate. This I believe is a regression. > > > > I do not believe that this happens at lower bit rates, for instance > > if you reduce the ceiling and rate of all classes by a factor of 10. > > I can produce some numbers on that if you want them. > > > > The test machine with the tc rules and udp-sending processes > > has two Intel Xeon Quad-cores running at 1.86GHz. The kernel > > is SMP x86_64. > > With the following patch (basically a reversal of ""pkt_sched: Always use > q->requeue in dev_requeue_skb()" forward ported to the current > net-next-2.6 tree (tcp: Respect SO_RCVLOWAT in tcp_poll()), I get some > rather nice numbers (IMHO). > > 10194: 666780666bits/s 666Mbits/s > 10197: 141154197bits/s 141Mbits/s > 10196: 141023090bits/s 141Mbits/s > ----------------------------------- > total: 948957954bits/s 948Mbits/s > > I'm not sure what evil things this patch does to other aspects > of the qdisc code. > > diff --git a/net/sched/sch_generic.c b/net/sched/sch_generic.c > index 31f6b61..d2e0da6 100644 > --- a/net/sched/sch_generic.c > +++ b/net/sched/sch_generic.c > @@ -44,7 +44,10 @@ static inline int qdisc_qlen(struct Qdisc *q) > > static inline int dev_requeue_skb(struct sk_buff *skb, struct Qdisc *q) > { > - q->gso_skb = skb; > + if (unlikely(skb->next)) > + q->gso_skb = skb; > + else > + q->ops->requeue(skb, q); > __netif_schedule(q); > > return 0; ------------------------- tc qdisc del dev eth0 root tc qdisc add dev eth0 root handle 1: htb default 10 r2q 10000 tc class add dev eth0 parent 1: classid 1:1 htb \ rate 1Gbit ceil 1Gbit tc class add dev eth0 parent 1:1 classid 1:10 htb \ rate 1Gbit ceil 1Gbit tc class add dev eth0 parent 1:1 classid 1:11 htb \ rate 500Mbit ceil 1Gbit tc class add dev eth0 parent 1:1 classid 1:12 htb \ rate 100Mbit ceil 1Gbit tc class add dev eth0 parent 1:1 classid 1:13 htb \ rate 100Mbit ceil 1Gbit tc filter add dev eth0 protocol ip parent 1: \ u32 match ip dport 10194 0xffff flowid 1:11 tc filter add dev eth0 protocol ip parent 1: \ u32 match ip dport 10196 0xffff flowid 1:12 tc filter add dev eth0 protocol ip parent 1: \ u32 match ip dport 10197 0xffff flowid 1:13 -- To unsubscribe from this list: send the line "unsubscribe netdev" in the body of a message to majordomo@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
Jarek Poplawski wrote: >>> Prior to this patch the result looks like this: >>> >>> 10194: 545134589bits/s 545Mbits/s >>> 10197: 205358520bits/s 205Mbits/s >>> 10196: 205311416bits/s 205Mbits/s >>> ----------------------------------- >>> total: 955804525bits/s 955Mbits/s >>> >>> And after the patch the result looks like this: >>> 10194: 384248522bits/s 384Mbits/s >>> 10197: 284706778bits/s 284Mbits/s >>> 10196: 288119464bits/s 288Mbits/s >>> ----------------------------------- >>> total: 957074765bits/s 957Mbits/s I've misinterpreted the numbers, please disregard my previous mail. I'm wondering though, even before this patch, the sharing doesn't seem to be proportional to the allocated rates. Assuming the upper limit is somewhere around 950mbit, we have 250 mbit for sharing above the allocated rates, so it should be: 500mbit class: 500mbit + 250mbit/7*5 == 678.57mbit 100mbit class: 100mbit + 250mbit/1*5 == 150mbit 100mbit class: 100mbit + 250mbit/1*5 == 150mbit But maybe my understanding of how excess bandwidth is distributed with HTB is wrong. > So, in short, the results with requeuing off show the first class > doesn't get its rate while the others can borrow. > > My first (maybe wrong) idea is that requeuing could be used here for > something it wasn't probably meant to. The scenario could be like this: > the first (and most privileged) class is sending until the card limit, > and when the xmit is stopped and requeuing on, it slows the others > (while it has to wait anyway) with requeuing procedures plus gets > "additional" packet in its queue. > > In the "requeuing off" case there should be a bit more time for others > and each packet seen only once. > > Since it looks like HTB was lending unused rate, it had to try the first > class first, so if it didn't use this, probably there were not enough > packets in its queue, and as mentioned above, requeuing code could help > to get them, and so to prevent lending to others, when there is not > enough enqueuing in the meantime. > > So, maybe my diagnose is totally wrong, but there are the questions: > > 1) Is HTB or other similar scheduling code expected to limit correctly > while we substantially overlimit (since requeuing should be used so > much)? > 2) Should requeuing be considered as such important factor of > controlling the rates? > > I've some doubts it should work like this. I still can't really make anything of this bug, but the only two visible differences to HTB resulting from requeing on an upper level should be that 1) it doesn't reactivate classes that went passive by the last dequeue 2) the time checkpoint from the last dequeue event is different I guess its in fact the second thing, if a lower priority packet is requeued and dequeued again, HTB doesn't notice and might allow the class to send earlier again than it would have previously. Simon, if you set the ceiling to something around the real limit you're able to reach (maybe try 940mbit), do the proportions change significantly? -- To unsubscribe from this list: send the line "unsubscribe netdev" in the body of a message to majordomo@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
Patrick McHardy wrote, On 10/07/2008 02:48 PM: > Jarek Poplawski wrote: >>>> Prior to this patch the result looks like this: >>>> >>>> 10194: 545134589bits/s 545Mbits/s >>>> 10197: 205358520bits/s 205Mbits/s >>>> 10196: 205311416bits/s 205Mbits/s >>>> ----------------------------------- >>>> total: 955804525bits/s 955Mbits/s >>>> >>>> And after the patch the result looks like this: >>>> 10194: 384248522bits/s 384Mbits/s >>>> 10197: 284706778bits/s 284Mbits/s >>>> 10196: 288119464bits/s 288Mbits/s >>>> ----------------------------------- >>>> total: 957074765bits/s 957Mbits/s > > I've misinterpreted the numbers, please disregard my previous mail. > > I'm wondering though, even before this patch, the sharing doesn't > seem to be proportional to the allocated rates. Assuming the upper > limit is somewhere around 950mbit, we have 250 mbit for sharing > above the allocated rates, so it should be: > > 500mbit class: 500mbit + 250mbit/7*5 == 678.57mbit > 100mbit class: 100mbit + 250mbit/1*5 == 150mbit > 100mbit class: 100mbit + 250mbit/1*5 == 150mbit > > But maybe my understanding of how excess bandwidth is distributed > with HTB is wrong. Good point, but the numbers a bit wrong: 500mbit class: 500mbit + 250mbit/7*5 == 678.57mbit 100mbit class: 100mbit + 250mbit/7*1 == 135.71mbit 100mbit class: 100mbit + 250mbit/7*1 == 135.71mbit ========== 950.00mbit > > I still can't really make anything of this bug, but the only two > visible differences to HTB resulting from requeing on an upper level > should be that > > 1) it doesn't reactivate classes that went passive by the last dequeue > 2) the time checkpoint from the last dequeue event is different > > I guess its in fact the second thing, if a lower priority packet > is requeued and dequeued again, HTB doesn't notice and might allow > the class to send earlier again than it would have previously. With high requeuing the timing has to be wrong, but I'm not sure why just lower priority has to gain here. Anyway, IMHO this regression is really doubtful: since the digits are wrong in both cases I can only agree the old method gives better wrong results... Jarek P. -- To unsubscribe from this list: send the line "unsubscribe netdev" in the body of a message to majordomo@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
On Tue, Oct 07, 2008 at 07:44:35AM +0000, Jarek Poplawski wrote: > On Tue, Oct 07, 2008 at 03:51:47PM +1100, Simon Horman wrote: > > On Tue, Oct 07, 2008 at 12:15:52PM +1100, Simon Horman wrote: > > > Hi Dave, Hi Jarek, > > Hi Simon, > > > > I know that you guys were/are playing around a lot in here, but > > > unfortunately I think that "pkt_sched: Always use q->requeue in > > > dev_requeue_skb()" (f0876520b0b721bedafd9cec3b1b0624ae566eee) has > > > introduced a performance regression for HTB. > ... > > > The test machine with the tc rules and udp-sending processes > > > has two Intel Xeon Quad-cores running at 1.86GHz. The kernel > > > is SMP x86_64. > > > > With the following patch (basically a reversal of ""pkt_sched: Always use > > q->requeue in dev_requeue_skb()" forward ported to the current > > net-next-2.6 tree (tcp: Respect SO_RCVLOWAT in tcp_poll()), I get some > > rather nice numbers (IMHO). > > > > 10194: 666780666bits/s 666Mbits/s > > 10197: 141154197bits/s 141Mbits/s > > 10196: 141023090bits/s 141Mbits/s > > ----------------------------------- > > total: 948957954bits/s 948Mbits/s > > > > I'm not sure what evil things this patch does to other aspects > > of the qdisc code. > > I'd like to establish this too. This patch was meant to remove some > other problems possibly the simplest way. Maybe it's too simple. > Anyway, it's kind of RFC, so the rest of the requeuing code is left > unchanged, just for easy revoking like below. But first we should > try to understand this more. Agreed. > So, thanks for testing and reporting this. (BTW, what network card > do you use and is there multiqueuing on?) The network card is an e1000. Acutally there are two, but I am only using the first one. I am not sure what the Broadcom controllers are (I can get someone to go and inspect the machine if its important). # lspci | grep Ether 03:00.0 Ethernet controller: Broadcom Corporation NetXtreme II BCM5708 Gigabit Ethernet (rev 12) 05:00.0 Ethernet controller: Broadcom Corporation NetXtreme II BCM5708 Gigabit Ethernet (rev 12) 0b:00.0 Ethernet controller: Intel Corporation 82571EB Gigabit Ethernet Controller (rev 06) 0b:00.1 Ethernet controller: Intel Corporation 82571EB Gigabit Ethernet Controller (rev 06) And from dmsg e1000e 0000:0b:00.0: setting latency timer to 64 0000:0b:00.0: 0000:0b:00.0: Failed to initialize MSI interrupts. Falling back to legacy interrupts. 0000:0b:00.0: eth0: (PCI Express:2.5GB/s:Width x4) 00:1b:78:57:dc:f2 0000:0b:00.0: eth0: Intel(R) PRO/1000 Network Connection 0000:0b:00.0: eth0: MAC: 0, PHY: 4, PBA No: d51930-003 e1000e 0000:0b:00.1: setting latency timer to 64 0000:0b:00.1: 0000:0b:00.1: Failed to initialize MSI interrupts. Falling back to legacy interrupts. 0000:0b:00.1: eth1: (PCI Express:2.5GB/s:Width x4) 00:1b:78:57:dc:f3 0000:0b:00.1: eth1: Intel(R) PRO/1000 Network Connection 0000:0b:00.1: eth1: MAC: 0, PHY: 4, PBA No: d51930-003 It is eth0 that I am using. My kernel is compiled with CONFIG_NET_SCH_MULTIQ disabled.
On Tue, Oct 07, 2008 at 12:20:52PM +0000, Jarek Poplawski wrote: > On Tue, Oct 07, 2008 at 03:51:47PM +1100, Simon Horman wrote: > > On Tue, Oct 07, 2008 at 12:15:52PM +1100, Simon Horman wrote: > > > Hi Dave, Hi Jarek, > > > > > > I know that you guys were/are playing around a lot in here, but > > > unfortunately I think that "pkt_sched: Always use q->requeue in > > > dev_requeue_skb()" (f0876520b0b721bedafd9cec3b1b0624ae566eee) has > > > introduced a performance regression for HTB. > > Since this looks to me as possibly theoretical problem I added > Martin and Patrick to Cc. > > > > > > > My tc rules are below, but in a nutshell I have 3 leaf classes. > > > One with a rate of 500Mbit/s and the other two with 100Mbit/s. > > > The ceiling for all classes is 1Gb/s and that is also both > > > the rate and ceiling for the parent class. > > > > > > [ rate=1Gbit/s ] > > > [ ceil=1Gbit/s ] > > > | > > > +--------------------+--------------------+ > > > | | | > > > [ rate=500Mbit/s ] [ rate=100Mbit/s ] [ rate=100Mbit/s ] > > > [ ceil= 1Gbit/s ] [ ceil=100Mbit/s ] [ ceil= 1Gbit/s ] > > ?! [ ceil= 1Gbit/s ] [ ceil=1Gbit/s ] [ ceil= 1Gbit/s ] Sorry, you are correct. ceil=1Gbit/s for all leaf-nodes is correct. ceil=100Mbit/s was a cut and paste error.
On Wed, Oct 08, 2008 at 12:00:22AM +0200, Jarek Poplawski wrote: > Patrick McHardy wrote, On 10/07/2008 02:48 PM: > > > Jarek Poplawski wrote: > >>>> Prior to this patch the result looks like this: > >>>> > >>>> 10194: 545134589bits/s 545Mbits/s > >>>> 10197: 205358520bits/s 205Mbits/s > >>>> 10196: 205311416bits/s 205Mbits/s > >>>> ----------------------------------- > >>>> total: 955804525bits/s 955Mbits/s > >>>> > >>>> And after the patch the result looks like this: > >>>> 10194: 384248522bits/s 384Mbits/s > >>>> 10197: 284706778bits/s 284Mbits/s > >>>> 10196: 288119464bits/s 288Mbits/s > >>>> ----------------------------------- > >>>> total: 957074765bits/s 957Mbits/s > > > > I've misinterpreted the numbers, please disregard my previous mail. > > > > I'm wondering though, even before this patch, the sharing doesn't > > seem to be proportional to the allocated rates. Assuming the upper > > limit is somewhere around 950mbit, we have 250 mbit for sharing > > above the allocated rates, so it should be: > > > > 500mbit class: 500mbit + 250mbit/7*5 == 678.57mbit > > 100mbit class: 100mbit + 250mbit/1*5 == 150mbit > > 100mbit class: 100mbit + 250mbit/1*5 == 150mbit > > > > But maybe my understanding of how excess bandwidth is distributed > > with HTB is wrong. > > Good point, but the numbers a bit wrong: > > 500mbit class: 500mbit + 250mbit/7*5 == 678.57mbit > 100mbit class: 100mbit + 250mbit/7*1 == 135.71mbit > 100mbit class: 100mbit + 250mbit/7*1 == 135.71mbit > ========== > 950.00mbit > > > > > I still can't really make anything of this bug, but the only two > > visible differences to HTB resulting from requeing on an upper level > > should be that > > > > 1) it doesn't reactivate classes that went passive by the last dequeue > > 2) the time checkpoint from the last dequeue event is different > > > > I guess its in fact the second thing, if a lower priority packet > > is requeued and dequeued again, HTB doesn't notice and might allow > > the class to send earlier again than it would have previously. > > With high requeuing the timing has to be wrong, but I'm not sure why > just lower priority has to gain here. > > Anyway, IMHO this regression is really doubtful: since the digits are > wrong in both cases I can only agree the old method gives better wrong > results... I first started looking into this problem because I noticed that borrowing wasn't working in the correct proportions. That is the problem that Patrick pointed out and you re-did the maths for above. I noticed this on 2.6.26-rc7. So I did some testing on older kernels and noticed that although 2.6.26-rc7 was imperfect, it did seem that progress was being made in the right direction. Though unfortunately there is noise in the results, so the trend may not be real. It was also unfortunate that I was not able to get any older kernels to boot on the hw that I was using for testing (an HP dl360-g5 - any kernel-config tips welcome). 2.6.27-rc7 ---------- 10194: 568641840bits/s 568Mbits/s 10197: 193942866bits/s 193Mbits/s 10196: 194073184bits/s 194Mbits/s ----------------------------------- total: 956657890bits/s 956Mbits/s 2.6.26 ------ 10194: 507581709bits/s 507Mbits/s 10197: 224391677bits/s 224Mbits/s 10196: 224863501bits/s 224Mbits/s ----------------------------------- total: 956836888bits/s 956Mbits/s 2.6.25 ------ 10194: 426211904bits/s 426Mbits/s 10197: 265862037bits/s 265Mbits/s 10196: 264875210bits/s 264Mbits/s ----------------------------------- total: 956949152bits/s 956Mbits/s Then I tested net-next-2.6 and noticed that things were not good, as I reported in my opening post for this thread. Curiously, the trivial revert patch that I posted, when applied on top of yesterdays's net-next-2.6 ("tcp: Respect SO_RCVLOWAT in tcp_poll()"), gives the closest to ideal result that I have seen in any test. 10194: 666780666bits/s 666Mbits/s 10197: 141154197bits/s 141Mbits/s 10196: 141023090bits/s 141Mbits/s ----------------------------------- total: 948957954bits/s 948Mbits/s That does indeed seem promising. Though I do realise that my methods have essentially been stabs in the dark and the problem needs to be understood.
Simon Horman wrote: > On Wed, Oct 08, 2008 at 12:00:22AM +0200, Jarek Poplawski wrote: > >> Anyway, IMHO this regression is really doubtful: since the digits are >> wrong in both cases I can only agree the old method gives better wrong >> results... >> > > I first started looking into this problem because I noticed that > borrowing wasn't working in the correct proportions. That is > the problem that Patrick pointed out and you re-did the maths for above. > > I noticed this on 2.6.26-rc7. So I did some testing on older kernels and > noticed that although 2.6.26-rc7 was imperfect, it did seem that progress > was being made in the right direction. Though unfortunately there is noise > in the results, so the trend may not be real. It was also unfortunate that > I was not able to get any older kernels to boot on the hw that I was using > for testing (an HP dl360-g5 - any kernel-config tips welcome). > > [...] > noticed that things were not good, as I > reported in my opening post for this thread. Curiously, the trivial revert > patch that I posted, when applied on top of yesterdays's net-next-2.6 > ("tcp: Respect SO_RCVLOWAT in tcp_poll()"), gives the closest to ideal > result that I have seen in any test. > > 10194: 666780666bits/s 666Mbits/s > 10197: 141154197bits/s 141Mbits/s > 10196: 141023090bits/s 141Mbits/s > ----------------------------------- > total: 948957954bits/s 948Mbits/s > > > That does indeed seem promising. Though I do realise that my methods > have essentially been stabs in the dark and the problem needs to > be understood. > I'm pretty sure that the differences are caused by HTB not being in control of the queue since the device is the real bottleneck in this configuration. Its quite possible that there simply might a subtle timing change that causes feedback through HTBs borrowing and ceiling. So what would really be useful to understand this is to make HTB control the queue and see if it behaves as expected. -- To unsubscribe from this list: send the line "unsubscribe netdev" in the body of a message to majordomo@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
Patrick McHardy wrote: > So what would really be useful to understand this is to make HTB > control the queue and see if it behaves as expected. What also might help understand this better would be some runtime data. An easy way to get some halfway usable data is to just run tc -s -d qdisc show dev ...; tc -s -d class show dev ... in a loop. -- To unsubscribe from this list: send the line "unsubscribe netdev" in the body of a message to majordomo@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
On Wed, Oct 08, 2008 at 11:09:28AM +1100, Simon Horman wrote: > On Tue, Oct 07, 2008 at 07:44:35AM +0000, Jarek Poplawski wrote: ... > > So, thanks for testing and reporting this. (BTW, what network card > > do you use and is there multiqueuing on?) > > The network card is an e1000. Acutally there are two, but I > am only using the first one. I am not sure what the Broadcom > controllers are (I can get someone to go and inspect the > machine if its important). > > # lspci | grep Ether > 03:00.0 Ethernet controller: Broadcom Corporation NetXtreme II BCM5708 Gigabit Ethernet (rev 12) > 05:00.0 Ethernet controller: Broadcom Corporation NetXtreme II BCM5708 Gigabit Ethernet (rev 12) > 0b:00.0 Ethernet controller: Intel Corporation 82571EB Gigabit Ethernet Controller (rev 06) > 0b:00.1 Ethernet controller: Intel Corporation 82571EB Gigabit Ethernet Controller (rev 06) > > And from dmsg > > e1000e 0000:0b:00.0: setting latency timer to 64 > 0000:0b:00.0: 0000:0b:00.0: Failed to initialize MSI interrupts. Falling back to legacy interrupts. > 0000:0b:00.0: eth0: (PCI Express:2.5GB/s:Width x4) 00:1b:78:57:dc:f2 > 0000:0b:00.0: eth0: Intel(R) PRO/1000 Network Connection > 0000:0b:00.0: eth0: MAC: 0, PHY: 4, PBA No: d51930-003 > e1000e 0000:0b:00.1: setting latency timer to 64 > 0000:0b:00.1: 0000:0b:00.1: Failed to initialize MSI interrupts. Falling back to legacy interrupts. > 0000:0b:00.1: eth1: (PCI Express:2.5GB/s:Width x4) 00:1b:78:57:dc:f3 > 0000:0b:00.1: eth1: Intel(R) PRO/1000 Network Connection > 0000:0b:00.1: eth1: MAC: 0, PHY: 4, PBA No: d51930-003 > > It is eth0 that I am using. I've only wandered if we should consider any new effects because of this, but e1000/1000e don't have this yet (bnx2 has). > My kernel is compiled with CONFIG_NET_SCH_MULTIQ disabled. Multiquing can work without this too. Jarek P. -- To unsubscribe from this list: send the line "unsubscribe netdev" in the body of a message to majordomo@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
On Wed, Oct 08, 2008 at 02:31:26AM +0200, Patrick McHardy wrote: ... > I'm pretty sure that the differences are caused by HTB not being > in control of the queue since the device is the real bottleneck > in this configuration. Yes, otherwise there would be no requeuing. And, btw. the golden rule of scheduling/shaping is limiting below "hardware" limits. > Its quite possible that there simply might > a subtle timing change that causes feedback through HTBs borrowing > and ceiling. I'd add my previous suspicion there could be not enough enqeuing on time for the fastest class (could be also very bursty), so other classes can borrow more. > > So what would really be useful to understand this is to make HTB > control the queue and see if it behaves as expected. > Right, trying with lower rates/ceils should explain this. Jarek P. -- To unsubscribe from this list: send the line "unsubscribe netdev" in the body of a message to majordomo@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
On Wednesday 08 October 2008, Jarek Poplawski wrote: > On Wed, Oct 08, 2008 at 02:31:26AM +0200, Patrick McHardy wrote: > ... > > > I'm pretty sure that the differences are caused by HTB not being > > in control of the queue since the device is the real bottleneck > > in this configuration. > > Yes, otherwise there would be no requeuing. And, btw. the golden rule > of scheduling/shaping is limiting below "hardware" limits. By the way, HTB counting ethernet headers (e.g. if you send 1500 byte ping, it will be 1514 in tc counters) . So possible ethernet overhead counted. -- To unsubscribe from this list: send the line "unsubscribe netdev" in the body of a message to majordomo@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
On Wed, Oct 08, 2008 at 06:55:51AM +0000, Jarek Poplawski wrote: > On Wed, Oct 08, 2008 at 02:31:26AM +0200, Patrick McHardy wrote: > ... > > I'm pretty sure that the differences are caused by HTB not being > > in control of the queue since the device is the real bottleneck > > in this configuration. > > Yes, otherwise there would be no requeuing. And, btw. the golden rule > of scheduling/shaping is limiting below "hardware" limits. > > > Its quite possible that there simply might > > a subtle timing change that causes feedback through HTBs borrowing > > and ceiling. > > I'd add my previous suspicion there could be not enough enqeuing on > time for the fastest class (could be also very bursty), so other > classes can borrow more. > > > > > So what would really be useful to understand this is to make HTB > > control the queue and see if it behaves as expected. > > > > Right, trying with lower rates/ceils should explain this. As I mentioned earlier things seem to work quite well with lower rates/ceilings. When I set up the classes with 10x lower values for rate and celing, as follows: [ rate=100Mbit/s ] [ ceil=100Mbit/s ] | +--------------------+--------------------+ | | | [ rate= 50Mbit/s ] [ rate= 10Mbit/s ] [ rate= 10Mbit/s ] [ ceil=100Mbit/s ] [ ceil=100Mbit/s ] [ ceil= 100Mbit/s ] Then I get results that are fairly close to the ideal values. net-next-2.6 - d877984 ---------------------- 10194: 68075482bits/s 68Mbits/s 10197: 14464848bits/s 14Mbits/s 10196: 14465632bits/s 14Mbits/s ----------------------------------- total: 97005962bits/s 97Mbits/s And I get those kind of results consistently for various different kernel versions.
On Wed, Oct 08, 2008 at 06:37:01AM +0000, Jarek Poplawski wrote: > On Wed, Oct 08, 2008 at 11:09:28AM +1100, Simon Horman wrote: > > On Tue, Oct 07, 2008 at 07:44:35AM +0000, Jarek Poplawski wrote: > ... > > > So, thanks for testing and reporting this. (BTW, what network card > > > do you use and is there multiqueuing on?) > > > > The network card is an e1000. Acutally there are two, but I > > am only using the first one. I am not sure what the Broadcom > > controllers are (I can get someone to go and inspect the > > machine if its important). > > > > # lspci | grep Ether > > 03:00.0 Ethernet controller: Broadcom Corporation NetXtreme II BCM5708 Gigabit Ethernet (rev 12) > > 05:00.0 Ethernet controller: Broadcom Corporation NetXtreme II BCM5708 Gigabit Ethernet (rev 12) > > 0b:00.0 Ethernet controller: Intel Corporation 82571EB Gigabit Ethernet Controller (rev 06) > > 0b:00.1 Ethernet controller: Intel Corporation 82571EB Gigabit Ethernet Controller (rev 06) > > > > And from dmsg > > > > e1000e 0000:0b:00.0: setting latency timer to 64 > > 0000:0b:00.0: 0000:0b:00.0: Failed to initialize MSI interrupts. Falling back to legacy interrupts. > > 0000:0b:00.0: eth0: (PCI Express:2.5GB/s:Width x4) 00:1b:78:57:dc:f2 > > 0000:0b:00.0: eth0: Intel(R) PRO/1000 Network Connection > > 0000:0b:00.0: eth0: MAC: 0, PHY: 4, PBA No: d51930-003 > > e1000e 0000:0b:00.1: setting latency timer to 64 > > 0000:0b:00.1: 0000:0b:00.1: Failed to initialize MSI interrupts. Falling back to legacy interrupts. > > 0000:0b:00.1: eth1: (PCI Express:2.5GB/s:Width x4) 00:1b:78:57:dc:f3 > > 0000:0b:00.1: eth1: Intel(R) PRO/1000 Network Connection > > 0000:0b:00.1: eth1: MAC: 0, PHY: 4, PBA No: d51930-003 > > > > It is eth0 that I am using. > > I've only wandered if we should consider any new effects because of > this, but e1000/1000e don't have this yet (bnx2 has). > > > My kernel is compiled with CONFIG_NET_SCH_MULTIQ disabled. > > Multiquing can work without this too. Ok, what should I be checking for?
Patrick McHardy wrote: > Patrick McHardy wrote: >> So what would really be useful to understand this is to make HTB >> control the queue and see if it behaves as expected. > > What also might help understand this better would be some > runtime data. An easy way to get some halfway usable data > is to just run > > tc -s -d qdisc show dev ...; tc -s -d class show dev ... > > in a loop. IMHO there is problem with packet hold outside of the qdisc. HTB like other non-workconserving qdiscs rely on information about class backlogged state. When there is packet in a queue then the queue is active (has a demand to send). The algorithm samples queue states at deterministic but unregular intervals to see whose classes wants service and whose can lend. If you hold a packet outside, relevant class thinks that it is not backlogged - and if sampled at this time then the algorithm decides to lend classe's time. And because the most active class has the biggest probability to see its packet catched outside, as final effect all classes will end up equalized with (almost) the same rates. Thus, from qdisc point, it is not good to keep a packet for more time out of the qdisc. devik -- To unsubscribe from this list: send the line "unsubscribe netdev" in the body of a message to majordomo@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
On Wed, Oct 08, 2008 at 06:22:44PM +1100, Simon Horman wrote: > On Wed, Oct 08, 2008 at 06:37:01AM +0000, Jarek Poplawski wrote: > > On Wed, Oct 08, 2008 at 11:09:28AM +1100, Simon Horman wrote: > > > On Tue, Oct 07, 2008 at 07:44:35AM +0000, Jarek Poplawski wrote: > > ... > > > > So, thanks for testing and reporting this. (BTW, what network card > > > > do you use and is there multiqueuing on?) > > > > > > The network card is an e1000. Acutally there are two, but I ... > > I've only wandered if we should consider any new effects because of > > this, but e1000/1000e don't have this yet (bnx2 has). ... > Ok, what should I be checking for? I think, it's all with this... We simply know multiquing doesn't matter here. Jarek P. -- To unsubscribe from this list: send the line "unsubscribe netdev" in the body of a message to majordomo@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
On Wed, Oct 08, 2008 at 06:22:04PM +1100, Simon Horman wrote: > On Wed, Oct 08, 2008 at 06:55:51AM +0000, Jarek Poplawski wrote: > > On Wed, Oct 08, 2008 at 02:31:26AM +0200, Patrick McHardy wrote: > > ... > > > I'm pretty sure that the differences are caused by HTB not being > > > in control of the queue since the device is the real bottleneck > > > in this configuration. > > > > Yes, otherwise there would be no requeuing. And, btw. the golden rule > > of scheduling/shaping is limiting below "hardware" limits. > > > > > Its quite possible that there simply might > > > a subtle timing change that causes feedback through HTBs borrowing > > > and ceiling. > > > > I'd add my previous suspicion there could be not enough enqeuing on > > time for the fastest class (could be also very bursty), so other > > classes can borrow more. > > > > > > > > So what would really be useful to understand this is to make HTB > > > control the queue and see if it behaves as expected. > > > > > > > Right, trying with lower rates/ceils should explain this. > > As I mentioned earlier things seem to work quite well with lower > rates/ceilings. When I set up the classes with 10x lower values > for rate and celing, as follows: > > > [ rate=100Mbit/s ] > [ ceil=100Mbit/s ] > | > +--------------------+--------------------+ > | | | > [ rate= 50Mbit/s ] [ rate= 10Mbit/s ] [ rate= 10Mbit/s ] > [ ceil=100Mbit/s ] [ ceil=100Mbit/s ] [ ceil= 100Mbit/s ] > > Then I get results that are fairly close to the ideal values. > > net-next-2.6 - d877984 > ---------------------- > 10194: 68075482bits/s 68Mbits/s > 10197: 14464848bits/s 14Mbits/s > 10196: 14465632bits/s 14Mbits/s > ----------------------------------- > total: 97005962bits/s 97Mbits/s > > And I get those kind of results consistently for various > different kernel versions. OK. But as Patrick mentioned it would be interesting to try a little below hardware limits: 950, or maybe lower, until HTB starts getting accuracy. Jarek P. -- To unsubscribe from this list: send the line "unsubscribe netdev" in the body of a message to majordomo@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
On Wed, Oct 08, 2008 at 09:34:30AM +0200, Martin Devera wrote: > Patrick McHardy wrote: >> Patrick McHardy wrote: >>> So what would really be useful to understand this is to make HTB >>> control the queue and see if it behaves as expected. >> >> What also might help understand this better would be some >> runtime data. An easy way to get some halfway usable data >> is to just run >> >> tc -s -d qdisc show dev ...; tc -s -d class show dev ... >> >> in a loop. > > IMHO there is problem with packet hold outside of the qdisc. > HTB like other non-workconserving qdiscs rely on information > about class backlogged state. > When there is packet in a queue then the queue is active (has > a demand to send). > The algorithm samples queue states at deterministic but unregular > intervals to see whose classes wants service and whose can lend. > > If you hold a packet outside, relevant class thinks that it is > not backlogged - and if sampled at this time then the algorithm > decides to lend classe's time. Right, but on the other hand I can't see any correction of these times/tokens, so it seems this can't give us "right" results anyway? E.g. with 100% requeuing (each packet requeued once) HTB should think it "gave" the rate 2x higher than seen on the other side - or I miss something? > > And because the most active class has the biggest probability > to see its packet catched outside, as final effect all classes > will end up equalized with (almost) the same rates. Yes, but if this most active class has always packets in its queue, and is most prioritized, it seems this should get its share anyway. > > Thus, from qdisc point, it is not good to keep a packet for > more time out of the qdisc. Sure, the question is how much it's useful against associated code complications and additional cpu usage. Thanks, Jarek P. -- To unsubscribe from this list: send the line "unsubscribe netdev" in the body of a message to majordomo@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
>> The algorithm samples queue states at deterministic but unregular >> intervals to see whose classes wants service and whose can lend. >> >> If you hold a packet outside, relevant class thinks that it is >> not backlogged - and if sampled at this time then the algorithm >> decides to lend classe's time. > > Right, but on the other hand I can't see any correction of these > times/tokens, so it seems this can't give us "right" results > anyway? E.g. with 100% requeuing (each packet requeued once) HTB > should think it "gave" the rate 2x higher than seen on the other > side - or I miss something? Yes, it is another problem - double acounting packet when requeued... Well, you are right, the number are not too supportive to this explanation... It seems that the first class didn't get its basic "rate", which is should be guaranteed. Simon, can you try to these things (separately): a/ increase quantum to the first class (say 10x) b/ set ceil=rate on all three classes The idea is to a/ make sure there is no requeue-related change to the drr pointer which could boost reqeued class, b/ to see whether priorized class has problems to send or other classes are sending even when they should not. >> Thus, from qdisc point, it is not good to keep a packet for >> more time out of the qdisc. > > Sure, the question is how much it's useful against associated > code complications and additional cpu usage. honestly, I'm not familiar with the new code. Can you tell me in short what is gso_skb and where the skb goes now if not requeued ? thanks, Martin -- To unsubscribe from this list: send the line "unsubscribe netdev" in the body of a message to majordomo@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
On Wed, Oct 08, 2008 at 12:47:57PM +0200, Martin Devera wrote: ... >>> Thus, from qdisc point, it is not good to keep a packet for >>> more time out of the qdisc. >> >> Sure, the question is how much it's useful against associated >> code complications and additional cpu usage. > > honestly, I'm not familiar with the new code. Can you tell me > in short what is gso_skb and where the skb goes now if not requeued ? I hope I don't mislead you too much: gso_skb is actually a list of skbs after dividing a previous skb in dev_gso_segment() in dev_hard_start_xmit(), which because of some device/driver error gets back to qdisc_run()/ qdisc_restart(). It's stored as q->gso_skb and tried for next xmits before any new dequeuing until sending. And currently, "normal" skbs are treated exactly the same (so not returned to the qdisc tree). There is still possible internal use of old ->requeue() functions between qdiscs for some strange cases, but it will be probably replaced too with storing to the q->requeue lists (one level deep). So both these cases could be treated as: an skb is out of the qdiscs, one foot in the driver's door (probably because of some errors - otherwise driver should stop the queue earlier). This is supposed to simplify the code, especially error handling, especially wrt. multiqueuing. Cheers, Jarek P. -- To unsubscribe from this list: send the line "unsubscribe netdev" in the body of a message to majordomo@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
On Wed, Oct 08, 2008 at 08:03:40AM +0000, Jarek Poplawski wrote: > On Wed, Oct 08, 2008 at 06:22:04PM +1100, Simon Horman wrote: > > On Wed, Oct 08, 2008 at 06:55:51AM +0000, Jarek Poplawski wrote: > > > On Wed, Oct 08, 2008 at 02:31:26AM +0200, Patrick McHardy wrote: > > > ... > > > > I'm pretty sure that the differences are caused by HTB not being > > > > in control of the queue since the device is the real bottleneck > > > > in this configuration. > > > > > > Yes, otherwise there would be no requeuing. And, btw. the golden rule > > > of scheduling/shaping is limiting below "hardware" limits. > > > > > > > Its quite possible that there simply might > > > > a subtle timing change that causes feedback through HTBs borrowing > > > > and ceiling. > > > > > > I'd add my previous suspicion there could be not enough enqeuing on > > > time for the fastest class (could be also very bursty), so other > > > classes can borrow more. > > > > > > > > > > > So what would really be useful to understand this is to make HTB > > > > control the queue and see if it behaves as expected. > > > > > > > > > > Right, trying with lower rates/ceils should explain this. > > > > As I mentioned earlier things seem to work quite well with lower > > rates/ceilings. When I set up the classes with 10x lower values > > for rate and celing, as follows: > > > > > > [ rate=100Mbit/s ] > > [ ceil=100Mbit/s ] > > | > > +--------------------+--------------------+ > > | | | > > [ rate= 50Mbit/s ] [ rate= 10Mbit/s ] [ rate= 10Mbit/s ] > > [ ceil=100Mbit/s ] [ ceil=100Mbit/s ] [ ceil= 100Mbit/s ] > > > > Then I get results that are fairly close to the ideal values. > > > > net-next-2.6 - d877984 > > ---------------------- > > 10194: 68075482bits/s 68Mbits/s > > 10197: 14464848bits/s 14Mbits/s > > 10196: 14465632bits/s 14Mbits/s > > ----------------------------------- > > total: 97005962bits/s 97Mbits/s > > > > And I get those kind of results consistently for various > > different kernel versions. > > OK. But as Patrick mentioned it would be interesting to try a little > below hardware limits: 950, or maybe lower, until HTB starts getting > accuracy. Hi, it seems that for my particular setup the magic number is 935Mbit/s. Kernel is net-next-2.6 071d7ab6649eb34a873a53e71635186e9117101d ("ipvs: Remove stray file left over from ipvs move"), which is after Jarek's "pkt_sched: Update qdisc requeue stats in dev_requeue_skb()" patch. ideal (based on 950Mbit/s aggregate) 500mbit class (10194): 500mbit + 250mbit/7*5 == 678.57mbit 100mbit class (10196): 100mbit + 250mbit/7*1 == 135.71mbit 100mbit class (10197): 100mbit + 250mbit/7*1 == 135.71mbit ========== 950.00mbit n=900 10194: 677727637bits/s 677Mbits/s 10197: 136662048bits/s 136Mbits/s 10196: 136725637bits/s 136Mbits/s ----------------------------------- total: 951115322bits/s 951Mbits/s n=920 10194: 676271338bits/s 676Mbits/s 10197: 137301090bits/s 137Mbits/s 10196: 137301877bits/s 137Mbits/s ----------------------------------- total: 950874306bits/s 950Mbits/s n=930 10194: 674681581bits/s 674Mbits/s 10197: 137538965bits/s 137Mbits/s 10196: 137541320bits/s 137Mbits/s ----------------------------------- total: 949761866bits/s 949Mbits/s n=933 10194: 675568704bits/s 675Mbits/s 10197: 137661437bits/s 137Mbits/s 10196: 137662221bits/s 137Mbits/s ----------------------------------- total: 950892362bits/s 950Mbits/s n=934 10194: 675399128bits/s 675Mbits/s 10197: 137653586bits/s 137Mbits/s 10196: 137704613bits/s 137Mbits/s ----------------------------------- total: 950757328bits/s 950Mbits/s n=935 10194: 675169893bits/s 675Mbits/s 10197: 137667714bits/s 137Mbits/s 10196: 137670858bits/s 137Mbits/s ----------------------------------- total: 950508466bits/s 950Mbits/s n=936 10194: 385295016bits/s 385Mbits/s 10197: 285078114bits/s 285Mbits/s 10196: 286588581bits/s 286Mbits/s ----------------------------------- total: 956961712bits/s 956Mbits/s n=937 10194: 384569616bits/s 384Mbits/s 10197: 285480072bits/s 285Mbits/s 10196: 286627050bits/s 286Mbits/s ----------------------------------- total: 956676738bits/s 956Mbits/s n=940 10194: 384577466bits/s 384Mbits/s 10197: 285655138bits/s 285Mbits/s 10196: 286846872bits/s 286Mbits/s ----------------------------------- total: 957079477bits/s 957Mbits/s n=950 10194: 384097789bits/s 384Mbits/s 10197: 285950325bits/s 285Mbits/s 10196: 286894760bits/s 286Mbits/s ----------------------------------- total: 956942874bits/s 956Mbits/s n=1000 10194: 384639482bits/s 384Mbits/s 10197: 285133069bits/s 285Mbits/s 10196: 287172674bits/s 287Mbits/s ----------------------------------- total: 956945226bits/s 956Mbits/s # HTB ########################################################################### n=933 tc qdisc del dev eth0 root tc qdisc add dev eth0 root handle 1: htb default 10 r2q $((n * 10)) tc class add dev eth0 parent 1: classid 1:1 htb \ rate ${n}Mbit ceil ${n}Mbit tc class add dev eth0 parent 1:1 classid 1:10 htb \ rate ${n}Mbit ceil ${n}Mbit tc class add dev eth0 parent 1:1 classid 1:11 htb \ rate 500Mbit ceil ${n}Mbit tc class add dev eth0 parent 1:1 classid 1:12 htb \ rate 100Mbit ceil ${n}Mbit tc class add dev eth0 parent 1:1 classid 1:13 htb \ rate 100Mbit ceil ${n}Mbit #tc filter add dev eth0 protocol ip parent 1: \ # u32 match ip src 172.17.60.194 flowid 1:11 #tc filter add dev eth0 protocol ip parent 1: \ # u32 match ip src 172.17.60.196 flowid 1:12 #tc filter add dev eth0 protocol ip parent 1: \ # u32 match ip src 172.17.60.197 flowid 1:13 tc filter add dev eth0 protocol ip parent 1: \ u32 match ip dport 10194 0xffff flowid 1:11 tc filter add dev eth0 protocol ip parent 1: \ u32 match ip dport 10196 0xffff flowid 1:12 tc filter add dev eth0 protocol ip parent 1: \ u32 match ip dport 10197 0xffff flowid 1:13 tc -s -d class show dev eth0 -- To unsubscribe from this list: send the line "unsubscribe netdev" in the body of a message to majordomo@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
On Wed, Oct 08, 2008 at 12:47:57PM +0200, Martin Devera wrote: >>> The algorithm samples queue states at deterministic but unregular >>> intervals to see whose classes wants service and whose can lend. >>> >>> If you hold a packet outside, relevant class thinks that it is >>> not backlogged - and if sampled at this time then the algorithm >>> decides to lend classe's time. >> >> Right, but on the other hand I can't see any correction of these >> times/tokens, so it seems this can't give us "right" results >> anyway? E.g. with 100% requeuing (each packet requeued once) HTB >> should think it "gave" the rate 2x higher than seen on the other >> side - or I miss something? > > Yes, it is another problem - double acounting packet when requeued... > Well, you are right, the number are not too supportive to this > explanation... > It seems that the first class didn't get its basic "rate", which > is should be guaranteed. > > Simon, can you try to these things (separately): > a/ increase quantum to the first class (say 10x) Hi Martin, Do you mean increase r2q ? If so, here are some results r2q=10000 (original setting) 10194: 383950984bits/s 383Mbits/s 10197: 285071834bits/s 285Mbits/s 10196: 287241757bits/s 287Mbits/s ----------------------------------- total: 956264576bits/s 956Mbits/s r2q=25000 HTB: quantum of class 10012 is small. Consider r2q change. HTB: quantum of class 10013 is small. Consider r2q change. 10194: 375149600bits/s 375Mbits/s 10197: 289728064bits/s 289Mbits/s 10196: 291783370bits/s 291Mbits/s ----------------------------------- total: 956661034bits/s 956Mbits/s r2q=50000 HTB: quantum of class 10012 is small. Consider r2q change. HTB: quantum of class 10013 is small. Consider r2q change. 10194: 367905789bits/s 367Mbits/s 10197: 292223005bits/s 292Mbits/s 10196: 296510256bits/s 296Mbits/s ----------------------------------- total: 956639050bits/s 956Mbits/s r2q=100000 HTB: quantum of class 10011 is small. Consider r2q change. HTB: quantum of class 10012 is small. Consider r2q change. HTB: quantum of class 10013 is small. Consider r2q change. 10194: 367925416bits/s 367Mbits/s 10197: 293556834bits/s 293Mbits/s 10196: 295315384bits/s 295Mbits/s ----------------------------------- total: 956797634bits/s 956Mbits/s > b/ set ceil=rate on all three classes This seems pretty close to the expected/ideal result. 10194: 496575074bits/s 496Mbits/s 10197: 96969861bits/s 96Mbits/s 10196: 96973002bits/s 96Mbits/s ----------------------------------- total: 690517938bits/s 690Mbits/s Kernel is net-next-2.6 071d7ab6649eb34a873a53e71635186e9117101d ("ipvs: Remove stray file left over from ipvs move"), which is after Jarek's "pkt_sched: Update qdisc requeue stats in dev_requeue_skb()" patch. > > The idea is to a/ make sure there is no requeue-related change > to the drr pointer which could boost reqeued class, > b/ to see whether priorized class has problems to send or > other classes are sending even when they should not. > >>> Thus, from qdisc point, it is not good to keep a packet for >>> more time out of the qdisc. >> >> Sure, the question is how much it's useful against associated >> code complications and additional cpu usage. > > honestly, I'm not familiar with the new code. Can you tell me > in short what is gso_skb and where the skb goes now if not requeued ? > > thanks, Martin
On Thu, Oct 09, 2008 at 11:54:40AM +1100, Simon Horman wrote: > On Wed, Oct 08, 2008 at 08:03:40AM +0000, Jarek Poplawski wrote: ... > > OK. But as Patrick mentioned it would be interesting to try a little > > below hardware limits: 950, or maybe lower, until HTB starts getting > > accuracy. > > Hi, Hi Simon, > > it seems that for my particular setup the magic number is 935Mbit/s. > > > Kernel is net-next-2.6 071d7ab6649eb34a873a53e71635186e9117101d > ("ipvs: Remove stray file left over from ipvs move"), > which is after Jarek's "pkt_sched: Update qdisc requeue stats in > dev_requeue_skb()" patch. > > ideal (based on 950Mbit/s aggregate) > 500mbit class (10194): 500mbit + 250mbit/7*5 == 678.57mbit > 100mbit class (10196): 100mbit + 250mbit/7*1 == 135.71mbit > 100mbit class (10197): 100mbit + 250mbit/7*1 == 135.71mbit > ========== > 950.00mbit > n=900 > 10194: 677727637bits/s 677Mbits/s > 10197: 136662048bits/s 136Mbits/s > 10196: 136725637bits/s 136Mbits/s > ----------------------------------- > total: 951115322bits/s 951Mbits/s > > n=920 > 10194: 676271338bits/s 676Mbits/s > 10197: 137301090bits/s 137Mbits/s > 10196: 137301877bits/s 137Mbits/s > ----------------------------------- > total: 950874306bits/s 950Mbits/s > > n=930 > 10194: 674681581bits/s 674Mbits/s > 10197: 137538965bits/s 137Mbits/s > 10196: 137541320bits/s 137Mbits/s > ----------------------------------- > total: 949761866bits/s 949Mbits/s > > n=933 > 10194: 675568704bits/s 675Mbits/s > 10197: 137661437bits/s 137Mbits/s > 10196: 137662221bits/s 137Mbits/s > ----------------------------------- > total: 950892362bits/s 950Mbits/s > > n=934 > 10194: 675399128bits/s 675Mbits/s > 10197: 137653586bits/s 137Mbits/s > 10196: 137704613bits/s 137Mbits/s > ----------------------------------- > total: 950757328bits/s 950Mbits/s > > n=935 > 10194: 675169893bits/s 675Mbits/s > 10197: 137667714bits/s 137Mbits/s > 10196: 137670858bits/s 137Mbits/s > ----------------------------------- > total: 950508466bits/s 950Mbits/s > > n=936 > 10194: 385295016bits/s 385Mbits/s > 10197: 285078114bits/s 285Mbits/s > 10196: 286588581bits/s 286Mbits/s > ----------------------------------- > total: 956961712bits/s 956Mbits/s It's hard to believe so small change can matter so much here!!! Great testing! It seems it's safer to use something below such magic number, and generally 10% below hardware limit could be the rule. It also looks like my idea that the first class could get not enough packets was wrong. It seems it's rather blocked by other classes when HTB thinks it can lend more. Anyway, I'm also surprised HTB could be still so exact: congratulations Martin! Simon, if I don't miss something, I guess you should be OK with these last changes in requeuing? The older way did some safety for such overestimated configs by hiding the real problem, but there were the costs: cpu etc. Thanks, Jarek P. > > n=937 > 10194: 384569616bits/s 384Mbits/s > 10197: 285480072bits/s 285Mbits/s > 10196: 286627050bits/s 286Mbits/s > ----------------------------------- > total: 956676738bits/s 956Mbits/s > > n=940 > 10194: 384577466bits/s 384Mbits/s > 10197: 285655138bits/s 285Mbits/s > 10196: 286846872bits/s 286Mbits/s > ----------------------------------- > total: 957079477bits/s 957Mbits/s > > n=950 > 10194: 384097789bits/s 384Mbits/s > 10197: 285950325bits/s 285Mbits/s > 10196: 286894760bits/s 286Mbits/s > ----------------------------------- > total: 956942874bits/s 956Mbits/s > > n=1000 > 10194: 384639482bits/s 384Mbits/s > 10197: 285133069bits/s 285Mbits/s > 10196: 287172674bits/s 287Mbits/s > ----------------------------------- > total: 956945226bits/s 956Mbits/s > > > # HTB > ########################################################################### > n=933 > > tc qdisc del dev eth0 root > > tc qdisc add dev eth0 root handle 1: htb default 10 r2q $((n * 10)) > > tc class add dev eth0 parent 1: classid 1:1 htb \ > rate ${n}Mbit ceil ${n}Mbit > > tc class add dev eth0 parent 1:1 classid 1:10 htb \ > rate ${n}Mbit ceil ${n}Mbit > tc class add dev eth0 parent 1:1 classid 1:11 htb \ > rate 500Mbit ceil ${n}Mbit > tc class add dev eth0 parent 1:1 classid 1:12 htb \ > rate 100Mbit ceil ${n}Mbit > tc class add dev eth0 parent 1:1 classid 1:13 htb \ > rate 100Mbit ceil ${n}Mbit > > #tc filter add dev eth0 protocol ip parent 1: \ > # u32 match ip src 172.17.60.194 flowid 1:11 > #tc filter add dev eth0 protocol ip parent 1: \ > # u32 match ip src 172.17.60.196 flowid 1:12 > #tc filter add dev eth0 protocol ip parent 1: \ > # u32 match ip src 172.17.60.197 flowid 1:13 > > tc filter add dev eth0 protocol ip parent 1: \ > u32 match ip dport 10194 0xffff flowid 1:11 > tc filter add dev eth0 protocol ip parent 1: \ > u32 match ip dport 10196 0xffff flowid 1:12 > tc filter add dev eth0 protocol ip parent 1: \ > u32 match ip dport 10197 0xffff flowid 1:13 > > tc -s -d class show dev eth0 > -- To unsubscribe from this list: send the line "unsubscribe netdev" in the body of a message to majordomo@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
>> >> Simon, can you try to these things (separately): >> a/ increase quantum to the first class (say 10x) > > Hi Martin, > > Do you mean increase r2q ? If so, here are some results no, no, add rather ... "quantum 50000" to the first class only r2q affects computation for all classes >> b/ set ceil=rate on all three classes > > This seems pretty close to the expected/ideal result. > > 10194: 496575074bits/s 496Mbits/s > 10197: 96969861bits/s 96Mbits/s > 10196: 96973002bits/s 96Mbits/s > ----------------------------------- > total: 690517938bits/s 690Mbits/s > Hmm, it seems that problem is with drr loop or borrowing, rate computation seems to work ok even in this corner case. In any case from your other test follows that when NIC throttles then HTB can't arrive at correct rates - I feel this as HTB bug and add it to my (very long) todo list... thanks, Martin -- To unsubscribe from this list: send the line "unsubscribe netdev" in the body of a message to majordomo@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
> It also looks like my idea that the first class could get not enough > packets was wrong. It seems it's rather blocked by other classes when > HTB thinks it can lend more. > > Anyway, I'm also surprised HTB could be still so exact: congratulations > Martin! thanks for making my depressions smaller ;-) I'd like to make it work even on edge.. devik -- To unsubscribe from this list: send the line "unsubscribe netdev" in the body of a message to majordomo@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
On Thu, Oct 09, 2008 at 08:22:49AM +0200, Martin Devera wrote: > >> > >> Simon, can you try to these things (separately): > >> a/ increase quantum to the first class (say 10x) > > > > Hi Martin, > > > > Do you mean increase r2q ? If so, here are some results > > no, no, add rather ... "quantum 50000" to the first class only > r2q affects computation for all classes Yes, it's a half-documented parameter: not in the man, but e.g. here: tc class add htb help BTW, I wonder if better priority/quantum doesn't actually harm here. When HTB lends the rate the first class here is always served at the beginning. And if it's over the hardware limit, this class's packets are waiting. Other classes get their chunks later, probably after the card did some cleaning, so they may never have to wait at this situation. So, maybe it would be interesting to try this other way: lower priority of the first class with a prio param e.g. 1? Jarek P. -- To unsubscribe from this list: send the line "unsubscribe netdev" in the body of a message to majordomo@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
On Thu, Oct 09, 2008 at 09:56:00AM +0000, Jarek Poplawski wrote: ... > So, maybe it would be interesting to try this other way: > lower priority of the first class with a prio param e.g. 1? Actually, this alone could be not enough, and lowering the quantum for this class (e.g. quantum 10000) additionally could be useful. Jarek P. -- To unsubscribe from this list: send the line "unsubscribe netdev" in the body of a message to majordomo@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
Jarek Poplawski wrote: > On Thu, Oct 09, 2008 at 08:22:49AM +0200, Martin Devera wrote: >>>> Simon, can you try to these things (separately): >>>> a/ increase quantum to the first class (say 10x) >>> Hi Martin, >>> >>> Do you mean increase r2q ? If so, here are some results >> no, no, add rather ... "quantum 50000" to the first class only >> r2q affects computation for all classes > > Yes, it's a half-documented parameter: not in the man, but e.g. here: > tc class add htb help > > BTW, I wonder if better priority/quantum doesn't actually harm here. > When HTB lends the rate the first class here is always served at the > beginning. And if it's over the hardware limit, this class's packets > are waiting. Other classes get their chunks later, probably after the > card did some cleaning, so they may never have to wait at this > situation. So, maybe it would be interesting to try this other way: > lower priority of the first class with a prio param e.g. 1? It starts to seem too complex to me (too many races related hypotheses). The original intent was to investigate possibly corrupted drr pointer, but just now I think the only way is to stuff enough debug code into htb and try myself... It will be probable faster - once I find time to do it. -- To unsubscribe from this list: send the line "unsubscribe netdev" in the body of a message to majordomo@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
On Thu, Oct 09, 2008 at 12:52:27PM +0200, Martin Devera wrote: ... > It starts to seem too complex to me (too many races related hypotheses). > The original intent was to investigate possibly corrupted drr pointer, > but just now I think the only way is to stuff enough debug code into > htb and try myself... > It will be probable faster - once I find time to do it. Sure, but, on the other hand, isn't it too simple... Jarek P. -- To unsubscribe from this list: send the line "unsubscribe netdev" in the body of a message to majordomo@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
On Thu, Oct 09, 2008 at 08:22:49AM +0200, Martin Devera wrote: > >> > >> Simon, can you try to these things (separately): > >> a/ increase quantum to the first class (say 10x) > > > > Hi Martin, > > > > Do you mean increase r2q ? If so, here are some results > > no, no, add rather ... "quantum 50000" to the first class only > r2q affects computation for all classes Sorry, I wasn't aware of that parameter. I ran a test now. With the default quantum (6250) for the 500Mbit/s class I got: 10194: 383384949bits/s 383Mbits/s 10197: 285879669bits/s 285Mbits/s 10196: 287648424bits/s 287Mbits/s ----------------------------------- total: 956913042bits/s 956Mbits/s And when a quantum of 50000 for the 500Mbit/s class I got: 10194: 396345613bits/s 396Mbits/s 10197: 279511994bits/s 279Mbits/s 10196: 281062498bits/s 281Mbits/s ----------------------------------- total: 956920106bits/s 956Mbits/s > >> b/ set ceil=rate on all three classes > > > > This seems pretty close to the expected/ideal result. > > > > 10194: 496575074bits/s 496Mbits/s > > 10197: 96969861bits/s 96Mbits/s > > 10196: 96973002bits/s 96Mbits/s > > ----------------------------------- > > total: 690517938bits/s 690Mbits/s > > > > Hmm, it seems that problem is with drr loop or borrowing, > rate computation seems to work ok even in this corner case. > > In any case from your other test follows that when NIC > throttles then HTB can't arrive at correct rates - I feel > this as HTB bug and add it to my (very long) todo list... > > thanks, Martin
On Thu, Oct 09, 2008 at 06:21:45AM +0000, Jarek Poplawski wrote: > On Thu, Oct 09, 2008 at 11:54:40AM +1100, Simon Horman wrote: > > On Wed, Oct 08, 2008 at 08:03:40AM +0000, Jarek Poplawski wrote: > ... > > > OK. But as Patrick mentioned it would be interesting to try a little > > > below hardware limits: 950, or maybe lower, until HTB starts getting > > > accuracy. > > > > Hi, > > Hi Simon, > > > > > it seems that for my particular setup the magic number is 935Mbit/s. [snip] > It's hard to believe so small change can matter so much here!!! > Great testing! Thanks, I was rather surprised myself. > It seems it's safer to use something below such magic number, and > generally 10% below hardware limit could be the rule. That kind of rule would certainly solve this case, not it would be nice for such a rule not to be necessary. > It also looks like my idea that the first class could get not enough > packets was wrong. It seems it's rather blocked by other classes when > HTB thinks it can lend more. > > Anyway, I'm also surprised HTB could be still so exact: congratulations > Martin! > > Simon, if I don't miss something, I guess you should be OK with these > last changes in requeuing? The older way did some safety for such > overestimated configs by hiding the real problem, but there were the > costs: cpu etc. I think that as Martin pointed out in another post, there is really too much hypotheses going on at this point. I think that it would be nice to try and get to the bottom of this, but I'm prepared to concede that it isn't a show-stopper.
>> r2q affects computation for all classes > > Sorry, I wasn't aware of that parameter. > I ran a test now. With the default quantum (6250) for the 500Mbit/s > class I got: > > 10194: 383384949bits/s 383Mbits/s > 10197: 285879669bits/s 285Mbits/s > 10196: 287648424bits/s 287Mbits/s > ----------------------------------- > total: 956913042bits/s 956Mbits/s > > And when a quantum of 50000 for the 500Mbit/s class I got: > > 10194: 396345613bits/s 396Mbits/s > 10197: 279511994bits/s 279Mbits/s > 10196: 281062498bits/s 281Mbits/s > ----------------------------------- > total: 956920106bits/s 956Mbits/s thanks! it seems there is no strong dependence. I'm out of simple explanations and I'd have to simulate it myself. Unfortunately I can't say when I can do it as I'm in complicated part of life now.... So that let's live with that 90% of max rate limitation for now. -- To unsubscribe from this list: send the line "unsubscribe netdev" in the body of a message to majordomo@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
Simon Horman wrote: >> It seems it's safer to use something below such magic number, and >> generally 10% below hardware limit could be the rule. > > That kind of rule would certainly solve this case, > not it would be nice for such a rule not to be necessary. It would be interesting to see if higher burst rates allow you to use higher rates. Userspace automatically calculates the bursts based on the timer frequency, which requires that the timers actually reach that frequency during runtime. With high packet rates that might not always be the case, so perhaps we can do better by adding a few percent on top. -- To unsubscribe from this list: send the line "unsubscribe netdev" in the body of a message to majordomo@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
On Thu, Oct 09, 2008 at 10:18:38PM +1100, Simon Horman wrote: > On Thu, Oct 09, 2008 at 06:21:45AM +0000, Jarek Poplawski wrote: ... > > It seems it's safer to use something below such magic number, and > > generally 10% below hardware limit could be the rule. > > That kind of rule would certainly solve this case, > not it would be nice for such a rule not to be necessary. As a matter of fact I think most of admins get used to this, especially with various dsls etc. As you can see, your 1Gbit isn't real rate here, so what you propose is ignoring parameters and doing autodetection, which is possible but, probably erroneus as well. > > Simon, if I don't miss something, I guess you should be OK with these > > last changes in requeuing? The older way did some safety for such > > overestimated configs by hiding the real problem, but there were the > > costs: cpu etc. > > I think that as Martin pointed out in another post, there > is really too much hypotheses going on at this point. > I think that it would be nice to try and get to the bottom of this, > but I'm prepared to concede that it isn't a show-stopper. Actually, I really doubt it's any stopper at all. It's a low level tool, not intended for common use, which really needs some tweaking. (Some people still prefer CBQ for more control...) Thanks, Jarek P. -- To unsubscribe from this list: send the line "unsubscribe netdev" in the body of a message to majordomo@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
On 08-10-2008 00:00, Jarek Poplawski wrote: > Patrick McHardy wrote, On 10/07/2008 02:48 PM: ... >> I still can't really make anything of this bug, but the only two >> visible differences to HTB resulting from requeing on an upper level >> should be that >> >> 1) it doesn't reactivate classes that went passive by the last dequeue >> 2) the time checkpoint from the last dequeue event is different >> >> I guess its in fact the second thing, if a lower priority packet >> is requeued and dequeued again, HTB doesn't notice and might allow >> the class to send earlier again than it would have previously. > > With high requeuing the timing has to be wrong, but I'm not sure why > just lower priority has to gain here. I think it's quite simple and not hypothetical: lower priority classes gain on overestimated ceils not on requeuing. Because of lending over hardware limit all buckets get more tokens than their rate, and they are allowed to send more and more until the hardware stops. HTB still doesn't know about this and it can lend until mbuffer limit is reached. So at some moment all buffers are full, everyone can send (bigger buffers could still give some advantage) and everything is controlled only by hardware. This is clear gain for the less privileged. Requeuing actually can prevent this some way, which IMHO, shouldn't matter here because it's not the way intended to shape the traffic. But we could consider if, after removing requeuing which mattered here, some change is needed in "proper" way of limiting such effects of wrong parameters or hardware errors (like the size of mbuffer etc.)? Thanks, Jarek P. -- To unsubscribe from this list: send the line "unsubscribe netdev" in the body of a message to majordomo@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
diff --git a/net/sched/sch_generic.c b/net/sched/sch_generic.c index 31f6b61..d2e0da6 100644 --- a/net/sched/sch_generic.c +++ b/net/sched/sch_generic.c @@ -44,7 +44,10 @@ static inline int qdisc_qlen(struct Qdisc *q) static inline int dev_requeue_skb(struct sk_buff *skb, struct Qdisc *q) { - q->gso_skb = skb; + if (unlikely(skb->next)) + q->gso_skb = skb; + else + q->ops->requeue(skb, q); __netif_schedule(q); return 0;