Patchwork HTB accuracy for high speed

login
register
mail settings
Submitter Jarek Poplawski
Date May 16, 2009, 2:14 p.m.
Message ID <20090516141430.GB3013@ami.dom.local>
Download mbox | patch
Permalink /patch/27303/
State RFC
Delegated to: David Miller
Headers show

Comments

Jarek Poplawski - May 16, 2009, 2:14 p.m.
On Fri, May 15, 2009 at 03:49:31PM +0100, Antonio Almeida wrote:
...
> I also note that, for HTB rate configurations over 500Mbit/s on leaf
> class, when I stop the traffic, in the output of "tc -s -d class ls
> dev eth1" command, I see that leaf's rate (in bits/s) is growing
> instead of decreasing (as expected since I've stopped the traffic).
> Rate in pps is ok and decreases until 0pps. Rate in bits/s increases
> above 1000Mbit and stays there for a few minutes. After two or three
> minutes it becomes 0bit. The same happens for it's ancestors (also for
> root class).Here's tc output of my leaf class for this situation:
> 
> class htb 1:108 parent 1:10 leaf 108: prio 7 quantum 1514 rate
> 555000Kbit ceil 555000Kbit burst 70901b/8 mpu 0b overhead 0b cburst
> 70901b/8 mpu 0b overhead 0b level 0
>  Sent 120267768144 bytes 242475339 pkt (dropped 62272599, overlimits 0
> requeues 0)
>  rate 1074Mbit 0pps backlog 0b 0p requeues 0
>  lended: 242475339 borrowed: 0 giants: 0
>  tokens: 8 ctokens: 8

This looks like a regular bug. I guess it's an overflow in
gen_estimator(), but I'm not sure there is nothing more. Could you
try the patch below? (An offset warning when patching 2.6.25 is OK)

Thanks,
Jarek P.
---

 net/core/gen_estimator.c |    6 +++++-
 1 files changed, 5 insertions(+), 1 deletions(-)

--
To unsubscribe from this list: send the line "unsubscribe netdev" in
the body of a message to majordomo@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Antonio Almeida - May 18, 2009, 2:36 p.m.
This patch works perfectly!
rate (bits/s) is now decreasing along with pps when I stop the traffic
(doesn't grow as it used to for rates over 500Mbtis/s).

# tc -s -d class ls dev eth1 | head -21 | tail -1
 rate 651960Kbit 97482pps backlog 0b 0p requeues 0
 rate 541134Kbit 80911pps backlog 0b 0p requeues 0
 rate 405850Kbit 60683pps backlog 0b 0p requeues 0
 rate 304388Kbit 45512pps backlog 0b 0p requeues 0
 rate 304388Kbit 45512pps backlog 0b 0p requeues 0
 rate 228291Kbit 34134pps backlog 0b 0p requeues 0
 rate 171218Kbit 25601pps backlog 0b 0p requeues 0
 rate 171218Kbit 25601pps backlog 0b 0p requeues 0
 rate 128414Kbit 19201pps backlog 0b 0p requeues 0
 rate 96310Kbit 14400pps backlog 0b 0p requeues 0
 rate 96310Kbit 14400pps backlog 0b 0p requeues 0
 rate 72233Kbit 10800pps backlog 0b 0p requeues 0
 rate 54174Kbit 8100pps backlog 0b 0p requeues 0


Thank's to you!
  Antonio Almeida




On Sat, May 16, 2009 at 3:14 PM, Jarek Poplawski <jarkao2@gmail.com> wrote:
> On Fri, May 15, 2009 at 03:49:31PM +0100, Antonio Almeida wrote:
> ...
>> I also note that, for HTB rate configurations over 500Mbit/s on leaf
>> class, when I stop the traffic, in the output of "tc -s -d class ls
>> dev eth1" command, I see that leaf's rate (in bits/s) is growing
>> instead of decreasing (as expected since I've stopped the traffic).
>> Rate in pps is ok and decreases until 0pps. Rate in bits/s increases
>> above 1000Mbit and stays there for a few minutes. After two or three
>> minutes it becomes 0bit. The same happens for it's ancestors (also for
>> root class).Here's tc output of my leaf class for this situation:
>>
>> class htb 1:108 parent 1:10 leaf 108: prio 7 quantum 1514 rate
>> 555000Kbit ceil 555000Kbit burst 70901b/8 mpu 0b overhead 0b cburst
>> 70901b/8 mpu 0b overhead 0b level 0
>>  Sent 120267768144 bytes 242475339 pkt (dropped 62272599, overlimits 0
>> requeues 0)
>>  rate 1074Mbit 0pps backlog 0b 0p requeues 0
>>  lended: 242475339 borrowed: 0 giants: 0
>>  tokens: 8 ctokens: 8
>
> This looks like a regular bug. I guess it's an overflow in
> gen_estimator(), but I'm not sure there is nothing more. Could you
> try the patch below? (An offset warning when patching 2.6.25 is OK)
>
> Thanks,
> Jarek P.
> ---
>
>  net/core/gen_estimator.c |    6 +++++-
>  1 files changed, 5 insertions(+), 1 deletions(-)
>
> diff --git a/net/core/gen_estimator.c b/net/core/gen_estimator.c
> index 9cc9f95..87f0ced 100644
> --- a/net/core/gen_estimator.c
> +++ b/net/core/gen_estimator.c
> @@ -127,7 +127,11 @@ static void est_timer(unsigned long arg)
>                npackets = e->bstats->packets;
>                rate = (nbytes - e->last_bytes)<<(7 - idx);
>                e->last_bytes = nbytes;
> -               e->avbps += ((long)rate - (long)e->avbps) >> e->ewma_log;
> +               if (rate > e->avbps)
> +                       e->avbps += (rate - e->avbps) >> e->ewma_log;
> +               else
> +                       e->avbps -= (e->avbps - rate) >> e->ewma_log;
> +
>                e->rate_est->bps = (e->avbps+0xF)>>5;
>
>                rate = (npackets - e->last_packets)<<(12 - idx);
>
--
To unsubscribe from this list: send the line "unsubscribe netdev" in
the body of a message to majordomo@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Vladimir Ivashchenko - May 18, 2009, 11:14 p.m.
On Mon, 2009-05-18 at 15:36 +0100, Antonio Almeida wrote:
> This patch works perfectly!
> rate (bits/s) is now decreasing along with pps when I stop the traffic
> (doesn't grow as it used to for rates over 500Mbtis/s).

I'm not able to reach full speed with bond + HTB + sfq on 2.6.29.1, both
with and without these patches. I seem to get a lot of drops on sfq
qdiscs, whatever quantum I set. Playing with IRQ affinity doesn't help.
I didn't check without bond.

With bond + HFSC + sfq, I'm able to reach the speed. It doesn't seem to
overspill with 580 mbps load. Jarek, would your patches help with HSFC
overspill ? I will check tomorrow under 750 mbps load. 

# ethtool -k eth0
Offload parameters for eth0:
Cannot get device flags: Operation not supported
rx-checksumming: on
tx-checksumming: on
scatter-gather: on
tcp segmentation offload: off
udp fragmentation offload: off
generic segmentation offload: off
large receive offload: off

# cat /sys/devices/system/clocksource/clocksource0/current_clocksource
tsc
Vladimir Ivashchenko - May 18, 2009, 11:27 p.m.
> With bond + HFSC + sfq, I'm able to reach the speed. It doesn't seem to
> overspill with 580 mbps load. Jarek, would your patches help with HSFC
> overspill ? I will check tomorrow under 750 mbps load. 

Please disregard my comment about HFSC. It still overspills heavily.

On a 400 mbps limit, I'm getting 520 mbps actual throughput.
Jarek Poplawski - May 19, 2009, 11:03 a.m.
On Tue, May 19, 2009 at 02:27:47AM +0300, Vladimir Ivashchenko wrote:
> 
> > With bond + HFSC + sfq, I'm able to reach the speed. It doesn't seem to
> > overspill with 580 mbps load. Jarek, would your patches help with HSFC
> > overspill ? I will check tomorrow under 750 mbps load. 

The gen_estimator patch should fix only the effect of rising rate
after flow stop, and maybe similar overflows while reporting rates
around 1Gbit. It would show on tc stats of HFSC or HTB, but doesn't
affect actual scheduling rates.

The iproute2 tc_core patch can matter for HTB scheduling rates if
there are a lot of small packets (e.g. 100 byte for rate 500Mbit)
possibly mixed with bigger ones. It doesn't matter for HFSC or
rates <100Mbit.

> Please disregard my comment about HFSC. It still overspills heavily.
> 
> On a 400 mbps limit, I'm getting 520 mbps actual throughput.

I guess you should send some logs. Your previous report seem to show
the sum of sc rates of of children could be too high. You seem to
expect the parent's sc and ul should limit this, but actually children
rates decide and parent's rates are mainly for lending/borrowing (at
least in HTB). So, it would be nice to try with one leaf class first,
(similarly to Antonio) how high rates are respected.

High drop should be OK if the flow is much faster than scheduling/
hardware send rate. It could be a bit higher than in older kernels
because of limited requeuing, but this could be corrected with
longer queue lenghts (sfq has a very short queue: max 127).

Jarek P.
--
To unsubscribe from this list: send the line "unsubscribe netdev" in
the body of a message to majordomo@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Vladimir Ivashchenko - May 19, 2009, 2:04 p.m.
> > Please disregard my comment about HFSC. It still overspills heavily.
> > 
> > On a 400 mbps limit, I'm getting 520 mbps actual throughput.
> 
> I guess you should send some logs. Your previous report seem to show

Can you give some hints on which logs you would like to see?

> the sum of sc rates of of children could be too high. You seem to
> expect the parent's sc and ul should limit this, but actually children
> rates decide and parent's rates are mainly for lending/borrowing (at

The children's ceil rate is 70% of the parent 1:2 class rate.

> least in HTB). So, it would be nice to try with one leaf class first,
> (similarly to Antonio) how high rates are respected.

Unfortunately its difficult for me to play with classes as its real traffic. 
I'll try to get a traffic generator.

> High drop should be OK if the flow is much faster than scheduling/
> hardware send rate. It could be a bit higher than in older kernels
> because of limited requeuing, but this could be corrected with
> longer queue lenghts (sfq has a very short queue: max 127).

I don't think its sfq, since I have the same sfq qdiscs with HSFC.

Also I'm comparing this to my production HTB box has 2.6.21.5 with esfq 
and no bond (just eth), esfq also has 127p limit.

I tried to get rid of bond on the outbound traffic, I balanced traffic
via eth0 and eth2 manually by splitting routes going through them.

I still had the same issue with HTB not reaching the full speed.

I'm going to try testing exactly the same configuration on 2.6.29 as I have
on 2.6.21.5 tonight. The only difference would be that I use sfq(dst) instead of
esfq(dst) which is not available on 2.6.29.
Jarek Poplawski - May 19, 2009, 8:10 p.m.
On Tue, May 19, 2009 at 05:04:16PM +0300, Vladimir Ivashchenko wrote:
> > > Please disregard my comment about HFSC. It still overspills heavily.
> > > 
> > > On a 400 mbps limit, I'm getting 520 mbps actual throughput.
> > 
> > I guess you should send some logs. Your previous report seem to show
> 
> Can you give some hints on which logs you would like to see?

Similarly to Antonio's: ifconfigs and tc -s for qdiscs and classes at
the beginning and at the end of testing.

> > the sum of sc rates of of children could be too high. You seem to
> > expect the parent's sc and ul should limit this, but actually children
> > rates decide and parent's rates are mainly for lending/borrowing (at
> 
> The children's ceil rate is 70% of the parent 1:2 class rate.

How about children's main rates?

> > least in HTB). So, it would be nice to try with one leaf class first,
> > (similarly to Antonio) how high rates are respected.
> 
> Unfortunately its difficult for me to play with classes as its real traffic. 
> I'll try to get a traffic generator.

Let it be the real traffic, but please re-check these rates sums.

> > High drop should be OK if the flow is much faster than scheduling/
> > hardware send rate. It could be a bit higher than in older kernels
> > because of limited requeuing, but this could be corrected with
> > longer queue lenghts (sfq has a very short queue: max 127).
> 
> I don't think its sfq, since I have the same sfq qdiscs with HSFC.
> 
> Also I'm comparing this to my production HTB box has 2.6.21.5 with esfq 
> and no bond (just eth), esfq also has 127p limit.
> 
> I tried to get rid of bond on the outbound traffic, I balanced traffic
> via eth0 and eth2 manually by splitting routes going through them.
> 
> I still had the same issue with HTB not reaching the full speed.
> 
> I'm going to try testing exactly the same configuration on 2.6.29 as I have
> on 2.6.21.5 tonight. The only difference would be that I use sfq(dst) instead of
> esfq(dst) which is not available on 2.6.29.

I'm a bit lost about your configs/results and not reaching vs.
overspilled, so please send some new data to compare (gzipped?).

Jarek P. 
--
To unsubscribe from this list: send the line "unsubscribe netdev" in
the body of a message to majordomo@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Vladimir Ivashchenko - May 20, 2009, 10:07 p.m.
> > > 
> > > I guess you should send some logs. Your previous report seem to show
> > 
> > Can you give some hints on which logs you would like to see?
> 
> Similarly to Antonio's: ifconfigs and tc -s for qdiscs and classes at
> the beginning and at the end of testing.

Ok, it seems that I finally found what is causing my HTB on 2.6.29 not
to reach full throughput: dst hashing on sfq with high divisor value.

2.6.21 esfq divisor 13 depth 4096 hash dst - 680 mbps
2.6.29 sfq WITHOUT "flow hash keys dst ... " (default sfq) - 680 mbps
2.6.29 sfq + "flow hash keys dst divisor 64" filter - 680 mbps
2.6.29 sfq + "flow hash keys dst divisor 256" filter - 660 mbps
2.6.29 sfq + "flow hash keys dst divisor 2048" filters - 460 mbps

I'm using high sfq hash divisor in order to decrease the number of
collisions, there are several thousands of hosts behind each of the
classes. 

Any ideas why increasing the sfq divisor size results in drop of
throughput ?

Attached are diagnostics gathered in case of divisor 2048.
Eric Dumazet - May 20, 2009, 10:46 p.m.
Vladimir Ivashchenko a écrit :
>>>> I guess you should send some logs. Your previous report seem to show
>>> Can you give some hints on which logs you would like to see?
>> Similarly to Antonio's: ifconfigs and tc -s for qdiscs and classes at
>> the beginning and at the end of testing.
> 
> Ok, it seems that I finally found what is causing my HTB on 2.6.29 not
> to reach full throughput: dst hashing on sfq with high divisor value.
> 
> 2.6.21 esfq divisor 13 depth 4096 hash dst - 680 mbps
> 2.6.29 sfq WITHOUT "flow hash keys dst ... " (default sfq) - 680 mbps
> 2.6.29 sfq + "flow hash keys dst divisor 64" filter - 680 mbps
> 2.6.29 sfq + "flow hash keys dst divisor 256" filter - 660 mbps
> 2.6.29 sfq + "flow hash keys dst divisor 2048" filters - 460 mbps
> 
> I'm using high sfq hash divisor in order to decrease the number of
> collisions, there are several thousands of hosts behind each of the
> classes. 
> 
> Any ideas why increasing the sfq divisor size results in drop of
> throughput ?
> 
> Attached are diagnostics gathered in case of divisor 2048.
> 


But... it appears sfq currently supports a fixed divisor of 1024

net/sched/sch_sfq.c

 IMPLEMENTATION:
 This implementation limits maximal queue length to 128;
 maximal mtu to 2^15-1; number of hash buckets to 1024.
 The only goal of this restrictions was that all data
 fit into one 4K page :-). Struct sfq_sched_data is
 organized in anti-cache manner: all the data for a bucket
 are scattered over different locations. This is not good,
 but it allowed me to put it into 4K.

 It is easy to increase these values, but not in flight.  */

#define SFQ_DEPTH   128
#define SFQ_HASH_DIVISOR    1024


Apparently Corey Hickey 2007 work on SFQ was not merged.

http://kerneltrap.org/mailarchive/linux-netdev/2007/9/28/325048


--
To unsubscribe from this list: send the line "unsubscribe netdev" in
the body of a message to majordomo@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Jarek Poplawski - May 21, 2009, 7:20 a.m.
On Thu, May 21, 2009 at 12:46:16AM +0200, Eric Dumazet wrote:
> Vladimir Ivashchenko a écrit :
> >>>> I guess you should send some logs. Your previous report seem to show
> >>> Can you give some hints on which logs you would like to see?
> >> Similarly to Antonio's: ifconfigs and tc -s for qdiscs and classes at
> >> the beginning and at the end of testing.
> > 
> > Ok, it seems that I finally found what is causing my HTB on 2.6.29 not
> > to reach full throughput: dst hashing on sfq with high divisor value.
> > 
> > 2.6.21 esfq divisor 13 depth 4096 hash dst - 680 mbps
> > 2.6.29 sfq WITHOUT "flow hash keys dst ... " (default sfq) - 680 mbps
> > 2.6.29 sfq + "flow hash keys dst divisor 64" filter - 680 mbps
> > 2.6.29 sfq + "flow hash keys dst divisor 256" filter - 660 mbps
> > 2.6.29 sfq + "flow hash keys dst divisor 2048" filters - 460 mbps
> > 
> > I'm using high sfq hash divisor in order to decrease the number of
> > collisions, there are several thousands of hosts behind each of the
> > classes. 
> > 
> > Any ideas why increasing the sfq divisor size results in drop of
> > throughput ?
> > 
> > Attached are diagnostics gathered in case of divisor 2048.
> > 
> 
> 
> But... it appears sfq currently supports a fixed divisor of 1024
> 
> net/sched/sch_sfq.c
> 
>  IMPLEMENTATION:
>  This implementation limits maximal queue length to 128;
>  maximal mtu to 2^15-1; number of hash buckets to 1024.
>  The only goal of this restrictions was that all data
>  fit into one 4K page :-). Struct sfq_sched_data is
>  organized in anti-cache manner: all the data for a bucket
>  are scattered over different locations. This is not good,
>  but it allowed me to put it into 4K.
> 
>  It is easy to increase these values, but not in flight.  */
> 
> #define SFQ_DEPTH   128
> #define SFQ_HASH_DIVISOR    1024
> 
> 
> Apparently Corey Hickey 2007 work on SFQ was not merged.
> 
> http://kerneltrap.org/mailarchive/linux-netdev/2007/9/28/325048

Yes, sfq has its design limits, and as a matter of fact, because of
max length (127) it should be treated as a toy or "personal" qdisc.

I don't know why more of esfq wasn't merged, anyway similar
functionality could be achieved in current kernels with sch_drr +
cls_flow, alas not enough documented. Here is some hint:
http://markmail.org/message/h24627xkrxyqxn4k

Jarek P.

PS: I guess, you wasn't very consistent if your main problem was
exceeding or not reaching htb rate, and there is quite a difference.

Vladimir Ivashchenko wrote, On 05/08/2009 10:46 PM:

> Exporting HZ=1000 doesn't help. However, even if I recompile the kernel
> to 1000 Hz and the burst is calculated correctly, for some reason HTB on
> 2.6.29 is still worse at rate control than 2.6.21.
> 
> With 2.6.21, ceil of 775 mbits, burst 99425b -> actual rate 825 mbits.
> With 2.6.29, same ceil/burst -> actual rate 890 mbits.
...

Vladimir Ivashchenko wrote, On 05/17/2009 10:29 PM:

> Hi Antonio,
> 
> FYI, these are exactly the same problems I get in real life.
> Check the later posts in "bond + tc regression" thread.
...
--
To unsubscribe from this list: send the line "unsubscribe netdev" in
the body of a message to majordomo@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Vladimir Ivashchenko - May 21, 2009, 7:44 a.m.
> I don't know why more of esfq wasn't merged, anyway similar
> functionality could be achieved in current kernels with sch_drr +
> cls_flow, alas not enough documented. Here is some hint:
> http://markmail.org/message/h24627xkrxyqxn4k

Can I balance only by destination IP using this approach? 
Normal IP flow-based balancing is not good for me, I need 
to ensure equality between destination hosts.

> 
> Jarek P.
> 
> PS: I guess, you wasn't very consistent if your main problem was
> exceeding or not reaching htb rate, and there is quite a difference.

Yes indeed :(

I'm trying to migrate from 2.6.21 eth/htb/esfq to 2.6.29 
bond/htb/sfq, and that introduces a lot of changes.

Apparently during some point I changed sfq divisor from 1024 
to 2048 and forgot about it.

Now I realize that the problems I reported were as follows:

1) HTB exceeds target when I use HTB + sfq + divisor 1024
2) HFSC exceeds target when I use HFSC + sfq + divisor 1024
3) HTB does not reach target when I use HTB + sfq + divisor 2048

I will check again scenario 1) with the latest patches from
the list.

> Vladimir Ivashchenko wrote, On 05/08/2009 10:46 PM:
> 
> > Exporting HZ=1000 doesn't help. However, even if I recompile the kernel
> > to 1000 Hz and the burst is calculated correctly, for some reason HTB on
> > 2.6.29 is still worse at rate control than 2.6.21.
> > 
> > With 2.6.21, ceil of 775 mbits, burst 99425b -> actual rate 825 mbits.
> > With 2.6.29, same ceil/burst -> actual rate 890 mbits.
> ...
> 
> Vladimir Ivashchenko wrote, On 05/17/2009 10:29 PM:
> 
> > Hi Antonio,
> > 
> > FYI, these are exactly the same problems I get in real life.
> > Check the later posts in "bond + tc regression" thread.
> ...
Jarek Poplawski - May 21, 2009, 8:28 a.m.
On Thu, May 21, 2009 at 10:44:00AM +0300, Vladimir Ivashchenko wrote:
> > I don't know why more of esfq wasn't merged, anyway similar
> > functionality could be achieved in current kernels with sch_drr +
> > cls_flow, alas not enough documented. Here is some hint:
> > http://markmail.org/message/h24627xkrxyqxn4k
> 
> Can I balance only by destination IP using this approach? 
> Normal IP flow-based balancing is not good for me, I need 
> to ensure equality between destination hosts.

Yes, you need to use flow "dst" key, I guess. (tc filter add flow help)

Jarek P.

> > PS: I guess, you wasn't very consistent if your main problem was
> > exceeding or not reaching htb rate, and there is quite a difference.
> 
> Yes indeed :(

Generally, the most common reasons are:
- too short (or zero) tx queue length or/plus some disturbances in
  maintaining the flow - for not reaching the rate
- gso/tso or other non standard packets sizes - for exceeding the
  rate.
--
To unsubscribe from this list: send the line "unsubscribe netdev" in
the body of a message to majordomo@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Eric Dumazet - May 21, 2009, 9:07 a.m.
Jarek Poplawski a écrit :
> On Thu, May 21, 2009 at 10:44:00AM +0300, Vladimir Ivashchenko wrote:
>>> I don't know why more of esfq wasn't merged, anyway similar
>>> functionality could be achieved in current kernels with sch_drr +
>>> cls_flow, alas not enough documented. Here is some hint:
>>> http://markmail.org/message/h24627xkrxyqxn4k
>> Can I balance only by destination IP using this approach? 
>> Normal IP flow-based balancing is not good for me, I need 
>> to ensure equality between destination hosts.
> 
> Yes, you need to use flow "dst" key, I guess. (tc filter add flow help)
> 
> Jarek P.
> 
>>> PS: I guess, you wasn't very consistent if your main problem was
>>> exceeding or not reaching htb rate, and there is quite a difference.
>> Yes indeed :(
> 
> Generally, the most common reasons are:
> - too short (or zero) tx queue length or/plus some disturbances in
>   maintaining the flow - for not reaching the rate


> - gso/tso or other non standard packets sizes - for exceeding the
>   rate.

Could we detect this at runtime and emit a warning (once) ?

Or should we assume guys using this stuff should be smart enough ?
I confess I made this error once and this was not so easy to spot...
	

--
To unsubscribe from this list: send the line "unsubscribe netdev" in
the body of a message to majordomo@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Jarek Poplawski - May 21, 2009, 9:22 a.m.
On Thu, May 21, 2009 at 11:07:24AM +0200, Eric Dumazet wrote:
...
> > - gso/tso or other non standard packets sizes - for exceeding the
> >   rate.
> 
> Could we detect this at runtime and emit a warning (once) ?

I guess, it's a rhetorical question...

Jarek P.
--
To unsubscribe from this list: send the line "unsubscribe netdev" in
the body of a message to majordomo@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Vladimir Ivashchenko - May 23, 2009, 10:37 a.m.
> > > cls_flow, alas not enough documented. Here is some hint:
> > > http://markmail.org/message/h24627xkrxyqxn4k
> > 
> > Can I balance only by destination IP using this approach? 
> > Normal IP flow-based balancing is not good for me, I need 
> > to ensure equality between destination hosts.
> 
> Yes, you need to use flow "dst" key, I guess. (tc filter add flow
> help)

What is the number of DRR classes I need to create, a separate class for
each host? I have around 20000 hosts.

I figured out that WRR does what I want and its documented, so I'm using
a 2.6.27 kernel with WRR now.

I was still hitting a wall with bonding. I played with a lot of
combinations and could not find a way to make it scale to multiple
cores. Cores which handle incoming traffic would get hit to 0-20% idle.

So, I got rid of bonding completely and instead configured PBR on Cisco
+ Linux routing in such a way so that packet gets received and
transmitted using NICs connected to the same pair of cores with common
cache. 65-70% idle on all cores now, compared to 0-30% idle in worst
case scenarios before.

> - gso/tso or other non standard packets sizes - for exceeding the
>   rate.

Just FYI, kernel 2.6.29.1, sub-classes with sfq divisor 1024, tso & gso
off, netdevice.h and tc_core.c patches applied:

class htb 1:2 root rate 775000Kbit ceil 775000Kbit burst 98328b cburst
98328b
Sent 64883444467 bytes 72261124 pkt (dropped 0, overlimits 0 requeues 0)
rate 821332Kbit 112572pps backlog 0b 0p requeues 0
lended: 21736738 borrowed: 0 giants: 0

In any case, exceeding the rate is not big of a problem for me.

Thanks a lot to everyone for their help.
Jarek Poplawski - May 23, 2009, 2:34 p.m.
On Sat, May 23, 2009 at 01:37:32PM +0300, Vladimir Ivashchenko wrote:
> 
> > > > cls_flow, alas not enough documented. Here is some hint:
> > > > http://markmail.org/message/h24627xkrxyqxn4k
> > > 
> > > Can I balance only by destination IP using this approach? 
> > > Normal IP flow-based balancing is not good for me, I need 
> > > to ensure equality between destination hosts.
> > 
> > Yes, you need to use flow "dst" key, I guess. (tc filter add flow
> > help)
> 
> What is the number of DRR classes I need to create, a separate class for
> each host? I have around 20000 hosts.

One class per divisor.

> I figured out that WRR does what I want and its documented, so I'm using
> a 2.6.27 kernel with WRR now.

OK if it works for you.
 
> I was still hitting a wall with bonding. I played with a lot of
> combinations and could not find a way to make it scale to multiple
> cores. Cores which handle incoming traffic would get hit to 0-20% idle.
> 
> So, I got rid of bonding completely and instead configured PBR on Cisco
> + Linux routing in such a way so that packet gets received and
> transmitted using NICs connected to the same pair of cores with common
> cache. 65-70% idle on all cores now, compared to 0-30% idle in worst
> case scenarios before.

As a matter of fact I don't understand this bonding idea vs. smp: I
guess Eric Dumazet wrote why it's wrong wrt. locking. I'm not an smp
expert but I think the most efficient use is with separate NICs per
cpu (so with separate HTB qdiscs if possible), or multiqueue NICs -
but they would currently need a common HTB etc., so again a common
locking/cache problem.

> > - gso/tso or other non standard packets sizes - for exceeding the
> >   rate.
> 
> Just FYI, kernel 2.6.29.1, sub-classes with sfq divisor 1024, tso & gso
> off, netdevice.h and tc_core.c patches applied:
> 
> class htb 1:2 root rate 775000Kbit ceil 775000Kbit burst 98328b cburst
> 98328b
> Sent 64883444467 bytes 72261124 pkt (dropped 0, overlimits 0 requeues 0)
> rate 821332Kbit 112572pps backlog 0b 0p requeues 0
> lended: 21736738 borrowed: 0 giants: 0
> 
> In any case, exceeding the rate is not big of a problem for me.

Anyway, I'd be interested with the full tc -s class & qdisc report.

Thanks,
Jarek P.
--
To unsubscribe from this list: send the line "unsubscribe netdev" in
the body of a message to majordomo@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Vladimir Ivashchenko - May 23, 2009, 3:06 p.m.
> > So, I got rid of bonding completely and instead configured PBR on Cisco
> > + Linux routing in such a way so that packet gets received and
> > transmitted using NICs connected to the same pair of cores with common
> > cache. 65-70% idle on all cores now, compared to 0-30% idle in worst
> > case scenarios before.
> 
> As a matter of fact I don't understand this bonding idea vs. smp: I
> guess Eric Dumazet wrote why it's wrong wrt. locking. I'm not an smp
> expert but I think the most efficient use is with separate NICs per
> cpu (so with separate HTB qdiscs if possible), or multiqueue NICs -

I tried the following scenario: 2 NICs used for receive + another 2 NICs 
used for transmit having HTB. Each NIC on a separate core. No bonding, 
just manual load balancing using IP routing.

The result was that RX cores would be 20% and 40% idle respectively, even 
though the amount of traffic they were receiving was roughly the same. 
The TX cores were idling at around 90%. 

I found this strange personally, but I'm completely ignorant in internals of
kernel operation.
Jarek Poplawski - May 23, 2009, 3:35 p.m.
On Sat, May 23, 2009 at 06:06:30PM +0300, Vladimir Ivashchenko wrote:
> > > So, I got rid of bonding completely and instead configured PBR on Cisco
> > > + Linux routing in such a way so that packet gets received and
> > > transmitted using NICs connected to the same pair of cores with common
> > > cache. 65-70% idle on all cores now, compared to 0-30% idle in worst
> > > case scenarios before.
> > 
> > As a matter of fact I don't understand this bonding idea vs. smp: I
> > guess Eric Dumazet wrote why it's wrong wrt. locking. I'm not an smp
> > expert but I think the most efficient use is with separate NICs per
> > cpu (so with separate HTB qdiscs if possible), or multiqueue NICs -
> 
> I tried the following scenario: 2 NICs used for receive + another 2 NICs 
> used for transmit having HTB. Each NIC on a separate core. No bonding, 
> just manual load balancing using IP routing.
> 
> The result was that RX cores would be 20% and 40% idle respectively, even 
> though the amount of traffic they were receiving was roughly the same. 
> The TX cores were idling at around 90%. 

There is not enough data to analyse this, but generally you should aim
at maintaining one flow (RX + TX) on the same cpu cache.

Jarek P.
--
To unsubscribe from this list: send the line "unsubscribe netdev" in
the body of a message to majordomo@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Vladimir Ivashchenko - May 23, 2009, 3:53 p.m.
> > > > So, I got rid of bonding completely and instead configured PBR on Cisco
> > > > + Linux routing in such a way so that packet gets received and
> > > > transmitted using NICs connected to the same pair of cores with common
> > > > cache. 65-70% idle on all cores now, compared to 0-30% idle in worst
> > > > case scenarios before.
> > > 
> > > As a matter of fact I don't understand this bonding idea vs. smp: I
> > > guess Eric Dumazet wrote why it's wrong wrt. locking. I'm not an smp
> > > expert but I think the most efficient use is with separate NICs per
> > > cpu (so with separate HTB qdiscs if possible), or multiqueue NICs -
> > 
> > I tried the following scenario: 2 NICs used for receive + another 2 NICs 
> > used for transmit having HTB. Each NIC on a separate core. No bonding, 
> > just manual load balancing using IP routing.
> > 
> > The result was that RX cores would be 20% and 40% idle respectively, even 
> > though the amount of traffic they were receiving was roughly the same. 
> > The TX cores were idling at around 90%. 
> 
> There is not enough data to analyse this, but generally you should aim
> at maintaining one flow (RX + TX) on the same cpu cache.

Yep, that's what I did in the end (as per the top paragraph).
Jarek Poplawski - May 23, 2009, 4:02 p.m.
On Sat, May 23, 2009 at 06:53:21PM +0300, Vladimir Ivashchenko wrote:
> > > > > So, I got rid of bonding completely and instead configured PBR on Cisco
> > > > > + Linux routing in such a way so that packet gets received and
> > > > > transmitted using NICs connected to the same pair of cores with common
> > > > > cache. 65-70% idle on all cores now, compared to 0-30% idle in worst
> > > > > case scenarios before.
...
> > There is not enough data to analyse this, but generally you should aim
> > at maintaining one flow (RX + TX) on the same cpu cache.
> 
> Yep, that's what I did in the end (as per the top paragraph).

So, stop writing: "I'm completely ignorant in internals of kernel
operation" because you're smp expert now! ;-)

Jarek P.
--
To unsubscribe from this list: send the line "unsubscribe netdev" in
the body of a message to majordomo@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html

Patch

diff --git a/net/core/gen_estimator.c b/net/core/gen_estimator.c
index 9cc9f95..87f0ced 100644
--- a/net/core/gen_estimator.c
+++ b/net/core/gen_estimator.c
@@ -127,7 +127,11 @@  static void est_timer(unsigned long arg)
 		npackets = e->bstats->packets;
 		rate = (nbytes - e->last_bytes)<<(7 - idx);
 		e->last_bytes = nbytes;
-		e->avbps += ((long)rate - (long)e->avbps) >> e->ewma_log;
+		if (rate > e->avbps)
+			e->avbps += (rate - e->avbps) >> e->ewma_log;
+		else
+			e->avbps -= (e->avbps - rate) >> e->ewma_log;
+
 		e->rate_est->bps = (e->avbps+0xF)>>5;
 
 		rate = (npackets - e->last_packets)<<(12 - idx);