diff mbox

[RESEND] iproute2 : invalid burst/cburst calculation with hrtimers

Message ID 200902021926.17768.denys@visp.net.lb
State Not Applicable, archived
Delegated to: David Miller
Headers show

Commit Message

Denys Fedoryshchenko Feb. 2, 2009, 5:26 p.m. UTC
------------->
iproute2 : invalid burst/cburst calculation with hrtimers

If hrtimers on, /proc/net/psched shows 4th variable 
as 1000000000
Because burst calculated by division rate to this variable, 
it will be almost always zero. As result, we will get higher system 
load on low rates, and on high rates shaper will not able to process 
data. So it is kind of critical bugfix for systems with hrtimers.
It is checked and proved. Core 2 Quad was not able to 
shape 200Mbps, and gave only 180-190. It is more safe to set it 
to 1000HZ. If user wants, he can set custom "env" HZ variable.

Signed-off-by: Denys Fedoryschenko <denys@visp.net.lb>
---

-------------------------------------------------------

Comments

Stephen Hemminger Feb. 2, 2009, 6:07 p.m. UTC | #1
On Mon, 2 Feb 2009 19:26:17 +0200
Denys Fedoryschenko <denys@visp.net.lb> wrote:

> ------------->
> iproute2 : invalid burst/cburst calculation with hrtimers
> 
> If hrtimers on, /proc/net/psched shows 4th variable 
> as 1000000000
> Because burst calculated by division rate to this variable, 
> it will be almost always zero. As result, we will get higher system 
> load on low rates, and on high rates shaper will not able to process 
> data. So it is kind of critical bugfix for systems with hrtimers.
> It is checked and proved. Core 2 Quad was not able to 
> shape 200Mbps, and gave only 180-190. It is more safe to set it 
> to 1000HZ. If user wants, he can set custom "env" HZ variable.
> 
> Signed-off-by: Denys Fedoryschenko <denys@visp.net.lb>
> ---
> 
> -------------------------------------------------------

I would rather this be converted to floating point.
--
To unsubscribe from this list: send the line "unsubscribe netdev" in
the body of a message to majordomo@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Denys Fedoryshchenko Feb. 2, 2009, 6:21 p.m. UTC | #2
Why floating point?
It seems get_hz (and burst in result) just cannot be too low.
Still seems some code in HTB rely on jiffies, so get_hz() probably just wrong.

Already i'm tried to increase burst/cburst calculation precision, but it will 
fail still with such high get_hz variable.

Ii dont know way to get real HZ variable from userspace.

Here is Martin Devera answer
>but it really
>seems as if get_hz() is too high.
>For 1000Mbps - it is 1MB per 1 ms and assuming HZ=1000, then burst
>should be about 1MB.
>But I must admit, I'm not familiar with latest state of HZ in kernel.
>There was "NO_NZ" effort IIRC, where the HZ granularity can be
>considerably finer - but still not infinite.



On Monday 02 February 2009 20:07:42 Stephen Hemminger wrote:
> On Mon, 2 Feb 2009 19:26:17 +0200
>
> Denys Fedoryschenko <denys@visp.net.lb> wrote:
> > ------------->
> > iproute2 : invalid burst/cburst calculation with hrtimers
> >
> > If hrtimers on, /proc/net/psched shows 4th variable
> > as 1000000000
> > Because burst calculated by division rate to this variable,
> > it will be almost always zero. As result, we will get higher system
> > load on low rates, and on high rates shaper will not able to process
> > data. So it is kind of critical bugfix for systems with hrtimers.
> > It is checked and proved. Core 2 Quad was not able to
> > shape 200Mbps, and gave only 180-190. It is more safe to set it
> > to 1000HZ. If user wants, he can set custom "env" HZ variable.
> >
> > Signed-off-by: Denys Fedoryschenko <denys@visp.net.lb>
> > ---
> >
> > -------------------------------------------------------
>
> I would rather this be converted to floating point.


--
To unsubscribe from this list: send the line "unsubscribe netdev" in
the body of a message to majordomo@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Jarek Poplawski Feb. 4, 2009, 9:20 p.m. UTC | #3
Denys Fedoryschenko wrote, On 02/02/2009 07:21 PM:

> Why floating point?


Yes, floating point is not enough. get_hz() == 1000000000 could be
a theoretical resolution, but I guess we don't expect to use this,
even if it were possible. Since high resolution can schedule more
often than 1000HZ, it seems these buffers could be set lower (10x,
100x?), but there should be some reasonable limit.

Jarek P.

> It seems get_hz (and burst in result) just cannot be too low.
> Still seems some code in HTB rely on jiffies, so get_hz() probably just wrong.
> 
> Already i'm tried to increase burst/cburst calculation precision, but it will 
> fail still with such high get_hz variable.
> 
> Ii dont know way to get real HZ variable from userspace.
> 
> Here is Martin Devera answer
>> but it really
>> seems as if get_hz() is too high.
>> For 1000Mbps - it is 1MB per 1 ms and assuming HZ=1000, then burst
>> should be about 1MB.
>> But I must admit, I'm not familiar with latest state of HZ in kernel.
>> There was "NO_NZ" effort IIRC, where the HZ granularity can be
>> considerably finer - but still not infinite.
> 
> 
> 
> On Monday 02 February 2009 20:07:42 Stephen Hemminger wrote:
>> On Mon, 2 Feb 2009 19:26:17 +0200
>>
>> Denys Fedoryschenko <denys@visp.net.lb> wrote:
>>> ------------->
>>> iproute2 : invalid burst/cburst calculation with hrtimers
>>>
>>> If hrtimers on, /proc/net/psched shows 4th variable
>>> as 1000000000
>>> Because burst calculated by division rate to this variable,
>>> it will be almost always zero. As result, we will get higher system
>>> load on low rates, and on high rates shaper will not able to process
>>> data. So it is kind of critical bugfix for systems with hrtimers.
>>> It is checked and proved. Core 2 Quad was not able to
>>> shape 200Mbps, and gave only 180-190. It is more safe to set it
>>> to 1000HZ. If user wants, he can set custom "env" HZ variable.
>>>
>>> Signed-off-by: Denys Fedoryschenko <denys@visp.net.lb>
>>> ---
>>>
>>> -------------------------------------------------------
>> I would rather this be converted to floating point.
> 
> 
> --
> To unsubscribe from this list: send the line "unsubscribe netdev" in
> the body of a message to majordomo@vger.kernel.org
> More majordomo info at  http://vger.kernel.org/majordomo-info.html
> 


--
To unsubscribe from this list: send the line "unsubscribe netdev" in
the body of a message to majordomo@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Jarek Poplawski Feb. 4, 2009, 9:53 p.m. UTC | #4
Jarek Poplawski wrote, On 02/04/2009 10:20 PM:

> Denys Fedoryschenko wrote, On 02/02/2009 07:21 PM:
> 
>> Why floating point?
> 
> 
> Yes, floating point is not enough. get_hz() == 1000000000 could be
> a theoretical resolution, but I guess we don't expect to use this,
> even if it were possible. Since high resolution can schedule more
> often than 1000HZ, it seems these buffers could be set lower (10x,
> 100x?), but there should be some reasonable limit.


Anyway, it's rather a matter of tweaking. IMHO 1000HZ limit for setting
these defaults looks safe, so I think this patch is OK.

Jarek P.
 
>> It seems get_hz (and burst in result) just cannot be too low.
>> Still seems some code in HTB rely on jiffies, so get_hz() probably just wrong.
>>
>> Already i'm tried to increase burst/cburst calculation precision, but it will 
>> fail still with such high get_hz variable.
>>
>> Ii dont know way to get real HZ variable from userspace.
>>
>> Here is Martin Devera answer
>>> but it really
>>> seems as if get_hz() is too high.
>>> For 1000Mbps - it is 1MB per 1 ms and assuming HZ=1000, then burst
>>> should be about 1MB.
>>> But I must admit, I'm not familiar with latest state of HZ in kernel.
>>> There was "NO_NZ" effort IIRC, where the HZ granularity can be
>>> considerably finer - but still not infinite.
>>
>>
>> On Monday 02 February 2009 20:07:42 Stephen Hemminger wrote:
>>> On Mon, 2 Feb 2009 19:26:17 +0200
>>>
>>> Denys Fedoryschenko <denys@visp.net.lb> wrote:
>>>> ------------->
>>>> iproute2 : invalid burst/cburst calculation with hrtimers
>>>>
>>>> If hrtimers on, /proc/net/psched shows 4th variable
>>>> as 1000000000
>>>> Because burst calculated by division rate to this variable,
>>>> it will be almost always zero. As result, we will get higher system
>>>> load on low rates, and on high rates shaper will not able to process
>>>> data. So it is kind of critical bugfix for systems with hrtimers.
>>>> It is checked and proved. Core 2 Quad was not able to
>>>> shape 200Mbps, and gave only 180-190. It is more safe to set it
>>>> to 1000HZ. If user wants, he can set custom "env" HZ variable.
>>>>
>>>> Signed-off-by: Denys Fedoryschenko <denys@visp.net.lb>
>>>> ---
>>>>
>>>> -------------------------------------------------------
>>> I would rather this be converted to floating point.
>>
>> --
>> To unsubscribe from this list: send the line "unsubscribe netdev" in
>> the body of a message to majordomo@vger.kernel.org
>> More majordomo info at  http://vger.kernel.org/majordomo-info.html
>>
> 
> 
> --
> To unsubscribe from this list: send the line "unsubscribe netdev" in
> the body of a message to majordomo@vger.kernel.org
> More majordomo info at  http://vger.kernel.org/majordomo-info.html
> 


--
To unsubscribe from this list: send the line "unsubscribe netdev" in
the body of a message to majordomo@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Denys Fedoryshchenko Feb. 4, 2009, 10:26 p.m. UTC | #5
On Wednesday 04 February 2009 23:53:14 Jarek Poplawski wrote:
> Jarek Poplawski wrote, On 02/04/2009 10:20 PM:
> > Denys Fedoryschenko wrote, On 02/02/2009 07:21 PM:
> >> Why floating point?
> >
> > Yes, floating point is not enough. get_hz() == 1000000000 could be
> > a theoretical resolution, but I guess we don't expect to use this,
> > even if it were possible. Since high resolution can schedule more
> > often than 1000HZ, it seems these buffers could be set lower (10x,
> > 100x?), but there should be some reasonable limit.
>
> Anyway, it's rather a matter of tweaking. IMHO 1000HZ limit for setting
> these defaults looks safe, so I think this patch is OK.
>
> Jarek P.
Maybe we can put note in documentation or even htb help output, that user can 
increase precision with hrtimers enabled, by setting HZ variable?
--
To unsubscribe from this list: send the line "unsubscribe netdev" in
the body of a message to majordomo@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Jarek Poplawski Feb. 4, 2009, 10:49 p.m. UTC | #6
On Thu, Feb 05, 2009 at 12:26:46AM +0200, Denys Fedoryschenko wrote:
...
> Maybe we can put note in documentation or even htb help output, that user can 
> increase precision with hrtimers enabled, by setting HZ variable?

Hmm... I rather meant using burst and cburst params. But hrtimers are
very hardware dependent, so it's hard to advise anything.

Jarek P.
--
To unsubscribe from this list: send the line "unsubscribe netdev" in
the body of a message to majordomo@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Denys Fedoryshchenko Feb. 4, 2009, 11:24 p.m. UTC | #7
On Thursday 05 February 2009 00:49:29 Jarek Poplawski wrote:
> On Thu, Feb 05, 2009 at 12:26:46AM +0200, Denys Fedoryschenko wrote:
> ...
>
> > Maybe we can put note in documentation or even htb help output, that user
> > can increase precision with hrtimers enabled, by setting HZ variable?
>
> Hmm... I rather meant using burst and cburst params. But hrtimers are
> very hardware dependent, so it's hard to advise anything.
>
> Jarek P.

By default burst/cburst calculated depends on timer resolution, as i 
understood, by taking 4th parameter of /proc/net/psched, which as i 
understood - precision of hrtimer .

With hrtimers it is nanosecond resolution, precision is very high, but 
overhead too high too. And as i understand, maybe it can be problem, that htb 
still uses HZ variable in some cases like:
	return q->now + 2 * PSCHED_TICKS_PER_SEC / HZ;


--
To unsubscribe from this list: send the line "unsubscribe netdev" in
the body of a message to majordomo@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Jarek Poplawski Feb. 5, 2009, 7:38 a.m. UTC | #8
On Thu, Feb 05, 2009 at 01:24:05AM +0200, Denys Fedoryschenko wrote:
> On Thursday 05 February 2009 00:49:29 Jarek Poplawski wrote:
> > On Thu, Feb 05, 2009 at 12:26:46AM +0200, Denys Fedoryschenko wrote:
> > ...
> >
> > > Maybe we can put note in documentation or even htb help output, that user
> > > can increase precision with hrtimers enabled, by setting HZ variable?
> >
> > Hmm... I rather meant using burst and cburst params. But hrtimers are
> > very hardware dependent, so it's hard to advise anything.
> >
> > Jarek P.
> 
> By default burst/cburst calculated depends on timer resolution, as i 
> understood, by taking 4th parameter of /proc/net/psched, which as i 
> understood - precision of hrtimer .
> 
> With hrtimers it is nanosecond resolution, precision is very high, but 
> overhead too high too. And as i understand, maybe it can be problem, that htb 
> still uses HZ variable in some cases like:
> 	return q->now + 2 * PSCHED_TICKS_PER_SEC / HZ;

I think HZ doesn't matter much in this place, because if it's ever
triggered there is something really wrong and precision is lost.
Anyway, this place was changed a bit in net-next.

I guess I could misunderstand your previous question. If you mean this
previous thread, then yes, IMHO CONFIG_HZ should matter for precision
(especially places where similar loops use all such jiffies in "normal"
situations). On the other hand, maybe I'm wrong, but using HZ 1000
with HIGH_RES_TIMERS looks quite obvious to me...

Jarek P.
--
To unsubscribe from this list: send the line "unsubscribe netdev" in
the body of a message to majordomo@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Denys Fedoryshchenko Feb. 5, 2009, 9:01 a.m. UTC | #9
Well, i don't know well what's wrong, but:

I have enough powerful router, 2x Quad Core 2 - 2.8 Ghz
4 x e1000e

And if i just try to shape interface for 200 Mbps with default iproute2 from 
git (and latest stable release same situation), i am getting too low 
burst/cburst. Sometimes it is even miscalculated (it can become 1500 byte, 
sometimes, while 1600 is minimum) , because yes, Stephen right, it needs 
float.
BUT.
Even i set manually twice more burst/cburst, it is just not enough for 200 
Mbps. I am checking on Cisco, rate is reaching max 180-185Mbps, and even on 
this rate buffering in qdiscs and as result jittering hard.
So this 100 bytes, which probably missing, really doesn't matter.

With 1000 HZ i have
class htb 1:10 parent 1:2 rate 220000Kbit ceil 220000Kbit burst 29067b cburst 
29067b
 Sent 14027734219153 bytes 590886842 pkt (dropped 0, overlimits 0 requeues 0)
 rate 122452Kbit 17131pps backlog 0b 0p requeues 0
 lended: 171333348 borrowed: 0 giants: 0
 tokens: 980 ctokens: 980

And it works quite well and precise.

P.S. Maybe a case, when some guy was not able to reach his gigabit with HTB 
shaper, i remember there was flowing some discussing, was somehow similar as 
my case.

On Thursday 05 February 2009 09:38:32 Jarek Poplawski wrote:
>
> I think HZ doesn't matter much in this place, because if it's ever
> triggered there is something really wrong and precision is lost.
> Anyway, this place was changed a bit in net-next.
>
> I guess I could misunderstand your previous question. If you mean this
> previous thread, then yes, IMHO CONFIG_HZ should matter for precision
> (especially places where similar loops use all such jiffies in "normal"
> situations). On the other hand, maybe I'm wrong, but using HZ 1000
> with HIGH_RES_TIMERS looks quite obvious to me...
>
> Jarek P.
> --
> To unsubscribe from this list: send the line "unsubscribe netdev" in
> the body of a message to majordomo@vger.kernel.org
> More majordomo info at  http://vger.kernel.org/majordomo-info.html


--
To unsubscribe from this list: send the line "unsubscribe netdev" in
the body of a message to majordomo@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Jarek Poplawski Feb. 5, 2009, 10:08 a.m. UTC | #10
On Thu, Feb 05, 2009 at 11:01:35AM +0200, Denys Fedoryschenko wrote:
> Well, i don't know well what's wrong, but:
> 
> I have enough powerful router, 2x Quad Core 2 - 2.8 Ghz
> 4 x e1000e
> 
> And if i just try to shape interface for 200 Mbps with default iproute2 from 
> git (and latest stable release same situation), i am getting too low 
> burst/cburst. Sometimes it is even miscalculated (it can become 1500 byte, 
> sometimes, while 1600 is minimum) , because yes, Stephen right, it needs 
> float.
> BUT.
> Even i set manually twice more burst/cburst, it is just not enough for 200 
> Mbps. I am checking on Cisco, rate is reaching max 180-185Mbps, and even on 
> this rate buffering in qdiscs and as result jittering hard.
> So this 100 bytes, which probably missing, really doesn't matter.
> 
> With 1000 HZ i have
> class htb 1:10 parent 1:2 rate 220000Kbit ceil 220000Kbit burst 29067b cburst 
> 29067b
>  Sent 14027734219153 bytes 590886842 pkt (dropped 0, overlimits 0 requeues 0)
>  rate 122452Kbit 17131pps backlog 0b 0p requeues 0
>  lended: 171333348 borrowed: 0 giants: 0
>  tokens: 980 ctokens: 980
> 
> And it works quite well and precise.

I lost your point... I think it's just like in your (and Martin's)
description to your patch. Plus you wrote it was tested OK?!

BTW, did you try to set the same (29kb) burst/cburst manually? Did
you try with half of it (15kb)?

> P.S. Maybe a case, when some guy was not able to reach his gigabit with HTB 
> shaper, i remember there was flowing some discussing, was somehow similar as 
> my case.

I don't remember this, but with current defaults it's really a problem
for higher rates.

Jarek P.
--
To unsubscribe from this list: send the line "unsubscribe netdev" in
the body of a message to majordomo@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Denys Fedoryshchenko Feb. 5, 2009, 11:09 a.m. UTC | #11
On Thursday 05 February 2009 12:08:53 Jarek Poplawski wrote:
>
> I lost your point... I think it's just like in your (and Martin's)
> description to your patch. Plus you wrote it was tested OK?!
Martin just pointed me, that burst/cburst probably is too small.

I found, that it is too small because of get_hz(). Actually on highres it will 
be always 1600 (default minimal value).

>
> BTW, did you try to set the same (29kb) burst/cburst manually? Did
> you try with half of it (15kb)?
I didn't, i guess i must calculate corresponding rates for another classes 
equally. But i try to set HZ variable to 2000 for example, i will see:

class htb 1:10 parent 1:2 rate 220000Kbit ceil 220000Kbit burst 15317b cburst 
15317b
 Sent 100864803 bytes 115615 pkt (dropped 0, overlimits 0 requeues 0)
 rate 15426Kbit 2106pps backlog 0b 0p requeues 0
 lended: 29467 borrowed: 0 giants: 0
 tokens: 533 ctokens: 533

But it is difficult to know, if it works fine or not. I will try to do more 
tests for 2000 and 10000 HZ. The issue, that on that point where i had fault, 
now is less load, so i have to find another place where i can test 
differences.

>
> > P.S. Maybe a case, when some guy was not able to reach his gigabit with
> > HTB shaper, i remember there was flowing some discussing, was somehow
> > similar as my case.
>
> I don't remember this, but with current defaults it's really a problem
> for higher rates.
>
> Jarek P.


--
To unsubscribe from this list: send the line "unsubscribe netdev" in
the body of a message to majordomo@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Jarek Poplawski Feb. 5, 2009, 11:56 a.m. UTC | #12
On Thu, Feb 05, 2009 at 01:09:06PM +0200, Denys Fedoryschenko wrote:
> On Thursday 05 February 2009 12:08:53 Jarek Poplawski wrote:
> >
> > I lost your point... I think it's just like in your (and Martin's)
> > description to your patch. Plus you wrote it was tested OK?!
> Martin just pointed me, that burst/cburst probably is too small.
> 
> I found, that it is too small because of get_hz(). Actually on highres it will 
> be always 1600 (default minimal value).

And these are very good points.

> > BTW, did you try to set the same (29kb) burst/cburst manually? Did
> > you try with half of it (15kb)?
> I didn't, i guess i must calculate corresponding rates for another classes 
> equally. But i try to set HZ variable to 2000 for example, i will see:
> 
> class htb 1:10 parent 1:2 rate 220000Kbit ceil 220000Kbit burst 15317b cburst 
> 15317b
>  Sent 100864803 bytes 115615 pkt (dropped 0, overlimits 0 requeues 0)
>  rate 15426Kbit 2106pps backlog 0b 0p requeues 0
>  lended: 29467 borrowed: 0 giants: 0
>  tokens: 533 ctokens: 533
> 
> But it is difficult to know, if it works fine or not. I will try to do more 
> tests for 2000 and 10000 HZ. The issue, that on that point where i had fault, 
> now is less load, so i have to find another place where i can test 
> differences.

I don't know where else in tc this HZ env. variable is used, so this
could change more than you need. I only wondered if setting this
manually works, so one try should be enough.

Since you tested with double size burst (~3kb?) and it didn't work,
then we know even 10000 (10x) get_hz() would be wrong. This 2x case
is only for curiosity but I think there is no reason to use more than
1000HZ for setting these defaults in tc.

Jarek P.
--
To unsubscribe from this list: send the line "unsubscribe netdev" in
the body of a message to majordomo@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
diff mbox

Patch

diff -Naur iproute2/tc/q_htb.c iproute2-new/tc/q_htb.c
--- iproute2/tc/q_htb.c	2009-01-22 11:10:36.000000000 +0200
+++ iproute2-new/tc/q_htb.c	2009-01-22 11:31:40.000000000 +0200
@@ -110,6 +110,7 @@ 
 	unsigned mtu;
 	unsigned short mpu = 0;
 	unsigned short overhead = 0;
+	unsigned int bursthz;
 	unsigned int linklayer  = LINKLAYER_ETHERNET; /* Assume ethernet */
 	struct rtattr *tail;
 
@@ -209,9 +210,16 @@ 
 	if (!opt.ceil.rate) opt.ceil = opt.rate;
 
 	/* compute minimal allowed burst from rate; mtu is added here to make
-	   sute that buffer is larger than mtu and to have some safeguard space */
-	if (!buffer) buffer = opt.rate.rate / get_hz() + mtu;
-	if (!cbuffer) cbuffer = opt.ceil.rate / get_hz() + mtu;
+	   sure that buffer is larger than mtu and to have some safeguard space 
+	   With hrtimers enabled, hz variable gets too high resolution. If we
+	   use it for computing burst, burst will be too low. In worst case 
+	   we will not reach specified rate/ceil, in best - load will be too 
+	   high.
+	*/
+	bursthz = get_hz() == 1000000000 ? 1000 : get_hz();	
+
+	if (!buffer) buffer = opt.rate.rate / bursthz + mtu;
+	if (!cbuffer) cbuffer = opt.ceil.rate / bursthz + mtu;
 
 	opt.ceil.overhead = overhead;
 	opt.rate.overhead = overhead;