diff mbox

[iproute2] Re: HTB accuracy for high speed

Message ID 20090530200756.GF3166@ami.dom.local
State Not Applicable, archived
Delegated to: David Miller
Headers show

Commit Message

Jarek Poplawski May 30, 2009, 8:07 p.m. UTC
On Fri, May 29, 2009 at 06:02:39PM +0100, Antonio Almeida wrote:
> On Thu, May 28, 2009 at 10:12 PM, Jarek Poplawski wrote:
...
> > -#define PSCHED_US2NS(x)                        ((s64)(x) << 10)
> > -#define PSCHED_NS2US(x)                        ((x) >> 10)
> > +#define PSCHED_US2NS(x)                        ((s64)(x) << 6)
> > +#define PSCHED_NS2US(x)                        ((x) >> 6)
> >
> >  #define PSCHED_TICKS_PER_SEC           PSCHED_NS2US(NSEC_PER_SEC)
> >  #define PSCHED_PASTPERFECT             0
> 
> It's better! This patch gives more accuracy to HTB. Here some values:
> Note that these are boundary values, so, e.g., any HTB configuration
> between 377000Kbit and 400000Kbit would fall in the same step - close
> to 397977Kbit.
> This test was made over the same conditions: generating 950Mbit/s of
> unidirectional tcp traffic of 800 bytes packets long.

Here is a tc patch, which should minimize these boundaries, so please,
repeat this test with previous patches/conditions plus this one.

Thanks,
Jarek P.
---

 tc/tc_core.c |   10 +++++-----
 tc/tc_core.h |    4 ++--
 2 files changed, 7 insertions(+), 7 deletions(-)

--
To unsubscribe from this list: send the line "unsubscribe netdev" in
the body of a message to majordomo@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html

Comments

Antonio Almeida June 2, 2009, 10:12 a.m. UTC | #1
On Sat, May 30, 2009 at 9:07 PM, Jarek Poplawski wrote:
> Here is a tc patch, which should minimize these boundaries, so please,
> repeat this test with previous patches/conditions plus this one.
>
> Thanks,
> Jarek P.
> ---
>
>  tc/tc_core.c |   10 +++++-----
>  tc/tc_core.h |    4 ++--
>  2 files changed, 7 insertions(+), 7 deletions(-)
>
> diff --git a/tc/tc_core.c b/tc/tc_core.c
> index 9a0ff39..6d74287 100644
> --- a/tc/tc_core.c
> +++ b/tc/tc_core.c
> @@ -27,18 +27,18 @@
>  static double tick_in_usec = 1;
>  static double clock_factor = 1;
>
> -int tc_core_time2big(unsigned time)
> +int tc_core_time2big(double time)
>  {
> -       __u64 t = time;
> +       __u64 t;
>
> -       t *= tick_in_usec;
> +       t = time * tick_in_usec + 0.5;
>        return (t >> 32) != 0;
>  }
>
>
> -unsigned tc_core_time2tick(unsigned time)
> +unsigned tc_core_time2tick(double time)
>  {
> -       return time*tick_in_usec;
> +       return time * tick_in_usec + 0.5;
>  }
>
>  unsigned tc_core_tick2time(unsigned tick)
> diff --git a/tc/tc_core.h b/tc/tc_core.h
> index 5a693ba..0ac65aa 100644
> --- a/tc/tc_core.h
> +++ b/tc/tc_core.h
> @@ -13,8 +13,8 @@ enum link_layer {
>  };
>
>
> -int  tc_core_time2big(unsigned time);
> -unsigned tc_core_time2tick(unsigned time);
> +int  tc_core_time2big(double time);
> +unsigned tc_core_time2tick(double time);
>  unsigned tc_core_tick2time(unsigned tick);
>  unsigned tc_core_time2ktime(unsigned time);
>  unsigned tc_core_ktime2time(unsigned ktime);
>

I'm getting great values with this patch!

class htb 1:108 parent 1:10 leaf 108: prio 7 quantum 1514 rate
555000Kbit ceil 555000Kbit burst 70970b/8 mpu 0b overhead 0b cburst
70970b/8 mpu 0b overhead 0b level 0
 Sent 14270693572 bytes 17928007 pkt (dropped 12579262, overlimits 0 requeues 0)
 rate 552755Kbit 86802pps backlog 0b 127p requeues 0
 lended: 17927880 borrowed: 0 giants: 0
 tokens: -16095 ctokens: -16095

(for packets of 800 bytes)
I'll get back to you with more values.

  Antonio Almeida
--
To unsubscribe from this list: send the line "unsubscribe netdev" in
the body of a message to majordomo@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Antonio Almeida June 2, 2009, 11:45 a.m. UTC | #2
On Tue, Jun 2, 2009 at 11:12 AM, Antonio Almeida wrote:
> I'm getting great values with this patch!
>...
> I'll get back to you with more values.

The steps are much smaller and the error keeps lower than 1%.
Injecting over 950Mpbs of tcp packets of 800bytes I get these values:

Configuration	Sent rate		error (%)
498000Kbit	495023Kbit	0,60
499000Kbit	497456Kbit	0,31
500000Kbit	497498Kbit	0,50
501000Kbit	497496Kbit	0,70
502000Kbit	499986Kbit	0,40
503000Kbit	499978Kbit	0,60
504000Kbit	502520Kbit	0,29
		
696000Kbit	690964Kbit	0,72
697000Kbit	695782Kbit	0,17
698000Kbit	695783Kbit	0,32
699000Kbit	695783Kbit	0,46
700000Kbit	695795Kbit	0,60
701000Kbit	695786Kbit	0,74
702000Kbit	700703Kbit	0,18
		
896000Kbit	888383Kbit	0,85
897000Kbit	896289Kbit	0,08
904000Kbit	896389Kbit	0,84
905000Kbit	904542Kbit	0,05

  Antonio Almeida
--
To unsubscribe from this list: send the line "unsubscribe netdev" in
the body of a message to majordomo@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Jarek Poplawski June 2, 2009, 12:36 p.m. UTC | #3
On Tue, Jun 02, 2009 at 12:45:28PM +0100, Antonio Almeida wrote:
> On Tue, Jun 2, 2009 at 11:12 AM, Antonio Almeida wrote:
> > I'm getting great values with this patch!
> >...
> > I'll get back to you with more values.
> 
> The steps are much smaller and the error keeps lower than 1%.
> Injecting over 950Mpbs of tcp packets of 800bytes I get these values:

Nice values - should be acceptable, I guess. Alas this is not all, and
I'll ask you soon for re-testing HFSC (after another patch) or maybe
even some simple CBQ setup ;-)

Thank you very much for testing,
Jarek P.

> 
> Configuration	Sent rate		error (%)
> 498000Kbit	495023Kbit	0,60
> 499000Kbit	497456Kbit	0,31
> 500000Kbit	497498Kbit	0,50
> 501000Kbit	497496Kbit	0,70
> 502000Kbit	499986Kbit	0,40
> 503000Kbit	499978Kbit	0,60
> 504000Kbit	502520Kbit	0,29
> 		
> 696000Kbit	690964Kbit	0,72
> 697000Kbit	695782Kbit	0,17
> 698000Kbit	695783Kbit	0,32
> 699000Kbit	695783Kbit	0,46
> 700000Kbit	695795Kbit	0,60
> 701000Kbit	695786Kbit	0,74
> 702000Kbit	700703Kbit	0,18
> 		
> 896000Kbit	888383Kbit	0,85
> 897000Kbit	896289Kbit	0,08
> 904000Kbit	896389Kbit	0,84
> 905000Kbit	904542Kbit	0,05
> 
>   Antonio Almeida
--
To unsubscribe from this list: send the line "unsubscribe netdev" in
the body of a message to majordomo@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Patrick McHardy June 2, 2009, 12:45 p.m. UTC | #4
Jarek Poplawski wrote:
> On Tue, Jun 02, 2009 at 12:45:28PM +0100, Antonio Almeida wrote:
>> On Tue, Jun 2, 2009 at 11:12 AM, Antonio Almeida wrote:
>>> I'm getting great values with this patch!
>>> ...
>>> I'll get back to you with more values.
>> The steps are much smaller and the error keeps lower than 1%.
>> Injecting over 950Mpbs of tcp packets of 800bytes I get these values:
> 
> Nice values - should be acceptable, I guess. Alas this is not all, and
> I'll ask you soon for re-testing HFSC (after another patch) or maybe
> even some simple CBQ setup ;-)

I didn't follow the full discussion, so I'm not sure which kind of
arithmetic error you're attempting to cure. For the HFSC scaling
factors, please just keep in mind that its also supposed to be
very accurate at low bandwidths.

--
To unsubscribe from this list: send the line "unsubscribe netdev" in
the body of a message to majordomo@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Jarek Poplawski June 2, 2009, 1:08 p.m. UTC | #5
On Tue, Jun 02, 2009 at 02:45:34PM +0200, Patrick McHardy wrote:
> Jarek Poplawski wrote:
>> On Tue, Jun 02, 2009 at 12:45:28PM +0100, Antonio Almeida wrote:
>>> On Tue, Jun 2, 2009 at 11:12 AM, Antonio Almeida wrote:
>>>> I'm getting great values with this patch!
>>>> ...
>>>> I'll get back to you with more values.
>>> The steps are much smaller and the error keeps lower than 1%.
>>> Injecting over 950Mpbs of tcp packets of 800bytes I get these values:
>>
>> Nice values - should be acceptable, I guess. Alas this is not all, and
>> I'll ask you soon for re-testing HFSC (after another patch) or maybe
>> even some simple CBQ setup ;-)
>
> I didn't follow the full discussion, so I'm not sure which kind of
> arithmetic error you're attempting to cure. For the HFSC scaling
> factors, please just keep in mind that its also supposed to be
> very accurate at low bandwidths.

It's all here:

http://permalink.gmane.org/gmane.linux.network/129301

Of course, I'd appreciate any suggestions.

Thanks,
Jarek P.
--
To unsubscribe from this list: send the line "unsubscribe netdev" in
the body of a message to majordomo@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Patrick McHardy June 2, 2009, 1:20 p.m. UTC | #6
Jarek Poplawski wrote:
> On Tue, Jun 02, 2009 at 02:45:34PM +0200, Patrick McHardy wrote:
>> I didn't follow the full discussion, so I'm not sure which kind of
>> arithmetic error you're attempting to cure. For the HFSC scaling
>> factors, please just keep in mind that its also supposed to be
>> very accurate at low bandwidths.
> 
> It's all here:
> 
> http://permalink.gmane.org/gmane.linux.network/129301

I've read through the mails where you suggested to change the scaling
factors. I wasn't able to find the reasoning (IOW: where does it
overflow or loose precision in which case) though.

> Of course, I'd appreciate any suggestions.

The HFSC shifts would indeed need adjustments if the US<->NS conversion
factor were to change.

--
To unsubscribe from this list: send the line "unsubscribe netdev" in
the body of a message to majordomo@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Jarek Poplawski June 2, 2009, 9:37 p.m. UTC | #7
On Tue, Jun 02, 2009 at 03:20:20PM +0200, Patrick McHardy wrote:
> Jarek Poplawski wrote:
>> On Tue, Jun 02, 2009 at 02:45:34PM +0200, Patrick McHardy wrote:
>>> I didn't follow the full discussion, so I'm not sure which kind of
>>> arithmetic error you're attempting to cure. For the HFSC scaling
>>> factors, please just keep in mind that its also supposed to be
>>> very accurate at low bandwidths.
>>
>> It's all here:
>>
>> http://permalink.gmane.org/gmane.linux.network/129301
>
> I've read through the mails where you suggested to change the scaling
> factors. I wasn't able to find the reasoning (IOW: where does it
> overflow or loose precision in which case) though.

I described the reasoning here:
http://permalink.gmane.org/gmane.linux.network/128189

Of course, we could try some other solution than changing the scaling.
I considered a possibility to do it internally in htb, even with
skipping rate tables, but the change of the scaling seems to be the
most generic way (alas there are some odd compatibility issues in
iproute/tc like TIME_UNITS_PER_SEC or "if (nom == 1000000)" to make
it really consistent/readable).

Jarek P.
--
To unsubscribe from this list: send the line "unsubscribe netdev" in
the body of a message to majordomo@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Jarek Poplawski June 2, 2009, 9:50 p.m. UTC | #8
Jarek Poplawski wrote, On 06/02/2009 11:37 PM:
...

> I described the reasoning here:
> http://permalink.gmane.org/gmane.linux.network/128189

The link is stuck now, so here is a quote:

Jarek Poplawski wrote, On 05/17/2009 10:15 PM:

> Here is some additional explanation. It looks like these rates above
> 500Mbit hit the design limits of packet scheduling. Currently used
> internal resolution PSCHED_TICKS_PER_SEC is 1,000,000. 550Mbit rate
> with 800byte packets means 550M/8/800 = 85938 packets/s, so on average
> 1000000/85938 = 11.6 ticks per packet. Accounting only 11 ticks means
> we leave 0.6*85938 = 51563 ticks per second, letting for additional
> sending of 51563/11 = 4687 packets/s or 4687*800*8 = 30Mbit. Of course
> it could be worse (0.9 tick/packet lost) depending on packet sizes vs.
> rates, and the effect rises for higher rates.


Jarek P.
--
To unsubscribe from this list: send the line "unsubscribe netdev" in
the body of a message to majordomo@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Patrick McHardy June 3, 2009, 7:06 a.m. UTC | #9
Jarek Poplawski wrote:
> Jarek Poplawski wrote, On 06/02/2009 11:37 PM:
> ...
> 
>> I described the reasoning here:
>> http://permalink.gmane.org/gmane.linux.network/128189
> 
> The link is stuck now, so here is a quote:

Thanks.

> Jarek Poplawski wrote, On 05/17/2009 10:15 PM:
> 
>> Here is some additional explanation. It looks like these rates above
>> 500Mbit hit the design limits of packet scheduling. Currently used
>> internal resolution PSCHED_TICKS_PER_SEC is 1,000,000. 550Mbit rate
>> with 800byte packets means 550M/8/800 = 85938 packets/s, so on average
>> 1000000/85938 = 11.6 ticks per packet. Accounting only 11 ticks means
>> we leave 0.6*85938 = 51563 ticks per second, letting for additional
>> sending of 51563/11 = 4687 packets/s or 4687*800*8 = 30Mbit. Of course
>> it could be worse (0.9 tick/packet lost) depending on packet sizes vs.
>> rates, and the effect rises for higher rates.

I see. Unfortunately changing the scaling factors is pushing the lower
end towards overflowing. For example Denys Fedoryshchenko reported some
breakage a few years ago when I changed the iproute-internal factors
triggered by this command:

.. tbf buffer 1024kb latency 500ms rate 128kbit peakrate 256kbit 
minburst 16384

The burst size calculated by TBF with the current parameters is
64000000. Increasing it by a factor of 16 as in your patch results
in 1024000000. Which means we're getting dangerously close to
overflowing, a buffer size increase or a rate decrease of slightly
bigger than factor 4 will already overflow.

Mid-term we really need to move to 64 bit values and ns resolution,
otherwise this problem is just going to reappear as soon as someone
tries 10gbit. Not sure what the best short term fix is, I feel a bit
uneasy about changing the current factors given how close this brings
us towards overflowing.
--
To unsubscribe from this list: send the line "unsubscribe netdev" in
the body of a message to majordomo@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Jarek Poplawski June 3, 2009, 7:40 a.m. UTC | #10
On Wed, Jun 03, 2009 at 09:06:37AM +0200, Patrick McHardy wrote:
> Jarek Poplawski wrote:
>> Jarek Poplawski wrote, On 06/02/2009 11:37 PM:
>> ...
>>
>>> I described the reasoning here:
>>> http://permalink.gmane.org/gmane.linux.network/128189
>>
>> The link is stuck now, so here is a quote:
>
> Thanks.
>
>> Jarek Poplawski wrote, On 05/17/2009 10:15 PM:
>>
>>> Here is some additional explanation. It looks like these rates above
>>> 500Mbit hit the design limits of packet scheduling. Currently used
>>> internal resolution PSCHED_TICKS_PER_SEC is 1,000,000. 550Mbit rate
>>> with 800byte packets means 550M/8/800 = 85938 packets/s, so on average
>>> 1000000/85938 = 11.6 ticks per packet. Accounting only 11 ticks means
>>> we leave 0.6*85938 = 51563 ticks per second, letting for additional
>>> sending of 51563/11 = 4687 packets/s or 4687*800*8 = 30Mbit. Of course
>>> it could be worse (0.9 tick/packet lost) depending on packet sizes vs.
>>> rates, and the effect rises for higher rates.
>
> I see. Unfortunately changing the scaling factors is pushing the lower
> end towards overflowing. For example Denys Fedoryshchenko reported some
> breakage a few years ago when I changed the iproute-internal factors
> triggered by this command:
>
> .. tbf buffer 1024kb latency 500ms rate 128kbit peakrate 256kbit  
> minburst 16384
>
> The burst size calculated by TBF with the current parameters is
> 64000000. Increasing it by a factor of 16 as in your patch results
> in 1024000000. Which means we're getting dangerously close to
> overflowing, a buffer size increase or a rate decrease of slightly
> bigger than factor 4 will already overflow.
>
> Mid-term we really need to move to 64 bit values and ns resolution,
> otherwise this problem is just going to reappear as soon as someone
> tries 10gbit. Not sure what the best short term fix is, I feel a bit
> uneasy about changing the current factors given how close this brings
> us towards overflowing.

I completely agree it's on the verge of overflow, and actually would
overflow for some insanely low (for today's standards) rates. So I
treat it's as a temporary solution, until people start asking about
more than 1 or 2Gbit. And of course we will have to move to 64 bit
anyway. Or we can do it now...

Btw., I've some doubts about HFSC; it's really different than others
wrt. rate tables/time accounting, and these PSCHED_TICKS look only
like an unnecesary compatibility; it works OK with usecs and doesn't
need this change now, unless I miss something. So maybe we would
simply stop using common psched_get_time() for it, and only do a
conversion for qdisc_watchdog_schedule() etc.?

Thanks,
Jarek P.
--
To unsubscribe from this list: send the line "unsubscribe netdev" in
the body of a message to majordomo@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Patrick McHardy June 3, 2009, 7:53 a.m. UTC | #11
Jarek Poplawski wrote:
> On Wed, Jun 03, 2009 at 09:06:37AM +0200, Patrick McHardy wrote:
>> Mid-term we really need to move to 64 bit values and ns resolution,
>> otherwise this problem is just going to reappear as soon as someone
>> tries 10gbit. Not sure what the best short term fix is, I feel a bit
>> uneasy about changing the current factors given how close this brings
>> us towards overflowing.
> 
> I completely agree it's on the verge of overflow, and actually would
> overflow for some insanely low (for today's standards) rates. So I
> treat it's as a temporary solution, until people start asking about
> more than 1 or 2Gbit. And of course we will have to move to 64 bit
> anyway. Or we can do it now...

That (now) would certainly be the best solution, but its a non-trivial
task since all the ABIs use 32 bit values.

> Btw., I've some doubts about HFSC; it's really different than others
> wrt. rate tables/time accounting, and these PSCHED_TICKS look only
> like an unnecesary compatibility; it works OK with usecs and doesn't
> need this change now, unless I miss something. So maybe we would
> simply stop using common psched_get_time() for it, and only do a
> conversion for qdisc_watchdog_schedule() etc.?

Yes, it would work perfectly fine with usecs, which is actually (and
unfortunately) the unit it uses in its ABI. But I think its better
to convert the values once during initialization, instead of again
and again when scheduling the watchdog. The necessary changes are
really trivial, all you need to do when changing the scaling factors
is to increase SM_MASK and decrease ISM_MASK accordingly.
--
To unsubscribe from this list: send the line "unsubscribe netdev" in
the body of a message to majordomo@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Jarek Poplawski June 3, 2009, 8:01 a.m. UTC | #12
On Wed, Jun 03, 2009 at 09:53:11AM +0200, Patrick McHardy wrote:
...
> Yes, it would work perfectly fine with usecs, which is actually (and
> unfortunately) the unit it uses in its ABI. But I think its better
> to convert the values once during initialization, instead of again
> and again when scheduling the watchdog. The necessary changes are
> really trivial, all you need to do when changing the scaling factors
> is to increase SM_MASK and decrease ISM_MASK accordingly.

Right! (On the other hand we could consider a separate watchdog too...)

Jarek P.
--
To unsubscribe from this list: send the line "unsubscribe netdev" in
the body of a message to majordomo@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Patrick McHardy June 3, 2009, 8:29 a.m. UTC | #13
Jarek Poplawski wrote:
> On Wed, Jun 03, 2009 at 09:53:11AM +0200, Patrick McHardy wrote:
> ...
>> Yes, it would work perfectly fine with usecs, which is actually (and
>> unfortunately) the unit it uses in its ABI. But I think its better
>> to convert the values once during initialization, instead of again
>> and again when scheduling the watchdog. The necessary changes are
>> really trivial, all you need to do when changing the scaling factors
>> is to increase SM_MASK and decrease ISM_MASK accordingly.
> 
> Right! (On the other hand we could consider a separate watchdog too...)

We could :) But I don't see any benefit doing that, especially given
that eventually everything should be using ns resolution anyways.
--
To unsubscribe from this list: send the line "unsubscribe netdev" in
the body of a message to majordomo@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Jarek Poplawski June 3, 2009, 8:45 a.m. UTC | #14
On Wed, Jun 03, 2009 at 10:29:58AM +0200, Patrick McHardy wrote:
> Jarek Poplawski wrote:
>> On Wed, Jun 03, 2009 at 09:53:11AM +0200, Patrick McHardy wrote:
>> ...
>>> Yes, it would work perfectly fine with usecs, which is actually (and
>>> unfortunately) the unit it uses in its ABI. But I think its better
>>> to convert the values once during initialization, instead of again
>>> and again when scheduling the watchdog. The necessary changes are
>>> really trivial, all you need to do when changing the scaling factors
>>> is to increase SM_MASK and decrease ISM_MASK accordingly.
>>
>> Right! (On the other hand we could consider a separate watchdog too...)
>
> We could :) But I don't see any benefit doing that, especially given
> that eventually everything should be using ns resolution anyways.

The main benefit would be readability... I guess it's no problem for
you, but I'm currently trying to make sure things like this are/will
be OK :-)

        dx = ((u64)d * PSCHED_TICKS_PER_SEC);
        dx += USEC_PER_SEC - 1;
 
Jarek P.
--
To unsubscribe from this list: send the line "unsubscribe netdev" in
the body of a message to majordomo@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
David Miller June 4, 2009, 4:53 a.m. UTC | #15
From: Patrick McHardy <kaber@trash.net>
Date: Wed, 03 Jun 2009 09:53:11 +0200

> Jarek Poplawski wrote:
>> On Wed, Jun 03, 2009 at 09:06:37AM +0200, Patrick McHardy wrote:
>>> Mid-term we really need to move to 64 bit values and ns resolution,
>>> otherwise this problem is just going to reappear as soon as someone
>>> tries 10gbit. Not sure what the best short term fix is, I feel a bit
>>> uneasy about changing the current factors given how close this brings
>>> us towards overflowing.
>> I completely agree it's on the verge of overflow, and actually would
>> overflow for some insanely low (for today's standards) rates. So I
>> treat it's as a temporary solution, until people start asking about
>> more than 1 or 2Gbit. And of course we will have to move to 64 bit
>> anyway. Or we can do it now...
> 
> That (now) would certainly be the best solution, but its a non-trivial
> task since all the ABIs use 32 bit values.

We could pass in a new attribute which provides the upper-32bits
of the value.  I'm not sure if that works in this case but it's
an idea.

--
To unsubscribe from this list: send the line "unsubscribe netdev" in
the body of a message to majordomo@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Jarek Poplawski June 4, 2009, 7:50 a.m. UTC | #16
On Wed, Jun 03, 2009 at 09:53:14PM -0700, David Miller wrote:
> From: Patrick McHardy <kaber@trash.net>
> Date: Wed, 03 Jun 2009 09:53:11 +0200
> 
> > Jarek Poplawski wrote:
> >> On Wed, Jun 03, 2009 at 09:06:37AM +0200, Patrick McHardy wrote:
> >>> Mid-term we really need to move to 64 bit values and ns resolution,
> >>> otherwise this problem is just going to reappear as soon as someone
> >>> tries 10gbit. Not sure what the best short term fix is, I feel a bit
> >>> uneasy about changing the current factors given how close this brings
> >>> us towards overflowing.
> >> I completely agree it's on the verge of overflow, and actually would
> >> overflow for some insanely low (for today's standards) rates. So I
> >> treat it's as a temporary solution, until people start asking about
> >> more than 1 or 2Gbit. And of course we will have to move to 64 bit
> >> anyway. Or we can do it now...
> > 
> > That (now) would certainly be the best solution, but its a non-trivial
> > task since all the ABIs use 32 bit values.
> 
> We could pass in a new attribute which provides the upper-32bits
> of the value.  I'm not sure if that works in this case but it's
> an idea.

I'm not sure it could be so simple: I guess Patrick is concerned with
a new tc talking to an old kernel (otherwise a kernel should recognize
an old format). Then it would need something reasonable in 32bits.

But, I'm not even sure we need 64bit rate tables. We could
alternatively use (after checking a kernel can handle this)
simply a log to shift these values in kernel to u64:

- static inline u32 qdisc_l2t(struct qdisc_rate_table* rtab, unsigned int pktlen)
+ static inline u64 qdisc_l2t(struct qdisc_rate_table* rtab, unsigned int pktlen)
  {
	...
-        return rtab->data[slot];
+        return rtab->data[slot] << rtab->rate.rate_log;
  }

Since these overflows are for low rates, this rounding of lower bits
shouldn't matter here. So, IMHO, it's more about adding this overhead
of u64 to the kernel now.

Jarek P.
--
To unsubscribe from this list: send the line "unsubscribe netdev" in
the body of a message to majordomo@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
diff mbox

Patch

diff --git a/tc/tc_core.c b/tc/tc_core.c
index 9a0ff39..6d74287 100644
--- a/tc/tc_core.c
+++ b/tc/tc_core.c
@@ -27,18 +27,18 @@ 
 static double tick_in_usec = 1;
 static double clock_factor = 1;
 
-int tc_core_time2big(unsigned time)
+int tc_core_time2big(double time)
 {
-	__u64 t = time;
+	__u64 t;
 
-	t *= tick_in_usec;
+	t = time * tick_in_usec + 0.5;
 	return (t >> 32) != 0;
 }
 
 
-unsigned tc_core_time2tick(unsigned time)
+unsigned tc_core_time2tick(double time)
 {
-	return time*tick_in_usec;
+	return time * tick_in_usec + 0.5;
 }
 
 unsigned tc_core_tick2time(unsigned tick)
diff --git a/tc/tc_core.h b/tc/tc_core.h
index 5a693ba..0ac65aa 100644
--- a/tc/tc_core.h
+++ b/tc/tc_core.h
@@ -13,8 +13,8 @@  enum link_layer {
 };
 
 
-int  tc_core_time2big(unsigned time);
-unsigned tc_core_time2tick(unsigned time);
+int  tc_core_time2big(double time);
+unsigned tc_core_time2tick(double time);
 unsigned tc_core_tick2time(unsigned tick);
 unsigned tc_core_time2ktime(unsigned time);
 unsigned tc_core_ktime2time(unsigned ktime);