Patchwork pkt_sched: gen_estimator: use 64 bits intermediate counters for bps

login
register
mail settings
Submitter Eric Dumazet
Date May 18, 2009, 11:59 p.m.
Message ID <4A11F67B.3050805@cosmosbay.com>
Download mbox | patch
Permalink /patch/27383/
State Accepted
Delegated to: David Miller
Headers show

Comments

Eric Dumazet - May 18, 2009, 11:59 p.m.
David Miller a écrit :
> From: Jarek Poplawski <jarkao2@gmail.com>
> Date: Mon, 18 May 2009 19:23:49 +0200
> 
>> On Mon, May 18, 2009 at 06:40:56PM +0200, Eric Dumazet wrote:
>>> With a typical estimator "1sec 8sec", ewma_log value is 3
>>>
>>> At gigabit speeds, we are very close to overflow yes, since
>>> we only have 27 bits available, so 134217728 bytes per second
>>> or 1073741824 bits per second.
>>>
>>> So formula :
>>> e->avbps += ((long)rate - (long)e->avbps) >> e->ewma_log;
>>> is going to overflow.
>>>
>>> One way to avoid the overflow would be to use a smaller estimator, like "500ms 4sec" 
>>>
>>> Or use a 64bits rate & avbps, this is needed fo 10Gb speeds I suppose...
>> Yes, I considered this too, but because of an overhead I decided to
>> fix as designed (according to the comment) for now. But probably you
>> are right, and we should go further, so I'm OK with your patch.
> 
> I like this patch too, Eric can you submit this formally with
> proper signoffs etc.?
> 

Sure, here it is. We might need a similar patch to get a correct pps value
too, since we currently are limited to ~ 2^21 packets per second.

[PATCH] pkt_sched: gen_estimator: use 64 bit intermediate counters for bps

gen_estimator can overflow bps (bytes per second) with Gb links, while
it was designed with a u32 API, with a theorical limit of 34360Mbit (2^32 bytes)

Using 64 bit intermediate avbps/brate counters can allow us to reach this
theorical limit.

Signed-off-by: Eric Dumazet <dada1@cosmosbay.com>
Signed-off-by: Jarek Poplawski <jarkao2@gmail.com>
---


--
To unsubscribe from this list: send the line "unsubscribe netdev" in
the body of a message to majordomo@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
David Miller - May 19, 2009, 2:27 a.m.
From: Eric Dumazet <dada1@cosmosbay.com>
Date: Tue, 19 May 2009 01:59:55 +0200

> Sure, here it is.

Applied, thanks!

> We might need a similar patch to get a correct pps value
> too, since we currently are limited to ~ 2^21 packets per second.

True, but it is a less urgent issue than bps overflow.
--
To unsubscribe from this list: send the line "unsubscribe netdev" in
the body of a message to majordomo@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Jarek Poplawski - May 19, 2009, 7:02 a.m.
On Tue, May 19, 2009 at 01:59:55AM +0200, Eric Dumazet wrote:
...
> diff --git a/net/core/gen_estimator.c b/net/core/gen_estimator.c
...
> -		e->avbps += ((long)rate - (long)e->avbps) >> e->ewma_log;
> +		e->avbps += ((s64)(brate - e->avbps)) >> e->ewma_log;

Btw., I'm a bit concerned about the syntax here: isn't such shifting
of signed ints implementation dependant?

Jarek P.
--
To unsubscribe from this list: send the line "unsubscribe netdev" in
the body of a message to majordomo@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Eric Dumazet - May 19, 2009, 7:31 a.m.
Jarek Poplawski a écrit :
> On Tue, May 19, 2009 at 01:59:55AM +0200, Eric Dumazet wrote:
> ...
>> diff --git a/net/core/gen_estimator.c b/net/core/gen_estimator.c
> ...
>> -		e->avbps += ((long)rate - (long)e->avbps) >> e->ewma_log;
>> +		e->avbps += ((s64)(brate - e->avbps)) >> e->ewma_log;
> 
> Btw., I'm a bit concerned about the syntax here: isn't such shifting
> of signed ints implementation dependant?
> 

You are right Jarek, I very often forget to never ever use signed quantities
at all ! (But also note original code has same undefined behavior)


Quoting wikipedia : (http://en.wikipedia.org/wiki/Arithmetic_shift)

The (1999) ISO standard for the, C programming language defines the C language's 
right shift operator in terms of divisions by powers of 2. Because of the 
aforementioned non-equivalence, the standard explicitly excludes from that
 definition the right shifts of signed numbers that have negative values.
 It doesn't specify the behaviour of the right shift operator in such circumstances,
 but instead requires each individual C compiler to specify the behaviour of shifting 
negative values right.

Apparently gcc does the *right* thing on x86_32, but we probably want something
stronger here. I could not find gcc documentation statement on right shifts of 
negative values.


 436:   8b 4b 14                mov    0x14(%ebx),%ecx
 439:   89 73 18                mov    %esi,0x18(%ebx)
 43c:   89 7b 1c                mov    %edi,0x1c(%ebx)
 43f:   8b 73 20                mov    0x20(%ebx),%esi
 442:   8b 7b 24                mov    0x24(%ebx),%edi
 445:   29 f0                   sub    %esi,%eax
 447:   19 fa                   sbb    %edi,%edx
 449:   0f ad d0                shrd   %cl,%edx,%eax
 44c:   d3 fa                   sar    %cl,%edx         << good >>
 44e:   f6 c1 20                test   $0x20,%cl
 451:   74 05                   je     458 <est_timer+0xb8>
 453:   89 d0                   mov    %edx,%eax
 455:   c1 fa 1f                sar    $0x1f,%edx       
 458:   01 f0                   add    %esi,%eax
 45a:   8b 4b 0c                mov    0xc(%ebx),%ecx
 45d:   89 43 20                mov    %eax,0x20(%ebx)
 460:   11 fa                   adc    %edi,%edx
 462:   83 c0 0f                add    $0xf,%eax
 465:   89 53 24                mov    %edx,0x24(%ebx)
 468:   83 d2 00                adc    $0x0,%edx
 46b:   0f ac d0 05             shrd   $0x5,%edx,%eax
 46f:   89 01                   mov    %eax,(%ecx)

--
To unsubscribe from this list: send the line "unsubscribe netdev" in
the body of a message to majordomo@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Jarek Poplawski - May 19, 2009, 7:42 a.m.
On Tue, May 19, 2009 at 09:31:36AM +0200, Eric Dumazet wrote:
> Jarek Poplawski a écrit :
> > On Tue, May 19, 2009 at 01:59:55AM +0200, Eric Dumazet wrote:
> > ...
> >> diff --git a/net/core/gen_estimator.c b/net/core/gen_estimator.c
> > ...
> >> -		e->avbps += ((long)rate - (long)e->avbps) >> e->ewma_log;
> >> +		e->avbps += ((s64)(brate - e->avbps)) >> e->ewma_log;
> > 
> > Btw., I'm a bit concerned about the syntax here: isn't such shifting
> > of signed ints implementation dependant?
> > 
> 
> You are right Jarek, I very often forget to never ever use signed quantities
> at all ! (But also note original code has same undefined behavior)

Sure, I've meant the original code including 5 lines below.

> Apparently gcc does the *right* thing on x86_32, but we probably want something
> stronger here. I could not find gcc documentation statement on right shifts of 
> negative values.

I guess gcc and most of others do this "right"; but it looks
"unkosher" anyway.

Jarek P.
--
To unsubscribe from this list: send the line "unsubscribe netdev" in
the body of a message to majordomo@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Jarek Poplawski - May 19, 2009, 7:57 a.m.
On Tue, May 19, 2009 at 07:42:47AM +0000, Jarek Poplawski wrote:
> On Tue, May 19, 2009 at 09:31:36AM +0200, Eric Dumazet wrote:
> > Jarek Poplawski a écrit :
> > > On Tue, May 19, 2009 at 01:59:55AM +0200, Eric Dumazet wrote:
> > > ...
> > >> diff --git a/net/core/gen_estimator.c b/net/core/gen_estimator.c
> > > ...
> > >> -		e->avbps += ((long)rate - (long)e->avbps) >> e->ewma_log;
> > >> +		e->avbps += ((s64)(brate - e->avbps)) >> e->ewma_log;
> > > 
> > > Btw., I'm a bit concerned about the syntax here: isn't such shifting
> > > of signed ints implementation dependant?
> > > 
> > 
> > You are right Jarek, I very often forget to never ever use signed quantities
> > at all ! (But also note original code has same undefined behavior)
> 
> Sure, I've meant the original code including 5 lines below.
> 
> > Apparently gcc does the *right* thing on x86_32, but we probably want something
> > stronger here. I could not find gcc documentation statement on right shifts of 
> > negative values.
> 
> I guess gcc and most of others do this "right"; but it looks
> "unkosher" anyway.

I might have missed your point here, but would it be so costly to do
these shifts separately here?

Jarek P.
--
To unsubscribe from this list: send the line "unsubscribe netdev" in
the body of a message to majordomo@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
David Miller - May 19, 2009, 8:18 a.m.
From: Eric Dumazet <dada1@cosmosbay.com>
Date: Tue, 19 May 2009 09:31:36 +0200

> Apparently gcc does the *right* thing on x86_32, but we probably
> want something stronger here. I could not find gcc documentation
> statement on right shifts of negative values.

It emits an "arithmetic shift right" for every CPU I've ever checked.
--
To unsubscribe from this list: send the line "unsubscribe netdev" in
the body of a message to majordomo@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Eric Dumazet - May 19, 2009, 6:03 p.m.
Jarek Poplawski a écrit :
> On Tue, May 19, 2009 at 07:42:47AM +0000, Jarek Poplawski wrote:
>> On Tue, May 19, 2009 at 09:31:36AM +0200, Eric Dumazet wrote:
>>> Jarek Poplawski a écrit :
>>>> On Tue, May 19, 2009 at 01:59:55AM +0200, Eric Dumazet wrote:
>>>> ...
>>>>> diff --git a/net/core/gen_estimator.c b/net/core/gen_estimator.c
>>>> ...
>>>>> -		e->avbps += ((long)rate - (long)e->avbps) >> e->ewma_log;
>>>>> +		e->avbps += ((s64)(brate - e->avbps)) >> e->ewma_log;
>>>> Btw., I'm a bit concerned about the syntax here: isn't such shifting
>>>> of signed ints implementation dependant?
>>>>
>>> You are right Jarek, I very often forget to never ever use signed quantities
>>> at all ! (But also note original code has same undefined behavior)
>> Sure, I've meant the original code including 5 lines below.
>>
>>> Apparently gcc does the *right* thing on x86_32, but we probably want something
>>> stronger here. I could not find gcc documentation statement on right shifts of 
>>> negative values.
>> I guess gcc and most of others do this "right"; but it looks
>> "unkosher" anyway.
> 
> I might have missed your point here, but would it be so costly to do
> these shifts separately here?

You replied to yourself Jarek :)

As I said earlier, I found your concern right, so please submit a patch ?

I found many occurrences of a right shift on a signed int/long in kernel.
One example being :

arch/x86/mm/init_64.c

int kern_addr_valid(unsigned long addr)
{
	unsigned long above = ((long)addr) >> __VIRTUAL_MASK_SHIFT;


and another rate estimator in drivers/atm/idt77252.c

static void
idt77252_est_timer(unsigned long data)


We could aso check net/netfilter/ipvs/ip_vs_est.c (estimation_timer())

--
To unsubscribe from this list: send the line "unsubscribe netdev" in
the body of a message to majordomo@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html

Patch

diff --git a/net/core/gen_estimator.c b/net/core/gen_estimator.c
index 9cc9f95..ea28659 100644
--- a/net/core/gen_estimator.c
+++ b/net/core/gen_estimator.c
@@ -66,9 +66,9 @@ 
 
    NOTES.
 
-   * The stored value for avbps is scaled by 2^5, so that maximal
-     rate is ~1Gbit, avpps is scaled by 2^10.
-
+   * avbps is scaled by 2^5, avpps is scaled by 2^10.
+   * both values are reported as 32 bit unsigned values. bps can
+     overflow for fast links : max speed being 34360Mbit/sec
    * Minimal interval is HZ/4=250msec (it is the greatest common divisor
      for HZ=100 and HZ=1024 8)), maximal interval
      is (HZ*2^EST_MAX_INTERVAL)/4 = 8sec. Shorter intervals
@@ -86,9 +86,9 @@  struct gen_estimator
 	spinlock_t		*stats_lock;
 	int			ewma_log;
 	u64			last_bytes;
+	u64			avbps;
 	u32			last_packets;
 	u32			avpps;
-	u32			avbps;
 	struct rcu_head		e_rcu;
 	struct rb_node		node;
 };
@@ -115,6 +115,7 @@  static void est_timer(unsigned long arg)
 	rcu_read_lock();
 	list_for_each_entry_rcu(e, &elist[idx].list, list) {
 		u64 nbytes;
+		u64 brate;
 		u32 npackets;
 		u32 rate;
 
@@ -125,9 +126,9 @@  static void est_timer(unsigned long arg)
 
 		nbytes = e->bstats->bytes;
 		npackets = e->bstats->packets;
-		rate = (nbytes - e->last_bytes)<<(7 - idx);
+		brate = (nbytes - e->last_bytes)<<(7 - idx);
 		e->last_bytes = nbytes;
-		e->avbps += ((long)rate - (long)e->avbps) >> e->ewma_log;
+		e->avbps += ((s64)(brate - e->avbps)) >> e->ewma_log;
 		e->rate_est->bps = (e->avbps+0xF)>>5;
 
 		rate = (npackets - e->last_packets)<<(12 - idx);