diff mbox

[net-next-2.6] bridge: 64bit rx/tx counters

Message ID 1276598376.2541.93.camel@edumazet-laptop
State Accepted, archived
Delegated to: David Miller
Headers show

Commit Message

Eric Dumazet June 15, 2010, 10:39 a.m. UTC
Note : should be applied after "net: Introduce 
u64_stats_sync infrastructure", if accepted.


Thanks

[PATCH net-next-2.6] bridge: 64bit rx/tx counters

Use u64_stats_sync infrastructure to provide 64bit rx/tx 
counters even on 32bit hosts.

It is safe to use a single u64_stats_sync for rx and tx,
because BH is disabled on both, and we use per_cpu data.

Signed-off-by: Eric Dumazet <eric.dumazet@gmail.com>
---
 net/bridge/br_device.c  |   24 +++++++++++++++---------
 net/bridge/br_input.c   |    2 ++
 net/bridge/br_private.h |    9 +++++----
 3 files changed, 22 insertions(+), 13 deletions(-)



--
To unsubscribe from this list: send the line "unsubscribe netdev" in
the body of a message to majordomo@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html

Comments

David Miller June 22, 2010, 5:25 p.m. UTC | #1
From: Eric Dumazet <eric.dumazet@gmail.com>
Date: Tue, 15 Jun 2010 12:39:36 +0200

> Note : should be applied after "net: Introduce 
> u64_stats_sync infrastructure", if accepted.
> 
> 
> Thanks
> 
> [PATCH net-next-2.6] bridge: 64bit rx/tx counters
> 
> Use u64_stats_sync infrastructure to provide 64bit rx/tx 
> counters even on 32bit hosts.
> 
> It is safe to use a single u64_stats_sync for rx and tx,
> because BH is disabled on both, and we use per_cpu data.
> 
> Signed-off-by: Eric Dumazet <eric.dumazet@gmail.com>

Applied.
--
To unsubscribe from this list: send the line "unsubscribe netdev" in
the body of a message to majordomo@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Andrew Morton Aug. 10, 2010, 4:47 a.m. UTC | #2
On Tue, 15 Jun 2010 12:39:36 +0200 Eric Dumazet <eric.dumazet@gmail.com> wrote:

> Note : should be applied after "net: Introduce 
> u64_stats_sync infrastructure", if accepted.
> 
> 
> Thanks
> 
> [PATCH net-next-2.6] bridge: 64bit rx/tx counters
> 
> Use u64_stats_sync infrastructure to provide 64bit rx/tx 
> counters even on 32bit hosts.
> 
> It is safe to use a single u64_stats_sync for rx and tx,
> because BH is disabled on both, and we use per_cpu data.
>

Oh for fuck's sake.  Will you guys just stop adding generic kernel
infrastructure behind everyone's backs?

Had I actually been aware that this stuff was going into the tree I'd
have pointed out that the u64_stats_* api needs renaming. 
s/stats/counter/ because it has no business assuming that the counter
is being used for statistics.


And all this open-coded per-cpu counter stuff added all over the place.
Were percpu_counters tested or reviewed and found inadequate and unfixable?
If so, please do tell.

--
To unsubscribe from this list: send the line "unsubscribe netdev" in
the body of a message to majordomo@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Eric Dumazet Aug. 12, 2010, 12:16 p.m. UTC | #3
Le lundi 09 août 2010 à 21:47 -0700, Andrew Morton a écrit :

> Oh for fuck's sake.  Will you guys just stop adding generic kernel
> infrastructure behind everyone's backs?
> 
> Had I actually been aware that this stuff was going into the tree I'd
> have pointed out that the u64_stats_* api needs renaming. 
> s/stats/counter/ because it has no business assuming that the counter
> is being used for statistics.
> 
> 

Sure. Someone suggested to change the name, considering values could
also be signed (s64 instead of u64_...)

> And all this open-coded per-cpu counter stuff added all over the place.
> Were percpu_counters tested or reviewed and found inadequate and unfixable?
> If so, please do tell.
> 

percpu_counters tries hard to maintain a view of the current value of
the (global) counter. This adds a cost because of a shared cache line
and locking. (__percpu_counter_sum() is not very scalable on big hosts,
it locks the percpu_counter lock for a possibly long iteration)


For network stats we dont want to maintain this central value, we do the
folding only when necessary. And this folding has zero effect on
concurrent writers (counter updates)

For network stack, we also need to update two values, a packet counter
and a bytes counter. percpu_counter is not very good for the 'bytes
counter', since we would have to use a arbitrary big bias value.
Using several percpu_counter would also probably use more cache lines.

Also please note this stuff is only needed for 32bit arches. 

Using percpu_counter would slow down network stack on modern arches.


I am very well aware of the percpu_counter stuff, I believe I tried to
optimize it a bit in the past.

Thanks


--
To unsubscribe from this list: send the line "unsubscribe netdev" in
the body of a message to majordomo@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Andrew Morton Aug. 12, 2010, 3:07 p.m. UTC | #4
On Thu, 12 Aug 2010 14:16:15 +0200 Eric Dumazet <eric.dumazet@gmail.com> wrote:

> > And all this open-coded per-cpu counter stuff added all over the place.
> > Were percpu_counters tested or reviewed and found inadequate and unfixable?
> > If so, please do tell.
> > 
> 
> percpu_counters tries hard to maintain a view of the current value of
> the (global) counter. This adds a cost because of a shared cache line
> and locking. (__percpu_counter_sum() is not very scalable on big hosts,
> it locks the percpu_counter lock for a possibly long iteration)

Could be.  Is percpu_counter_read_positive() unsuitable?

> 
> For network stats we dont want to maintain this central value, we do the
> folding only when necessary.

hm.  Well, why?  That big walk across all possible CPUs could be really
expensive for some applications.  Especially if num_possible_cpus is
much larger than num_online_cpus, which iirc can happen in
virtualisation setups; probably it can happen in non-virtualised
machines too.

> And this folding has zero effect on
> concurrent writers (counter updates)

The fastpath looks a little expensive in the code you've added.  The
write_seqlock() does an rmw and a wmb() and the stats inc is a 64-bit
rmw whereas percpu_counters do a simple 32-bit add.  So I'd expect that
at some suitable batch value, percpu-counters are faster on 32-bit. 

They'll usually be slower on 64-bit, until that num_possible_cpus walk
bites you.

percpu_counters might need some work to make them irq-friendly.  That
bare spin_lock().

btw, I worry a bit about seqlocks in the presence of interrupts:

static inline void write_seqcount_begin(seqcount_t *s)
{
	s->sequence++;
	smp_wmb();
}

are we assuming that the ++ there is atomic wrt interrupts?  I think
so.  Is that always true for all architectures, compiler versions, etc?

> For network stack, we also need to update two values, a packet counter
> and a bytes counter. percpu_counter is not very good for the 'bytes
> counter', since we would have to use a arbitrary big bias value.

OK, that's a nasty problem for percpu-counters.

> Using several percpu_counter would also probably use more cache lines.
> 
> Also please note this stuff is only needed for 32bit arches. 
> 
> Using percpu_counter would slow down network stack on modern arches.

Was this ever quantified?

> 
> I am very well aware of the percpu_counter stuff, I believe I tried to
> optimize it a bit in the past.
--
To unsubscribe from this list: send the line "unsubscribe netdev" in
the body of a message to majordomo@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Eric Dumazet Aug. 12, 2010, 9:47 p.m. UTC | #5
Le jeudi 12 août 2010 à 08:07 -0700, Andrew Morton a écrit : 
> On Thu, 12 Aug 2010 14:16:15 +0200 Eric Dumazet <eric.dumazet@gmail.com> wrote:
> 
> > > And all this open-coded per-cpu counter stuff added all over the place.
> > > Were percpu_counters tested or reviewed and found inadequate and unfixable?
> > > If so, please do tell.
> > > 
> > 
> > percpu_counters tries hard to maintain a view of the current value of
> > the (global) counter. This adds a cost because of a shared cache line
> > and locking. (__percpu_counter_sum() is not very scalable on big hosts,
> > it locks the percpu_counter lock for a possibly long iteration)
> 
> Could be.  Is percpu_counter_read_positive() unsuitable?
> 

I bet most people want precise counters when doing 'ifconfig lo'

SNMP applications would be very surprised to get non increasing values
between two samples, or inexact values.

> > 
> > For network stats we dont want to maintain this central value, we do the
> > folding only when necessary.
> 
> hm.  Well, why?  That big walk across all possible CPUs could be really
> expensive for some applications.  Especially if num_possible_cpus is
> much larger than num_online_cpus, which iirc can happen in
> virtualisation setups; probably it can happen in non-virtualised
> machines too.
> 

Agreed.

> > And this folding has zero effect on
> > concurrent writers (counter updates)
> 
> The fastpath looks a little expensive in the code you've added.  The
> write_seqlock() does an rmw and a wmb() and the stats inc is a 64-bit
> rmw whereas percpu_counters do a simple 32-bit add.  So I'd expect that
> at some suitable batch value, percpu-counters are faster on 32-bit. 
> 

Hmm... 6 instructions (16 bytes of text) are a "little expensive" versus
120 instructions if we use percpu_counter ?

Following code from drivers/net/loopback.c

	u64_stats_update_begin(&lb_stats->syncp);
	lb_stats->bytes += len;
	lb_stats->packets++;
	u64_stats_update_end(&lb_stats->syncp);

maps on i386 to :

	ff 46 10             	incl   0x10(%esi)  // u64_stats_update_begin(&lb_stats->syncp);
	89 f8                	mov    %edi,%eax
	99                   	cltd   
	01 7e 08             	add    %edi,0x8(%esi)
	11 56 0c             	adc    %edx,0xc(%esi)
	83 06 01             	addl   $0x1,(%esi)
	83 56 04 00          	adcl   $0x0,0x4(%esi)
	ff 46 10             	incl   0x10(%esi) // u64_stats_update_end(&lb_stats->syncp);


Exactly 6 added instructions compared to previous kernel (32bit
counters), only on 32bit hosts. These instructions are not expensive (no
conditional branches, no extra register pressure) and access private cpu
data.

While two calls to __percpu_counter_add() add about 120 instructions,
even on 64bit hosts, wasting precious cpu cycles.



> They'll usually be slower on 64-bit, until that num_possible_cpus walk
> bites you.
> 

But are you aware we already fold SNMP values using for_each_possible()
macros, before adding 64bit counters ? Not related to 64bit stuff
really...

> percpu_counters might need some work to make them irq-friendly.  That
> bare spin_lock().
> 
> btw, I worry a bit about seqlocks in the presence of interrupts:
> 

Please note that nothing is assumed about interrupts and seqcounts

Both readers and writers must mask them if necessary.

In most situations, masking softirq is enough for networking cases
(updates are performed from softirq handler, reads from process context)

> static inline void write_seqcount_begin(seqcount_t *s)
> {
> 	s->sequence++;
> 	smp_wmb();
> }
> 
> are we assuming that the ++ there is atomic wrt interrupts?  I think
> so.  Is that always true for all architectures, compiler versions, etc?
> 

s->sequence++ is certainly not atomic wrt interrupts on RISC arches

> > For network stack, we also need to update two values, a packet counter
> > and a bytes counter. percpu_counter is not very good for the 'bytes
> > counter', since we would have to use a arbitrary big bias value.
> 
> OK, that's a nasty problem for percpu-counters.
> 
> > Using several percpu_counter would also probably use more cache lines.
> > 
> > Also please note this stuff is only needed for 32bit arches. 
> > 
> > Using percpu_counter would slow down network stack on modern arches.
> 
> Was this ever quantified?

A single misplacement of dst refcount was responsible for a 25% tbench
slowdown on a small machine (8 cores). Without any lock, only atomic
operations on a shared cache line...

So I think we could easily quantify a big slow down adding two
percpu_counters add() in a driver fastpath and a 16 or 32 cores machine.
(It would be a revert of percpu stuff we added last years)

Improvements would be

0) Just forget about 64bit stuff on 32bit arches as we did from linux
0.99. People should not run 40Gb links on 32bit kernels :)

1) If we really want percpu_counter() stuff, find a way to make it
hierarchical or use a a very big BIAS (2^30 ?). And/Or reduce
percpu_counter_add() complexity for increasing unsigned counters.

2) Avoid the write_seqcount_begin()/end() stuff when a writer changes
only the low order parts of the 64bit counter.

   (ie maintain a 32bit percpu value, and only atomicaly touch the
shared upper 32bits (and the seqcount) when overflowing this 32bit
percpu value.

Not sure its worth the added conditional branch.

Thanks


--
To unsubscribe from this list: send the line "unsubscribe netdev" in
the body of a message to majordomo@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Andrew Morton Aug. 12, 2010, 10:11 p.m. UTC | #6
On Thu, 12 Aug 2010 23:47:37 +0200
Eric Dumazet <eric.dumazet@gmail.com> wrote:

> Le jeudi 12 ao__t 2010 __ 08:07 -0700, Andrew Morton a __crit : 
> > On Thu, 12 Aug 2010 14:16:15 +0200 Eric Dumazet <eric.dumazet@gmail.com> wrote:
> > 
> > > > And all this open-coded per-cpu counter stuff added all over the place.
> > > > Were percpu_counters tested or reviewed and found inadequate and unfixable?
> > > > If so, please do tell.
> > > > 
> > > 
> > > percpu_counters tries hard to maintain a view of the current value of
> > > the (global) counter. This adds a cost because of a shared cache line
> > > and locking. (__percpu_counter_sum() is not very scalable on big hosts,
> > > it locks the percpu_counter lock for a possibly long iteration)
> > 
> > Could be.  Is percpu_counter_read_positive() unsuitable?
> > 
> 
> I bet most people want precise counters when doing 'ifconfig lo'
> 
> SNMP applications would be very surprised to get non increasing values
> between two samples, or inexact values.

percpu_counter_read_positive() should be returning monotonically
increasing numbers - if it ever went backward that would be bad.  But
yes, the value will increase in a lumpy fashion.  Probably one would
need to make informed choices between percpu_counter_read_positive()
and percpu_counter_sum(), depending on the type of stat.

But that's all a bit academic.

>
> > > And this folding has zero effect on
> > > concurrent writers (counter updates)
> > 
> > The fastpath looks a little expensive in the code you've added.  The
> > write_seqlock() does an rmw and a wmb() and the stats inc is a 64-bit
> > rmw whereas percpu_counters do a simple 32-bit add.  So I'd expect that
> > at some suitable batch value, percpu-counters are faster on 32-bit. 
> > 
> 
> Hmm... 6 instructions (16 bytes of text) are a "little expensive" versus
> 120 instructions if we use percpu_counter ?
> 
> Following code from drivers/net/loopback.c
> 
> 	u64_stats_update_begin(&lb_stats->syncp);
> 	lb_stats->bytes += len;
> 	lb_stats->packets++;
> 	u64_stats_update_end(&lb_stats->syncp);
> 
> maps on i386 to :
> 
> 	ff 46 10             	incl   0x10(%esi)  // u64_stats_update_begin(&lb_stats->syncp);
> 	89 f8                	mov    %edi,%eax
> 	99                   	cltd   
> 	01 7e 08             	add    %edi,0x8(%esi)
> 	11 56 0c             	adc    %edx,0xc(%esi)
> 	83 06 01             	addl   $0x1,(%esi)
> 	83 56 04 00          	adcl   $0x0,0x4(%esi)
> 	ff 46 10             	incl   0x10(%esi) // u64_stats_update_end(&lb_stats->syncp);
> 
> 
> Exactly 6 added instructions compared to previous kernel (32bit
> counters), only on 32bit hosts. These instructions are not expensive (no
> conditional branches, no extra register pressure) and access private cpu
> data.
> 
> While two calls to __percpu_counter_add() add about 120 instructions,
> even on 64bit hosts, wasting precious cpu cycles.

Oy.  You omitted the per_cpu_ptr() evaluation and, I bet, included all
the executed-1/batch-times instructions.

> 
> > They'll usually be slower on 64-bit, until that num_possible_cpus walk
> > bites you.
> > 
> 
> But are you aware we already fold SNMP values using for_each_possible()
> macros, before adding 64bit counters ? Not related to 64bit stuff
> really...


> > percpu_counters might need some work to make them irq-friendly.  That
> > bare spin_lock().
> > 
> > btw, I worry a bit about seqlocks in the presence of interrupts:
> > 
> 
> Please note that nothing is assumed about interrupts and seqcounts
> 
> Both readers and writers must mask them if necessary.
> 
> In most situations, masking softirq is enough for networking cases
> (updates are performed from softirq handler, reads from process context)

Yup, write_seqcount_begin/end() are pretty dangerous-looking.  The
caller needs to protect the lock against other CPUs, against interrupts
and even against preemption.


--
To unsubscribe from this list: send the line "unsubscribe netdev" in
the body of a message to majordomo@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
diff mbox

Patch

diff --git a/net/bridge/br_device.c b/net/bridge/br_device.c
index b898364..d7bfe31 100644
--- a/net/bridge/br_device.c
+++ b/net/bridge/br_device.c
@@ -38,8 +38,10 @@  netdev_tx_t br_dev_xmit(struct sk_buff *skb, struct net_device *dev)
 	}
 #endif
 
+	u64_stats_update_begin(&brstats->syncp);
 	brstats->tx_packets++;
 	brstats->tx_bytes += skb->len;
+	u64_stats_update_end(&brstats->syncp);
 
 	BR_INPUT_SKB_CB(skb)->brdev = dev;
 
@@ -92,21 +94,25 @@  static int br_dev_stop(struct net_device *dev)
 	return 0;
 }
 
-static struct net_device_stats *br_get_stats(struct net_device *dev)
+static struct rtnl_link_stats64 *br_get_stats64(struct net_device *dev)
 {
 	struct net_bridge *br = netdev_priv(dev);
-	struct net_device_stats *stats = &dev->stats;
-	struct br_cpu_netstats sum = { 0 };
+	struct rtnl_link_stats64 *stats = &dev->stats64;
+	struct br_cpu_netstats tmp, sum = { 0 };
 	unsigned int cpu;
 
 	for_each_possible_cpu(cpu) {
+		unsigned int start;
 		const struct br_cpu_netstats *bstats
 			= per_cpu_ptr(br->stats, cpu);
-
-		sum.tx_bytes   += bstats->tx_bytes;
-		sum.tx_packets += bstats->tx_packets;
-		sum.rx_bytes   += bstats->rx_bytes;
-		sum.rx_packets += bstats->rx_packets;
+		do {
+			start = u64_stats_fetch_begin(&bstats->syncp);
+			memcpy(&tmp, bstats, sizeof(tmp));
+		} while (u64_stats_fetch_retry(&bstats->syncp, start));
+		sum.tx_bytes   += tmp.tx_bytes;
+		sum.tx_packets += tmp.tx_packets;
+		sum.rx_bytes   += tmp.rx_bytes;
+		sum.rx_packets += tmp.rx_packets;
 	}
 
 	stats->tx_bytes   = sum.tx_bytes;
@@ -288,7 +294,7 @@  static const struct net_device_ops br_netdev_ops = {
 	.ndo_open		 = br_dev_open,
 	.ndo_stop		 = br_dev_stop,
 	.ndo_start_xmit		 = br_dev_xmit,
-	.ndo_get_stats		 = br_get_stats,
+	.ndo_get_stats64	 = br_get_stats64,
 	.ndo_set_mac_address	 = br_set_mac_address,
 	.ndo_set_multicast_list	 = br_dev_set_multicast_list,
 	.ndo_change_mtu		 = br_change_mtu,
diff --git a/net/bridge/br_input.c b/net/bridge/br_input.c
index 99647d8..86d357b 100644
--- a/net/bridge/br_input.c
+++ b/net/bridge/br_input.c
@@ -27,8 +27,10 @@  static int br_pass_frame_up(struct sk_buff *skb)
 	struct net_bridge *br = netdev_priv(brdev);
 	struct br_cpu_netstats *brstats = this_cpu_ptr(br->stats);
 
+	u64_stats_update_begin(&brstats->syncp);
 	brstats->rx_packets++;
 	brstats->rx_bytes += skb->len;
+	u64_stats_update_end(&brstats->syncp);
 
 	indev = skb->dev;
 	skb->dev = brdev;
diff --git a/net/bridge/br_private.h b/net/bridge/br_private.h
index c83519b..f078c54 100644
--- a/net/bridge/br_private.h
+++ b/net/bridge/br_private.h
@@ -146,10 +146,11 @@  struct net_bridge_port
 };
 
 struct br_cpu_netstats {
-	unsigned long	rx_packets;
-	unsigned long	rx_bytes;
-	unsigned long	tx_packets;
-	unsigned long	tx_bytes;
+	u64			rx_packets;
+	u64			rx_bytes;
+	u64			tx_packets;
+	u64			tx_bytes;
+	struct u64_stats_sync	syncp;
 };
 
 struct net_bridge