diff mbox

[03/18] ipvs: zero percpu stats

Message ID alpine.LFD.2.00.1103061218470.1879@ja.ssi.bg
State Not Applicable, archived
Delegated to: David Miller
Headers show

Commit Message

Julian Anastasov March 6, 2011, 12:18 p.m. UTC
Hello,

On Sun, 6 Mar 2011, Eric Dumazet wrote:

>>  	Zero the new percpu stats because we copy from there.
>>
>> Signed-off-by: Julian Anastasov <ja@ssi.bg>
>> Signed-off-by: Simon Horman <horms@verge.net.au>
>> ---
>>  net/netfilter/ipvs/ip_vs_ctl.c |   17 +++++++++++++++++
>>  1 files changed, 17 insertions(+), 0 deletions(-)
>>
>> diff --git a/net/netfilter/ipvs/ip_vs_ctl.c b/net/netfilter/ipvs/ip_vs_ctl.c
>> index a2a67ad..fd74527 100644
>> --- a/net/netfilter/ipvs/ip_vs_ctl.c
>> +++ b/net/netfilter/ipvs/ip_vs_ctl.c
>> @@ -715,8 +715,25 @@ static void ip_vs_trash_cleanup(struct net *net)
>>  static void
>>  ip_vs_zero_stats(struct ip_vs_stats *stats)
>>  {
>> +	struct ip_vs_cpu_stats *cpustats = stats->cpustats;
>> +	int i;
>> +
>>  	spin_lock_bh(&stats->lock);
>>
>> +	for_each_possible_cpu(i) {
>> +		struct ip_vs_cpu_stats *u = per_cpu_ptr(cpustats, i);
>> +		unsigned int start;
>> +
>> +		/* Do not pretend to be writer, it is enough to
>> +		 * sync with writers that modify the u64 counters
>> +		 * because under stats->lock we are the only reader.
>> +		 */
>> +		do {
>> +			start = u64_stats_fetch_begin(&u->syncp);
>> +			memset(&u->ustats, 0, sizeof(u->ustats));
>> +		} while (u64_stats_fetch_retry(&u->syncp, start));
>
>
> Sorry this makes no sense to me.

 	Hm, yes, the comment is a little bit misleading.
I fixed it below...

> This code _is_ a writer, and hardly a hot path.

 	Yes, the picture is as follows:

- in 2.6.38-rc we remove the global spin lock (stats->lock)
from packet processing which is a hot path, adding percpu
counters instead

- we need protection for percpu counters and for the sum

- the chain is: interrupts increment percpu counters, the
estimation timer reads them and creates sum every 2 seconds,
then user context can read the sum or even to show the percpu
counters, not to forget the zeroing of sum and counters

The players in detail:

- packet processing:
 	- softirq context, hot path
 	- increments counters by using u64_stats_update_begin and
 	u64_stats_update_end, does not wait readers or zeroing
 	- sum not touched, stats->lock usage removed in 2.6.38-rc

- 2-second estimation timer:
 	- funcs: estimation_timer()
 	- timer context, softirq
 	- reads percpu counters with u64_stats_fetch_begin and
 	u64_stats_fetch_retry to sync with counter incrementing
 	- uses spin_lock (stats->lock) to protect the written sum
 	which is later read by user context: provides
 	at least u64 atomicity but additionally the relation
 	between packets and bytes

- sum readers:
 	- funcs: ip_vs_stats_show(), ip_vs_stats_percpu_show(),
 	ip_vs_copy_stats(), ip_vs_genl_fill_stats()
 	- user context, not a hot path
 	- uses spin_lock_bh (stats->lock) for atomic reading of
 	the sum created by estimation_timer()

- show percpu counters:
 	- funcs: ip_vs_stats_percpu_show()
 	- user context, not a hot path
 	- uses u64_stats_fetch_begin_bh and u64_stats_fetch_retry_bh
 	to synchronize with counter incrementing
 	- still missing: should use spin_lock_bh (stats->lock)
 	to synchronize with ip_vs_zero_stats() that modifies
 	percpu counters.

- zero stats and percpu counters
 	- funcs: ip_vs_zero_stats()
 	- user context, not a hot path
 	- uses spin_lock_bh (stats->lock) while modifying
 	sum but also while zeroing percpu counters because
 	we are a hidden writer which does not allow other
 	percpu counter readers at the same time but we are
 	still synchronized with percpu counter incrementing
 	without delaying it

To summarize, I see 2 solutions, in order of preference:

1. all players except packet processing should use stats->lock
when reading/writing sum or when reading/zeroing percpu
counters. Use u64_stats to avoid delays in incrementing.

2. Use seqlock instead of u64_stats if we want to treat the
percpu counters zeroing as writer. This returns us before
2.6.38-rc where we used global stats->lock even for counter
incrementing. Except that now we can use percpu seqlock
just to register the zeroing as writer.

> Why try to pretend its a reader and confuse people ?
>
> Either :
>
> - Another writer can modify the counters in same time, and we must
> synchronize with them (we are a writer after all)

 	Global mutex allows only one zeroing at a time.
But zeroing runs in parallel with incrementing, so we
have 2 writers for a per-CPU state. This sounds like
above solution 2 with percpu seqlock? But it adds extra
spin_lock in hot path, even if it is percpu. It only
saves the spin_lock_bh while reading percpu counters in
ip_vs_stats_percpu_show(). That is why a prefer solution 1.

> - Another reader can read the counters in same time, and we must let
> them catch we mihjt have cleared half of their values.

 	Yes, zeroing can run in parallel with /proc reading,
that is why I now try to serialize all readers with the
stats spin lock to guarantee u64 atomicity.

> - No reader or writer can access data, no synch is needed, a pure
> memset() is OK.

 	Packet processing can damage the counters while we
do memset, so we need at least u64_stats_fetch_* to sync
with incrementing.

>> +	}
>> +
>>  	memset(&stats->ustats, 0, sizeof(stats->ustats));
>>  	ip_vs_zero_estimator(stats);

 	So, here is solution 1 fully implemented:

--
To unsubscribe from this list: send the line "unsubscribe netdev" in
the body of a message to majordomo@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html

Comments

Simon Horman March 10, 2011, 1:34 a.m. UTC | #1
On Sun, Mar 06, 2011 at 02:18:35PM +0200, Julian Anastasov wrote:

[ snip ]

> 	Zero the new percpu stats because we copy from there.
> Use the stats spin lock to synchronize the percpu zeroing with
> the percpu reading, both in user context and not in a hot path.
> 
> Signed-off-by: Julian Anastasov <ja@ssi.bg>

Eric, do you have any thoughts on this?
It seems clean to me.

> ---
> 
> diff -urp lvs-test-2.6-8a80c79/linux/net/netfilter/ipvs/ip_vs_ctl.c linux/net/netfilter/ipvs/ip_vs_ctl.c
> --- lvs-test-2.6-8a80c79/linux/net/netfilter/ipvs/ip_vs_ctl.c	2011-03-06 13:39:59.000000000 +0200
> +++ linux/net/netfilter/ipvs/ip_vs_ctl.c	2011-03-06 13:44:56.108275455 +0200
> @@ -713,8 +713,25 @@ static void ip_vs_trash_cleanup(struct n
>  static void
>  ip_vs_zero_stats(struct ip_vs_stats *stats)
>  {
> +	struct ip_vs_cpu_stats *cpustats = stats->cpustats;
> +	int i;
> +
>  	spin_lock_bh(&stats->lock);
> 
> +	for_each_possible_cpu(i) {
> +		struct ip_vs_cpu_stats *u = per_cpu_ptr(cpustats, i);
> +		unsigned int start;
> +
> +		/* Do not pretend to be writer, it is enough to
> +		 * sync with writers that modify the u64 counters
> +		 * because under stats->lock there are no other readers
> +		 */
> +		do {
> +			start = u64_stats_fetch_begin(&u->syncp);
> +			memset(&u->ustats, 0, sizeof(u->ustats));
> +		} while (u64_stats_fetch_retry(&u->syncp, start));
> +	}
> +
>  	memset(&stats->ustats, 0, sizeof(stats->ustats));
>  	ip_vs_zero_estimator(stats);
> 
> @@ -2015,16 +2032,19 @@ static int ip_vs_stats_percpu_show(struc
>  	seq_printf(seq,
>  		   "CPU    Conns  Packets  Packets            Bytes            Bytes\n");
> 
> +	/* Use spin lock early to synchronize with percpu zeroing */
> +	spin_lock_bh(&tot_stats->lock);
> +
>  	for_each_possible_cpu(i) {
>  		struct ip_vs_cpu_stats *u = per_cpu_ptr(cpustats, i);
>  		unsigned int start;
>  		__u64 inbytes, outbytes;
> 
>  		do {
> -			start = u64_stats_fetch_begin_bh(&u->syncp);
> +			start = u64_stats_fetch_begin(&u->syncp);
>  			inbytes = u->ustats.inbytes;
>  			outbytes = u->ustats.outbytes;
> -		} while (u64_stats_fetch_retry_bh(&u->syncp, start));
> +		} while (u64_stats_fetch_retry(&u->syncp, start));
> 
>  		seq_printf(seq, "%3X %8X %8X %8X %16LX %16LX\n",
>  			   i, u->ustats.conns, u->ustats.inpkts,
> @@ -2032,7 +2052,6 @@ static int ip_vs_stats_percpu_show(struc
>  			   (__u64)outbytes);
>  	}
> 
> -	spin_lock_bh(&tot_stats->lock);
>  	seq_printf(seq, "  ~ %8X %8X %8X %16LX %16LX\n\n",
>  		   tot_stats->ustats.conns, tot_stats->ustats.inpkts,
>  		   tot_stats->ustats.outpkts,
> 
--
To unsubscribe from this list: send the line "unsubscribe netdev" in
the body of a message to majordomo@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
David Miller March 10, 2011, 2:53 a.m. UTC | #2
From: Simon Horman <horms@verge.net.au>
Date: Thu, 10 Mar 2011 10:34:42 +0900

> On Sun, Mar 06, 2011 at 02:18:35PM +0200, Julian Anastasov wrote:
> 
> [ snip ]
> 
>> 	Zero the new percpu stats because we copy from there.
>> Use the stats spin lock to synchronize the percpu zeroing with
>> the percpu reading, both in user context and not in a hot path.
>> 
>> Signed-off-by: Julian Anastasov <ja@ssi.bg>
> 
> Eric, do you have any thoughts on this?
> It seems clean to me.

Eric is away until this weekend, so don't be alarmed by a
late response :-)
--
To unsubscribe from this list: send the line "unsubscribe netdev" in
the body of a message to majordomo@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Simon Horman March 10, 2011, 4:27 a.m. UTC | #3
On Wed, Mar 09, 2011 at 06:53:44PM -0800, David Miller wrote:
> From: Simon Horman <horms@verge.net.au>
> Date: Thu, 10 Mar 2011 10:34:42 +0900
> 
> > On Sun, Mar 06, 2011 at 02:18:35PM +0200, Julian Anastasov wrote:
> > 
> > [ snip ]
> > 
> >> 	Zero the new percpu stats because we copy from there.
> >> Use the stats spin lock to synchronize the percpu zeroing with
> >> the percpu reading, both in user context and not in a hot path.
> >> 
> >> Signed-off-by: Julian Anastasov <ja@ssi.bg>
> > 
> > Eric, do you have any thoughts on this?
> > It seems clean to me.
> 
> Eric is away until this weekend, so don't be alarmed by a
> late response :-)

Thanks, I'll wait longer :-)
--
To unsubscribe from this list: send the line "unsubscribe netdev" in
the body of a message to majordomo@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Eric Dumazet March 13, 2011, 10:57 a.m. UTC | #4
Le dimanche 06 mars 2011 à 14:18 +0200, Julian Anastasov a écrit :
> 	Hello,
> 
> On Sun, 6 Mar 2011, Eric Dumazet wrote:
> 
> >>  	Zero the new percpu stats because we copy from there.
> >>
> >> Signed-off-by: Julian Anastasov <ja@ssi.bg>
> >> Signed-off-by: Simon Horman <horms@verge.net.au>
> >> ---
> >>  net/netfilter/ipvs/ip_vs_ctl.c |   17 +++++++++++++++++
> >>  1 files changed, 17 insertions(+), 0 deletions(-)
> >>
> >> diff --git a/net/netfilter/ipvs/ip_vs_ctl.c b/net/netfilter/ipvs/ip_vs_ctl.c
> >> index a2a67ad..fd74527 100644
> >> --- a/net/netfilter/ipvs/ip_vs_ctl.c
> >> +++ b/net/netfilter/ipvs/ip_vs_ctl.c
> >> @@ -715,8 +715,25 @@ static void ip_vs_trash_cleanup(struct net *net)
> >>  static void
> >>  ip_vs_zero_stats(struct ip_vs_stats *stats)
> >>  {
> >> +	struct ip_vs_cpu_stats *cpustats = stats->cpustats;
> >> +	int i;
> >> +
> >>  	spin_lock_bh(&stats->lock);
> >>
> >> +	for_each_possible_cpu(i) {
> >> +		struct ip_vs_cpu_stats *u = per_cpu_ptr(cpustats, i);
> >> +		unsigned int start;
> >> +
> >> +		/* Do not pretend to be writer, it is enough to
> >> +		 * sync with writers that modify the u64 counters
> >> +		 * because under stats->lock we are the only reader.
> >> +		 */
> >> +		do {
> >> +			start = u64_stats_fetch_begin(&u->syncp);
> >> +			memset(&u->ustats, 0, sizeof(u->ustats));
> >> +		} while (u64_stats_fetch_retry(&u->syncp, start));
> >
> >
> > Sorry this makes no sense to me.
> 
>  	Hm, yes, the comment is a little bit misleading.
> I fixed it below...
> 
> > This code _is_ a writer, and hardly a hot path.
> 
>  	Yes, the picture is as follows:
> 
> - in 2.6.38-rc we remove the global spin lock (stats->lock)
> from packet processing which is a hot path, adding percpu
> counters instead
> 
> - we need protection for percpu counters and for the sum
> 
> - the chain is: interrupts increment percpu counters, the
> estimation timer reads them and creates sum every 2 seconds,
> then user context can read the sum or even to show the percpu
> counters, not to forget the zeroing of sum and counters
> 
> The players in detail:
> 
> - packet processing:
>  	- softirq context, hot path
>  	- increments counters by using u64_stats_update_begin and
>  	u64_stats_update_end, does not wait readers or zeroing
>  	- sum not touched, stats->lock usage removed in 2.6.38-rc
> 
> - 2-second estimation timer:
>  	- funcs: estimation_timer()
>  	- timer context, softirq
>  	- reads percpu counters with u64_stats_fetch_begin and
>  	u64_stats_fetch_retry to sync with counter incrementing
>  	- uses spin_lock (stats->lock) to protect the written sum
>  	which is later read by user context: provides
>  	at least u64 atomicity but additionally the relation
>  	between packets and bytes
> 
> - sum readers:
>  	- funcs: ip_vs_stats_show(), ip_vs_stats_percpu_show(),
>  	ip_vs_copy_stats(), ip_vs_genl_fill_stats()
>  	- user context, not a hot path
>  	- uses spin_lock_bh (stats->lock) for atomic reading of
>  	the sum created by estimation_timer()
> 
> - show percpu counters:
>  	- funcs: ip_vs_stats_percpu_show()
>  	- user context, not a hot path
>  	- uses u64_stats_fetch_begin_bh and u64_stats_fetch_retry_bh
>  	to synchronize with counter incrementing
>  	- still missing: should use spin_lock_bh (stats->lock)
>  	to synchronize with ip_vs_zero_stats() that modifies
>  	percpu counters.
> 
> - zero stats and percpu counters
>  	- funcs: ip_vs_zero_stats()
>  	- user context, not a hot path
>  	- uses spin_lock_bh (stats->lock) while modifying
>  	sum but also while zeroing percpu counters because
>  	we are a hidden writer which does not allow other
>  	percpu counter readers at the same time but we are
>  	still synchronized with percpu counter incrementing
>  	without delaying it
> 
> To summarize, I see 2 solutions, in order of preference:
> 
> 1. all players except packet processing should use stats->lock
> when reading/writing sum or when reading/zeroing percpu
> counters. Use u64_stats to avoid delays in incrementing.
> 
> 2. Use seqlock instead of u64_stats if we want to treat the
> percpu counters zeroing as writer. This returns us before
> 2.6.38-rc where we used global stats->lock even for counter
> incrementing. Except that now we can use percpu seqlock
> just to register the zeroing as writer.
> 
> > Why try to pretend its a reader and confuse people ?
> >
> > Either :
> >
> > - Another writer can modify the counters in same time, and we must
> > synchronize with them (we are a writer after all)
> 
>  	Global mutex allows only one zeroing at a time.
> But zeroing runs in parallel with incrementing, so we
> have 2 writers for a per-CPU state. This sounds like
> above solution 2 with percpu seqlock? But it adds extra
> spin_lock in hot path, even if it is percpu. It only
> saves the spin_lock_bh while reading percpu counters in
> ip_vs_stats_percpu_show(). That is why a prefer solution 1.
> 
> > - Another reader can read the counters in same time, and we must let
> > them catch we mihjt have cleared half of their values.
> 
>  	Yes, zeroing can run in parallel with /proc reading,
> that is why I now try to serialize all readers with the
> stats spin lock to guarantee u64 atomicity.
> 
> > - No reader or writer can access data, no synch is needed, a pure
> > memset() is OK.
> 
>  	Packet processing can damage the counters while we
> do memset, so we need at least u64_stats_fetch_* to sync
> with incrementing.
> 

OK I now understand what you wanted to do.

Problem is you do synchronize your memset() with a concurrent writer but
one way only. (You detect a writer did some changes on the counters
while you memset() them), but a writer has no way to detect your writes
(could be partially committed to main memory) : It could read a
corrupted value.

I feel memory barriers are wrong and not really fixable without slowing
down the hot path.

As implied in include/linux/u64_stats_sync.h file, a "writer" should be
alone :)

One other way to handle that (and let hot path packet processing without
extra locking) would be to never memset() this data, but use a separate
"summed" value as a relative point, and substract this sum to the
current one (all this in slow path, so not a problem)



--
To unsubscribe from this list: send the line "unsubscribe netdev" in
the body of a message to majordomo@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
diff mbox

Patch

==============

 	Zero the new percpu stats because we copy from there.
Use the stats spin lock to synchronize the percpu zeroing with
the percpu reading, both in user context and not in a hot path.

Signed-off-by: Julian Anastasov <ja@ssi.bg>
---

diff -urp lvs-test-2.6-8a80c79/linux/net/netfilter/ipvs/ip_vs_ctl.c linux/net/netfilter/ipvs/ip_vs_ctl.c
--- lvs-test-2.6-8a80c79/linux/net/netfilter/ipvs/ip_vs_ctl.c	2011-03-06 13:39:59.000000000 +0200
+++ linux/net/netfilter/ipvs/ip_vs_ctl.c	2011-03-06 13:44:56.108275455 +0200
@@ -713,8 +713,25 @@  static void ip_vs_trash_cleanup(struct n
  static void
  ip_vs_zero_stats(struct ip_vs_stats *stats)
  {
+	struct ip_vs_cpu_stats *cpustats = stats->cpustats;
+	int i;
+
  	spin_lock_bh(&stats->lock);

+	for_each_possible_cpu(i) {
+		struct ip_vs_cpu_stats *u = per_cpu_ptr(cpustats, i);
+		unsigned int start;
+
+		/* Do not pretend to be writer, it is enough to
+		 * sync with writers that modify the u64 counters
+		 * because under stats->lock there are no other readers
+		 */
+		do {
+			start = u64_stats_fetch_begin(&u->syncp);
+			memset(&u->ustats, 0, sizeof(u->ustats));
+		} while (u64_stats_fetch_retry(&u->syncp, start));
+	}
+
  	memset(&stats->ustats, 0, sizeof(stats->ustats));
  	ip_vs_zero_estimator(stats);

@@ -2015,16 +2032,19 @@  static int ip_vs_stats_percpu_show(struc
  	seq_printf(seq,
  		   "CPU    Conns  Packets  Packets            Bytes            Bytes\n");

+	/* Use spin lock early to synchronize with percpu zeroing */
+	spin_lock_bh(&tot_stats->lock);
+
  	for_each_possible_cpu(i) {
  		struct ip_vs_cpu_stats *u = per_cpu_ptr(cpustats, i);
  		unsigned int start;
  		__u64 inbytes, outbytes;

  		do {
-			start = u64_stats_fetch_begin_bh(&u->syncp);
+			start = u64_stats_fetch_begin(&u->syncp);
  			inbytes = u->ustats.inbytes;
  			outbytes = u->ustats.outbytes;
-		} while (u64_stats_fetch_retry_bh(&u->syncp, start));
+		} while (u64_stats_fetch_retry(&u->syncp, start));

  		seq_printf(seq, "%3X %8X %8X %8X %16LX %16LX\n",
  			   i, u->ustats.conns, u->ustats.inpkts,
@@ -2032,7 +2052,6 @@  static int ip_vs_stats_percpu_show(struc
  			   (__u64)outbytes);
  	}

-	spin_lock_bh(&tot_stats->lock);
  	seq_printf(seq, "  ~ %8X %8X %8X %16LX %16LX\n\n",
  		   tot_stats->ustats.conns, tot_stats->ustats.inpkts,
  		   tot_stats->ustats.outpkts,