diff mbox

netfilter question

Message ID 20161117000725.GK30581@breakpoint.cc
State RFC
Delegated to: Pablo Neira
Headers show

Commit Message

Florian Westphal Nov. 17, 2016, 12:07 a.m. UTC
Eric Dumazet <eric.dumazet@gmail.com> wrote:
> > > On Wed, Nov 16, 2016 at 2:22 AM, Eric Desrochers <ericd@miromedia.ca> wrote:
> > > > Hi Eric,
> > > >
> > > > My name is Eric and I'm reaching you today as I found your name in multiple netfilter kernel commits, and was hoping we can discuss about a potential regression.
> > > >
> > > > I identified (git bisect) this commit [https://github.com/torvalds/linux/commit/71ae0dff02d756e4d2ca710b79f2ff5390029a5f] as the one introducing a serious performance slowdown when using the binary ip/ip6tables with a large number of policies.
> > > >
> > > > I also tried with the latest and greatest v4.9-rc4 mainline kernel, and the slowdown is still present.
> > > >
> > > > So even commit [https://github.com/torvalds/linux/commit/a1a56aaa0735c7821e85c37268a7c12e132415cb] which introduce a 16 bytes alignment on xt_counter percpu allocations so that bytes and packets sit in same cache line, doesn't have impact too.
> > > >
> > > >
> > > > Everything I found is detailed in the following bug : https://bugs.launchpad.net/ubuntu/+source/linux/+bug/1640786
> > > >
> > > > Of course, I'm totally aware that "iptables-restore" should be the favorite choice as it is way more efficient (note that using iptables-restore doesn't exhibit the problem) but some folks still rely on ip/ip6tables and might face this performance slowdown.
> > > >
> > > > I found the problem today, I will continue to investigate on my side, but I was wondering if we could have a discussion about this subject.
> > > >
> > > > Thanks in advance.
> > 
> > [..]
> > 
> > > Key point is that we really care about fast path : packet processing.
> > > And cited commit helps this a lot by lowering the memory foot print on
> > > hosts with many cores.
> > > This is a step into right direction.
> > > 
> > > Now we probably should batch the percpu allocations one page at a
> > > time, or ask Tejun if percpu allocations could be really really fast
> > > (probably much harder)
> > > 
> > > But really you should not use iptables one rule at a time...
> > > This will never compete with iptables-restore. ;)
> > > 
> > > Florian, would you have time to work on a patch trying to group the
> > > percpu allocations one page at a time ?
> > 
> > You mean something like this ? :
> >         xt_entry_foreach(iter, entry0, newinfo->size) {
> > -               ret = find_check_entry(iter, net, repl->name, repl->size);
> > -               if (ret != 0)
> > +               if (pcpu_alloc == 0) {
> > +                       pcnt = __alloc_percpu(PAGE_SIZE, sizeof(struct xt_counters));
> 
> alignment should be a page.

[..]

> > Error unwind will also be a mess (we can abuse .bcnt to tell if pcpu offset should be free'd or not).
> 
> Free if the address is aligned to a page boundary ?

Good idea.  This seems to work for me.  Eric (Desrochers), does this
improve the situation for you as well?

--
To unsubscribe from this list: send the line "unsubscribe netfilter-devel" in
the body of a message to majordomo@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html

Comments

Eric Dumazet Nov. 17, 2016, 2:34 a.m. UTC | #1
On Thu, 2016-11-17 at 01:07 +0100, Florian Westphal wrote:

Seems very nice !

> +
> +void xt_percpu_counter_free(struct xt_counters *counters)
> +{
> +	unsigned long pcnt = counters->pcnt;
> +
> +	if (nr_cpu_ids > 1 && (pcnt & (PAGE_SIZE - 1)) == 0)
> +		free_percpu((void __percpu *) (unsigned long)pcnt);
> +}


pcnt is already an "unsigned long"

This packing might also speed up "iptables -nvL" dumps ;)


--
To unsubscribe from this list: send the line "unsubscribe netfilter-devel" in
the body of a message to majordomo@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Eric Desrochers Nov. 17, 2016, 3:49 p.m. UTC | #2
Hi Florian,

thanks for quick response, will give it a try and get back to you with the outcome of my test.

On 2016-11-17 01:07 AM, Florian Westphal wrote:
> Eric Dumazet <eric.dumazet@gmail.com> wrote:
>>>> On Wed, Nov 16, 2016 at 2:22 AM, Eric Desrochers <ericd@miromedia.ca> wrote:
>>>>> Hi Eric,
>>>>>
>>>>> My name is Eric and I'm reaching you today as I found your name in multiple netfilter kernel commits, and was hoping we can discuss about a potential regression.
>>>>>
>>>>> I identified (git bisect) this commit [https://github.com/torvalds/linux/commit/71ae0dff02d756e4d2ca710b79f2ff5390029a5f] as the one introducing a serious performance slowdown when using the binary ip/ip6tables with a large number of policies.
>>>>>
>>>>> I also tried with the latest and greatest v4.9-rc4 mainline kernel, and the slowdown is still present.
>>>>>
>>>>> So even commit [https://github.com/torvalds/linux/commit/a1a56aaa0735c7821e85c37268a7c12e132415cb] which introduce a 16 bytes alignment on xt_counter percpu allocations so that bytes and packets sit in same cache line, doesn't have impact too.
>>>>>
>>>>>
>>>>> Everything I found is detailed in the following bug : https://bugs.launchpad.net/ubuntu/+source/linux/+bug/1640786
>>>>>
>>>>> Of course, I'm totally aware that "iptables-restore" should be the favorite choice as it is way more efficient (note that using iptables-restore doesn't exhibit the problem) but some folks still rely on ip/ip6tables and might face this performance slowdown.
>>>>>
>>>>> I found the problem today, I will continue to investigate on my side, but I was wondering if we could have a discussion about this subject.
>>>>>
>>>>> Thanks in advance.
>>> [..]
>>>
>>>> Key point is that we really care about fast path : packet processing.
>>>> And cited commit helps this a lot by lowering the memory foot print on
>>>> hosts with many cores.
>>>> This is a step into right direction.
>>>>
>>>> Now we probably should batch the percpu allocations one page at a
>>>> time, or ask Tejun if percpu allocations could be really really fast
>>>> (probably much harder)
>>>>
>>>> But really you should not use iptables one rule at a time...
>>>> This will never compete with iptables-restore. ;)
>>>>
>>>> Florian, would you have time to work on a patch trying to group the
>>>> percpu allocations one page at a time ?
>>> You mean something like this ? :
>>>         xt_entry_foreach(iter, entry0, newinfo->size) {
>>> -               ret = find_check_entry(iter, net, repl->name, repl->size);
>>> -               if (ret != 0)
>>> +               if (pcpu_alloc == 0) {
>>> +                       pcnt = __alloc_percpu(PAGE_SIZE, sizeof(struct xt_counters));
>> alignment should be a page.
> [..]
>
>>> Error unwind will also be a mess (we can abuse .bcnt to tell if pcpu offset should be free'd or not).
>> Free if the address is aligned to a page boundary ?
> Good idea.  This seems to work for me.  Eric (Desrochers), does this
> improve the situation for you as well?
>
> diff --git a/include/linux/netfilter/x_tables.h b/include/linux/netfilter/x_tables.h
> --- a/include/linux/netfilter/x_tables.h
> +++ b/include/linux/netfilter/x_tables.h
> @@ -403,38 +403,14 @@ static inline unsigned long ifname_compare_aligned(const char *_a,
>  	return ret;
>  }
>  
> +struct xt_percpu_counter_alloc_state {
> +	unsigned int off;
> +	const char *mem;
> +};
>  
> -/* On SMP, ip(6)t_entry->counters.pcnt holds address of the
> - * real (percpu) counter.  On !SMP, its just the packet count,
> - * so nothing needs to be done there.
> - *
> - * xt_percpu_counter_alloc returns the address of the percpu
> - * counter, or 0 on !SMP. We force an alignment of 16 bytes
> - * so that bytes/packets share a common cache line.
> - *
> - * Hence caller must use IS_ERR_VALUE to check for error, this
> - * allows us to return 0 for single core systems without forcing
> - * callers to deal with SMP vs. NONSMP issues.
> - */
> -static inline unsigned long xt_percpu_counter_alloc(void)
> -{
> -	if (nr_cpu_ids > 1) {
> -		void __percpu *res = __alloc_percpu(sizeof(struct xt_counters),
> -						    sizeof(struct xt_counters));
> -
> -		if (res == NULL)
> -			return -ENOMEM;
> -
> -		return (__force unsigned long) res;
> -	}
> -
> -	return 0;
> -}
> -static inline void xt_percpu_counter_free(u64 pcnt)
> -{
> -	if (nr_cpu_ids > 1)
> -		free_percpu((void __percpu *) (unsigned long) pcnt);
> -}
> +bool xt_percpu_counter_alloc(struct xt_percpu_counter_alloc_state *state,
> +			     struct xt_counters *counter);
> +void xt_percpu_counter_free(struct xt_counters *cnt);
>  
>  static inline struct xt_counters *
>  xt_get_this_cpu_counter(struct xt_counters *cnt)
> diff --git a/net/ipv4/netfilter/arp_tables.c b/net/ipv4/netfilter/arp_tables.c
> index 39004da318e2..cbea0cb030da 100644
> --- a/net/ipv4/netfilter/arp_tables.c
> +++ b/net/ipv4/netfilter/arp_tables.c
> @@ -411,17 +411,15 @@ static inline int check_target(struct arpt_entry *e, const char *name)
>  }
>  
>  static inline int
> -find_check_entry(struct arpt_entry *e, const char *name, unsigned int size)
> +find_check_entry(struct arpt_entry *e, const char *name, unsigned int size,
> +		 struct xt_percpu_counter_alloc_state *alloc_state)
>  {
>  	struct xt_entry_target *t;
>  	struct xt_target *target;
> -	unsigned long pcnt;
>  	int ret;
>  
> -	pcnt = xt_percpu_counter_alloc();
> -	if (IS_ERR_VALUE(pcnt))
> +	if (!xt_percpu_counter_alloc(alloc_state, &e->counters))
>  		return -ENOMEM;
> -	e->counters.pcnt = pcnt;
>  
>  	t = arpt_get_target(e);
>  	target = xt_request_find_target(NFPROTO_ARP, t->u.user.name,
> @@ -439,7 +437,7 @@ find_check_entry(struct arpt_entry *e, const char *name, unsigned int size)
>  err:
>  	module_put(t->u.kernel.target->me);
>  out:
> -	xt_percpu_counter_free(e->counters.pcnt);
> +	xt_percpu_counter_free(&e->counters);
>  
>  	return ret;
>  }
> @@ -519,7 +517,7 @@ static inline void cleanup_entry(struct arpt_entry *e)
>  	if (par.target->destroy != NULL)
>  		par.target->destroy(&par);
>  	module_put(par.target->me);
> -	xt_percpu_counter_free(e->counters.pcnt);
> +	xt_percpu_counter_free(&e->counters);
>  }
>  
>  /* Checks and translates the user-supplied table segment (held in
> @@ -528,6 +526,7 @@ static inline void cleanup_entry(struct arpt_entry *e)
>  static int translate_table(struct xt_table_info *newinfo, void *entry0,
>  			   const struct arpt_replace *repl)
>  {
> +	struct xt_percpu_counter_alloc_state alloc_state = { 0 };
>  	struct arpt_entry *iter;
>  	unsigned int *offsets;
>  	unsigned int i;
> @@ -590,7 +589,7 @@ static int translate_table(struct xt_table_info *newinfo, void *entry0,
>  	/* Finally, each sanity check must pass */
>  	i = 0;
>  	xt_entry_foreach(iter, entry0, newinfo->size) {
> -		ret = find_check_entry(iter, repl->name, repl->size);
> +		ret = find_check_entry(iter, repl->name, repl->size, &alloc_state);
>  		if (ret != 0)
>  			break;
>  		++i;
> diff --git a/net/ipv4/netfilter/ip_tables.c b/net/ipv4/netfilter/ip_tables.c
> index 46815c8a60d7..0024550516d1 100644
> --- a/net/ipv4/netfilter/ip_tables.c
> +++ b/net/ipv4/netfilter/ip_tables.c
> @@ -531,7 +531,8 @@ static int check_target(struct ipt_entry *e, struct net *net, const char *name)
>  
>  static int
>  find_check_entry(struct ipt_entry *e, struct net *net, const char *name,
> -		 unsigned int size)
> +		 unsigned int size,
> +		 struct xt_percpu_counter_alloc_state *alloc_state)
>  {
>  	struct xt_entry_target *t;
>  	struct xt_target *target;
> @@ -539,12 +540,9 @@ find_check_entry(struct ipt_entry *e, struct net *net, const char *name,
>  	unsigned int j;
>  	struct xt_mtchk_param mtpar;
>  	struct xt_entry_match *ematch;
> -	unsigned long pcnt;
>  
> -	pcnt = xt_percpu_counter_alloc();
> -	if (IS_ERR_VALUE(pcnt))
> +	if (!xt_percpu_counter_alloc(alloc_state, &e->counters))
>  		return -ENOMEM;
> -	e->counters.pcnt = pcnt;
>  
>  	j = 0;
>  	mtpar.net	= net;
> @@ -582,7 +580,7 @@ find_check_entry(struct ipt_entry *e, struct net *net, const char *name,
>  		cleanup_match(ematch, net);
>  	}
>  
> -	xt_percpu_counter_free(e->counters.pcnt);
> +	xt_percpu_counter_free(&e->counters);
>  
>  	return ret;
>  }
> @@ -670,7 +668,7 @@ cleanup_entry(struct ipt_entry *e, struct net *net)
>  	if (par.target->destroy != NULL)
>  		par.target->destroy(&par);
>  	module_put(par.target->me);
> -	xt_percpu_counter_free(e->counters.pcnt);
> +	xt_percpu_counter_free(&e->counters);
>  }
>  
>  /* Checks and translates the user-supplied table segment (held in
> @@ -679,6 +677,7 @@ static int
>  translate_table(struct net *net, struct xt_table_info *newinfo, void *entry0,
>  		const struct ipt_replace *repl)
>  {
> +	struct xt_percpu_counter_alloc_state alloc_state = { 0 };
>  	struct ipt_entry *iter;
>  	unsigned int *offsets;
>  	unsigned int i;
> @@ -738,7 +737,7 @@ translate_table(struct net *net, struct xt_table_info *newinfo, void *entry0,
>  	/* Finally, each sanity check must pass */
>  	i = 0;
>  	xt_entry_foreach(iter, entry0, newinfo->size) {
> -		ret = find_check_entry(iter, net, repl->name, repl->size);
> +		ret = find_check_entry(iter, net, repl->name, repl->size, &alloc_state);
>  		if (ret != 0)
>  			break;
>  		++i;
> diff --git a/net/ipv6/netfilter/ip6_tables.c b/net/ipv6/netfilter/ip6_tables.c
> index 6ff42b8301cc..123d9af6742e 100644
> --- a/net/ipv6/netfilter/ip6_tables.c
> +++ b/net/ipv6/netfilter/ip6_tables.c
> @@ -562,7 +562,8 @@ static int check_target(struct ip6t_entry *e, struct net *net, const char *name)
>  
>  static int
>  find_check_entry(struct ip6t_entry *e, struct net *net, const char *name,
> -		 unsigned int size)
> +		 unsigned int size,
> +		 struct xt_percpu_counter_alloc_state *alloc_state)
>  {
>  	struct xt_entry_target *t;
>  	struct xt_target *target;
> @@ -570,12 +571,9 @@ find_check_entry(struct ip6t_entry *e, struct net *net, const char *name,
>  	unsigned int j;
>  	struct xt_mtchk_param mtpar;
>  	struct xt_entry_match *ematch;
> -	unsigned long pcnt;
>  
> -	pcnt = xt_percpu_counter_alloc();
> -	if (IS_ERR_VALUE(pcnt))
> +	if (!xt_percpu_counter_alloc(alloc_state, &e->counters))
>  		return -ENOMEM;
> -	e->counters.pcnt = pcnt;
>  
>  	j = 0;
>  	mtpar.net	= net;
> @@ -612,7 +610,7 @@ find_check_entry(struct ip6t_entry *e, struct net *net, const char *name,
>  		cleanup_match(ematch, net);
>  	}
>  
> -	xt_percpu_counter_free(e->counters.pcnt);
> +	xt_percpu_counter_free(&e->counters);
>  
>  	return ret;
>  }
> @@ -699,8 +697,7 @@ static void cleanup_entry(struct ip6t_entry *e, struct net *net)
>  	if (par.target->destroy != NULL)
>  		par.target->destroy(&par);
>  	module_put(par.target->me);
> -
> -	xt_percpu_counter_free(e->counters.pcnt);
> +	xt_percpu_counter_free(&e->counters);
>  }
>  
>  /* Checks and translates the user-supplied table segment (held in
> @@ -709,6 +706,7 @@ static int
>  translate_table(struct net *net, struct xt_table_info *newinfo, void *entry0,
>  		const struct ip6t_replace *repl)
>  {
> +	struct xt_percpu_counter_alloc_state alloc_state = { 0 };
>  	struct ip6t_entry *iter;
>  	unsigned int *offsets;
>  	unsigned int i;
> @@ -768,7 +766,7 @@ translate_table(struct net *net, struct xt_table_info *newinfo, void *entry0,
>  	/* Finally, each sanity check must pass */
>  	i = 0;
>  	xt_entry_foreach(iter, entry0, newinfo->size) {
> -		ret = find_check_entry(iter, net, repl->name, repl->size);
> +		ret = find_check_entry(iter, net, repl->name, repl->size, &alloc_state);
>  		if (ret != 0)
>  			break;
>  		++i;
> diff --git a/net/netfilter/x_tables.c b/net/netfilter/x_tables.c
> index ad818e52859b..a4d1084b163f 100644
> --- a/net/netfilter/x_tables.c
> +++ b/net/netfilter/x_tables.c
> @@ -1615,6 +1615,59 @@ void xt_proto_fini(struct net *net, u_int8_t af)
>  }
>  EXPORT_SYMBOL_GPL(xt_proto_fini);
>  
> +/**
> + * xt_percpu_counter_alloc - allocate x_tables rule counter
> + *
> + * @state: pointer to xt_percpu allocation state
> + * @counter: pointer to counter struct inside the ip(6)/arpt_entry struct
> + *
> + * On SMP, the packet counter [ ip(6)t_entry->counters.pcnt ] will then
> + * contain the address of the real (percpu) counter.
> + *
> + * Rule evaluation needs to use xt_get_this_cpu_counter() helper
> + * to fetch the real percpu counter.
> + *
> + * To speed up allocation and improve data locality, an entire
> + * page is allocated.
> + *
> + * xt_percpu_counter_alloc_state contains the base address of the
> + * allocated page and the current sub-offset.
> + *
> + * returns false on error.
> + */
> +bool xt_percpu_counter_alloc(struct xt_percpu_counter_alloc_state *state,
> +			     struct xt_counters *counter)
> +{
> +	BUILD_BUG_ON(PAGE_SIZE < (sizeof(*counter) * 2));
> +
> +	if (nr_cpu_ids <= 1)
> +		return true;
> +
> +	if (state->mem == NULL) {
> +		state->mem = __alloc_percpu(PAGE_SIZE, PAGE_SIZE);
> +		if (!state->mem)
> +			return false;
> +	}
> +	counter->pcnt = (__force unsigned long)(state->mem + state->off);
> +	state->off += sizeof(*counter);
> +	if (state->off > (PAGE_SIZE - sizeof(*counter))) {
> +		state->mem = NULL;
> +		state->off = 0;
> +	}
> +
> +	return true;
> +}
> +EXPORT_SYMBOL_GPL(xt_percpu_counter_alloc);
> +
> +void xt_percpu_counter_free(struct xt_counters *counters)
> +{
> +	unsigned long pcnt = counters->pcnt;
> +
> +	if (nr_cpu_ids > 1 && (pcnt & (PAGE_SIZE - 1)) == 0)
> +		free_percpu((void __percpu *) (unsigned long)pcnt);
> +}
> +EXPORT_SYMBOL_GPL(xt_percpu_counter_free);
> +
>  static int __net_init xt_net_init(struct net *net)
>  {
>  	int i;

--
To unsubscribe from this list: send the line "unsubscribe netfilter-devel" in
the body of a message to majordomo@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Eric Dumazet Nov. 20, 2016, 6:33 a.m. UTC | #3
On Thu, 2016-11-17 at 01:07 +0100, Florian Westphal wrote:

> +	if (state->mem == NULL) {
> +		state->mem = __alloc_percpu(PAGE_SIZE, PAGE_SIZE);
> +		if (!state->mem)
> +			return false;
> +	}

This will fail on arches where PAGE_SIZE=65536

percpu allocator limit is PCPU_MIN_UNIT_SIZE  ( 32 KB )

So maybe use a smaller value like 4096 ?

#define XT_PCPU_BLOC_SIZE 4096


--
To unsubscribe from this list: send the line "unsubscribe netfilter-devel" in
the body of a message to majordomo@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Eric Dumazet Nov. 20, 2016, 5:31 p.m. UTC | #4
On Sun, 2016-11-20 at 12:22 -0500, Eric D wrote:
> I'm currently abroad for work and will come back home soon. I will
> test the solution and provide feedback to Florian by end of week.
> 
> Thanks for jumping on this quickly.
> 
> Eric
> 
> 
> On Nov 20, 2016 7:33 AM, "Eric Dumazet" <eric.dumazet@gmail.com>
> wrote:
>         On Thu, 2016-11-17 at 01:07 +0100, Florian Westphal wrote:
>         
>         > +     if (state->mem == NULL) {
>         > +             state->mem = __alloc_percpu(PAGE_SIZE,
>         PAGE_SIZE);
>         > +             if (!state->mem)
>         > +                     return false;
>         > +     }
>         
>         This will fail on arches where PAGE_SIZE=65536
>         
>         percpu allocator limit is PCPU_MIN_UNIT_SIZE  ( 32 KB )
>         
>         So maybe use a smaller value like 4096 ?
>         
>         #define XT_PCPU_BLOC_SIZE 4096
>         
Thanks Eric, I will test the patch myself, because I believe we need it
asap ;)




--
To unsubscribe from this list: send the line "unsubscribe netfilter-devel" in
the body of a message to majordomo@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Eric Dumazet Nov. 20, 2016, 5:55 p.m. UTC | #5
On Sun, 2016-11-20 at 09:31 -0800, Eric Dumazet wrote:

> Thanks Eric, I will test the patch myself, because I believe we need it
> asap ;)


Current net-next without Florian patch :

lpaa24:~# time for f in `seq 1 2000` ; do iptables -A FORWARD ; done

real	0m12.856s
user	0m0.590s
sys	0m11.131s


perf report ...; perf report ->

    47.45%  iptables  [kernel.kallsyms]  [k] pcpu_alloc_area                      
     8.49%  iptables  [kernel.kallsyms]  [k] memset_erms                          
     7.35%  iptables  [kernel.kallsyms]  [k] get_counters                         
     2.87%  iptables  [kernel.kallsyms]  [k] __memmove                            
     2.33%  iptables  [kernel.kallsyms]  [k] pcpu_alloc                           
     2.07%  iptables  [kernel.kallsyms]  [k] _find_next_bit.part.0                
     1.62%  iptables  xtables-multi      [.] 0x000000000001bb9d                   
     1.25%  iptables  [kernel.kallsyms]  [k] page_fault                           
     1.01%  iptables  [kernel.kallsyms]  [k] memcmp                               
     0.94%  iptables  [kernel.kallsyms]  [k] translate_table                      
     0.76%  iptables  [kernel.kallsyms]  [k] find_next_bit                        
     0.73%  iptables  [kernel.kallsyms]  [k] filemap_map_pages                    
     0.68%  iptables  [kernel.kallsyms]  [k] copy_user_enhanced_fast_string       
     0.54%  iptables  [kernel.kallsyms]  [k] __get_user_8                         
     0.54%  iptables  [kernel.kallsyms]  [k] clear_page_c_e                

After patch :

lpaa24:~# time for f in `seq 1 2000` ; do iptables -A FORWARD ; done

real	0m3.867s
user	0m0.559s
sys	0m2.216s

    22.15%  iptables  [kernel.kallsyms]  [k] get_counters                           
     5.85%  iptables  xtables-multi      [.] 0x000000000001bbac                     
     3.99%  iptables  [kernel.kallsyms]  [k] page_fault                             
     2.37%  iptables  [kernel.kallsyms]  [k] memcmp                                 
     2.19%  iptables  [kernel.kallsyms]  [k] copy_user_enhanced_fast_string         
     1.89%  iptables  [kernel.kallsyms]  [k] translate_table                        
     1.78%  iptables  [kernel.kallsyms]  [k] memset_erms                            
     1.74%  iptables  [kernel.kallsyms]  [k] clear_page_c_e                         
     1.73%  iptables  [kernel.kallsyms]  [k] __get_user_8                           
     1.72%  iptables  [kernel.kallsyms]  [k] perf_iterate_ctx                       
     1.21%  iptables  [kernel.kallsyms]  [k] handle_mm_fault                        
     0.98%  iptables  [kernel.kallsyms]  [k] unmap_page_range          

So this is a huge win. And I suspect data path will also gain from all
pcpu counters being in the same area of memory (this is where I am very interested)


--
To unsubscribe from this list: send the line "unsubscribe netfilter-devel" in
the body of a message to majordomo@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
diff mbox

Patch

diff --git a/include/linux/netfilter/x_tables.h b/include/linux/netfilter/x_tables.h
--- a/include/linux/netfilter/x_tables.h
+++ b/include/linux/netfilter/x_tables.h
@@ -403,38 +403,14 @@  static inline unsigned long ifname_compare_aligned(const char *_a,
 	return ret;
 }
 
+struct xt_percpu_counter_alloc_state {
+	unsigned int off;
+	const char *mem;
+};
 
-/* On SMP, ip(6)t_entry->counters.pcnt holds address of the
- * real (percpu) counter.  On !SMP, its just the packet count,
- * so nothing needs to be done there.
- *
- * xt_percpu_counter_alloc returns the address of the percpu
- * counter, or 0 on !SMP. We force an alignment of 16 bytes
- * so that bytes/packets share a common cache line.
- *
- * Hence caller must use IS_ERR_VALUE to check for error, this
- * allows us to return 0 for single core systems without forcing
- * callers to deal with SMP vs. NONSMP issues.
- */
-static inline unsigned long xt_percpu_counter_alloc(void)
-{
-	if (nr_cpu_ids > 1) {
-		void __percpu *res = __alloc_percpu(sizeof(struct xt_counters),
-						    sizeof(struct xt_counters));
-
-		if (res == NULL)
-			return -ENOMEM;
-
-		return (__force unsigned long) res;
-	}
-
-	return 0;
-}
-static inline void xt_percpu_counter_free(u64 pcnt)
-{
-	if (nr_cpu_ids > 1)
-		free_percpu((void __percpu *) (unsigned long) pcnt);
-}
+bool xt_percpu_counter_alloc(struct xt_percpu_counter_alloc_state *state,
+			     struct xt_counters *counter);
+void xt_percpu_counter_free(struct xt_counters *cnt);
 
 static inline struct xt_counters *
 xt_get_this_cpu_counter(struct xt_counters *cnt)
diff --git a/net/ipv4/netfilter/arp_tables.c b/net/ipv4/netfilter/arp_tables.c
index 39004da318e2..cbea0cb030da 100644
--- a/net/ipv4/netfilter/arp_tables.c
+++ b/net/ipv4/netfilter/arp_tables.c
@@ -411,17 +411,15 @@  static inline int check_target(struct arpt_entry *e, const char *name)
 }
 
 static inline int
-find_check_entry(struct arpt_entry *e, const char *name, unsigned int size)
+find_check_entry(struct arpt_entry *e, const char *name, unsigned int size,
+		 struct xt_percpu_counter_alloc_state *alloc_state)
 {
 	struct xt_entry_target *t;
 	struct xt_target *target;
-	unsigned long pcnt;
 	int ret;
 
-	pcnt = xt_percpu_counter_alloc();
-	if (IS_ERR_VALUE(pcnt))
+	if (!xt_percpu_counter_alloc(alloc_state, &e->counters))
 		return -ENOMEM;
-	e->counters.pcnt = pcnt;
 
 	t = arpt_get_target(e);
 	target = xt_request_find_target(NFPROTO_ARP, t->u.user.name,
@@ -439,7 +437,7 @@  find_check_entry(struct arpt_entry *e, const char *name, unsigned int size)
 err:
 	module_put(t->u.kernel.target->me);
 out:
-	xt_percpu_counter_free(e->counters.pcnt);
+	xt_percpu_counter_free(&e->counters);
 
 	return ret;
 }
@@ -519,7 +517,7 @@  static inline void cleanup_entry(struct arpt_entry *e)
 	if (par.target->destroy != NULL)
 		par.target->destroy(&par);
 	module_put(par.target->me);
-	xt_percpu_counter_free(e->counters.pcnt);
+	xt_percpu_counter_free(&e->counters);
 }
 
 /* Checks and translates the user-supplied table segment (held in
@@ -528,6 +526,7 @@  static inline void cleanup_entry(struct arpt_entry *e)
 static int translate_table(struct xt_table_info *newinfo, void *entry0,
 			   const struct arpt_replace *repl)
 {
+	struct xt_percpu_counter_alloc_state alloc_state = { 0 };
 	struct arpt_entry *iter;
 	unsigned int *offsets;
 	unsigned int i;
@@ -590,7 +589,7 @@  static int translate_table(struct xt_table_info *newinfo, void *entry0,
 	/* Finally, each sanity check must pass */
 	i = 0;
 	xt_entry_foreach(iter, entry0, newinfo->size) {
-		ret = find_check_entry(iter, repl->name, repl->size);
+		ret = find_check_entry(iter, repl->name, repl->size, &alloc_state);
 		if (ret != 0)
 			break;
 		++i;
diff --git a/net/ipv4/netfilter/ip_tables.c b/net/ipv4/netfilter/ip_tables.c
index 46815c8a60d7..0024550516d1 100644
--- a/net/ipv4/netfilter/ip_tables.c
+++ b/net/ipv4/netfilter/ip_tables.c
@@ -531,7 +531,8 @@  static int check_target(struct ipt_entry *e, struct net *net, const char *name)
 
 static int
 find_check_entry(struct ipt_entry *e, struct net *net, const char *name,
-		 unsigned int size)
+		 unsigned int size,
+		 struct xt_percpu_counter_alloc_state *alloc_state)
 {
 	struct xt_entry_target *t;
 	struct xt_target *target;
@@ -539,12 +540,9 @@  find_check_entry(struct ipt_entry *e, struct net *net, const char *name,
 	unsigned int j;
 	struct xt_mtchk_param mtpar;
 	struct xt_entry_match *ematch;
-	unsigned long pcnt;
 
-	pcnt = xt_percpu_counter_alloc();
-	if (IS_ERR_VALUE(pcnt))
+	if (!xt_percpu_counter_alloc(alloc_state, &e->counters))
 		return -ENOMEM;
-	e->counters.pcnt = pcnt;
 
 	j = 0;
 	mtpar.net	= net;
@@ -582,7 +580,7 @@  find_check_entry(struct ipt_entry *e, struct net *net, const char *name,
 		cleanup_match(ematch, net);
 	}
 
-	xt_percpu_counter_free(e->counters.pcnt);
+	xt_percpu_counter_free(&e->counters);
 
 	return ret;
 }
@@ -670,7 +668,7 @@  cleanup_entry(struct ipt_entry *e, struct net *net)
 	if (par.target->destroy != NULL)
 		par.target->destroy(&par);
 	module_put(par.target->me);
-	xt_percpu_counter_free(e->counters.pcnt);
+	xt_percpu_counter_free(&e->counters);
 }
 
 /* Checks and translates the user-supplied table segment (held in
@@ -679,6 +677,7 @@  static int
 translate_table(struct net *net, struct xt_table_info *newinfo, void *entry0,
 		const struct ipt_replace *repl)
 {
+	struct xt_percpu_counter_alloc_state alloc_state = { 0 };
 	struct ipt_entry *iter;
 	unsigned int *offsets;
 	unsigned int i;
@@ -738,7 +737,7 @@  translate_table(struct net *net, struct xt_table_info *newinfo, void *entry0,
 	/* Finally, each sanity check must pass */
 	i = 0;
 	xt_entry_foreach(iter, entry0, newinfo->size) {
-		ret = find_check_entry(iter, net, repl->name, repl->size);
+		ret = find_check_entry(iter, net, repl->name, repl->size, &alloc_state);
 		if (ret != 0)
 			break;
 		++i;
diff --git a/net/ipv6/netfilter/ip6_tables.c b/net/ipv6/netfilter/ip6_tables.c
index 6ff42b8301cc..123d9af6742e 100644
--- a/net/ipv6/netfilter/ip6_tables.c
+++ b/net/ipv6/netfilter/ip6_tables.c
@@ -562,7 +562,8 @@  static int check_target(struct ip6t_entry *e, struct net *net, const char *name)
 
 static int
 find_check_entry(struct ip6t_entry *e, struct net *net, const char *name,
-		 unsigned int size)
+		 unsigned int size,
+		 struct xt_percpu_counter_alloc_state *alloc_state)
 {
 	struct xt_entry_target *t;
 	struct xt_target *target;
@@ -570,12 +571,9 @@  find_check_entry(struct ip6t_entry *e, struct net *net, const char *name,
 	unsigned int j;
 	struct xt_mtchk_param mtpar;
 	struct xt_entry_match *ematch;
-	unsigned long pcnt;
 
-	pcnt = xt_percpu_counter_alloc();
-	if (IS_ERR_VALUE(pcnt))
+	if (!xt_percpu_counter_alloc(alloc_state, &e->counters))
 		return -ENOMEM;
-	e->counters.pcnt = pcnt;
 
 	j = 0;
 	mtpar.net	= net;
@@ -612,7 +610,7 @@  find_check_entry(struct ip6t_entry *e, struct net *net, const char *name,
 		cleanup_match(ematch, net);
 	}
 
-	xt_percpu_counter_free(e->counters.pcnt);
+	xt_percpu_counter_free(&e->counters);
 
 	return ret;
 }
@@ -699,8 +697,7 @@  static void cleanup_entry(struct ip6t_entry *e, struct net *net)
 	if (par.target->destroy != NULL)
 		par.target->destroy(&par);
 	module_put(par.target->me);
-
-	xt_percpu_counter_free(e->counters.pcnt);
+	xt_percpu_counter_free(&e->counters);
 }
 
 /* Checks and translates the user-supplied table segment (held in
@@ -709,6 +706,7 @@  static int
 translate_table(struct net *net, struct xt_table_info *newinfo, void *entry0,
 		const struct ip6t_replace *repl)
 {
+	struct xt_percpu_counter_alloc_state alloc_state = { 0 };
 	struct ip6t_entry *iter;
 	unsigned int *offsets;
 	unsigned int i;
@@ -768,7 +766,7 @@  translate_table(struct net *net, struct xt_table_info *newinfo, void *entry0,
 	/* Finally, each sanity check must pass */
 	i = 0;
 	xt_entry_foreach(iter, entry0, newinfo->size) {
-		ret = find_check_entry(iter, net, repl->name, repl->size);
+		ret = find_check_entry(iter, net, repl->name, repl->size, &alloc_state);
 		if (ret != 0)
 			break;
 		++i;
diff --git a/net/netfilter/x_tables.c b/net/netfilter/x_tables.c
index ad818e52859b..a4d1084b163f 100644
--- a/net/netfilter/x_tables.c
+++ b/net/netfilter/x_tables.c
@@ -1615,6 +1615,59 @@  void xt_proto_fini(struct net *net, u_int8_t af)
 }
 EXPORT_SYMBOL_GPL(xt_proto_fini);
 
+/**
+ * xt_percpu_counter_alloc - allocate x_tables rule counter
+ *
+ * @state: pointer to xt_percpu allocation state
+ * @counter: pointer to counter struct inside the ip(6)/arpt_entry struct
+ *
+ * On SMP, the packet counter [ ip(6)t_entry->counters.pcnt ] will then
+ * contain the address of the real (percpu) counter.
+ *
+ * Rule evaluation needs to use xt_get_this_cpu_counter() helper
+ * to fetch the real percpu counter.
+ *
+ * To speed up allocation and improve data locality, an entire
+ * page is allocated.
+ *
+ * xt_percpu_counter_alloc_state contains the base address of the
+ * allocated page and the current sub-offset.
+ *
+ * returns false on error.
+ */
+bool xt_percpu_counter_alloc(struct xt_percpu_counter_alloc_state *state,
+			     struct xt_counters *counter)
+{
+	BUILD_BUG_ON(PAGE_SIZE < (sizeof(*counter) * 2));
+
+	if (nr_cpu_ids <= 1)
+		return true;
+
+	if (state->mem == NULL) {
+		state->mem = __alloc_percpu(PAGE_SIZE, PAGE_SIZE);
+		if (!state->mem)
+			return false;
+	}
+	counter->pcnt = (__force unsigned long)(state->mem + state->off);
+	state->off += sizeof(*counter);
+	if (state->off > (PAGE_SIZE - sizeof(*counter))) {
+		state->mem = NULL;
+		state->off = 0;
+	}
+
+	return true;
+}
+EXPORT_SYMBOL_GPL(xt_percpu_counter_alloc);
+
+void xt_percpu_counter_free(struct xt_counters *counters)
+{
+	unsigned long pcnt = counters->pcnt;
+
+	if (nr_cpu_ids > 1 && (pcnt & (PAGE_SIZE - 1)) == 0)
+		free_percpu((void __percpu *) (unsigned long)pcnt);
+}
+EXPORT_SYMBOL_GPL(xt_percpu_counter_free);
+
 static int __net_init xt_net_init(struct net *net)
 {
 	int i;