diff mbox series

[RFC,2/4] net/fib: Provide fib_balance_budget sysctl

Message ID 20190326153026.24493-3-dima@arista.com
State RFC
Delegated to: David Miller
Headers show
Series net/fib: Speed up trie rebalancing for full view | expand

Commit Message

Dmitry Safonov March 26, 2019, 3:30 p.m. UTC
Unfortunately, MAX_WORK at this moment is broken: during
trie_rebalance() climbing upwards resize() is being called for each
tnode. Which makes the limit useless the bigger trie gets: at each level
there are 10 attempts to balance tnode and childs, resulting in
O(10^n + 10^{n-1} + ... 10^{n-k}) complexity to balance trie, where k - is
the level where we changed the trie originally (adding/removing
alias/route).

Which results in the following price of removing one route under big
trie (similar for some other single routes that results in reallocating
tnodes by hitting the threshold limits for halve/inflate resize):

Before:
   Basic info: size of leaf: 40 bytes, size of tnode: 40 bytes.
   Main:
           Aver depth:     1.99
           Max depth:      2
           Leaves:         77825
           Prefixes:       77825
           Internal nodes: 77
             9: 1  10: 76
           Pointers: 78336
   Null ptrs: 435
   Total size: 7912  kB
   Local:
   [omitted]

After:
   Basic info: size of leaf: 40 bytes, size of tnode: 40 bytes.
   Main:
           Aver depth:     3.00
           Max depth:      3
           Leaves:         77824
           Prefixes:       77824
           Internal nodes: 20491
             1: 2048  2: 18432  6: 1  11: 10
           Pointers: 98368
   Null ptrs: 54
   Total size: 8865  kB
   Local:
   [omitted]

Provide a sysctl to control amount of pending balancing work.
(by default unlimited as it was)

Cc: linux-doc@vger.kernel.org
Cc: Jonathan Corbet <corbet@lwn.net>
Fixes: ff181ed8768f ("fib_trie: Push assignment of child to parent down
into inflate/halve")
Signed-off-by: Dmitry Safonov <dima@arista.com>
---
 Documentation/networking/ip-sysctl.txt |  6 +++
 include/net/ip.h                       |  1 +
 net/ipv4/fib_trie.c                    | 60 +++++++++++++++-----------
 net/ipv4/sysctl_net_ipv4.c             |  7 +++
 4 files changed, 49 insertions(+), 25 deletions(-)

Comments

Alexander Duyck April 1, 2019, 6:09 p.m. UTC | #1
On Tue, 2019-03-26 at 15:30 +0000, Dmitry Safonov wrote:
> Unfortunately, MAX_WORK at this moment is broken: during
> trie_rebalance() climbing upwards resize() is being called for each
> tnode. Which makes the limit useless the bigger trie gets: at each level
> there are 10 attempts to balance tnode and childs, resulting in
> O(10^n + 10^{n-1} + ... 10^{n-k}) complexity to balance trie, where k - is
> the level where we changed the trie originally (adding/removing
> alias/route).
> 
> Which results in the following price of removing one route under big
> trie (similar for some other single routes that results in reallocating
> tnodes by hitting the threshold limits for halve/inflate resize):

The info below doesn't really tell us the price. It might be useful if
you could just time the removal of that one route and provide that
information.

> Before:
>    Basic info: size of leaf: 40 bytes, size of tnode: 40 bytes.
>    Main:
>            Aver depth:     1.99
>            Max depth:      2
>            Leaves:         77825
>            Prefixes:       77825
>            Internal nodes: 77
>              9: 1  10: 76
>            Pointers: 78336
>    Null ptrs: 435
>    Total size: 7912  kB
>    Local:
>    [omitted]
> 
> After:
>    Basic info: size of leaf: 40 bytes, size of tnode: 40 bytes.
>    Main:
>            Aver depth:     3.00
>            Max depth:      3
>            Leaves:         77824
>            Prefixes:       77824
>            Internal nodes: 20491
>              1: 2048  2: 18432  6: 1  11: 10
>            Pointers: 98368
>    Null ptrs: 54
>    Total size: 8865  kB
>    Local:
>    [omitted]
> 

Is this before/after the patch, or just before/after the removal of the
one route? I assume you are are just talking about the one route since
the number of prefixes has dropped by 1.

> Provide a sysctl to control amount of pending balancing work.
> (by default unlimited as it was)
> 
> Cc: linux-doc@vger.kernel.org
> Cc: Jonathan Corbet <corbet@lwn.net>
> Fixes: ff181ed8768f ("fib_trie: Push assignment of child to parent down
> into inflate/halve")
> Signed-off-by: Dmitry Safonov <dima@arista.com>
> ---
>  Documentation/networking/ip-sysctl.txt |  6 +++
>  include/net/ip.h                       |  1 +
>  net/ipv4/fib_trie.c                    | 60 +++++++++++++++-----------
>  net/ipv4/sysctl_net_ipv4.c             |  7 +++
>  4 files changed, 49 insertions(+), 25 deletions(-)
> 
> diff --git a/Documentation/networking/ip-sysctl.txt b/Documentation/networking/ip-sysctl.txt
> index acdfb5d2bcaa..fb71dacff4dd 100644
> --- a/Documentation/networking/ip-sysctl.txt
> +++ b/Documentation/networking/ip-sysctl.txt
> @@ -81,6 +81,12 @@ fib_multipath_hash_policy - INTEGER
>  	0 - Layer 3
>  	1 - Layer 4
>  
> +fib_balance_budget - UNSIGNED INTEGER
> +	Limits the number of resize attempts during balancing fib trie
> +	on adding/removing new routes.
> +	Possible values:
> +	Default: UINT_MAX (0xFFFFFFFF)
> +

So I am not really a fan of the use of UNSIGNED_INTEGER here. I would
much rather see this be a signed integer and the test be for the value
going negative instead of having to test in so many places if the value
is 0.

>  ip_forward_update_priority - INTEGER
>  	Whether to update SKB priority from "TOS" field in IPv4 header after it
>  	is forwarded. The new SKB priority is mapped from TOS field value
> diff --git a/include/net/ip.h b/include/net/ip.h
> index be3cad9c2e4c..305d0e43088b 100644
> --- a/include/net/ip.h
> +++ b/include/net/ip.h
> @@ -421,6 +421,7 @@ static inline unsigned int ip_skb_dst_mtu(struct sock *sk,
>  	return min(READ_ONCE(skb_dst(skb)->dev->mtu), IP_MAX_MTU);
>  }
>  
> +extern unsigned int fib_balance_budget;
>  struct dst_metrics *ip_fib_metrics_init(struct net *net, struct nlattr *fc_mx,
>  					int fc_mx_len,
>  					struct netlink_ext_ack *extack);
> diff --git a/net/ipv4/fib_trie.c b/net/ipv4/fib_trie.c
> index ad7d56c421cb..d90cf9dfd443 100644
> --- a/net/ipv4/fib_trie.c
> +++ b/net/ipv4/fib_trie.c
> @@ -182,8 +182,10 @@ struct trie {
>  #endif
>  };
>  
> -static struct key_vector *resize(struct trie *t, struct key_vector *tn);
> +static struct key_vector *resize(struct trie *t, struct key_vector *tn,
> +				 unsigned int *budget);
>  static size_t tnode_free_size;
> +unsigned int fib_balance_budget = UINT_MAX;
>  
>  /*
>   * synchronize_rcu after call_rcu for that many pages; it should be especially
> @@ -506,7 +508,8 @@ static void tnode_free(struct key_vector *tn)
>  
>  static struct key_vector *replace(struct trie *t,
>  				  struct key_vector *oldtnode,
> -				  struct key_vector *tn)
> +				  struct key_vector *tn,
> +				  unsigned int *budget)
>  {
>  	struct key_vector *tp = node_parent(oldtnode);
>  	unsigned long i;
> @@ -522,19 +525,19 @@ static struct key_vector *replace(struct trie *t,
>  	tnode_free(oldtnode);
>  
>  	/* resize children now that oldtnode is freed */
> -	for (i = child_length(tn); i;) {
> +	for (i = child_length(tn); i && *budget;) {

So this is an example of the kind of stuff I am talking about. Having a
check for budget here isn't going to add much in the way of value. I
would much rather keep the logic contained within resize, and just do a
check for *budget >= 0.

>  		struct key_vector *inode = get_child(tn, --i);
>  
>  		/* resize child node */
>  		if (tnode_full(tn, inode))
> -			tn = resize(t, inode);
> +			tn = resize(t, inode, budget);
>  	}
>  
>  	return tp;
>  }
>  
> -static struct key_vector *inflate(struct trie *t,
> -				  struct key_vector *oldtnode)
> +static struct key_vector *inflate(struct trie *t, struct key_vector *oldtnode,
> +				  unsigned int *budget)
>  {
>  	struct key_vector *tn;
>  	unsigned long i;
> @@ -621,7 +624,7 @@ static struct key_vector *inflate(struct trie *t,
>  	}
>  
>  	/* setup the parent pointers into and out of this node */
> -	return replace(t, oldtnode, tn);
> +	return replace(t, oldtnode, tn, budget);
>  nomem:
>  	/* all pointers should be clean so we are done */
>  	tnode_free(tn);
> @@ -629,8 +632,8 @@ static struct key_vector *inflate(struct trie *t,
>  	return NULL;
>  }
>  
> -static struct key_vector *halve(struct trie *t,
> -				struct key_vector *oldtnode)
> +static struct key_vector *halve(struct trie *t, struct key_vector *oldtnode,
> +				unsigned int *budget)
>  {
>  	struct key_vector *tn;
>  	unsigned long i;
> @@ -676,7 +679,7 @@ static struct key_vector *halve(struct trie *t,
>  	}
>  
>  	/* setup the parent pointers into and out of this node */
> -	return replace(t, oldtnode, tn);
> +	return replace(t, oldtnode, tn, budget);
>  nomem:
>  	/* all pointers should be clean so we are done */
>  	tnode_free(tn);
> @@ -843,15 +846,15 @@ static inline bool should_collapse(struct key_vector *tn)
>  	return used < 2;
>  }
>  
> -#define MAX_WORK 10
> -static struct key_vector *resize(struct trie *t, struct key_vector *tn)
> +static struct key_vector *resize(struct trie *t, struct key_vector *tn,
> +				 unsigned int *budget)
>  {
>  #ifdef CONFIG_IP_FIB_TRIE_STATS
>  	struct trie_use_stats __percpu *stats = t->stats;
>  #endif
>  	struct key_vector *tp = node_parent(tn);
>  	unsigned long cindex = get_index(tn->key, tp);
> -	int max_work = MAX_WORK;
> +	bool inflated = false;
>  
>  	pr_debug("In tnode_resize %p inflate_threshold=%d threshold=%d\n",
>  		 tn, inflate_threshold, halve_threshold);
> @@ -865,8 +868,8 @@ static struct key_vector *resize(struct trie *t, struct key_vector *tn)
>  	/* Double as long as the resulting node has a number of
>  	 * nonempty nodes that are above the threshold.
>  	 */
> -	while (should_inflate(tp, tn) && max_work) {
> -		tp = inflate(t, tn);
> +	while (should_inflate(tp, tn) && *budget) {
> +		tp = inflate(t, tn, budget);
>  		if (!tp) {
>  #ifdef CONFIG_IP_FIB_TRIE_STATS
>  			this_cpu_inc(stats->resize_node_skipped);
> @@ -874,22 +877,25 @@ static struct key_vector *resize(struct trie *t, struct key_vector *tn)
>  			break;
>  		}
>  
> -		max_work--;
> +		(*budget)--;

One thing we might explore here would be to look at pulling out the
check of budget at the start of the loop and instead combine the
decrement with the check for < 0 here so we have something like:
		if (--*budget < 0)
			break;

That way what should happen is that we will achieve maximum depth in
any resize and make at least one pass optimizing from the bottom up
instead of the top down.

> +		inflated = true;
>  		tn = get_child(tp, cindex);
>  	}
>  
>  	/* update parent in case inflate failed */
>  	tp = node_parent(tn);
>  
> -	/* Return if at least one inflate is run */
> -	if (max_work != MAX_WORK)
> +	/* Return if at least one inflate is run:
> +	 * microoptimization to not recalculate thresholds
> +	 */
> +	if (inflated)
>  		return tp;

An alternative to having to add another variable would be to look at
switching the loop to a combination of an if and a do/while like so:
	if (should_inflate(tp, tn)) {
		do {
			...
		} while (should_inflate(tp, tn));
	
		/* update parent in case inflate failed */
		return node_parent(tn);
	}
		

>  	/* Halve as long as the number of empty children in this
>  	 * node is above threshold.
>  	 */
> -	while (should_halve(tp, tn) && max_work) {
> -		tp = halve(t, tn);
> +	while (should_halve(tp, tn) && *budget) {
> +		tp = halve(t, tn, budget);
>  		if (!tp) {
>  #ifdef CONFIG_IP_FIB_TRIE_STATS
>  			this_cpu_inc(stats->resize_node_skipped);
> @@ -897,7 +903,7 @@ static struct key_vector *resize(struct trie *t, struct key_vector *tn)
>  			break;
>  		}
>  
> -		max_work--;
> +		(*budget)--;

Same here. It might make sense to pull the budget check out of the
start of the loop and move it here so that it becomes a depth first
optimization instead of being a shallow one.

>  		tn = get_child(tp, cindex);
>  	}
>  
> @@ -1005,8 +1011,10 @@ static struct fib_alias *fib_find_alias(struct hlist_head *fah, u8 slen,
>  
>  static void trie_rebalance(struct trie *t, struct key_vector *tn)
>  {
> -	while (!IS_TRIE(tn))
> -		tn = resize(t, tn);
> +	unsigned int budget = fib_balance_budget;
> +
> +	while (budget && !IS_TRIE(tn))
> +		tn = resize(t, tn, &budget);

I would not have the budget check here. At a minimum we should be
replacing trie nodes with leaves in the case that we are down to one
leaf node in the trie.

>  }
>  
>  static int fib_insert_node(struct trie *t, struct key_vector *tp,
> @@ -1784,6 +1792,7 @@ struct fib_table *fib_trie_unmerge(struct fib_table *oldtb)
>  void fib_table_flush_external(struct fib_table *tb)
>  {
>  	struct trie *t = (struct trie *)tb->tb_data;
> +	unsigned int budget = fib_balance_budget;
>  	struct key_vector *pn = t->kv;
>  	unsigned long cindex = 1;
>  	struct hlist_node *tmp;
> @@ -1806,7 +1815,7 @@ void fib_table_flush_external(struct fib_table *tb)
>  				update_suffix(pn);
>  
>  			/* resize completed node */
> -			pn = resize(t, pn);
> +			pn = resize(t, pn, &budget);
>  			cindex = get_index(pkey, pn);
>  
>  			continue;

So we would probably want a completely different budget here, or at
least we would want the budget to be multiplied by the leaves we are
removing. The problem is we aren't pulling single addresses, but are
literally splitting the tree into two separate tries. The same budget
isn't going to work for both cases.

> @@ -1853,6 +1862,7 @@ void fib_table_flush_external(struct fib_table *tb)
>  int fib_table_flush(struct net *net, struct fib_table *tb, bool flush_all)
>  {
>  	struct trie *t = (struct trie *)tb->tb_data;
> +	unsigned int budget = fib_balance_budget;
>  	struct key_vector *pn = t->kv;
>  	unsigned long cindex = 1;
>  	struct hlist_node *tmp;
> @@ -1876,7 +1886,7 @@ int fib_table_flush(struct net *net, struct fib_table *tb, bool flush_all)
>  				update_suffix(pn);
>  
>  			/* resize completed node */
> -			pn = resize(t, pn);
> +			pn = resize(t, pn, &budget);
>  			cindex = get_index(pkey, pn);
>  
>  			continue;

Similar issues here, though I don't know what the typical amount of
change is we expect to see from a fib_table_flush call versus something
like the fib_table_flush_external call.

> diff --git a/net/ipv4/sysctl_net_ipv4.c b/net/ipv4/sysctl_net_ipv4.c
> index ba0fc4b18465..d7274cc442af 100644
> --- a/net/ipv4/sysctl_net_ipv4.c
> +++ b/net/ipv4/sysctl_net_ipv4.c
> @@ -443,6 +443,13 @@ static struct ctl_table ipv4_table[] = {
>  		.mode		= 0644,
>  		.proc_handler	= proc_dointvec
>  	},
> +	{
> +		.procname	= "fib_balance_budget",
> +		.data		= &fib_balance_budget,
> +		.maxlen		= sizeof(fib_balance_budget),
> +		.mode		= 0644,
> +		.proc_handler	= proc_douintvec,
> +	},
>  	{
>  		.procname	= "inet_peer_threshold",
>  		.data		= &inet_peer_threshold,
Dmitry Safonov April 4, 2019, 6:31 p.m. UTC | #2
On 4/1/19 7:09 PM, Alexander Duyck wrote:
> On Tue, 2019-03-26 at 15:30 +0000, Dmitry Safonov wrote:
>> Unfortunately, MAX_WORK at this moment is broken: during
>> trie_rebalance() climbing upwards resize() is being called for each
>> tnode. Which makes the limit useless the bigger trie gets: at each level
>> there are 10 attempts to balance tnode and childs, resulting in
>> O(10^n + 10^{n-1} + ... 10^{n-k}) complexity to balance trie, where k - is
>> the level where we changed the trie originally (adding/removing
>> alias/route).
>>
>> Which results in the following price of removing one route under big
>> trie (similar for some other single routes that results in reallocating
>> tnodes by hitting the threshold limits for halve/inflate resize):
> 
> The info below doesn't really tell us the price. It might be useful if
> you could just time the removal of that one route and provide that
> information.

Yes, well, that was an attempt to point that MAX_WORK in the current
state doesn't limit much complexity to balance trie.

So, most of the time (up to a couple of seconds) is spent on
synchronise_rcu() (which was addressed by 4/4 and a patch from David
Ahern in -next).

But I thought fixing max_work is also worth attention, regardless the
fix for synchronising rcu. And playing with the sysctl limit that is
provided by the patch, I've found out that even with 4/4 applied one can
tune this sysctl knob the way that results in dividing the work between
consecutive route manipulations. So that on a 4-core switch the time
spent to add a route in maximum was more than a second and adjusting
work limit can bring it down to 300msec in max. Though, the total time
spent to add the whole route table was the same.

I will provide more details for v2.

> 
>> Before:
>>    Basic info: size of leaf: 40 bytes, size of tnode: 40 bytes.
>>    Main:
>>            Aver depth:     1.99
>>            Max depth:      2
>>            Leaves:         77825
>>            Prefixes:       77825
>>            Internal nodes: 77
>>              9: 1  10: 76
>>            Pointers: 78336
>>    Null ptrs: 435
>>    Total size: 7912  kB
>>    Local:
>>    [omitted]
>>
>> After:
>>    Basic info: size of leaf: 40 bytes, size of tnode: 40 bytes.
>>    Main:
>>            Aver depth:     3.00
>>            Max depth:      3
>>            Leaves:         77824
>>            Prefixes:       77824
>>            Internal nodes: 20491
>>              1: 2048  2: 18432  6: 1  11: 10
>>            Pointers: 98368
>>    Null ptrs: 54
>>    Total size: 8865  kB
>>    Local:
>>    [omitted]
>>
> 
> Is this before/after the patch, or just before/after the removal of the
> one route? I assume you are are just talking about the one route since
> the number of prefixes has dropped by 1.

Yes, just for removing one route. Should have mentioned that explicitly,
sorry.

> 
>> Provide a sysctl to control amount of pending balancing work.
>> (by default unlimited as it was)
>>
>> Cc: linux-doc@vger.kernel.org
>> Cc: Jonathan Corbet <corbet@lwn.net>
>> Fixes: ff181ed8768f ("fib_trie: Push assignment of child to parent down
>> into inflate/halve")
>> Signed-off-by: Dmitry Safonov <dima@arista.com>
>> ---
>>  Documentation/networking/ip-sysctl.txt |  6 +++
>>  include/net/ip.h                       |  1 +
>>  net/ipv4/fib_trie.c                    | 60 +++++++++++++++-----------
>>  net/ipv4/sysctl_net_ipv4.c             |  7 +++
>>  4 files changed, 49 insertions(+), 25 deletions(-)
>>
>> diff --git a/Documentation/networking/ip-sysctl.txt b/Documentation/networking/ip-sysctl.txt
>> index acdfb5d2bcaa..fb71dacff4dd 100644
>> --- a/Documentation/networking/ip-sysctl.txt
>> +++ b/Documentation/networking/ip-sysctl.txt
>> @@ -81,6 +81,12 @@ fib_multipath_hash_policy - INTEGER
>>  	0 - Layer 3
>>  	1 - Layer 4
>>  
>> +fib_balance_budget - UNSIGNED INTEGER
>> +	Limits the number of resize attempts during balancing fib trie
>> +	on adding/removing new routes.
>> +	Possible values:
>> +	Default: UINT_MAX (0xFFFFFFFF)
>> +
> 
> So I am not really a fan of the use of UNSIGNED_INTEGER here. I would
> much rather see this be a signed integer and the test be for the value
> going negative instead of having to test in so many places if the value
> is 0.

Ok, sounds good to me - will change the type for v2.

[..]
>> @@ -874,22 +877,25 @@ static struct key_vector *resize(struct trie *t, struct key_vector *tn)
>>  			break;
>>  		}
>>  
>> -		max_work--;
>> +		(*budget)--;
> 
> One thing we might explore here would be to look at pulling out the
> check of budget at the start of the loop and instead combine the
> decrement with the check for < 0 here so we have something like:
> 		if (--*budget < 0)
> 			break;
> 
> That way what should happen is that we will achieve maximum depth in
> any resize and make at least one pass optimizing from the bottom up
> instead of the top down.

Yeah, sounds reasonable - will do this way and retest.

> 
>> +		inflated = true;
>>  		tn = get_child(tp, cindex);
>>  	}
>>  
>>  	/* update parent in case inflate failed */
>>  	tp = node_parent(tn);
>>  
>> -	/* Return if at least one inflate is run */
>> -	if (max_work != MAX_WORK)
>> +	/* Return if at least one inflate is run:
>> +	 * microoptimization to not recalculate thresholds
>> +	 */
>> +	if (inflated)
>>  		return tp;
> 
> An alternative to having to add another variable would be to look at
> switching the loop to a combination of an if and a do/while like so:
> 	if (should_inflate(tp, tn)) {
> 		do {
> 			...
> 		} while (should_inflate(tp, tn));
> 	
> 		/* update parent in case inflate failed */
> 		return node_parent(tn);
> 	}

That's neat! Will drop the variable.

> 		
> 
>>  	/* Halve as long as the number of empty children in this
>>  	 * node is above threshold.
>>  	 */
>> -	while (should_halve(tp, tn) && max_work) {
>> -		tp = halve(t, tn);
>> +	while (should_halve(tp, tn) && *budget) {
>> +		tp = halve(t, tn, budget);
>>  		if (!tp) {
>>  #ifdef CONFIG_IP_FIB_TRIE_STATS
>>  			this_cpu_inc(stats->resize_node_skipped);
>> @@ -897,7 +903,7 @@ static struct key_vector *resize(struct trie *t, struct key_vector *tn)
>>  			break;
>>  		}
>>  
>> -		max_work--;
>> +		(*budget)--;
> 
> Same here. It might make sense to pull the budget check out of the
> start of the loop and move it here so that it becomes a depth first
> optimization instead of being a shallow one.

Makes sense, will rework.

> 
>>  		tn = get_child(tp, cindex);
>>  	}
>>  
>> @@ -1005,8 +1011,10 @@ static struct fib_alias *fib_find_alias(struct hlist_head *fah, u8 slen,
>>  
>>  static void trie_rebalance(struct trie *t, struct key_vector *tn)
>>  {
>> -	while (!IS_TRIE(tn))
>> -		tn = resize(t, tn);
>> +	unsigned int budget = fib_balance_budget;
>> +
>> +	while (budget && !IS_TRIE(tn))
>> +		tn = resize(t, tn, &budget);
> 
> I would not have the budget check here. At a minimum we should be
> replacing trie nodes with leaves in the case that we are down to one
> leaf node in the trie.

Yeah, I see.

> 
>>  }
>>  
>>  static int fib_insert_node(struct trie *t, struct key_vector *tp,
>> @@ -1784,6 +1792,7 @@ struct fib_table *fib_trie_unmerge(struct fib_table *oldtb)
>>  void fib_table_flush_external(struct fib_table *tb)
>>  {
>>  	struct trie *t = (struct trie *)tb->tb_data;
>> +	unsigned int budget = fib_balance_budget;
>>  	struct key_vector *pn = t->kv;
>>  	unsigned long cindex = 1;
>>  	struct hlist_node *tmp;
>> @@ -1806,7 +1815,7 @@ void fib_table_flush_external(struct fib_table *tb)
>>  				update_suffix(pn);
>>  
>>  			/* resize completed node */
>> -			pn = resize(t, pn);
>> +			pn = resize(t, pn, &budget);
>>  			cindex = get_index(pkey, pn);
>>  
>>  			continue;
> 
> So we would probably want a completely different budget here, or at
> least we would want the budget to be multiplied by the leaves we are
> removing. The problem is we aren't pulling single addresses, but are
> literally splitting the tree into two separate tries. The same budget
> isn't going to work for both cases.

Which will multiply the work (and the time spent).
Hmm, well, I guess we can multiply here and relax rtnl_mutex pressure if
needed by introducing fib_lock. Does that sounds good?

> 
>> @@ -1853,6 +1862,7 @@ void fib_table_flush_external(struct fib_table *tb)
>>  int fib_table_flush(struct net *net, struct fib_table *tb, bool flush_all)
>>  {
>>  	struct trie *t = (struct trie *)tb->tb_data;
>> +	unsigned int budget = fib_balance_budget;
>>  	struct key_vector *pn = t->kv;
>>  	unsigned long cindex = 1;
>>  	struct hlist_node *tmp;
>> @@ -1876,7 +1886,7 @@ int fib_table_flush(struct net *net, struct fib_table *tb, bool flush_all)
>>  				update_suffix(pn);
>>  
>>  			/* resize completed node */
>> -			pn = resize(t, pn);
>> +			pn = resize(t, pn, &budget);
>>  			cindex = get_index(pkey, pn);
>>  
>>  			continue;
> 
> Similar issues here, though I don't know what the typical amount of
> change is we expect to see from a fib_table_flush call versus something
> like the fib_table_flush_external call.

Yes. Well, I was also curious if it's possible to avoid resize() call
under (flush_all == true)?
So, that exiting a namespace wouldn't result in rebalancing a possibly
huge routing trie.

Thanks for your notes and review,
          Dmitry
diff mbox series

Patch

diff --git a/Documentation/networking/ip-sysctl.txt b/Documentation/networking/ip-sysctl.txt
index acdfb5d2bcaa..fb71dacff4dd 100644
--- a/Documentation/networking/ip-sysctl.txt
+++ b/Documentation/networking/ip-sysctl.txt
@@ -81,6 +81,12 @@  fib_multipath_hash_policy - INTEGER
 	0 - Layer 3
 	1 - Layer 4
 
+fib_balance_budget - UNSIGNED INTEGER
+	Limits the number of resize attempts during balancing fib trie
+	on adding/removing new routes.
+	Possible values:
+	Default: UINT_MAX (0xFFFFFFFF)
+
 ip_forward_update_priority - INTEGER
 	Whether to update SKB priority from "TOS" field in IPv4 header after it
 	is forwarded. The new SKB priority is mapped from TOS field value
diff --git a/include/net/ip.h b/include/net/ip.h
index be3cad9c2e4c..305d0e43088b 100644
--- a/include/net/ip.h
+++ b/include/net/ip.h
@@ -421,6 +421,7 @@  static inline unsigned int ip_skb_dst_mtu(struct sock *sk,
 	return min(READ_ONCE(skb_dst(skb)->dev->mtu), IP_MAX_MTU);
 }
 
+extern unsigned int fib_balance_budget;
 struct dst_metrics *ip_fib_metrics_init(struct net *net, struct nlattr *fc_mx,
 					int fc_mx_len,
 					struct netlink_ext_ack *extack);
diff --git a/net/ipv4/fib_trie.c b/net/ipv4/fib_trie.c
index ad7d56c421cb..d90cf9dfd443 100644
--- a/net/ipv4/fib_trie.c
+++ b/net/ipv4/fib_trie.c
@@ -182,8 +182,10 @@  struct trie {
 #endif
 };
 
-static struct key_vector *resize(struct trie *t, struct key_vector *tn);
+static struct key_vector *resize(struct trie *t, struct key_vector *tn,
+				 unsigned int *budget);
 static size_t tnode_free_size;
+unsigned int fib_balance_budget = UINT_MAX;
 
 /*
  * synchronize_rcu after call_rcu for that many pages; it should be especially
@@ -506,7 +508,8 @@  static void tnode_free(struct key_vector *tn)
 
 static struct key_vector *replace(struct trie *t,
 				  struct key_vector *oldtnode,
-				  struct key_vector *tn)
+				  struct key_vector *tn,
+				  unsigned int *budget)
 {
 	struct key_vector *tp = node_parent(oldtnode);
 	unsigned long i;
@@ -522,19 +525,19 @@  static struct key_vector *replace(struct trie *t,
 	tnode_free(oldtnode);
 
 	/* resize children now that oldtnode is freed */
-	for (i = child_length(tn); i;) {
+	for (i = child_length(tn); i && *budget;) {
 		struct key_vector *inode = get_child(tn, --i);
 
 		/* resize child node */
 		if (tnode_full(tn, inode))
-			tn = resize(t, inode);
+			tn = resize(t, inode, budget);
 	}
 
 	return tp;
 }
 
-static struct key_vector *inflate(struct trie *t,
-				  struct key_vector *oldtnode)
+static struct key_vector *inflate(struct trie *t, struct key_vector *oldtnode,
+				  unsigned int *budget)
 {
 	struct key_vector *tn;
 	unsigned long i;
@@ -621,7 +624,7 @@  static struct key_vector *inflate(struct trie *t,
 	}
 
 	/* setup the parent pointers into and out of this node */
-	return replace(t, oldtnode, tn);
+	return replace(t, oldtnode, tn, budget);
 nomem:
 	/* all pointers should be clean so we are done */
 	tnode_free(tn);
@@ -629,8 +632,8 @@  static struct key_vector *inflate(struct trie *t,
 	return NULL;
 }
 
-static struct key_vector *halve(struct trie *t,
-				struct key_vector *oldtnode)
+static struct key_vector *halve(struct trie *t, struct key_vector *oldtnode,
+				unsigned int *budget)
 {
 	struct key_vector *tn;
 	unsigned long i;
@@ -676,7 +679,7 @@  static struct key_vector *halve(struct trie *t,
 	}
 
 	/* setup the parent pointers into and out of this node */
-	return replace(t, oldtnode, tn);
+	return replace(t, oldtnode, tn, budget);
 nomem:
 	/* all pointers should be clean so we are done */
 	tnode_free(tn);
@@ -843,15 +846,15 @@  static inline bool should_collapse(struct key_vector *tn)
 	return used < 2;
 }
 
-#define MAX_WORK 10
-static struct key_vector *resize(struct trie *t, struct key_vector *tn)
+static struct key_vector *resize(struct trie *t, struct key_vector *tn,
+				 unsigned int *budget)
 {
 #ifdef CONFIG_IP_FIB_TRIE_STATS
 	struct trie_use_stats __percpu *stats = t->stats;
 #endif
 	struct key_vector *tp = node_parent(tn);
 	unsigned long cindex = get_index(tn->key, tp);
-	int max_work = MAX_WORK;
+	bool inflated = false;
 
 	pr_debug("In tnode_resize %p inflate_threshold=%d threshold=%d\n",
 		 tn, inflate_threshold, halve_threshold);
@@ -865,8 +868,8 @@  static struct key_vector *resize(struct trie *t, struct key_vector *tn)
 	/* Double as long as the resulting node has a number of
 	 * nonempty nodes that are above the threshold.
 	 */
-	while (should_inflate(tp, tn) && max_work) {
-		tp = inflate(t, tn);
+	while (should_inflate(tp, tn) && *budget) {
+		tp = inflate(t, tn, budget);
 		if (!tp) {
 #ifdef CONFIG_IP_FIB_TRIE_STATS
 			this_cpu_inc(stats->resize_node_skipped);
@@ -874,22 +877,25 @@  static struct key_vector *resize(struct trie *t, struct key_vector *tn)
 			break;
 		}
 
-		max_work--;
+		(*budget)--;
+		inflated = true;
 		tn = get_child(tp, cindex);
 	}
 
 	/* update parent in case inflate failed */
 	tp = node_parent(tn);
 
-	/* Return if at least one inflate is run */
-	if (max_work != MAX_WORK)
+	/* Return if at least one inflate is run:
+	 * microoptimization to not recalculate thresholds
+	 */
+	if (inflated)
 		return tp;
 
 	/* Halve as long as the number of empty children in this
 	 * node is above threshold.
 	 */
-	while (should_halve(tp, tn) && max_work) {
-		tp = halve(t, tn);
+	while (should_halve(tp, tn) && *budget) {
+		tp = halve(t, tn, budget);
 		if (!tp) {
 #ifdef CONFIG_IP_FIB_TRIE_STATS
 			this_cpu_inc(stats->resize_node_skipped);
@@ -897,7 +903,7 @@  static struct key_vector *resize(struct trie *t, struct key_vector *tn)
 			break;
 		}
 
-		max_work--;
+		(*budget)--;
 		tn = get_child(tp, cindex);
 	}
 
@@ -1005,8 +1011,10 @@  static struct fib_alias *fib_find_alias(struct hlist_head *fah, u8 slen,
 
 static void trie_rebalance(struct trie *t, struct key_vector *tn)
 {
-	while (!IS_TRIE(tn))
-		tn = resize(t, tn);
+	unsigned int budget = fib_balance_budget;
+
+	while (budget && !IS_TRIE(tn))
+		tn = resize(t, tn, &budget);
 }
 
 static int fib_insert_node(struct trie *t, struct key_vector *tp,
@@ -1784,6 +1792,7 @@  struct fib_table *fib_trie_unmerge(struct fib_table *oldtb)
 void fib_table_flush_external(struct fib_table *tb)
 {
 	struct trie *t = (struct trie *)tb->tb_data;
+	unsigned int budget = fib_balance_budget;
 	struct key_vector *pn = t->kv;
 	unsigned long cindex = 1;
 	struct hlist_node *tmp;
@@ -1806,7 +1815,7 @@  void fib_table_flush_external(struct fib_table *tb)
 				update_suffix(pn);
 
 			/* resize completed node */
-			pn = resize(t, pn);
+			pn = resize(t, pn, &budget);
 			cindex = get_index(pkey, pn);
 
 			continue;
@@ -1853,6 +1862,7 @@  void fib_table_flush_external(struct fib_table *tb)
 int fib_table_flush(struct net *net, struct fib_table *tb, bool flush_all)
 {
 	struct trie *t = (struct trie *)tb->tb_data;
+	unsigned int budget = fib_balance_budget;
 	struct key_vector *pn = t->kv;
 	unsigned long cindex = 1;
 	struct hlist_node *tmp;
@@ -1876,7 +1886,7 @@  int fib_table_flush(struct net *net, struct fib_table *tb, bool flush_all)
 				update_suffix(pn);
 
 			/* resize completed node */
-			pn = resize(t, pn);
+			pn = resize(t, pn, &budget);
 			cindex = get_index(pkey, pn);
 
 			continue;
diff --git a/net/ipv4/sysctl_net_ipv4.c b/net/ipv4/sysctl_net_ipv4.c
index ba0fc4b18465..d7274cc442af 100644
--- a/net/ipv4/sysctl_net_ipv4.c
+++ b/net/ipv4/sysctl_net_ipv4.c
@@ -443,6 +443,13 @@  static struct ctl_table ipv4_table[] = {
 		.mode		= 0644,
 		.proc_handler	= proc_dointvec
 	},
+	{
+		.procname	= "fib_balance_budget",
+		.data		= &fib_balance_budget,
+		.maxlen		= sizeof(fib_balance_budget),
+		.mode		= 0644,
+		.proc_handler	= proc_douintvec,
+	},
 	{
 		.procname	= "inet_peer_threshold",
 		.data		= &inet_peer_threshold,