rib_trie / Fix inflate_threshold_root. Now=15 size=11 bits

Message ID	20090626151051.GA2714@ami.dom.local
State	RFC, archived
Delegated to:	David Miller
Headers	show Return-Path: <netdev-owner@vger.kernel.org> DomainKey-Signature: a=rsa-sha1; c=nofws; d=gmail.com; s=gamma; h=date:from:to:cc:subject:message-id:references:mime-version :content-type:content-disposition:in-reply-to:user-agent; b=RkPZ7qhRKdG5vhvuyuwzY1j+EpM9k9WEKRSBSwr2/O1QZ3M0QJqx4YigLx87r2y+Lt oj0s83VduZIGxEswUZVD6LWfL3Mwq0ItithUqExxDoGI4N4NDBH3xaU0LbAjFQdZTn60 ESjD96ATHlJ6sz1Xz5pQ6pgm1uMKDm/uha6lQ= Date: Fri, 26 Jun 2009 17:10:52 +0200 From: Jarek Poplawski <jarkao2@gmail.com> To: Robert Olsson <robert@robur.slu.se> Cc: Robert Olsson <Robert.Olsson@data.slu.se>, Eric Dumazet <dada1@cosmosbay.com>, =?us-ascii?B?PT9JU08tODg1OS0yP1E/UGF3ZT1CM19TdGFzemV3c2tpPz0=?= <pstaszewski@itcare.pl>, Robert Olsson <robert.olsson@its.uu.se>, Linux Network Development list <netdev@vger.kernel.org> Subject: Re: rib_trie / Fix inflate_threshold_root. Now=15 size=11 bits Message-ID: <20090626151051.GA2714@ami.dom.local> References: <4A43E9F1.90209@cosmosbay.com> <20090626080302.GB5204@ff.dom.local> <19012.37515.146191.198843@robur.slu.se> <20090626093727.GA6832@ff.dom.local> <19012.49700.908412.410984@robur.slu.se> <20090626125449.GA8897@ff.dom.local> <20090626132820.GB8897@ff.dom.local> <19012.53943.734747.493480@robur.slu.se> MIME-Version: 1.0 Content-Type: text/plain; charset=us-ascii Content-Disposition: inline In-Reply-To: <19012.53943.734747.493480@robur.slu.se> User-Agent: Mutt/1.5.18 (2008-05-17) Sender: netdev-owner@vger.kernel.org Precedence: bulk

Jarek Poplawski June 26, 2009, 3:10 p.m. UTC

On Fri, Jun 26, 2009 at 03:52:55PM +0200, Robert Olsson wrote:
> 
> Jarek Poplawski writes:
> 
>  Thanks, 
> 
>  Should be worth testing so we synchronize_rcu instead of doing call_rcu's
>  

Alas take 2 (nor 1) doesn't compile, so here it is again.

Thanks,
Jarek P.
--- (take 3 - for testing)

 net/ipv4/fib_trie.c |   30 ++++++++++++++++++++++++------
 1 files changed, 24 insertions(+), 6 deletions(-)

--
To unsubscribe from this list: send the line "unsubscribe netdev" in
the body of a message to majordomo@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html

Paul E. McKenney June 26, 2009, 3:30 p.m. UTC | #1

On Fri, Jun 26, 2009 at 05:10:52PM +0200, Jarek Poplawski wrote:
> On Fri, Jun 26, 2009 at 03:52:55PM +0200, Robert Olsson wrote:
> > 
> > Jarek Poplawski writes:
> > 
> >  Thanks, 
> > 
> >  Should be worth testing so we synchronize_rcu instead of doing call_rcu's
> >  
> 
> Alas take 2 (nor 1) doesn't compile, so here it is again.

So the idea is to balance memory and latency, so that large changes
(those affecting the root node) get at least one synchronize_rcu(),
while smaller changes just use call_rcu(), correct?  This means that
the amount of memory awaiting an RCU grace period is limited, but
the algorithm avoids per-node synchronize_rcu() overhead.

If I understand the goal correctly, looks good!  (Give or take my
limited understanding of fib_trie and is usage, of course.)

							Thanx, Paul

> Thanks,
> Jarek P.
> --- (take 3 - for testing)
> 
>  net/ipv4/fib_trie.c |   30 ++++++++++++++++++++++++------
>  1 files changed, 24 insertions(+), 6 deletions(-)
> 
> diff --git a/net/ipv4/fib_trie.c b/net/ipv4/fib_trie.c
> index 012cf5a..1a4c4b7 100644
> --- a/net/ipv4/fib_trie.c
> +++ b/net/ipv4/fib_trie.c
> @@ -366,6 +366,17 @@ static void __tnode_vfree(struct work_struct *arg)
>  	vfree(tn);
>  }
> 
> +static void __tnode_free(struct tnode *tn)
> +{
> +	size_t size = sizeof(struct tnode) +
> +		      (sizeof(struct node *) << tn->bits);
> +
> +	if (size <= PAGE_SIZE)
> +		kfree(tn);
> +	else
> +		vfree(tn);
> +}
> +
>  static void __tnode_free_rcu(struct rcu_head *head)
>  {
>  	struct tnode *tn = container_of(head, struct tnode, rcu);
> @@ -402,7 +413,7 @@ static void tnode_free_flush(void)
>  	while ((tn = tnode_free_head)) {
>  		tnode_free_head = tn->tnode_free;
>  		tn->tnode_free = NULL;
> -		tnode_free(tn);
> +		__tnode_free(tn);
>  	}
>  }
> 
> @@ -1021,18 +1032,25 @@ static void trie_rebalance(struct trie *t, struct tnode *tn)
>  				      (struct node *)tn, wasfull);
> 
>  		tp = node_parent((struct node *) tn);
> -		tnode_free_flush();
>  		if (!tp)
>  			break;
>  		tn = tp;
>  	}
> 
> +	if (tnode_free_head) {
> +		synchronize_rcu();
> +		tnode_free_flush();
> +	}
> +
>  	/* Handle last (top) tnode */
> -	if (IS_TNODE(tn))
> +	if (IS_TNODE(tn)) {
>  		tn = (struct tnode *)resize(t, (struct tnode *)tn);
> -
> -	rcu_assign_pointer(t->trie, (struct node *)tn);
> -	tnode_free_flush();
> +		rcu_assign_pointer(t->trie, (struct node *)tn);
> +		synchronize_rcu();
> +		tnode_free_flush();
> +	} else {
> +		rcu_assign_pointer(t->trie, (struct node *)tn);
> +	}
> 
>  	return;
>  }
> --
> To unsubscribe from this list: send the line "unsubscribe netdev" in
> the body of a message to majordomo@vger.kernel.org
> More majordomo info at  http://vger.kernel.org/majordomo-info.html
--
To unsubscribe from this list: send the line "unsubscribe netdev" in
the body of a message to majordomo@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html

Jarek Poplawski June 26, 2009, 3:54 p.m. UTC | #2

On Fri, Jun 26, 2009 at 08:30:10AM -0700, Paul E. McKenney wrote:
> On Fri, Jun 26, 2009 at 05:10:52PM +0200, Jarek Poplawski wrote:
> > On Fri, Jun 26, 2009 at 03:52:55PM +0200, Robert Olsson wrote:
> > > 
> > > Jarek Poplawski writes:
> > > 
> > >  Thanks, 
> > > 
> > >  Should be worth testing so we synchronize_rcu instead of doing call_rcu's
> > >  
> > 
> > Alas take 2 (nor 1) doesn't compile, so here it is again.
> 
> So the idea is to balance memory and latency, so that large changes
> (those affecting the root node) get at least one synchronize_rcu(),
> while smaller changes just use call_rcu(), correct?  This means that
> the amount of memory awaiting an RCU grace period is limited, but
> the algorithm avoids per-node synchronize_rcu() overhead.
> 
> If I understand the goal correctly, looks good!  (Give or take my
> limited understanding of fib_trie and is usage, of course.)

The goal is practically to replace all call_rcu() during
trie_rebalance() with synchronize_rcu() (except some freeing after
ENOMEM). I guess currently (<= 2.6.30) call_rcu() can free this
memory after trie_rebalance() has finished, that's why there were
problems with enabled preemption. So this patch tries to do/force
this a bit earlier - at least before the top/largest node is
rebalanced.

Thanks,
Jarek P.

> 
> 							Thanx, Paul
> 
> > Thanks,
> > Jarek P.
> > --- (take 3 - for testing)
> > 
> >  net/ipv4/fib_trie.c |   30 ++++++++++++++++++++++++------
> >  1 files changed, 24 insertions(+), 6 deletions(-)
> > 
> > diff --git a/net/ipv4/fib_trie.c b/net/ipv4/fib_trie.c
> > index 012cf5a..1a4c4b7 100644
> > --- a/net/ipv4/fib_trie.c
> > +++ b/net/ipv4/fib_trie.c
> > @@ -366,6 +366,17 @@ static void __tnode_vfree(struct work_struct *arg)
> >  	vfree(tn);
> >  }
> > 
> > +static void __tnode_free(struct tnode *tn)
> > +{
> > +	size_t size = sizeof(struct tnode) +
> > +		      (sizeof(struct node *) << tn->bits);
> > +
> > +	if (size <= PAGE_SIZE)
> > +		kfree(tn);
> > +	else
> > +		vfree(tn);
> > +}
> > +
> >  static void __tnode_free_rcu(struct rcu_head *head)
> >  {
> >  	struct tnode *tn = container_of(head, struct tnode, rcu);
> > @@ -402,7 +413,7 @@ static void tnode_free_flush(void)
> >  	while ((tn = tnode_free_head)) {
> >  		tnode_free_head = tn->tnode_free;
> >  		tn->tnode_free = NULL;
> > -		tnode_free(tn);
> > +		__tnode_free(tn);
> >  	}
> >  }
> > 
> > @@ -1021,18 +1032,25 @@ static void trie_rebalance(struct trie *t, struct tnode *tn)
> >  				      (struct node *)tn, wasfull);
> > 
> >  		tp = node_parent((struct node *) tn);
> > -		tnode_free_flush();
> >  		if (!tp)
> >  			break;
> >  		tn = tp;
> >  	}
> > 
> > +	if (tnode_free_head) {
> > +		synchronize_rcu();
> > +		tnode_free_flush();
> > +	}
> > +
> >  	/* Handle last (top) tnode */
> > -	if (IS_TNODE(tn))
> > +	if (IS_TNODE(tn)) {
> >  		tn = (struct tnode *)resize(t, (struct tnode *)tn);
> > -
> > -	rcu_assign_pointer(t->trie, (struct node *)tn);
> > -	tnode_free_flush();
> > +		rcu_assign_pointer(t->trie, (struct node *)tn);
> > +		synchronize_rcu();
> > +		tnode_free_flush();
> > +	} else {
> > +		rcu_assign_pointer(t->trie, (struct node *)tn);
> > +	}
> > 
> >  	return;
> >  }
> > --
> > To unsubscribe from this list: send the line "unsubscribe netdev" in
> > the body of a message to majordomo@vger.kernel.org
> > More majordomo info at  http://vger.kernel.org/majordomo-info.html
--
To unsubscribe from this list: send the line "unsubscribe netdev" in
the body of a message to majordomo@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html

Jarek Poplawski June 26, 2009, 4:15 p.m. UTC | #3

On Fri, Jun 26, 2009 at 05:54:10PM +0200, Jarek Poplawski wrote:
> On Fri, Jun 26, 2009 at 08:30:10AM -0700, Paul E. McKenney wrote:
> > On Fri, Jun 26, 2009 at 05:10:52PM +0200, Jarek Poplawski wrote:
> > > On Fri, Jun 26, 2009 at 03:52:55PM +0200, Robert Olsson wrote:
> > > > 
> > > > Jarek Poplawski writes:
> > > > 
> > > >  Thanks, 
> > > > 
> > > >  Should be worth testing so we synchronize_rcu instead of doing call_rcu's
> > > >  
> > > 
> > > Alas take 2 (nor 1) doesn't compile, so here it is again.
> > 
> > So the idea is to balance memory and latency, so that large changes
> > (those affecting the root node) get at least one synchronize_rcu(),
> > while smaller changes just use call_rcu(), correct?  This means that
> > the amount of memory awaiting an RCU grace period is limited, but
> > the algorithm avoids per-node synchronize_rcu() overhead.
> > 
> > If I understand the goal correctly, looks good!  (Give or take my
> > limited understanding of fib_trie and is usage, of course.)
> 
> The goal is practically to replace all call_rcu() during
> trie_rebalance() with synchronize_rcu() (except some freeing after
> ENOMEM). I guess currently (<= 2.6.30) call_rcu() can free this
> memory after trie_rebalance() has finished, that's why there were
> problems with enabled preemption. So this patch tries to do/force
> this a bit earlier - at least before the top/largest node is
> rebalanced.

On the other hand, we could probably stay with call_rcu() plus only
one synchronize_rcu() before the top node's resize() if you think it's
enough here?

Thanks,
Jarek P.

> 
> > 
> > 							Thanx, Paul
> > 
> > > Thanks,
> > > Jarek P.
> > > --- (take 3 - for testing)
> > > 
> > >  net/ipv4/fib_trie.c |   30 ++++++++++++++++++++++++------
> > >  1 files changed, 24 insertions(+), 6 deletions(-)
> > > 
> > > diff --git a/net/ipv4/fib_trie.c b/net/ipv4/fib_trie.c
> > > index 012cf5a..1a4c4b7 100644
> > > --- a/net/ipv4/fib_trie.c
> > > +++ b/net/ipv4/fib_trie.c
> > > @@ -366,6 +366,17 @@ static void __tnode_vfree(struct work_struct *arg)
> > >  	vfree(tn);
> > >  }
> > > 
> > > +static void __tnode_free(struct tnode *tn)
> > > +{
> > > +	size_t size = sizeof(struct tnode) +
> > > +		      (sizeof(struct node *) << tn->bits);
> > > +
> > > +	if (size <= PAGE_SIZE)
> > > +		kfree(tn);
> > > +	else
> > > +		vfree(tn);
> > > +}
> > > +
> > >  static void __tnode_free_rcu(struct rcu_head *head)
> > >  {
> > >  	struct tnode *tn = container_of(head, struct tnode, rcu);
> > > @@ -402,7 +413,7 @@ static void tnode_free_flush(void)
> > >  	while ((tn = tnode_free_head)) {
> > >  		tnode_free_head = tn->tnode_free;
> > >  		tn->tnode_free = NULL;
> > > -		tnode_free(tn);
> > > +		__tnode_free(tn);
> > >  	}
> > >  }
> > > 
> > > @@ -1021,18 +1032,25 @@ static void trie_rebalance(struct trie *t, struct tnode *tn)
> > >  				      (struct node *)tn, wasfull);
> > > 
> > >  		tp = node_parent((struct node *) tn);
> > > -		tnode_free_flush();
> > >  		if (!tp)
> > >  			break;
> > >  		tn = tp;
> > >  	}
> > > 
> > > +	if (tnode_free_head) {
> > > +		synchronize_rcu();
> > > +		tnode_free_flush();
> > > +	}
> > > +
> > >  	/* Handle last (top) tnode */
> > > -	if (IS_TNODE(tn))
> > > +	if (IS_TNODE(tn)) {
> > >  		tn = (struct tnode *)resize(t, (struct tnode *)tn);
> > > -
> > > -	rcu_assign_pointer(t->trie, (struct node *)tn);
> > > -	tnode_free_flush();
> > > +		rcu_assign_pointer(t->trie, (struct node *)tn);
> > > +		synchronize_rcu();
> > > +		tnode_free_flush();
> > > +	} else {
> > > +		rcu_assign_pointer(t->trie, (struct node *)tn);
> > > +	}
> > > 
> > >  	return;
> > >  }
> > > --
> > > To unsubscribe from this list: send the line "unsubscribe netdev" in
> > > the body of a message to majordomo@vger.kernel.org
> > > More majordomo info at  http://vger.kernel.org/majordomo-info.html
--
To unsubscribe from this list: send the line "unsubscribe netdev" in
the body of a message to majordomo@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html

Paul E. McKenney June 26, 2009, 4:23 p.m. UTC | #4

On Fri, Jun 26, 2009 at 06:15:00PM +0200, Jarek Poplawski wrote:
> On Fri, Jun 26, 2009 at 05:54:10PM +0200, Jarek Poplawski wrote:
> > On Fri, Jun 26, 2009 at 08:30:10AM -0700, Paul E. McKenney wrote:
> > > On Fri, Jun 26, 2009 at 05:10:52PM +0200, Jarek Poplawski wrote:
> > > > On Fri, Jun 26, 2009 at 03:52:55PM +0200, Robert Olsson wrote:
> > > > > 
> > > > > Jarek Poplawski writes:
> > > > > 
> > > > >  Thanks, 
> > > > > 
> > > > >  Should be worth testing so we synchronize_rcu instead of doing call_rcu's
> > > > >  
> > > > 
> > > > Alas take 2 (nor 1) doesn't compile, so here it is again.
> > > 
> > > So the idea is to balance memory and latency, so that large changes
> > > (those affecting the root node) get at least one synchronize_rcu(),
> > > while smaller changes just use call_rcu(), correct?  This means that
> > > the amount of memory awaiting an RCU grace period is limited, but
> > > the algorithm avoids per-node synchronize_rcu() overhead.
> > > 
> > > If I understand the goal correctly, looks good!  (Give or take my
> > > limited understanding of fib_trie and is usage, of course.)
> > 
> > The goal is practically to replace all call_rcu() during
> > trie_rebalance() with synchronize_rcu() (except some freeing after
> > ENOMEM). I guess currently (<= 2.6.30) call_rcu() can free this
> > memory after trie_rebalance() has finished, that's why there were
> > problems with enabled preemption. So this patch tries to do/force
> > this a bit earlier - at least before the top/largest node is
> > rebalanced.
> 
> On the other hand, we could probably stay with call_rcu() plus only
> one synchronize_rcu() before the top node's resize() if you think it's
> enough here?

Well, my first task is to understand the problem/goal.  ;-)

My guess from what you said above is that use of call_rcu(), when
combined with changes to the trie in rapid succession, is resulting
in excessive memory awaiting a grace period.  Is this the case, or am I
confused?

							Thanx, Paul

> Thanks,
> Jarek P.
> 
> > 
> > > 
> > > 							Thanx, Paul
> > > 
> > > > Thanks,
> > > > Jarek P.
> > > > --- (take 3 - for testing)
> > > > 
> > > >  net/ipv4/fib_trie.c |   30 ++++++++++++++++++++++++------
> > > >  1 files changed, 24 insertions(+), 6 deletions(-)
> > > > 
> > > > diff --git a/net/ipv4/fib_trie.c b/net/ipv4/fib_trie.c
> > > > index 012cf5a..1a4c4b7 100644
> > > > --- a/net/ipv4/fib_trie.c
> > > > +++ b/net/ipv4/fib_trie.c
> > > > @@ -366,6 +366,17 @@ static void __tnode_vfree(struct work_struct *arg)
> > > >  	vfree(tn);
> > > >  }
> > > > 
> > > > +static void __tnode_free(struct tnode *tn)
> > > > +{
> > > > +	size_t size = sizeof(struct tnode) +
> > > > +		      (sizeof(struct node *) << tn->bits);
> > > > +
> > > > +	if (size <= PAGE_SIZE)
> > > > +		kfree(tn);
> > > > +	else
> > > > +		vfree(tn);
> > > > +}
> > > > +
> > > >  static void __tnode_free_rcu(struct rcu_head *head)
> > > >  {
> > > >  	struct tnode *tn = container_of(head, struct tnode, rcu);
> > > > @@ -402,7 +413,7 @@ static void tnode_free_flush(void)
> > > >  	while ((tn = tnode_free_head)) {
> > > >  		tnode_free_head = tn->tnode_free;
> > > >  		tn->tnode_free = NULL;
> > > > -		tnode_free(tn);
> > > > +		__tnode_free(tn);
> > > >  	}
> > > >  }
> > > > 
> > > > @@ -1021,18 +1032,25 @@ static void trie_rebalance(struct trie *t, struct tnode *tn)
> > > >  				      (struct node *)tn, wasfull);
> > > > 
> > > >  		tp = node_parent((struct node *) tn);
> > > > -		tnode_free_flush();
> > > >  		if (!tp)
> > > >  			break;
> > > >  		tn = tp;
> > > >  	}
> > > > 
> > > > +	if (tnode_free_head) {
> > > > +		synchronize_rcu();
> > > > +		tnode_free_flush();
> > > > +	}
> > > > +
> > > >  	/* Handle last (top) tnode */
> > > > -	if (IS_TNODE(tn))
> > > > +	if (IS_TNODE(tn)) {
> > > >  		tn = (struct tnode *)resize(t, (struct tnode *)tn);
> > > > -
> > > > -	rcu_assign_pointer(t->trie, (struct node *)tn);
> > > > -	tnode_free_flush();
> > > > +		rcu_assign_pointer(t->trie, (struct node *)tn);
> > > > +		synchronize_rcu();
> > > > +		tnode_free_flush();
> > > > +	} else {
> > > > +		rcu_assign_pointer(t->trie, (struct node *)tn);
> > > > +	}
> > > > 
> > > >  	return;
> > > >  }
> > > > --
> > > > To unsubscribe from this list: send the line "unsubscribe netdev" in
> > > > the body of a message to majordomo@vger.kernel.org
> > > > More majordomo info at  http://vger.kernel.org/majordomo-info.html
--
To unsubscribe from this list: send the line "unsubscribe netdev" in
the body of a message to majordomo@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html

Jarek Poplawski June 26, 2009, 4:45 p.m. UTC | #5

On Fri, Jun 26, 2009 at 09:23:40AM -0700, Paul E. McKenney wrote:
> On Fri, Jun 26, 2009 at 06:15:00PM +0200, Jarek Poplawski wrote:
> > On Fri, Jun 26, 2009 at 05:54:10PM +0200, Jarek Poplawski wrote:
> > > On Fri, Jun 26, 2009 at 08:30:10AM -0700, Paul E. McKenney wrote:
> > > > On Fri, Jun 26, 2009 at 05:10:52PM +0200, Jarek Poplawski wrote:
> > > > > On Fri, Jun 26, 2009 at 03:52:55PM +0200, Robert Olsson wrote:
> > > > > > 
> > > > > > Jarek Poplawski writes:
> > > > > > 
> > > > > >  Thanks, 
> > > > > > 
> > > > > >  Should be worth testing so we synchronize_rcu instead of doing call_rcu's
> > > > > >  
> > > > > 
> > > > > Alas take 2 (nor 1) doesn't compile, so here it is again.
> > > > 
> > > > So the idea is to balance memory and latency, so that large changes
> > > > (those affecting the root node) get at least one synchronize_rcu(),
> > > > while smaller changes just use call_rcu(), correct?  This means that
> > > > the amount of memory awaiting an RCU grace period is limited, but
> > > > the algorithm avoids per-node synchronize_rcu() overhead.
> > > > 
> > > > If I understand the goal correctly, looks good!  (Give or take my
> > > > limited understanding of fib_trie and is usage, of course.)
> > > 
> > > The goal is practically to replace all call_rcu() during
> > > trie_rebalance() with synchronize_rcu() (except some freeing after
> > > ENOMEM). I guess currently (<= 2.6.30) call_rcu() can free this
> > > memory after trie_rebalance() has finished, that's why there were
> > > problems with enabled preemption. So this patch tries to do/force
> > > this a bit earlier - at least before the top/largest node is
> > > rebalanced.
> > 
> > On the other hand, we could probably stay with call_rcu() plus only
> > one synchronize_rcu() before the top node's resize() if you think it's
> > enough here?
> 
> Well, my first task is to understand the problem/goal.  ;-)
> 
> My guess from what you said above is that use of call_rcu(), when
> combined with changes to the trie in rapid succession, is resulting
> in excessive memory awaiting a grace period.  Is this the case, or am I
> confused?

Exactly! (I guess... ;-)

Thanks,
Jarek P.
> > 
> > > 
> > > > 
> > > > 							Thanx, Paul
> > > > 
> > > > > Thanks,
> > > > > Jarek P.
> > > > > --- (take 3 - for testing)
> > > > > 
> > > > >  net/ipv4/fib_trie.c |   30 ++++++++++++++++++++++++------
> > > > >  1 files changed, 24 insertions(+), 6 deletions(-)
> > > > > 
> > > > > diff --git a/net/ipv4/fib_trie.c b/net/ipv4/fib_trie.c
> > > > > index 012cf5a..1a4c4b7 100644
> > > > > --- a/net/ipv4/fib_trie.c
> > > > > +++ b/net/ipv4/fib_trie.c
> > > > > @@ -366,6 +366,17 @@ static void __tnode_vfree(struct work_struct *arg)
> > > > >  	vfree(tn);
> > > > >  }
> > > > > 
> > > > > +static void __tnode_free(struct tnode *tn)
> > > > > +{
> > > > > +	size_t size = sizeof(struct tnode) +
> > > > > +		      (sizeof(struct node *) << tn->bits);
> > > > > +
> > > > > +	if (size <= PAGE_SIZE)
> > > > > +		kfree(tn);
> > > > > +	else
> > > > > +		vfree(tn);
> > > > > +}
> > > > > +
> > > > >  static void __tnode_free_rcu(struct rcu_head *head)
> > > > >  {
> > > > >  	struct tnode *tn = container_of(head, struct tnode, rcu);
> > > > > @@ -402,7 +413,7 @@ static void tnode_free_flush(void)
> > > > >  	while ((tn = tnode_free_head)) {
> > > > >  		tnode_free_head = tn->tnode_free;
> > > > >  		tn->tnode_free = NULL;
> > > > > -		tnode_free(tn);
> > > > > +		__tnode_free(tn);
> > > > >  	}
> > > > >  }
> > > > > 
> > > > > @@ -1021,18 +1032,25 @@ static void trie_rebalance(struct trie *t, struct tnode *tn)
> > > > >  				      (struct node *)tn, wasfull);
> > > > > 
> > > > >  		tp = node_parent((struct node *) tn);
> > > > > -		tnode_free_flush();
> > > > >  		if (!tp)
> > > > >  			break;
> > > > >  		tn = tp;
> > > > >  	}
> > > > > 
> > > > > +	if (tnode_free_head) {
> > > > > +		synchronize_rcu();
> > > > > +		tnode_free_flush();
> > > > > +	}
> > > > > +
> > > > >  	/* Handle last (top) tnode */
> > > > > -	if (IS_TNODE(tn))
> > > > > +	if (IS_TNODE(tn)) {
> > > > >  		tn = (struct tnode *)resize(t, (struct tnode *)tn);
> > > > > -
> > > > > -	rcu_assign_pointer(t->trie, (struct node *)tn);
> > > > > -	tnode_free_flush();
> > > > > +		rcu_assign_pointer(t->trie, (struct node *)tn);
> > > > > +		synchronize_rcu();
> > > > > +		tnode_free_flush();
> > > > > +	} else {
> > > > > +		rcu_assign_pointer(t->trie, (struct node *)tn);
> > > > > +	}
> > > > > 
> > > > >  	return;
> > > > >  }
> > > > > --
> > > > > To unsubscribe from this list: send the line "unsubscribe netdev" in
> > > > > the body of a message to majordomo@vger.kernel.org
> > > > > More majordomo info at  http://vger.kernel.org/majordomo-info.html
--
To unsubscribe from this list: send the line "unsubscribe netdev" in
the body of a message to majordomo@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html

Paul E. McKenney June 26, 2009, 5:05 p.m. UTC | #6

On Fri, Jun 26, 2009 at 06:45:57PM +0200, Jarek Poplawski wrote:
> On Fri, Jun 26, 2009 at 09:23:40AM -0700, Paul E. McKenney wrote:
> > On Fri, Jun 26, 2009 at 06:15:00PM +0200, Jarek Poplawski wrote:
> > > On Fri, Jun 26, 2009 at 05:54:10PM +0200, Jarek Poplawski wrote:
> > > > On Fri, Jun 26, 2009 at 08:30:10AM -0700, Paul E. McKenney wrote:
> > > > > On Fri, Jun 26, 2009 at 05:10:52PM +0200, Jarek Poplawski wrote:
> > > > > > On Fri, Jun 26, 2009 at 03:52:55PM +0200, Robert Olsson wrote:
> > > > > > > 
> > > > > > > Jarek Poplawski writes:
> > > > > > > 
> > > > > > >  Thanks, 
> > > > > > > 
> > > > > > >  Should be worth testing so we synchronize_rcu instead of doing call_rcu's
> > > > > > >  
> > > > > > 
> > > > > > Alas take 2 (nor 1) doesn't compile, so here it is again.
> > > > > 
> > > > > So the idea is to balance memory and latency, so that large changes
> > > > > (those affecting the root node) get at least one synchronize_rcu(),
> > > > > while smaller changes just use call_rcu(), correct?  This means that
> > > > > the amount of memory awaiting an RCU grace period is limited, but
> > > > > the algorithm avoids per-node synchronize_rcu() overhead.
> > > > > 
> > > > > If I understand the goal correctly, looks good!  (Give or take my
> > > > > limited understanding of fib_trie and is usage, of course.)
> > > > 
> > > > The goal is practically to replace all call_rcu() during
> > > > trie_rebalance() with synchronize_rcu() (except some freeing after
> > > > ENOMEM). I guess currently (<= 2.6.30) call_rcu() can free this
> > > > memory after trie_rebalance() has finished, that's why there were
> > > > problems with enabled preemption. So this patch tries to do/force
> > > > this a bit earlier - at least before the top/largest node is
> > > > rebalanced.
> > > 
> > > On the other hand, we could probably stay with call_rcu() plus only
> > > one synchronize_rcu() before the top node's resize() if you think it's
> > > enough here?
> > 
> > Well, my first task is to understand the problem/goal.  ;-)
> > 
> > My guess from what you said above is that use of call_rcu(), when
> > combined with changes to the trie in rapid succession, is resulting
> > in excessive memory awaiting a grace period.  Is this the case, or am I
> > confused?
> 
> Exactly! (I guess... ;-)

;-)

In that case, simply invoking synchronize_rcu() every once and awhile
should take care of things.  This could be at the end of every large
trie operation, or you could even count the call_rcu() invocations and
do a synchronize_rcu() every 100th, 1,000th, or whatever, based on
the amount of memory available.

							Thanx, Paul
--
To unsubscribe from this list: send the line "unsubscribe netdev" in
the body of a message to majordomo@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html

rib_trie / Fix inflate_threshold_root. Now=15 size=11 bits

Commit Message

Comments

Patch