Patchwork Fw: [Bug 13339] New: rtable leak in ipv4/route.c

login
register
mail settings
Submitter Jarek Poplawski
Date May 19, 2009, 5:17 p.m.
Message ID <20090519171703.GA2749@ami.dom.local>
Download mbox | patch
Permalink /patch/27403/
State RFC
Delegated to: David Miller
Headers show

Comments

Jarek Poplawski - May 19, 2009, 5:17 p.m.
On Tue, May 19, 2009 at 12:23:30PM -0400, Neil Horman wrote:
> On Tue, May 19, 2009 at 05:32:29PM +0200, Eric Dumazet wrote:
> > Jarek Poplawski a écrit :
> > > On 19-05-2009 04:35, Stephen Hemminger wrote:
> > >> Begin forwarded message:
> > >>
> > >> Date: Mon, 18 May 2009 14:10:20 GMT
> > >> From: bugzilla-daemon@bugzilla.kernel.org
> > >> To: shemminger@linux-foundation.org
> > >> Subject: [Bug 13339] New: rtable leak in ipv4/route.c
> > >>
> > >>
> > >> http://bugzilla.kernel.org/show_bug.cgi?id=13339
> > > ...
> > >> 2.6.29 patch has introduced flexible route cache rebuilding. Unfortunately the
> > >> patch has at least one critical flaw, and another problem.
> > >>
> > >> rt_intern_hash calculates rthi pointer, which is later used for new entry
> > >> insertion. The same loop calculates cand pointer which is used to clean the
> > >> list. If the pointers are the same, rtable leak occurs, as first the cand is
> > >> removed then the new entry is appended to it.
> > >>
> > >> This leak leads to unregister_netdevice problem (usage count > 0).
> > >>
> > >> Another problem of the patch is that it tries to insert the entries in certain
> > >> order, to facilitate counting of entries distinct by all but QoS parameters.
> > >> Unfortunately, referencing an existing rtable entry moves it to list beginning,
> > >> to speed up further lookups, so the carefully built order is destroyed.
> > 
> > We could change rt_check_expire() to be smarter and handle any order in chains.
> > 
> > This would let rt_intern_hash() be simpler.
> > 
> > As its a more performance critical path, all would be good :)
> > 
> > >>
> > >> For the first problem the simplest patch it to set rthi=0 when rthi==cand, but
> > >> it will also destroy the ordering.
> > > 
> > > I think fixing this bug fast is more important than this
> > > ordering or counting. Could you send your patch proposal?
> > > 
> > 
> 
> Of course, it helps if I attach the patch :)
> 
> 
> diff --git a/include/net/dst.h b/include/net/dst.h
> index 6be3b08..a39db6d 100644
> --- a/include/net/dst.h
> +++ b/include/net/dst.h
> @@ -47,6 +47,7 @@ struct dst_entry
>  #define DST_NOXFRM		2
>  #define DST_NOPOLICY		4
>  #define DST_NOHASH		8
> +#define DST_GRPLDR		16
>  	unsigned long		expires;
>  
>  	unsigned short		header_len;	/* more space at head required */
> diff --git a/net/ipv4/route.c b/net/ipv4/route.c
> index c4c60e9..0120f0e 100644
> --- a/net/ipv4/route.c
> +++ b/net/ipv4/route.c
> @@ -610,6 +610,8 @@ static inline int ip_rt_proc_init(void)
>  
>  static inline void rt_free(struct rtable *rt)
>  {
> +	if (rt->u.dst.flags & DST_GRPLDR)
> +		rt->u.dst.rt_next->u.dst.flag |= DST_GRPLDR;
>  	call_rcu_bh(&rt->u.dst.rcu_head, dst_rcu_free);
>  }
>  
> @@ -1143,8 +1145,11 @@ restart:
>  		 * relvant to the hash function together, which we use to adjust
>  		 * our chain length
>  		 */
> -		if (*rthp && compare_hash_inputs(&(*rthp)->fl, &rt->fl))
> +		if (!*rthi && *rthp && 
> +		    compare_hash_inputs(&(*rthp)->fl, &rt->fl) &&
> +		    (cand != rth))
>  			rthi = rth;

Does it really prevent cand == rthi in the next loop?

I didn't check Eric's patch yet, but I really don't know what's wrong
with something as simple as below for -stable, until "proper" fix is
analyzed and tested.

Jarek P.
---

 net/ipv4/route.c |    2 ++
 1 files changed, 2 insertions(+), 0 deletions(-)

--
To unsubscribe from this list: send the line "unsubscribe netdev" in
the body of a message to majordomo@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Neil Horman - May 19, 2009, 5:45 p.m.
On Tue, May 19, 2009 at 07:17:03PM +0200, Jarek Poplawski wrote:
> On Tue, May 19, 2009 at 12:23:30PM -0400, Neil Horman wrote:
> > On Tue, May 19, 2009 at 05:32:29PM +0200, Eric Dumazet wrote:
> > > Jarek Poplawski a écrit :
> > > > On 19-05-2009 04:35, Stephen Hemminger wrote:
> > > >> Begin forwarded message:
> > > >>
> > > >> Date: Mon, 18 May 2009 14:10:20 GMT
> > > >> From: bugzilla-daemon@bugzilla.kernel.org
> > > >> To: shemminger@linux-foundation.org
> > > >> Subject: [Bug 13339] New: rtable leak in ipv4/route.c
> > > >>
> > > >>
> > > >> http://bugzilla.kernel.org/show_bug.cgi?id=13339
> > > > ...
> > > >> 2.6.29 patch has introduced flexible route cache rebuilding. Unfortunately the
> > > >> patch has at least one critical flaw, and another problem.
> > > >>
> > > >> rt_intern_hash calculates rthi pointer, which is later used for new entry
> > > >> insertion. The same loop calculates cand pointer which is used to clean the
> > > >> list. If the pointers are the same, rtable leak occurs, as first the cand is
> > > >> removed then the new entry is appended to it.
> > > >>
> > > >> This leak leads to unregister_netdevice problem (usage count > 0).
> > > >>
> > > >> Another problem of the patch is that it tries to insert the entries in certain
> > > >> order, to facilitate counting of entries distinct by all but QoS parameters.
> > > >> Unfortunately, referencing an existing rtable entry moves it to list beginning,
> > > >> to speed up further lookups, so the carefully built order is destroyed.
> > > 
> > > We could change rt_check_expire() to be smarter and handle any order in chains.
> > > 
> > > This would let rt_intern_hash() be simpler.
> > > 
> > > As its a more performance critical path, all would be good :)
> > > 
> > > >>
> > > >> For the first problem the simplest patch it to set rthi=0 when rthi==cand, but
> > > >> it will also destroy the ordering.
> > > > 
> > > > I think fixing this bug fast is more important than this
> > > > ordering or counting. Could you send your patch proposal?
> > > > 
> > > 
> > 
> > Of course, it helps if I attach the patch :)
> > 
> > 
> > diff --git a/include/net/dst.h b/include/net/dst.h
> > index 6be3b08..a39db6d 100644
> > --- a/include/net/dst.h
> > +++ b/include/net/dst.h
> > @@ -47,6 +47,7 @@ struct dst_entry
> >  #define DST_NOXFRM		2
> >  #define DST_NOPOLICY		4
> >  #define DST_NOHASH		8
> > +#define DST_GRPLDR		16
> >  	unsigned long		expires;
> >  
> >  	unsigned short		header_len;	/* more space at head required */
> > diff --git a/net/ipv4/route.c b/net/ipv4/route.c
> > index c4c60e9..0120f0e 100644
> > --- a/net/ipv4/route.c
> > +++ b/net/ipv4/route.c
> > @@ -610,6 +610,8 @@ static inline int ip_rt_proc_init(void)
> >  
> >  static inline void rt_free(struct rtable *rt)
> >  {
> > +	if (rt->u.dst.flags & DST_GRPLDR)
> > +		rt->u.dst.rt_next->u.dst.flag |= DST_GRPLDR;
> >  	call_rcu_bh(&rt->u.dst.rcu_head, dst_rcu_free);
> >  }
> >  
> > @@ -1143,8 +1145,11 @@ restart:
> >  		 * relvant to the hash function together, which we use to adjust
> >  		 * our chain length
> >  		 */
> > -		if (*rthp && compare_hash_inputs(&(*rthp)->fl, &rt->fl))
> > +		if (!*rthi && *rthp && 
> > +		    compare_hash_inputs(&(*rthp)->fl, &rt->fl) &&
> > +		    (cand != rth))
> >  			rthi = rth;
> 
> Does it really prevent cand == rthi in the next loop?
> 
Yes, because cand and rthi are inspected during the same loop iteration, and
both assigned from rth.  since I added a check which requires !rthi (which is
actually a bug above that I need to fix), once rthi is set, it won't be moved,
and on the next iteration, if cand is assigned, it is assigned to rth, which
(being the next iteration), is a different rt cache entry


> I didn't check Eric's patch yet, but I really don't know what's wrong
> with something as simple as below for -stable, until "proper" fix is
> analyzed and tested.
> 
Because the above fixes it without continuing to break the ordering.  You're
change below prevents the leak, but still allows for disordered lists to form,
which IMHO doesn't really make it a candidate for -stable.  I'd much rather fix
both the leak and the ordering before pushing anything out

speaking of which, I'm going to ask again, I've been looking all morning, and
I'm unable to find the move to front heuristic that you mentioned creates furthe
list disordering.  If you can point it out to me, I can complete my patch and
present something for you to test more throughly.

Neil

> 
--
To unsubscribe from this list: send the line "unsubscribe netdev" in
the body of a message to majordomo@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Jarek Poplawski - May 19, 2009, 5:47 p.m.
On Tue, May 19, 2009 at 07:17:03PM +0200, Jarek Poplawski wrote:
> On Tue, May 19, 2009 at 12:23:30PM -0400, Neil Horman wrote:
...
> > Of course, it helps if I attach the patch :)
> > 
> > 
> > diff --git a/include/net/dst.h b/include/net/dst.h
> > index 6be3b08..a39db6d 100644
> > --- a/include/net/dst.h
> > +++ b/include/net/dst.h
> > @@ -47,6 +47,7 @@ struct dst_entry
> >  #define DST_NOXFRM		2
> >  #define DST_NOPOLICY		4
> >  #define DST_NOHASH		8
> > +#define DST_GRPLDR		16
> >  	unsigned long		expires;
> >  
> >  	unsigned short		header_len;	/* more space at head required */
> > diff --git a/net/ipv4/route.c b/net/ipv4/route.c
> > index c4c60e9..0120f0e 100644
> > --- a/net/ipv4/route.c
> > +++ b/net/ipv4/route.c
> > @@ -610,6 +610,8 @@ static inline int ip_rt_proc_init(void)
> >  
> >  static inline void rt_free(struct rtable *rt)
> >  {
> > +	if (rt->u.dst.flags & DST_GRPLDR)
> > +		rt->u.dst.rt_next->u.dst.flag |= DST_GRPLDR;
> >  	call_rcu_bh(&rt->u.dst.rcu_head, dst_rcu_free);
> >  }
> >  
> > @@ -1143,8 +1145,11 @@ restart:
> >  		 * relvant to the hash function together, which we use to adjust
> >  		 * our chain length
> >  		 */
> > -		if (*rthp && compare_hash_inputs(&(*rthp)->fl, &rt->fl))
> > +		if (!*rthi && *rthp && 
> > +		    compare_hash_inputs(&(*rthp)->fl, &rt->fl) &&
> > +		    (cand != rth))
> >  			rthi = rth;
> 
> Does it really prevent cand == rthi in the next loop?

Hmm.. It looks OK yet!

Sorry,
Jarek P.
--
To unsubscribe from this list: send the line "unsubscribe netdev" in
the body of a message to majordomo@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Jarek Poplawski - May 19, 2009, 5:53 p.m.
On Tue, May 19, 2009 at 01:45:04PM -0400, Neil Horman wrote:
> On Tue, May 19, 2009 at 07:17:03PM +0200, Jarek Poplawski wrote:
...
> > > diff --git a/include/net/dst.h b/include/net/dst.h
> > > index 6be3b08..a39db6d 100644
> > > --- a/include/net/dst.h
> > > +++ b/include/net/dst.h
> > > @@ -47,6 +47,7 @@ struct dst_entry
> > >  #define DST_NOXFRM		2
> > >  #define DST_NOPOLICY		4
> > >  #define DST_NOHASH		8
> > > +#define DST_GRPLDR		16
> > >  	unsigned long		expires;
> > >  
> > >  	unsigned short		header_len;	/* more space at head required */
> > > diff --git a/net/ipv4/route.c b/net/ipv4/route.c
> > > index c4c60e9..0120f0e 100644
> > > --- a/net/ipv4/route.c
> > > +++ b/net/ipv4/route.c
> > > @@ -610,6 +610,8 @@ static inline int ip_rt_proc_init(void)
> > >  
> > >  static inline void rt_free(struct rtable *rt)
> > >  {
> > > +	if (rt->u.dst.flags & DST_GRPLDR)
> > > +		rt->u.dst.rt_next->u.dst.flag |= DST_GRPLDR;
> > >  	call_rcu_bh(&rt->u.dst.rcu_head, dst_rcu_free);
> > >  }
> > >  
> > > @@ -1143,8 +1145,11 @@ restart:
> > >  		 * relvant to the hash function together, which we use to adjust
> > >  		 * our chain length
> > >  		 */
> > > -		if (*rthp && compare_hash_inputs(&(*rthp)->fl, &rt->fl))
> > > +		if (!*rthi && *rthp && 
> > > +		    compare_hash_inputs(&(*rthp)->fl, &rt->fl) &&
> > > +		    (cand != rth))
> > >  			rthi = rth;
> > 
> > Does it really prevent cand == rthi in the next loop?
> > 
> Yes, because cand and rthi are inspected during the same loop iteration, and
> both assigned from rth.  since I added a check which requires !rthi (which is
> actually a bug above that I need to fix), once rthi is set, it won't be moved,
> and on the next iteration, if cand is assigned, it is assigned to rth, which
> (being the next iteration), is a different rt cache entry
> 
> 
> > I didn't check Eric's patch yet, but I really don't know what's wrong
> > with something as simple as below for -stable, until "proper" fix is
> > analyzed and tested.
> > 
> Because the above fixes it without continuing to break the ordering.  You're
> change below prevents the leak, but still allows for disordered lists to form,
> which IMHO doesn't really make it a candidate for -stable.  I'd much rather fix
> both the leak and the ordering before pushing anything out
> 
> speaking of which, I'm going to ask again, I've been looking all morning, and
> I'm unable to find the move to front heuristic that you mentioned creates furthe
> list disordering.  If you can point it out to me, I can complete my patch and
> present something for you to test more throughly.

As I've written already, your patch looks OK to me.

Thanks,
Jarek P.
--
To unsubscribe from this list: send the line "unsubscribe netdev" in
the body of a message to majordomo@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Jarek Poplawski - May 19, 2009, 6:05 p.m.
On Tue, May 19, 2009 at 01:45:04PM -0400, Neil Horman wrote:
...
> I'm unable to find the move to front heuristic that you mentioned creates furthe
> list disordering. [...]

I'd only like to make it clear the analysis you refer to is by
Alexander V. Lukyanov <lav@yar.ru>

Jarek P.
--
To unsubscribe from this list: send the line "unsubscribe netdev" in
the body of a message to majordomo@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Neil Horman - May 19, 2009, 6:16 p.m.
On Tue, May 19, 2009 at 08:05:05PM +0200, Jarek Poplawski wrote:
> On Tue, May 19, 2009 at 01:45:04PM -0400, Neil Horman wrote:
> ...
> > I'm unable to find the move to front heuristic that you mentioned creates furthe
> > list disordering. [...]
> 
> I'd only like to make it clear the analysis you refer to is by
> Alexander V. Lukyanov <lav@yar.ru>
> 
Ok, thank you for giving credit where its due :).  Alexander, can you point out
the location of the move-to-front heruistic that you mention in your analysis?

Neil

> Jarek P.
> 
--
To unsubscribe from this list: send the line "unsubscribe netdev" in
the body of a message to majordomo@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Alexander V. Lukyanov - May 20, 2009, 6:36 a.m.
On Tue, May 19, 2009 at 02:16:07PM -0400, Neil Horman wrote:
> Alexander, can you point out
> the location of the move-to-front heruistic that you mention in your analysis?

It's near /* Put it first */ comment (if I read the code correctly).

Patch

diff --git a/net/ipv4/route.c b/net/ipv4/route.c
index c4c60e9..f4e6c7a 100644
--- a/net/ipv4/route.c
+++ b/net/ipv4/route.c
@@ -1157,6 +1157,8 @@  restart:
 		if (chain_length > ip_rt_gc_elasticity) {
 			*candp = cand->u.dst.rt_next;
 			rt_free(cand);
+			if (rthi == cand)
+				rthi = NULL;
 		}
 	} else {
 		if (chain_length > rt_chain_length_max) {