diff mbox

Fw: [Bug 13339] New: rtable leak in ipv4/route.c

Message ID 4A12FEDA.7040806@cosmosbay.com
State RFC, archived
Delegated to: David Miller
Headers show

Commit Message

Eric Dumazet May 19, 2009, 6:47 p.m. UTC
Neil Horman a écrit :
> On Tue, May 19, 2009 at 05:32:29PM +0200, Eric Dumazet wrote:
>> Jarek Poplawski a écrit :
>>> On 19-05-2009 04:35, Stephen Hemminger wrote:
>>>> Begin forwarded message:
>>>>
>>>> Date: Mon, 18 May 2009 14:10:20 GMT
>>>> From: bugzilla-daemon@bugzilla.kernel.org
>>>> To: shemminger@linux-foundation.org
>>>> Subject: [Bug 13339] New: rtable leak in ipv4/route.c
>>>>
>>>>
>>>> http://bugzilla.kernel.org/show_bug.cgi?id=13339
>>> ...
>>>> 2.6.29 patch has introduced flexible route cache rebuilding. Unfortunately the
>>>> patch has at least one critical flaw, and another problem.
>>>>
>>>> rt_intern_hash calculates rthi pointer, which is later used for new entry
>>>> insertion. The same loop calculates cand pointer which is used to clean the
>>>> list. If the pointers are the same, rtable leak occurs, as first the cand is
>>>> removed then the new entry is appended to it.
>>>>
>>>> This leak leads to unregister_netdevice problem (usage count > 0).
>>>>
>>>> Another problem of the patch is that it tries to insert the entries in certain
>>>> order, to facilitate counting of entries distinct by all but QoS parameters.
>>>> Unfortunately, referencing an existing rtable entry moves it to list beginning,
>>>> to speed up further lookups, so the carefully built order is destroyed.
>> We could change rt_check_expire() to be smarter and handle any order in chains.
>>
>> This would let rt_intern_hash() be simpler.
>>
>> As its a more performance critical path, all would be good :)
>>
>>>> For the first problem the simplest patch it to set rthi=0 when rthi==cand, but
>>>> it will also destroy the ordering.
>>> I think fixing this bug fast is more important than this
>>> ordering or counting. Could you send your patch proposal?
>>>
> 
> I was thinking something along these lines (also completely untested).  The
> extra check in rt_intern hash prevents us from identifying a 'group leader' that
> is about to be deleted, and sets rthi to point at the group leader rather than
> the last entry in the group as it previously did, which we then mark with the
> new flag in the dst_entry (updating it on rt_free of course).  This lets us
> identify the boundaries of a group, so when we do a move to front heruistic, we
> can move the entire group rather than just the single entry.    I prefer this to
> Erics approach since, even though rt_check_expire isn't in a performance
> critical path, the addition of the extra list search when the list is unordered
> makes rt_check_expire run in O(n^2) rather than O(n), which I think will have a
> significant impact on high traffic systems.  Since the ordered list creation was
> done at almost zero performance cost in rt_intern_hash, I'd like to keep it if
> at all possible.
> 
> There is of course a shortcomming in my patch in that it doesn't actually do
> anything currently with DST_GRPLDR other than set/clear it appropriately, I've
> yet to identify where the route cache move to front heruistic is implemented.
> If someone could show me where that is, I'd be happy to finish this patch.
> 

O(n^2) ? Not exactly ...

N * (Y + x)  instead of N * (Y),  where x <= Y/100, and we can make x being 0
in fact with proper prefetching.

Real cost is bringing into CPU cache the information needed, by two
orders of magnitude. (one cache line miss : more than 200 cycles,
and we need at least two cache lines per rtable.)

Adding 'group' notion to dst entries is overengineering an already
too complex ipv4/route.c file, that is almost unreadable these days,
since even you can not spot the point where we move the entry at chain
front.

Hint : search for /* Put it first */ in rt_intern_hash()

Moving whole group in front would defeat the purpose of move, actually,
since rank in chain is used to decay the timeout in garbage collector.
(search for tmo >>= 1; )


Here is is a revised patch, doing the length computation inside
the loop, while cpu is prefeching next entry into its cache.

BTW, we actually have another bug, since length is set to 0 at the wrong place
(before the for (; goal > 0; goal--) loop)

We must of course reset length to 0 for each chain.

All credits for Alexander V. Lukyanov for doing the analysis of course :)


 net/ipv4/route.c |   57 ++++++++++++++-------------------------------
 1 files changed, 18 insertions(+), 39 deletions(-)


--
To unsubscribe from this list: send the line "unsubscribe netdev" in
the body of a message to majordomo@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html

Comments

Neil Horman May 19, 2009, 7:24 p.m. UTC | #1
On Tue, May 19, 2009 at 08:47:54PM +0200, Eric Dumazet wrote:
> Neil Horman a écrit :
> > On Tue, May 19, 2009 at 05:32:29PM +0200, Eric Dumazet wrote:
> >> Jarek Poplawski a écrit :
> >>> On 19-05-2009 04:35, Stephen Hemminger wrote:
> >>>> Begin forwarded message:
> >>>>
> >>>> Date: Mon, 18 May 2009 14:10:20 GMT
> >>>> From: bugzilla-daemon@bugzilla.kernel.org
> >>>> To: shemminger@linux-foundation.org
> >>>> Subject: [Bug 13339] New: rtable leak in ipv4/route.c
> >>>>
> >>>>
> >>>> http://bugzilla.kernel.org/show_bug.cgi?id=13339
> >>> ...
> >>>> 2.6.29 patch has introduced flexible route cache rebuilding. Unfortunately the
> >>>> patch has at least one critical flaw, and another problem.
> >>>>
> >>>> rt_intern_hash calculates rthi pointer, which is later used for new entry
> >>>> insertion. The same loop calculates cand pointer which is used to clean the
> >>>> list. If the pointers are the same, rtable leak occurs, as first the cand is
> >>>> removed then the new entry is appended to it.
> >>>>
> >>>> This leak leads to unregister_netdevice problem (usage count > 0).
> >>>>
> >>>> Another problem of the patch is that it tries to insert the entries in certain
> >>>> order, to facilitate counting of entries distinct by all but QoS parameters.
> >>>> Unfortunately, referencing an existing rtable entry moves it to list beginning,
> >>>> to speed up further lookups, so the carefully built order is destroyed.
> >> We could change rt_check_expire() to be smarter and handle any order in chains.
> >>
> >> This would let rt_intern_hash() be simpler.
> >>
> >> As its a more performance critical path, all would be good :)
> >>
> >>>> For the first problem the simplest patch it to set rthi=0 when rthi==cand, but
> >>>> it will also destroy the ordering.
> >>> I think fixing this bug fast is more important than this
> >>> ordering or counting. Could you send your patch proposal?
> >>>
> > 
> > I was thinking something along these lines (also completely untested).  The
> > extra check in rt_intern hash prevents us from identifying a 'group leader' that
> > is about to be deleted, and sets rthi to point at the group leader rather than
> > the last entry in the group as it previously did, which we then mark with the
> > new flag in the dst_entry (updating it on rt_free of course).  This lets us
> > identify the boundaries of a group, so when we do a move to front heruistic, we
> > can move the entire group rather than just the single entry.    I prefer this to
> > Erics approach since, even though rt_check_expire isn't in a performance
> > critical path, the addition of the extra list search when the list is unordered
> > makes rt_check_expire run in O(n^2) rather than O(n), which I think will have a
> > significant impact on high traffic systems.  Since the ordered list creation was
> > done at almost zero performance cost in rt_intern_hash, I'd like to keep it if
> > at all possible.
> > 
> > There is of course a shortcomming in my patch in that it doesn't actually do
> > anything currently with DST_GRPLDR other than set/clear it appropriately, I've
> > yet to identify where the route cache move to front heruistic is implemented.
> > If someone could show me where that is, I'd be happy to finish this patch.
> > 
> 
> O(n^2) ? Not exactly ...
> 
you're right, on closer examination, its not O(n^2), but its not what you have
below either I don't think
 
> N * (Y + x)  instead of N * (Y),  where x <= Y/100, and we can make x being 0
> in fact with proper prefetching.
> 
From the way I read you patch (which is actually pretty clever as I look at it),
you add an extra loop that searches from the start of the given chain to the
point at which the outer loop is set looking for a duplicate, so that each
rtable entry which varies only by non-hash relevant values is counted only once.
Thats good.  But that means for a chain of n nodes in which you are checking the
ith node for duplicates you need to visit at worst i other nodes, so to process
the entire chain we need to visit SUM(i=1..n)(i) == (n(n-1)/2) nodes rather than
the simply linear search requireing the inspection of only n nodes.  I see how
with prefetching the cost of the additional node visitation can be minimized,
but thats still a significantly larger set of nodes to check per expiration run.

> Real cost is bringing into CPU cache the information needed, by two
> orders of magnitude. (one cache line miss : more than 200 cycles,
> and we need at least two cache lines per rtable.)

> Adding 'group' notion to dst entries is overengineering an already
> too complex ipv4/route.c file, that is almost unreadable these days,
> since even you can not spot the point where we move the entry at chain
> front.
> 
> Hint : search for /* Put it first */ in rt_intern_hash()
> 
Ah! So its not actually on rtable access (i.e. route lookup) that we do a
move-to-front, its only when we're hashing in new entries.  That explains why I
didn't see it, I was thinking route lookups moved the entry to the front, not
inserts.

> Moving whole group in front would defeat the purpose of move, actually,
> since rank in chain is used to decay the timeout in garbage collector.
> (search for tmo >>= 1; )
> 
Argh, so the list is implicitly ordered by expiration time.  That really defeats
the entire purpose of doing grouping in the ilst at all.  If thats the case,
then I agree, its probably better to to take the additional visitation hit in in
check_expire above than to try and preserve ordering. 

Thanks
Neil

--
To unsubscribe from this list: send the line "unsubscribe netdev" in
the body of a message to majordomo@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
David Miller May 19, 2009, 10:05 p.m. UTC | #2
From: Neil Horman <nhorman@tuxdriver.com>
Date: Tue, 19 May 2009 15:24:50 -0400

>> Moving whole group in front would defeat the purpose of move, actually,
>> since rank in chain is used to decay the timeout in garbage collector.
>> (search for tmo >>= 1; )
>> 
> Argh, so the list is implicitly ordered by expiration time.  That
> really defeats the entire purpose of doing grouping in the ilst at
> all.  If thats the case, then I agree, its probably better to to
> take the additional visitation hit in in check_expire above than to
> try and preserve ordering.

Yes, this seems best.

I was worried that somehow the ordering also influences lookups,
because the TOS bits don't go into the hash so I worried that it would
be important that explicit TOS values appear before wildcard ones.
But it doesn't appear that this is an issue, we don't have wildcard
TOSs in the rtable entries, they are always explicit.

So I would like to see an explicit final patch from Eric so we can get
this fixed now.

Thanks!
--
To unsubscribe from this list: send the line "unsubscribe netdev" in
the body of a message to majordomo@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Neil Horman May 19, 2009, 11:05 p.m. UTC | #3
On Tue, May 19, 2009 at 03:05:17PM -0700, David Miller wrote:
> From: Neil Horman <nhorman@tuxdriver.com>
> Date: Tue, 19 May 2009 15:24:50 -0400
> 
> >> Moving whole group in front would defeat the purpose of move, actually,
> >> since rank in chain is used to decay the timeout in garbage collector.
> >> (search for tmo >>= 1; )
> >> 
> > Argh, so the list is implicitly ordered by expiration time.  That
> > really defeats the entire purpose of doing grouping in the ilst at
> > all.  If thats the case, then I agree, its probably better to to
> > take the additional visitation hit in in check_expire above than to
> > try and preserve ordering.
> 
> Yes, this seems best.
> 
> I was worried that somehow the ordering also influences lookups,
> because the TOS bits don't go into the hash so I worried that it would
> be important that explicit TOS values appear before wildcard ones.
> But it doesn't appear that this is an issue, we don't have wildcard
> TOSs in the rtable entries, they are always explicit.
> 
> So I would like to see an explicit final patch from Eric so we can get
> this fixed now.
> 
> Thanks!
> --
> To unsubscribe from this list: send the line "unsubscribe netdev" in
> the body of a message to majordomo@vger.kernel.org
> More majordomo info at  http://vger.kernel.org/majordomo-info.html
> 

Agreed, thanks once again for educating me Eric!
Neil

--
To unsubscribe from this list: send the line "unsubscribe netdev" in
the body of a message to majordomo@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
diff mbox

Patch

diff --git a/net/ipv4/route.c b/net/ipv4/route.c
index c4c60e9..769f72e 100644
--- a/net/ipv4/route.c
+++ b/net/ipv4/route.c
@@ -784,7 +784,7 @@  static void rt_check_expire(void)
 {
 	static unsigned int rover;
 	unsigned int i = rover, goal;
-	struct rtable *rth, **rthp;
+	struct rtable *rth, *aux, **rthp;
 	unsigned long length = 0, samples = 0;
 	unsigned long sum = 0, sum2 = 0;
 	u64 mult;
@@ -795,7 +795,6 @@  static void rt_check_expire(void)
 	goal = (unsigned int)mult;
 	if (goal > rt_hash_mask)
 		goal = rt_hash_mask + 1;
-	length = 0;
 	for (; goal > 0; goal--) {
 		unsigned long tmo = ip_rt_gc_timeout;
 
@@ -809,8 +808,10 @@  static void rt_check_expire(void)
 
 		if (*rthp == NULL)
 			continue;
+		length = 0;
 		spin_lock_bh(rt_hash_lock_addr(i));
 		while ((rth = *rthp) != NULL) {
+			prefetch(rth->u.dst.rt_next);
 			if (rt_is_expired(rth)) {
 				*rthp = rth->u.dst.rt_next;
 				rt_free(rth);
@@ -819,33 +820,30 @@  static void rt_check_expire(void)
 			if (rth->u.dst.expires) {
 				/* Entry is expired even if it is in use */
 				if (time_before_eq(jiffies, rth->u.dst.expires)) {
+nofree:
 					tmo >>= 1;
 					rthp = &rth->u.dst.rt_next;
 					/*
-					 * Only bump our length if the hash
-					 * inputs on entries n and n+1 are not
-					 * the same, we only count entries on
+					 * We only count entries on
 					 * a chain with equal hash inputs once
 					 * so that entries for different QOS
 					 * levels, and other non-hash input
 					 * attributes don't unfairly skew
 					 * the length computation
 					 */
-					if ((*rthp == NULL) ||
-					    !compare_hash_inputs(&(*rthp)->fl,
-								 &rth->fl))
-						length += ONE;
+					for (aux = rt_hash_table[i].chain;;) {
+						if (aux == rth) {
+							length += ONE;
+							break;
+						}
+						if (compare_hash_inputs(&aux->fl, &rth->fl))
+							break;
+						aux = aux->u.dst.rt_next;
+					}
 					continue;
 				}
-			} else if (!rt_may_expire(rth, tmo, ip_rt_gc_timeout)) {
-				tmo >>= 1;
-				rthp = &rth->u.dst.rt_next;
-				if ((*rthp == NULL) ||
-				    !compare_hash_inputs(&(*rthp)->fl,
-							 &rth->fl))
-					length += ONE;
-				continue;
-			}
+			} else if (!rt_may_expire(rth, tmo, ip_rt_gc_timeout))
+				goto nofree;
 
 			/* Cleanup aged off entries. */
 			*rthp = rth->u.dst.rt_next;
@@ -1068,7 +1066,6 @@  out:	return 0;
 static int rt_intern_hash(unsigned hash, struct rtable *rt, struct rtable **rp)
 {
 	struct rtable	*rth, **rthp;
-	struct rtable	*rthi;
 	unsigned long	now;
 	struct rtable *cand, **candp;
 	u32 		min_score;
@@ -1088,7 +1085,6 @@  restart:
 	}
 
 	rthp = &rt_hash_table[hash].chain;
-	rthi = NULL;
 
 	spin_lock_bh(rt_hash_lock_addr(hash));
 	while ((rth = *rthp) != NULL) {
@@ -1134,17 +1130,6 @@  restart:
 		chain_length++;
 
 		rthp = &rth->u.dst.rt_next;
-
-		/*
-		 * check to see if the next entry in the chain
-		 * contains the same hash input values as rt.  If it does
-		 * This is where we will insert into the list, instead of
-		 * at the head.  This groups entries that differ by aspects not
-		 * relvant to the hash function together, which we use to adjust
-		 * our chain length
-		 */
-		if (*rthp && compare_hash_inputs(&(*rthp)->fl, &rt->fl))
-			rthi = rth;
 	}
 
 	if (cand) {
@@ -1205,10 +1190,7 @@  restart:
 		}
 	}
 
-	if (rthi)
-		rt->u.dst.rt_next = rthi->u.dst.rt_next;
-	else
-		rt->u.dst.rt_next = rt_hash_table[hash].chain;
+	rt->u.dst.rt_next = rt_hash_table[hash].chain;
 
 #if RT_CACHE_DEBUG >= 2
 	if (rt->u.dst.rt_next) {
@@ -1224,10 +1206,7 @@  restart:
 	 * previous writes to rt are comitted to memory
 	 * before making rt visible to other CPUS.
 	 */
-	if (rthi)
-		rcu_assign_pointer(rthi->u.dst.rt_next, rt);
-	else
-		rcu_assign_pointer(rt_hash_table[hash].chain, rt);
+	rcu_assign_pointer(rt_hash_table[hash].chain, rt);
 
 	spin_unlock_bh(rt_hash_lock_addr(hash));
 	*rp = rt;