From patchwork Tue May 19 18:47:54 2009 Content-Type: text/plain; charset="utf-8" MIME-Version: 1.0 Content-Transfer-Encoding: 8bit X-Patchwork-Submitter: Eric Dumazet X-Patchwork-Id: 27408 X-Patchwork-Delegate: davem@davemloft.net Return-Path: X-Original-To: patchwork-incoming@bilbo.ozlabs.org Delivered-To: patchwork-incoming@bilbo.ozlabs.org Received: from ozlabs.org (ozlabs.org [203.10.76.45]) (using TLSv1 with cipher DHE-RSA-AES256-SHA (256/256 bits)) (Client CN "mx.ozlabs.org", Issuer "CA Cert Signing Authority" (verified OK)) by bilbo.ozlabs.org (Postfix) with ESMTPS id EBEA8B7081 for ; Wed, 20 May 2009 04:48:21 +1000 (EST) Received: by ozlabs.org (Postfix) id 619BBDE08D; Wed, 20 May 2009 04:48:21 +1000 (EST) Delivered-To: patchwork-incoming@ozlabs.org Received: from vger.kernel.org (vger.kernel.org [209.132.176.167]) by ozlabs.org (Postfix) with ESMTP id A1621DE08B for ; Wed, 20 May 2009 04:48:20 +1000 (EST) Received: (majordomo@vger.kernel.org) by vger.kernel.org via listexpand id S1752781AbZESSsN (ORCPT ); Tue, 19 May 2009 14:48:13 -0400 Received: (majordomo@vger.kernel.org) by vger.kernel.org id S1752614AbZESSsM (ORCPT ); Tue, 19 May 2009 14:48:12 -0400 Received: from gw1.cosmosbay.com ([212.99.114.194]:45566 "EHLO gw1.cosmosbay.com" rhost-flags-OK-OK-OK-OK) by vger.kernel.org with ESMTP id S1752558AbZESSsL convert rfc822-to-8bit (ORCPT ); Tue, 19 May 2009 14:48:11 -0400 Received: from [127.0.0.1] (localhost [127.0.0.1]) by gw1.cosmosbay.com (8.13.7/8.13.7) with ESMTP id n4JIls4i013406; Tue, 19 May 2009 20:47:54 +0200 Message-ID: <4A12FEDA.7040806@cosmosbay.com> Date: Tue, 19 May 2009 20:47:54 +0200 From: Eric Dumazet User-Agent: Thunderbird 2.0.0.21 (Windows/20090302) MIME-Version: 1.0 To: Neil Horman CC: Jarek Poplawski , lav@yar.ru, Stephen Hemminger , netdev@vger.kernel.org Subject: Re: Fw: [Bug 13339] New: rtable leak in ipv4/route.c References: <20090519123417.GA7376@ff.dom.local> <4A12D10D.3000504@cosmosbay.com> <20090519162048.GB28034@hmsreliant.think-freely.org> In-Reply-To: <20090519162048.GB28034@hmsreliant.think-freely.org> X-Greylist: Sender IP whitelisted, not delayed by milter-greylist-1.6 (gw1.cosmosbay.com [0.0.0.0]); Tue, 19 May 2009 20:47:58 +0200 (CEST) Sender: netdev-owner@vger.kernel.org Precedence: bulk List-ID: X-Mailing-List: netdev@vger.kernel.org Neil Horman a écrit : > On Tue, May 19, 2009 at 05:32:29PM +0200, Eric Dumazet wrote: >> Jarek Poplawski a écrit : >>> On 19-05-2009 04:35, Stephen Hemminger wrote: >>>> Begin forwarded message: >>>> >>>> Date: Mon, 18 May 2009 14:10:20 GMT >>>> From: bugzilla-daemon@bugzilla.kernel.org >>>> To: shemminger@linux-foundation.org >>>> Subject: [Bug 13339] New: rtable leak in ipv4/route.c >>>> >>>> >>>> http://bugzilla.kernel.org/show_bug.cgi?id=13339 >>> ... >>>> 2.6.29 patch has introduced flexible route cache rebuilding. Unfortunately the >>>> patch has at least one critical flaw, and another problem. >>>> >>>> rt_intern_hash calculates rthi pointer, which is later used for new entry >>>> insertion. The same loop calculates cand pointer which is used to clean the >>>> list. If the pointers are the same, rtable leak occurs, as first the cand is >>>> removed then the new entry is appended to it. >>>> >>>> This leak leads to unregister_netdevice problem (usage count > 0). >>>> >>>> Another problem of the patch is that it tries to insert the entries in certain >>>> order, to facilitate counting of entries distinct by all but QoS parameters. >>>> Unfortunately, referencing an existing rtable entry moves it to list beginning, >>>> to speed up further lookups, so the carefully built order is destroyed. >> We could change rt_check_expire() to be smarter and handle any order in chains. >> >> This would let rt_intern_hash() be simpler. >> >> As its a more performance critical path, all would be good :) >> >>>> For the first problem the simplest patch it to set rthi=0 when rthi==cand, but >>>> it will also destroy the ordering. >>> I think fixing this bug fast is more important than this >>> ordering or counting. Could you send your patch proposal? >>> > > I was thinking something along these lines (also completely untested). The > extra check in rt_intern hash prevents us from identifying a 'group leader' that > is about to be deleted, and sets rthi to point at the group leader rather than > the last entry in the group as it previously did, which we then mark with the > new flag in the dst_entry (updating it on rt_free of course). This lets us > identify the boundaries of a group, so when we do a move to front heruistic, we > can move the entire group rather than just the single entry. I prefer this to > Erics approach since, even though rt_check_expire isn't in a performance > critical path, the addition of the extra list search when the list is unordered > makes rt_check_expire run in O(n^2) rather than O(n), which I think will have a > significant impact on high traffic systems. Since the ordered list creation was > done at almost zero performance cost in rt_intern_hash, I'd like to keep it if > at all possible. > > There is of course a shortcomming in my patch in that it doesn't actually do > anything currently with DST_GRPLDR other than set/clear it appropriately, I've > yet to identify where the route cache move to front heruistic is implemented. > If someone could show me where that is, I'd be happy to finish this patch. > O(n^2) ? Not exactly ... N * (Y + x) instead of N * (Y), where x <= Y/100, and we can make x being 0 in fact with proper prefetching. Real cost is bringing into CPU cache the information needed, by two orders of magnitude. (one cache line miss : more than 200 cycles, and we need at least two cache lines per rtable.) Adding 'group' notion to dst entries is overengineering an already too complex ipv4/route.c file, that is almost unreadable these days, since even you can not spot the point where we move the entry at chain front. Hint : search for /* Put it first */ in rt_intern_hash() Moving whole group in front would defeat the purpose of move, actually, since rank in chain is used to decay the timeout in garbage collector. (search for tmo >>= 1; ) Here is is a revised patch, doing the length computation inside the loop, while cpu is prefeching next entry into its cache. BTW, we actually have another bug, since length is set to 0 at the wrong place (before the for (; goal > 0; goal--) loop) We must of course reset length to 0 for each chain. All credits for Alexander V. Lukyanov for doing the analysis of course :) net/ipv4/route.c | 57 ++++++++++++++------------------------------- 1 files changed, 18 insertions(+), 39 deletions(-) --- To unsubscribe from this list: send the line "unsubscribe netdev" in the body of a message to majordomo@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html diff --git a/net/ipv4/route.c b/net/ipv4/route.c index c4c60e9..769f72e 100644 --- a/net/ipv4/route.c +++ b/net/ipv4/route.c @@ -784,7 +784,7 @@ static void rt_check_expire(void) { static unsigned int rover; unsigned int i = rover, goal; - struct rtable *rth, **rthp; + struct rtable *rth, *aux, **rthp; unsigned long length = 0, samples = 0; unsigned long sum = 0, sum2 = 0; u64 mult; @@ -795,7 +795,6 @@ static void rt_check_expire(void) goal = (unsigned int)mult; if (goal > rt_hash_mask) goal = rt_hash_mask + 1; - length = 0; for (; goal > 0; goal--) { unsigned long tmo = ip_rt_gc_timeout; @@ -809,8 +808,10 @@ static void rt_check_expire(void) if (*rthp == NULL) continue; + length = 0; spin_lock_bh(rt_hash_lock_addr(i)); while ((rth = *rthp) != NULL) { + prefetch(rth->u.dst.rt_next); if (rt_is_expired(rth)) { *rthp = rth->u.dst.rt_next; rt_free(rth); @@ -819,33 +820,30 @@ static void rt_check_expire(void) if (rth->u.dst.expires) { /* Entry is expired even if it is in use */ if (time_before_eq(jiffies, rth->u.dst.expires)) { +nofree: tmo >>= 1; rthp = &rth->u.dst.rt_next; /* - * Only bump our length if the hash - * inputs on entries n and n+1 are not - * the same, we only count entries on + * We only count entries on * a chain with equal hash inputs once * so that entries for different QOS * levels, and other non-hash input * attributes don't unfairly skew * the length computation */ - if ((*rthp == NULL) || - !compare_hash_inputs(&(*rthp)->fl, - &rth->fl)) - length += ONE; + for (aux = rt_hash_table[i].chain;;) { + if (aux == rth) { + length += ONE; + break; + } + if (compare_hash_inputs(&aux->fl, &rth->fl)) + break; + aux = aux->u.dst.rt_next; + } continue; } - } else if (!rt_may_expire(rth, tmo, ip_rt_gc_timeout)) { - tmo >>= 1; - rthp = &rth->u.dst.rt_next; - if ((*rthp == NULL) || - !compare_hash_inputs(&(*rthp)->fl, - &rth->fl)) - length += ONE; - continue; - } + } else if (!rt_may_expire(rth, tmo, ip_rt_gc_timeout)) + goto nofree; /* Cleanup aged off entries. */ *rthp = rth->u.dst.rt_next; @@ -1068,7 +1066,6 @@ out: return 0; static int rt_intern_hash(unsigned hash, struct rtable *rt, struct rtable **rp) { struct rtable *rth, **rthp; - struct rtable *rthi; unsigned long now; struct rtable *cand, **candp; u32 min_score; @@ -1088,7 +1085,6 @@ restart: } rthp = &rt_hash_table[hash].chain; - rthi = NULL; spin_lock_bh(rt_hash_lock_addr(hash)); while ((rth = *rthp) != NULL) { @@ -1134,17 +1130,6 @@ restart: chain_length++; rthp = &rth->u.dst.rt_next; - - /* - * check to see if the next entry in the chain - * contains the same hash input values as rt. If it does - * This is where we will insert into the list, instead of - * at the head. This groups entries that differ by aspects not - * relvant to the hash function together, which we use to adjust - * our chain length - */ - if (*rthp && compare_hash_inputs(&(*rthp)->fl, &rt->fl)) - rthi = rth; } if (cand) { @@ -1205,10 +1190,7 @@ restart: } } - if (rthi) - rt->u.dst.rt_next = rthi->u.dst.rt_next; - else - rt->u.dst.rt_next = rt_hash_table[hash].chain; + rt->u.dst.rt_next = rt_hash_table[hash].chain; #if RT_CACHE_DEBUG >= 2 if (rt->u.dst.rt_next) { @@ -1224,10 +1206,7 @@ restart: * previous writes to rt are comitted to memory * before making rt visible to other CPUS. */ - if (rthi) - rcu_assign_pointer(rthi->u.dst.rt_next, rt); - else - rcu_assign_pointer(rt_hash_table[hash].chain, rt); + rcu_assign_pointer(rt_hash_table[hash].chain, rt); spin_unlock_bh(rt_hash_lock_addr(hash)); *rp = rt;