From patchwork Tue May 19 15:32:29 2009 Content-Type: text/plain; charset="utf-8" MIME-Version: 1.0 Content-Transfer-Encoding: 8bit X-Patchwork-Submitter: Eric Dumazet X-Patchwork-Id: 27398 X-Patchwork-Delegate: davem@davemloft.net Return-Path: X-Original-To: patchwork-incoming@bilbo.ozlabs.org Delivered-To: patchwork-incoming@bilbo.ozlabs.org Received: from ozlabs.org (ozlabs.org [203.10.76.45]) (using TLSv1 with cipher DHE-RSA-AES256-SHA (256/256 bits)) (Client CN "mx.ozlabs.org", Issuer "CA Cert Signing Authority" (verified OK)) by bilbo.ozlabs.org (Postfix) with ESMTPS id CC985B7079 for ; Wed, 20 May 2009 01:33:37 +1000 (EST) Received: by ozlabs.org (Postfix) id BB803DE112; Wed, 20 May 2009 01:33:37 +1000 (EST) Delivered-To: patchwork-incoming@ozlabs.org Received: from vger.kernel.org (vger.kernel.org [209.132.176.167]) by ozlabs.org (Postfix) with ESMTP id 37661DE10C for ; Wed, 20 May 2009 01:33:37 +1000 (EST) Received: (majordomo@vger.kernel.org) by vger.kernel.org via listexpand id S1753170AbZESPcn (ORCPT ); Tue, 19 May 2009 11:32:43 -0400 Received: (majordomo@vger.kernel.org) by vger.kernel.org id S1753072AbZESPcm (ORCPT ); Tue, 19 May 2009 11:32:42 -0400 Received: from gw1.cosmosbay.com ([212.99.114.194]:44683 "EHLO gw1.cosmosbay.com" rhost-flags-OK-OK-OK-OK) by vger.kernel.org with ESMTP id S1752949AbZESPcm convert rfc822-to-8bit (ORCPT ); Tue, 19 May 2009 11:32:42 -0400 Received: from [127.0.0.1] (localhost [127.0.0.1]) by gw1.cosmosbay.com (8.13.7/8.13.7) with ESMTP id n4JFWT5I006105; Tue, 19 May 2009 17:32:29 +0200 Message-ID: <4A12D10D.3000504@cosmosbay.com> Date: Tue, 19 May 2009 17:32:29 +0200 From: Eric Dumazet User-Agent: Thunderbird 2.0.0.21 (Windows/20090302) MIME-Version: 1.0 To: Jarek Poplawski CC: lav@yar.ru, Stephen Hemminger , netdev@vger.kernel.org, Neil Horman Subject: Re: Fw: [Bug 13339] New: rtable leak in ipv4/route.c References: <20090519123417.GA7376@ff.dom.local> In-Reply-To: <20090519123417.GA7376@ff.dom.local> X-Greylist: Sender IP whitelisted, not delayed by milter-greylist-1.6 (gw1.cosmosbay.com [0.0.0.0]); Tue, 19 May 2009 17:32:30 +0200 (CEST) Sender: netdev-owner@vger.kernel.org Precedence: bulk List-ID: X-Mailing-List: netdev@vger.kernel.org Jarek Poplawski a écrit : > On 19-05-2009 04:35, Stephen Hemminger wrote: >> Begin forwarded message: >> >> Date: Mon, 18 May 2009 14:10:20 GMT >> From: bugzilla-daemon@bugzilla.kernel.org >> To: shemminger@linux-foundation.org >> Subject: [Bug 13339] New: rtable leak in ipv4/route.c >> >> >> http://bugzilla.kernel.org/show_bug.cgi?id=13339 > ... >> 2.6.29 patch has introduced flexible route cache rebuilding. Unfortunately the >> patch has at least one critical flaw, and another problem. >> >> rt_intern_hash calculates rthi pointer, which is later used for new entry >> insertion. The same loop calculates cand pointer which is used to clean the >> list. If the pointers are the same, rtable leak occurs, as first the cand is >> removed then the new entry is appended to it. >> >> This leak leads to unregister_netdevice problem (usage count > 0). >> >> Another problem of the patch is that it tries to insert the entries in certain >> order, to facilitate counting of entries distinct by all but QoS parameters. >> Unfortunately, referencing an existing rtable entry moves it to list beginning, >> to speed up further lookups, so the carefully built order is destroyed. We could change rt_check_expire() to be smarter and handle any order in chains. This would let rt_intern_hash() be simpler. As its a more performance critical path, all would be good :) >> >> For the first problem the simplest patch it to set rthi=0 when rthi==cand, but >> it will also destroy the ordering. > > I think fixing this bug fast is more important than this > ordering or counting. Could you send your patch proposal? > Here is mine, only compiled, not tested yet. All credits for Stephen for doing the analysis of course :) --- To unsubscribe from this list: send the line "unsubscribe netdev" in the body of a message to majordomo@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html diff --git a/net/ipv4/route.c b/net/ipv4/route.c index c4c60e9..fbe77ad 100644 --- a/net/ipv4/route.c +++ b/net/ipv4/route.c @@ -780,12 +780,37 @@ static void rt_do_flush(int process_context) #define FRACT_BITS 3 #define ONE (1UL << FRACT_BITS) +static unsigned long compute_length(struct rtable *head) +{ + struct rtable *rth, *aux; + unsigned long length = 0; + + for (rth = head; rth != NULL; rth = rth->u.dst.rt_next) { + /* + * We ignore items having same hash inputs + * so that entries for different QOS + * levels, and other non-hash input + * attributes don't unfairly skew + * the length computation + */ + for (aux = head; ;aux = aux->u.dst.rt_next) { + if (aux == rth) { + length += ONE; + break; + } + if (compare_hash_inputs(&aux->fl, &rth->fl)) + break; + } + } + return length; +} + static void rt_check_expire(void) { static unsigned int rover; unsigned int i = rover, goal; struct rtable *rth, **rthp; - unsigned long length = 0, samples = 0; + unsigned long length, samples = 0; unsigned long sum = 0, sum2 = 0; u64 mult; @@ -795,7 +820,6 @@ static void rt_check_expire(void) goal = (unsigned int)mult; if (goal > rt_hash_mask) goal = rt_hash_mask + 1; - length = 0; for (; goal > 0; goal--) { unsigned long tmo = ip_rt_gc_timeout; @@ -821,29 +845,11 @@ static void rt_check_expire(void) if (time_before_eq(jiffies, rth->u.dst.expires)) { tmo >>= 1; rthp = &rth->u.dst.rt_next; - /* - * Only bump our length if the hash - * inputs on entries n and n+1 are not - * the same, we only count entries on - * a chain with equal hash inputs once - * so that entries for different QOS - * levels, and other non-hash input - * attributes don't unfairly skew - * the length computation - */ - if ((*rthp == NULL) || - !compare_hash_inputs(&(*rthp)->fl, - &rth->fl)) - length += ONE; continue; } } else if (!rt_may_expire(rth, tmo, ip_rt_gc_timeout)) { tmo >>= 1; rthp = &rth->u.dst.rt_next; - if ((*rthp == NULL) || - !compare_hash_inputs(&(*rthp)->fl, - &rth->fl)) - length += ONE; continue; } @@ -851,6 +857,7 @@ static void rt_check_expire(void) *rthp = rth->u.dst.rt_next; rt_free(rth); } + length = compute_length(rt_hash_table[i].chain); spin_unlock_bh(rt_hash_lock_addr(i)); sum += length; sum2 += length*length; @@ -1068,7 +1075,6 @@ out: return 0; static int rt_intern_hash(unsigned hash, struct rtable *rt, struct rtable **rp) { struct rtable *rth, **rthp; - struct rtable *rthi; unsigned long now; struct rtable *cand, **candp; u32 min_score; @@ -1088,7 +1094,6 @@ restart: } rthp = &rt_hash_table[hash].chain; - rthi = NULL; spin_lock_bh(rt_hash_lock_addr(hash)); while ((rth = *rthp) != NULL) { @@ -1134,17 +1139,6 @@ restart: chain_length++; rthp = &rth->u.dst.rt_next; - - /* - * check to see if the next entry in the chain - * contains the same hash input values as rt. If it does - * This is where we will insert into the list, instead of - * at the head. This groups entries that differ by aspects not - * relvant to the hash function together, which we use to adjust - * our chain length - */ - if (*rthp && compare_hash_inputs(&(*rthp)->fl, &rt->fl)) - rthi = rth; } if (cand) { @@ -1205,10 +1199,7 @@ restart: } } - if (rthi) - rt->u.dst.rt_next = rthi->u.dst.rt_next; - else - rt->u.dst.rt_next = rt_hash_table[hash].chain; + rt->u.dst.rt_next = rt_hash_table[hash].chain; #if RT_CACHE_DEBUG >= 2 if (rt->u.dst.rt_next) { @@ -1224,10 +1215,7 @@ restart: * previous writes to rt are comitted to memory * before making rt visible to other CPUS. */ - if (rthi) - rcu_assign_pointer(rthi->u.dst.rt_next, rt); - else - rcu_assign_pointer(rt_hash_table[hash].chain, rt); + rcu_assign_pointer(rt_hash_table[hash].chain, rt); spin_unlock_bh(rt_hash_lock_addr(hash)); *rp = rt;