From patchwork Wed Jan 9 17:44:19 2013 Content-Type: text/plain; charset="utf-8" MIME-Version: 1.0 Content-Transfer-Encoding: 7bit X-Patchwork-Submitter: Roland Dreier X-Patchwork-Id: 210803 X-Patchwork-Delegate: davem@davemloft.net Return-Path: X-Original-To: patchwork-incoming@ozlabs.org Delivered-To: patchwork-incoming@ozlabs.org Received: from vger.kernel.org (vger.kernel.org [209.132.180.67]) by ozlabs.org (Postfix) with ESMTP id 57AF92C00DA for ; Thu, 10 Jan 2013 04:44:32 +1100 (EST) Received: (majordomo@vger.kernel.org) by vger.kernel.org via listexpand id S932241Ab3AIRo2 (ORCPT ); Wed, 9 Jan 2013 12:44:28 -0500 Received: from na3sys010aog103.obsmtp.com ([74.125.245.74]:51610 "HELO na3sys010aog103.obsmtp.com" rhost-flags-OK-OK-OK-OK) by vger.kernel.org with SMTP id S932233Ab3AIRo1 (ORCPT ); Wed, 9 Jan 2013 12:44:27 -0500 Received: from mail-pa0-f72.google.com ([209.85.220.72]) (using TLSv1) by na3sys010aob103.postini.com ([74.125.244.12]) with SMTP ID DSNKUO2seuCEeIf+JQeWG5r5awjU3F9/7Tcy@postini.com; Wed, 09 Jan 2013 09:44:27 PST Received: by mail-pa0-f72.google.com with SMTP id fb10so1807613pad.3 for ; Wed, 09 Jan 2013 09:44:26 -0800 (PST) DKIM-Signature: v=1; a=rsa-sha256; c=relaxed/relaxed; d=purestorage.com; s=google; h=x-received:x-received:sender:from:to:cc:subject:date:message-id :x-mailer; bh=aa+Ri6TcrlXtS4BbqiDoNN9pgWW1DgE+q3+losjAXo4=; b=c4oRQp+ebC/rXU9Jb4RNzkBAj+0uNmYYYEKuOiwlKx5gflsds/W+Jgvk4l8L0vpyjA BLvFK0ilT/hHv3c9C+AGYqKmqvRKpq5UhZ8rBGqz5xNfhjfMAP+tqPBsFdjf5bHdv5rv UZ9MYRdqD2g28rRRCkaw46Fkw7+NG0+E+LEKU= X-Google-DKIM-Signature: v=1; a=rsa-sha256; c=relaxed/relaxed; d=google.com; s=20120113; h=x-received:x-received:sender:from:to:cc:subject:date:message-id :x-mailer:x-gm-message-state; bh=aa+Ri6TcrlXtS4BbqiDoNN9pgWW1DgE+q3+losjAXo4=; b=g2E/qprkPoD0mzrVVIZNQwOuESb6QuGzcNULeKvFtlx9fMgcljYZjr2zvEHOCbFIGC Q5fuUUy0xEMZ1ovXCU1SHMBL8bOATduwSjbivRmw4eez4tH7DVO/jOnpaO2eK9I29nWj lNXBkuf+hIIKHkzd0FcO4wLjKid5xooTUXUIEkeAw5C3pvxJ/Vy3gvfusA84py1hdewN ucbtWNRn9JcL5LtoL5mzCjmMqT3RQfkJ21Z9QPgQ0NESNL+tD+8I2tYpy/MQHYaoFwtN 2255b2d5D8qs0VFAFEjATzrBzoe1wDDCk7xqukokL+xey8hmZEsmDVkvkKSP0n9qc8WI Y5Cg== X-Received: by 10.68.218.97 with SMTP id pf1mr211519633pbc.96.1357753466266; Wed, 09 Jan 2013 09:44:26 -0800 (PST) X-Received: by 10.68.218.97 with SMTP id pf1mr211519613pbc.96.1357753466099; Wed, 09 Jan 2013 09:44:26 -0800 (PST) Received: from roland-t410s.purestorage.com ([216.200.155.2]) by mx.google.com with ESMTPS id is6sm41969918pbc.55.2013.01.09.09.44.23 (version=TLSv1/SSLv3 cipher=OTHER); Wed, 09 Jan 2013 09:44:24 -0800 (PST) From: Roland Dreier To: "David S. Miller" Cc: netdev@vger.kernel.org, Roland Dreier Subject: [PATCH/RFC] ipv6: fib: Drop cached routes with dead neighbours on fib GC Date: Wed, 9 Jan 2013 09:44:19 -0800 Message-Id: <1357753459-12872-1-git-send-email-roland@kernel.org> X-Mailer: git-send-email 1.8.0 X-Gm-Message-State: ALoCoQnzUgVcDlzqWoLIwPO8rOM83Xpl4g+e7biMZiyU38GXD267L/C359/SHOpsCWkD1+anisC0T/NqEbL1OBoK2e+xdzlDTIqQdbzKseIOu4t+vLVtxDS14chSAnN/Au5CFld4omhtwYticYbOBzA9ufjetv82/9Y22uLUjgREmpvHKVHYh84= Sender: netdev-owner@vger.kernel.org Precedence: bulk List-ID: X-Mailing-List: netdev@vger.kernel.org From: Roland Dreier This patch is as much bug report as it is a proposal to merge this specific patch. The problem is definitely real; we hit it in a situation where we have two systems connected back-to-back with two bonded links between them, one system reboots, and the other system gets NETDEV_CHANGEADDR. This patch definitely fixes that case for us, but I'm not sure it's the right place to fix this, or if it covers all the cases where this could happen. Anyway... ------------ 8< ------------ When a bonding interface changes slaves (or goes from no active slaves to having a slave with link), the bonding driver generates a NETDEV_CHANGEADDR notification. In this case, the ipv6 neighbour discovery code calls neigh_changeaddr(), which basically just calls neigh_flush_dev(). Now, neigh_flush_dev() just goes through the neighbour hash table and tries to free every neighbour from the device being flushed. However, if someone else has an additional reference on the neighbour, we hit if (atomic_read(&n->refcnt) != 1) { /* The most unpleasant situation. We must destroy neighbour entry, but someone still uses it. The destroy will be delayed until the last user releases us, but we must kill timers etc. and move it to safe state. */ skb_queue_purge(&n->arp_queue); n->arp_queue_len_bytes = 0; n->output = neigh_blackhole; if (n->nud_state & NUD_VALID) n->nud_state = NUD_NOARP; else n->nud_state = NUD_NONE; NEIGH_PRINTK2("neigh %p is stray.\n", n); } which leaves the final freeing of the "stray" neighbour until the last reference is dropped; in the meantime the output function is set to neigh_blackhole, which does nothing but free the skb and return ENETDOWN. All of this is fine, unless we have something like a TCP over IPv6 to a system directly reachable via the bonding interface. In that case, we'll have a cached route for the flow. The route will hold a reference on a neighbour pointing to the destination, so the neighbour will become "stray" and won't be freed. The TCP socket will hold a reference on the route, so it won't be GCed after the aging interval. Since every packet we send via the "stray" neighbour will be dropped, a TCP connection can stay in the ESTABLISHED (or FIN-WAIT-1) state for a *long* time before it finally gives up. This leads to a situation where even new connections to or from that destination fail, because we just drop every packet we try to send (including things like neighbour discovery advertisements in response to the remote system trying to find us). One symptom of having the cached route with a "stray" neighbour around is that ping6 to the unreachable system up will fail with: # ping6 fe80::202:c903:f:185b%bond0 PING fe80::202:c903:f:185b%bond0(fe80::202:c903:f:185b) 56 data bytes ping: sendmsg: Network is down ping: sendmsg: Network is down ("sendmsg: Network is down" == ENETDOWN returned from neigh_blackhole()) A solution is much simpler than the problem's description: the ipv6 ndisc code calls fib6_run_gc() right after it calls neigh_changeaddr(), and we can add one line to fib6_age() to drop cached routes with a "stray" neighbour. This forcibly kills the routes that hold a reference to the "stray" neighbour so it can be freed without waiting. Signed-off-by: Roland Dreier --- net/ipv6/ip6_fib.c | 5 +++-- 1 file changed, 3 insertions(+), 2 deletions(-) diff --git a/net/ipv6/ip6_fib.c b/net/ipv6/ip6_fib.c index 710cafd..5895b1c 100644 --- a/net/ipv6/ip6_fib.c +++ b/net/ipv6/ip6_fib.c @@ -1602,8 +1602,9 @@ static int fib6_age(struct rt6_info *rt, void *arg) } gc_args.more++; } else if (rt->rt6i_flags & RTF_CACHE) { - if (atomic_read(&rt->dst.__refcnt) == 0 && - time_after_eq(now, rt->dst.lastuse + gc_args.timeout)) { + if ((rt->n && rt->n->dead) || + (atomic_read(&rt->dst.__refcnt) == 0 && + time_after_eq(now, rt->dst.lastuse + gc_args.timeout))) { RT6_TRACE("aging clone %p\n", rt); return -1; } else if (rt->rt6i_flags & RTF_GATEWAY) {