diff mbox

[V3] neigh: Really delete an arp/neigh entry on "ip neigh delete" or "arp -d"

Message ID 1496188346-83229-1-git-send-email-sowmini.varadhan@oracle.com
State Superseded, archived
Delegated to: David Miller
Headers show

Commit Message

Sowmini Varadhan May 30, 2017, 11:52 p.m. UTC
The command
  # arp -s 62.2.0.1 a:b:c:d:e:f dev eth2
adds an entry like the following (listed by "arp -an")
  ? (62.2.0.1) at 0a:0b:0c:0d:0e:0f [ether] PERM on eth2
but the symmetric deletion command
  # arp -i eth2 -d 62.2.0.1
does not remove the PERM entry from the table, and instead leaves behind
  ? (62.2.0.1) at <incomplete> on eth2

The reason is that there is a refcnt of 1 for the arp_tbl itself
(neigh_alloc starts off the entry with a refcnt of 1), thus
the neigh_release() call from arp_invalidate() will (at best) just
decrement the ref to 1, but will never actually free it from the
table.

To fix this, we need to do something like neigh_forced_gc: if
the refcnt is 1 (i.e., on the table's ref), remove the entry from
the table and free it. This patch refactors and shares common code
between neigh_forced_gc and the newly added neigh_remove_one.

A similar issue exists for IPv6 Neighbor Cache entries, and is fixed
in a similar manner by this patch.

Signed-off-by: Sowmini Varadhan <sowmini.varadhan@oracle.com>
---
v2: kbuild-test-robot compile error
v3: do not export_symbol neigh_remove_one (David Miller comment)

 include/net/neighbour.h |    1 +
 net/core/neighbour.c    |   62 ++++++++++++++++++++++++++++++++++++++--------
 net/ipv4/arp.c          |    1 +
 3 files changed, 53 insertions(+), 11 deletions(-)

Comments

Julian Anastasov May 31, 2017, 9:41 p.m. UTC | #1
Hello,

On Tue, 30 May 2017, Sowmini Varadhan wrote:

> @@ -1650,6 +1689,7 @@ static int neigh_delete(struct sk_buff *skb, struct nlmsghdr *nlh,
>  			   NEIGH_UPDATE_F_ADMIN,
>  			   NETLINK_CB(skb).portid);
>  	neigh_release(neigh);
> +	neigh_remove_one(neigh, tbl);

	So, we do not hold reference to neigh while accessing
its fields. I suspect we need to move the table lock from
neigh_remove_one here, for example:

+	write_lock_bh(&tbl->lock);
	neigh_release(neigh);
+	neigh_remove_one(neigh, tbl);
+	write_unlock_bh(&tbl->lock);

	Otherwise, neigh_forced_gc can call neigh_destroy
where neigh is freed (and replaced by another neigh entry) after
RCU grace period (which can end prematurely if there are no
running RCU read-side critical sections) and I'm not sure if
our thread can be delayed, so that such pointer replacement
can happen unnoticed by us.

	Another solution to cause faster removal would be
to cancel the gc_work and to schedule it after 1 jiffie.
It helps when many entries are deleted at once but the
work prefers to just sleep when gc_thresh1 is not reached,
so such solution is not good enough.

Regards

--
Julian Anastasov <ja@ssi.bg>
Sowmini Varadhan May 31, 2017, 9:46 p.m. UTC | #2
On (06/01/17 00:41), Julian Anastasov wrote:
> 
> 	So, we do not hold reference to neigh while accessing
> its fields. I suspect we need to move the table lock from
> neigh_remove_one here, for example:

good point, let me think over your suggestion carefully (it sounds
right, I want to make sure I dont miss any other race-windows)
and post patch v4 tomorrow..

> 	Another solution to cause faster removal would be
> to cancel the gc_work and to schedule it after 1 jiffie.
> It helps when many entries are deleted at once but the
> work prefers to just sleep when gc_thresh1 is not reached,
> so such solution is not good enough.

Right the other drawback of relying on gc for cleanup is
that it doesn't give higher preference to cleaning up FAILED
entries first- so it can end up cleaning up other useful entries
(as I was pointing out to David Ahern)

--Sowmini
Sowmini Varadhan May 31, 2017, 10:07 p.m. UTC | #3
On (06/01/17 00:41), Julian Anastasov wrote:
> 
> 	So, we do not hold reference to neigh while accessing
> its fields. I suspect we need to move the table lock from
> neigh_remove_one here, for example:

Another thought is to have neigh_remove_one to remove
a neigh only if it is NUD_FAILED - that may potentially remove
more than one entry, but that's probably harmless?

--Sowmini
Julian Anastasov May 31, 2017, 10:41 p.m. UTC | #4
Hello,

On Wed, 31 May 2017, Sowmini Varadhan wrote:

> On (06/01/17 00:41), Julian Anastasov wrote:
> > 
> > 	So, we do not hold reference to neigh while accessing
> > its fields. I suspect we need to move the table lock from
> > neigh_remove_one here, for example:
> 
> good point, let me think over your suggestion carefully (it sounds
> right, I want to make sure I dont miss any other race-windows)
> and post patch v4 tomorrow..

	Another problem is that neigh_update() changes the
state but before we go and unlink the entry another CPU
can reactivate the entry, i.e. NUD_INCOMPLETE entered
in __neigh_event_send(). So, there will be always some
small window where surprises can happen and the entry is
not really deleted.

	Checking for NUD_FAILED may not be needed, the
atomic_read(&n->refcnt) == 1 check under lock ensures
that we are in some state without timer, i.e. not in
NUD_INCOMPLETE, so it is ok to remove the entry.

> > 	Another solution to cause faster removal would be
> > to cancel the gc_work and to schedule it after 1 jiffie.
> > It helps when many entries are deleted at once but the
> > work prefers to just sleep when gc_thresh1 is not reached,
> > so such solution is not good enough.
> 
> Right the other drawback of relying on gc for cleanup is
> that it doesn't give higher preference to cleaning up FAILED
> entries first- so it can end up cleaning up other useful entries
> (as I was pointing out to David Ahern)

	I see, ok.

Regards

--
Julian Anastasov <ja@ssi.bg>
David Miller May 31, 2017, 11:26 p.m. UTC | #5
From: Sowmini Varadhan <sowmini.varadhan@oracle.com>
Date: Tue, 30 May 2017 16:52:26 -0700


Just as a side note in the future, if you could fix the date
and time on your computer, that would be awesome :-)
Sowmini Varadhan June 1, 2017, 1:21 a.m. UTC | #6
On (06/01/17 01:41), Julian Anastasov wrote:
> 	Another problem is that neigh_update() changes the
> state but before we go and unlink the entry another CPU
> can reactivate the entry, i.e. NUD_INCOMPLETE entered
> in __neigh_event_send(). So, there will be always some
> small window where surprises can happen and the entry is
> not really deleted.

but that would be ok- it's the same as if I did
   arp -d <addr>
   ping <addr>

I think the only danger is that, once we drop the ref with
neigh_release(), we are relying on the table's lock for the neigh
to not disappear under us. Thus as you correctly pointed out,
we need the tbl->lock to make sure we sync with all paths that
can pull the neigh out of the table (and release the table's ref
along the way).

Thanks for catching that, patch v4 (with correct smtp timestamp!)
will be sent out shortly.

--Sowmini
diff mbox

Patch

diff --git a/include/net/neighbour.h b/include/net/neighbour.h
index e4dd3a2..639b675 100644
--- a/include/net/neighbour.h
+++ b/include/net/neighbour.h
@@ -317,6 +317,7 @@  struct neighbour *__neigh_create(struct neigh_table *tbl, const void *pkey,
 int neigh_update(struct neighbour *neigh, const u8 *lladdr, u8 new, u32 flags,
 		 u32 nlmsg_pid);
 void __neigh_set_probe_once(struct neighbour *neigh);
+bool neigh_remove_one(struct neighbour *ndel, struct neigh_table *tbl);
 void neigh_changeaddr(struct neigh_table *tbl, struct net_device *dev);
 int neigh_ifdown(struct neigh_table *tbl, struct net_device *dev);
 int neigh_resolve_output(struct neighbour *neigh, struct sk_buff *skb);
diff --git a/net/core/neighbour.c b/net/core/neighbour.c
index d274f81..de8bf76 100644
--- a/net/core/neighbour.c
+++ b/net/core/neighbour.c
@@ -118,6 +118,54 @@  unsigned long neigh_rand_reach_time(unsigned long base)
 EXPORT_SYMBOL(neigh_rand_reach_time);
 
 
+static bool neigh_del(struct neighbour *n, __u8 state,
+		      struct neighbour __rcu **np, struct neigh_table *tbl)
+{
+	bool retval = false;
+
+	write_lock(&n->lock);
+	if (atomic_read(&n->refcnt) == 1 && !(n->nud_state & state)) {
+		rcu_assign_pointer(*np,
+				   rcu_dereference_protected(n->next,
+				   lockdep_is_held(&tbl->lock)));
+		n->dead = 1;
+		retval = true;
+	}
+	write_unlock(&n->lock);
+	if (retval)
+		neigh_cleanup_and_release(n);
+	return retval;
+}
+
+bool neigh_remove_one(struct neighbour *ndel, struct neigh_table *tbl)
+{
+	struct neigh_hash_table *nht;
+	void *pkey = ndel->primary_key;
+	u32 hash_val;
+	struct neighbour *n;
+	struct neighbour __rcu **np;
+	bool retval = false;
+
+	write_lock_bh(&tbl->lock);
+	nht = rcu_dereference_protected(tbl->nht,
+					lockdep_is_held(&tbl->lock));
+	hash_val = tbl->hash(pkey, ndel->dev, nht->hash_rnd);
+	hash_val = hash_val >> (32 - nht->hash_shift);
+
+	np = &nht->hash_buckets[hash_val];
+	while ((n = rcu_dereference_protected(*np,
+				lockdep_is_held(&tbl->lock))) != NULL) {
+		if (n == ndel) {
+			retval = neigh_del(n, 0, np, tbl);
+			write_unlock_bh(&tbl->lock);
+			return retval;
+		}
+		np = &n->next;
+	}
+	write_unlock_bh(&tbl->lock);
+	return retval;
+}
+
 static int neigh_forced_gc(struct neigh_table *tbl)
 {
 	int shrunk = 0;
@@ -140,19 +188,10 @@  static int neigh_forced_gc(struct neigh_table *tbl)
 			 * - nobody refers to it.
 			 * - it is not permanent
 			 */
-			write_lock(&n->lock);
-			if (atomic_read(&n->refcnt) == 1 &&
-			    !(n->nud_state & NUD_PERMANENT)) {
-				rcu_assign_pointer(*np,
-					rcu_dereference_protected(n->next,
-						  lockdep_is_held(&tbl->lock)));
-				n->dead = 1;
-				shrunk	= 1;
-				write_unlock(&n->lock);
-				neigh_cleanup_and_release(n);
+			if (neigh_del(n, NUD_PERMANENT, np, tbl)) {
+				shrunk = 1;
 				continue;
 			}
-			write_unlock(&n->lock);
 			np = &n->next;
 		}
 	}
@@ -1650,6 +1689,7 @@  static int neigh_delete(struct sk_buff *skb, struct nlmsghdr *nlh,
 			   NEIGH_UPDATE_F_ADMIN,
 			   NETLINK_CB(skb).portid);
 	neigh_release(neigh);
+	neigh_remove_one(neigh, tbl);
 
 out:
 	return err;
diff --git a/net/ipv4/arp.c b/net/ipv4/arp.c
index e9f3386..5264004 100644
--- a/net/ipv4/arp.c
+++ b/net/ipv4/arp.c
@@ -1120,6 +1120,7 @@  static int arp_invalidate(struct net_device *dev, __be32 ip)
 					   NEIGH_UPDATE_F_OVERRIDE|
 					   NEIGH_UPDATE_F_ADMIN, 0);
 		neigh_release(neigh);
+		neigh_remove_one(neigh, &arp_tbl);
 	}
 
 	return err;