Message ID | 1496188346-83229-1-git-send-email-sowmini.varadhan@oracle.com |
---|---|
State | Superseded, archived |
Delegated to: | David Miller |
Headers | show |
Hello, On Tue, 30 May 2017, Sowmini Varadhan wrote: > @@ -1650,6 +1689,7 @@ static int neigh_delete(struct sk_buff *skb, struct nlmsghdr *nlh, > NEIGH_UPDATE_F_ADMIN, > NETLINK_CB(skb).portid); > neigh_release(neigh); > + neigh_remove_one(neigh, tbl); So, we do not hold reference to neigh while accessing its fields. I suspect we need to move the table lock from neigh_remove_one here, for example: + write_lock_bh(&tbl->lock); neigh_release(neigh); + neigh_remove_one(neigh, tbl); + write_unlock_bh(&tbl->lock); Otherwise, neigh_forced_gc can call neigh_destroy where neigh is freed (and replaced by another neigh entry) after RCU grace period (which can end prematurely if there are no running RCU read-side critical sections) and I'm not sure if our thread can be delayed, so that such pointer replacement can happen unnoticed by us. Another solution to cause faster removal would be to cancel the gc_work and to schedule it after 1 jiffie. It helps when many entries are deleted at once but the work prefers to just sleep when gc_thresh1 is not reached, so such solution is not good enough. Regards -- Julian Anastasov <ja@ssi.bg>
On (06/01/17 00:41), Julian Anastasov wrote: > > So, we do not hold reference to neigh while accessing > its fields. I suspect we need to move the table lock from > neigh_remove_one here, for example: good point, let me think over your suggestion carefully (it sounds right, I want to make sure I dont miss any other race-windows) and post patch v4 tomorrow.. > Another solution to cause faster removal would be > to cancel the gc_work and to schedule it after 1 jiffie. > It helps when many entries are deleted at once but the > work prefers to just sleep when gc_thresh1 is not reached, > so such solution is not good enough. Right the other drawback of relying on gc for cleanup is that it doesn't give higher preference to cleaning up FAILED entries first- so it can end up cleaning up other useful entries (as I was pointing out to David Ahern) --Sowmini
On (06/01/17 00:41), Julian Anastasov wrote: > > So, we do not hold reference to neigh while accessing > its fields. I suspect we need to move the table lock from > neigh_remove_one here, for example: Another thought is to have neigh_remove_one to remove a neigh only if it is NUD_FAILED - that may potentially remove more than one entry, but that's probably harmless? --Sowmini
Hello, On Wed, 31 May 2017, Sowmini Varadhan wrote: > On (06/01/17 00:41), Julian Anastasov wrote: > > > > So, we do not hold reference to neigh while accessing > > its fields. I suspect we need to move the table lock from > > neigh_remove_one here, for example: > > good point, let me think over your suggestion carefully (it sounds > right, I want to make sure I dont miss any other race-windows) > and post patch v4 tomorrow.. Another problem is that neigh_update() changes the state but before we go and unlink the entry another CPU can reactivate the entry, i.e. NUD_INCOMPLETE entered in __neigh_event_send(). So, there will be always some small window where surprises can happen and the entry is not really deleted. Checking for NUD_FAILED may not be needed, the atomic_read(&n->refcnt) == 1 check under lock ensures that we are in some state without timer, i.e. not in NUD_INCOMPLETE, so it is ok to remove the entry. > > Another solution to cause faster removal would be > > to cancel the gc_work and to schedule it after 1 jiffie. > > It helps when many entries are deleted at once but the > > work prefers to just sleep when gc_thresh1 is not reached, > > so such solution is not good enough. > > Right the other drawback of relying on gc for cleanup is > that it doesn't give higher preference to cleaning up FAILED > entries first- so it can end up cleaning up other useful entries > (as I was pointing out to David Ahern) I see, ok. Regards -- Julian Anastasov <ja@ssi.bg>
From: Sowmini Varadhan <sowmini.varadhan@oracle.com>
Date: Tue, 30 May 2017 16:52:26 -0700
Just as a side note in the future, if you could fix the date
and time on your computer, that would be awesome :-)
On (06/01/17 01:41), Julian Anastasov wrote: > Another problem is that neigh_update() changes the > state but before we go and unlink the entry another CPU > can reactivate the entry, i.e. NUD_INCOMPLETE entered > in __neigh_event_send(). So, there will be always some > small window where surprises can happen and the entry is > not really deleted. but that would be ok- it's the same as if I did arp -d <addr> ping <addr> I think the only danger is that, once we drop the ref with neigh_release(), we are relying on the table's lock for the neigh to not disappear under us. Thus as you correctly pointed out, we need the tbl->lock to make sure we sync with all paths that can pull the neigh out of the table (and release the table's ref along the way). Thanks for catching that, patch v4 (with correct smtp timestamp!) will be sent out shortly. --Sowmini
diff --git a/include/net/neighbour.h b/include/net/neighbour.h index e4dd3a2..639b675 100644 --- a/include/net/neighbour.h +++ b/include/net/neighbour.h @@ -317,6 +317,7 @@ struct neighbour *__neigh_create(struct neigh_table *tbl, const void *pkey, int neigh_update(struct neighbour *neigh, const u8 *lladdr, u8 new, u32 flags, u32 nlmsg_pid); void __neigh_set_probe_once(struct neighbour *neigh); +bool neigh_remove_one(struct neighbour *ndel, struct neigh_table *tbl); void neigh_changeaddr(struct neigh_table *tbl, struct net_device *dev); int neigh_ifdown(struct neigh_table *tbl, struct net_device *dev); int neigh_resolve_output(struct neighbour *neigh, struct sk_buff *skb); diff --git a/net/core/neighbour.c b/net/core/neighbour.c index d274f81..de8bf76 100644 --- a/net/core/neighbour.c +++ b/net/core/neighbour.c @@ -118,6 +118,54 @@ unsigned long neigh_rand_reach_time(unsigned long base) EXPORT_SYMBOL(neigh_rand_reach_time); +static bool neigh_del(struct neighbour *n, __u8 state, + struct neighbour __rcu **np, struct neigh_table *tbl) +{ + bool retval = false; + + write_lock(&n->lock); + if (atomic_read(&n->refcnt) == 1 && !(n->nud_state & state)) { + rcu_assign_pointer(*np, + rcu_dereference_protected(n->next, + lockdep_is_held(&tbl->lock))); + n->dead = 1; + retval = true; + } + write_unlock(&n->lock); + if (retval) + neigh_cleanup_and_release(n); + return retval; +} + +bool neigh_remove_one(struct neighbour *ndel, struct neigh_table *tbl) +{ + struct neigh_hash_table *nht; + void *pkey = ndel->primary_key; + u32 hash_val; + struct neighbour *n; + struct neighbour __rcu **np; + bool retval = false; + + write_lock_bh(&tbl->lock); + nht = rcu_dereference_protected(tbl->nht, + lockdep_is_held(&tbl->lock)); + hash_val = tbl->hash(pkey, ndel->dev, nht->hash_rnd); + hash_val = hash_val >> (32 - nht->hash_shift); + + np = &nht->hash_buckets[hash_val]; + while ((n = rcu_dereference_protected(*np, + lockdep_is_held(&tbl->lock))) != NULL) { + if (n == ndel) { + retval = neigh_del(n, 0, np, tbl); + write_unlock_bh(&tbl->lock); + return retval; + } + np = &n->next; + } + write_unlock_bh(&tbl->lock); + return retval; +} + static int neigh_forced_gc(struct neigh_table *tbl) { int shrunk = 0; @@ -140,19 +188,10 @@ static int neigh_forced_gc(struct neigh_table *tbl) * - nobody refers to it. * - it is not permanent */ - write_lock(&n->lock); - if (atomic_read(&n->refcnt) == 1 && - !(n->nud_state & NUD_PERMANENT)) { - rcu_assign_pointer(*np, - rcu_dereference_protected(n->next, - lockdep_is_held(&tbl->lock))); - n->dead = 1; - shrunk = 1; - write_unlock(&n->lock); - neigh_cleanup_and_release(n); + if (neigh_del(n, NUD_PERMANENT, np, tbl)) { + shrunk = 1; continue; } - write_unlock(&n->lock); np = &n->next; } } @@ -1650,6 +1689,7 @@ static int neigh_delete(struct sk_buff *skb, struct nlmsghdr *nlh, NEIGH_UPDATE_F_ADMIN, NETLINK_CB(skb).portid); neigh_release(neigh); + neigh_remove_one(neigh, tbl); out: return err; diff --git a/net/ipv4/arp.c b/net/ipv4/arp.c index e9f3386..5264004 100644 --- a/net/ipv4/arp.c +++ b/net/ipv4/arp.c @@ -1120,6 +1120,7 @@ static int arp_invalidate(struct net_device *dev, __be32 ip) NEIGH_UPDATE_F_OVERRIDE| NEIGH_UPDATE_F_ADMIN, 0); neigh_release(neigh); + neigh_remove_one(neigh, &arp_tbl); } return err;
The command # arp -s 62.2.0.1 a:b:c:d:e:f dev eth2 adds an entry like the following (listed by "arp -an") ? (62.2.0.1) at 0a:0b:0c:0d:0e:0f [ether] PERM on eth2 but the symmetric deletion command # arp -i eth2 -d 62.2.0.1 does not remove the PERM entry from the table, and instead leaves behind ? (62.2.0.1) at <incomplete> on eth2 The reason is that there is a refcnt of 1 for the arp_tbl itself (neigh_alloc starts off the entry with a refcnt of 1), thus the neigh_release() call from arp_invalidate() will (at best) just decrement the ref to 1, but will never actually free it from the table. To fix this, we need to do something like neigh_forced_gc: if the refcnt is 1 (i.e., on the table's ref), remove the entry from the table and free it. This patch refactors and shares common code between neigh_forced_gc and the newly added neigh_remove_one. A similar issue exists for IPv6 Neighbor Cache entries, and is fixed in a similar manner by this patch. Signed-off-by: Sowmini Varadhan <sowmini.varadhan@oracle.com> --- v2: kbuild-test-robot compile error v3: do not export_symbol neigh_remove_one (David Miller comment) include/net/neighbour.h | 1 + net/core/neighbour.c | 62 ++++++++++++++++++++++++++++++++++++++-------- net/ipv4/arp.c | 1 + 3 files changed, 53 insertions(+), 11 deletions(-)