diff mbox

[v2] net: fix a double free issue for neighbour entry

Message ID 1432085113-25063-1-git-send-email-ying.xue@windriver.com
State Rejected, archived
Delegated to: David Miller
Headers show

Commit Message

Ying Xue May 20, 2015, 1:25 a.m. UTC
Calling __ipv4_neigh_lookup_noref() inside rcu_read_lock_bh() can
guarantee that its searched neighbour entry is not freed before RCU
grace period, but it cannot ensure that its obtained neighbour will
be freed shortly. Exactly saying, it cannot prevent neigh_destroy()
from being executed on another context at the same time. For example,
if ip_finish_output2() continues to deliver a SKB with a neighbour
entry whose refcount is zero, neigh_add_timer() may be called in
neigh_resolve_output() subsequently. As a result, neigh_add_timer()
takes refcount on the neighbour that already had a refcount of zero.
When the neighbour refcount is put before the timer's handler is
exited, neigh_destroy() will be called again, meaning crash happens
at the moment.

To prevent the issue from occurring, we must check whether the refcount
of a neighbour searched by __ipv4_neigh_lookup_noref() is decremented
to zero or not. If it's zero, we should create a new one.

However, as reading neigh's refcount is unsafe through atomic_read()
like it doesn't imply any memory barrier and the cost is too expensive
if we enforce a proper implicit or explicit memory barrier on it,
another checking of identifying whether neigh's dead flag is set or not
is involved into __neigh_event_send() to further prevent neigh_add_timer()
from holding a neigh's refcount that already hit zero, thereby avoiding
what the issue cannot absolutely happen.

Reported-by: Joern Engel <joern@logfs.org>
Cc: Eric Dumazet <edumazet@google.com>
Signed-off-by: Ying Xue <ying.xue@windriver.com>
---
 v2:
  - As Eric pointed that identifying whether neigh's refcnt is zero
    through atomic_read() is unsafe, another condition checking of
    verifying neigh's dead flag is set is involved into
    __neigh_event_send() to further prevent neigh_add_timer() from
    holding a neigh's refcnt that already hit zero.
  - Now the patch is created based on "net" tree considering it's
    a very fatal issue.

 net/core/neighbour.c |    3 +++
 net/ipv4/ip_output.c |    2 +-
 2 files changed, 4 insertions(+), 1 deletion(-)

Comments

Eric Dumazet May 20, 2015, 5:27 a.m. UTC | #1
On Wed, 2015-05-20 at 09:25 +0800, Ying Xue wrote:
> Calling __ipv4_neigh_lookup_noref() inside rcu_read_lock_bh() can
> guarantee that its searched neighbour entry is not freed before RCU
> grace period, but it cannot ensure that its obtained neighbour will
> be freed shortly. Exactly saying, it cannot prevent neigh_destroy()
> from being executed on another context at the same time. For example,
> if ip_finish_output2() continues to deliver a SKB with a neighbour
> entry whose refcount is zero, neigh_add_timer() may be called in
> neigh_resolve_output() subsequently. As a result, neigh_add_timer()
> takes refcount on the neighbour that already had a refcount of zero.
> When the neighbour refcount is put before the timer's handler is
> exited, neigh_destroy() will be called again, meaning crash happens
> at the moment.
> 
> To prevent the issue from occurring, we must check whether the refcount
> of a neighbour searched by __ipv4_neigh_lookup_noref() is decremented
> to zero or not. If it's zero, we should create a new one.
> 
> However, as reading neigh's refcount is unsafe through atomic_read()
> like it doesn't imply any memory barrier and the cost is too expensive
> if we enforce a proper implicit or explicit memory barrier on it,
> another checking of identifying whether neigh's dead flag is set or not
> is involved into __neigh_event_send() to further prevent neigh_add_timer()
> from holding a neigh's refcount that already hit zero, thereby avoiding
> what the issue cannot absolutely happen.
> 
> Reported-by: Joern Engel <joern@logfs.org>
> Cc: Eric Dumazet <edumazet@google.com>
> Signed-off-by: Ying Xue <ying.xue@windriver.com>
> ---
>  v2:
>   - As Eric pointed that identifying whether neigh's refcnt is zero
>     through atomic_read() is unsafe, another condition checking of
>     verifying neigh's dead flag is set is involved into
>     __neigh_event_send() to further prevent neigh_add_timer() from
>     holding a neigh's refcnt that already hit zero.
>   - Now the patch is created based on "net" tree considering it's
>     a very fatal issue.
> 
>  net/core/neighbour.c |    3 +++
>  net/ipv4/ip_output.c |    2 +-
>  2 files changed, 4 insertions(+), 1 deletion(-)
> 
> diff --git a/net/core/neighbour.c b/net/core/neighbour.c
> index 3de6542..c7a675c 100644
> --- a/net/core/neighbour.c
> +++ b/net/core/neighbour.c
> @@ -958,6 +958,9 @@ int __neigh_event_send(struct neighbour *neigh, struct sk_buff *skb)
>  	if (neigh->nud_state & (NUD_CONNECTED | NUD_DELAY | NUD_PROBE))
>  		goto out_unlock_bh;
>  
> +	if (neigh->dead)
> +		goto out_unlock_bh;
> +
>  	if (!(neigh->nud_state & (NUD_STALE | NUD_INCOMPLETE))) {
>  		if (NEIGH_VAR(neigh->parms, MCAST_PROBES) +
>  		    NEIGH_VAR(neigh->parms, APP_PROBES)) {
> diff --git a/net/ipv4/ip_output.c b/net/ipv4/ip_output.c
> index c65b93a..5889774 100644
> --- a/net/ipv4/ip_output.c
> +++ b/net/ipv4/ip_output.c
> @@ -200,7 +200,7 @@ static inline int ip_finish_output2(struct sock *sk, struct sk_buff *skb)
>  	rcu_read_lock_bh();
>  	nexthop = (__force u32) rt_nexthop(rt, ip_hdr(skb)->daddr);
>  	neigh = __ipv4_neigh_lookup_noref(dev, nexthop);
> -	if (unlikely(!neigh))
> +	if (unlikely(!neigh || !atomic_read(&neigh->refcnt)))
>  		neigh = __neigh_create(&arp_tbl, &nexthop, dev, false);
>  	if (!IS_ERR(neigh)) {
>  		int res = dst_neigh_output(dst, neigh, skb);

Sorry, this atomic_read() makes no sense to me.

When rcu is used, this is perfectly fine to use an object which refcount
is 0. If you believe the opposite, then point me to relevant
Documentation/RCU/ file.

refcount should only protect the object itself, not the use of it by a
rcu reader. This has nothing to do with rcu but standard refcounting of
objects.

The only forbidden thing would be to try to take a reference on it with
atomic_inc() instead of the more careful atomic_inc_not_zero(), if the
caller needs to exit the rcu_read_lock() section safely (as explained in
Documentation/RCU/rcuref.txt around line 52)

So far, dst_neigh_output() will not try to take a refcnt.


By the time you check atomic_read() the result is meaningless if another
cpu decrements the refcount to 0.

So what is the point, trying to reduce a race window but not properly
remove it ?

I repeat again : using atomic_read() in a rcu lookup is absolutely
useless and is a clear sign you missed a fundamental issue. 

So now, we are back to the patch I initially sent, but you did not
address Julian Anastasov feedback.

Really I am not very happy...


--
To unsubscribe from this list: send the line "unsubscribe netdev" in
the body of a message to majordomo@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Ying Xue May 20, 2015, 7:01 a.m. UTC | #2
On 05/20/2015 01:27 PM, Eric Dumazet wrote:
> Sorry, this atomic_read() makes no sense to me.
> 
> When rcu is used, this is perfectly fine to use an object which refcount
> is 0. If you believe the opposite, then point me to relevant
> Documentation/RCU/ file.
> 

However, the reality for us is that rcu_read_lock() can guarantee that a neigh
object is not freed within the area covered by rcu read lock, but it cannot
prevent neigh_destroy() from being executed in another context at the same time.

> refcount should only protect the object itself, not the use of it by a
> rcu reader. This has nothing to do with rcu but standard refcounting of
> objects.
> 
> The only forbidden thing would be to try to take a reference on it with
> atomic_inc() instead of the more careful atomic_inc_not_zero(), if the
> caller needs to exit the rcu_read_lock() section safely (as explained in
> Documentation/RCU/rcuref.txt around line 52)
> 
> So far, dst_neigh_output() will not try to take a refcnt.
> 

Please let us take a look at the following log generated by the Joern Engel
written debug patch:

[ 4974.816012]  [<ffffffff8163baae>] dump_stack+0x19/0x1b
[ 4974.816017]  [<ffffffff8103f141>] warn_slowpath_common+0x61/0x80
[ 4974.816019]  [<ffffffff8103f21a>] warn_slowpath_null+0x1a/0x20
[ 4974.816021]  [<ffffffff8163de95>] neigh_hold.part.26+0x1e/0x27
[ 4974.816027]  [<ffffffff8155f08c>] neigh_add_timer+0x3c/0x60
[ 4974.816029]  [<ffffffff8155f1ab>] __neigh_event_send+0xfb/0x220
[ 4974.816031]  [<ffffffff8155f40b>] neigh_resolve_output+0x13b/0x220
[ 4974.816038]  [<ffffffff8158c950>] ip_finish_output+0x1b0/0x3b0
[ 4974.816040]  [<ffffffff8158ddd8>] ip_output+0x58/0x90
[ 4974.816042]  [<ffffffff8158d5a5>] ip_local_out+0x25/0x30
[ 4974.816045]  [<ffffffff8158d8f7>] ip_queue_xmit+0x137/0x380
[ 4974.816051]  [<ffffffff815a47a5>] tcp_transmit_skb+0x445/0x8a0
[ 4974.816054]  [<ffffffff815a4d40>] tcp_write_xmit+0x140/0xb00
[ 4974.816058]  [<ffffffff815a59ae>] __tcp_push_pending_frames+0x2e/0xc0
[ 4974.816062]  [<ffffffff81597198>] tcp_sendmsg+0x118/0xd90
[ 4974.816070]  [<ffffffff81278b55>] ? debug_object_deactivate+0x115/0x170
[ 4974.816076]  [<ffffffff815bf434>] inet_sendmsg+0x64/0xb0
[ 4974.816080]  [<ffffffff8153da56>] sock_sendmsg+0x76/0x90
[ 4974.816086]  [<ffffffff81046e89>] ? local_bh_enable_ip+0x89/0xf0

Above stack trace log tells us the following call chain happens:
ip_finish_output()
  ->ip_finish_output2()
    ->dst_neigh_output()
      ->neigh_resolve_output()
        ->__neigh_event_send()
          ->neigh_add_timer()
            ->neigh_hold()

Moreover, the below debug code is added in Joern Engel written debug patch :
+static inline void neigh_hold(struct neighbour *neigh)
+{
+	WARN_ON_ONCE(atomic_inc_return(&neigh->refcnt) == 1);
+}

So, we can identify above stack trace was printed by WARN_ON_ONCE(), which also
proves that dst_neigh_output() not only tries to take a refcnt of a neigh, but
also it the refcnt of neigh entry that dst_neigh_output() will be touched was
already decremented to zero.

Please consider the below scenario which is very similar to above case met by
Joern Engel:

Time 1:
CPU 0:
neigh_release()
  ->neigh_destroy() //it's called as neigh's refcnt is 0 now
    ->kfree_rcu(neigh, rcu);//freeing neigh object is delayed after a RCU
grace period

Time 2:
CPU 1:
ip_finish_output2()
  rcu_read_lock_bh()
  dst_neigh_output()
   __neigh_event_send()
    neigh_add_timer()
      neigh_hold() //refcnt of neigh is increased from 0 to 1;
     rcu_read_unlock_bh()

Time 3:
CPU 0:
On RCU_SOFTIRQ context:
rcu_process_callbacks()
  neigh object is really freed!

Time 4:
CPU 0:
Release the neigh whose refcnt was increased from 0 to zero.
neigh_release() is called by someone. As the neigh is already freed, panic
happens!!!

This is why I initially added the condition statement in ip_finish_output2() to
check whether the refcnt of neigh returned by __ipv4_neigh_lookup_noref() is
zero or not with atomic_read(). But as you pointed out, atomic_read() was not
safe for us, but we cannot use other atomic routines with implicit memory
barriers as they all increase or decease refcnt while checking refcnt is 0,
meanwhile, we cannot use explicit memory barrier as it seems too expensive for
the hot path. So I try to add another condition to prevent neigh_add_timer()
called by __neigh_event_send() from being take a refcnt which is already zero.

> 
> By the time you check atomic_read() the result is meaningless if another
> cpu decrements the refcount to 0.
> 

If you think checking if refcnt is zero is meaningless in ip_finish_output2(),
why does __ipv4_neigh_lookup() use atomic_inc_not_zero() instead of directly
using neigh_hold() in __ipv4_neigh_lookup()?

Regards,
Ying

> So what is the point, trying to reduce a race window but not properly
> remove it ?
> 
> I repeat again : using atomic_read() in a rcu lookup is absolutely
> useless and is a clear sign you missed a fundamental issue. 
> 
> So now, we are back to the patch I initially sent, but you did not
> address Julian Anastasov feedback.

--
To unsubscribe from this list: send the line "unsubscribe netdev" in
the body of a message to majordomo@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Julian Anastasov May 20, 2015, 7:35 a.m. UTC | #3
Hello,

On Wed, 20 May 2015, Ying Xue wrote:

> --- a/net/core/neighbour.c
> +++ b/net/core/neighbour.c
> @@ -958,6 +958,9 @@ int __neigh_event_send(struct neighbour *neigh, struct sk_buff *skb)
>  	if (neigh->nud_state & (NUD_CONNECTED | NUD_DELAY | NUD_PROBE))
>  		goto out_unlock_bh;
>  
> +	if (neigh->dead)
> +		goto out_unlock_bh;
> +

	Returning 0 in all cases is wrong. May be you can goto
to another new label where nud_state check can allow valid
address to be used. See my idea:

http://marc.info/?l=linux-netdev&m=142816363503402&w=2

	I.e. return 0 for NUD_STALE, drop skb and return 1
otherwise.

>  	if (!(neigh->nud_state & (NUD_STALE | NUD_INCOMPLETE))) {
>  		if (NEIGH_VAR(neigh->parms, MCAST_PROBES) +
>  		    NEIGH_VAR(neigh->parms, APP_PROBES)) {
> diff --git a/net/ipv4/ip_output.c b/net/ipv4/ip_output.c
> index c65b93a..5889774 100644
> --- a/net/ipv4/ip_output.c
> +++ b/net/ipv4/ip_output.c
> @@ -200,7 +200,7 @@ static inline int ip_finish_output2(struct sock *sk, struct sk_buff *skb)
>  	rcu_read_lock_bh();
>  	nexthop = (__force u32) rt_nexthop(rt, ip_hdr(skb)->daddr);
>  	neigh = __ipv4_neigh_lookup_noref(dev, nexthop);
> -	if (unlikely(!neigh))
> +	if (unlikely(!neigh || !atomic_read(&neigh->refcnt)))

	You should forget about refcnt. kfree_rcu in neigh_destroy
will not free neigh while RCU readers are present. Still,
neigh_destroy runs in parallel with readers and I hope they can
use the stored address safely. I mean, neigh_event_send allowing 
neigh_resolve_output to use address in NUD_STALE state on dead=1
by returning 0 and reaching the dev_queue_xmit call.

	Still, another inefficiency remains: how can
__neigh_event_send indicate to ip_finish_output2 that
dead=1 entry is [to be] removed and new entry needs to
be created. It seems the ->output method returns code
but skb is freed in all cases. The result is that we
drop single skb in such condition. But such optimization
should not be part of this bugfix.

Regards

--
Julian Anastasov <ja@ssi.bg>
--
To unsubscribe from this list: send the line "unsubscribe netdev" in
the body of a message to majordomo@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Julian Anastasov May 20, 2015, 8:07 a.m. UTC | #4
Hello,

On Wed, 20 May 2015, Ying Xue wrote:

> On 05/20/2015 01:27 PM, Eric Dumazet wrote:
> > Sorry, this atomic_read() makes no sense to me.
> > 
> > When rcu is used, this is perfectly fine to use an object which refcount
> > is 0. If you believe the opposite, then point me to relevant
> > Documentation/RCU/ file.
> > 
> 
> However, the reality for us is that rcu_read_lock() can guarantee that a neigh
> object is not freed within the area covered by rcu read lock, but it cannot
> prevent neigh_destroy() from being executed in another context at the same time.

	The situation is that one writer decided that
entry is to be removed. Reader comes and tries to become
second writer. It should check refcnt==0 or dead==1 as
in your last patch, always under write_lock. Second and next
writers should not try to change state, timers, etc.
Such writers are possible only if they were readers because
only they can find entry that is unlinked by another writer.

	And we want to keep the readers free of any memory
barriers as they can cost hundreds of clocks. We are lucky
that the neigh states allow RCU readers to run without any 
atomic_inc_not_zero calls because memory barriers are not
cheap.

Regards

--
Julian Anastasov <ja@ssi.bg>
--
To unsubscribe from this list: send the line "unsubscribe netdev" in
the body of a message to majordomo@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Ying Xue May 20, 2015, 8:39 a.m. UTC | #5
On 05/20/2015 04:07 PM, Julian Anastasov wrote:
> 
> 	Hello,
> 
> On Wed, 20 May 2015, Ying Xue wrote:
> 
>> On 05/20/2015 01:27 PM, Eric Dumazet wrote:
>>> Sorry, this atomic_read() makes no sense to me.
>>>
>>> When rcu is used, this is perfectly fine to use an object which refcount
>>> is 0. If you believe the opposite, then point me to relevant
>>> Documentation/RCU/ file.
>>>
>>
>> However, the reality for us is that rcu_read_lock() can guarantee that a neigh
>> object is not freed within the area covered by rcu read lock, but it cannot
>> prevent neigh_destroy() from being executed in another context at the same time.
> 
> 	The situation is that one writer decided that
> entry is to be removed. Reader comes and tries to become
> second writer. It should check refcnt==0 or dead==1 as
> in your last patch, always under write_lock.

Yes, this way is absolutely safe for us! But, for example, if we check refcnt==0
in write_lock protection, the checking is a bit late. Instead, if the checking
is advanced to ip_finish_output2(), we can _early_ find the fact that we cannot
use the neigh entry looked up by __ipv4_neigh_lookup_noref(). Of course, doing
the check with atomic_read() is unsafe _really_, but once atomic_read() returned
value tells us neigh refcnt is zero, the result is absolutely true. This is
because refcnt is always decremented from a certain value which is bigger than 0
to 0. Therefore, if atomic_read() tells us a neigh's refcnt is 0, we should
definitely create a new one; on the contrary, if it tells us a neigh's refcnt is
not zero, it doesn't mean the refcnt is really equal to 0 because atomic_read()
might read a stale refcnt value. In this situation, the condition of
!atomic_read(&neigh->refcnt)) is really useless for us. This is why I try to
involve another condition check of dead==1 to prevent it from happening in
version 2. Meanwhile, as the checking of dead==1 is conducted under write lock,
this is absolutely safe for us.

 Second and next
> writers should not try to change state, timers, etc.
> Such writers are possible only if they were readers because
> only they can find entry that is unlinked by another writer.
> 
> 	And we want to keep the readers free of any memory
> barriers as they can cost hundreds of clocks. We are lucky
> that the neigh states allow RCU readers to run without any 
> atomic_inc_not_zero calls because memory barriers are not
> cheap.
> 

Yes, I agreed with you.

Regards,
Ying

> Regards
> 
> --
> Julian Anastasov <ja@ssi.bg>
> 
> 

--
To unsubscribe from this list: send the line "unsubscribe netdev" in
the body of a message to majordomo@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Ying Xue May 20, 2015, 9:22 a.m. UTC | #6
On 05/20/2015 03:35 PM, Julian Anastasov wrote:
> 
> 	Hello,
> 
> On Wed, 20 May 2015, Ying Xue wrote:
> 
>> --- a/net/core/neighbour.c
>> +++ b/net/core/neighbour.c
>> @@ -958,6 +958,9 @@ int __neigh_event_send(struct neighbour *neigh, struct sk_buff *skb)
>>  	if (neigh->nud_state & (NUD_CONNECTED | NUD_DELAY | NUD_PROBE))
>>  		goto out_unlock_bh;
>>  
>> +	if (neigh->dead)
>> +		goto out_unlock_bh;
>> +
> 
> 	Returning 0 in all cases is wrong.

Good catch!
Yes, you are right. We should return 1 especially when neigh->dead == 1.

 May be you can goto
> to another new label where nud_state check can allow valid
> address to be used. See my idea:
> 
> http://marc.info/?l=linux-netdev&m=142816363503402&w=2
> 
> 	I.e. return 0 for NUD_STALE, drop skb and return 1
> otherwise.
> 
>>  	if (!(neigh->nud_state & (NUD_STALE | NUD_INCOMPLETE))) {
>>  		if (NEIGH_VAR(neigh->parms, MCAST_PROBES) +
>>  		    NEIGH_VAR(neigh->parms, APP_PROBES)) {
>> diff --git a/net/ipv4/ip_output.c b/net/ipv4/ip_output.c
>> index c65b93a..5889774 100644
>> --- a/net/ipv4/ip_output.c
>> +++ b/net/ipv4/ip_output.c
>> @@ -200,7 +200,7 @@ static inline int ip_finish_output2(struct sock *sk, struct sk_buff *skb)
>>  	rcu_read_lock_bh();
>>  	nexthop = (__force u32) rt_nexthop(rt, ip_hdr(skb)->daddr);
>>  	neigh = __ipv4_neigh_lookup_noref(dev, nexthop);
>> -	if (unlikely(!neigh))
>> +	if (unlikely(!neigh || !atomic_read(&neigh->refcnt)))
> 
> 	You should forget about refcnt. kfree_rcu in neigh_destroy
> will not free neigh while RCU readers are present. Still,
> neigh_destroy runs in parallel with readers and I hope they can
> use the stored address safely.

I know touching neigh entry in ip_finish_output2() without identifying if
atomic_read(&neigh->refcnt)==0 under RCU lock is absolutely safe for us. But in
some extent we can say the issue is not much a closely relationship to RCU lock.
The key problem for us is how efficiently we can prevent neigh refcnt from being
increased from 0 to 1 on the path of dst_neigh_output().

Regards,
Ying

 I mean, neigh_event_send allowing
> neigh_resolve_output to use address in NUD_STALE state on dead=1
> by returning 0 and reaching the dev_queue_xmit call.
> 
> 	Still, another inefficiency remains: how can
> __neigh_event_send indicate to ip_finish_output2 that
> dead=1 entry is [to be] removed and new entry needs to
> be created. It seems the ->output method returns code
> but skb is freed in all cases. The result is that we
> drop single skb in such condition. But such optimization
> should not be part of this bugfix.
> 
> Regards
> 
> --
> Julian Anastasov <ja@ssi.bg>
> 
> 

--
To unsubscribe from this list: send the line "unsubscribe netdev" in
the body of a message to majordomo@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Eric Dumazet May 20, 2015, 10:45 a.m. UTC | #7
On Wed, 2015-05-20 at 15:01 +0800, Ying Xue wrote:

> Time 4:
> CPU 0:
> Release the neigh whose refcnt was increased from 0 to zero.
> neigh_release() is called by someone. As the neigh is already freed, panic
> happens!!!

There is a bug for sure, although I never saw it ever happening.

I am only saying your 'fix' is not appropriate.

You focus on refcnt which is the wrong thing to test in this context.

Even if we decided to remove RCU locking and go back to rwlock, refcnt
would not be the thing to test.


--
To unsubscribe from this list: send the line "unsubscribe netdev" in
the body of a message to majordomo@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Julian Anastasov May 20, 2015, 7:19 p.m. UTC | #8
Hello,

On Wed, 20 May 2015, Ying Xue wrote:

> Yes, this way is absolutely safe for us! But, for example, if we check refcnt==0
> in write_lock protection, the checking is a bit late. Instead, if the checking
> is advanced to ip_finish_output2(), we can _early_ find the fact that we cannot
> use the neigh entry looked up by __ipv4_neigh_lookup_noref(). Of course, doing
> the check with atomic_read() is unsafe _really_, but once atomic_read() returned
> value tells us neigh refcnt is zero, the result is absolutely true. This is

	The problem is that this atomic_read is just an
extra code that works in rare cases. Modern CPUs have
128-byte cache lines and 'dead', 'refcnt' and 'next'
can be in same cache line. If you take into account the
write-back write policy used by CPU caches and the cache
locking effects, in most of the cases when such reader
finds entry in list, the refcnt will be > 0 and dead
will be 0. The reason: write-back policy makes the whole
cache line public to other CPUs at once. So, without
explicit barriers other CPUs can see data after delay and the
order of stores in same cache line is not observed.
As result, pure reader can not see 0 in refcnt, even if you
put more atomic_read calls at this place. atomic_read contains
only hints to compiler.

	So, we must ensure that if a writer decides to
remove entry other writers should not change things.
More observations for the writers:

- neigh_lookup is initially a reader bur it can increase refcnt
at any time, even while another CPU is under write_lock.
This makes all refcnt checks in neigh_forced_gc,
neigh_flush_dev and neigh_periodic_work racy because
neigh_cleanup_and_release is called after write_unlock.
If a writer decides to remove the entry, other write
operations may try to modify the entry because they
rely on their own reference. What can happen is that
refcnt is 1 when it is decided to remove the entry but
the refcnt can be increased by concurrent writers
such as neigh_lookup callers.

	As result, neigh_lookup followed by neigh_update may
work with dead=1 entry. May be the fix is to add
check for dead flag in neigh_update to avoid modifications
for removed entry, after the both write_lock_bh calls.
This again happens later, i.e. __neigh_lookup calls
neigh_lookup (which does not notice the dead flag early),
we miss to call neigh_create and then neigh_update will fail
on dead==1.

- neigh_timer_handler looks safe, may be it does not
need check for dead flag because if neigh_flush_dev calls
neigh_del_timer and del_timer in neigh_del_timer fails
with 0 (due to other CPU stuck at write_lock in
neigh_timer_handler) the 'if (!(state & NUD_IN_TIMER))'
check in neigh_timer_handler should succeed because
neigh_flush_dev will change nud_state on success of the
'if (atomic_read(&n->refcnt) != 1)' check.

- __neigh_event_send: needs check for dead flag, as
already discussed

Regards

--
Julian Anastasov <ja@ssi.bg>
--
To unsubscribe from this list: send the line "unsubscribe netdev" in
the body of a message to majordomo@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
diff mbox

Patch

diff --git a/net/core/neighbour.c b/net/core/neighbour.c
index 3de6542..c7a675c 100644
--- a/net/core/neighbour.c
+++ b/net/core/neighbour.c
@@ -958,6 +958,9 @@  int __neigh_event_send(struct neighbour *neigh, struct sk_buff *skb)
 	if (neigh->nud_state & (NUD_CONNECTED | NUD_DELAY | NUD_PROBE))
 		goto out_unlock_bh;
 
+	if (neigh->dead)
+		goto out_unlock_bh;
+
 	if (!(neigh->nud_state & (NUD_STALE | NUD_INCOMPLETE))) {
 		if (NEIGH_VAR(neigh->parms, MCAST_PROBES) +
 		    NEIGH_VAR(neigh->parms, APP_PROBES)) {
diff --git a/net/ipv4/ip_output.c b/net/ipv4/ip_output.c
index c65b93a..5889774 100644
--- a/net/ipv4/ip_output.c
+++ b/net/ipv4/ip_output.c
@@ -200,7 +200,7 @@  static inline int ip_finish_output2(struct sock *sk, struct sk_buff *skb)
 	rcu_read_lock_bh();
 	nexthop = (__force u32) rt_nexthop(rt, ip_hdr(skb)->daddr);
 	neigh = __ipv4_neigh_lookup_noref(dev, nexthop);
-	if (unlikely(!neigh))
+	if (unlikely(!neigh || !atomic_read(&neigh->refcnt)))
 		neigh = __neigh_create(&arp_tbl, &nexthop, dev, false);
 	if (!IS_ERR(neigh)) {
 		int res = dst_neigh_output(dst, neigh, skb);