[net] packet: hold bind lock when rebinding to fanout hook

Message ID 20170914211441.67326-1-willemb@google.com
State Accepted
Delegated to: David Miller
Headers show
Series
  • [net] packet: hold bind lock when rebinding to fanout hook
Related show

Commit Message

Willem de Bruijn Sept. 14, 2017, 9:14 p.m.
Packet socket bind operations must hold the po->bind_lock. This keeps
po->running consistent with whether the socket is actually on a ptype
list to receive packets.

fanout_add unbinds a socket and its packet_rcv/tpacket_rcv call, then
binds the fanout object to receive through packet_rcv_fanout.

Make it hold the po->bind_lock when testing po->running and rebinding.
Else, it can race with other rebind operations, such as that in
packet_set_ring from packet_rcv to tpacket_rcv. Concurrent updates
can result in a socket being added to a fanout group twice, causing
use-after-free KASAN bug reports, among others.

Reported independently by both trinity and syzkaller.
Verified that the syzkaller reproducer passes after this patch.

Reported-by: nixioaming <nixiaoming@huawei.com>
Signed-off-by: Willem de Bruijn <willemb@google.com>
---
 net/packet/af_packet.c | 16 +++++++++++-----
 1 file changed, 11 insertions(+), 5 deletions(-)

Comments

Eric Dumazet Sept. 14, 2017, 10:53 p.m. | #1
On Thu, 2017-09-14 at 17:14 -0400, Willem de Bruijn wrote:
> Packet socket bind operations must hold the po->bind_lock. This keeps
> po->running consistent with whether the socket is actually on a ptype
> list to receive packets.
> 
> fanout_add unbinds a socket and its packet_rcv/tpacket_rcv call, then
> binds the fanout object to receive through packet_rcv_fanout.
> 
> Make it hold the po->bind_lock when testing po->running and rebinding.
> Else, it can race with other rebind operations, such as that in
> packet_set_ring from packet_rcv to tpacket_rcv. Concurrent updates
> can result in a socket being added to a fanout group twice, causing
> use-after-free KASAN bug reports, among others.
> 
> Reported independently by both trinity and syzkaller.
> Verified that the syzkaller reproducer passes after this patch.
> 
> Reported-by: nixioaming <nixiaoming@huawei.com>
> Signed-off-by: Willem de Bruijn <willemb@google.com>
> ---
>  net/packet/af_packet.c | 16 +++++++++++-----
>  1 file changed, 11 insertions(+), 5 deletions(-)
> 
> diff --git a/net/packet/af_packet.c b/net/packet/af_packet.c
> index c26172995511..d288f52c53f7 100644
> --- a/net/packet/af_packet.c
> +++ b/net/packet/af_packet.c
> @@ -1684,10 +1684,6 @@ static int fanout_add(struct sock *sk, u16 id, u16 type_flags)
>  
>  	mutex_lock(&fanout_mutex);
>  
> -	err = -EINVAL;
> -	if (!po->running)
> -		goto out;
> -
>  	err = -EALREADY;
>  	if (po->fanout)
>  		goto out;
> @@ -1749,7 +1745,10 @@ static int fanout_add(struct sock *sk, u16 id, u16 type_flags)
>  		list_add(&match->list, &fanout_list);
>  	}
>  	err = -EINVAL;
> -	if (match->type == type &&
> +
> +	spin_lock(&po->bind_lock);
> +	if (po->running &&
> +	    match->type == type &&
>  	    match->prot_hook.type == po->prot_hook.type &&
>  	    match->prot_hook.dev == po->prot_hook.dev) {
>  		err = -ENOSPC;
> @@ -1761,6 +1760,13 @@ static int fanout_add(struct sock *sk, u16 id, u16 type_flags)
>  			err = 0;
>  		}
>  	}
> +	spin_unlock(&po->bind_lock);
> +
> +	if (err && !refcount_read(&match->sk_ref)) {

It seems sk_ref is always read/changed under 
mutex_lock(&fanout_mutex) protection.

Not sure why we use a refcount_t (or an atomic_t in older kernels)

All these atomic/spinlock/mutexes are a maze. 

> +		list_del(&match->list);
> +		kfree(match);
> +	}
> +
>  out:
>  	if (err && rollover) {
>  		kfree(rollover);
Willem de Bruijn Sept. 15, 2017, 2:07 p.m. | #2
On Thu, Sep 14, 2017 at 5:14 PM, Willem de Bruijn <willemb@google.com> wrote:
> Packet socket bind operations must hold the po->bind_lock. This keeps
> po->running consistent with whether the socket is actually on a ptype
> list to receive packets.
>
> fanout_add unbinds a socket and its packet_rcv/tpacket_rcv call, then
> binds the fanout object to receive through packet_rcv_fanout.
>
> Make it hold the po->bind_lock when testing po->running and rebinding.
> Else, it can race with other rebind operations, such as that in
> packet_set_ring from packet_rcv to tpacket_rcv. Concurrent updates
> can result in a socket being added to a fanout group twice, causing
> use-after-free KASAN bug reports, among others.
>
> Reported independently by both trinity and syzkaller.
> Verified that the syzkaller reproducer passes after this patch.
>

I forgot to add the Fixes tag, sorry.

Fixes: dc99f600698d ("packet: Add fanout support.")

> Reported-by: nixioaming <nixiaoming@huawei.com>
> Signed-off-by: Willem de Bruijn <willemb@google.com>
Cong Wang Sept. 15, 2017, 6:33 p.m. | #3
On Thu, Sep 14, 2017 at 2:14 PM, Willem de Bruijn <willemb@google.com> wrote:
> Packet socket bind operations must hold the po->bind_lock. This keeps
> po->running consistent with whether the socket is actually on a ptype
> list to receive packets.
>
> fanout_add unbinds a socket and its packet_rcv/tpacket_rcv call, then
> binds the fanout object to receive through packet_rcv_fanout.
>
> Make it hold the po->bind_lock when testing po->running and rebinding.
> Else, it can race with other rebind operations, such as that in
> packet_set_ring from packet_rcv to tpacket_rcv. Concurrent updates
> can result in a socket being added to a fanout group twice, causing
> use-after-free KASAN bug reports, among others.
>
> Reported independently by both trinity and syzkaller.
> Verified that the syzkaller reproducer passes after this patch.
>
> Reported-by: nixioaming <nixiaoming@huawei.com>
> Signed-off-by: Willem de Bruijn <willemb@google.com>
> ---
>  net/packet/af_packet.c | 16 +++++++++++-----
>  1 file changed, 11 insertions(+), 5 deletions(-)
>
> diff --git a/net/packet/af_packet.c b/net/packet/af_packet.c
> index c26172995511..d288f52c53f7 100644
> --- a/net/packet/af_packet.c
> +++ b/net/packet/af_packet.c
> @@ -1684,10 +1684,6 @@ static int fanout_add(struct sock *sk, u16 id, u16 type_flags)
>
>         mutex_lock(&fanout_mutex);
>
> -       err = -EINVAL;
> -       if (!po->running)
> -               goto out;
> -
>         err = -EALREADY;
>         if (po->fanout)
>                 goto out;
> @@ -1749,7 +1745,10 @@ static int fanout_add(struct sock *sk, u16 id, u16 type_flags)
>                 list_add(&match->list, &fanout_list);
>         }
>         err = -EINVAL;
> -       if (match->type == type &&
> +
> +       spin_lock(&po->bind_lock);
> +       if (po->running &&


As you move the po->running check later after setting po->rollover, I wonder
if po->rollover possibly depends on po>running on other path?


> +           match->type == type &&
>             match->prot_hook.type == po->prot_hook.type &&
>             match->prot_hook.dev == po->prot_hook.dev) {
>                 err = -ENOSPC;
> @@ -1761,6 +1760,13 @@ static int fanout_add(struct sock *sk, u16 id, u16 type_flags)
>                         err = 0;
>                 }
>         }
> +       spin_unlock(&po->bind_lock);
> +
> +       if (err && !refcount_read(&match->sk_ref)) {
> +               list_del(&match->list);
> +               kfree(match);
> +       }

This looks correct but still seems odd, it smells you don't use refcnt in an
expected way.
Willem de Bruijn Sept. 15, 2017, 8:01 p.m. | #4
>> diff --git a/net/packet/af_packet.c b/net/packet/af_packet.c
>> index c26172995511..d288f52c53f7 100644
>> --- a/net/packet/af_packet.c
>> +++ b/net/packet/af_packet.c
>> @@ -1684,10 +1684,6 @@ static int fanout_add(struct sock *sk, u16 id, u16 type_flags)
>>
>>         mutex_lock(&fanout_mutex);
>>
>> -       err = -EINVAL;
>> -       if (!po->running)
>> -               goto out;
>> -
>>         err = -EALREADY;
>>         if (po->fanout)
>>                 goto out;
>> @@ -1749,7 +1745,10 @@ static int fanout_add(struct sock *sk, u16 id, u16 type_flags)
>>                 list_add(&match->list, &fanout_list);
>>         }
>>         err = -EINVAL;
>> -       if (match->type == type &&
>> +
>> +       spin_lock(&po->bind_lock);
>> +       if (po->running &&
>
>
> As you move the po->running check later after setting po->rollover, I wonder
> if po->rollover possibly depends on po>running on other path?

The rollover code does not explicitly check po->running, if that is
what you are concerned about.

A newly allocated po->rollover structure will also only be accessed
from the datapath once the socket is added to a fanout group. Both
before and after this patch, that happens well after the po->running
check.

If the socket already had a po->rollover, then it must already have
had a po->fanout, too, so fanout_add does not reach this code.

>> +           match->type == type &&
>>             match->prot_hook.type == po->prot_hook.type &&
>>             match->prot_hook.dev == po->prot_hook.dev) {
>>                 err = -ENOSPC;
>> @@ -1761,6 +1760,13 @@ static int fanout_add(struct sock *sk, u16 id, u16 type_flags)
>>                         err = 0;
>>                 }
>>         }
>> +       spin_unlock(&po->bind_lock);
>> +
>> +       if (err && !refcount_read(&match->sk_ref)) {
>> +               list_del(&match->list);
>> +               kfree(match);
>> +       }
>
> This looks correct but still seems odd, it smells you don't use refcnt in an
> expected way.

It tests whether the object has no references, in which case it must have
been newly allocated and the fanout join operation must have failed.

I don't see an obviously simpler patch.

The entire code could perhaps be restructured eventually. There is,
for instance, no reason to test match->{type, prot_hook.type, prot_hook.dev}
when having just allocated the structure. Nor to test whether sk_ref exceeds
PACKET_FANOUT_MAX. Conversely, this sk_ref == 0 test only makes sense
when having taken the (!match) branch earlier.

But a refactor is out of scope for a bug fix.
David Miller Sept. 20, 2017, 9:03 p.m. | #5
From: Willem de Bruijn <willemdebruijn.kernel@gmail.com>
Date: Fri, 15 Sep 2017 10:07:46 -0400

> On Thu, Sep 14, 2017 at 5:14 PM, Willem de Bruijn <willemb@google.com> wrote:
>> Packet socket bind operations must hold the po->bind_lock. This keeps
>> po->running consistent with whether the socket is actually on a ptype
>> list to receive packets.
>>
>> fanout_add unbinds a socket and its packet_rcv/tpacket_rcv call, then
>> binds the fanout object to receive through packet_rcv_fanout.
>>
>> Make it hold the po->bind_lock when testing po->running and rebinding.
>> Else, it can race with other rebind operations, such as that in
>> packet_set_ring from packet_rcv to tpacket_rcv. Concurrent updates
>> can result in a socket being added to a fanout group twice, causing
>> use-after-free KASAN bug reports, among others.
>>
>> Reported independently by both trinity and syzkaller.
>> Verified that the syzkaller reproducer passes after this patch.
>>
> 
> I forgot to add the Fixes tag, sorry.
> 
> Fixes: dc99f600698d ("packet: Add fanout support.")

Applied and queued up for stable as it fixes this race and I can't
see any new problems it introduces.

But boy is this one messy area.

The scariest thing to me now is the save/restore sequence done by
packet_set_ring(), for example.

	spin_lock(&po->bind_lock);
	was_running = po->running;
	num = po->num;
	if (was_running) {
		po->num = 0;
		__unregister_prot_hook(sk, false);
	}
	spin_unlock(&po->bind_lock);
 ...
	spin_lock(&po->bind_lock);
	if (was_running) {
		po->num = num;
		register_prot_hook(sk);
	}
	spin_unlock(&po->bind_lock);

The socket is also locked during this sequence but that doesn't
prevent parallel changes to the running state.

Since po->bind_lock is dropped, it's possible for another thread
to grab bind_lock and bind it meanwhile.

The above code seems to assume that can't happen, and that
register_prot_hook() will always see po->running set to zero
and rebind the socket.

If the race happens we'll have weird state, because we did not
rebind yet we modified po->num.

We seem to have a hierachy of sleeping and non-sleeping locks
that do not work well together.
Willem de Bruijn Sept. 21, 2017, 9:10 p.m. | #6
On Wed, Sep 20, 2017 at 5:03 PM, David Miller <davem@davemloft.net> wrote:
> From: Willem de Bruijn <willemdebruijn.kernel@gmail.com>
> Date: Fri, 15 Sep 2017 10:07:46 -0400
>
>> On Thu, Sep 14, 2017 at 5:14 PM, Willem de Bruijn <willemb@google.com> wrote:
>>> Packet socket bind operations must hold the po->bind_lock. This keeps
>>> po->running consistent with whether the socket is actually on a ptype
>>> list to receive packets.
>>>
>>> fanout_add unbinds a socket and its packet_rcv/tpacket_rcv call, then
>>> binds the fanout object to receive through packet_rcv_fanout.
>>>
>>> Make it hold the po->bind_lock when testing po->running and rebinding.
>>> Else, it can race with other rebind operations, such as that in
>>> packet_set_ring from packet_rcv to tpacket_rcv. Concurrent updates
>>> can result in a socket being added to a fanout group twice, causing
>>> use-after-free KASAN bug reports, among others.
>>>
>>> Reported independently by both trinity and syzkaller.
>>> Verified that the syzkaller reproducer passes after this patch.
>>>
>>
>> I forgot to add the Fixes tag, sorry.
>>
>> Fixes: dc99f600698d ("packet: Add fanout support.")
>
> Applied and queued up for stable as it fixes this race and I can't
> see any new problems it introduces.
>
> But boy is this one messy area.

Yeah, I've been staring at this code for a while now. But I don't see
an obvious way to simplify it. packet_notifier does not hold the socket
lock, so even if all of bind, set_ring and fanout_add do hold that,
bind_lock is still needed. packet_mmap may not take the socket lock,
so pg_vec_lock must remain to synchronize with packet_set_ring
even if for no other reason. I had a look at using rcu pointers for rx_ring
and tx_ring, to avoid taking that lock in the datapath and possibly updating
without the unbind/bind dance. But that update needs to be atomic with
purging the socket queue, so the socket must be properly quiesced.

> The scariest thing to me now is the save/restore sequence done by
> packet_set_ring(), for example.
>
>         spin_lock(&po->bind_lock);
>         was_running = po->running;
>         num = po->num;
>         if (was_running) {
>                 po->num = 0;
>                 __unregister_prot_hook(sk, false);
>         }
>         spin_unlock(&po->bind_lock);
>  ...
>         spin_lock(&po->bind_lock);
>         if (was_running) {
>                 po->num = num;
>                 register_prot_hook(sk);
>         }
>         spin_unlock(&po->bind_lock);
>
> The socket is also locked during this sequence but that doesn't
> prevent parallel changes to the running state.
>
> Since po->bind_lock is dropped, it's possible for another thread
> to grab bind_lock and bind it meanwhile.
>
> The above code seems to assume that can't happen, and that
> register_prot_hook() will always see po->running set to zero
> and rebind the socket.
>
> If the race happens we'll have weird state, because we did not
> rebind yet we modified po->num.

It appears that the only path that may try to bind without holding the
socket lock is packet_notifier. That skips register_prot_hook if !po->num.

> We seem to have a hierachy of sleeping and non-sleeping locks
> that do not work well together.

Given the number recent bugs that were fixed by locking the socket
inside a particular setsockopt case, I think that we should lock that
entire function, similar to other protocol families (after verifying that
all cases can indeed sleep).

Another issue that looks fragile is the test for po->fanout in
packet_do_bind before taking the socket lock.

Patch

diff --git a/net/packet/af_packet.c b/net/packet/af_packet.c
index c26172995511..d288f52c53f7 100644
--- a/net/packet/af_packet.c
+++ b/net/packet/af_packet.c
@@ -1684,10 +1684,6 @@  static int fanout_add(struct sock *sk, u16 id, u16 type_flags)
 
 	mutex_lock(&fanout_mutex);
 
-	err = -EINVAL;
-	if (!po->running)
-		goto out;
-
 	err = -EALREADY;
 	if (po->fanout)
 		goto out;
@@ -1749,7 +1745,10 @@  static int fanout_add(struct sock *sk, u16 id, u16 type_flags)
 		list_add(&match->list, &fanout_list);
 	}
 	err = -EINVAL;
-	if (match->type == type &&
+
+	spin_lock(&po->bind_lock);
+	if (po->running &&
+	    match->type == type &&
 	    match->prot_hook.type == po->prot_hook.type &&
 	    match->prot_hook.dev == po->prot_hook.dev) {
 		err = -ENOSPC;
@@ -1761,6 +1760,13 @@  static int fanout_add(struct sock *sk, u16 id, u16 type_flags)
 			err = 0;
 		}
 	}
+	spin_unlock(&po->bind_lock);
+
+	if (err && !refcount_read(&match->sk_ref)) {
+		list_del(&match->list);
+		kfree(match);
+	}
+
 out:
 	if (err && rollover) {
 		kfree(rollover);