diff mbox series

[net] af_key: free SKBs under RCU protection

Message ID 1537402712-12875-1-git-send-email-stranche@codeaurora.org
State Changes Requested, archived
Delegated to: David Miller
Headers show
Series [net] af_key: free SKBs under RCU protection | expand

Commit Message

Sean Tranchetti Sept. 20, 2018, 12:18 a.m. UTC
pfkey_broadcast() can make calls to pfkey_broadcast_one() which
will clone or copy the passed in SKB. This new SKB will be assigned
the sock_rfree() function as its destructor, which requires that
the socket reference the SKB contains is valid when the SKB is freed.

If this SKB is ever passed to pfkey_broadcast() again by some other
function (such as pkfey_dump() or pfkey_promisc) it will then be
freed there. However, since this free occurs outside of RCU protection,
it is possible that userspace could close the socket and trigger
pfkey_release() to free the socket before sock_rfree() can run, creating
the following race condition:

1: An SKB belonging to the pfkey socket is passed to pfkey_broadcast().
   It performs the broadcast to any other sockets, and calls
   rcu_read_unlock(), but does not yet reach kfree_skb().
2: Userspace closes the socket, triggering pfkey_realse(). Since no one
   holds the RCU lock, synchronize_rcu() returns and it is allowed to
   continue. It calls sock_put() to free the socket.
3: pfkey_broadcast() now calls kfree_skb() on the original SKB it was
   passed, triggering a call to sock_rfree(). This function now accesses
   the freed struct sock * via skb->sk, and attempts to update invalid
   memory.

By ensuring that the pfkey_broadcast() also frees the SKBs while it holds
the RCU lock, we can ensure that the socket will remain valid when the SKB
is freed, avoiding crashes like the following:

Unable to handle kernel paging request at virtual address 6b6b6b6b6b6c4b
[006b6b6b6b6b6c4b] address between user and kernel address ranges
Internal error: Oops: 96000004 [#1] PREEMPT SMP
task: fffffff78f65b380 task.stack: ffffff8049a88000
pc : sock_rfree+0x38/0x6c
lr : skb_release_head_state+0x6c/0xcc
Process repro (pid: 7117, stack limit = 0xffffff8049a88000)
Call trace:
	sock_rfree+0x38/0x6c
	skb_release_head_state+0x6c/0xcc
	skb_release_all+0x1c/0x38
	__kfree_skb+0x1c/0x30
	kfree_skb+0xd0/0xf4
	pfkey_broadcast+0x14c/0x18c
	pfkey_sendmsg+0x1d8/0x408
	sock_sendmsg+0x44/0x60
	___sys_sendmsg+0x1d0/0x2a8
	__sys_sendmsg+0x64/0xb4
	SyS_sendmsg+0x34/0x4c
	el0_svc_naked+0x34/0x38
Kernel panic - not syncing: Fatal exception

Signed-off-by: Sean Tranchetti <stranche@codeaurora.org>
---
 net/key/af_key.c | 2 +-
 1 file changed, 1 insertion(+), 1 deletion(-)

Comments

Eric Dumazet Sept. 20, 2018, 1:29 p.m. UTC | #1
On 09/19/2018 05:18 PM, Sean Tranchetti wrote:
> pfkey_broadcast() can make calls to pfkey_broadcast_one() which
> will clone or copy the passed in SKB. This new SKB will be assigned
> the sock_rfree() function as its destructor, which requires that
> the socket reference the SKB contains is valid when the SKB is freed.
> 
> If this SKB is ever passed to pfkey_broadcast() again by some other
> function (such as pkfey_dump() or pfkey_promisc) it will then be
> freed there. However, since this free occurs outside of RCU protection,
> it is possible that userspace could close the socket and trigger
> pfkey_release() to free the socket before sock_rfree() can run, creating
> the following race condition:
> 
> 1: An SKB belonging to the pfkey socket is passed to pfkey_broadcast().
>    It performs the broadcast to any other sockets, and calls
>    rcu_read_unlock(), but does not yet reach kfree_skb().
> 2: Userspace closes the socket, triggering pfkey_realse(). Since no one
>    holds the RCU lock, synchronize_rcu() returns and it is allowed to
>    continue. It calls sock_put() to free the socket.
> 3: pfkey_broadcast() now calls kfree_skb() on the original SKB it was
>    passed, triggering a call to sock_rfree(). This function now accesses
>    the freed struct sock * via skb->sk, and attempts to update invalid
>    memory.
> 
> By ensuring that the pfkey_broadcast() also frees the SKBs while it holds
> the RCU lock, we can ensure that the socket will remain valid when the SKB
> is freed, avoiding crashes like the following:
> 
> Unable to handle kernel paging request at virtual address 6b6b6b6b6b6c4b
> [006b6b6b6b6b6c4b] address between user and kernel address ranges
> Internal error: Oops: 96000004 [#1] PREEMPT SMP
> task: fffffff78f65b380 task.stack: ffffff8049a88000
> pc : sock_rfree+0x38/0x6c
> lr : skb_release_head_state+0x6c/0xcc
> Process repro (pid: 7117, stack limit = 0xffffff8049a88000)
> Call trace:
> 	sock_rfree+0x38/0x6c
> 	skb_release_head_state+0x6c/0xcc
> 	skb_release_all+0x1c/0x38
> 	__kfree_skb+0x1c/0x30
> 	kfree_skb+0xd0/0xf4
> 	pfkey_broadcast+0x14c/0x18c
> 	pfkey_sendmsg+0x1d8/0x408
> 	sock_sendmsg+0x44/0x60
> 	___sys_sendmsg+0x1d0/0x2a8
> 	__sys_sendmsg+0x64/0xb4
> 	SyS_sendmsg+0x34/0x4c
> 	el0_svc_naked+0x34/0x38
> Kernel panic - not syncing: Fatal exception
> 
> Signed-off-by: Sean Tranchetti <stranche@codeaurora.org>
> ---
>  net/key/af_key.c | 2 +-
>  1 file changed, 1 insertion(+), 1 deletion(-)
> 
> diff --git a/net/key/af_key.c b/net/key/af_key.c
> index 9d61266..dd257c7 100644
> --- a/net/key/af_key.c
> +++ b/net/key/af_key.c
> @@ -275,13 +275,13 @@ static int pfkey_broadcast(struct sk_buff *skb, gfp_t allocation,
>  		if ((broadcast_flags & BROADCAST_REGISTERED) && err)
>  			err = err2;
>  	}
> -	rcu_read_unlock();
>  
>  	if (one_sk != NULL)
>  		err = pfkey_broadcast_one(skb, &skb2, allocation, one_sk);
>  
>  	kfree_skb(skb2);
>  	kfree_skb(skb);
> +	rcu_read_unlock();
>  	return err;
>  }
>  
> 

I do not believe the changelog or the patch makes sense.

Having skb still referencing a socket prevents this socket being released.

If you think about it, what would prevent the freeing happening 
_before_ the rcu_read_lock() in pfkey_broadcast() ?

Maybe the correct fix is that pfkey_broadcast_one() should ensure the socket is still valid.

I would suggest something like :

diff --git a/net/key/af_key.c b/net/key/af_key.c
index 9d61266526e767770d9a1ce184ac8cdd59de309a..5ce309d020dda5e46e4426c4a639bfb551e2260d 100644
--- a/net/key/af_key.c
+++ b/net/key/af_key.c
@@ -201,7 +201,9 @@ static int pfkey_broadcast_one(struct sk_buff *skb, struct sk_buff **skb2,
 {
        int err = -ENOBUFS;
 
-       sock_hold(sk);
+       if (!refcount_inc_not_zero(&sk->sk_refcnt))
+               return -ENOENT;
+
        if (*skb2 == NULL) {
                if (refcount_read(&skb->users) != 1) {
                        *skb2 = skb_clone(skb, allocation);
Sean Tranchetti Sept. 20, 2018, 7:25 p.m. UTC | #2
> 
> I do not believe the changelog or the patch makes sense.
> 
> Having skb still referencing a socket prevents this socket being 
> released.
> 
> If you think about it, what would prevent the freeing happening
> _before_ the rcu_read_lock() in pfkey_broadcast() ?
> 
> Maybe the correct fix is that pfkey_broadcast_one() should ensure the
> socket is still valid.
> 
> I would suggest something like :
> 
> diff --git a/net/key/af_key.c b/net/key/af_key.c
> index
> 9d61266526e767770d9a1ce184ac8cdd59de309a..5ce309d020dda5e46e4426c4a639bfb551e2260d
> 100644
> --- a/net/key/af_key.c
> +++ b/net/key/af_key.c
> @@ -201,7 +201,9 @@ static int pfkey_broadcast_one(struct sk_buff
> *skb, struct sk_buff **skb2,
>  {
>         int err = -ENOBUFS;
> 
> -       sock_hold(sk);
> +       if (!refcount_inc_not_zero(&sk->sk_refcnt))
> +               return -ENOENT;
> +
>         if (*skb2 == NULL) {
>                 if (refcount_read(&skb->users) != 1) {
>                         *skb2 = skb_clone(skb, allocation);

Hi Eric,

I'm not sure that the socket getting freed before the rcu_read_lock() 
would
be an issue, since then it would no longer be in the net_pkey->table 
that
we loop through (since we call pfkey_remove() from pfkey_relase()). 
Because of
that, all the sockets processed in pfkey_broadcast_one() have valid 
refcounts,
so checking for zero there doesn't prevent the crash that I'm seeing.

However, after going over the call flow again, I see that the actual 
problem
occurs because of pfkey_broadcast_one(). Specifically, because of this 
check:

	if (*skb2 == NULL) {
		if (refcount_read(&skb->users) != 1) {
			*skb2 = skb_clone(skb, allocation);
		} else {
			*skb2 = skb;
			refcount_inc(&skb->users);
		}
	}

Since we always pass a freshly cloned SKB to this function, skb->users 
is
always 1, and skb2 just becomes skb. We then set skb2 (and thus skb) to
belong to the socket.

If the socket we queue skb2 to frees this SKB (thereby decrementing its
refcount to 1) and the socket is freed before pfkey_broadcast() can
execute the kfree_skb(skb) on line 284, we will then attempt to run
sock_rfree() on an SKB with a dangling reference to this socket.

Perhaps a cleaner solution here is to always clone the SKB in
pfkey_broadcast_one(). That will ensure that the two kfree_skb() calls
in pfkey_broadcast() will never be passed an SKB with sock_rfree() as
its destructor, and we can avoid this race condition.
Eric Dumazet Sept. 20, 2018, 10:10 p.m. UTC | #3
On 09/20/2018 12:25 PM, stranche@codeaurora.org wrote:
>>
>> I do not believe the changelog or the patch makes sense.
>>
>> Having skb still referencing a socket prevents this socket being released.
>>
>> If you think about it, what would prevent the freeing happening
>> _before_ the rcu_read_lock() in pfkey_broadcast() ?
>>
>> Maybe the correct fix is that pfkey_broadcast_one() should ensure the
>> socket is still valid.
>>
>> I would suggest something like :
>>
>> diff --git a/net/key/af_key.c b/net/key/af_key.c
>> index
>> 9d61266526e767770d9a1ce184ac8cdd59de309a..5ce309d020dda5e46e4426c4a639bfb551e2260d
>> 100644
>> --- a/net/key/af_key.c
>> +++ b/net/key/af_key.c
>> @@ -201,7 +201,9 @@ static int pfkey_broadcast_one(struct sk_buff
>> *skb, struct sk_buff **skb2,
>>  {
>>         int err = -ENOBUFS;
>>
>> -       sock_hold(sk);
>> +       if (!refcount_inc_not_zero(&sk->sk_refcnt))
>> +               return -ENOENT;
>> +
>>         if (*skb2 == NULL) {
>>                 if (refcount_read(&skb->users) != 1) {
>>                         *skb2 = skb_clone(skb, allocation);
> 
> Hi Eric,
> 
> I'm not sure that the socket getting freed before the rcu_read_lock() would
> be an issue, since then it would no longer be in the net_pkey->table that
> we loop through (since we call pfkey_remove() from pfkey_relase()). Because of
> that, all the sockets processed in pfkey_broadcast_one() have valid refcounts,
> so checking for zero there doesn't prevent the crash that I'm seeing.
> 
> However, after going over the call flow again, I see that the actual problem
> occurs because of pfkey_broadcast_one(). Specifically, because of this check:
> 
>     if (*skb2 == NULL) {
>         if (refcount_read(&skb->users) != 1) {
>             *skb2 = skb_clone(skb, allocation);
>         } else {
>             *skb2 = skb;
>             refcount_inc(&skb->users);
>         }
>     }
> 
> Since we always pass a freshly cloned SKB to this function, skb->users is
> always 1, and skb2 just becomes skb. We then set skb2 (and thus skb) to
> belong to the socket.
> 
> If the socket we queue skb2 to frees this SKB (thereby decrementing its
> refcount to 1) and the socket is freed before pfkey_broadcast() can
> execute the kfree_skb(skb) on line 284, we will then attempt to run
> sock_rfree() on an SKB with a dangling reference to this socket.
> 
> Perhaps a cleaner solution here is to always clone the SKB in
> pfkey_broadcast_one(). That will ensure that the two kfree_skb() calls
> in pfkey_broadcast() will never be passed an SKB with sock_rfree() as
> its destructor, and we can avoid this race condition.

As long as one skb has sock_rfree has its destructor, the socket attached to
this skb can not be released. There is no race here.

Note that skb_clone() does not propagate the destructor.

The issue here is that in the rcu lookup, we can find a socket that has been
dismantled, with a 0 refcount.

We must not use sock_hold() in this case, since we are not sure the socket refcount is not already 0.

pfkey_broadcast() and pfkey_broadcast_one() violate basic RCU rules.

When in a RCU lookup, one want to increment an object refcount, it needs
to be extra-careful, as I did in my proposal.

Note that the race could be automatically detected with CONFIG_REFCOUNT_FULL=y
Eric Dumazet Sept. 20, 2018, 10:29 p.m. UTC | #4
On 09/20/2018 03:10 PM, Eric Dumazet wrote:
> 
> 
> On 09/20/2018 12:25 PM, stranche@codeaurora.org wrote:
>>>
>>> I do not believe the changelog or the patch makes sense.
>>>
>>> Having skb still referencing a socket prevents this socket being released.
>>>
>>> If you think about it, what would prevent the freeing happening
>>> _before_ the rcu_read_lock() in pfkey_broadcast() ?
>>>
>>> Maybe the correct fix is that pfkey_broadcast_one() should ensure the
>>> socket is still valid.
>>>
>>> I would suggest something like :
>>>
>>> diff --git a/net/key/af_key.c b/net/key/af_key.c
>>> index
>>> 9d61266526e767770d9a1ce184ac8cdd59de309a..5ce309d020dda5e46e4426c4a639bfb551e2260d
>>> 100644
>>> --- a/net/key/af_key.c
>>> +++ b/net/key/af_key.c
>>> @@ -201,7 +201,9 @@ static int pfkey_broadcast_one(struct sk_buff
>>> *skb, struct sk_buff **skb2,
>>>  {
>>>         int err = -ENOBUFS;
>>>
>>> -       sock_hold(sk);
>>> +       if (!refcount_inc_not_zero(&sk->sk_refcnt))
>>> +               return -ENOENT;
>>> +
>>>         if (*skb2 == NULL) {
>>>                 if (refcount_read(&skb->users) != 1) {
>>>                         *skb2 = skb_clone(skb, allocation);
>>
>> Hi Eric,
>>
>> I'm not sure that the socket getting freed before the rcu_read_lock() would
>> be an issue, since then it would no longer be in the net_pkey->table that
>> we loop through (since we call pfkey_remove() from pfkey_relase()). Because of
>> that, all the sockets processed in pfkey_broadcast_one() have valid refcounts,
>> so checking for zero there doesn't prevent the crash that I'm seeing.
>>
>> However, after going over the call flow again, I see that the actual problem
>> occurs because of pfkey_broadcast_one(). Specifically, because of this check:
>>
>>     if (*skb2 == NULL) {
>>         if (refcount_read(&skb->users) != 1) {
>>             *skb2 = skb_clone(skb, allocation);
>>         } else {
>>             *skb2 = skb;
>>             refcount_inc(&skb->users);
>>         }
>>     }
>>
>> Since we always pass a freshly cloned SKB to this function, skb->users is
>> always 1, and skb2 just becomes skb. We then set skb2 (and thus skb) to
>> belong to the socket.
>>
>> If the socket we queue skb2 to frees this SKB (thereby decrementing its
>> refcount to 1) and the socket is freed before pfkey_broadcast() can
>> execute the kfree_skb(skb) on line 284, we will then attempt to run
>> sock_rfree() on an SKB with a dangling reference to this socket.
>>
>> Perhaps a cleaner solution here is to always clone the SKB in
>> pfkey_broadcast_one(). That will ensure that the two kfree_skb() calls
>> in pfkey_broadcast() will never be passed an SKB with sock_rfree() as
>> its destructor, and we can avoid this race condition.
> 
> As long as one skb has sock_rfree has its destructor, the socket attached to
> this skb can not be released. There is no race here.
> 
> Note that skb_clone() does not propagate the destructor.
> 
> The issue here is that in the rcu lookup, we can find a socket that has been
> dismantled, with a 0 refcount.
> 
> We must not use sock_hold() in this case, since we are not sure the socket refcount is not already 0.
> 
> pfkey_broadcast() and pfkey_broadcast_one() violate basic RCU rules.
> 
> When in a RCU lookup, one want to increment an object refcount, it needs
> to be extra-careful, as I did in my proposal.
> 
> Note that the race could be automatically detected with CONFIG_REFCOUNT_FULL=y

Bug was added in commit 7f6b9dbd5afb ("af_key: locking change")
Sean Tranchetti Sept. 21, 2018, 5:09 p.m. UTC | #5
>> 
>> As long as one skb has sock_rfree has its destructor, the socket 
>> attached to
>> this skb can not be released. There is no race here.
>> 
>> Note that skb_clone() does not propagate the destructor.
>> 
>> The issue here is that in the rcu lookup, we can find a socket that 
>> has been
>> dismantled, with a 0 refcount.
>> 
>> We must not use sock_hold() in this case, since we are not sure the 
>> socket refcount is not already 0.
>> 
>> pfkey_broadcast() and pfkey_broadcast_one() violate basic RCU rules.
>> 
>> When in a RCU lookup, one want to increment an object refcount, it 
>> needs
>> to be extra-careful, as I did in my proposal.
>> 
>> Note that the race could be automatically detected with 
>> CONFIG_REFCOUNT_FULL=y
> 
> Bug was added in commit 7f6b9dbd5afb ("af_key: locking change")

Hi Eric,

I tried your refcount idea below, but it still results in the same 
crash.

>>>> --- a/net/key/af_key.c
>>>> +++ b/net/key/af_key.c
>>>> @@ -201,7 +201,9 @@ static int pfkey_broadcast_one(struct sk_buff
>>>> *skb, struct sk_buff **skb2,
>>>>  {
>>>>         int err = -ENOBUFS;
>>>> 
>>>> -       sock_hold(sk);
>>>> +       if (!refcount_inc_not_zero(&sk->sk_refcnt))
>>>> +               return -ENOENT;
>>>> +
>>>>         if (*skb2 == NULL) {
>>>>                 if (refcount_read(&skb->users) != 1) {
>>>>                         *skb2 = skb_clone(skb, allocation);

I also tried reverting 7f6b9dbd5afb ("af_key: locking change") and 
running the
test there and I still see the crash, so it doesn't seem to be an RCU 
specific
issue.

Is there anything else that could be causing this?
Eric Dumazet Sept. 21, 2018, 5:40 p.m. UTC | #6
On 09/21/2018 10:09 AM, stranche@codeaurora.org wrote:

> I also tried reverting 7f6b9dbd5afb ("af_key: locking change") and running the
> test there and I still see the crash, so it doesn't seem to be an RCU specific
> issue.
> 
> Is there anything else that could be causing this?

What about you share your repro ?
Sean Tranchetti Sept. 21, 2018, 6:44 p.m. UTC | #7
On 2018-09-21 11:40, Eric Dumazet wrote:
> On 09/21/2018 10:09 AM, stranche@codeaurora.org wrote:
> 
>> I also tried reverting 7f6b9dbd5afb ("af_key: locking change") and 
>> running the
>> test there and I still see the crash, so it doesn't seem to be an RCU 
>> specific
>> issue.
>> 
>> Is there anything else that could be causing this?
> 
> What about you share your repro ?

Sure. Syzkaller reproducer source is attached.
// autogenerated by syzkaller (http://github.com/google/syzkaller)

#define _GNU_SOURCE
#include <endian.h>
#include <sys/syscall.h>
#include <unistd.h>
#include <linux/futex.h>
#include <pthread.h>
#include <stdlib.h>
#include <errno.h>
#include <signal.h>
#include <stdarg.h>
#include <stdio.h>
#include <sys/time.h>
#include <sys/wait.h>
#include <time.h>
#include <sys/prctl.h>
#include <dirent.h>
#include <sys/mount.h>
#include <errno.h>
#include <sched.h>
#include <signal.h>
#include <stdarg.h>
#include <stdbool.h>
#include <stdio.h>
#include <sys/prctl.h>
#include <sys/resource.h>
#include <sys/time.h>
#include <sys/wait.h>
#include <errno.h>
#include <fcntl.h>
#include <sys/stat.h>
#include <sys/types.h>
#include <arpa/inet.h>
#include <errno.h>
#include <fcntl.h>
#include <linux/if.h>
#include <linux/if_ether.h>
#include <linux/if_tun.h>
#include <linux/ip.h>
#include <linux/tcp.h>
#include <net/if_arp.h>
#include <stdarg.h>
#include <stdbool.h>
#include <stdio.h>
#include <stdlib.h>
#include <sys/ioctl.h>
#include <sys/stat.h>
#include <sys/uio.h>
#include <linux/net.h>
#include <netinet/in.h>
#include <sys/socket.h>
#include <sys/mount.h>

__attribute__((noreturn)) static void doexit(int status) {
  volatile unsigned i;
  syscall(__NR_exit_group, status);
  for (i = 0;; i++) {
  }
}
#include <stdint.h>
#include <string.h>
#include <errno.h>
#include <stdarg.h>
#include <stdio.h>
#include <stdlib.h>
#include <sys/stat.h>
#include <setjmp.h>
#include <signal.h>
#include <string.h>

const int kFailStatus = 67;
const int kRetryStatus = 69;

  static void fail(const char* msg, ...) {
  int e = errno;
  va_list args;
  va_start(args, msg);
  vfprintf(stderr, msg, args);
  va_end(args);
  fprintf(stderr, " (errno %d)\n", e);
  doexit((e == ENOMEM || e == EAGAIN) ? kRetryStatus : kFailStatus);
}

  static void exitf(const char* msg, ...) {
  int e = errno;
  va_list args;
  va_start(args, msg);
  vfprintf(stderr, msg, args);
  va_end(args);
  fprintf(stderr, " (errno %d)\n", e);
  doexit(kRetryStatus);
}

static __thread int skip_segv;
static __thread jmp_buf segv_env;

static void segv_handler(int sig, siginfo_t* info, void* uctx) {
  uintptr_t addr = (uintptr_t) info->si_addr;
  const uintptr_t prog_start = 1 << 20;
  const uintptr_t prog_end = 100 << 20;
  if (__atomic_load_n(&skip_segv, __ATOMIC_RELAXED) &&
      (addr < prog_start || addr > prog_end)) {
        _longjmp(segv_env, 1);
  }
    doexit(sig);
}

static void install_segv_handler() {
  struct sigaction sa;

  memset(&sa, 0, sizeof(sa));
  sa.sa_handler = SIG_IGN;
  syscall(SYS_rt_sigaction, 0x20, &sa, NULL, 8);
  syscall(SYS_rt_sigaction, 0x21, &sa, NULL, 8);

  memset(&sa, 0, sizeof(sa));
  sa.sa_sigaction = segv_handler;
  sa.sa_flags = SA_NODEFER | SA_SIGINFO;
  sigaction(SIGSEGV, &sa, NULL);
  sigaction(SIGBUS, &sa, NULL);
}

#define NONFAILING(...) { __atomic_fetch_add(&skip_segv, 1, __ATOMIC_SEQ_CST); if (_setjmp(segv_env) == 0) { __VA_ARGS__; } __atomic_fetch_sub(&skip_segv, 1, __ATOMIC_SEQ_CST); }

static uint64_t current_time_ms() {
  struct timespec ts;

  if (clock_gettime(CLOCK_MONOTONIC, &ts)) fail("clock_gettime failed");
  return (uint64_t) ts.tv_sec * 1000 + (uint64_t) ts.tv_nsec / 1000000;
}

static void use_temporary_dir() {
  char tmpdir_template[] = "./syzkaller.XXXXXX";
  char* tmpdir = mkdtemp(tmpdir_template);
  if (!tmpdir) fail("failed to mkdtemp");
  if (chmod(tmpdir, 0777)) fail("failed to chmod");
  if (chdir(tmpdir)) fail("failed to chdir");
}

static void vsnprintf_check(char* str, size_t size, const char* format,
                            va_list args) {
  int rv;

  rv = vsnprintf(str, size, format, args);
  if (rv < 0) fail("tun: snprintf failed");
  if ((size_t) rv >= size)
    fail("tun: string '%s...' doesn't fit into buffer", str);
}

static void snprintf_check(char* str, size_t size, const char* format, ...) {
  va_list args;

  va_start(args, format);
  vsnprintf_check(str, size, format, args);
  va_end(args);
}

#define COMMAND_MAX_LEN 128
#define PATH_PREFIX "PATH=/usr/local/sbin:/usr/local/bin:/usr/sbin:/usr/bin:/sbin:/bin "
#define PATH_PREFIX_LEN (sizeof(PATH_PREFIX) - 1)

static void execute_command(bool panic, const char* format, ...) {
  va_list args;
  char command[PATH_PREFIX_LEN + COMMAND_MAX_LEN];
  int rv;

  va_start(args, format);
  memcpy(command, PATH_PREFIX, PATH_PREFIX_LEN);
  vsnprintf_check(command + PATH_PREFIX_LEN, COMMAND_MAX_LEN, format, args);
  va_end(args);
  rv = system(command);
  if (rv) {
    if (panic) fail("command '%s' failed: %d", &command[0], rv);
      }
}

static int tunfd = -1;
static int tun_frags_enabled;

#define SYZ_TUN_MAX_PACKET_SIZE 1000

#define TUN_IFACE "syz_tun"

#define LOCAL_MAC "aa:aa:aa:aa:aa:aa"
#define REMOTE_MAC "aa:aa:aa:aa:aa:bb"

#define LOCAL_IPV4 "172.20.20.170"
#define REMOTE_IPV4 "172.20.20.187"

#define LOCAL_IPV6 "fe80::aa"
#define REMOTE_IPV6 "fe80::bb"

#define IFF_NAPI 0x0010
#define IFF_NAPI_FRAGS 0x0020

static void initialize_tun(void) {
  tunfd = open("/dev/net/tun", O_RDWR | O_NONBLOCK);
  if (tunfd == -1) {
    printf("tun: can't open /dev/net/tun: please enable CONFIG_TUN=y\n");
    printf("otherwise fuzzing or reproducing might not work as intended\n");
    return;
  }
  const int kTunFd = 252;
  if (dup2(tunfd, kTunFd) < 0) fail("dup2(tunfd, kTunFd) failed");
  close(tunfd);
  tunfd = kTunFd;

  struct ifreq ifr;
  memset(&ifr, 0, sizeof(ifr));
  strncpy(ifr.ifr_name, TUN_IFACE, IFNAMSIZ);
  ifr.ifr_flags = IFF_TAP | IFF_NO_PI | IFF_NAPI | IFF_NAPI_FRAGS;
  if (ioctl(tunfd, TUNSETIFF, (void*)&ifr) < 0) {
    ifr.ifr_flags = IFF_TAP | IFF_NO_PI;
    if (ioctl(tunfd, TUNSETIFF, (void*)&ifr) < 0)
      fail("tun: ioctl(TUNSETIFF) failed");
  }
  if (ioctl(tunfd, TUNGETIFF, (void*)&ifr) < 0)
    fail("tun: ioctl(TUNGETIFF) failed");
  tun_frags_enabled = (ifr.ifr_flags & IFF_NAPI_FRAGS) != 0;

  execute_command(1, "sysctl -w net.ipv6.conf.%s.accept_dad=0", TUN_IFACE);

  execute_command(1, "sysctl -w net.ipv6.conf.%s.router_solicitations=0",
                  TUN_IFACE);

  execute_command(1, "ip link set dev %s address %s", TUN_IFACE, LOCAL_MAC);
  execute_command(1, "ip addr add %s/24 dev %s", LOCAL_IPV4, TUN_IFACE);
  execute_command(1, "ip -6 addr add %s/120 dev %s", LOCAL_IPV6, TUN_IFACE);
  execute_command(1, "ip neigh add %s lladdr %s dev %s nud permanent",
                  REMOTE_IPV4, REMOTE_MAC, TUN_IFACE);
  execute_command(1, "ip -6 neigh add %s lladdr %s dev %s nud permanent",
                  REMOTE_IPV6, REMOTE_MAC, TUN_IFACE);
  execute_command(1, "ip link set dev %s up", TUN_IFACE);
}

#define DEV_IPV4 "172.20.20.%d"
#define DEV_IPV6 "fe80::%02hx"
#define DEV_MAC "aa:aa:aa:aa:aa:%02hx"

static void initialize_netdevices(void) {
  unsigned i;
  const char* devtypes[] = { "ip6gretap", "bridge", "vcan", "bond", "team" };
  const char* devnames[] = {
    "lo", "sit0", "bridge0", "vcan0", "tunl0", "gre0", "gretap0", "ip_vti0",
    "ip6_vti0", "ip6tnl0", "ip6gre0", "ip6gretap0", "erspan0", "bond0", "veth0",
    "veth1", "team0", "veth0_to_bridge", "veth1_to_bridge", "veth0_to_bond",
    "veth1_to_bond", "veth0_to_team", "veth1_to_team"
  };
  const char* devmasters[] = { "bridge", "bond", "team" };

  for (i = 0; i < sizeof(devtypes) / (sizeof(devtypes[0])); i++)
    execute_command(0, "ip link add dev %s0 type %s", devtypes[i], devtypes[i]);
  execute_command(0, "ip link add type veth");

  for (i = 0; i < sizeof(devmasters) / (sizeof(devmasters[0])); i++) {
    execute_command(
        0, "ip link add name %s_slave_0 type veth peer name veth0_to_%s",
        devmasters[i], devmasters[i]);
    execute_command(
        0, "ip link add name %s_slave_1 type veth peer name veth1_to_%s",
        devmasters[i], devmasters[i]);
    execute_command(0, "ip link set %s_slave_0 master %s0", devmasters[i],
                    devmasters[i]);
    execute_command(0, "ip link set %s_slave_1 master %s0", devmasters[i],
                    devmasters[i]);
    execute_command(0, "ip link set veth0_to_%s up", devmasters[i]);
    execute_command(0, "ip link set veth1_to_%s up", devmasters[i]);
  }
  execute_command(0, "ip link set bridge_slave_0 up");
  execute_command(0, "ip link set bridge_slave_1 up");

  for (i = 0; i < sizeof(devnames) / (sizeof(devnames[0])); i++) {
    char addr[32];
    snprintf_check(addr, sizeof(addr), DEV_IPV4, i + 10);
    execute_command(0, "ip -4 addr add %s/24 dev %s", addr, devnames[i]);
    snprintf_check(addr, sizeof(addr), DEV_IPV6, i + 10);
    execute_command(0, "ip -6 addr add %s/120 dev %s", addr, devnames[i]);
    snprintf_check(addr, sizeof(addr), DEV_MAC, i + 10);
    execute_command(0, "ip link set dev %s address %s", devnames[i], addr);
    execute_command(0, "ip link set dev %s up", devnames[i]);
  }
}

static int read_tun(char* data, int size) {
  if (tunfd < 0) return -1;

  int rv = read(tunfd, data, size);
  if (rv < 0) {
    if (errno == EAGAIN) return -1;
    if (errno == EBADFD) return -1;
    fail("tun: read failed with %d", rv);
  }
  return rv;
}

static void flush_tun() {
  char data[SYZ_TUN_MAX_PACKET_SIZE];
  while (read_tun(&data[0], sizeof(data)) != -1)
    ;
}

static bool write_file(const char* file, const char* what, ...) {
  char buf[1024];
  va_list args;
  va_start(args, what);
  vsnprintf(buf, sizeof(buf), what, args);
  va_end(args);
  buf[sizeof(buf) - 1] = 0;
  int len = strlen(buf);

  int fd = open(file, O_WRONLY | O_CLOEXEC);
  if (fd == -1) return false;
  if (write(fd, buf, len) != len) {
    int err = errno;
    close(fd);
    errno = err;
    return false;
  }
  close(fd);
  return true;
}

static void setup_cgroups() {
  if (mkdir("/syzcgroup", 0777)) {
      }
  if (mkdir("/syzcgroup/unified", 0777)) {
      }
  if (mount("none", "/syzcgroup/unified", "cgroup2", 0, NULL)) {
      }
  if (chmod("/syzcgroup/unified", 0777)) {
      }
  if (!write_file("/syzcgroup/unified/cgroup.subtree_control",
                  "+cpu +memory +io +pids +rdma")) {
      }
  if (mkdir("/syzcgroup/cpu", 0777)) {
      }
  if (mount("none", "/syzcgroup/cpu", "cgroup", 0,
            "cpuset,cpuacct,perf_event,hugetlb")) {
      }
  if (!write_file("/syzcgroup/cpu/cgroup.clone_children", "1")) {
      }
  if (chmod("/syzcgroup/cpu", 0777)) {
      }
  if (mkdir("/syzcgroup/net", 0777)) {
      }
  if (mount("none", "/syzcgroup/net", "cgroup", 0,
            "net_cls,net_prio,devices,freezer")) {
      }
  if (chmod("/syzcgroup/net", 0777)) {
      }
}

static void setup_binfmt_misc() {
  if (!write_file("/proc/sys/fs/binfmt_misc/register",
                  ":syz0:M:0:syz0::./file0:")) {
      }
  if (!write_file("/proc/sys/fs/binfmt_misc/register",
                  ":syz1:M:1:yz1::./file0:POC")) {
      }
}

static void loop();

static void sandbox_common() {
  prctl(PR_SET_PDEATHSIG, SIGKILL, 0, 0, 0);
  setpgrp();
  setsid();

  struct rlimit rlim;
  rlim.rlim_cur = rlim.rlim_max = 160 << 20;
  setrlimit(RLIMIT_AS, &rlim);
  rlim.rlim_cur = rlim.rlim_max = 8 << 20;
  setrlimit(RLIMIT_MEMLOCK, &rlim);
  rlim.rlim_cur = rlim.rlim_max = 136 << 20;
  setrlimit(RLIMIT_FSIZE, &rlim);
  rlim.rlim_cur = rlim.rlim_max = 1 << 20;
  setrlimit(RLIMIT_STACK, &rlim);
  rlim.rlim_cur = rlim.rlim_max = 0;
  setrlimit(RLIMIT_CORE, &rlim);

  /*if (unshare(CLONE_NEWNS)) {
      }
  if (unshare(CLONE_NEWIPC)) {
      }
  if (unshare(0x02000000)) {
      }
  if (unshare(CLONE_NEWUTS)) {
      }
  if (unshare(CLONE_SYSVSEM)) {
      }*/
}

static int do_sandbox_none(void) {
  int pid = fork();
  if (pid < 0) fail("sandbox fork failed");
  if (pid) return pid;

  sandbox_common();
  initialize_tun();

  loop();
  doexit(1);
}

#define XT_TABLE_SIZE 1536
#define XT_MAX_ENTRIES 10

struct xt_counters {
  uint64_t pcnt, bcnt;
};

struct ipt_getinfo {
  char name[32];
  unsigned int valid_hooks;
  unsigned int hook_entry[5];
  unsigned int underflow[5];
  unsigned int num_entries;
  unsigned int size;
};

struct ipt_get_entries {
  char name[32];
  unsigned int size;
  void* entrytable[XT_TABLE_SIZE / sizeof(void*)];
};

struct ipt_replace {
  char name[32];
  unsigned int valid_hooks;
  unsigned int num_entries;
  unsigned int size;
  unsigned int hook_entry[5];
  unsigned int underflow[5];
  unsigned int num_counters;
  struct xt_counters* counters;
  char entrytable[XT_TABLE_SIZE];
};

struct ipt_table_desc {
  const char* name;
  struct ipt_getinfo info;
  struct ipt_replace replace;
};

static struct ipt_table_desc ipv4_tables[] = {
  {.name = "filter" }, {.name = "nat" }, {.name = "mangle" }, {.name = "raw" },
  {.name = "security" },
};

static struct ipt_table_desc ipv6_tables[] = {
  {.name = "filter" }, {.name = "nat" }, {.name = "mangle" }, {.name = "raw" },
  {.name = "security" },
};

#define IPT_BASE_CTL 64
#define IPT_SO_SET_REPLACE (IPT_BASE_CTL)
#define IPT_SO_GET_INFO (IPT_BASE_CTL)
#define IPT_SO_GET_ENTRIES (IPT_BASE_CTL + 1)

struct arpt_getinfo {
  char name[32];
  unsigned int valid_hooks;
  unsigned int hook_entry[3];
  unsigned int underflow[3];
  unsigned int num_entries;
  unsigned int size;
};

struct arpt_get_entries {
  char name[32];
  unsigned int size;
  void* entrytable[XT_TABLE_SIZE / sizeof(void*)];
};

struct arpt_replace {
  char name[32];
  unsigned int valid_hooks;
  unsigned int num_entries;
  unsigned int size;
  unsigned int hook_entry[3];
  unsigned int underflow[3];
  unsigned int num_counters;
  struct xt_counters* counters;
  char entrytable[XT_TABLE_SIZE];
};

struct arpt_table_desc {
  const char* name;
  struct arpt_getinfo info;
  struct arpt_replace replace;
};

static struct arpt_table_desc arpt_tables[] = { {.name = "filter" }, };

#define ARPT_BASE_CTL 96
#define ARPT_SO_SET_REPLACE (ARPT_BASE_CTL)
#define ARPT_SO_GET_INFO (ARPT_BASE_CTL)
#define ARPT_SO_GET_ENTRIES (ARPT_BASE_CTL + 1)

static void checkpoint_iptables(struct ipt_table_desc* tables, int num_tables,
                                int family, int level) {
  struct ipt_get_entries entries;
  socklen_t optlen;
  int fd, i;

  fd = socket(family, SOCK_STREAM, IPPROTO_TCP);
  if (fd == -1) {
    switch (errno) {
      case EAFNOSUPPORT:
      case ENOPROTOOPT:
        return;
    }
    fail("socket(%d, SOCK_STREAM, IPPROTO_TCP)", family);
  }
  for (i = 0; i < num_tables; i++) {
    struct ipt_table_desc* table = &tables[i];
    strcpy(table->info.name, table->name);
    strcpy(table->replace.name, table->name);
    optlen = sizeof(table->info);
    if (getsockopt(fd, level, IPT_SO_GET_INFO, &table->info, &optlen)) {
      switch (errno) {
        case EPERM:
        case ENOENT:
        case ENOPROTOOPT:
          continue;
      }
      fail("getsockopt(IPT_SO_GET_INFO)");
    }
  
    if (table->info.size > sizeof(table->replace.entrytable))
      fail("table size is too large: %u", table->info.size);
    if (table->info.num_entries > XT_MAX_ENTRIES)
      fail("too many counters: %u", table->info.num_entries);
    memset(&entries, 0, sizeof(entries));
    strcpy(entries.name, table->name);
    entries.size = table->info.size;
    optlen = sizeof(entries) - sizeof(entries.entrytable) + table->info.size;
    if (getsockopt(fd, level, IPT_SO_GET_ENTRIES, &entries, &optlen))
      fail("getsockopt(IPT_SO_GET_ENTRIES)");
    table->replace.valid_hooks = table->info.valid_hooks;
    table->replace.num_entries = table->info.num_entries;
    table->replace.size = table->info.size;
    memcpy(table->replace.hook_entry, table->info.hook_entry,
           sizeof(table->replace.hook_entry));
    memcpy(table->replace.underflow, table->info.underflow,
           sizeof(table->replace.underflow));
    memcpy(table->replace.entrytable, entries.entrytable, table->info.size);
  }
  close(fd);
}

static void reset_iptables(struct ipt_table_desc* tables, int num_tables,
                           int family, int level) {
  struct xt_counters counters[XT_MAX_ENTRIES];
  struct ipt_get_entries entries;
  struct ipt_getinfo info;
  socklen_t optlen;
  int fd, i;

  fd = socket(family, SOCK_STREAM, IPPROTO_TCP);
  if (fd == -1) {
    switch (errno) {
      case EAFNOSUPPORT:
      case ENOPROTOOPT:
        return;
    }
    fail("socket(%d, SOCK_STREAM, IPPROTO_TCP)", family);
  }
  for (i = 0; i < num_tables; i++) {
    struct ipt_table_desc* table = &tables[i];
    if (table->info.valid_hooks == 0) continue;
    memset(&info, 0, sizeof(info));
    strcpy(info.name, table->name);
    optlen = sizeof(info);
    if (getsockopt(fd, level, IPT_SO_GET_INFO, &info, &optlen))
      fail("getsockopt(IPT_SO_GET_INFO)");
    if (memcmp(&table->info, &info, sizeof(table->info)) == 0) {
      memset(&entries, 0, sizeof(entries));
      strcpy(entries.name, table->name);
      entries.size = table->info.size;
      optlen = sizeof(entries) - sizeof(entries.entrytable) + entries.size;
      if (getsockopt(fd, level, IPT_SO_GET_ENTRIES, &entries, &optlen))
        fail("getsockopt(IPT_SO_GET_ENTRIES)");
      if (memcmp(table->replace.entrytable, entries.entrytable,
                 table->info.size) == 0)
        continue;
    }
        table->replace.num_counters = info.num_entries;
    table->replace.counters = counters;
    optlen = sizeof(table->replace) - sizeof(table->replace.entrytable) +
             table->replace.size;
    if (setsockopt(fd, level, IPT_SO_SET_REPLACE, &table->replace, optlen))
      fail("setsockopt(IPT_SO_SET_REPLACE)");
  }
  close(fd);
}

static void checkpoint_arptables(void) {
  struct arpt_get_entries entries;
  socklen_t optlen;
  unsigned i;
  int fd;

  fd = socket(AF_INET, SOCK_STREAM, IPPROTO_TCP);
  if (fd == -1) fail("socket(AF_INET, SOCK_STREAM, IPPROTO_TCP)");
  for (i = 0; i < sizeof(arpt_tables) / sizeof(arpt_tables[0]); i++) {
    struct arpt_table_desc* table = &arpt_tables[i];
    strcpy(table->info.name, table->name);
    strcpy(table->replace.name, table->name);
    optlen = sizeof(table->info);
    if (getsockopt(fd, SOL_IP, ARPT_SO_GET_INFO, &table->info, &optlen)) {
      switch (errno) {
        case EPERM:
        case ENOENT:
        case ENOPROTOOPT:
          continue;
      }
      fail("getsockopt(ARPT_SO_GET_INFO)");
    }
    if (table->info.size > sizeof(table->replace.entrytable))
      fail("table size is too large: %u", table->info.size);
    if (table->info.num_entries > XT_MAX_ENTRIES)
      fail("too many counters: %u", table->info.num_entries);
    memset(&entries, 0, sizeof(entries));
    strcpy(entries.name, table->name);
    entries.size = table->info.size;
    optlen = sizeof(entries) - sizeof(entries.entrytable) + table->info.size;
    if (getsockopt(fd, SOL_IP, ARPT_SO_GET_ENTRIES, &entries, &optlen))
      fail("getsockopt(ARPT_SO_GET_ENTRIES)");
    table->replace.valid_hooks = table->info.valid_hooks;
    table->replace.num_entries = table->info.num_entries;
    table->replace.size = table->info.size;
    memcpy(table->replace.hook_entry, table->info.hook_entry,
           sizeof(table->replace.hook_entry));
    memcpy(table->replace.underflow, table->info.underflow,
           sizeof(table->replace.underflow));
    memcpy(table->replace.entrytable, entries.entrytable, table->info.size);
  }
  close(fd);
}

static void reset_arptables() {
  struct xt_counters counters[XT_MAX_ENTRIES];
  struct arpt_get_entries entries;
  struct arpt_getinfo info;
  socklen_t optlen;
  unsigned i;
  int fd;

  fd = socket(AF_INET, SOCK_STREAM, IPPROTO_TCP);
  if (fd == -1) fail("socket(AF_INET, SOCK_STREAM, IPPROTO_TCP)");
  for (i = 0; i < sizeof(arpt_tables) / sizeof(arpt_tables[0]); i++) {
    struct arpt_table_desc* table = &arpt_tables[i];
    if (table->info.valid_hooks == 0) continue;
    memset(&info, 0, sizeof(info));
    strcpy(info.name, table->name);
    optlen = sizeof(info);
    if (getsockopt(fd, SOL_IP, ARPT_SO_GET_INFO, &info, &optlen))
      fail("getsockopt(ARPT_SO_GET_INFO)");
    if (memcmp(&table->info, &info, sizeof(table->info)) == 0) {
      memset(&entries, 0, sizeof(entries));
      strcpy(entries.name, table->name);
      entries.size = table->info.size;
      optlen = sizeof(entries) - sizeof(entries.entrytable) + entries.size;
      if (getsockopt(fd, SOL_IP, ARPT_SO_GET_ENTRIES, &entries, &optlen))
        fail("getsockopt(ARPT_SO_GET_ENTRIES)");
      if (memcmp(table->replace.entrytable, entries.entrytable,
                 table->info.size) == 0)
        continue;
    }
        table->replace.num_counters = info.num_entries;
    table->replace.counters = counters;
    optlen = sizeof(table->replace) - sizeof(table->replace.entrytable) +
             table->replace.size;
    if (setsockopt(fd, SOL_IP, ARPT_SO_SET_REPLACE, &table->replace, optlen))
      fail("setsockopt(ARPT_SO_SET_REPLACE)");
  }
  close(fd);
}
#include <linux/if.h>
#include <linux/netfilter_bridge/ebtables.h>

struct ebt_table_desc {
  const char* name;
  struct ebt_replace replace;
  char entrytable[XT_TABLE_SIZE];
};

static struct ebt_table_desc ebt_tables[] = {
  {.name = "filter" }, {.name = "nat" }, {.name = "broute" },
};

static void checkpoint_ebtables(void) {
  socklen_t optlen;
  unsigned i;
  int fd;

  fd = socket(AF_INET, SOCK_STREAM, IPPROTO_TCP);
  if (fd == -1) fail("socket(AF_INET, SOCK_STREAM, IPPROTO_TCP)");
  for (i = 0; i < sizeof(ebt_tables) / sizeof(ebt_tables[0]); i++) {
    struct ebt_table_desc* table = &ebt_tables[i];
    strcpy(table->replace.name, table->name);
    optlen = sizeof(table->replace);
    if (getsockopt(fd, SOL_IP, EBT_SO_GET_INIT_INFO, &table->replace,
                   &optlen)) {
      switch (errno) {
        case EPERM:
        case ENOENT:
        case ENOPROTOOPT:
          continue;
      }
      fail("getsockopt(EBT_SO_GET_INIT_INFO)");
    }
      if (table->replace.entries_size > sizeof(table->entrytable))
      fail("table size is too large: %u", table->replace.entries_size);
    table->replace.num_counters = 0;
    table->replace.entries = table->entrytable;
    optlen = sizeof(table->replace) + table->replace.entries_size;
    if (getsockopt(fd, SOL_IP, EBT_SO_GET_INIT_ENTRIES, &table->replace,
                   &optlen))
      fail("getsockopt(EBT_SO_GET_INIT_ENTRIES)");
  }
  close(fd);
}

static void reset_ebtables() {
  struct ebt_replace replace;
  char entrytable[XT_TABLE_SIZE];
  socklen_t optlen;
  unsigned i, j, h;
  int fd;

  fd = socket(AF_INET, SOCK_STREAM, IPPROTO_TCP);
  if (fd == -1) fail("socket(AF_INET, SOCK_STREAM, IPPROTO_TCP)");
  for (i = 0; i < sizeof(ebt_tables) / sizeof(ebt_tables[0]); i++) {
    struct ebt_table_desc* table = &ebt_tables[i];
    if (table->replace.valid_hooks == 0) continue;
    memset(&replace, 0, sizeof(replace));
    strcpy(replace.name, table->name);
    optlen = sizeof(replace);
    if (getsockopt(fd, SOL_IP, EBT_SO_GET_INFO, &replace, &optlen))
      fail("getsockopt(EBT_SO_GET_INFO)");
    replace.num_counters = 0;
    table->replace.entries = 0;
    for (h = 0; h < NF_BR_NUMHOOKS; h++)
      table->replace.hook_entry[h] = 0;
    if (memcmp(&table->replace, &replace, sizeof(table->replace)) == 0) {
      memset(&entrytable, 0, sizeof(entrytable));
      replace.entries = entrytable;
      optlen = sizeof(replace) + replace.entries_size;
      if (getsockopt(fd, SOL_IP, EBT_SO_GET_ENTRIES, &replace, &optlen))
        fail("getsockopt(EBT_SO_GET_ENTRIES)");
      if (memcmp(table->entrytable, entrytable, replace.entries_size) == 0)
        continue;
    }
        for (j = 0, h = 0; h < NF_BR_NUMHOOKS; h++) {
      if (table->replace.valid_hooks & (1 << h)) {
        table->replace.hook_entry[h] =
            (struct ebt_entries*)table->entrytable + j;
        j++;
      }
    }
    table->replace.entries = table->entrytable;
    optlen = sizeof(table->replace) + table->replace.entries_size;
    if (setsockopt(fd, SOL_IP, EBT_SO_SET_ENTRIES, &table->replace, optlen))
      fail("setsockopt(EBT_SO_SET_ENTRIES)");
  }
  close(fd);
}

static void checkpoint_net_namespace(void) {
  checkpoint_ebtables();
  checkpoint_arptables();
  checkpoint_iptables(ipv4_tables, sizeof(ipv4_tables) / sizeof(ipv4_tables[0]),
                      AF_INET, SOL_IP);
  checkpoint_iptables(ipv6_tables, sizeof(ipv6_tables) / sizeof(ipv6_tables[0]),
                      AF_INET6, SOL_IPV6);
}

static void reset_net_namespace(void) {
  reset_ebtables();
  reset_arptables();
  reset_iptables(ipv4_tables, sizeof(ipv4_tables) / sizeof(ipv4_tables[0]),
                 AF_INET, SOL_IP);
  reset_iptables(ipv6_tables, sizeof(ipv6_tables) / sizeof(ipv6_tables[0]),
                 AF_INET6, SOL_IPV6);
}

static void remove_dir(const char* dir) {
  DIR* dp;
  struct dirent* ep;
  int iter = 0;
retry:
  while (umount2(dir, MNT_DETACH) == 0) {
      }
  dp = opendir(dir);
  if (dp == NULL) {
    if (errno == EMFILE) {
      exitf("opendir(%s) failed due to NOFILE, exiting", dir);
    }
    exitf("opendir(%s) failed", dir);
  }
  while ((ep = readdir(dp))) {
    if (strcmp(ep->d_name, ".") == 0 || strcmp(ep->d_name, "..") == 0) continue;
    char filename[FILENAME_MAX];
    snprintf(filename, sizeof(filename), "%s/%s", dir, ep->d_name);
    struct stat st;
    if (lstat(filename, &st)) exitf("lstat(%s) failed", filename);
    if (S_ISDIR(st.st_mode)) {
      remove_dir(filename);
      continue;
    }
    int i;
    for (i = 0;; i++) {
            if (unlink(filename) == 0) break;
      if (errno == EROFS) {
                break;
      }
      if (errno != EBUSY || i > 100) exitf("unlink(%s) failed", filename);
            if (umount2(filename, MNT_DETACH)) exitf("umount(%s) failed", filename);
    }
  }
  closedir(dp);
  int i;
  for (i = 0;; i++) {
        if (rmdir(dir) == 0) break;
    if (i < 100) {
      if (errno == EROFS) {
                break;
      }
      if (errno == EBUSY) {
                if (umount2(dir, MNT_DETACH)) exitf("umount(%s) failed", dir);
        continue;
      }
      if (errno == ENOTEMPTY) {
        if (iter < 100) {
          iter++;
          goto retry;
        }
      }
    }
    exitf("rmdir(%s) failed", dir);
  }
}

static void execute_one();
extern unsigned long long procid;

static void loop() {
  if(0)
        checkpoint_net_namespace();
  char cgroupdir[64];
  snprintf(cgroupdir, sizeof(cgroupdir), "/syzcgroup/unified/syz%llu", procid);
  char cgroupdir_cpu[64];
  snprintf(cgroupdir_cpu, sizeof(cgroupdir_cpu), "/syzcgroup/cpu/syz%llu",
           procid);
  char cgroupdir_net[64];
  snprintf(cgroupdir_net, sizeof(cgroupdir_net), "/syzcgroup/net/syz%llu",
           procid);
  if (mkdir(cgroupdir, 0777)) {
      }
  if (mkdir(cgroupdir_cpu, 0777)) {
      }
  if (mkdir(cgroupdir_net, 0777)) {
      }
  int pid = getpid();
  char procs_file[128];
  snprintf(procs_file, sizeof(procs_file), "%s/cgroup.procs", cgroupdir);
  if (!write_file(procs_file, "%d", pid)) {
      }
  snprintf(procs_file, sizeof(procs_file), "%s/cgroup.procs", cgroupdir_cpu);
  if (!write_file(procs_file, "%d", pid)) {
      }
  snprintf(procs_file, sizeof(procs_file), "%s/cgroup.procs", cgroupdir_net);
  if (!write_file(procs_file, "%d", pid)) {
      }
  int iter;
  for (iter = 0;; iter++) {
    char cwdbuf[32];
    sprintf(cwdbuf, "./%d", iter);
    if (mkdir(cwdbuf, 0777)) fail("failed to mkdir");
    int pid = fork();
    if (pid < 0) fail("clone failed");
    if (pid == 0) {
      prctl(PR_SET_PDEATHSIG, SIGKILL, 0, 0, 0);
      setpgrp();
      if (chdir(cwdbuf)) fail("failed to chdir");
      if (symlink(cgroupdir, "./cgroup")) {
              }
      if (symlink(cgroupdir_cpu, "./cgroup.cpu")) {
              }
      if (symlink(cgroupdir_net, "./cgroup.net")) {
              }
      flush_tun();
      execute_one();
            doexit(0);
    }

    int status = 0;
    uint64_t start = current_time_ms();
    for (;;) {
      int res = waitpid(-1, &status, __WALL | WNOHANG);
      if (res == pid) {
                break;
      }
      usleep(1000);
      if (current_time_ms() - start < 3 * 1000) continue;
                  kill(-pid, SIGKILL);
      kill(pid, SIGKILL);
      while (waitpid(-1, &status, __WALL) != pid) {
      }
      break;
    }
    remove_dir(cwdbuf);
    if(0)
        reset_net_namespace();
  }
}

struct thread_t {
  int created, running, call;
  pthread_t th;
};

static struct thread_t threads[16];
static void execute_call(int call);
static int running;
static int collide;

static void* thr(void* arg) {
  struct thread_t* th = (struct thread_t*)arg;
  for (;;) {
    while (!__atomic_load_n(&th->running, __ATOMIC_ACQUIRE))
      syscall(SYS_futex, &th->running, FUTEX_WAIT, 0, 0);
    execute_call(th->call);
    __atomic_fetch_sub(&running, 1, __ATOMIC_RELAXED);
    __atomic_store_n(&th->running, 0, __ATOMIC_RELEASE);
    syscall(SYS_futex, &th->running, FUTEX_WAKE);
  }
  return 0;
}

static void execute(int num_calls) {
  int call, thread;
  running = 0;
  for (call = 0; call < num_calls; call++) {
    for (thread = 0; thread < sizeof(threads) / sizeof(threads[0]); thread++) {
      struct thread_t* th = &threads[thread];
      if (!th->created) {
        th->created = 1;
        pthread_attr_t attr;
        pthread_attr_init(&attr);
        pthread_attr_setstacksize(&attr, 128 << 10);
        pthread_create(&th->th, &attr, thr, th);
      }
      if (!__atomic_load_n(&th->running, __ATOMIC_ACQUIRE)) {
        th->call = call;
        __atomic_fetch_add(&running, 1, __ATOMIC_RELAXED);
        __atomic_store_n(&th->running, 1, __ATOMIC_RELEASE);
        syscall(SYS_futex, &th->running, FUTEX_WAKE);
        if (collide && call % 2) break;
        struct timespec ts;
        ts.tv_sec = 0;
        ts.tv_nsec = 20 * 1000 * 1000;
        syscall(SYS_futex, &th->running, FUTEX_WAIT, 1, &ts);
        if (running) usleep((call == num_calls - 1) ? 10000 : 1000);
        break;
      }
    }
  }
}

#ifndef __NR_mmap
#define __NR_mmap 222
#endif
#ifndef __NR_socket
#define __NR_socket 198
#endif
#ifndef __NR_sendmsg
#define __NR_sendmsg 211
#endif

uint64_t r[1] = {0xffffffffffffffff};
unsigned long long procid;
void execute_call(int call)
{
        long res;       switch (call) {
        case 0:
                res = syscall(__NR_socket, 0xf, 3, 2);
                if (res != -1)
                                r[0] = res;
                break;
        case 1:
                NONFAILING(*(uint64_t*)0x20000180 = 0);
                NONFAILING(*(uint32_t*)0x20000188 = 0);
                NONFAILING(*(uint64_t*)0x20000190 = 0x20000340);
                NONFAILING(*(uint64_t*)0x20000340 = 0x20000080);
                NONFAILING(memcpy((void*)0x20000080, "\x02\x0b\x80\x01\x02\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00", 16));
                NONFAILING(*(uint64_t*)0x20000348 = 0x10);
                NONFAILING(*(uint64_t*)0x20000198 = 1);
                NONFAILING(*(uint64_t*)0x200001a0 = 0);
                NONFAILING(*(uint64_t*)0x200001a8 = 0);
                NONFAILING(*(uint32_t*)0x200001b0 = 0);
                syscall(__NR_sendmsg, r[0], 0x20000180, 0);
                break;
        }
}

void execute_one()
{
        syscall(SYS_write, 1, "executing program\n", strlen("executing program\n"));
        execute(2);
        collide = 1;
        execute(2);
}

int main()
{
        syscall(__NR_mmap, 0x20000000, 0x1000000, 3, 0x32, -1, 0);
        char *cwd = get_current_dir_name();
        for (procid = 0; procid < 18; procid++) {
                if (fork() == 0) {
                        install_segv_handler();
                        for (;;) {
                                if (chdir(cwd))
                                        fail("failed to chdir");
                                use_temporary_dir();
                                int pid = do_sandbox_none();
                                int status = 0;
                                while (waitpid(pid, &status, __WALL) != pid) {}
                        }
                }
        }
        sleep(1000000);
        return 0;
}
Eric Dumazet Sept. 23, 2018, 5:15 p.m. UTC | #8
On 09/20/2018 12:25 PM, stranche@codeaurora.org wrote:

> Perhaps a cleaner solution here is to always clone the SKB in
> pfkey_broadcast_one(). That will ensure that the two kfree_skb() calls
> in pfkey_broadcast() will never be passed an SKB with sock_rfree() as
> its destructor, and we can avoid this race condition.

Yes, this whole idea of avoiding the cloning is brain dead.

Better play safe and having a straightforward implementation.

I suggest something like this (I could not reproduce the bug with the syzkaller repro)

Note that I removed the sock_hold(sk)/sock_put(sk) pair as this is useless.
The only time GFP_KERNEL might be used is when the sk is already owned by the caller.


 net/key/af_key.c |   40 +++++++++++++++-------------------------
 1 file changed, 15 insertions(+), 25 deletions(-)
diff --git a/net/key/af_key.c b/net/key/af_key.c
index 9d61266526e767770d9a1ce184ac8cdd59de309a..7da629d5971712d5219528c55bad869bb084a343 100644
--- a/net/key/af_key.c
+++ b/net/key/af_key.c
@@ -196,30 +196,22 @@ static int pfkey_release(struct socket *sock)
 	return 0;
 }
 
-static int pfkey_broadcast_one(struct sk_buff *skb, struct sk_buff **skb2,
-			       gfp_t allocation, struct sock *sk)
+static int pfkey_broadcast_one(struct sk_buff *skb, gfp_t allocation,
+			       struct sock *sk)
 {
 	int err = -ENOBUFS;
 
-	sock_hold(sk);
-	if (*skb2 == NULL) {
-		if (refcount_read(&skb->users) != 1) {
-			*skb2 = skb_clone(skb, allocation);
-		} else {
-			*skb2 = skb;
-			refcount_inc(&skb->users);
-		}
-	}
-	if (*skb2 != NULL) {
-		if (atomic_read(&sk->sk_rmem_alloc) <= sk->sk_rcvbuf) {
-			skb_set_owner_r(*skb2, sk);
-			skb_queue_tail(&sk->sk_receive_queue, *skb2);
-			sk->sk_data_ready(sk);
-			*skb2 = NULL;
-			err = 0;
-		}
+	if (atomic_read(&sk->sk_rmem_alloc) > sk->sk_rcvbuf)
+		return err;
+
+	skb = skb_clone(skb, allocation);
+
+	if (skb) {
+		skb_set_owner_r(skb, sk);
+		skb_queue_tail(&sk->sk_receive_queue, skb);
+		sk->sk_data_ready(sk);
+		err = 0;
 	}
-	sock_put(sk);
 	return err;
 }
 
@@ -234,7 +226,6 @@ static int pfkey_broadcast(struct sk_buff *skb, gfp_t allocation,
 {
 	struct netns_pfkey *net_pfkey = net_generic(net, pfkey_net_id);
 	struct sock *sk;
-	struct sk_buff *skb2 = NULL;
 	int err = -ESRCH;
 
 	/* XXX Do we need something like netlink_overrun?  I think
@@ -253,7 +244,7 @@ static int pfkey_broadcast(struct sk_buff *skb, gfp_t allocation,
 		 * socket.
 		 */
 		if (pfk->promisc)
-			pfkey_broadcast_one(skb, &skb2, GFP_ATOMIC, sk);
+			pfkey_broadcast_one(skb, GFP_ATOMIC, sk);
 
 		/* the exact target will be processed later */
 		if (sk == one_sk)
@@ -268,7 +259,7 @@ static int pfkey_broadcast(struct sk_buff *skb, gfp_t allocation,
 				continue;
 		}
 
-		err2 = pfkey_broadcast_one(skb, &skb2, GFP_ATOMIC, sk);
+		err2 = pfkey_broadcast_one(skb, GFP_ATOMIC, sk);
 
 		/* Error is cleared after successful sending to at least one
 		 * registered KM */
@@ -278,9 +269,8 @@ static int pfkey_broadcast(struct sk_buff *skb, gfp_t allocation,
 	rcu_read_unlock();
 
 	if (one_sk != NULL)
-		err = pfkey_broadcast_one(skb, &skb2, allocation, one_sk);
+		err = pfkey_broadcast_one(skb, allocation, one_sk);
 
-	kfree_skb(skb2);
 	kfree_skb(skb);
 	return err;
 }
Sean Tranchetti Sept. 24, 2018, 6:46 p.m. UTC | #9
On 2018-09-23 11:15, Eric Dumazet wrote:
> On 09/20/2018 12:25 PM, stranche@codeaurora.org wrote:
> 
>> Perhaps a cleaner solution here is to always clone the SKB in
>> pfkey_broadcast_one(). That will ensure that the two kfree_skb() calls
>> in pfkey_broadcast() will never be passed an SKB with sock_rfree() as
>> its destructor, and we can avoid this race condition.
> 
> Yes, this whole idea of avoiding the cloning is brain dead.
> 
> Better play safe and having a straightforward implementation.
> 
> I suggest something like this (I could not reproduce the bug with the
> syzkaller repro)
> 
> Note that I removed the sock_hold(sk)/sock_put(sk) pair as this is 
> useless.
> The only time GFP_KERNEL might be used is when the sk is already owned
> by the caller.
> 
> 
>  net/key/af_key.c |   40 +++++++++++++++-------------------------
>  1 file changed, 15 insertions(+), 25 deletions(-)

Hi Eric,

That patch works like a charm. Could you upload that as a formal patch?
Thanks for all your help with this.
diff mbox series

Patch

diff --git a/net/key/af_key.c b/net/key/af_key.c
index 9d61266..dd257c7 100644
--- a/net/key/af_key.c
+++ b/net/key/af_key.c
@@ -275,13 +275,13 @@  static int pfkey_broadcast(struct sk_buff *skb, gfp_t allocation,
 		if ((broadcast_flags & BROADCAST_REGISTERED) && err)
 			err = err2;
 	}
-	rcu_read_unlock();
 
 	if (one_sk != NULL)
 		err = pfkey_broadcast_one(skb, &skb2, allocation, one_sk);
 
 	kfree_skb(skb2);
 	kfree_skb(skb);
+	rcu_read_unlock();
 	return err;
 }