diff mbox

unix: avoid use-after-free in ep_remove_wait_queue (w/ Fixes:)

Message ID 87ziydvasn.fsf_-_@doppelsaurus.mobileactivedefense.com
State Changes Requested, archived
Delegated to: David Miller
Headers show

Commit Message

Rainer Weikusat Nov. 16, 2015, 10:28 p.m. UTC
An AF_UNIX datagram socket being the client in an n:1 association with
some server socket is only allowed to send messages to the server if the
receive queue of this socket contains at most sk_max_ack_backlog
datagrams. This implies that prospective writers might be forced to go
to sleep despite none of the message presently enqueued on the server
receive queue were sent by them. In order to ensure that these will be
woken up once space becomes again available, the present unix_dgram_poll
routine does a second sock_poll_wait call with the peer_wait wait queue
of the server socket as queue argument (unix_dgram_recvmsg does a wake
up on this queue after a datagram was received). This is inherently
problematic because the server socket is only guaranteed to remain alive
for as long as the client still holds a reference to it. In case the
connection is dissolved via connect or by the dead peer detection logic
in unix_dgram_sendmsg, the server socket may be freed despite "the
polling mechanism" (in particular, epoll) still has a pointer to the
corresponding peer_wait queue. There's no way to forcibly deregister a
wait queue with epoll.

Based on an idea by Jason Baron, the patch below changes the code such
that a wait_queue_t belonging to the client socket is enqueued on the
peer_wait queue of the server whenever the peer receive queue full
condition is detected by either a sendmsg or a poll. A wake up on the
peer queue is then relayed to the ordinary wait queue of the client
socket via wake function. The connection to the peer wait queue is again
dissolved if either a wake up is about to be relayed or the client
socket reconnects or a dead peer is detected or the client socket is
itself closed. This enables removing the second sock_poll_wait from
unix_dgram_poll, thus avoiding the use-after-free, while still ensuring
that no blocked writer sleeps forever.

Signed-off-by: Rainer Weikusat <rweikusat@mobileactivedefense.com>
Fixes: ec0d215f9420 ("af_unix: fix 'poll for write'/connected DGRAM sockets")
---

Additional remark about "5456f09aaf88/ af_unix: fix unix_dgram_poll()
behavior for EPOLLOUT event": This shouldn't be an issue anymore with
this change despite it restores the "only when writable" behaviour" as
the wake up relay will also be set up once _dgram_sendmsg returned
EAGAIN for a send attempt on a n:1 connected socket.


--
To unsubscribe from this list: send the line "unsubscribe netdev" in
the body of a message to majordomo@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html

Comments

Jason Baron Nov. 17, 2015, 4:13 p.m. UTC | #1
On 11/16/2015 05:28 PM, Rainer Weikusat wrote:
> An AF_UNIX datagram socket being the client in an n:1 association with
> some server socket is only allowed to send messages to the server if the
> receive queue of this socket contains at most sk_max_ack_backlog
> datagrams. This implies that prospective writers might be forced to go
> to sleep despite none of the message presently enqueued on the server
> receive queue were sent by them. In order to ensure that these will be
> woken up once space becomes again available, the present unix_dgram_poll
> routine does a second sock_poll_wait call with the peer_wait wait queue
> of the server socket as queue argument (unix_dgram_recvmsg does a wake
> up on this queue after a datagram was received). This is inherently
> problematic because the server socket is only guaranteed to remain alive
> for as long as the client still holds a reference to it. In case the
> connection is dissolved via connect or by the dead peer detection logic
> in unix_dgram_sendmsg, the server socket may be freed despite "the
> polling mechanism" (in particular, epoll) still has a pointer to the
> corresponding peer_wait queue. There's no way to forcibly deregister a
> wait queue with epoll.
> 
> Based on an idea by Jason Baron, the patch below changes the code such
> that a wait_queue_t belonging to the client socket is enqueued on the
> peer_wait queue of the server whenever the peer receive queue full
> condition is detected by either a sendmsg or a poll. A wake up on the
> peer queue is then relayed to the ordinary wait queue of the client
> socket via wake function. The connection to the peer wait queue is again
> dissolved if either a wake up is about to be relayed or the client
> socket reconnects or a dead peer is detected or the client socket is
> itself closed. This enables removing the second sock_poll_wait from
> unix_dgram_poll, thus avoiding the use-after-free, while still ensuring
> that no blocked writer sleeps forever.
> 
> Signed-off-by: Rainer Weikusat <rweikusat@mobileactivedefense.com>
> Fixes: ec0d215f9420 ("af_unix: fix 'poll for write'/connected DGRAM sockets")
> ---
> 
> Additional remark about "5456f09aaf88/ af_unix: fix unix_dgram_poll()
> behavior for EPOLLOUT event": This shouldn't be an issue anymore with
> this change despite it restores the "only when writable" behaviour" as
> the wake up relay will also be set up once _dgram_sendmsg returned
> EAGAIN for a send attempt on a n:1 connected socket.
> 
> 

Hi,

My only comment was about potentially avoiding the double lock in the
write path, otherwise this looks ok to me.

Thanks,

-Jason
--
To unsubscribe from this list: send the line "unsubscribe netdev" in
the body of a message to majordomo@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
David Miller Nov. 17, 2015, 8:14 p.m. UTC | #2
From: Rainer Weikusat <rweikusat@mobileactivedefense.com>
Date: Mon, 16 Nov 2015 22:28:40 +0000

> An AF_UNIX datagram socket being the client in an n:1 association with
> some server socket is only allowed to send messages to the server if the
> receive queue of this socket contains at most sk_max_ack_backlog
> datagrams. This implies that prospective writers might be forced to go
> to sleep despite none of the message presently enqueued on the server
> receive queue were sent by them. In order to ensure that these will be
> woken up once space becomes again available, the present unix_dgram_poll
> routine does a second sock_poll_wait call with the peer_wait wait queue
> of the server socket as queue argument (unix_dgram_recvmsg does a wake
> up on this queue after a datagram was received). This is inherently
> problematic because the server socket is only guaranteed to remain alive
> for as long as the client still holds a reference to it. In case the
> connection is dissolved via connect or by the dead peer detection logic
> in unix_dgram_sendmsg, the server socket may be freed despite "the
> polling mechanism" (in particular, epoll) still has a pointer to the
> corresponding peer_wait queue. There's no way to forcibly deregister a
> wait queue with epoll.
> 
> Based on an idea by Jason Baron, the patch below changes the code such
> that a wait_queue_t belonging to the client socket is enqueued on the
> peer_wait queue of the server whenever the peer receive queue full
> condition is detected by either a sendmsg or a poll. A wake up on the
> peer queue is then relayed to the ordinary wait queue of the client
> socket via wake function. The connection to the peer wait queue is again
> dissolved if either a wake up is about to be relayed or the client
> socket reconnects or a dead peer is detected or the client socket is
> itself closed. This enables removing the second sock_poll_wait from
> unix_dgram_poll, thus avoiding the use-after-free, while still ensuring
> that no blocked writer sleeps forever.
> 
> Signed-off-by: Rainer Weikusat <rweikusat@mobileactivedefense.com>
> Fixes: ec0d215f9420 ("af_unix: fix 'poll for write'/connected DGRAM sockets")

So because of a corner case of epoll handling and sender socket release,
every single datagram sendmsg has to do a double lock now?

I do not dispute the correctness of your fix at this point, but that
added cost in the fast path is really too high.
--
To unsubscribe from this list: send the line "unsubscribe netdev" in
the body of a message to majordomo@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Rainer Weikusat Nov. 17, 2015, 9:37 p.m. UTC | #3
David Miller <davem@davemloft.net> writes:
> From: Rainer Weikusat <rweikusat@mobileactivedefense.com>
> Date: Mon, 16 Nov 2015 22:28:40 +0000
>
>> An AF_UNIX datagram socket being the client in an n:1 association with
>> some server socket is only allowed to send messages to the server if the
>> receive queue of this socket contains at most sk_max_ack_backlog
>> datagrams.

[...]

>> Signed-off-by: Rainer Weikusat <rweikusat@mobileactivedefense.com>
>> Fixes: ec0d215f9420 ("af_unix: fix 'poll for write'/connected DGRAM sockets")
>
> So because of a corner case of epoll handling and sender socket release,
> every single datagram sendmsg has to do a double lock now?
>
> I do not dispute the correctness of your fix at this point, but that
> added cost in the fast path is really too high.

This leaves only the option of a somewhat incorrect solution and what is
or isn't acceptable in this respect is somewhat difficult to decide. The
basic options would be

	- return EAGAIN even if sending became possible (Jason's most
          recent suggestions)

	- retry sending a limited number of times, eg, once, before
          returning EAGAIN, on the grounds that this is nicer to the
          application and that redoing all the stuff up to the _lock in
          dgram_sendmsg can possibly/ likely be avoided

Which one do you prefer?
--
To unsubscribe from this list: send the line "unsubscribe netdev" in
the body of a message to majordomo@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Rainer Weikusat Nov. 17, 2015, 10:09 p.m. UTC | #4
Rainer Weikusat <rw@doppelsaurus.mobileactivedefense.com> writes:

[...]

> The basic options would be
>
> 	- return EAGAIN even if sending became possible (Jason's most
>           recent suggestions)
>
> 	- retry sending a limited number of times, eg, once, before
>           returning EAGAIN, on the grounds that this is nicer to the
>           application and that redoing all the stuff up to the _lock in
>           dgram_sendmsg can possibly/ likely be avoided

A third option: Use trylock to acquire the sk lock. If this succeeds,
there's no risk of deadlocking anyone even if acquiring the locks in the
wrong order. This could look as follows (NB: I didn't even compile this,
I just wrote the code to get an idea how complicated it would be):

		int need_wakeup;

[...]

		need_wakeup = 0;
		err = 0;
		if (spin_lock_trylock(unix_sk(sk)->lock)) {
			if (unix_peer(sk) != other ||
				unix_dgram_peer_wake_me(sk, other))
				err = -EAGAIN;
		} else {
			err = -EAGAIN;
			
			unix_state_unlock(other);
			unix_state_lock(sk);
			
			need_wakeup = unix_peer(sk) != other &&
				      unix_dgram_peer_wake_connect(sk, other) &&
				      sk_receive_queue_len(other) == 0;
		}
		
		unix_state_unlock(sk);
		
		if (err) {
			if (need_wakeup)
				wake_up_interruptible_poll(sk_sleep(sk),
							   POLLOUT |
							   POLLWRNORM |
							   POLLWRBAND);

			goto out_free;
		}
--
To unsubscribe from this list: send the line "unsubscribe netdev" in
the body of a message to majordomo@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Rainer Weikusat Nov. 17, 2015, 10:48 p.m. UTC | #5
Rainer Weikusat <rweikusat@mobileactivedefense.com> writes:

[...]

> This leaves only the option of a somewhat incorrect solution and what is
> or isn't acceptable in this respect is somewhat difficult to decide. The
> basic options would be

[...]
> 	- retry sending a limited number of times, eg, once, before
>           returning EAGAIN, on the grounds that this is nicer to the
>           application and that redoing all the stuff up to the _lock in
>           dgram_sendmsg can possibly/ likely be avoided

Since it's better to have a specific example of something: Here's
another 'code sketch' of this option (hopefully with less errors this
time, there's an int restart = 0 above):

	if (unix_peer(other) != sk && unix_recvq_full(other)) {
		int need_wakeup;
		

[...]

		need_wakeup = 0;
		err = 0;
		unix_state_unlock(other);
		unix_state_lock(sk);

		if (unix_peer(sk) == other) {
			if (++restart == 2) {
				need_wakeup = unix_dgram_peer_wake_connect(sk, other) &&
					      sk_receive_queue_len(other) == 0;
				err = -EAGAIN;
			} else if (unix_dgram_peer_wake_me(sk, other))
				err = -EAGAIN;
		} else
			err = -EAGAIN;

		unix_state_unlock(sk);

		if (err || !restart) {
			if (need_wakeup)
				wake_up_interruptible_poll(sk_sleep(sk),
							   POLLOUT |
							   POLLWRNORM |
							   POLLWRBAND);
			
			goto out_free;
		}
		
		goto restart;
	}

I don't particularly like that, either, and to me, the best option seems
to be to return the spurious EAGAIN if taking both locks unconditionally
is not an option as that's the simplest choice.

--
To unsubscribe from this list: send the line "unsubscribe netdev" in
the body of a message to majordomo@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Rainer Weikusat Nov. 18, 2015, 6:15 p.m. UTC | #6
David Miller <davem@davemloft.net> writes:
> From: Rainer Weikusat <rweikusat@mobileactivedefense.com>
> Date: Mon, 16 Nov 2015 22:28:40 +0000
>
>> An AF_UNIX datagram socket being the client in an n:1

[...]

> So because of a corner case of epoll handling and sender socket release,
> every single datagram sendmsg has to do a double lock now?
>
> I do not dispute the correctness of your fix at this point, but that
> added cost in the fast path is really too high.

Some more information on this: Running the test program included below
on my 'work' system (otherwise idle, after logging in via VT with no GUI
running)/ quadcore AMD A10-5700, 3393.984 for 20 times/ patched 4.3 resulted in the
following throughput statistics[*]:

avg		13.617  M/s
median		13.393  M/s
max		17.14   M/s
min		13.047  M/s
deviation	0.85

I'll try to post the results for 'unpatched' later as I'm also working
on a couple of other things.

[*] I do not use my fingers for counting, hence, these are binary and
not decimal units.

------------
#include <inttypes.h>
#include <stdlib.h>
#include <stdio.h>
#include <sys/socket.h>
#include <sys/time.h>
#include <sys/wait.h>
#include <unistd.h>

enum {
    MSG_SZ =	16,
    MSGS =	1000000
};

static char msg[MSG_SZ];

static uint64_t tv2u(struct timeval *tv)
{
    uint64_t u;

    u = tv->tv_sec;
    u *= 1000000;
    return u + tv->tv_usec;
}

int main(void)
{
    struct timeval start, stop;
    uint64_t t_diff;
    double rate;
    int sks[2];
    unsigned remain;
    char buf[MSG_SZ];

    socketpair(AF_UNIX, SOCK_SEQPACKET, 0, sks);

    if (fork() == 0) {
	close(*sks);
	
	gettimeofday(&start, 0);
	while (read(sks[1], buf, sizeof(buf)) > 0);
	gettimeofday(&stop, 0);

	t_diff = tv2u(&stop);
	t_diff -= tv2u(&start);
	rate = MSG_SZ * MSGS;
	rate /= t_diff;
	rate *= 1000000;
	printf("rate %fM/s\n", rate / (1 << 20));

	fflush(stdout);
	_exit(0);
    }

    close(sks[1]);
    
    remain = MSGS;
    do write(*sks, msg, sizeof(msg)); while (--remain);
    close(*sks);

    wait(NULL);
    return 0;
}
--
To unsubscribe from this list: send the line "unsubscribe netdev" in
the body of a message to majordomo@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Rainer Weikusat Nov. 18, 2015, 11:39 p.m. UTC | #7
Rainer Weikusat <rw@doppelsaurus.mobileactivedefense.com> writes:

[...]

> Some more information on this: Running the test program included below
> on my 'work' system (otherwise idle, after logging in via VT with no GUI
> running)/ quadcore AMD A10-5700, 3393.984 for 20 times/ patched 4.3 resulted in the
> following throughput statistics[*]:

Since the results were too variable with only 20 runs, I've also tested
this with 100 for three kernels, stock 4.3, 4.3 plus the published
patch, 4.3 plus the published patch plus the "just return EAGAIN"
modification". The 1st and the 3rd perform about identical for the
test program I used (slightly modified version included below), the 2nd
is markedly slower. This is most easily visible when grouping the
printed data rates (B/s) 'by millions':

stock 4.3
---------
13000000.000-13999999.000       3       (3%)
14000000.000-14999999.000       82      (82%)
15000000.000-15999999.000       15      (15%)


4.3 + patch
-----------
13000000.000-13999999.000       54      (54%)
14000000.000-14999999.000       35      (35%)
15000000.000-15999999.000       7       (7%)
16000000.000-16999999.000       1       (1%)
18000000.000-18999999.000       1       (1%)
22000000.000-22999999.000       2       (2%)


4.3 + modified patch
--------------------
13000000.000-13999999.000       3       (3%)
14000000.000-14999999.000       82      (82%)
15000000.000-15999999.000       14      (14%)
24000000.000-24999999.000       1       (1%)


IMHO, the 3rd option would be the way to go if this was considered an
acceptable option (ie, despite it returns spurious errors in 'rare
cases').


modified test program
=====================
#include <inttypes.h>
#include <stdlib.h>
#include <stdio.h>
#include <sys/socket.h>
#include <sys/time.h>
#include <sys/wait.h>
#include <unistd.h>

enum {
    MSG_SZ =	16,
    MSGS =	1000000
};

static char msg[MSG_SZ];

static uint64_t tv2u(struct timeval *tv)
{
    uint64_t u;

    u = tv->tv_sec;
    u *= 1000000;
    return u + tv->tv_usec;
}

int main(void)
{
    struct timeval start, stop;
    uint64_t t_diff;
    double rate;
    int sks[2];
    unsigned remain;
    char buf[MSG_SZ];

    socketpair(AF_UNIX, SOCK_SEQPACKET, 0, sks);

    if (fork() == 0) {
	close(*sks);
	
	gettimeofday(&start, 0);
	while (read(sks[1], buf, sizeof(buf)) > 0);
	gettimeofday(&stop, 0);

	t_diff = tv2u(&stop);
	t_diff -= tv2u(&start);
	rate = MSG_SZ * MSGS;
	rate /= t_diff;
	rate *= 1000000;
	printf("%f\n", rate);

	fflush(stdout);
	_exit(0);
    }

    close(sks[1]);
    
    remain = MSGS;
    do write(*sks, msg, sizeof(msg)); while (--remain);
    close(*sks);

    wait(NULL);
    return 0;
}
--
To unsubscribe from this list: send the line "unsubscribe netdev" in
the body of a message to majordomo@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Rainer Weikusat Nov. 19, 2015, 11:48 p.m. UTC | #8
Rainer Weikusat <rweikusat@mobileactivedefense.com> writes:
> Rainer Weikusat <rw@doppelsaurus.mobileactivedefense.com> writes:
>
> [...]
>
>> The basic options would be
>>
>> 	- return EAGAIN even if sending became possible (Jason's most
>>           recent suggestions)
>>
>> 	- retry sending a limited number of times, eg, once, before
>>           returning EAGAIN, on the grounds that this is nicer to the
>>           application and that redoing all the stuff up to the _lock in
>>           dgram_sendmsg can possibly/ likely be avoided
>
> A third option:

A fourth and even one that's reasonably simple to implement: In case
other became ready during the checks, drop other lock, do a double-lock
sk, other, set a flag variable indicating this and restart the procedure
after the unix_state_lock_other[*], using the value of the flag to lock/
unlock sk as needed. Should other still be ready to receive data,
execution can then continue with the 'queue it' code as the other lock
was held all the time this time. Combined with a few unlikely
annotations in place where they're IMHO appropriate, this is speed-wise
comparable to the stock kernel.
--
To unsubscribe from this list: send the line "unsubscribe netdev" in
the body of a message to majordomo@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
diff mbox

Patch

diff --git a/include/net/af_unix.h b/include/net/af_unix.h
index b36d837..2a91a05 100644
--- a/include/net/af_unix.h
+++ b/include/net/af_unix.h
@@ -62,6 +62,7 @@  struct unix_sock {
 #define UNIX_GC_CANDIDATE	0
 #define UNIX_GC_MAYBE_CYCLE	1
 	struct socket_wq	peer_wq;
+	wait_queue_t		peer_wake;
 };
 
 static inline struct unix_sock *unix_sk(const struct sock *sk)
diff --git a/net/unix/af_unix.c b/net/unix/af_unix.c
index 94f6582..3f4974d 100644
--- a/net/unix/af_unix.c
+++ b/net/unix/af_unix.c
@@ -326,6 +326,112 @@  found:
 	return s;
 }
 
+/* Support code for asymmetrically connected dgram sockets
+ *
+ * If a datagram socket is connected to a socket not itself connected
+ * to the first socket (eg, /dev/log), clients may only enqueue more
+ * messages if the present receive queue of the server socket is not
+ * "too large". This means there's a second writeability condition
+ * poll and sendmsg need to test. The dgram recv code will do a wake
+ * up on the peer_wait wait queue of a socket upon reception of a
+ * datagram which needs to be propagated to sleeping would-be writers
+ * since these might not have sent anything so far. This can't be
+ * accomplished via poll_wait because the lifetime of the server
+ * socket might be less than that of its clients if these break their
+ * association with it or if the server socket is closed while clients
+ * are still connected to it and there's no way to inform "a polling
+ * implementation" that it should let go of a certain wait queue
+ *
+ * In order to propagate a wake up, a wait_queue_t of the client
+ * socket is enqueued on the peer_wait queue of the server socket
+ * whose wake function does a wake_up on the ordinary client socket
+ * wait queue. This connection is established whenever a write (or
+ * poll for write) hit the flow control condition and broken when the
+ * association to the server socket is dissolved or after a wake up
+ * was relayed.
+ */
+
+static int unix_dgram_peer_wake_relay(wait_queue_t *q, unsigned mode, int flags,
+				      void *key)
+{
+	struct unix_sock *u;
+	wait_queue_head_t *u_sleep;
+
+	u = container_of(q, struct unix_sock, peer_wake);
+
+	__remove_wait_queue(&unix_sk(u->peer_wake.private)->peer_wait,
+			    q);
+	u->peer_wake.private = NULL;
+
+	/* relaying can only happen while the wq still exists */
+	u_sleep = sk_sleep(&u->sk);
+	if (u_sleep)
+		wake_up_interruptible_poll(u_sleep, key);
+
+	return 0;
+}
+
+static int unix_dgram_peer_wake_connect(struct sock *sk, struct sock *other)
+{
+	struct unix_sock *u, *u_other;
+	int rc;
+
+	u = unix_sk(sk);
+	u_other = unix_sk(other);
+	rc = 0;
+	spin_lock(&u_other->peer_wait.lock);
+
+	if (!u->peer_wake.private) {
+		u->peer_wake.private = other;
+		__add_wait_queue(&u_other->peer_wait, &u->peer_wake);
+
+		rc = 1;
+	}
+
+	spin_unlock(&u_other->peer_wait.lock);
+	return rc;
+}
+
+static int unix_dgram_peer_wake_disconnect(struct sock *sk, struct sock *other)
+{
+	struct unix_sock *u, *u_other;
+	int rc;
+
+	u = unix_sk(sk);
+	u_other = unix_sk(other);
+	rc = 0;
+	spin_lock(&u_other->peer_wait.lock);
+
+	if (u->peer_wake.private == other) {
+		__remove_wait_queue(&u_other->peer_wait, &u->peer_wake);
+		u->peer_wake.private = NULL;
+
+		rc = 1;
+	}
+
+	spin_unlock(&u_other->peer_wait.lock);
+	return rc;
+}
+
+/* preconditions:
+ *	- unix_peer(sk) == other
+ *	- association is stable
+ */
+static int unix_dgram_peer_wake_me(struct sock *sk, struct sock *other)
+{
+	int connected;
+
+	connected = unix_dgram_peer_wake_connect(sk, other);
+
+	if (unix_recvq_full(other))
+		return 1;
+
+	if (connected)
+		unix_dgram_peer_wake_disconnect(sk, other);
+
+	return 0;
+}
+
 static inline int unix_writable(struct sock *sk)
 {
 	return (atomic_read(&sk->sk_wmem_alloc) << 2) <= sk->sk_sndbuf;
@@ -430,6 +536,8 @@  static void unix_release_sock(struct sock *sk, int embrion)
 			skpair->sk_state_change(skpair);
 			sk_wake_async(skpair, SOCK_WAKE_WAITD, POLL_HUP);
 		}
+
+		unix_dgram_peer_wake_disconnect(sk, skpair);
 		sock_put(skpair); /* It may now die */
 		unix_peer(sk) = NULL;
 	}
@@ -664,6 +772,7 @@  static struct sock *unix_create1(struct net *net, struct socket *sock, int kern)
 	INIT_LIST_HEAD(&u->link);
 	mutex_init(&u->readlock); /* single task reading lock */
 	init_waitqueue_head(&u->peer_wait);
+	init_waitqueue_func_entry(&u->peer_wake, unix_dgram_peer_wake_relay);
 	unix_insert_socket(unix_sockets_unbound(sk), sk);
 out:
 	if (sk == NULL)
@@ -1031,6 +1140,13 @@  restart:
 	if (unix_peer(sk)) {
 		struct sock *old_peer = unix_peer(sk);
 		unix_peer(sk) = other;
+
+		if (unix_dgram_peer_wake_disconnect(sk, old_peer))
+			wake_up_interruptible_poll(sk_sleep(sk),
+						   POLLOUT |
+						   POLLWRNORM |
+						   POLLWRBAND);
+
 		unix_state_double_unlock(sk, other);
 
 		if (other != old_peer)
@@ -1548,7 +1664,7 @@  restart:
 		goto out_free;
 	}
 
-	unix_state_lock(other);
+	unix_state_double_lock(sk, other);
 	err = -EPERM;
 	if (!unix_may_send(sk, other))
 		goto out_unlock;
@@ -1562,9 +1678,15 @@  restart:
 		sock_put(other);
 
 		err = 0;
-		unix_state_lock(sk);
 		if (unix_peer(sk) == other) {
 			unix_peer(sk) = NULL;
+
+			if (unix_dgram_peer_wake_disconnect(sk, other))
+				wake_up_interruptible_poll(sk_sleep(sk),
+							   POLLOUT |
+							   POLLWRNORM |
+							   POLLWRBAND);
+
 			unix_state_unlock(sk);
 
 			unix_dgram_disconnected(sk, other);
@@ -1591,20 +1713,27 @@  restart:
 	}
 
 	if (unix_peer(other) != sk && unix_recvq_full(other)) {
-		if (!timeo) {
-			err = -EAGAIN;
-			goto out_unlock;
-		}
+		if (timeo) {
+			unix_state_unlock(sk);
 
-		timeo = unix_wait_for_peer(other, timeo);
+			timeo = unix_wait_for_peer(other, timeo);
 
-		err = sock_intr_errno(timeo);
-		if (signal_pending(current))
-			goto out_free;
+			err = sock_intr_errno(timeo);
+			if (signal_pending(current))
+				goto out_free;
 
-		goto restart;
+			goto restart;
+		}
+
+		if (unix_peer(sk) != other ||
+		    unix_dgram_peer_wake_me(sk, other)) {
+			err = -EAGAIN;
+			goto out_unlock;
+		}
 	}
 
+	unix_state_unlock(sk);
+
 	if (sock_flag(other, SOCK_RCVTSTAMP))
 		__net_timestamp(skb);
 	maybe_add_creds(skb, sock, other);
@@ -1618,7 +1747,7 @@  restart:
 	return len;
 
 out_unlock:
-	unix_state_unlock(other);
+	unix_state_double_unlock(sk, other);
 out_free:
 	kfree_skb(skb);
 out:
@@ -2453,14 +2582,16 @@  static unsigned int unix_dgram_poll(struct file *file, struct socket *sock,
 		return mask;
 
 	writable = unix_writable(sk);
-	other = unix_peer_get(sk);
-	if (other) {
-		if (unix_peer(other) != sk) {
-			sock_poll_wait(file, &unix_sk(other)->peer_wait, wait);
-			if (unix_recvq_full(other))
-				writable = 0;
-		}
-		sock_put(other);
+	if (writable) {
+		unix_state_lock(sk);
+
+		other = unix_peer(sk);
+		if (other && unix_peer(other) != sk &&
+		    unix_recvq_full(other) &&
+		    unix_dgram_peer_wake_me(sk, other))
+			writable = 0;
+
+		unix_state_unlock(sk);
 	}
 
 	if (writable)