diff mbox

[net-next,v2] unix: add read side socket memory accounting for dgram sockets

Message ID 20140217130353.GA22833@order.stressinduktion.org
State Changes Requested, archived
Delegated to: David Miller
Headers show

Commit Message

Hannes Frederic Sowa Feb. 17, 2014, 1:03 p.m. UTC
The current situation:

We allocate memory with sock_alloc_send_pskb which accounts skb to the
sending socket's sk_wmem_alloc. The skb->destructor does a sock_wfree
as soon as the recipient retrieves the data and skb is destroyed.

The problem with this implementation is that an unconnected unix dgram
server could get blocked sending further data, if a client does not
retrieve its frames from its sk_receive_queue because sk_wmem_alloc is
under control of the client.

This patch does the following improvment:

We still allocate dgram packets with sock_alloc_send_pskb, which now
does normally not block if the socket has too many packets in flight.

While delivering the skb to the client in unix_dgram_sendmsg we check
for the recipients sockets rmem and block there if sk_rcvbuf is surpassed.

As the sender socket is not attached to the skb while retrieving it any
more, we must attach the unix dgram peers address to the skb directly,
this is done via a new field in unix_skb_parms.

Note, socket buffer limitation is already in place from the generic
socket layer. The client cannot request more than sysctl_rmem_max
(/proc/sys/net/core/rmem_max) bytes of receive buffer.

The (unix_peer(other) != sk) checks are removed as we now no longer block
in in sock_alloc_send_pskb if we try to send data to ourself, as send
buffer will always be free. To avoid DoS we need to bring the sender
to a stop trying to deliver skbs to its own socket receive queue here.
The same logic applies to the unix_dgram_poll change.

So, currently the following protections are in place to not get a victim
of a DoS:
* sk_receive_queue length limitation by sk_max_ack_backlog
* sk_rcvbuf length limiting on  the receiving socket
* sending buffer limitations in case of too may concurrent send requests

With this patch it is possible to maybe relax the unix_recq_full check in
future or make the sk_max_ack_backlog configurable in future on a per-socket
basis.

Reported-by: Yannick Koehler <yannick@koehler.name>
CC: Yannick Koehler <yannick@koehler.name>
Cc: Eric Dumazet <eric.dumazet@gmail.com>
Cc: David Miller <davem@davemloft.net>
Cc: Cong Wang <xiyou.wangcong@gmail.com>
Cc: Dan Ballard <dan@mindstab.net>
Signed-off-by: Hannes Frederic Sowa <hannes@stressinduktion.org>
---
Changelog v2:

In unix_wait_for_peer I messed up the condition of unix_other_writable to
determine if we should call schedule_timeout:

Instead of

                sched = sched && (unix_recvq_full(other) ||
                                  unix_other_writable(other));

it is now

                sched = sched && (unix_recvq_full(other) ||
                                  !unix_other_writable(other));

 include/net/af_unix.h |  1 +
 net/unix/af_unix.c    | 79 ++++++++++++++++++++++++++++++++++++++++-----------
 2 files changed, 63 insertions(+), 17 deletions(-)

Comments

David Miller Feb. 19, 2014, 7:49 p.m. UTC | #1
From: Hannes Frederic Sowa <hannes@stressinduktion.org>
Date: Mon, 17 Feb 2014 14:03:53 +0100

> We still allocate dgram packets with sock_alloc_send_pskb, which now
> does normally not block if the socket has too many packets in flight.

It seems like it does to me, it does sock_wait_for_wmem(), right?

Or are you trying to say that usually this is not the point at which
we block, but rather it's when we check the peer's receive queue?
--
To unsubscribe from this list: send the line "unsubscribe netdev" in
the body of a message to majordomo@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Hannes Frederic Sowa Feb. 19, 2014, 8:08 p.m. UTC | #2
On Wed, Feb 19, 2014 at 02:49:27PM -0500, David Miller wrote:
> From: Hannes Frederic Sowa <hannes@stressinduktion.org>
> Date: Mon, 17 Feb 2014 14:03:53 +0100
> 
> > We still allocate dgram packets with sock_alloc_send_pskb, which now
> > does normally not block if the socket has too many packets in flight.
> 
> It seems like it does to me, it does sock_wait_for_wmem(), right?

We still do the sock_wait_for_wmem in sock_alloc_send_pskb, correct,
but we rarly block there, maybe in highly concurrent cases with big
payloads where one process got interrupted to account the memory to the
receiving socket.

> Or are you trying to say that usually this is not the point at which
> we block, but rather it's when we check the peer's receive queue?

Exactly.

Bye,

  Hannes

--
To unsubscribe from this list: send the line "unsubscribe netdev" in
the body of a message to majordomo@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
David Miller Feb. 25, 2014, 10:17 p.m. UTC | #3
From: Hannes Frederic Sowa <hannes@stressinduktion.org>
Date: Wed, 19 Feb 2014 21:08:11 +0100

> On Wed, Feb 19, 2014 at 02:49:27PM -0500, David Miller wrote:
>> From: Hannes Frederic Sowa <hannes@stressinduktion.org>
>> Date: Mon, 17 Feb 2014 14:03:53 +0100
>> 
>> > We still allocate dgram packets with sock_alloc_send_pskb, which now
>> > does normally not block if the socket has too many packets in flight.
>> 
>> It seems like it does to me, it does sock_wait_for_wmem(), right?
> 
> We still do the sock_wait_for_wmem in sock_alloc_send_pskb, correct,
> but we rarly block there, maybe in highly concurrent cases with big
> payloads where one process got interrupted to account the memory to the
> receiving socket.
> 
>> Or are you trying to say that usually this is not the point at which
>> we block, but rather it's when we check the peer's receive queue?
> 
> Exactly.

Just FYI, I'm not ignoring this patch, I'm just thinking about it a lot.
--
To unsubscribe from this list: send the line "unsubscribe netdev" in
the body of a message to majordomo@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
David Miller Feb. 26, 2014, 7:55 p.m. UTC | #4
From: Hannes Frederic Sowa <hannes@stressinduktion.org>
Date: Wed, 19 Feb 2014 21:08:11 +0100

> On Wed, Feb 19, 2014 at 02:49:27PM -0500, David Miller wrote:
>> From: Hannes Frederic Sowa <hannes@stressinduktion.org>
>> Date: Mon, 17 Feb 2014 14:03:53 +0100
>> 
>> > We still allocate dgram packets with sock_alloc_send_pskb, which now
>> > does normally not block if the socket has too many packets in flight.
>> 
>> It seems like it does to me, it does sock_wait_for_wmem(), right?
> 
> We still do the sock_wait_for_wmem in sock_alloc_send_pskb, correct,
> but we rarly block there, maybe in highly concurrent cases with big
> payloads where one process got interrupted to account the memory to the
> receiving socket.
> 
>> Or are you trying to say that usually this is not the point at which
>> we block, but rather it's when we check the peer's receive queue?
> 
> Exactly.

So my only major objection is that we're now going to incur two atomics
on every single datagram UNIX packet sent between two peers.

I think you can safely put the 'other' sock pointer into the control
block.  Because when we disassociate from a peer any pending packets
in flight to him will be consumed/dropped, so there can't be any
lingering references, right?
--
To unsubscribe from this list: send the line "unsubscribe netdev" in
the body of a message to majordomo@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Hannes Frederic Sowa Feb. 26, 2014, 8:21 p.m. UTC | #5
On Wed, Feb 26, 2014 at 02:55:59PM -0500, David Miller wrote:
> From: Hannes Frederic Sowa <hannes@stressinduktion.org>
> Date: Wed, 19 Feb 2014 21:08:11 +0100
> 
> > On Wed, Feb 19, 2014 at 02:49:27PM -0500, David Miller wrote:
> >> From: Hannes Frederic Sowa <hannes@stressinduktion.org>
> >> Date: Mon, 17 Feb 2014 14:03:53 +0100
> >> 
> >> > We still allocate dgram packets with sock_alloc_send_pskb, which now
> >> > does normally not block if the socket has too many packets in flight.
> >> 
> >> It seems like it does to me, it does sock_wait_for_wmem(), right?
> > 
> > We still do the sock_wait_for_wmem in sock_alloc_send_pskb, correct,
> > but we rarly block there, maybe in highly concurrent cases with big
> > payloads where one process got interrupted to account the memory to the
> > receiving socket.
> > 
> >> Or are you trying to say that usually this is not the point at which
> >> we block, but rather it's when we check the peer's receive queue?
> > 
> > Exactly.
> 
> So my only major objection is that we're now going to incur two atomics
> on every single datagram UNIX packet sent between two peers.

Ok, I see. Daniel Borkmann pointed me to hackbench off-list and I will
do some before/after benchmarks.

> I think you can safely put the 'other' sock pointer into the control
> block.  Because when we disassociate from a peer any pending packets
> in flight to him will be consumed/dropped, so there can't be any
> lingering references, right?

I am afraid lingering socket references will happen if the sockets are
used in unconnected mode.

Greetings,

  Hannes

--
To unsubscribe from this list: send the line "unsubscribe netdev" in
the body of a message to majordomo@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
David Miller Feb. 26, 2014, 8:42 p.m. UTC | #6
From: Hannes Frederic Sowa <hannes@stressinduktion.org>
Date: Wed, 26 Feb 2014 21:21:54 +0100

> On Wed, Feb 26, 2014 at 02:55:59PM -0500, David Miller wrote:
>> I think you can safely put the 'other' sock pointer into the control
>> block.  Because when we disassociate from a peer any pending packets
>> in flight to him will be consumed/dropped, so there can't be any
>> lingering references, right?
> 
> I am afraid lingering socket references will happen if the sockets are
> used in unconnected mode.

Sigh...

Ok, let's wait for your hackbench results and work from there.
--
To unsubscribe from this list: send the line "unsubscribe netdev" in
the body of a message to majordomo@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Hannes Frederic Sowa March 4, 2014, 8:13 p.m. UTC | #7
On Wed, Feb 26, 2014 at 03:42:32PM -0500, David Miller wrote:
> From: Hannes Frederic Sowa <hannes@stressinduktion.org>
> Date: Wed, 26 Feb 2014 21:21:54 +0100
> 
> > On Wed, Feb 26, 2014 at 02:55:59PM -0500, David Miller wrote:
> >> I think you can safely put the 'other' sock pointer into the control
> >> block.  Because when we disassociate from a peer any pending packets
> >> in flight to him will be consumed/dropped, so there can't be any
> >> lingering references, right?
> > 
> > I am afraid lingering socket references will happen if the sockets are
> > used in unconnected mode.
> 
> Sigh...
> 
> Ok, let's wait for your hackbench results and work from there.

Just a small followup:

Because of the additional sock_wfree during skb handover in unix_dgram_sendmsg
the spinlock in __wake_up_sync_key is hit more often thus more cacheline
bouncing.

Deferring the POLLOUT notification after the sock_rfree in unix_dgram_recvmsg
already halfed the number of context switches with the new patch.

I try to play with some other ideas this week and will submit new patch with
performance numbers.

Haven't noticed any problems with the additional atomic operations
performance-wise so far.

Bye,

  Hannes

--
To unsubscribe from this list: send the line "unsubscribe netdev" in
the body of a message to majordomo@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
David Miller March 4, 2014, 8:37 p.m. UTC | #8
From: Hannes Frederic Sowa <hannes@stressinduktion.org>
Date: Tue, 4 Mar 2014 21:13:36 +0100

> Just a small followup:
> 
> Because of the additional sock_wfree during skb handover in unix_dgram_sendmsg
> the spinlock in __wake_up_sync_key is hit more often thus more cacheline
> bouncing.
> 
> Deferring the POLLOUT notification after the sock_rfree in unix_dgram_recvmsg
> already halfed the number of context switches with the new patch.
> 
> I try to play with some other ideas this week and will submit new patch with
> performance numbers.
> 
> Haven't noticed any problems with the additional atomic operations
> performance-wise so far.

Ok, thanks for the update.
--
To unsubscribe from this list: send the line "unsubscribe netdev" in
the body of a message to majordomo@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Hannes Frederic Sowa March 10, 2014, 1:07 a.m. UTC | #9
On Tue, Mar 04, 2014 at 03:37:32PM -0500, David Miller wrote:
> From: Hannes Frederic Sowa <hannes@stressinduktion.org>
> Date: Tue, 4 Mar 2014 21:13:36 +0100
> 
> > Just a small followup:
> > 
> > Because of the additional sock_wfree during skb handover in unix_dgram_sendmsg
> > the spinlock in __wake_up_sync_key is hit more often thus more cacheline
> > bouncing.
> > 
> > Deferring the POLLOUT notification after the sock_rfree in unix_dgram_recvmsg
> > already halfed the number of context switches with the new patch.
> > 
> > I try to play with some other ideas this week and will submit new patch with
> > performance numbers.
> > 
> > Haven't noticed any problems with the additional atomic operations
> > performance-wise so far.
> 
> Ok, thanks for the update.

I need to have some feedback regarding backwards compatibility, so still no
final patch:

First the unconnected send/reply benchmark, which look good.

/home/hannes/netperf-2.6.0/src/netperf -t  DG_RR -- -r 1,1

=== Unpatched net-next: ===
DG REQUEST/RESPONSE TEST
Local /Remote
Socket Size   Request  Resp.   Elapsed  Trans.
Send   Recv   Size     Size    Time     Rate
bytes  Bytes  bytes    bytes   secs.    per sec

212992 212992 1        1       10.00    23587.28
4608   2304

=== Patched net-next: ===
DG REQUEST/RESPONSE TEST
Local /Remote
Socket Size   Request  Resp.   Elapsed  Trans.
Send   Recv   Size     Size    Time     Rate
bytes  Bytes  bytes    bytes   secs.    per sec

212992 212992 1        1       10.00    21607.50
4608   2304

I guess it doesn't matter much. As soon as I start profiling the
difference vanishes even more. perf diff shows now difference in
unix_dgram_sendmsg.

I couldn't do the parameterless netperf DG_STREAM benchmarks, because
netserver unconditionally sets SO_RCVBUF to 0 and such the clamps this value
to SOCK_MIN_RCVBUF. The sender cannot send netperf's normal packet size
through the socket. In case there are other applications out there,
are we allowed to break them?

Otherwise benchmark results for stream tests don't look that good because
of rescheduling from sock_wfree and I didn't find an easy way to reduce
them besides playing dirty tricks with SOCK_USE_WRITE_QUEUE for which
I could not come up with a proper and race-free idea.

/home/hannes/netperf-2.6.0/src/netperf -t  DG_STREAM -- -m 1024

=== Unpatched net-next: ===
DG UNIDIRECTIONAL SEND TEST
Socket  Message  Elapsed      Messages
Size    Size     Time         Okay Errors   Throughput
bytes   bytes    secs            #      #   10^6bits/sec

212992    1024    10.02      963846      0     787.68
 2304            10.02      963846            787.68

=== Patched net-next: ===
DG UNIDIRECTIONAL SEND TEST
Socket  Message  Elapsed      Messages
Size    Size     Time         Okay Errors   Throughput
bytes   bytes    secs            #      #   10^6bits/sec

212992    1024    10.00      347507      0     284.68
 2304            10.00      347507            284.68

The important question for me is if we can assume applications already
setting SO_RCVBUF to some minimal value and because of this not receiving
packets being buggy and ignore them or do we need to introduce some way
around this?

Greetings,

  Hannes

--
To unsubscribe from this list: send the line "unsubscribe netdev" in
the body of a message to majordomo@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
David Miller March 10, 2014, 8:27 p.m. UTC | #10
From: Hannes Frederic Sowa <hannes@stressinduktion.org>
Date: Mon, 10 Mar 2014 02:07:54 +0100

> On Tue, Mar 04, 2014 at 03:37:32PM -0500, David Miller wrote:
>> From: Hannes Frederic Sowa <hannes@stressinduktion.org>
>> Date: Tue, 4 Mar 2014 21:13:36 +0100
>> 
>> > Just a small followup:
>> > 
>> > Because of the additional sock_wfree during skb handover in unix_dgram_sendmsg
>> > the spinlock in __wake_up_sync_key is hit more often thus more cacheline
>> > bouncing.
>> > 
>> > Deferring the POLLOUT notification after the sock_rfree in unix_dgram_recvmsg
>> > already halfed the number of context switches with the new patch.
>> > 
>> > I try to play with some other ideas this week and will submit new patch with
>> > performance numbers.
>> > 
>> > Haven't noticed any problems with the additional atomic operations
>> > performance-wise so far.
>> 
>> Ok, thanks for the update.
> 
> I need to have some feedback regarding backwards compatibility, so still no
> final patch:
> 
> First the unconnected send/reply benchmark, which look good.
> 
> /home/hannes/netperf-2.6.0/src/netperf -t  DG_RR -- -r 1,1
> 
> === Unpatched net-next: ===
> DG REQUEST/RESPONSE TEST
> Local /Remote
> Socket Size   Request  Resp.   Elapsed  Trans.
> Send   Recv   Size     Size    Time     Rate
> bytes  Bytes  bytes    bytes   secs.    per sec
> 
> 212992 212992 1        1       10.00    23587.28
> 4608   2304
> 
> === Patched net-next: ===
> DG REQUEST/RESPONSE TEST
> Local /Remote
> Socket Size   Request  Resp.   Elapsed  Trans.
> Send   Recv   Size     Size    Time     Rate
> bytes  Bytes  bytes    bytes   secs.    per sec
> 
> 212992 212992 1        1       10.00    21607.50
> 4608   2304

If I read those transaction rate numbers correctly, it's slowing
down by more then %8.  I'm not so sure that looks "good" to me.

> I couldn't do the parameterless netperf DG_STREAM benchmarks, because
> netserver unconditionally sets SO_RCVBUF to 0 and such the clamps this value
> to SOCK_MIN_RCVBUF. The sender cannot send netperf's normal packet size
> through the socket. In case there are other applications out there,
> are we allowed to break them?
 ...
> The important question for me is if we can assume applications already
> setting SO_RCVBUF to some minimal value and because of this not receiving
> packets being buggy and ignore them or do we need to introduce some way
> around this?

I think if it works currently, the risk of breaking a lot of things is simply
too high to change this.
--
To unsubscribe from this list: send the line "unsubscribe netdev" in
the body of a message to majordomo@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Hannes Frederic Sowa March 10, 2014, 9:04 p.m. UTC | #11
Hi!

On Mon, Mar 10, 2014 at 04:27:56PM -0400, David Miller wrote:
> > First the unconnected send/reply benchmark, which look good.
> > 
> > /home/hannes/netperf-2.6.0/src/netperf -t  DG_RR -- -r 1,1
> > 
> > === Unpatched net-next: ===
> > DG REQUEST/RESPONSE TEST
> > Local /Remote
> > Socket Size   Request  Resp.   Elapsed  Trans.
> > Send   Recv   Size     Size    Time     Rate
> > bytes  Bytes  bytes    bytes   secs.    per sec
> > 
> > 212992 212992 1        1       10.00    23587.28
> > 4608   2304
> > 
> > === Patched net-next: ===
> > DG REQUEST/RESPONSE TEST
> > Local /Remote
> > Socket Size   Request  Resp.   Elapsed  Trans.
> > Send   Recv   Size     Size    Time     Rate
> > bytes  Bytes  bytes    bytes   secs.    per sec
> > 
> > 212992 212992 1        1       10.00    21607.50
> > 4608   2304
> 
> If I read those transaction rate numbers correctly, it's slowing
> down by more then %8.  I'm not so sure that looks "good" to me.

You haven't seen the other benchmarks before I fixed the wakeup. ;)

(I should really have done more benchmarks before, I only was concerened
about the "correctness" issue.)

Ok, sure, a performance drop is clearly visible, I have to admit, even
though I don't know if lot's of people use this style of communication.

> > I couldn't do the parameterless netperf DG_STREAM benchmarks, because
> > netserver unconditionally sets SO_RCVBUF to 0 and such the clamps this value
> > to SOCK_MIN_RCVBUF. The sender cannot send netperf's normal packet size
> > through the socket. In case there are other applications out there,
> > are we allowed to break them?
>  ...
> > The important question for me is if we can assume applications already
> > setting SO_RCVBUF to some minimal value and because of this not receiving
> > packets being buggy and ignore them or do we need to introduce some way
> > around this?
> 
> I think if it works currently, the risk of breaking a lot of things is simply
> too high to change this.

Yes, it currently works.

I have to think more about this, but the only solutions which I came up
with is adding more special cases for connected dgram sockets etc. and
that cannot be the solution either... (especially because even connected
sockets can communicate with other sockets if msg->msg_name is specified).

Thanks,

  Hannes


--
To unsubscribe from this list: send the line "unsubscribe netdev" in
the body of a message to majordomo@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
diff mbox

Patch

diff --git a/include/net/af_unix.h b/include/net/af_unix.h
index a175ba4..52fbabd 100644
--- a/include/net/af_unix.h
+++ b/include/net/af_unix.h
@@ -36,6 +36,7 @@  struct unix_skb_parms {
 	u32			secid;		/* Security ID		*/
 #endif
 	u32			consumed;
+	struct unix_address	*dgram_peer;
 };
 
 #define UNIXCB(skb) 	(*(struct unix_skb_parms *)&((skb)->cb))
diff --git a/net/unix/af_unix.c b/net/unix/af_unix.c
index 29fc8be..103338e 100644
--- a/net/unix/af_unix.c
+++ b/net/unix/af_unix.c
@@ -187,6 +187,12 @@  static inline int unix_recvq_full(struct sock const *sk)
 	return skb_queue_len(&sk->sk_receive_queue) > sk->sk_max_ack_backlog;
 }
 
+static bool unix_rmem_full(const struct sock *sk,
+			   const struct sk_buff *skb)
+{
+	return sk_rmem_alloc_get(sk) + skb->truesize > sk->sk_rcvbuf;
+}
+
 struct sock *unix_peer_get(struct sock *s)
 {
 	struct sock *peer;
@@ -322,6 +328,11 @@  static inline int unix_writable(struct sock *sk)
 	return (atomic_read(&sk->sk_wmem_alloc) << 2) <= sk->sk_sndbuf;
 }
 
+static int unix_other_writable(const struct sock *other)
+{
+	return (atomic_read(&other->sk_rmem_alloc) << 2) <= other->sk_rcvbuf;
+}
+
 static void unix_write_space(struct sock *sk)
 {
 	struct socket_wq *wq;
@@ -1048,8 +1059,13 @@  static long unix_wait_for_peer(struct sock *other, long timeo)
 	prepare_to_wait_exclusive(&u->peer_wait, &wait, TASK_INTERRUPTIBLE);
 
 	sched = !sock_flag(other, SOCK_DEAD) &&
-		!(other->sk_shutdown & RCV_SHUTDOWN) &&
-		unix_recvq_full(other);
+		!(other->sk_shutdown & RCV_SHUTDOWN);
+
+	if (other->sk_type == SOCK_STREAM)
+		sched = sched && unix_recvq_full(other);
+	else
+		sched = sched && (unix_recvq_full(other) ||
+				  !unix_other_writable(other));
 
 	unix_state_unlock(other);
 
@@ -1367,9 +1383,33 @@  static void unix_destruct_scm(struct sk_buff *skb)
 	/* Alas, it calls VFS */
 	/* So fscking what? fput() had been SMP-safe since the last Summer */
 	scm_destroy(&scm);
+}
+
+static void unix_skb_destruct_r(struct sk_buff *skb)
+{
+	unix_destruct_scm(skb);
+	if (UNIXCB(skb).dgram_peer)
+		unix_release_addr(UNIXCB(skb).dgram_peer);
+	sock_rfree(skb);
+}
+
+static void unix_skb_destruct_w(struct sk_buff *skb)
+{
+	unix_destruct_scm(skb);
+	if (UNIXCB(skb).dgram_peer)
+		unix_release_addr(UNIXCB(skb).dgram_peer);
 	sock_wfree(skb);
 }
 
+static void unix_skb_set_owner_r(struct sk_buff *skb, struct sock *sk,
+				   struct sock *other)
+{
+	sock_wfree(skb);
+	skb->sk = other;
+	skb->destructor = unix_skb_destruct_r;
+	atomic_add(skb->truesize, &other->sk_rmem_alloc);
+}
+
 #define MAX_RECURSION_LEVEL 4
 
 static int unix_attach_fds(struct scm_cookie *scm, struct sk_buff *skb)
@@ -1417,7 +1457,6 @@  static int unix_scm_to_skb(struct scm_cookie *scm, struct sk_buff *skb, bool sen
 	if (scm->fp && send_fds)
 		err = unix_attach_fds(scm, skb);
 
-	skb->destructor = unix_destruct_scm;
 	return err;
 }
 
@@ -1504,6 +1543,12 @@  static int unix_dgram_sendmsg(struct kiocb *kiocb, struct socket *sock,
 	if (skb == NULL)
 		goto out;
 
+	if (u->addr) {
+		UNIXCB(skb).dgram_peer = u->addr;
+		atomic_inc(&UNIXCB(skb).dgram_peer->refcnt);
+	}
+	skb->destructor = unix_skb_destruct_w;
+
 	err = unix_scm_to_skb(siocb->scm, skb, true);
 	if (err < 0)
 		goto out_free;
@@ -1579,7 +1624,7 @@  restart:
 			goto out_unlock;
 	}
 
-	if (unix_peer(other) != sk && unix_recvq_full(other)) {
+	if (unix_recvq_full(other) || unix_rmem_full(other, skb)) {
 		if (!timeo) {
 			err = -EAGAIN;
 			goto out_unlock;
@@ -1597,6 +1642,7 @@  restart:
 	if (sock_flag(other, SOCK_RCVTSTAMP))
 		__net_timestamp(skb);
 	maybe_add_creds(skb, sock, other);
+	unix_skb_set_owner_r(skb, sk, other);
 	skb_queue_tail(&other->sk_receive_queue, skb);
 	if (max_level > unix_sk(other)->recursion_level)
 		unix_sk(other)->recursion_level = max_level;
@@ -1677,6 +1723,8 @@  static int unix_stream_sendmsg(struct kiocb *kiocb, struct socket *sock,
 		if (!skb)
 			goto out_err;
 
+		skb->destructor = unix_skb_destruct_w;
+
 		/* Only send the fds in the first buffer */
 		err = unix_scm_to_skb(siocb->scm, skb, !fds_sent);
 		if (err < 0) {
@@ -1760,13 +1808,12 @@  static int unix_seqpacket_recvmsg(struct kiocb *iocb, struct socket *sock,
 	return unix_dgram_recvmsg(iocb, sock, msg, size, flags);
 }
 
-static void unix_copy_addr(struct msghdr *msg, struct sock *sk)
+static void unix_copy_addr(struct msghdr *msg,
+			   struct unix_address *ua)
 {
-	struct unix_sock *u = unix_sk(sk);
-
-	if (u->addr) {
-		msg->msg_namelen = u->addr->len;
-		memcpy(msg->msg_name, u->addr->name, u->addr->len);
+	if (ua) {
+		msg->msg_namelen = ua->len;
+		memcpy(msg->msg_name, ua->name, ua->len);
 	}
 }
 
@@ -1810,7 +1857,7 @@  static int unix_dgram_recvmsg(struct kiocb *iocb, struct socket *sock,
 					POLLOUT | POLLWRNORM | POLLWRBAND);
 
 	if (msg->msg_name)
-		unix_copy_addr(msg, skb->sk);
+		unix_copy_addr(msg, UNIXCB(skb).dgram_peer);
 
 	if (size > skb->len - skip)
 		size = skb->len - skip;
@@ -2013,7 +2060,7 @@  again:
 
 		/* Copy address just once */
 		if (sunaddr) {
-			unix_copy_addr(msg, skb->sk);
+			unix_copy_addr(msg, unix_sk(skb->sk)->addr);
 			sunaddr = NULL;
 		}
 
@@ -2237,11 +2284,9 @@  static unsigned int unix_dgram_poll(struct file *file, struct socket *sock,
 	writable = unix_writable(sk);
 	other = unix_peer_get(sk);
 	if (other) {
-		if (unix_peer(other) != sk) {
-			sock_poll_wait(file, &unix_sk(other)->peer_wait, wait);
-			if (unix_recvq_full(other))
-				writable = 0;
-		}
+		sock_poll_wait(file, &unix_sk(other)->peer_wait, wait);
+		if (unix_recvq_full(other) || !unix_other_writable(other))
+			writable = 0;
 		sock_put(other);
 	}