diff mbox

tcp: tcp_release_cb() should release socket ownership

Message ID 1393972752.26794.145.camel@edumazet-glaptop2.roam.corp.google.com
State RFC, archived
Delegated to: David Miller
Headers show

Commit Message

Eric Dumazet March 4, 2014, 10:39 p.m. UTC
On Tue, 2014-03-04 at 13:48 -0500, David Miller wrote:
> From: Eric Dumazet <eric.dumazet@gmail.com>
> Date: Tue, 04 Mar 2014 06:16:23 -0800
> 
> > Could you try following hack/patch ?
> 
> Eric, if you do fix it like this (I don't mind), please use mnenomics
> for the various lock state values.

Sure this was kind of a hack...

I was thinking of following patch instead.

Note I would prefer waiting Lars tests.

Thanks !

[PATCH] tcp: tcp_release_cb() should release socket ownership

Lars Persson reported following deadlock :

-000 |M:0x0:0x802B6AF8(asm) <-- arch_spin_lock 
-001 |tcp_v4_rcv(skb = 0x8BD527A0) <-- sk = 0x8BE6B2A0
-002 |ip_local_deliver_finish(skb = 0x8BD527A0)
-003 |__netif_receive_skb_core(skb = 0x8BD527A0, ?)
-004 |netif_receive_skb(skb = 0x8BD527A0)
-005 |elk_poll(napi = 0x8C770500, budget = 64)
-006 |net_rx_action(?)
-007 |__do_softirq()
-008 |do_softirq()
-009 |local_bh_enable()
-010 |tcp_rcv_established(sk = 0x8BE6B2A0, skb = 0x87D3A9E0, th = 0x814EBE14, ?)
-011 |tcp_v4_do_rcv(sk = 0x8BE6B2A0, skb = 0x87D3A9E0)
-012 |tcp_delack_timer_handler(sk = 0x8BE6B2A0)
-013 |tcp_release_cb(sk = 0x8BE6B2A0)
-014 |release_sock(sk = 0x8BE6B2A0)
-015 |tcp_sendmsg(?, sk = 0x8BE6B2A0, ?, ?)
-016 |sock_sendmsg(sock = 0x8518C4C0, msg = 0x87D8DAA8, size = 4096)
-017 |kernel_sendmsg(?, ?, ?, ?, size = 4096)
-018 |smb_send_kvec(server = 0x87C4D400, iov = 0x87D8DB10, n_vec = 1, sent = 0x87D8DB1C)
-019 |smb_send_rqst(server = 0x87C4D400, rqst = 0x87D8DBA0)
-020 |cifs_call_async(server = 0x87C4D400, rqst = 0x87D8DBA0, receive = 0x0, callback = 0x80219514,
-021 |cifs_async_writev(wdata = 0x87FD6580)
-022 |cifs_writepages(mapping = 0x852096E4, wbc = 0x87D8DC88)
-023 |__writeback_single_inode(inode = 0x852095D0, wbc = 0x87D8DC88)
-024 |writeback_sb_inodes(sb = 0x87D6D800, wb = 0x87E4A9C0, work = 0x87D8DD88)
-025 |__writeback_inodes_wb(wb = 0x87E4A9C0, work = 0x87D8DD88)
-026 |wb_writeback(wb = 0x87E4A9C0, work = 0x87D8DD88)
-027 |wb_do_writeback(wb = 0x87E4A9C0, force_wait = 0)
-028 |bdi_writeback_workfn(work = 0x87E4A9CC)
-029 |process_one_work(worker = 0x8B045880, work = 0x87E4A9CC)
-030 |worker_thread(__worker = 0x8B045880)
-031 |kthread(_create = 0x87CADD90)
-032 |ret_from_kernel_thread(asm)

Bug occurs because __tcp_checksum_complete_user() enables BH, assuming
it is running from softirq context.

Lars trace involved a NIC without RX checksum support but other points
are problematic as well, like the prequeue stuff.

Problem is triggered by a timer, that found socket being owned by user.

tcp_release_cb() should call tcp_write_timer_handler() or
tcp_delack_timer_handler() in the appropriate context :

BH disabled and socket lock held, but 'owned' field cleared,
as if they were running from timer handlers.

Fixes: 6f458dfb4092 ("tcp: improve latencies of timer triggered events")
Reported-by: Lars Persson <lars.persson@axis.com>
Signed-off-by: Eric Dumazet <edumazet@google.com>
---
 include/net/sock.h    |    5 +++++
 net/core/sock.c       |    5 ++++-
 net/ipv4/tcp_output.c |   11 +++++++++++
 3 files changed, 20 insertions(+), 1 deletion(-)



--
To unsubscribe from this list: send the line "unsubscribe netdev" in
the body of a message to majordomo@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html

Comments

David Miller March 6, 2014, 1:59 a.m. UTC | #1
From: Eric Dumazet <eric.dumazet@gmail.com>
Date: Tue, 04 Mar 2014 14:39:12 -0800

> @@ -767,6 +767,17 @@ void tcp_release_cb(struct sock *sk)
>  	if (flags & (1UL << TCP_TSQ_DEFERRED))
>  		tcp_tsq_handler(sk);
>  
> +	/* Here begins the tricky part :
> +	 * We are called from release_sock() with :
> +	 * 1) BH disabled
> +	 * 2) sk_lock.slock spinlock held
> +	 * 3) socket owned by us (sk->sk_lock.owned == 1)
> +	 *
> +	 * But following code is meant to be called from BH handlers,
> +	 * so we should keep BH disabled, but early release socket ownership
> +	 */
> +	sock_release_ownership(sk);
> +

It really means that sk_lock.owned cannot ever be accessed without the
sk_lock spinlock held.

Most of this is easy to hand audit, except sock_owned_by_user() which
has call sites everywhere.

Consider adding a locking assertion to it.
--
To unsubscribe from this list: send the line "unsubscribe netdev" in
the body of a message to majordomo@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Eric Dumazet March 6, 2014, 2:06 a.m. UTC | #2
On Wed, 2014-03-05 at 20:59 -0500, David Miller wrote:

> It really means that sk_lock.owned cannot ever be accessed without the
> sk_lock spinlock held.
> 
> Most of this is easy to hand audit, except sock_owned_by_user() which
> has call sites everywhere.
> 
> Consider adding a locking assertion to it.

We can do that, but would it be a stable candidate ?

What about I send a followup for net-next ?

Thanks



--
To unsubscribe from this list: send the line "unsubscribe netdev" in
the body of a message to majordomo@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
David Miller March 6, 2014, 2:15 a.m. UTC | #3
From: Eric Dumazet <eric.dumazet@gmail.com>
Date: Wed, 05 Mar 2014 18:06:41 -0800

> On Wed, 2014-03-05 at 20:59 -0500, David Miller wrote:
> 
>> It really means that sk_lock.owned cannot ever be accessed without the
>> sk_lock spinlock held.
>> 
>> Most of this is easy to hand audit, except sock_owned_by_user() which
>> has call sites everywhere.
>> 
>> Consider adding a locking assertion to it.
> 
> We can do that, but would it be a stable candidate ?
> 
> What about I send a followup for net-next ?

Targetting net-next for the assertion is fine.

Did you get test results back yet?

--
To unsubscribe from this list: send the line "unsubscribe netdev" in
the body of a message to majordomo@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Eric Dumazet March 6, 2014, 2:20 a.m. UTC | #4
On Wed, 2014-03-05 at 21:15 -0500, David Miller wrote:

> Targetting net-next for the assertion is fine.
> 
> Did you get test results back yet?

Lars gave the results this morning for the first patch.

He said he was now testing the official patch.

Lets wait a bit more ;)



--
To unsubscribe from this list: send the line "unsubscribe netdev" in
the body of a message to majordomo@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Eric Dumazet March 7, 2014, 4:01 p.m. UTC | #5
On Fri, 2014-03-07 at 16:45 +0100, Lars Persson wrote:
> So far no deadlock on the socket lock is seen with this patch. I have seen two out-of-memory crashes which leaves the system deadlocked trying to allocate memory.
> 
> Stack traces and kernel log from one of the oom crashes is attached. I was thinking this may be caused by having the dirty_ratio set too high given our product's heavy use of memory. Currently re-testing with dirty_ratio reduced to 5%.
> 
> 
> [11515.530000] Recording maint invoked oom-killer: gfp_mask=0x201da, order=0, oom_score_adj=0
> [11515.540000] CPU: 0 PID: 1304 Comm: Recording maint Tainted: G           O 3.10.32 #10
> [11515.550000] Stack : 80666e52 00000049 80660000 00000000 805563dc 8b1876f8 8054af80 805d2227
> 	  00000518 00000000 80656d40 8b1876f8 000007ce 00000002 00000000 8047d858
> 	  00000000 8003bb64 00000006 00000000 8054d024
>  [11515.580000]  8b189a84 8b189a84 8054af80
> 	  00000000 00000000 00000000 00000000 00000000 00000000 00000000 00000000
> 	  00000000 00000000 00000000 00000000 00000000 00000000 00000000 8b189a18
> 	  ...


Well, this is 3.10.32, with local changes to dirty_ratio....

so I will simply ignore mm issues for the moment.

Thanks a lot Lars

Tested-by: Lars Persson <lars.persson@axis.com>



--
To unsubscribe from this list: send the line "unsubscribe netdev" in
the body of a message to majordomo@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Eric Dumazet March 10, 2014, 4:43 p.m. UTC | #6
On Fri, 2014-03-07 at 08:01 -0800, Eric Dumazet wrote:
> On Fri, 2014-03-07 at 16:45 +0100, Lars Persson wrote:
> > So far no deadlock on the socket lock is seen with this patch. I have seen two out-of-memory crashes which leaves the system deadlocked trying to allocate memory.
> > 
> > Stack traces and kernel log from one of the oom crashes is attached. I was thinking this may be caused by having the dirty_ratio set too high given our product's heavy use of memory. Currently re-testing with dirty_ratio reduced to 5%.
> > 
> > 
> > [11515.530000] Recording maint invoked oom-killer: gfp_mask=0x201da, order=0, oom_score_adj=0
> > [11515.540000] CPU: 0 PID: 1304 Comm: Recording maint Tainted: G           O 3.10.32 #10
> > [11515.550000] Stack : 80666e52 00000049 80660000 00000000 805563dc 8b1876f8 8054af80 805d2227
> > 	  00000518 00000000 80656d40 8b1876f8 000007ce 00000002 00000000 8047d858
> > 	  00000000 8003bb64 00000006 00000000 8054d024
> >  [11515.580000]  8b189a84 8b189a84 8054af80
> > 	  00000000 00000000 00000000 00000000 00000000 00000000 00000000 00000000
> > 	  00000000 00000000 00000000 00000000 00000000 00000000 00000000 8b189a18
> > 	  ...
> 
> 
> Well, this is 3.10.32, with local changes to dirty_ratio....
> 
> so I will simply ignore mm issues for the moment.
> 
> Thanks a lot Lars
> 
> Tested-by: Lars Persson <lars.persson@axis.com>

David, what happened with this patch ?

Should I re-submit it ?



--
To unsubscribe from this list: send the line "unsubscribe netdev" in
the body of a message to majordomo@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
David Miller March 10, 2014, 8:40 p.m. UTC | #7
From: Eric Dumazet <eric.dumazet@gmail.com>
Date: Mon, 10 Mar 2014 09:43:51 -0700

> On Fri, 2014-03-07 at 08:01 -0800, Eric Dumazet wrote:
>> On Fri, 2014-03-07 at 16:45 +0100, Lars Persson wrote:
>> > So far no deadlock on the socket lock is seen with this patch. I have seen two out-of-memory crashes which leaves the system deadlocked trying to allocate memory.
>> > 
>> > Stack traces and kernel log from one of the oom crashes is attached. I was thinking this may be caused by having the dirty_ratio set too high given our product's heavy use of memory. Currently re-testing with dirty_ratio reduced to 5%.
>> > 
>> > 
>> > [11515.530000] Recording maint invoked oom-killer: gfp_mask=0x201da, order=0, oom_score_adj=0
>> > [11515.540000] CPU: 0 PID: 1304 Comm: Recording maint Tainted: G           O 3.10.32 #10
>> > [11515.550000] Stack : 80666e52 00000049 80660000 00000000 805563dc 8b1876f8 8054af80 805d2227
>> > 	  00000518 00000000 80656d40 8b1876f8 000007ce 00000002 00000000 8047d858
>> > 	  00000000 8003bb64 00000006 00000000 8054d024
>> >  [11515.580000]  8b189a84 8b189a84 8054af80
>> > 	  00000000 00000000 00000000 00000000 00000000 00000000 00000000 00000000
>> > 	  00000000 00000000 00000000 00000000 00000000 00000000 00000000 8b189a18
>> > 	  ...
>> 
>> 
>> Well, this is 3.10.32, with local changes to dirty_ratio....
>> 
>> so I will simply ignore mm issues for the moment.
>> 
>> Thanks a lot Lars
>> 
>> Tested-by: Lars Persson <lars.persson@axis.com>
> 
> David, what happened with this patch ?
> 
> Should I re-submit it ?

Yes, please do.  Thanks!
--
To unsubscribe from this list: send the line "unsubscribe netdev" in
the body of a message to majordomo@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
diff mbox

Patch

diff --git a/include/net/sock.h b/include/net/sock.h
index 5c3f7c3624aa..2100d5e8809a 100644
--- a/include/net/sock.h
+++ b/include/net/sock.h
@@ -1488,6 +1488,11 @@  static inline void sk_wmem_free_skb(struct sock *sk, struct sk_buff *skb)
  */
 #define sock_owned_by_user(sk)	((sk)->sk_lock.owned)
 
+static inline void sock_release_ownership(struct sock *sk)
+{
+	sk->sk_lock.owned = 0;
+}
+
 /*
  * Macro so as to not evaluate some arguments when
  * lockdep is not enabled.
diff --git a/net/core/sock.c b/net/core/sock.c
index 5b6a9431b017..c0fc6bdad1e3 100644
--- a/net/core/sock.c
+++ b/net/core/sock.c
@@ -2357,10 +2357,13 @@  void release_sock(struct sock *sk)
 	if (sk->sk_backlog.tail)
 		__release_sock(sk);
 
+	/* Warning : release_cb() might need to release sk ownership,
+	 * ie call sock_release_ownership(sk) before us.
+	 */
 	if (sk->sk_prot->release_cb)
 		sk->sk_prot->release_cb(sk);
 
-	sk->sk_lock.owned = 0;
+	sock_release_ownership(sk);
 	if (waitqueue_active(&sk->sk_lock.wq))
 		wake_up(&sk->sk_lock.wq);
 	spin_unlock_bh(&sk->sk_lock.slock);
diff --git a/net/ipv4/tcp_output.c b/net/ipv4/tcp_output.c
index f0eb4e337ec8..17a11e65e57f 100644
--- a/net/ipv4/tcp_output.c
+++ b/net/ipv4/tcp_output.c
@@ -767,6 +767,17 @@  void tcp_release_cb(struct sock *sk)
 	if (flags & (1UL << TCP_TSQ_DEFERRED))
 		tcp_tsq_handler(sk);
 
+	/* Here begins the tricky part :
+	 * We are called from release_sock() with :
+	 * 1) BH disabled
+	 * 2) sk_lock.slock spinlock held
+	 * 3) socket owned by us (sk->sk_lock.owned == 1)
+	 *
+	 * But following code is meant to be called from BH handlers,
+	 * so we should keep BH disabled, but early release socket ownership
+	 */
+	sock_release_ownership(sk);
+
 	if (flags & (1UL << TCP_WRITE_TIMER_DEFERRED)) {
 		tcp_write_timer_handler(sk);
 		__sock_put(sk);