diff mbox

kernel BUG at kernel/timer.c:748!

Message ID 1348092082.31352.51.camel@edumazet-glaptop
State RFC, archived
Delegated to: David Miller
Headers show

Commit Message

Eric Dumazet Sept. 19, 2012, 10:01 p.m. UTC
On Wed, 2012-09-19 at 17:10 -0400, Dave Jones wrote:
> On Sat, Sep 15, 2012 at 11:16:52AM -0700, Yuchung Cheng wrote:
>  > On Fri, Sep 14, 2012 at 2:29 PM, Dave Jones <davej@redhat.com> wrote:
>  > > On Wed, Sep 05, 2012 at 11:48:29PM +0300, Julian Anastasov wrote:
>  > >
>  > >  > > kernel BUG at kernel/timer.c:748!
>  > >  > > Call Trace:
>  > >  > >  ? lock_sock_nested+0x8d/0xa0
>  > >  > >  sk_reset_timer+0x1c/0x30
>  > >  > >  ? sock_setsockopt+0x8c/0x960
>  > >  > >  inet_csk_reset_keepalive_timer+0x20/0x30
>  > >  > >  tcp_set_keepalive+0x3d/0x50
>  > >  > >  sock_setsockopt+0x923/0x960
>  > >  > >  ? trace_hardirqs_on_caller+0x16/0x1e0
>  > >  > >  ? fget_light+0x24c/0x520
>  > >  > >  sys_setsockopt+0xc6/0xe0
>  > >  > >  system_call_fastpath+0x1a/0x1f
>  > >  >
>  > >  >      Can this help? In case you see ICMPV6_PKT_TOOBIG...
>  > >  >
>  > >  > [PATCH] tcp: fix possible socket refcount problem for ipv6
>  > >
>  > > I just managed to reproduce this bug on rc5 with this patch,
>  > > so it doesn't seem to help.
>  > Could you post some tcpdump traces?
> 
> It's likely that there aren't any packets.  The fuzzer isn't smart
> enough (yet) to do anything too clever to the sockets it creates.
> 
> More likely is that this is some race where thread A is doing a setsockopt
> while thread B is doing a tear-down of the same socket.

I spent some time trying to track this bug, but found nothing so far.

The timer->function are never cleared by TCP stack at tear down, and
should be set before fd is installed and can be caught by other threads.

Most likely its a refcounting issue...

Following debugging patch might trigger a bug sooner ?



--
To unsubscribe from this list: send the line "unsubscribe netdev" in
the body of a message to majordomo@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html

Comments

Dave Jones Sept. 20, 2012, 2:02 a.m. UTC | #1
On Thu, Sep 20, 2012 at 12:01:22AM +0200, Eric Dumazet wrote:

 > I spent some time trying to track this bug, but found nothing so far.
 > 
 > The timer->function are never cleared by TCP stack at tear down, and
 > should be set before fd is installed and can be caught by other threads.
 > 
 > Most likely its a refcounting issue...
 > 
 > Following debugging patch might trigger a bug sooner ?

4 hours so far, nothing.. I'll leave it run overnight, but it doesn't seem to have
made much difference if any.

	Dave
--
To unsubscribe from this list: send the line "unsubscribe netdev" in
the body of a message to majordomo@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Dave Jones Sept. 24, 2012, 3:39 p.m. UTC | #2
On Wed, Sep 19, 2012 at 10:02:23PM -0400, Dave Jones wrote:
 > On Thu, Sep 20, 2012 at 12:01:22AM +0200, Eric Dumazet wrote:
 > 
 >  > I spent some time trying to track this bug, but found nothing so far.
 >  > 
 >  > The timer->function are never cleared by TCP stack at tear down, and
 >  > should be set before fd is installed and can be caught by other threads.
 >  > 
 >  > Most likely its a refcounting issue...
 >  > 
 >  > Following debugging patch might trigger a bug sooner ?
 > 
 > 4 hours so far, nothing.. I'll leave it run overnight, but it doesn't seem to have
 > made much difference if any.

One of my over-weekend runs hit this again, but it didn't trigger the WARN that
your patch added :-/

	Dave

--
To unsubscribe from this list: send the line "unsubscribe netdev" in
the body of a message to majordomo@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Eric Dumazet Sept. 24, 2012, 4:34 p.m. UTC | #3
On Mon, 2012-09-24 at 11:39 -0400, Dave Jones wrote:
> On Wed, Sep 19, 2012 at 10:02:23PM -0400, Dave Jones wrote:
>  > On Thu, Sep 20, 2012 at 12:01:22AM +0200, Eric Dumazet wrote:
>  > 
>  >  > I spent some time trying to track this bug, but found nothing so far.
>  >  > 
>  >  > The timer->function are never cleared by TCP stack at tear down, and
>  >  > should be set before fd is installed and can be caught by other threads.
>  >  > 
>  >  > Most likely its a refcounting issue...
>  >  > 
>  >  > Following debugging patch might trigger a bug sooner ?
>  > 
>  > 4 hours so far, nothing.. I'll leave it run overnight, but it doesn't seem to have
>  > made much difference if any.
> 
> One of my over-weekend runs hit this again, but it didn't trigger the WARN that
> your patch added :-/
> 
> 	Dave
> 

OK, I believe I found the reason. I Will post a patch.

open a raw socket AF_INET, TCP_PROTO
+ connect() ->sk_state set to TCP_ESTABLISHED
+ setsockopt( SO_KEEPALIVE, &on)  -> crash


Thanks



--
To unsubscribe from this list: send the line "unsubscribe netdev" in
the body of a message to majordomo@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
diff mbox

Patch

diff --git a/include/net/sock.h b/include/net/sock.h
index 84bdaec..5d3ad5b 100644
--- a/include/net/sock.h
+++ b/include/net/sock.h
@@ -511,18 +511,18 @@  static inline void sock_hold(struct sock *sk)
  */
 static inline void __sock_put(struct sock *sk)
 {
-	atomic_dec(&sk->sk_refcnt);
+	int newcnt = atomic_dec_return(&sk->sk_refcnt);
+
+	WARN_ON(newcnt <= 0);
 }
 
 static inline bool sk_del_node_init(struct sock *sk)
 {
 	bool rc = __sk_del_node_init(sk);
 
-	if (rc) {
-		/* paranoid for a while -acme */
-		WARN_ON(atomic_read(&sk->sk_refcnt) == 1);
+	if (rc)
 		__sock_put(sk);
-	}
+
 	return rc;
 }
 #define sk_del_node_init_rcu(sk)	sk_del_node_init(sk)
@@ -1620,7 +1620,10 @@  static inline void sk_filter_charge(struct sock *sk, struct sk_filter *fp)
 /* Ungrab socket and destroy it, if it was the last reference. */
 static inline void sock_put(struct sock *sk)
 {
-	if (atomic_dec_and_test(&sk->sk_refcnt))
+	int newcnt = atomic_dec_return(&sk->sk_refcnt);
+
+	WARN_ON(newcnt < 0);
+	if (newcnt == 0)
 		sk_free(sk);
 }