diff mbox

[Bugme-new,Bug,14749] New: Kernel locks up after a few minutes of heavy surfing

Message ID 4B1DC202.20607@gmail.com
State RFC, archived
Delegated to: David Miller
Headers show

Commit Message

Eric Dumazet Dec. 8, 2009, 3:03 a.m. UTC
Chris Rankin a écrit :
> 
> I saw something interesting in 2.6.31.7 about a crash due to fragmentation:
> 
> ipv4: additional update of dev_net(dev) to struct *net in ip_fragment.c, NULL ptr OOPS
> 
> I'll try applying that patch too, to see if it makes any difference. Along with that other UDP-related thing I noticed:
> 
> udp: Fix udp_poll() and ioctl()
> 

Its all two years old UDP bugs (I spot another one some hours ago), and very rare.
I run heavy duty servers with lot of UDP trafic and never caught a _single_ error,
I am quite suprised it could happen on your machine on demand.

1) Do you have another NIC adapter to try ? It might be a buggy driver.
  (Neil Horman found an error on Intel drivers some hours ago, that can corrupt skbs)

2) Could you add following debugging aid ?

3) Any chance you can do a git bisect ?

Thanks


--
To unsubscribe from this list: send the line "unsubscribe netdev" in
the body of a message to majordomo@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html

Comments

Chris Rankin Dec. 8, 2009, 9:03 a.m. UTC | #1
--- On Tue, 8/12/09, Eric Dumazet <eric.dumazet@gmail.com> wrote:
> Its all two years old UDP bugs (I spot another one some
> hours ago), and very rare.

> I am quite suprised it could happen on your machine on
> demand.

Who said anything about "on demand"? It took about 30 minutes to freeze last time; I was starting to think that a complete recompile had fixed it!

For the record: I've only seen that dmesg warning I've reported *once*, and that didn't kill the machine immediately (hence I was able to report it in the first place).

> 1) Do you have another NIC adapter to try ? It might be a
> buggy driver. (Neil Horman found an error on Intel drivers some
> hours ago, that can corrupt skbs)

I can test any patches for a e1000 that apply to 2.6.31.x. But the e1000 is an on-board device and I don't have another. But Fedora's 2.6.31.x kernels seem OK.

> 2) Could you add following debugging aid ?

Not a problem; I do have a serial console attached.

> 3) Any chance you can do a git bisect ?

How do you git-bisect a bug that you can't reproduce on demand? A negative is easy to spot, but a positive would be not experiencing a random freeze. As I said, I *almost* thought that I'd resolved the issue by recompiling last night.

Cheers,
Chris



      
--
To unsubscribe from this list: send the line "unsubscribe netdev" in
the body of a message to majordomo@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Chris Rankin Dec. 8, 2009, 9:17 a.m. UTC | #2
One other thing: this is an SMP machine with 2 physical hyper-threaded CPUs in. And all its IP traffic is routed through a UP 200MHz Pentium MMX machine that is also running 2.6.31.6 via an e100 card.

The Pentium MMX machine has been rock-solid so far.

Cheers,
Chris


      
--
To unsubscribe from this list: send the line "unsubscribe netdev" in
the body of a message to majordomo@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Eric Dumazet Dec. 8, 2009, 11:21 a.m. UTC | #3
Chris Rankin a écrit :
> --- On Tue, 8/12/09, Eric Dumazet <eric.dumazet@gmail.com> wrote:
>> Its all two years old UDP bugs (I spot another one some
>> hours ago), and very rare.
> 
>> I am quite suprised it could happen on your machine on
>> demand.
> 
> Who said anything about "on demand"? It took about 30 minutes to freeze last time; 
> I was starting to think that a complete recompile had fixed it!
> 

30 minutes is pretty fast, this is why I said 'on demand'...

> For the record: I've only seen that dmesg warning I've reported *once*, and that didn't kill the machine immediately (hence I was able to report it in the first place).
> 
>> 1) Do you have another NIC adapter to try ? It might be a
>> buggy driver. (Neil Horman found an error on Intel drivers some
>> hours ago, that can corrupt skbs)
> 
> I can test any patches for a e1000 that apply to 2.6.31.x. But the e1000 is an on-board device and I don't have another. But Fedora's 2.6.31.x kernels seem OK.
> 
>> 2) Could you add following debugging aid ?
> 
> Not a problem; I do have a serial console attached.
> 
>> 3) Any chance you can do a git bisect ?
> 
> How do you git-bisect a bug that you can't reproduce on demand? A negative is easy to spot, but a positive would be not experiencing a random freeze. As I said, I *almost* thought that I'd resolved the issue by recompiling last night.
> 

Please fold your lines length to < 70 

If Fedora kernel works, either its just pure luck, or they found
a bug and they didnt sent the fix to mainline (unlikely)


--
To unsubscribe from this list: send the line "unsubscribe netdev" in
the body of a message to majordomo@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Jarek Poplawski Dec. 8, 2009, 11:36 a.m. UTC | #4
On 08-12-2009 12:21, Eric Dumazet wrote:
> If Fedora kernel works, either its just pure luck, or they found
> a bug and they didnt sent the fix to mainline (unlikely)

Is it the same .config?

Jarek P.
--
To unsubscribe from this list: send the line "unsubscribe netdev" in
the body of a message to majordomo@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Neil Horman Dec. 8, 2009, noon UTC | #5
On Tue, Dec 08, 2009 at 01:03:15AM -0800, Chris Rankin wrote:
> --- On Tue, 8/12/09, Eric Dumazet <eric.dumazet@gmail.com> wrote:
> > Its all two years old UDP bugs (I spot another one some
> > hours ago), and very rare.
> 
> > I am quite suprised it could happen on your machine on
> > demand.
> 
> Who said anything about "on demand"? It took about 30 minutes to freeze last time; I was starting to think that a complete recompile had fixed it!
> 
> For the record: I've only seen that dmesg warning I've reported *once*, and that didn't kill the machine immediately (hence I was able to report it in the first place).
> 
30 minutes isn't too long to wait for an error to appear, I think.

> > 1) Do you have another NIC adapter to try ? It might be a
> > buggy driver. (Neil Horman found an error on Intel drivers some
> > hours ago, that can corrupt skbs)
> 
> I can test any patches for a e1000 that apply to 2.6.31.x. But the e1000 is an on-board device and I don't have another. But Fedora's 2.6.31.x kernels seem OK.
> 
Those patches I posted for the intel drivers will apply cleanly pretty far back
in git, as that code hasn't changed much.  You might also consider turning on
slab debugging.  Many of the errors I encountered leading up to a fatal oops
werent themselves fatal, and were hidden until such time as we used slab
debugging to catch a bunch of redzone violations.

> > 2) Could you add following debugging aid ?
> 
> Not a problem; I do have a serial console attached.
> 
> > 3) Any chance you can do a git bisect ?
> 
> How do you git-bisect a bug that you can't reproduce on demand? A negative is easy to spot, but a positive would be not experiencing a random freeze. As I said, I *almost* thought that I'd resolved the issue by recompiling last night.
Well, it sounds like your longest time to failure is about 30 minutes.  Why not
write a script that runs your test for an hour at a stretch, and plug that inot
git bisect, and walk away?  You should have results in a day or so.

Regards
Neil

--
To unsubscribe from this list: send the line "unsubscribe netdev" in
the body of a message to majordomo@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Chris Rankin Dec. 8, 2009, 1:35 p.m. UTC | #6
--- On Tue, 8/12/09, Jarek Poplawski <jarkao2@gmail.com> wrote:
> Is it the same .config?

Similar, but no. I'll attach the .config to the bug tonight.

Chris


      
--
To unsubscribe from this list: send the line "unsubscribe netdev" in
the body of a message to majordomo@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Chris Rankin Dec. 8, 2009, 1:39 p.m. UTC | #7
--- On Tue, 8/12/09, Neil Horman <nhorman@tuxdriver.com> wrote:
> 30 minutes isn't too long to wait for an error to appear, I think.

Except it's a very "busy" waiting process with me actively surfing the web. I can't automate that. I'm still not entirely sure what the trigger condition is.

Chris


      
--
To unsubscribe from this list: send the line "unsubscribe netdev" in
the body of a message to majordomo@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Neil Horman Dec. 8, 2009, 1:41 p.m. UTC | #8
On Tue, Dec 08, 2009 at 05:39:28AM -0800, Chris Rankin wrote:
> --- On Tue, 8/12/09, Neil Horman <nhorman@tuxdriver.com> wrote:
> > 30 minutes isn't too long to wait for an error to appear, I think.
> 
> Except it's a very "busy" waiting process with me actively surfing the web. I can't automate that. I'm still not entirely sure what the trigger condition is.
> 
Sure you can, generate a list of sites that you visited and access them all with
a curl or wget script.  I would imagine thats a reasonable test to trigger the
reproducer.

Neil

--
To unsubscribe from this list: send the line "unsubscribe netdev" in
the body of a message to majordomo@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Jarek Poplawski Dec. 8, 2009, 1:47 p.m. UTC | #9
On Tue, Dec 08, 2009 at 05:35:40AM -0800, Chris Rankin wrote:
> --- On Tue, 8/12/09, Jarek Poplawski <jarkao2@gmail.com> wrote:
> > Is it the same .config?
> 
> Similar, but no. I'll attach the .config to the bug tonight.

...And a diff to Fedora's .config, plus if possible try if this
difference could matter.

Jarek P.
--
To unsubscribe from this list: send the line "unsubscribe netdev" in
the body of a message to majordomo@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Eric Dumazet Dec. 8, 2009, 2:39 p.m. UTC | #10
Neil Horman a écrit :
> On Tue, Dec 08, 2009 at 05:39:28AM -0800, Chris Rankin wrote:
>> --- On Tue, 8/12/09, Neil Horman <nhorman@tuxdriver.com> wrote:
>>> 30 minutes isn't too long to wait for an error to appear, I think.
>> Except it's a very "busy" waiting process with me actively surfing the web. I can't automate that. I'm still not entirely sure what the trigger condition is.
>>
> Sure you can, generate a list of sites that you visited and access them all with
> a curl or wget script.  I would imagine thats a reasonable test to trigger the
> reproducer.

Yes, but I suspect a multi threading bug, or vm , or X11, or something.

Andi posted a futex patch that is worth to try, if machine is swaping a bit.

Chris, please provide as much information as you can

# cat /proc/cpuinfo
# cat /proc/meminfo
# ps aux
# scripts/ver_linux
--
To unsubscribe from this list: send the line "unsubscribe netdev" in
the body of a message to majordomo@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Jarek Poplawski Dec. 15, 2009, 7:54 a.m. UTC | #11
On Tue, Dec 08, 2009 at 05:35:40AM -0800, Chris Rankin wrote:
> --- On Tue, 8/12/09, Jarek Poplawski <jarkao2@gmail.com> wrote:
> > Is it the same .config?
> 
> Similar, but no. I'll attach the .config to the bug tonight.

I can see quite a lot of differences, and some could matter here, e.g.
like these:

-# CONFIG_PREEMPT_RCU is not set
+# CONFIG_TREE_RCU is not set
+CONFIG_PREEMPT_RCU=y
...
-CONFIG_PREEMPT_VOLUNTARY=y
-# CONFIG_PREEMPT is not set
+# CONFIG_PREEMPT_VOLUNTARY is not set
+CONFIG_PREEMPT=y

It's hard to guess, but at least this second patch mentioned by you
(ipv4: additional update of dev_net(dev) to struct *net in
ip_fragment.c) shouldn't matter here. Anyway, now 2.6.32.1 should be
preferred for testing (if possible).

Jarek P.
--
To unsubscribe from this list: send the line "unsubscribe netdev" in
the body of a message to majordomo@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
diff mbox

Patch

diff --git a/net/ipv4/af_inet.c b/net/ipv4/af_inet.c
index 7d12c6a..5a7a456 100644
--- a/net/ipv4/af_inet.c
+++ b/net/ipv4/af_inet.c
@@ -147,10 +147,15 @@  void inet_sock_destruct(struct sock *sk)
 		return;
 	}
 
-	WARN_ON(atomic_read(&sk->sk_rmem_alloc));
-	WARN_ON(atomic_read(&sk->sk_wmem_alloc));
-	WARN_ON(sk->sk_wmem_queued);
-	WARN_ON(sk->sk_forward_alloc);
+	WARN((atomic_read(&sk->sk_rmem_alloc) | atomic_read(&sk->sk_wmem_alloc) |
+	     sk->sk_wmem_queued | sk->sk_forward_alloc) != 0,
+	     "%s socket sk_rmem_alloc=%d sk_wmem_alloc=%d "
+	     "sk_wmem_queued=%d sk_forward_alloc=%d\n",
+	     sk->sk_prot->name,
+	     atomic_read(&sk->sk_rmem_alloc),
+	     atomic_read(&sk->sk_wmem_alloc),
+	     sk->sk_wmem_queued,
+	     sk->sk_forward_alloc);
 
 	kfree(inet->opt);
 	dst_release(sk->sk_dst_cache);