Message ID | 4B1DC202.20607@gmail.com |
---|---|
State | RFC, archived |
Delegated to: | David Miller |
Headers | show |
--- On Tue, 8/12/09, Eric Dumazet <eric.dumazet@gmail.com> wrote: > Its all two years old UDP bugs (I spot another one some > hours ago), and very rare. > I am quite suprised it could happen on your machine on > demand. Who said anything about "on demand"? It took about 30 minutes to freeze last time; I was starting to think that a complete recompile had fixed it! For the record: I've only seen that dmesg warning I've reported *once*, and that didn't kill the machine immediately (hence I was able to report it in the first place). > 1) Do you have another NIC adapter to try ? It might be a > buggy driver. (Neil Horman found an error on Intel drivers some > hours ago, that can corrupt skbs) I can test any patches for a e1000 that apply to 2.6.31.x. But the e1000 is an on-board device and I don't have another. But Fedora's 2.6.31.x kernels seem OK. > 2) Could you add following debugging aid ? Not a problem; I do have a serial console attached. > 3) Any chance you can do a git bisect ? How do you git-bisect a bug that you can't reproduce on demand? A negative is easy to spot, but a positive would be not experiencing a random freeze. As I said, I *almost* thought that I'd resolved the issue by recompiling last night. Cheers, Chris -- To unsubscribe from this list: send the line "unsubscribe netdev" in the body of a message to majordomo@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
One other thing: this is an SMP machine with 2 physical hyper-threaded CPUs in. And all its IP traffic is routed through a UP 200MHz Pentium MMX machine that is also running 2.6.31.6 via an e100 card. The Pentium MMX machine has been rock-solid so far. Cheers, Chris -- To unsubscribe from this list: send the line "unsubscribe netdev" in the body of a message to majordomo@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
Chris Rankin a écrit : > --- On Tue, 8/12/09, Eric Dumazet <eric.dumazet@gmail.com> wrote: >> Its all two years old UDP bugs (I spot another one some >> hours ago), and very rare. > >> I am quite suprised it could happen on your machine on >> demand. > > Who said anything about "on demand"? It took about 30 minutes to freeze last time; > I was starting to think that a complete recompile had fixed it! > 30 minutes is pretty fast, this is why I said 'on demand'... > For the record: I've only seen that dmesg warning I've reported *once*, and that didn't kill the machine immediately (hence I was able to report it in the first place). > >> 1) Do you have another NIC adapter to try ? It might be a >> buggy driver. (Neil Horman found an error on Intel drivers some >> hours ago, that can corrupt skbs) > > I can test any patches for a e1000 that apply to 2.6.31.x. But the e1000 is an on-board device and I don't have another. But Fedora's 2.6.31.x kernels seem OK. > >> 2) Could you add following debugging aid ? > > Not a problem; I do have a serial console attached. > >> 3) Any chance you can do a git bisect ? > > How do you git-bisect a bug that you can't reproduce on demand? A negative is easy to spot, but a positive would be not experiencing a random freeze. As I said, I *almost* thought that I'd resolved the issue by recompiling last night. > Please fold your lines length to < 70 If Fedora kernel works, either its just pure luck, or they found a bug and they didnt sent the fix to mainline (unlikely) -- To unsubscribe from this list: send the line "unsubscribe netdev" in the body of a message to majordomo@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
On 08-12-2009 12:21, Eric Dumazet wrote: > If Fedora kernel works, either its just pure luck, or they found > a bug and they didnt sent the fix to mainline (unlikely) Is it the same .config? Jarek P. -- To unsubscribe from this list: send the line "unsubscribe netdev" in the body of a message to majordomo@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
On Tue, Dec 08, 2009 at 01:03:15AM -0800, Chris Rankin wrote: > --- On Tue, 8/12/09, Eric Dumazet <eric.dumazet@gmail.com> wrote: > > Its all two years old UDP bugs (I spot another one some > > hours ago), and very rare. > > > I am quite suprised it could happen on your machine on > > demand. > > Who said anything about "on demand"? It took about 30 minutes to freeze last time; I was starting to think that a complete recompile had fixed it! > > For the record: I've only seen that dmesg warning I've reported *once*, and that didn't kill the machine immediately (hence I was able to report it in the first place). > 30 minutes isn't too long to wait for an error to appear, I think. > > 1) Do you have another NIC adapter to try ? It might be a > > buggy driver. (Neil Horman found an error on Intel drivers some > > hours ago, that can corrupt skbs) > > I can test any patches for a e1000 that apply to 2.6.31.x. But the e1000 is an on-board device and I don't have another. But Fedora's 2.6.31.x kernels seem OK. > Those patches I posted for the intel drivers will apply cleanly pretty far back in git, as that code hasn't changed much. You might also consider turning on slab debugging. Many of the errors I encountered leading up to a fatal oops werent themselves fatal, and were hidden until such time as we used slab debugging to catch a bunch of redzone violations. > > 2) Could you add following debugging aid ? > > Not a problem; I do have a serial console attached. > > > 3) Any chance you can do a git bisect ? > > How do you git-bisect a bug that you can't reproduce on demand? A negative is easy to spot, but a positive would be not experiencing a random freeze. As I said, I *almost* thought that I'd resolved the issue by recompiling last night. Well, it sounds like your longest time to failure is about 30 minutes. Why not write a script that runs your test for an hour at a stretch, and plug that inot git bisect, and walk away? You should have results in a day or so. Regards Neil -- To unsubscribe from this list: send the line "unsubscribe netdev" in the body of a message to majordomo@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
--- On Tue, 8/12/09, Jarek Poplawski <jarkao2@gmail.com> wrote:
> Is it the same .config?
Similar, but no. I'll attach the .config to the bug tonight.
Chris
--
To unsubscribe from this list: send the line "unsubscribe netdev" in
the body of a message to majordomo@vger.kernel.org
More majordomo info at http://vger.kernel.org/majordomo-info.html
--- On Tue, 8/12/09, Neil Horman <nhorman@tuxdriver.com> wrote:
> 30 minutes isn't too long to wait for an error to appear, I think.
Except it's a very "busy" waiting process with me actively surfing the web. I can't automate that. I'm still not entirely sure what the trigger condition is.
Chris
--
To unsubscribe from this list: send the line "unsubscribe netdev" in
the body of a message to majordomo@vger.kernel.org
More majordomo info at http://vger.kernel.org/majordomo-info.html
On Tue, Dec 08, 2009 at 05:39:28AM -0800, Chris Rankin wrote: > --- On Tue, 8/12/09, Neil Horman <nhorman@tuxdriver.com> wrote: > > 30 minutes isn't too long to wait for an error to appear, I think. > > Except it's a very "busy" waiting process with me actively surfing the web. I can't automate that. I'm still not entirely sure what the trigger condition is. > Sure you can, generate a list of sites that you visited and access them all with a curl or wget script. I would imagine thats a reasonable test to trigger the reproducer. Neil -- To unsubscribe from this list: send the line "unsubscribe netdev" in the body of a message to majordomo@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
On Tue, Dec 08, 2009 at 05:35:40AM -0800, Chris Rankin wrote: > --- On Tue, 8/12/09, Jarek Poplawski <jarkao2@gmail.com> wrote: > > Is it the same .config? > > Similar, but no. I'll attach the .config to the bug tonight. ...And a diff to Fedora's .config, plus if possible try if this difference could matter. Jarek P. -- To unsubscribe from this list: send the line "unsubscribe netdev" in the body of a message to majordomo@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
Neil Horman a écrit : > On Tue, Dec 08, 2009 at 05:39:28AM -0800, Chris Rankin wrote: >> --- On Tue, 8/12/09, Neil Horman <nhorman@tuxdriver.com> wrote: >>> 30 minutes isn't too long to wait for an error to appear, I think. >> Except it's a very "busy" waiting process with me actively surfing the web. I can't automate that. I'm still not entirely sure what the trigger condition is. >> > Sure you can, generate a list of sites that you visited and access them all with > a curl or wget script. I would imagine thats a reasonable test to trigger the > reproducer. Yes, but I suspect a multi threading bug, or vm , or X11, or something. Andi posted a futex patch that is worth to try, if machine is swaping a bit. Chris, please provide as much information as you can # cat /proc/cpuinfo # cat /proc/meminfo # ps aux # scripts/ver_linux -- To unsubscribe from this list: send the line "unsubscribe netdev" in the body of a message to majordomo@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
On Tue, Dec 08, 2009 at 05:35:40AM -0800, Chris Rankin wrote: > --- On Tue, 8/12/09, Jarek Poplawski <jarkao2@gmail.com> wrote: > > Is it the same .config? > > Similar, but no. I'll attach the .config to the bug tonight. I can see quite a lot of differences, and some could matter here, e.g. like these: -# CONFIG_PREEMPT_RCU is not set +# CONFIG_TREE_RCU is not set +CONFIG_PREEMPT_RCU=y ... -CONFIG_PREEMPT_VOLUNTARY=y -# CONFIG_PREEMPT is not set +# CONFIG_PREEMPT_VOLUNTARY is not set +CONFIG_PREEMPT=y It's hard to guess, but at least this second patch mentioned by you (ipv4: additional update of dev_net(dev) to struct *net in ip_fragment.c) shouldn't matter here. Anyway, now 2.6.32.1 should be preferred for testing (if possible). Jarek P. -- To unsubscribe from this list: send the line "unsubscribe netdev" in the body of a message to majordomo@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
diff --git a/net/ipv4/af_inet.c b/net/ipv4/af_inet.c index 7d12c6a..5a7a456 100644 --- a/net/ipv4/af_inet.c +++ b/net/ipv4/af_inet.c @@ -147,10 +147,15 @@ void inet_sock_destruct(struct sock *sk) return; } - WARN_ON(atomic_read(&sk->sk_rmem_alloc)); - WARN_ON(atomic_read(&sk->sk_wmem_alloc)); - WARN_ON(sk->sk_wmem_queued); - WARN_ON(sk->sk_forward_alloc); + WARN((atomic_read(&sk->sk_rmem_alloc) | atomic_read(&sk->sk_wmem_alloc) | + sk->sk_wmem_queued | sk->sk_forward_alloc) != 0, + "%s socket sk_rmem_alloc=%d sk_wmem_alloc=%d " + "sk_wmem_queued=%d sk_forward_alloc=%d\n", + sk->sk_prot->name, + atomic_read(&sk->sk_rmem_alloc), + atomic_read(&sk->sk_wmem_alloc), + sk->sk_wmem_queued, + sk->sk_forward_alloc); kfree(inet->opt); dst_release(sk->sk_dst_cache);