From patchwork Tue Mar 31 19:12:52 2009 Content-Type: text/plain; charset="utf-8" MIME-Version: 1.0 Content-Transfer-Encoding: 8bit X-Patchwork-Submitter: =?utf-8?q?Ilpo_J=C3=A4rvinen?= X-Patchwork-Id: 25439 X-Patchwork-Delegate: davem@davemloft.net Return-Path: X-Original-To: patchwork-incoming@ozlabs.org Delivered-To: patchwork-incoming@ozlabs.org Received: from vger.kernel.org (vger.kernel.org [209.132.176.167]) by ozlabs.org (Postfix) with ESMTP id 22B66DDDB2 for ; Wed, 1 Apr 2009 06:13:01 +1100 (EST) Received: (majordomo@vger.kernel.org) by vger.kernel.org via listexpand id S1754973AbZCaTM5 (ORCPT ); Tue, 31 Mar 2009 15:12:57 -0400 Received: (majordomo@vger.kernel.org) by vger.kernel.org id S1754099AbZCaTM4 (ORCPT ); Tue, 31 Mar 2009 15:12:56 -0400 Received: from courier.cs.helsinki.fi ([128.214.9.1]:33098 "EHLO mail.cs.helsinki.fi" rhost-flags-OK-OK-OK-OK) by vger.kernel.org with ESMTP id S1753957AbZCaTMz (ORCPT ); Tue, 31 Mar 2009 15:12:55 -0400 Received: from wrl-59.cs.helsinki.fi (wrl-59.cs.helsinki.fi [128.214.166.179]) (AUTH: PLAIN cs-relay, TLS: TLSv1/SSLv3,256bits,AES256-SHA) by mail.cs.helsinki.fi with esmtp; Tue, 31 Mar 2009 22:12:52 +0300 id 000703D9.49D26B34.0000682F Received: by wrl-59.cs.helsinki.fi (Postfix, from userid 50795) id 3D965A0096; Tue, 31 Mar 2009 22:12:52 +0300 (EEST) Received: from localhost (localhost [127.0.0.1]) by wrl-59.cs.helsinki.fi (Postfix) with ESMTP id 3A8E4A0091; Tue, 31 Mar 2009 22:12:52 +0300 (EEST) Date: Tue, 31 Mar 2009 22:12:52 +0300 (EEST) From: "=?ISO-8859-1?Q?Ilpo_J=E4rvinen?=" X-X-Sender: ijjarvin@wrl-59.cs.helsinki.fi To: Markus Trippelsdorf cc: Netdev , LKML Subject: Re: WARNING: at net/ipv4/tcp_input.c:2927 tcp_ack+0xd55/0x1991() In-Reply-To: <20090331184959.GA2725@gentoox2.trippelsdorf.de> Message-ID: References: <20090327211202.GA10014@gentoox2.trippelsdorf.de> <20090328045056.GA2394@gentoox2.trippelsdorf.de> <20090328095514.GA2599@gentoox2.trippelsdorf.de> <20090330164035.GA2652@gentoox2.trippelsdorf.de> <20090331071018.GA2641@gentoox2.trippelsdorf.de> <20090331184959.GA2725@gentoox2.trippelsdorf.de> MIME-Version: 1.0 Sender: netdev-owner@vger.kernel.org Precedence: bulk List-ID: X-Mailing-List: netdev@vger.kernel.org On Tue, 31 Mar 2009, Markus Trippelsdorf wrote: > On Tue, Mar 31, 2009 at 12:16:51PM +0300, Ilpo Järvinen wrote: > > On Tue, 31 Mar 2009, Markus Trippelsdorf wrote: > > > > > On Mon, Mar 30, 2009 at 09:52:55PM +0300, Ilpo Järvinen wrote: > > > > On Mon, 30 Mar 2009, Markus Trippelsdorf wrote: > > > > > > > > > On Mon, Mar 30, 2009 at 07:01:22PM +0300, Ilpo Järvinen wrote: > > > > > > On Sat, 28 Mar 2009, Markus Trippelsdorf wrote: > > > > > > > On Sat, Mar 28, 2009 at 10:29:58AM +0200, Ilpo Järvinen wrote: > > > > > > > > > > > > ...And, let me guess, you're in X and therefore unable to catch a final > > > > > > oops if any would be printed? It would be nice to get around that as well, > > > > > > either use serial/netconsole or hang in text mode while waiting for the > > > > > > crash (should be too hard if you are able to setup the workload first > > > > > > and then switch away from X and if reproducing takes about an hour)... > > > > > > > > > > OK, I will try this later. > > > > > > > > Lets hope that gives some clue where it ends up going boom (if it is > > > > caused by TCP we certainly should see something more sensible in console > > > > than just a hang)... ...I once again read through tcp commits but just > > > > cannot find anything that could cause fackets_out miscount, not to speak > > > > of crash prone changes so we'll just have to wait and see... > > > > > > The machine hanged again this night and I took two pictures: > > > http://www.mypicx.com/uploadimg/1055813374_03302009_2.jpg > > > http://www.mypicx.com/uploadimg/1543678904_03302009_1.jpg > > > > > > But this time there was no tcp related warning in the logs. > > > > Right. If that oops would be hit often enough one can easily mix the > > warning with that hang though there is no relation (the fact that final > > oops always goes unnoticed in X amplifies the effect). > > > > > I then pulled the lateset git changes, rebuild, rebooted and setup the > > > workload again. The machine was still up and running in the morning > > > (~4 hours uptime). So it may well be that the issue was fixed with > > > the latest changes. > > > > Lets hope so, in any case if you still see hangs please get the final oops. > > > > > If it ever occurs again I will notify you. > > It happend again. In this case it took ~10 minutes from the warning to > the final crash. Quite many RTTs in between already then... > I'm pretty sure there must be some kind of relation > between the two. How else could one explain that the machine crashes just > minutes after _each_ instance of that WARNING? Below is a quick counter printout patch, I'd have a more expensive patch which locates the place where the miscount happens for the first time but it needs to be brought up to date before I can let you try with it :-) (that warnon triggers quite late but is cheap to calculate as my debug patch adds considerable amount of calculations all around TCP code to catch even minor inconsistencies in early)... > (Unfortunately I was in X again, because I thought this problem was > solved) Thus again somewhat inconclusive where it crashes... :-( Considering this is very early state kernel I'm not surprised if there are still some non-related things remaining. How long did you run with 2.6.29 btw, and what did you run before that...? ...No need to remember exact dates but some rough idea would be nice know as this looks, if related to warning, to be older than last merged stuff... diff --git a/net/ipv4/tcp_input.c b/net/ipv4/tcp_input.c index 2bc8e27..179b2cb 100644 --- a/net/ipv4/tcp_input.c +++ b/net/ipv4/tcp_input.c @@ -2924,8 +2924,13 @@ static void tcp_fastretrans_alert(struct sock *sk, int pkts_acked, int flag) if (WARN_ON(!tp->packets_out && tp->sacked_out)) tp->sacked_out = 0; - if (WARN_ON(!tp->sacked_out && tp->fackets_out)) + if (WARN_ON(!tp->sacked_out && tp->fackets_out)) { + printk(KERN_ERR "TCP s%u l%u r%u f%u p%u sack%d\n", + tp->sacked_out, tp->lost_out, tp->retrans_out, + tp->fackets_out, tp->packets_out, + tp->rx_opt.sack_ok); tp->fackets_out = 0; + } /* Now state machine starts. * A. ECE, hence prohibit cwnd undoing, the reduction is required. */