From patchwork Mon Sep 7 00:52:59 2009 Content-Type: text/plain; charset="utf-8" MIME-Version: 1.0 Content-Transfer-Encoding: 8bit X-Patchwork-Submitter: David Miller X-Patchwork-Id: 33063 X-Patchwork-Delegate: davem@davemloft.net Return-Path: X-Original-To: patchwork-incoming@bilbo.ozlabs.org Delivered-To: patchwork-incoming@bilbo.ozlabs.org Received: from ozlabs.org (ozlabs.org [203.10.76.45]) (using TLSv1 with cipher DHE-RSA-AES256-SHA (256/256 bits)) (Client CN "mx.ozlabs.org", Issuer "CA Cert Signing Authority" (verified OK)) by bilbo.ozlabs.org (Postfix) with ESMTPS id 6E91CB7B88 for ; Mon, 7 Sep 2009 10:52:51 +1000 (EST) Received: by ozlabs.org (Postfix) id 5E409DDD0B; Mon, 7 Sep 2009 10:52:51 +1000 (EST) Delivered-To: patchwork-incoming@ozlabs.org Received: from vger.kernel.org (vger.kernel.org [209.132.176.167]) by ozlabs.org (Postfix) with ESMTP id EE865DDD01 for ; Mon, 7 Sep 2009 10:52:50 +1000 (EST) Received: (majordomo@vger.kernel.org) by vger.kernel.org via listexpand id S1758459AbZIGAwo (ORCPT ); Sun, 6 Sep 2009 20:52:44 -0400 Received: (majordomo@vger.kernel.org) by vger.kernel.org id S1758470AbZIGAwo (ORCPT ); Sun, 6 Sep 2009 20:52:44 -0400 Received: from 74-93-104-97-Washington.hfc.comcastbusiness.net ([74.93.104.97]:48544 "EHLO sunset.davemloft.net" rhost-flags-OK-OK-OK-OK) by vger.kernel.org with ESMTP id S1758421AbZIGAwo convert rfc822-to-8bit (ORCPT ); Sun, 6 Sep 2009 20:52:44 -0400 Received: from localhost (localhost [127.0.0.1]) by sunset.davemloft.net (Postfix) with ESMTP id F0932C8C413; Sun, 6 Sep 2009 17:52:59 -0700 (PDT) Date: Sun, 06 Sep 2009 17:52:59 -0700 (PDT) Message-Id: <20090906.175259.243140550.davem@davemloft.net> To: sbernard@nerim.net Cc: debian-sparc@lists.debian.org, sparclinux@vger.kernel.org, jurij@wooyd.org Subject: Re: Sparc release requalification From: David Miller In-Reply-To: <4AA427E6.5000801@nerim.net> References: <20090822203243.GA14511@droopy.oc.cox.net> <4AA427E6.5000801@nerim.net> X-Mailer: Mew version 6.2.51 on Emacs 22.1 / Mule 5.0 (SAKAKI) Mime-Version: 1.0 Sender: sparclinux-owner@vger.kernel.org Precedence: bulk List-ID: X-Mailing-List: sparclinux@vger.kernel.org From: Sébastien Bernard Date: Sun, 06 Sep 2009 23:21:42 +0200 > David said, he'll look this bug later. I'll need to remind him. In Linus's tree is the following fix for this. I'll submit it to -stable when I get a chance. sparc64: Kill spurious NMI watchdog triggers by increasing limit to 30 seconds. This is a compromise and a temporary workaround for bootup NMI watchdog triggers some people see with qla2xxx devices present. This happens when, for example: CPU 0 is in the driver init and looping submitting mailbox commands to load the firmware, then waiting for completion. CPU 1 is receiving the device interrupts. CPU 1 is where the NMI watchdog triggers. CPU 0 is submitting mailbox commands fast enough that by the time CPU 1 returns from the device interrupt handler, a new one is pending. This sequence runs for more than 5 seconds. The problematic case is CPU 1's timer interrupt running when the barrage of device interrupts begin. Then we have: timer interrupt return for softirq checking pending, thus enable interrupts qla2xxx interrupt return qla2xxx interrupt return ... 5+ seconds pass final qla2xxx interrupt for fw load return run timer softirq return At some point in the multi-second qla2xxx interrupt storm we trigger the NMI watchdog on CPU 1 from the NMI interrupt handler. The timer softirq, once we get back to running it, is smart enough to run the timer work enough times to make up for the missed timer interrupts. However, the NMI watchdogs (both x86 and sparc) use the timer interrupt count to notice the cpu is wedged. But in the above scenerio we'll receive only one such timer interrupt even if we last all the way back to running the timer softirq. The default watchdog trigger point is only 5 seconds, which is pretty low (the softwatchdog triggers at 60 seconds). So increase it to 30 seconds for now. Signed-off-by: David S. Miller --- arch/sparc/kernel/nmi.c | 2 +- 1 files changed, 1 insertions(+), 1 deletions(-) diff --git a/arch/sparc/kernel/nmi.c b/arch/sparc/kernel/nmi.c index 2c0cc72..b75bf50 100644 --- a/arch/sparc/kernel/nmi.c +++ b/arch/sparc/kernel/nmi.c @@ -103,7 +103,7 @@ notrace __kprobes void perfctr_irq(int irq, struct pt_regs *regs) } if (!touched && __get_cpu_var(last_irq_sum) == sum) { local_inc(&__get_cpu_var(alert_counter)); - if (local_read(&__get_cpu_var(alert_counter)) == 5 * nmi_hz) + if (local_read(&__get_cpu_var(alert_counter)) == 30 * nmi_hz) die_nmi("BUG: NMI Watchdog detected LOCKUP", regs, panic_on_timeout); } else {