From patchwork Tue Mar 21 07:34:53 2017 Content-Type: text/plain; charset="utf-8" MIME-Version: 1.0 Content-Transfer-Encoding: 8bit X-Patchwork-Submitter: AKASHI Takahiro X-Patchwork-Id: 741442 Return-Path: X-Original-To: incoming-imx@patchwork.ozlabs.org Delivered-To: patchwork-incoming-imx@bilbo.ozlabs.org Received: from bombadil.infradead.org (bombadil.infradead.org [65.50.211.133]) (using TLSv1.2 with cipher ECDHE-RSA-AES256-GCM-SHA384 (256/256 bits)) (No client certificate requested) by ozlabs.org (Postfix) with ESMTPS id 3vnPg43rxqz9s7B for ; Tue, 21 Mar 2017 18:33:04 +1100 (AEDT) Authentication-Results: ozlabs.org; dkim=pass (2048-bit key; unprotected) header.d=lists.infradead.org header.i=@lists.infradead.org header.b="PRSyJ2J4"; dkim=fail reason="signature verification failed" (1024-bit key; unprotected) header.d=linaro.org header.i=@linaro.org header.b="fua5PhEu"; dkim-atps=neutral DKIM-Signature: v=1; a=rsa-sha256; q=dns/txt; c=relaxed/relaxed; d=lists.infradead.org; s=bombadil.20170209; h=Sender: Content-Transfer-Encoding:Content-Type:Cc:List-Subscribe:List-Help:List-Post: List-Archive:List-Unsubscribe:List-Id:In-Reply-To:MIME-Version:References: Message-ID:Subject:To:From:Date:Reply-To:Content-ID:Content-Description: Resent-Date:Resent-From:Resent-Sender:Resent-To:Resent-Cc:Resent-Message-ID: List-Owner; bh=xfYHSytLTKjPIvGw928XfvFrnoTZ9Jrmv/B2th0Nteo=; b=PRSyJ2J4DJDg6I b097Fe2AM2zBmBaudqV/g/23SIV1IOJAW52Du1gHsGpAS46qhmq/vPVFmgUJ0R5xS3pOURcHV2P1W sqknqrQNW9cUvxoeh5FbmB+BeGabZSs1gt4xIJ+4XFaWbYlBxDjQD8hpZdkItUU07sHpS2CZ2P9gm NEx368gBrIKBpClCppVwETq9OI9PuWWejSbpRhkmlAk/6czMGjKofSx3F9fPc/YEHVKB60kSH27sL FA3YFN1qnu7HupPdG/cMYyKrcIzu+RiD9mWcWEXLswb3ziOK1RSxj+TorXU3Y7uPisml/Ru3TIcBT 8zYAqu0ZGbAvbPnjiyvw==; Received: from localhost ([127.0.0.1] helo=bombadil.infradead.org) by bombadil.infradead.org with esmtp (Exim 4.87 #1 (Red Hat Linux)) id 1cqEHu-0002dB-PK; Tue, 21 Mar 2017 07:33:02 +0000 Received: from mail-pg0-x231.google.com ([2607:f8b0:400e:c05::231]) by bombadil.infradead.org with esmtps (Exim 4.87 #1 (Red Hat Linux)) id 1cqEHU-0002Xs-8b for linux-arm-kernel@lists.infradead.org; Tue, 21 Mar 2017 07:32:38 +0000 Received: by mail-pg0-x231.google.com with SMTP id g2so89529670pge.3 for ; Tue, 21 Mar 2017 00:32:15 -0700 (PDT) DKIM-Signature: v=1; a=rsa-sha256; c=relaxed/relaxed; d=linaro.org; s=google; h=date:from:to:cc:subject:message-id:mail-followup-to:references :mime-version:content-disposition:content-transfer-encoding :in-reply-to:user-agent; bh=9xXvKWefUG1DDDCC6uomYi12lfEmxGdQ74TeAt5xCUA=; b=fua5PhEuO1dlxgVwluPPIgJT9MbA5H3E++5upeaRhoijTZnhLZd6LK/mJkG+7Goi9T YBdY7/YY5NtLqVmRuUaLFfzmRPbysrtVjA2ugh1ZPIzzebfMbNOObhDE0+5ipg/hoYNe 1Sdk0CJBW6Drm4nrP+8ahyzGV4lyazCWzmKbQ= X-Google-DKIM-Signature: v=1; a=rsa-sha256; c=relaxed/relaxed; d=1e100.net; s=20161025; h=x-gm-message-state:date:from:to:cc:subject:message-id :mail-followup-to:references:mime-version:content-disposition :content-transfer-encoding:in-reply-to:user-agent; bh=9xXvKWefUG1DDDCC6uomYi12lfEmxGdQ74TeAt5xCUA=; b=BpNLUeMN0T4ejMIeOqYyVzHcpjIcS1//xhfz6bQQmTOhyyciao6GE0OuIirZX35rGW M3nAy+lc2gchlM+g580ddlRTLU/1Pyj2l1Wny6aT7cCvPvOc2BEU/XP78Tn7XNzSYH0c 3+ctyztrpuLKWcWoBSs87p34EbQM6GxMyH4jYZq9Prnh2HnHV8K+c00ePi6P9oD00zR4 9vtz9oVCheFqexzmYlocCBFaKKPPqSaZI6TYtROFVGH3swTLzVxdb90pJmo3Tv96Ib5d 1kZv35dX6crPjkA8cRb3bKO2z3nNP/jI71PMO0sz+F+x9GP9GW2oQydW5Pm4zfXWN570 HCCg== X-Gm-Message-State: AFeK/H0RfrUIlyUUOSxgwE/oOSHaNlXSLaDN8jC0j44kceWjs87tJKPB8nAmUcl3lhi1DlY0 X-Received: by 10.84.231.201 with SMTP id g9mr45637643pln.91.1490081535228; Tue, 21 Mar 2017 00:32:15 -0700 (PDT) Received: from linaro.org ([121.95.100.191]) by smtp.googlemail.com with ESMTPSA id l126sm3521203pfl.56.2017.03.21.00.32.12 (version=TLS1_2 cipher=ECDHE-RSA-AES128-GCM-SHA256 bits=128/128); Tue, 21 Mar 2017 00:32:14 -0700 (PDT) Date: Tue, 21 Mar 2017 16:34:53 +0900 From: AKASHI Takahiro To: Mark Rutland , David Woodhouse Subject: Re: [PATCH v33 00/14] add kdump support Message-ID: <20170321073452.GA17298@linaro.org> Mail-Followup-To: AKASHI Takahiro , Mark Rutland , David Woodhouse , marc.zyngier@arm.com, catalin.marinas@arm.com, will.deacon@arm.com, geoff@infradead.org, kexec@lists.infradead.org, james.morse@arm.com, bauerman@linux.vnet.ibm.com, dyoung@redhat.com, linux-arm-kernel@lists.infradead.org References: <20170315095656.24992-1-takahiro.akashi@linaro.org> <1489750991.17202.40.camel@infradead.org> <1489759373.17202.44.camel@infradead.org> <20170317153358.GI5940@leverpostej> <1489765628.17202.59.camel@infradead.org> <20170317162421.GK5940@leverpostej> MIME-Version: 1.0 Content-Disposition: inline In-Reply-To: <20170317162421.GK5940@leverpostej> User-Agent: Mutt/1.5.24 (2015-08-30) X-CRM114-Version: 20100106-BlameMichelson ( TRE 0.8.0 (BSD) ) MR-646709E3 X-CRM114-CacheID: sfid-20170321_003236_336298_8A4786B3 X-CRM114-Status: GOOD ( 27.13 ) X-Spam-Score: -2.0 (--) X-Spam-Report: SpamAssassin version 3.4.1 on bombadil.infradead.org summary: Content analysis details: (-2.0 points) pts rule name description ---- ---------------------- -------------------------------------------------- -0.0 RCVD_IN_DNSWL_NONE RBL: Sender listed at http://www.dnswl.org/, no trust [2607:f8b0:400e:c05:0:0:0:231 listed in] [list.dnswl.org] -0.0 SPF_PASS SPF: sender matches SPF record -1.9 BAYES_00 BODY: Bayes spam probability is 0 to 1% [score: 0.0000] 0.1 DKIM_SIGNED Message has a DKIM or DK signature, not necessarily valid -0.1 DKIM_VALID Message has at least one valid DKIM or DK signature -0.1 DKIM_VALID_AU Message has a valid DKIM or DK signature from author's domain X-BeenThere: linux-arm-kernel@lists.infradead.org X-Mailman-Version: 2.1.21 Precedence: list List-Unsubscribe: , List-Archive: List-Post: List-Help: List-Subscribe: , Cc: marc.zyngier@arm.com, catalin.marinas@arm.com, will.deacon@arm.com, geoff@infradead.org, james.morse@arm.com, bauerman@linux.vnet.ibm.com, dyoung@redhat.com, kexec@lists.infradead.org, linux-arm-kernel@lists.infradead.org Sender: "linux-arm-kernel" Errors-To: linux-arm-kernel-bounces+incoming-imx=patchwork.ozlabs.org@lists.infradead.org List-Id: linux-imx-kernel.lists.patchwork.ozlabs.org On Fri, Mar 17, 2017 at 04:24:21PM +0000, Mark Rutland wrote: > On Fri, Mar 17, 2017 at 03:47:08PM +0000, David Woodhouse wrote: > > On Fri, 2017-03-17 at 15:33 +0000, Mark Rutland wrote: > > No, in this case the CPUs *were* offlined correctly, or at least "as > > designed", by smp_send_crash_stop(). And if that hadn't worked, as > > verified by *its* synchronisation method based on the atomic_t > > waiting_for_crash_ipi, then *it* would have complained for itself: > > > > if (atomic_read(&waiting_for_crash_ipi) > 0) > > pr_warning("SMP: failed to stop secondary CPUs %*pbl\n", > >    cpumask_pr_args(cpu_online_mask)); > > > > It's just that smp_send_crash_stop() (or more specifically > > ipi_cpu_crash_stop()) doesn't touch the online cpu mask. Unlike the > > ARM32 equivalent function machien_crash_nonpanic_core(), which does. > > > > It wasn't clear if that was *intentional*, to allow the original > > contents of the online mask before the crash to be seen in the > > resulting vmcore... or purely an accident.  Yes, it is intentional. I removed 'offline' code in my v14 (2016/3/4). As you assumed, I'd expect 'online' status of all CPUs to be kept unchanged in the core dump. If you can agree, I would like to modify this disputed warning code to: In case of failure in offlining, this can generate such a message like: SMP: stopping secondary CPUs SMP: failed to stop secondary CPUs 0,2-7 Starting crashdump kernel... Some CPUs may be stale, kdump will be unreliable. ------------[ cut here ]------------ WARNING: CPU: 1 PID: 1141 at /home/akashi/arm/armv8/linaro/linux-aarch64/arch/arm64/kernel/machine_kexec.c:157 machine_kexec+0x44/0x280 > Looking at this, there's a larger mess. > > The waiting_for_crash_ipi dance only tells us if CPUs have taken the > IPI, not wether they've been offlined (i.e. actually left the kernel). > We need something closer to the usual cpu_{disable,die,kill} dance, > clearing online as appropriate. First, I don't think there is no sure way to confirm whether CPUs successfully left the kernel. Even if we do something like this in ipi_cpu_crash_stop(): atomic_dec(&waiting_for_crash_ipi); cpu_die(cpu); atomic_inc(&waiting_for_crash_ipi); there is no guarantee that we reach the second update_cpu_boot_status() in failure of cpu_die(). Second, while "graceful" cpu shutdown would be fine, the basic idea in kdump design, I believe, is that we should do minimum things needed and tear down all the cpus as quickly as possible in order not only to make the reboot more successful but also to retain the kernel state (memory contents) as close as to the moment at the panic. (The latter is arguable.) That said, I will appreciate you if you have any suggestions regarding what be added for safer shutdown here. Thanks, -Takahiro AKASHI > If CPUs haven't left the kernel, we still need to warn about that. > > > FWIW if I trigger a crash on CPU 1 my kdump (still 4.9.8+v32) doesn't work. > > I end up booting the kdump kernel on CPU#1 and then it gets distinctly unhappy... > > > > [    0.000000] Booting Linux on physical CPU 0x1 > > ... > > [    0.017125] Detected PIPT I-cache on CPU1 > > [    0.017138] GICv3: CPU1: found redistributor 0 region 0:0x00000000f0280000 > > [    0.017147] CPU1: Booted secondary processor [411fd073] > > [    0.017339] Detected PIPT I-cache on CPU2 > > [    0.017347] GICv3: CPU2: found redistributor 2 region 0:0x00000000f02c0000 > > [    0.017354] CPU2: Booted secondary processor [411fd073] > > [    0.017537] Detected PIPT I-cache on CPU3 > > [    0.017545] GICv3: CPU3: found redistributor 3 region 0:0x00000000f02e0000 > > [    0.017551] CPU3: Booted secondary processor [411fd073] > > [    0.017576] Brought up 4 CPUs > > [    0.017587] SMP: Total of 4 processors activated. > > ... > > [   31.745809] INFO: rcu_sched detected stalls on CPUs/tasks: > > [   31.751299]  1-...: (30 GPs behind) idle=c90/0/0 softirq=0/0 fqs=0  > > [   31.757557]  2-...: (30 GPs behind) idle=608/0/0 softirq=0/0 fqs=0  > > [   31.763814]  3-...: (30 GPs behind) idle=604/0/0 softirq=0/0 fqs=0  > > [   31.770069]  (detected by 0, t=5252 jiffies, g=-270, c=-271, q=0) > > [   31.776161] Task dump for CPU 1: > > [   31.779381] swapper/1       R  running task        0     0      1 0x00000080 > > [   31.786446] Task dump for CPU 2: > > [   31.789666] swapper/2       R  running task        0     0      1 0x00000080 > > [   31.796725] Task dump for CPU 3: > > [   31.799945] swapper/3       R  running task        0     0      1 0x00000080 > > > > Is some of that platform-specific? > > That sounds like timer interrupts aren't being taken. > > Given that the CPUs have come up, my suspicion would be that the GIC's > been left in some odd state, that the kdump kernel hasn't managed to > recover from. > > Marc may have an idea. > > Thanks, > Mark. ===8<=== diff --git a/arch/arm64/include/asm/smp.h b/arch/arm64/include/asm/smp.h index cea009f2657d..55f08c5acfad 100644 --- a/arch/arm64/include/asm/smp.h +++ b/arch/arm64/include/asm/smp.h @@ -149,6 +149,7 @@ static inline void cpu_panic_kernel(void) bool cpus_are_stuck_in_kernel(void); extern void smp_send_crash_stop(void); +extern bool smp_crash_stop_failed(void); #endif /* ifndef __ASSEMBLY__ */ diff --git a/arch/arm64/kernel/machine_kexec.c b/arch/arm64/kernel/machine_kexec.c index 68b96ea13b4c..29e1cf8cca95 100644 --- a/arch/arm64/kernel/machine_kexec.c +++ b/arch/arm64/kernel/machine_kexec.c @@ -146,12 +146,15 @@ void machine_kexec(struct kimage *kimage) { phys_addr_t reboot_code_buffer_phys; void *reboot_code_buffer; + bool in_kexec_crash = (kimage == kexec_crash_image); + bool stuck_cpus = cpus_are_stuck_in_kernel(); /* * New cpus may have become stuck_in_kernel after we loaded the image. */ - BUG_ON((cpus_are_stuck_in_kernel() || (num_online_cpus() > 1)) && - !WARN_ON(kimage == kexec_crash_image)); + BUG_ON(!in_kexec_crash && (stuck_cpus || (num_online_cpus() > 1))); + WARN(in_kexec_crash && (stuck_cpus || smp_crash_stop_failed()), + "Some CPUs may be stale, kdump will be unreliable.\n"); reboot_code_buffer_phys = page_to_phys(kimage->control_code_page); reboot_code_buffer = phys_to_virt(reboot_code_buffer_phys); diff --git a/arch/arm64/kernel/smp.c b/arch/arm64/kernel/smp.c index a7e2921143c4..8016914591d2 100644 --- a/arch/arm64/kernel/smp.c +++ b/arch/arm64/kernel/smp.c @@ -833,7 +833,7 @@ static void ipi_cpu_stop(unsigned int cpu) } #ifdef CONFIG_KEXEC_CORE -static atomic_t waiting_for_crash_ipi; +static atomic_t waiting_for_crash_ipi = ATOMIC_INIT(0); #endif static void ipi_cpu_crash_stop(unsigned int cpu, struct pt_regs *regs) @@ -990,7 +990,12 @@ void smp_send_crash_stop(void) if (atomic_read(&waiting_for_crash_ipi) > 0) pr_warning("SMP: failed to stop secondary CPUs %*pbl\n", - cpumask_pr_args(cpu_online_mask)); + cpumask_pr_args(&mask)); +} + +bool smp_crash_stop_failed(void) +{ + return (atomic_read(&waiting_for_crash_ipi) > 0); } #endif ===>8===