From patchwork Thu Jul 6 17:56:11 2017 Content-Type: text/plain; charset="utf-8" MIME-Version: 1.0 Content-Transfer-Encoding: 7bit X-Patchwork-Submitter: Nicholas Piggin X-Patchwork-Id: 785240 Return-Path: X-Original-To: patchwork-incoming@ozlabs.org Delivered-To: patchwork-incoming@ozlabs.org Received: from lists.ozlabs.org (lists.ozlabs.org [103.22.144.68]) (using TLSv1.2 with cipher ADH-AES256-GCM-SHA384 (256/256 bits)) (No client certificate requested) by ozlabs.org (Postfix) with ESMTPS id 3x3QTY5mq7z9s4s for ; Fri, 7 Jul 2017 03:58:41 +1000 (AEST) Authentication-Results: ozlabs.org; dkim=fail reason="signature verification failed" (2048-bit key; unprotected) header.d=gmail.com header.i=@gmail.com header.b="vKkWu3G6"; dkim-atps=neutral Received: from lists.ozlabs.org (lists.ozlabs.org [IPv6:2401:3900:2:1::3]) by lists.ozlabs.org (Postfix) with ESMTP id 3x3QTY4bznzDr6D for ; Fri, 7 Jul 2017 03:58:41 +1000 (AEST) Authentication-Results: lists.ozlabs.org; dkim=fail reason="signature verification failed" (2048-bit key; unprotected) header.d=gmail.com header.i=@gmail.com header.b="vKkWu3G6"; dkim-atps=neutral X-Original-To: linuxppc-dev@lists.ozlabs.org Delivered-To: linuxppc-dev@lists.ozlabs.org Received: from mail-pf0-x242.google.com (mail-pf0-x242.google.com [IPv6:2607:f8b0:400e:c00::242]) (using TLSv1.2 with cipher ECDHE-RSA-AES128-GCM-SHA256 (128/128 bits)) (No client certificate requested) by lists.ozlabs.org (Postfix) with ESMTPS id 3x3QR54DBPzDr4V; Fri, 7 Jul 2017 03:56:32 +1000 (AEST) Authentication-Results: lists.ozlabs.org; dkim=pass (2048-bit key; unprotected) header.d=gmail.com header.i=@gmail.com header.b="vKkWu3G6"; dkim-atps=neutral Received: by mail-pf0-x242.google.com with SMTP id c24so1186384pfe.1; Thu, 06 Jul 2017 10:56:32 -0700 (PDT) DKIM-Signature: v=1; a=rsa-sha256; c=relaxed/relaxed; d=gmail.com; s=20161025; h=date:from:to:cc:subject:message-id:in-reply-to:references :organization:mime-version:content-transfer-encoding; bh=nIOEm2tH0o3mY4ktXWdsHvSl+CsFJKM/qFeXGY3CKfc=; b=vKkWu3G6ObKEUV9Yyzaq0ylSoLXeLPxWwK4eZJ7TRRb76o5CEQNlgJy7JIlhZkxrEP 4FoYgsBe27dcI0OvfYEAXLPFljV6GGwzlbGnBGSxFC7Qc0g2qkaAM9p9flpqYvGG7dKl AaLVcglQmoMt+7SUvpwRCjE/Lq6QG7ugjdSd7eGpSpQReYBe0u26s1r2MmGQyI3BNHFN XwVjvV0iKKreJTurkgfMqBuhoNHOZ/taBlwSCdElM/0vboDeIsuw+zzdLBuQIr0Xn1/3 zvFeGelHjgZO9qPyXyOrcuV7RC3U1hX8yZJCIAMTmORrzR4x88KQy85la6AHwfwKSARu BGlg== X-Google-DKIM-Signature: v=1; a=rsa-sha256; c=relaxed/relaxed; d=1e100.net; s=20161025; h=x-gm-message-state:date:from:to:cc:subject:message-id:in-reply-to :references:organization:mime-version:content-transfer-encoding; bh=nIOEm2tH0o3mY4ktXWdsHvSl+CsFJKM/qFeXGY3CKfc=; b=dMxWTnDFr3iIWvHZWzeVFc1aINvqBzFqL+d70bLj8U6ZzTV4g8c/SfiyJW+f+XQnLO AKlDS2EVEbt4ax+5fgUWVW5QdPkMbhY+IomiSgZHmF95yP/UOCAJeRuEHEHYS23ls5WI 2JsJaIMVdsBfWVP0XVsj7aT0g6SccG9uhjJuCRYWqde9EamPzYeCoG+XueljSVNGyYsH X0t4lPTRFxqimAV6sCHKUacBDAMvSE0a3tOzEI6VE7/L4DibIzNfo0+K7tNLKqZmWEIe zfXUrFdHXVyoWHcDAve4txBAzVxVJ3O4ca4Cjx/VCcyUEwKN/JkoacvCXym/XkwvWoQx g82g== X-Gm-Message-State: AIVw111KjRiBa2ZFVXpIIuwFvBh+WZh3Y4PaB5fY+8liLRqDoJKEp+ii p+WAifQY6u5RK7pu X-Received: by 10.84.229.13 with SMTP id b13mr29993983plk.1.1499363789950; Thu, 06 Jul 2017 10:56:29 -0700 (PDT) Received: from roar.ozlabs.ibm.com (59-102-83-48.tpgi.com.au. [59.102.83.48]) by smtp.gmail.com with ESMTPSA id 10sm1575530pfo.134.2017.07.06.10.56.26 (version=TLS1_2 cipher=ECDHE-RSA-CHACHA20-POLY1305 bits=256/256); Thu, 06 Jul 2017 10:56:28 -0700 (PDT) Date: Fri, 7 Jul 2017 03:56:11 +1000 From: Nicholas Piggin To: linuxppc-dev@lists.ozlabs.org Subject: Re: [PATCH 1/4] powerpc/powernv: handle the platform error reboot in ppc_md.restart Message-ID: <20170707035611.6a9e60b6@roar.ozlabs.ibm.com> In-Reply-To: <20170705040422.20933-2-npiggin@gmail.com> References: <20170705040422.20933-1-npiggin@gmail.com> <20170705040422.20933-2-npiggin@gmail.com> Organization: IBM X-Mailer: Claws Mail 3.14.1 (GTK+ 2.24.31; x86_64-pc-linux-gnu) MIME-Version: 1.0 X-BeenThere: linuxppc-dev@lists.ozlabs.org X-Mailman-Version: 2.1.23 Precedence: list List-Id: Linux on PowerPC Developers Mail List List-Unsubscribe: , List-Archive: List-Post: List-Help: List-Subscribe: , Cc: skiboot@lists.ozlabs.org, Mahesh Jagannath Salgaonkar Errors-To: linuxppc-dev-bounces+patchwork-incoming=ozlabs.org@lists.ozlabs.org Sender: "Linuxppc-dev" On Wed, 5 Jul 2017 14:04:19 +1000 Nicholas Piggin wrote: > Unrecovered MCE and HMI errors are sent through a special restart > OPAL call to log the platform error. The downside is that they don't > go through normal crash paths, so they don't give much information > to the Linux console. > > Change this by allowing them to set an error which then causes the > normal restart handler to use the platform error call. Have MCE and HMI > handlers set this and then use the normal panic path for unrecoverable > cases. > > Signed-off-by: Nicholas Piggin Mahesh brought up a couple of good points about this offline. Firstly that some HMI erorrs will stop the TB, second that if crash dumps are registered then we will not get to the platform reboot code from panic. So it was a nice idea, but it seems to be just a bit too hard to do exactly what we want in the panic path. So the other option is put some of the printk and console flushing into the opal platform error handler. It's not really ideal to duplicate this code here... but it's better than not printing anything. Patch 2 won't be able to just call die() for kernel context now, but it will have to check in_interrupt(), panic_on_oops, etc. to make sure die() doesn't panic. But that should be okay. This is what I have. If there are no great objections I'll repost a v2 series with this. diff --git a/arch/powerpc/include/asm/opal.h b/arch/powerpc/include/asm/opal.h index 588fb1c23af9..182dab435aad 100644 --- a/arch/powerpc/include/asm/opal.h +++ b/arch/powerpc/include/asm/opal.h @@ -50,7 +50,7 @@ int64_t opal_tpo_write(uint64_t token, uint32_t year_mon_day, uint32_t hour_min); int64_t opal_cec_power_down(uint64_t request); int64_t opal_cec_reboot(void); -int64_t opal_cec_reboot2(uint32_t reboot_type, char *diag); +int64_t opal_cec_reboot2(uint32_t reboot_type, const char *diag); int64_t opal_read_nvram(uint64_t buffer, uint64_t size, uint64_t offset); int64_t opal_write_nvram(uint64_t buffer, uint64_t size, uint64_t offset); int64_t opal_handle_interrupt(uint64_t isn, __be64 *outstanding_event_mask); diff --git a/arch/powerpc/platforms/powernv/opal-hmi.c b/arch/powerpc/platforms/powernv/opal-hmi.c index 88f3c61eec95..d78fed728cdf 100644 --- a/arch/powerpc/platforms/powernv/opal-hmi.c +++ b/arch/powerpc/platforms/powernv/opal-hmi.c @@ -30,6 +30,8 @@ #include #include +#include "powernv.h" + static int opal_hmi_handler_nb_init; struct OpalHmiEvtNode { struct list_head list; @@ -267,8 +269,6 @@ static void hmi_event_handler(struct work_struct *work) spin_unlock_irqrestore(&opal_hmi_evt_lock, flags); if (unrecoverable) { - int ret; - /* Pull all HMI events from OPAL before we panic. */ while (opal_get_msg(__pa(&msg), sizeof(msg)) == OPAL_SUCCESS) { u32 type; @@ -284,23 +284,7 @@ static void hmi_event_handler(struct work_struct *work) print_hmi_event_info(hmi_evt); } - /* - * Unrecoverable HMI exception. We need to inform BMC/OCC - * about this error so that it can collect relevant data - * for error analysis before rebooting. - */ - ret = opal_cec_reboot2(OPAL_REBOOT_PLATFORM_ERROR, - "Unrecoverable HMI exception"); - if (ret == OPAL_UNSUPPORTED) { - pr_emerg("Reboot type %d not supported\n", - OPAL_REBOOT_PLATFORM_ERROR); - } - - /* - * Fall through and panic if opal_cec_reboot2() returns - * OPAL_UNSUPPORTED. - */ - panic("Unrecoverable HMI exception"); + pnv_platform_error_reboot(NULL, "Unrecoverable HMI exception"); } } diff --git a/arch/powerpc/platforms/powernv/opal.c b/arch/powerpc/platforms/powernv/opal.c index 59684b4af4d1..ccbcfa22bacf 100644 --- a/arch/powerpc/platforms/powernv/opal.c +++ b/arch/powerpc/platforms/powernv/opal.c @@ -25,6 +25,10 @@ #include #include #include +#include +#include +#include +#include #include #include @@ -421,10 +425,57 @@ static int opal_recover_mce(struct pt_regs *regs, return recovered; } +void pnv_platform_error_reboot(struct pt_regs *regs, const char *msg) +{ + /* + * This is mostly taken from kernel/panic.c, but tries to do + * relatively minimal work. Don't use delay functions (TB may + * be broken), don't crash dump (need to set a firmware log), + * don't run notifiers. We do want to get some information to + * Linux console. + */ + smp_send_stop(); + + console_verbose(); + bust_spinlocks(1); + pr_emerg("Hardware platform error: %s\n", msg); + if (regs) + show_regs(regs); + printk_safe_flush_on_panic(); + kmsg_dump(KMSG_DUMP_PANIC); + bust_spinlocks(0); + debug_locks_off(); + console_flush_on_panic(); + + /* + * Don't bother to shut things down because this will + * xstop the system. + */ + if (opal_cec_reboot2(OPAL_REBOOT_PLATFORM_ERROR, msg) + == OPAL_UNSUPPORTED) { + pr_emerg("Reboot type %d not supported for %s\n", + OPAL_REBOOT_PLATFORM_ERROR, msg); + } + + /* + * We reached here. There can be three possibilities: + * 1. We are running on a firmware level that do not support + * opal_cec_reboot2() + * 2. We are running on a firmware level that do not support + * OPAL_REBOOT_PLATFORM_ERROR reboot type. + * 3. We are running on FSP based system that does not need + * opal to trigger checkstop explicitly for error analysis. + * The FSP PRD component would have already got notified + * about this error through other channels. + */ + + for (;;) + ; +} + int opal_machine_check(struct pt_regs *regs) { struct machine_check_event evt; - int ret; if (!get_mce_event(&evt, MCE_EVENT_RELEASE)) return 0; @@ -440,43 +491,7 @@ int opal_machine_check(struct pt_regs *regs) if (opal_recover_mce(regs, &evt)) return 1; - /* - * Unrecovered machine check, we are heading to panic path. - * - * We may have hit this MCE in very early stage of kernel - * initialization even before opal-prd has started running. If - * this is the case then this MCE error may go un-noticed or - * un-analyzed if we go down panic path. We need to inform - * BMC/OCC about this error so that they can collect relevant - * data for error analysis before rebooting. - * Use opal_cec_reboot2(OPAL_REBOOT_PLATFORM_ERROR) to do so. - * This function may not return on BMC based system. - */ - ret = opal_cec_reboot2(OPAL_REBOOT_PLATFORM_ERROR, - "Unrecoverable Machine Check exception"); - if (ret == OPAL_UNSUPPORTED) { - pr_emerg("Reboot type %d not supported\n", - OPAL_REBOOT_PLATFORM_ERROR); - } - - /* - * We reached here. There can be three possibilities: - * 1. We are running on a firmware level that do not support - * opal_cec_reboot2() - * 2. We are running on a firmware level that do not support - * OPAL_REBOOT_PLATFORM_ERROR reboot type. - * 3. We are running on FSP based system that does not need opal - * to trigger checkstop explicitly for error analysis. The FSP - * PRD component would have already got notified about this - * error through other channels. - * - * If hardware marked this as an unrecoverable MCE, we are - * going to panic anyway. Even if it didn't, it's not safe to - * continue at this point, so we should explicitly panic. - */ - - panic("PowerNV Unrecovered Machine Check"); - return 0; + pnv_platform_error_reboot(regs, "Unrecoverable Machine Check exception"); } /* Early hmi handler called in real mode. */ diff --git a/arch/powerpc/platforms/powernv/powernv.h b/arch/powerpc/platforms/powernv/powernv.h index 6dbc0a1da1f6..a159d48573d7 100644 --- a/arch/powerpc/platforms/powernv/powernv.h +++ b/arch/powerpc/platforms/powernv/powernv.h @@ -7,6 +7,8 @@ extern void pnv_smp_init(void); static inline void pnv_smp_init(void) { } #endif +extern void pnv_platform_error_reboot(struct pt_regs *regs, const char *msg) __noreturn; + struct pci_dev; #ifdef CONFIG_PCI