From patchwork Fri Feb 28 00:42:21 2020 Content-Type: text/plain; charset="utf-8" MIME-Version: 1.0 Content-Transfer-Encoding: 7bit X-Patchwork-Submitter: Nicholas Piggin X-Patchwork-Id: 1246227 Return-Path: X-Original-To: incoming@patchwork.ozlabs.org Delivered-To: patchwork-incoming@bilbo.ozlabs.org Received: from lists.ozlabs.org (lists.ozlabs.org [203.11.71.2]) (using TLSv1.3 with cipher TLS_AES_256_GCM_SHA384 (256/256 bits) key-exchange X25519 server-signature RSA-PSS (4096 bits) server-digest SHA256) (No client certificate requested) by ozlabs.org (Postfix) with ESMTPS id 48T9rv371Dz9sRl for ; Fri, 28 Feb 2020 11:48:47 +1100 (AEDT) Authentication-Results: ozlabs.org; dmarc=fail (p=none dis=none) header.from=gmail.com Authentication-Results: ozlabs.org; dkim=fail reason="signature verification failed" (2048-bit key; unprotected) header.d=gmail.com header.i=@gmail.com header.a=rsa-sha256 header.s=20161025 header.b=X3ZR+kKm; dkim-atps=neutral Received: from lists.ozlabs.org (lists.ozlabs.org [IPv6:2401:3900:2:1::3]) by lists.ozlabs.org (Postfix) with ESMTP id 48T9rt2mdczDqc3 for ; Fri, 28 Feb 2020 11:48:46 +1100 (AEDT) X-Original-To: skiboot@lists.ozlabs.org Delivered-To: skiboot@lists.ozlabs.org Authentication-Results: lists.ozlabs.org; spf=pass (sender SPF authorized) smtp.mailfrom=gmail.com (client-ip=2607:f8b0:4864:20::642; helo=mail-pl1-x642.google.com; envelope-from=npiggin@gmail.com; receiver=) Authentication-Results: lists.ozlabs.org; dmarc=pass (p=none dis=none) header.from=gmail.com Authentication-Results: lists.ozlabs.org; dkim=pass (2048-bit key; unprotected) header.d=gmail.com header.i=@gmail.com header.a=rsa-sha256 header.s=20161025 header.b=X3ZR+kKm; dkim-atps=neutral Received: from mail-pl1-x642.google.com (mail-pl1-x642.google.com [IPv6:2607:f8b0:4864:20::642]) (using TLSv1.3 with cipher TLS_AES_256_GCM_SHA384 (256/256 bits) key-exchange X25519 server-signature RSA-PSS (2048 bits) server-digest SHA256) (No client certificate requested) by lists.ozlabs.org (Postfix) with ESMTPS id 48T9kS417QzDrPX for ; Fri, 28 Feb 2020 11:43:11 +1100 (AEDT) Received: by mail-pl1-x642.google.com with SMTP id b22so494622pls.12 for ; Thu, 27 Feb 2020 16:43:11 -0800 (PST) DKIM-Signature: v=1; a=rsa-sha256; c=relaxed/relaxed; d=gmail.com; s=20161025; h=from:to:cc:subject:date:message-id:mime-version :content-transfer-encoding; bh=ncoJ1tU8HPKyiEOkTM0ylvE8LMGV/km1LL72VfWK/qA=; b=X3ZR+kKm+Ob2VUrUImvW0n1yClcUPAOQlIBYLvJ14EX7bPcvHuTkUWxuV1TvL26e0M zjU6UBNpg/p0G/KvcxwEt/Q6vYYnBvXrUL7AH9oLUznyJeJT6ZpJu+LSZ6v4tEmibSPy 5alQcRk9eFxDWbML2xONGK/IXOe92vEJ7Geh/RHCpAYrJ+mSYQBLNWQ7CLipNeH9ntfK SAISZjDw1ChlBBAWNRUzgivxAhEb5wqLL1V2wQF2NY4UnRPoiEzzbddgyfsiNDFSzpzK YZUp+L7kAr2ofdQ/cJRaFqEmZvn8YH03iihDxaz5B0ps2HNCTXouqDIoVFk6/QCdLbZF nndw== X-Google-DKIM-Signature: v=1; a=rsa-sha256; c=relaxed/relaxed; d=1e100.net; s=20161025; h=x-gm-message-state:from:to:cc:subject:date:message-id:mime-version :content-transfer-encoding; bh=ncoJ1tU8HPKyiEOkTM0ylvE8LMGV/km1LL72VfWK/qA=; b=l19klsZN2wn/drePcgBVaKEvLG+JMccVuwACGMdxLEIpvO44x+aoEmulFWKVMXamV4 aKqDErPFtET+7BOMZNPGWSWfmVCwDONg6N/GEijwcfx7B5J/wqzSo8Hqc14NM8j/Fv6j 6SRMZdo3bRLG2OIiMaTWShmqBhIKzfwIFj6M4HUxArHsOad+j+yuGB4pppisGhEMGH6D gN21l+0G4GfTqO4Gc+dItNwzBqn1yNLGYcAZwIhf3pJNRM2rQmIzdLmLgVwnSpdXsqd/ /mJyrYLiBjA1sdHuTADDeGQS6Tr4EwvkvdbQcxkmfMxZQAhnrCjzTN0xJ5/xHYhxLQs7 fLcA== X-Gm-Message-State: APjAAAX45IY1SE1EoQ61Qvsu5egP5437KBXNCCyBuRGn+fiDMVlrytNv 5KQxJxmf5LwgczNLTDACy5eO0VQN X-Google-Smtp-Source: APXvYqxZp8oKODLKliMT5CfJ/4JWj0JfXAA8vZC6g1qln/7mjup4CGRdfZeqT8X2KyxFMg/hiLEoTQ== X-Received: by 2002:a17:902:d906:: with SMTP id c6mr1425403plz.93.1582850588989; Thu, 27 Feb 2020 16:43:08 -0800 (PST) Received: from bobo.ozlabs.ibm.com (193-116-109-34.tpgi.com.au. [193.116.109.34]) by smtp.gmail.com with ESMTPSA id x7sm4614821pgp.0.2020.02.27.16.43.06 (version=TLS1_3 cipher=TLS_AES_256_GCM_SHA384 bits=256/256); Thu, 27 Feb 2020 16:43:08 -0800 (PST) From: Nicholas Piggin To: skiboot@lists.ozlabs.org Date: Fri, 28 Feb 2020 10:42:21 +1000 Message-Id: <20200228004221.220683-1-npiggin@gmail.com> X-Mailer: git-send-email 2.23.0 MIME-Version: 1.0 Subject: [Skiboot] [PATCH] core/mce: add support for decoding and handling machine checks X-BeenThere: skiboot@lists.ozlabs.org X-Mailman-Version: 2.1.29 Precedence: list List-Id: Mailing list for skiboot development List-Unsubscribe: , List-Archive: List-Post: List-Help: List-Subscribe: , Errors-To: skiboot-bounces+incoming=patchwork.ozlabs.org@lists.ozlabs.org Sender: "Skiboot" This provides an initial facility to decode machine checks into human readable strings, plus a minimum amount of metadata that a handler has to understand in order to deal with the machine check. For now this is only used by skiboot to make MCE reporting nicer, and an ERAT flush recovery attempt which is more about code coverage than really being helpful. *********************************************** Fatal MCE at 00000000300c9c0c .memcmp+0x3c MSR 9000000000141002 Cause: instruction fetch TLB multi-hit error Effective address: 0x00000000300c9c0c ... The intention is to subsequently provide an OPAL API with this information that will enable an OS to implement a machine independent OPAL machine check driver. The code and data tables are derived from Linux code that I wrote. Signed-off-by: Nicholas Piggin --- core/Makefile.inc | 2 +- core/exceptions.c | 57 ++++++++++++-- core/mce.c | 187 ++++++++++++++++++++++++++++++++++++++++++++ include/processor.h | 6 ++ include/ras.h | 47 +++++++++++ 5 files changed, 293 insertions(+), 6 deletions(-) create mode 100644 core/mce.c create mode 100644 include/ras.h diff --git a/core/Makefile.inc b/core/Makefile.inc index fddff50e9..057434bf1 100644 --- a/core/Makefile.inc +++ b/core/Makefile.inc @@ -7,7 +7,7 @@ CORE_OBJS = relocate.o console.o stack.o init.o chip.o mem_region.o CORE_OBJS += malloc.o lock.o cpu.o utils.o fdt.o opal.o interrupts.o timebase.o CORE_OBJS += opal-msg.o pci.o pci-virt.o pci-slot.o pcie-slot.o CORE_OBJS += pci-opal.o fast-reboot.o device.o exceptions.o trace.o affinity.o -CORE_OBJS += vpd.o platform.o nvram.o nvram-format.o hmi.o +CORE_OBJS += vpd.o platform.o nvram.o nvram-format.o hmi.o mce.o CORE_OBJS += console-log.o ipmi.o time-utils.o pel.o pool.o errorlog.o CORE_OBJS += timer.o i2c.o rtc.o flash.o sensor.o ipmi-opal.o CORE_OBJS += flash-subpartition.o bitmap.o buddy.o pci-quirk.o powercap.o psr.o diff --git a/core/exceptions.c b/core/exceptions.c index 7cb30c18c..833a436a8 100644 --- a/core/exceptions.c +++ b/core/exceptions.c @@ -10,6 +10,7 @@ #include #include #include +#include #define REG "%016llx" #define REG32 "%08x" @@ -32,6 +33,54 @@ static void dump_regs(struct stack_frame *stack) #define EXCEPTION_MAX_STR 320 +static void handle_mce(struct stack_frame *stack, uint64_t nip, uint64_t msr, bool *fatal) +{ + uint64_t mce_flags, mce_addr; + const char *mce_err; + const char *mce_fix = NULL; + char buf[EXCEPTION_MAX_STR]; + size_t l; + + decode_mce(stack->srr0, stack->srr1, stack->dsisr, stack->dar, + &mce_flags, &mce_err, &mce_addr); + + /* Try to recover. */ + if (mce_flags & MCE_ERAT_ERROR) { + /* Real-mode still uses ERAT, flush transient bitflips */ + flush_erat(); + mce_fix = "ERAT flush"; + + } else { + *fatal = true; + } + + prerror("***********************************************\n"); + l = 0; + l += snprintf(buf + l, EXCEPTION_MAX_STR - l, + "%s MCE at "REG" ", *fatal ? "Fatal" : "Non-fatal", nip); + l += snprintf_symbol(buf + l, EXCEPTION_MAX_STR - l, nip); + l += snprintf(buf + l, EXCEPTION_MAX_STR - l, " MSR "REG, msr); + prerror("%s\n", buf); + + l = 0; + l += snprintf(buf + l, EXCEPTION_MAX_STR - l, + "Cause: %s", mce_err); + prerror("%s\n", buf); + if (mce_flags & MCE_INVOLVED_EA) { + l = 0; + l += snprintf(buf + l, EXCEPTION_MAX_STR - l, + "Effective address: 0x%016llx", mce_addr); + prerror("%s\n", buf); + } + + if (!*fatal) { + l = 0; + l += snprintf(buf + l, EXCEPTION_MAX_STR - l, + "Attempting recovery: %s", mce_fix); + prerror("%s\n", buf); + } +} + void exception_entry(struct stack_frame *stack) { bool fatal = false; @@ -85,11 +134,8 @@ void exception_entry(struct stack_frame *stack) break; case 0x200: - fatal = true; - prerror("***********************************************\n"); - l += snprintf(buf + l, EXCEPTION_MAX_STR - l, - "Fatal MCE at "REG" ", nip); - break; + handle_mce(stack, nip, msr, &fatal); + goto no_symbol; case 0x700: { struct trap_table_entry *tte; @@ -130,6 +176,7 @@ void exception_entry(struct stack_frame *stack) l += snprintf_symbol(buf + l, EXCEPTION_MAX_STR - l, nip); l += snprintf(buf + l, EXCEPTION_MAX_STR - l, " MSR "REG, msr); prerror("%s\n", buf); +no_symbol: dump_regs(stack); backtrace_r1((uint64_t)stack); if (fatal) { diff --git a/core/mce.c b/core/mce.c new file mode 100644 index 000000000..0391d008e --- /dev/null +++ b/core/mce.c @@ -0,0 +1,187 @@ +// SPDX-License-Identifier: Apache-2.0 +/* + * Machine Check Exceptions + * + * Copyright 2020 IBM Corp. + */ + +#define pr_fmt(fmt) "MCE: " fmt + +#include +#include +#include + +#define SRR1_MC_LOADSTORE(srr1) ((srr1) & PPC_BIT(42)) + +struct mce_ierror_table { + unsigned long srr1_mask; + unsigned long srr1_value; + uint64_t type; + const char *error_str; +}; + +static const struct mce_ierror_table mce_p9_ierror_table[] = { +{ 0x00000000081c0000, 0x0000000000040000, + MCE_INSNFETCH | MCE_MEMORY_ERROR | MCE_INVOLVED_EA, + "instruction fetch memory uncorrectable error", }, +{ 0x00000000081c0000, 0x0000000000080000, + MCE_INSNFETCH | MCE_SLB_ERROR | MCE_INVOLVED_EA, + "instruction fetch SLB parity error", }, +{ 0x00000000081c0000, 0x00000000000c0000, + MCE_INSNFETCH | MCE_SLB_ERROR | MCE_INVOLVED_EA, + "instruction fetch SLB multi-hit error", }, +{ 0x00000000081c0000, 0x0000000000100000, + MCE_INSNFETCH | MCE_INVOLVED_EA | MCE_ERAT_ERROR, + "instruction fetch ERAT multi-hit error", }, +{ 0x00000000081c0000, 0x0000000000140000, + MCE_INSNFETCH | MCE_INVOLVED_EA | MCE_TLB_ERROR, + "instruction fetch TLB multi-hit error", }, +{ 0x00000000081c0000, 0x0000000000180000, + MCE_INSNFETCH | MCE_MEMORY_ERROR | MCE_TABLE_WALK | MCE_INVOLVED_EA, + "instruction fetch page table access memory uncorrectable error", }, +{ 0x00000000081c0000, 0x00000000001c0000, + MCE_INSNFETCH | MCE_INVOLVED_EA, + "instruction fetch to foreign address", }, +{ 0x00000000081c0000, 0x0000000008000000, + MCE_INSNFETCH | MCE_INVOLVED_EA, + "instruction fetch foreign link time-out", }, +{ 0x00000000081c0000, 0x0000000008040000, + MCE_INSNFETCH | MCE_TABLE_WALK | MCE_INVOLVED_EA, + "instruction fetch page table access foreign link time-out", }, +{ 0x00000000081c0000, 0x00000000080c0000, + MCE_INSNFETCH | MCE_INVOLVED_EA, + "instruction fetch real address error", }, +{ 0x00000000081c0000, 0x0000000008100000, + MCE_INSNFETCH | MCE_TABLE_WALK | MCE_INVOLVED_EA, + "instruction fetch page table access real address error", }, +{ 0x00000000081c0000, 0x0000000008140000, + MCE_LOADSTORE | MCE_IMPRECISE, + "store real address asynchronous error", }, +{ 0x00000000081c0000, 0x0000000008180000, + MCE_LOADSTORE | MCE_IMPRECISE, + "store foreign link time-out asynchronous error", }, +{ 0x00000000081c0000, 0x00000000081c0000, + MCE_INSNFETCH | MCE_TABLE_WALK | MCE_INVOLVED_EA, + "instruction fetch page table access to foreign address", }, +{ 0 } }; + +struct mce_derror_table { + unsigned long dsisr_value; + uint64_t type; + const char *error_str; +}; + +static const struct mce_derror_table mce_p9_derror_table[] = { +{ 0x00008000, + MCE_LOADSTORE | MCE_MEMORY_ERROR, + "load/store memory uncorrectable error", }, +{ 0x00004000, + MCE_LOADSTORE | MCE_MEMORY_ERROR | MCE_TABLE_WALK | MCE_INVOLVED_EA, + "load/store page table access memory uncorrectable error", }, +{ 0x00002000, + MCE_LOADSTORE | MCE_INVOLVED_EA, + "load/store foreign link time-out", }, +{ 0x00001000, + MCE_LOADSTORE | MCE_TABLE_WALK | MCE_INVOLVED_EA, + "load/store page table access foreign link time-out", }, +{ 0x00000800, + MCE_LOADSTORE | MCE_INVOLVED_EA | MCE_ERAT_ERROR, + "load/store ERAT multi-hit error", }, +{ 0x00000400, + MCE_LOADSTORE | MCE_INVOLVED_EA | MCE_TLB_ERROR, + "load/store TLB multi-hit error", }, +{ 0x00000200, + MCE_LOADSTORE | MCE_TLBIE_ERROR, + "TLBIE or TLBIEL instruction programming error", }, +{ 0x00000100, + MCE_LOADSTORE | MCE_INVOLVED_EA | MCE_SLB_ERROR, + "load/store SLB parity error", }, +{ 0x00000080, + MCE_LOADSTORE | MCE_INVOLVED_EA | MCE_SLB_ERROR, + "load/store SLB multi-hit error", }, +{ 0x00000040, + MCE_LOADSTORE | MCE_INVOLVED_EA, + "load real address error", }, +{ 0x00000020, + MCE_LOADSTORE | MCE_TABLE_WALK, + "load/store page table access real address error", }, +{ 0x00000010, + MCE_LOADSTORE, + "load/store to foreign address", }, +{ 0x00000008, + MCE_LOADSTORE | MCE_TABLE_WALK, + "load/store page table access to foreign address", }, +{ 0 } }; + +static void decode_ierror(const struct mce_ierror_table table[], + uint64_t srr1, + uint64_t *type, + const char **error_str) +{ + int i; + + for (i = 0; table[i].srr1_mask; i++) { + if ((srr1 & table[i].srr1_mask) != table[i].srr1_value) + continue; + + *type = table[i].type; + *error_str = table[i].error_str; + } +} + +static void decode_derror(const struct mce_derror_table table[], + uint32_t dsisr, + uint64_t *type, + const char **error_str) +{ + int i; + + for (i = 0; table[i].dsisr_value; i++) { + if (!(dsisr & table[i].dsisr_value)) + continue; + + *type = table[i].type; + *error_str = table[i].error_str; + } +} + +void decode_mce(uint64_t srr0, uint64_t srr1, + uint32_t dsisr, uint64_t dar, + uint64_t *type, const char **error_str, + uint64_t *address) +{ + *type = MCE_UNKNOWN; + *error_str = "unknown error"; + *address = 0; + + if (proc_gen != proc_gen_p9) { + *error_str = "unknown error (processor not supported)"; + return; + } + + /* + * On POWER9 DD2.1 and below, it's possible to get a machine check + * caused by a paste instruction where only DSISR bit 25 is set. This + * will result in the MCE handler seeing an unknown event and the + * kernel crashing. An MCE that occurs like this is spurious, so we + * don't need to do anything in terms of servicing it. If there is + * something that needs to be serviced, the CPU will raise the MCE + * again with the correct DSISR so that it can be serviced properly. + * So detect this case and mark it as handled. + */ + if (SRR1_MC_LOADSTORE(srr1) && dsisr == 0x02000000) { + *type = MCE_NO_ERROR; + *error_str = "no error (superfluous machine check)"; + return; + } + + if (SRR1_MC_LOADSTORE(srr1)) { + decode_derror(mce_p9_derror_table, dsisr, type, error_str); + if (*type & MCE_INVOLVED_EA) + *address = dar; + } else { + decode_ierror(mce_p9_ierror_table, srr1, type, error_str); + if (*type & MCE_INVOLVED_EA) + *address = srr0; + } +} diff --git a/include/processor.h b/include/processor.h index a0c2864a8..2860116f6 100644 --- a/include/processor.h +++ b/include/processor.h @@ -230,6 +230,12 @@ static inline bool is_power9n(uint32_t version) #ifndef __TEST__ +/* POWER9 and above only */ +static inline void flush_erat(void) +{ + asm volatile("slbia 7"); +} + /* * SMT priority */ diff --git a/include/ras.h b/include/ras.h new file mode 100644 index 000000000..d0faaeefb --- /dev/null +++ b/include/ras.h @@ -0,0 +1,47 @@ +// SPDX-License-Identifier: Apache-2.0 +/* Copyright 2020 IBM Corp. */ + +#ifndef __RAS_H +#define __RAS_H + +#include +#include +#include +#include +#include +#include +#include +#include +#include +#include + +#include +#include +#include +#include +#include +#include +#include + +#include + +#define MCE_NO_ERROR 0x0001 +#define MCE_UNKNOWN 0x0002 +#define MCE_INSNFETCH 0x0004 +#define MCE_LOADSTORE 0x0008 +#define MCE_TABLE_WALK 0x0010 +#define MCE_IMPRECISE 0x0020 +#define MCE_MEMORY_ERROR 0x0040 +#define MCE_SLB_ERROR 0x0080 +#define MCE_ERAT_ERROR 0x0100 +#define MCE_TLB_ERROR 0x0200 +#define MCE_TLBIE_ERROR 0x0400 +#define MCE_INVOLVED_EA 0x0800 +#define MCE_INVOLVED_PA 0x1000 + +void decode_mce(uint64_t srr0, uint64_t srr1, + uint32_t dsisr, uint64_t dar, + uint64_t *type, const char **error_str, + uint64_t *address); + +#endif /* __RAS_H */