From patchwork Mon Apr 16 17:32:57 2018 Content-Type: text/plain; charset="utf-8" MIME-Version: 1.0 Content-Transfer-Encoding: 7bit X-Patchwork-Submitter: Mahesh J Salgaonkar X-Patchwork-Id: 898817 Return-Path: X-Original-To: incoming@patchwork.ozlabs.org Delivered-To: patchwork-incoming@bilbo.ozlabs.org Received: from lists.ozlabs.org (lists.ozlabs.org [203.11.71.2]) (using TLSv1.2 with cipher ADH-AES256-GCM-SHA384 (256/256 bits)) (No client certificate requested) by ozlabs.org (Postfix) with ESMTPS id 40PwTP2Tzbz9s3G for ; Tue, 17 Apr 2018 03:33:29 +1000 (AEST) Authentication-Results: ozlabs.org; dmarc=none (p=none dis=none) header.from=linux.vnet.ibm.com Received: from lists.ozlabs.org (lists.ozlabs.org [IPv6:2401:3900:2:1::3]) by lists.ozlabs.org (Postfix) with ESMTP id 40PwTP0vlCzF1s3 for ; Tue, 17 Apr 2018 03:33:29 +1000 (AEST) Authentication-Results: lists.ozlabs.org; dmarc=none (p=none dis=none) header.from=linux.vnet.ibm.com X-Original-To: skiboot@lists.ozlabs.org Delivered-To: skiboot@lists.ozlabs.org Authentication-Results: lists.ozlabs.org; spf=none (mailfrom) smtp.mailfrom=linux.vnet.ibm.com (client-ip=148.163.158.5; helo=mx0a-001b2d01.pphosted.com; envelope-from=mahesh@linux.vnet.ibm.com; receiver=) Authentication-Results: lists.ozlabs.org; dmarc=none (p=none dis=none) header.from=linux.vnet.ibm.com Received: from mx0a-001b2d01.pphosted.com (mx0b-001b2d01.pphosted.com [148.163.158.5]) (using TLSv1.2 with cipher ECDHE-RSA-AES256-GCM-SHA384 (256/256 bits)) (No client certificate requested) by lists.ozlabs.org (Postfix) with ESMTPS id 40PwT01p08zF1sY for ; Tue, 17 Apr 2018 03:33:08 +1000 (AEST) Received: from pps.filterd (m0098417.ppops.net [127.0.0.1]) by mx0a-001b2d01.pphosted.com (8.16.0.22/8.16.0.22) with SMTP id w3GHVKSh009923 for ; Mon, 16 Apr 2018 13:33:05 -0400 Received: from e06smtp13.uk.ibm.com (e06smtp13.uk.ibm.com [195.75.94.109]) by mx0a-001b2d01.pphosted.com with ESMTP id 2hcxhb5jj0-1 (version=TLSv1.2 cipher=AES256-SHA256 bits=256 verify=NOT) for ; Mon, 16 Apr 2018 13:33:05 -0400 Received: from localhost by e06smtp13.uk.ibm.com with IBM ESMTP SMTP Gateway: Authorized Use Only! Violators will be prosecuted for from ; Mon, 16 Apr 2018 18:33:02 +0100 Received: from b06cxnps4075.portsmouth.uk.ibm.com (9.149.109.197) by e06smtp13.uk.ibm.com (192.168.101.143) with IBM ESMTP SMTP Gateway: Authorized Use Only! Violators will be prosecuted; Mon, 16 Apr 2018 18:32:59 +0100 Received: from d06av25.portsmouth.uk.ibm.com (d06av25.portsmouth.uk.ibm.com [9.149.105.61]) by b06cxnps4075.portsmouth.uk.ibm.com (8.14.9/8.14.9/NCO v10.0) with ESMTP id w3GHWxvh55508992; Mon, 16 Apr 2018 17:32:59 GMT Received: from d06av25.portsmouth.uk.ibm.com (unknown [127.0.0.1]) by IMSVA (Postfix) with ESMTP id D17CD11C04C; Mon, 16 Apr 2018 18:24:56 +0100 (BST) Received: from d06av25.portsmouth.uk.ibm.com (unknown [127.0.0.1]) by IMSVA (Postfix) with ESMTP id 59B1511C04A; Mon, 16 Apr 2018 18:24:56 +0100 (BST) Received: from jupiter.in.ibm.com (unknown [9.102.1.147]) by d06av25.portsmouth.uk.ibm.com (Postfix) with ESMTP; Mon, 16 Apr 2018 18:24:56 +0100 (BST) From: Mahesh J Salgaonkar To: skiboot list Date: Mon, 16 Apr 2018 23:02:57 +0530 In-Reply-To: <152389987405.2566.355149283827806637.stgit@jupiter.in.ibm.com> References: <152389987405.2566.355149283827806637.stgit@jupiter.in.ibm.com> User-Agent: StGit/0.17.1-dirty MIME-Version: 1.0 X-TM-AS-GCONF: 00 x-cbid: 18041617-0012-0000-0000-000005CB7DA6 X-IBM-AV-DETECTION: SAVI=unused REMOTE=unused XFE=unused x-cbparentid: 18041617-0013-0000-0000-00001947C31B Message-Id: <152389997774.2566.16737043105317918808.stgit@jupiter.in.ibm.com> X-Proofpoint-Virus-Version: vendor=fsecure engine=2.50.10434:, , definitions=2018-04-16_09:, , signatures=0 X-Proofpoint-Spam-Details: rule=outbound_notspam policy=outbound score=0 priorityscore=1501 malwarescore=0 suspectscore=0 phishscore=0 bulkscore=0 spamscore=0 clxscore=1015 lowpriorityscore=0 impostorscore=0 adultscore=0 classifier=spam adjust=0 reason=mlx scancount=1 engine=8.0.1-1709140000 definitions=main-1804160155 Subject: [Skiboot] [PATCH v2 01/15] opal/hmi: Don't re-read HMER multiple times X-BeenThere: skiboot@lists.ozlabs.org X-Mailman-Version: 2.1.26 Precedence: list List-Id: Mailing list for skiboot development List-Unsubscribe: , List-Archive: List-Post: List-Help: List-Subscribe: , Errors-To: skiboot-bounces+incoming=patchwork.ozlabs.org@lists.ozlabs.org Sender: "Skiboot" From: Benjamin Herrenschmidt We want to make sure all reporting and actions are based upon the same snapshot of HMER in case bits get added by HW while we are in OPAL. Signed-off-by: Benjamin Herrenschmidt --- core/hmi.c | 35 ++++++++++++++--------------------- 1 file changed, 14 insertions(+), 21 deletions(-) diff --git a/core/hmi.c b/core/hmi.c index 162dd8a11..8c100ade5 100644 --- a/core/hmi.c +++ b/core/hmi.c @@ -911,16 +911,13 @@ static int get_split_core_mode(void) * - SPR_TFMR_TB_RESIDUE_ERR * - SPR_TFMR_HDEC_PARITY_ERROR */ -static void pre_recovery_cleanup_p8(void) +static void pre_recovery_cleanup_p8(uint64_t hmer) { - uint64_t hmer; uint64_t tfmr; uint32_t sibling_thread_mask; int split_core_mode, subcore_id, thread_id, threads_per_core; int i; - hmer = mfspr(SPR_HMER); - /* exit if it is not Time facility error. */ if (!(hmer & SPR_HMER_TFAC_ERROR)) return; @@ -1018,15 +1015,12 @@ static void pre_recovery_cleanup_p8(void) * - SPR_TFMR_TB_RESIDUE_ERR * - SPR_TFMR_HDEC_PARITY_ERROR */ -static void pre_recovery_cleanup_p9(void) +static void pre_recovery_cleanup_p9(uint64_t hmer) { - uint64_t hmer; uint64_t tfmr; int threads_per_core = cpu_thread_count; int i; - hmer = mfspr(SPR_HMER); - /* exit if it is not Time facility error. */ if (!(hmer & SPR_HMER_TFAC_ERROR)) return; @@ -1104,12 +1098,12 @@ static void pre_recovery_cleanup_p9(void) wait_for_cleanup_complete(); } -static void pre_recovery_cleanup(void) +static void pre_recovery_cleanup(uint64_t hmer) { if (proc_gen == proc_gen_p9) - return pre_recovery_cleanup_p9(); + return pre_recovery_cleanup_p9(hmer); else - return pre_recovery_cleanup_p8(); + return pre_recovery_cleanup_p8(hmer); } static void hmi_exit(void) @@ -1118,9 +1112,8 @@ static void hmi_exit(void) *(this_cpu()->core_hmi_state_ptr) &= ~(this_cpu()->thread_mask); } -static void hmi_print_debug(const uint8_t *msg) +static void hmi_print_debug(const uint8_t *msg, uint64_t hmer) { - uint64_t hmer = mfspr(SPR_HMER); const char *loc; uint32_t core_id, thread_index; @@ -1152,7 +1145,7 @@ int handle_hmi_exception(uint64_t hmer, struct OpalHMIEvent *hmi_evt) * In case of split core, some of the Timer facility errors need * cleanup to be done before we proceed with the error recovery. */ - pre_recovery_cleanup(); + pre_recovery_cleanup(hmer); lock(&hmi_lock); /* @@ -1169,7 +1162,7 @@ int handle_hmi_exception(uint64_t hmer, struct OpalHMIEvent *hmi_evt) uint32_t core_id = pir_to_core_id(cpu->pir); uint64_t core_wof; - hmi_print_debug("Processor recovery occurred."); + hmi_print_debug("Processor recovery occurred.", hmer); if (!read_core_wof(chip_id, core_id, &core_wof)) { int i; @@ -1195,7 +1188,7 @@ int handle_hmi_exception(uint64_t hmer, struct OpalHMIEvent *hmi_evt) hmi_evt->type = OpalHMI_ERROR_PROC_RECOV_MASKED; queue_hmi_event(hmi_evt, recover); } - hmi_print_debug("Processor recovery Done (masked)."); + hmi_print_debug("Processor recovery Done (masked).", hmer); } if (hmer & SPR_HMER_PROC_RECV_AGAIN) { hmer &= ~SPR_HMER_PROC_RECV_AGAIN; @@ -1205,13 +1198,13 @@ int handle_hmi_exception(uint64_t hmer, struct OpalHMIEvent *hmi_evt) queue_hmi_event(hmi_evt, recover); } hmi_print_debug("Processor recovery occurred again before" - "bit2 was cleared\n"); + "bit2 was cleared\n", hmer); } /* Assert if we see malfunction alert, we can not continue. */ if (hmer & SPR_HMER_MALFUNCTION_ALERT) { hmer &= ~SPR_HMER_MALFUNCTION_ALERT; - hmi_print_debug("Malfunction Alert"); + hmi_print_debug("Malfunction Alert", hmer); if (hmi_evt) decode_malfunction(hmi_evt); } @@ -1220,7 +1213,7 @@ int handle_hmi_exception(uint64_t hmer, struct OpalHMIEvent *hmi_evt) if (hmer & SPR_HMER_HYP_RESOURCE_ERR) { hmer &= ~SPR_HMER_HYP_RESOURCE_ERR; - hmi_print_debug("Hypervisor resource error"); + hmi_print_debug("Hypervisor resource error", hmer); recover = 0; if (hmi_evt) { hmi_evt->severity = OpalHMI_SEV_FATAL; @@ -1236,7 +1229,7 @@ int handle_hmi_exception(uint64_t hmer, struct OpalHMIEvent *hmi_evt) if (hmer & SPR_HMER_TFAC_ERROR) { tfmr = mfspr(SPR_TFMR); /* save original TFMR */ - hmi_print_debug("Timer Facility Error"); + hmi_print_debug("Timer Facility Error", hmer); hmer &= ~SPR_HMER_TFAC_ERROR; recover = chiptod_recover_tb_errors(); @@ -1251,7 +1244,7 @@ int handle_hmi_exception(uint64_t hmer, struct OpalHMIEvent *hmi_evt) tfmr = mfspr(SPR_TFMR); /* save original TFMR */ hmer &= ~SPR_HMER_TFMR_PARITY_ERROR; - hmi_print_debug("TFMR parity Error"); + hmi_print_debug("TFMR parity Error", hmer); recover = chiptod_recover_tb_errors(); if (hmi_evt) { hmi_evt->severity = OpalHMI_SEV_FATAL; From patchwork Mon Apr 16 17:33:04 2018 Content-Type: text/plain; charset="utf-8" MIME-Version: 1.0 Content-Transfer-Encoding: 7bit X-Patchwork-Submitter: Mahesh J Salgaonkar X-Patchwork-Id: 898823 Return-Path: X-Original-To: incoming@patchwork.ozlabs.org Delivered-To: patchwork-incoming@bilbo.ozlabs.org Received: from lists.ozlabs.org (lists.ozlabs.org [203.11.71.2]) (using TLSv1.2 with cipher ADH-AES256-GCM-SHA384 (256/256 bits)) (No client certificate requested) by ozlabs.org (Postfix) with ESMTPS id 40PwV82vn2z9s3B for ; Tue, 17 Apr 2018 03:34:08 +1000 (AEST) Authentication-Results: ozlabs.org; dmarc=none (p=none dis=none) header.from=linux.vnet.ibm.com Received: from lists.ozlabs.org (lists.ozlabs.org [IPv6:2401:3900:2:1::3]) by lists.ozlabs.org (Postfix) with ESMTP id 40PwV81l3MzF1wG for ; Tue, 17 Apr 2018 03:34:08 +1000 (AEST) Authentication-Results: lists.ozlabs.org; dmarc=none (p=none dis=none) header.from=linux.vnet.ibm.com X-Original-To: skiboot@lists.ozlabs.org Delivered-To: skiboot@lists.ozlabs.org Authentication-Results: lists.ozlabs.org; spf=none (mailfrom) smtp.mailfrom=linux.vnet.ibm.com (client-ip=148.163.158.5; helo=mx0a-001b2d01.pphosted.com; envelope-from=mahesh@linux.vnet.ibm.com; receiver=) Authentication-Results: lists.ozlabs.org; dmarc=none (p=none dis=none) header.from=linux.vnet.ibm.com Received: from mx0a-001b2d01.pphosted.com (mx0b-001b2d01.pphosted.com [148.163.158.5]) (using TLSv1.2 with cipher ECDHE-RSA-AES256-GCM-SHA384 (256/256 bits)) (No client certificate requested) by lists.ozlabs.org (Postfix) with ESMTPS id 40PwT41P5mzF1x4 for ; Tue, 17 Apr 2018 03:33:11 +1000 (AEST) Received: from pps.filterd (m0098417.ppops.net [127.0.0.1]) by mx0a-001b2d01.pphosted.com (8.16.0.22/8.16.0.22) with SMTP id w3GHVNke010119 for ; Mon, 16 Apr 2018 13:33:09 -0400 Received: from e06smtp10.uk.ibm.com (e06smtp10.uk.ibm.com [195.75.94.106]) by mx0a-001b2d01.pphosted.com with ESMTP id 2hcxhb5jqf-1 (version=TLSv1.2 cipher=AES256-SHA256 bits=256 verify=NOT) for ; Mon, 16 Apr 2018 13:33:09 -0400 Received: from localhost by e06smtp10.uk.ibm.com with IBM ESMTP SMTP Gateway: Authorized Use Only! Violators will be prosecuted for from ; Mon, 16 Apr 2018 18:33:07 +0100 Received: from b06cxnps4076.portsmouth.uk.ibm.com (9.149.109.198) by e06smtp10.uk.ibm.com (192.168.101.140) with IBM ESMTP SMTP Gateway: Authorized Use Only! Violators will be prosecuted; Mon, 16 Apr 2018 18:33:06 +0100 Received: from d06av26.portsmouth.uk.ibm.com (d06av26.portsmouth.uk.ibm.com [9.149.105.62]) by b06cxnps4076.portsmouth.uk.ibm.com (8.14.9/8.14.9/NCO v10.0) with ESMTP id w3GHX5D943253938; Mon, 16 Apr 2018 17:33:06 GMT Received: from d06av26.portsmouth.uk.ibm.com (unknown [127.0.0.1]) by IMSVA (Postfix) with ESMTP id E1942AE057; Mon, 16 Apr 2018 18:22:56 +0100 (BST) Received: from d06av26.portsmouth.uk.ibm.com (unknown [127.0.0.1]) by IMSVA (Postfix) with ESMTP id 662CDAE055; Mon, 16 Apr 2018 18:22:56 +0100 (BST) Received: from jupiter.in.ibm.com (unknown [9.102.1.147]) by d06av26.portsmouth.uk.ibm.com (Postfix) with ESMTP; Mon, 16 Apr 2018 18:22:56 +0100 (BST) From: Mahesh J Salgaonkar To: skiboot list Date: Mon, 16 Apr 2018 23:03:04 +0530 In-Reply-To: <152389987405.2566.355149283827806637.stgit@jupiter.in.ibm.com> References: <152389987405.2566.355149283827806637.stgit@jupiter.in.ibm.com> User-Agent: StGit/0.17.1-dirty MIME-Version: 1.0 X-TM-AS-GCONF: 00 x-cbid: 18041617-0040-0000-0000-0000042FA0B4 X-IBM-AV-DETECTION: SAVI=unused REMOTE=unused XFE=unused x-cbparentid: 18041617-0041-0000-0000-00002633A695 Message-Id: <152389998444.2566.14745965086804941860.stgit@jupiter.in.ibm.com> X-Proofpoint-Virus-Version: vendor=fsecure engine=2.50.10434:, , definitions=2018-04-16_09:, , signatures=0 X-Proofpoint-Spam-Details: rule=outbound_notspam policy=outbound score=0 priorityscore=1501 malwarescore=0 suspectscore=0 phishscore=0 bulkscore=0 spamscore=0 clxscore=1015 lowpriorityscore=0 impostorscore=0 adultscore=0 classifier=spam adjust=0 reason=mlx scancount=1 engine=8.0.1-1709140000 definitions=main-1804160155 Subject: [Skiboot] [PATCH v2 02/15] opal/hmi: Remove races in clearing HMER X-BeenThere: skiboot@lists.ozlabs.org X-Mailman-Version: 2.1.26 Precedence: list List-Id: Mailing list for skiboot development List-Unsubscribe: , List-Archive: List-Post: List-Help: List-Subscribe: , Errors-To: skiboot-bounces+incoming=patchwork.ozlabs.org@lists.ozlabs.org Sender: "Skiboot" From: Benjamin Herrenschmidt Writing to HMER acts as an "AND". The current code writes back the value we originally read with the bits we handled cleared. This is racy, if a new bit gets set in HW after the original read, we'll end up clearing it without handling it. Instead, use an all 1's mask with only the bit handled cleared. Signed-off-by: Benjamin Herrenschmidt --- core/hmi.c | 22 ++++++++++++---------- 1 file changed, 12 insertions(+), 10 deletions(-) diff --git a/core/hmi.c b/core/hmi.c index 8c100ade5..f4cdbd57f 100644 --- a/core/hmi.c +++ b/core/hmi.c @@ -1139,7 +1139,7 @@ int handle_hmi_exception(uint64_t hmer, struct OpalHMIEvent *hmi_evt) { struct cpu_thread *cpu = this_cpu(); int recover = 1; - uint64_t tfmr; + uint64_t tfmr, handled = 0; /* * In case of split core, some of the Timer facility errors need @@ -1174,7 +1174,7 @@ int handle_hmi_exception(uint64_t hmer, struct OpalHMIEvent *hmi_evt) } } - hmer &= ~SPR_HMER_PROC_RECV_DONE; + handled |= SPR_HMER_PROC_RECV_DONE; if (hmi_evt) { hmi_evt->severity = OpalHMI_SEV_NO_ERROR; hmi_evt->type = OpalHMI_ERROR_PROC_RECOV_DONE; @@ -1182,7 +1182,7 @@ int handle_hmi_exception(uint64_t hmer, struct OpalHMIEvent *hmi_evt) } } if (hmer & SPR_HMER_PROC_RECV_ERROR_MASKED) { - hmer &= ~SPR_HMER_PROC_RECV_ERROR_MASKED; + handled |= SPR_HMER_PROC_RECV_ERROR_MASKED; if (hmi_evt) { hmi_evt->severity = OpalHMI_SEV_NO_ERROR; hmi_evt->type = OpalHMI_ERROR_PROC_RECOV_MASKED; @@ -1191,7 +1191,7 @@ int handle_hmi_exception(uint64_t hmer, struct OpalHMIEvent *hmi_evt) hmi_print_debug("Processor recovery Done (masked).", hmer); } if (hmer & SPR_HMER_PROC_RECV_AGAIN) { - hmer &= ~SPR_HMER_PROC_RECV_AGAIN; + handled |= SPR_HMER_PROC_RECV_AGAIN; if (hmi_evt) { hmi_evt->severity = OpalHMI_SEV_NO_ERROR; hmi_evt->type = OpalHMI_ERROR_PROC_RECOV_DONE_AGAIN; @@ -1202,7 +1202,7 @@ int handle_hmi_exception(uint64_t hmer, struct OpalHMIEvent *hmi_evt) } /* Assert if we see malfunction alert, we can not continue. */ if (hmer & SPR_HMER_MALFUNCTION_ALERT) { - hmer &= ~SPR_HMER_MALFUNCTION_ALERT; + handled |= SPR_HMER_MALFUNCTION_ALERT; hmi_print_debug("Malfunction Alert", hmer); if (hmi_evt) @@ -1211,7 +1211,7 @@ int handle_hmi_exception(uint64_t hmer, struct OpalHMIEvent *hmi_evt) /* Assert if we see Hypervisor resource error, we can not continue. */ if (hmer & SPR_HMER_HYP_RESOURCE_ERR) { - hmer &= ~SPR_HMER_HYP_RESOURCE_ERR; + handled |= SPR_HMER_HYP_RESOURCE_ERR; hmi_print_debug("Hypervisor resource error", hmer); recover = 0; @@ -1228,10 +1228,10 @@ int handle_hmi_exception(uint64_t hmer, struct OpalHMIEvent *hmi_evt) */ if (hmer & SPR_HMER_TFAC_ERROR) { tfmr = mfspr(SPR_TFMR); /* save original TFMR */ + handled |= SPR_HMER_TFAC_ERROR; hmi_print_debug("Timer Facility Error", hmer); - hmer &= ~SPR_HMER_TFAC_ERROR; recover = chiptod_recover_tb_errors(); if (hmi_evt) { hmi_evt->severity = OpalHMI_SEV_ERROR_SYNC; @@ -1242,7 +1242,7 @@ int handle_hmi_exception(uint64_t hmer, struct OpalHMIEvent *hmi_evt) } if (hmer & SPR_HMER_TFMR_PARITY_ERROR) { tfmr = mfspr(SPR_TFMR); /* save original TFMR */ - hmer &= ~SPR_HMER_TFMR_PARITY_ERROR; + handled |= SPR_HMER_TFMR_PARITY_ERROR; hmi_print_debug("TFMR parity Error", hmer); recover = chiptod_recover_tb_errors(); @@ -1259,9 +1259,11 @@ int handle_hmi_exception(uint64_t hmer, struct OpalHMIEvent *hmi_evt) /* * HMER bits are sticky, once set to 1 they remain set to 1 until * they are set to 0. Reset the error source bit to 0, otherwise - * we keep getting HMI interrupt again and again. + * we keep getting HMI interrupt again and again. Writing to HMER + * acts as an AND, so we write mask of all 1's except for the bits + * we want to clear. */ - mtspr(SPR_HMER, hmer); + mtspr(SPR_HMER, ~handled); hmi_exit(); /* Set the TB state looking at TFMR register before we head out. */ cpu->tb_invalid = !(mfspr(SPR_TFMR) & SPR_TFMR_TB_VALID); From patchwork Mon Apr 16 17:33:11 2018 Content-Type: text/plain; charset="utf-8" MIME-Version: 1.0 Content-Transfer-Encoding: 7bit X-Patchwork-Submitter: Mahesh J Salgaonkar X-Patchwork-Id: 898824 Return-Path: X-Original-To: incoming@patchwork.ozlabs.org Delivered-To: patchwork-incoming@bilbo.ozlabs.org Received: from lists.ozlabs.org (lists.ozlabs.org [IPv6:2401:3900:2:1::3]) (using TLSv1.2 with cipher ADH-AES256-GCM-SHA384 (256/256 bits)) (No client certificate requested) by ozlabs.org (Postfix) with ESMTPS id 40PwVC2krvz9s3G for ; Tue, 17 Apr 2018 03:34:11 +1000 (AEST) Authentication-Results: ozlabs.org; dmarc=none (p=none dis=none) header.from=linux.vnet.ibm.com Received: from lists.ozlabs.org (lists.ozlabs.org [IPv6:2401:3900:2:1::3]) by lists.ozlabs.org (Postfix) with ESMTP id 40PwVC16CJzF1t4 for ; Tue, 17 Apr 2018 03:34:11 +1000 (AEST) Authentication-Results: lists.ozlabs.org; dmarc=none (p=none dis=none) header.from=linux.vnet.ibm.com X-Original-To: skiboot@lists.ozlabs.org Delivered-To: skiboot@lists.ozlabs.org Authentication-Results: lists.ozlabs.org; spf=none (mailfrom) smtp.mailfrom=linux.vnet.ibm.com (client-ip=148.163.158.5; helo=mx0a-001b2d01.pphosted.com; envelope-from=mahesh@linux.vnet.ibm.com; receiver=) Authentication-Results: lists.ozlabs.org; dmarc=none (p=none dis=none) header.from=linux.vnet.ibm.com Received: from mx0a-001b2d01.pphosted.com (mx0b-001b2d01.pphosted.com [148.163.158.5]) (using TLSv1.2 with cipher ECDHE-RSA-AES256-GCM-SHA384 (256/256 bits)) (No client certificate requested) by lists.ozlabs.org (Postfix) with ESMTPS id 40PwTD6TgZzF1wJ for ; Tue, 17 Apr 2018 03:33:20 +1000 (AEST) Received: from pps.filterd (m0098414.ppops.net [127.0.0.1]) by mx0b-001b2d01.pphosted.com (8.16.0.22/8.16.0.22) with SMTP id w3GHVsK4035346 for ; Mon, 16 Apr 2018 13:33:18 -0400 Received: from e06smtp15.uk.ibm.com (e06smtp15.uk.ibm.com [195.75.94.111]) by mx0b-001b2d01.pphosted.com with ESMTP id 2hcwe30x1y-1 (version=TLSv1.2 cipher=AES256-SHA256 bits=256 verify=NOT) for ; Mon, 16 Apr 2018 13:33:17 -0400 Received: from localhost by e06smtp15.uk.ibm.com with IBM ESMTP SMTP Gateway: Authorized Use Only! Violators will be prosecuted for from ; Mon, 16 Apr 2018 18:33:16 +0100 Received: from b06cxnps4074.portsmouth.uk.ibm.com (9.149.109.196) by e06smtp15.uk.ibm.com (192.168.101.145) with IBM ESMTP SMTP Gateway: Authorized Use Only! Violators will be prosecuted; Mon, 16 Apr 2018 18:33:13 +0100 Received: from d06av22.portsmouth.uk.ibm.com (d06av22.portsmouth.uk.ibm.com [9.149.105.58]) by b06cxnps4074.portsmouth.uk.ibm.com (8.14.9/8.14.9/NCO v10.0) with ESMTP id w3GHXC6Y33816706; Mon, 16 Apr 2018 17:33:12 GMT Received: from d06av22.portsmouth.uk.ibm.com (unknown [127.0.0.1]) by IMSVA (Postfix) with ESMTP id D927A4C040; Mon, 16 Apr 2018 18:25:44 +0100 (BST) Received: from d06av22.portsmouth.uk.ibm.com (unknown [127.0.0.1]) by IMSVA (Postfix) with ESMTP id 347324C04A; Mon, 16 Apr 2018 18:25:44 +0100 (BST) Received: from jupiter.in.ibm.com (unknown [9.102.1.147]) by d06av22.portsmouth.uk.ibm.com (Postfix) with ESMTP; Mon, 16 Apr 2018 18:25:44 +0100 (BST) From: Mahesh J Salgaonkar To: skiboot list Date: Mon, 16 Apr 2018 23:03:11 +0530 In-Reply-To: <152389987405.2566.355149283827806637.stgit@jupiter.in.ibm.com> References: <152389987405.2566.355149283827806637.stgit@jupiter.in.ibm.com> User-Agent: StGit/0.17.1-dirty MIME-Version: 1.0 X-TM-AS-GCONF: 00 x-cbid: 18041617-0020-0000-0000-000004128BF8 X-IBM-AV-DETECTION: SAVI=unused REMOTE=unused XFE=unused x-cbparentid: 18041617-0021-0000-0000-000042A6C77A Message-Id: <152389999106.2566.4813979158528468398.stgit@jupiter.in.ibm.com> X-Proofpoint-Virus-Version: vendor=fsecure engine=2.50.10434:, , definitions=2018-04-16_09:, , signatures=0 X-Proofpoint-Spam-Details: rule=outbound_notspam policy=outbound score=0 priorityscore=1501 malwarescore=0 suspectscore=0 phishscore=0 bulkscore=0 spamscore=0 clxscore=1015 lowpriorityscore=0 impostorscore=0 adultscore=0 classifier=spam adjust=0 reason=mlx scancount=1 engine=8.0.1-1709140000 definitions=main-1804160155 Subject: [Skiboot] [PATCH v2 03/15] opal/hmi: Add a new opal_handle_hmi2 that returns direct info to Linux X-BeenThere: skiboot@lists.ozlabs.org X-Mailman-Version: 2.1.26 Precedence: list List-Id: Mailing list for skiboot development List-Unsubscribe: , List-Archive: List-Post: List-Help: List-Subscribe: , Errors-To: skiboot-bounces+incoming=patchwork.ozlabs.org@lists.ozlabs.org Sender: "Skiboot" From: Benjamin Herrenschmidt It returns a 64-bit flags mask currently set to provide info about which timer facilities were lost, and whether an event was generated. Signed-off-by: Benjamin Herrenschmidt --- core/hmi.c | 127 ++++++++++++++++++++++++++++++----------------- include/opal-api.h | 8 +++ include/opal-internal.h | 1 3 files changed, 90 insertions(+), 46 deletions(-) diff --git a/core/hmi.c b/core/hmi.c index f4cdbd57f..186ff75d7 100644 --- a/core/hmi.c +++ b/core/hmi.c @@ -291,7 +291,7 @@ static int setup_scom_addresses(void) return 0; } -static int queue_hmi_event(struct OpalHMIEvent *hmi_evt, int recover) +static int queue_hmi_event(struct OpalHMIEvent *hmi_evt, int recover, uint64_t *out_flags) { size_t num_params; @@ -314,6 +314,8 @@ static int queue_hmi_event(struct OpalHMIEvent *hmi_evt, int recover) */ num_params = ALIGN_UP(sizeof(*hmi_evt), sizeof(u64)) / sizeof(u64); + *out_flags |= OPAL_HMI_FLAGS_NEW_EVENT; + /* queue up for delivery to host. */ return _opal_queue_msg(OPAL_MSG_HMI_EVT, NULL, NULL, num_params, (uint64_t *)hmi_evt); @@ -409,7 +411,7 @@ static bool decode_core_fir(struct cpu_thread *cpu, } static void find_core_checkstop_reason(struct OpalHMIEvent *hmi_evt, - bool *event_generated) + uint64_t *out_flags) { struct cpu_thread *cpu; @@ -435,16 +437,14 @@ static void find_core_checkstop_reason(struct OpalHMIEvent *hmi_evt, hmi_evt->u.xstop_error.xstop_reason = 0; hmi_evt->u.xstop_error.u.pir = cpu->pir; - if (decode_core_fir(cpu, hmi_evt)) { - queue_hmi_event(hmi_evt, 0); - *event_generated = 1; - } + if (decode_core_fir(cpu, hmi_evt)) + queue_hmi_event(hmi_evt, 0, out_flags); } } static void find_capp_checkstop_reason(int flat_chip_id, struct OpalHMIEvent *hmi_evt, - bool *event_generated) + uint64_t *out_flags) { struct capp_info info; struct phb *phb; @@ -496,8 +496,7 @@ static void find_capp_checkstop_reason(int flat_chip_id, hmi_evt->severity = OpalHMI_SEV_NO_ERROR; hmi_evt->type = OpalHMI_ERROR_CAPP_RECOVERY; - queue_hmi_event(hmi_evt, 1); - *event_generated = true; + queue_hmi_event(hmi_evt, 1, out_flags); return; } @@ -506,7 +505,7 @@ static void find_capp_checkstop_reason(int flat_chip_id, static void find_nx_checkstop_reason(int flat_chip_id, struct OpalHMIEvent *hmi_evt, - bool *event_generated) + uint64_t *out_flags) { uint64_t nx_status; uint64_t nx_dma_fir; @@ -564,8 +563,7 @@ static void find_nx_checkstop_reason(int flat_chip_id, xscom_write(flat_chip_id, nx_dma_engine_fir, PPC_BIT(38)); /* Send an HMI event. */ - queue_hmi_event(hmi_evt, 0); - *event_generated = true; + queue_hmi_event(hmi_evt, 0, out_flags); } /* @@ -623,7 +621,7 @@ static void dump_scoms(int flat_chip_id, const char *unit, uint32_t *scoms) static void find_npu2_checkstop_reason(int flat_chip_id, struct OpalHMIEvent *hmi_evt, - bool *event_generated) + uint64_t *out_flags) { struct phb *phb; struct npu *p = NULL; @@ -714,13 +712,12 @@ static void find_npu2_checkstop_reason(int flat_chip_id, hmi_evt->u.xstop_error.u.chip_id = flat_chip_id; /* Marking the event as recoverable so that we don't crash */ - queue_hmi_event(hmi_evt, 1); - *event_generated = true; + queue_hmi_event(hmi_evt, 1, out_flags); } static void find_npu_checkstop_reason(int flat_chip_id, struct OpalHMIEvent *hmi_evt, - bool *event_generated) + uint64_t *out_flags) { struct phb *phb; struct npu *p = NULL; @@ -733,7 +730,7 @@ static void find_npu_checkstop_reason(int flat_chip_id, /* Only check for NPU errors if the chip has a NPU */ if (PVR_TYPE(mfspr(SPR_PVR)) != PVR_TYPE_P8NVL) - return find_npu2_checkstop_reason(flat_chip_id, hmi_evt, event_generated); + return find_npu2_checkstop_reason(flat_chip_id, hmi_evt, out_flags); /* Find the NPU on the chip associated with the HMI. */ for_each_phb(phb) { @@ -783,22 +780,22 @@ static void find_npu_checkstop_reason(int flat_chip_id, hmi_evt->u.xstop_error.u.chip_id = flat_chip_id; /* The HMI is "recoverable" because it shouldn't crash the system */ - queue_hmi_event(hmi_evt, 1); - *event_generated = true; + queue_hmi_event(hmi_evt, 1, out_flags); } -static void decode_malfunction(struct OpalHMIEvent *hmi_evt) +static void decode_malfunction(struct OpalHMIEvent *hmi_evt, uint64_t *out_flags) { int i; - uint64_t malf_alert; - bool event_generated = false; + uint64_t malf_alert, flags; + + flags = 0; if (!setup_scom_addresses()) { prerror("Failed to setup scom addresses\n"); /* Send an unknown HMI event. */ hmi_evt->u.xstop_error.xstop_type = CHECKSTOP_TYPE_UNKNOWN; hmi_evt->u.xstop_error.xstop_reason = 0; - queue_hmi_event(hmi_evt, false); + queue_hmi_event(hmi_evt, false, out_flags); return; } @@ -811,22 +808,23 @@ static void decode_malfunction(struct OpalHMIEvent *hmi_evt) if (malf_alert & PPC_BIT(i)) { xscom_write(this_cpu()->chip_id, malf_alert_scom, ~PPC_BIT(i)); - find_capp_checkstop_reason(i, hmi_evt, &event_generated); - find_nx_checkstop_reason(i, hmi_evt, &event_generated); - find_npu_checkstop_reason(i, hmi_evt, &event_generated); + find_capp_checkstop_reason(i, hmi_evt, &flags); + find_nx_checkstop_reason(i, hmi_evt, &flags); + find_npu_checkstop_reason(i, hmi_evt, &flags); } } - find_core_checkstop_reason(hmi_evt, &event_generated); + find_core_checkstop_reason(hmi_evt, &flags); /* * If we fail to find checkstop reason, send an unknown HMI event. */ - if (!event_generated) { + if (!(flags & OPAL_HMI_FLAGS_NEW_EVENT)) { hmi_evt->u.xstop_error.xstop_type = CHECKSTOP_TYPE_UNKNOWN; hmi_evt->u.xstop_error.xstop_reason = 0; - queue_hmi_event(hmi_evt, false); + queue_hmi_event(hmi_evt, false, &flags); } + *out_flags |= flags; } static void wait_for_cleanup_complete(void) @@ -911,7 +909,7 @@ static int get_split_core_mode(void) * - SPR_TFMR_TB_RESIDUE_ERR * - SPR_TFMR_HDEC_PARITY_ERROR */ -static void pre_recovery_cleanup_p8(uint64_t hmer) +static void pre_recovery_cleanup_p8(uint64_t hmer, uint64_t *out_flags) { uint64_t tfmr; uint32_t sibling_thread_mask; @@ -940,11 +938,19 @@ static void pre_recovery_cleanup_p8(uint64_t hmer) */ lock(&hmi_lock); tfmr = mfspr(SPR_TFMR); + if (!(tfmr & SPR_TFMR_TB_VALID)) + *out_flags |= OPAL_HMI_FLAGS_TB_RESYNC; + if (tfmr & SPR_TFMR_DEC_PARITY_ERR) + *out_flags |= OPAL_HMI_FLAGS_DEC_LOST; if (!(tfmr & (SPR_TFMR_TB_RESIDUE_ERR | SPR_TFMR_HDEC_PARITY_ERROR))) { unlock(&hmi_lock); return; } + /* Tell OS about a possible loss of HDEC */ + if (tfmr & SPR_TFMR_HDEC_PARITY_ERROR) + *out_flags |= OPAL_HMI_FLAGS_HDEC_LOST; + /* Gather split core information. */ split_core_mode = get_split_core_mode(); threads_per_core = cpu_thread_count / split_core_mode; @@ -1015,7 +1021,7 @@ static void pre_recovery_cleanup_p8(uint64_t hmer) * - SPR_TFMR_TB_RESIDUE_ERR * - SPR_TFMR_HDEC_PARITY_ERROR */ -static void pre_recovery_cleanup_p9(uint64_t hmer) +static void pre_recovery_cleanup_p9(uint64_t hmer, uint64_t *out_flags) { uint64_t tfmr; int threads_per_core = cpu_thread_count; @@ -1043,6 +1049,10 @@ static void pre_recovery_cleanup_p9(uint64_t hmer) */ lock(&hmi_lock); tfmr = mfspr(SPR_TFMR); + if (!(tfmr & SPR_TFMR_TB_VALID)) + *out_flags |= OPAL_HMI_FLAGS_TB_RESYNC; + if (tfmr & SPR_TFMR_DEC_PARITY_ERR) + *out_flags |= OPAL_HMI_FLAGS_DEC_LOST; if (!(tfmr & (SPR_TFMR_TB_RESIDUE_ERR | SPR_TFMR_HDEC_PARITY_ERROR))) { unlock(&hmi_lock); return; @@ -1068,6 +1078,10 @@ static void pre_recovery_cleanup_p9(uint64_t hmer) if ((*(this_cpu()->core_hmi_state_ptr) & CORE_THREAD_MASK) == 0) *(this_cpu()->core_hmi_state_ptr) &= ~HMI_STATE_CLEANUP_DONE; + /* Tell OS about a possible loss of HDEC */ + if (tfmr & SPR_TFMR_HDEC_PARITY_ERROR) + *out_flags |= OPAL_HMI_FLAGS_HDEC_LOST; + /* * Clear TB and wait for other threads to finish its cleanup work. */ @@ -1098,12 +1112,12 @@ static void pre_recovery_cleanup_p9(uint64_t hmer) wait_for_cleanup_complete(); } -static void pre_recovery_cleanup(uint64_t hmer) +static void pre_recovery_cleanup(uint64_t hmer, uint64_t *out_flags) { if (proc_gen == proc_gen_p9) - return pre_recovery_cleanup_p9(hmer); + return pre_recovery_cleanup_p9(hmer, out_flags); else - return pre_recovery_cleanup_p8(hmer); + return pre_recovery_cleanup_p8(hmer, out_flags); } static void hmi_exit(void) @@ -1135,7 +1149,8 @@ static void hmi_print_debug(const uint8_t *msg, uint64_t hmer) } } -int handle_hmi_exception(uint64_t hmer, struct OpalHMIEvent *hmi_evt) +static int handle_hmi_exception(uint64_t hmer, struct OpalHMIEvent *hmi_evt, + uint64_t *out_flags) { struct cpu_thread *cpu = this_cpu(); int recover = 1; @@ -1145,7 +1160,7 @@ int handle_hmi_exception(uint64_t hmer, struct OpalHMIEvent *hmi_evt) * In case of split core, some of the Timer facility errors need * cleanup to be done before we proceed with the error recovery. */ - pre_recovery_cleanup(hmer); + pre_recovery_cleanup(hmer, out_flags); lock(&hmi_lock); /* @@ -1178,7 +1193,7 @@ int handle_hmi_exception(uint64_t hmer, struct OpalHMIEvent *hmi_evt) if (hmi_evt) { hmi_evt->severity = OpalHMI_SEV_NO_ERROR; hmi_evt->type = OpalHMI_ERROR_PROC_RECOV_DONE; - queue_hmi_event(hmi_evt, recover); + queue_hmi_event(hmi_evt, recover, out_flags); } } if (hmer & SPR_HMER_PROC_RECV_ERROR_MASKED) { @@ -1186,7 +1201,7 @@ int handle_hmi_exception(uint64_t hmer, struct OpalHMIEvent *hmi_evt) if (hmi_evt) { hmi_evt->severity = OpalHMI_SEV_NO_ERROR; hmi_evt->type = OpalHMI_ERROR_PROC_RECOV_MASKED; - queue_hmi_event(hmi_evt, recover); + queue_hmi_event(hmi_evt, recover, out_flags); } hmi_print_debug("Processor recovery Done (masked).", hmer); } @@ -1195,7 +1210,7 @@ int handle_hmi_exception(uint64_t hmer, struct OpalHMIEvent *hmi_evt) if (hmi_evt) { hmi_evt->severity = OpalHMI_SEV_NO_ERROR; hmi_evt->type = OpalHMI_ERROR_PROC_RECOV_DONE_AGAIN; - queue_hmi_event(hmi_evt, recover); + queue_hmi_event(hmi_evt, recover, out_flags); } hmi_print_debug("Processor recovery occurred again before" "bit2 was cleared\n", hmer); @@ -1206,7 +1221,7 @@ int handle_hmi_exception(uint64_t hmer, struct OpalHMIEvent *hmi_evt) hmi_print_debug("Malfunction Alert", hmer); if (hmi_evt) - decode_malfunction(hmi_evt); + decode_malfunction(hmi_evt, out_flags); } /* Assert if we see Hypervisor resource error, we can not continue. */ @@ -1218,7 +1233,7 @@ int handle_hmi_exception(uint64_t hmer, struct OpalHMIEvent *hmi_evt) if (hmi_evt) { hmi_evt->severity = OpalHMI_SEV_FATAL; hmi_evt->type = OpalHMI_ERROR_HYP_RESOURCE; - queue_hmi_event(hmi_evt, recover); + queue_hmi_event(hmi_evt, recover, out_flags); } } @@ -1237,7 +1252,7 @@ int handle_hmi_exception(uint64_t hmer, struct OpalHMIEvent *hmi_evt) hmi_evt->severity = OpalHMI_SEV_ERROR_SYNC; hmi_evt->type = OpalHMI_ERROR_TFAC; hmi_evt->tfmr = tfmr; - queue_hmi_event(hmi_evt, recover); + queue_hmi_event(hmi_evt, recover, out_flags); } } if (hmer & SPR_HMER_TFMR_PARITY_ERROR) { @@ -1250,7 +1265,7 @@ int handle_hmi_exception(uint64_t hmer, struct OpalHMIEvent *hmi_evt) hmi_evt->severity = OpalHMI_SEV_FATAL; hmi_evt->type = OpalHMI_ERROR_TFMR_PARITY; hmi_evt->tfmr = tfmr; - queue_hmi_event(hmi_evt, recover); + queue_hmi_event(hmi_evt, recover, out_flags); } } @@ -1273,7 +1288,7 @@ int handle_hmi_exception(uint64_t hmer, struct OpalHMIEvent *hmi_evt) static int64_t opal_handle_hmi(void) { - uint64_t hmer; + uint64_t hmer, dummy_flags; struct OpalHMIEvent hmi_evt; /* @@ -1286,8 +1301,30 @@ static int64_t opal_handle_hmi(void) hmi_evt.version = OpalHMIEvt_V2; hmer = mfspr(SPR_HMER); /* Get HMER register value */ - handle_hmi_exception(hmer, &hmi_evt); + handle_hmi_exception(hmer, &hmi_evt, &dummy_flags); return OPAL_SUCCESS; } opal_call(OPAL_HANDLE_HMI, opal_handle_hmi, 0); + +static int64_t opal_handle_hmi2(__be64 *out_flags) +{ + uint64_t hmer, flags; + struct OpalHMIEvent hmi_evt; + + /* + * Compiled time check to see size of OpalHMIEvent do not exceed + * that of struct opal_msg. + */ + BUILD_ASSERT(sizeof(struct opal_msg) >= sizeof(struct OpalHMIEvent)); + + memset(&hmi_evt, 0, sizeof(struct OpalHMIEvent)); + hmi_evt.version = OpalHMIEvt_V2; + + hmer = mfspr(SPR_HMER); /* Get HMER register value */ + handle_hmi_exception(hmer, &hmi_evt, &flags); + *out_flags = cpu_to_be64(flags); + + return OPAL_SUCCESS; +} +opal_call(OPAL_HANDLE_HMI2, opal_handle_hmi2, 1); diff --git a/include/opal-api.h b/include/opal-api.h index df71cf2d7..09c77c18e 100644 --- a/include/opal-api.h +++ b/include/opal-api.h @@ -769,6 +769,14 @@ struct OpalHMIEvent { } u; }; +/* OPAL_HANDLE_HMI2 out_flags */ +enum { + OPAL_HMI_FLAGS_TB_RESYNC = (1ull << 0), /* Timebase has been resynced */ + OPAL_HMI_FLAGS_DEC_LOST = (1ull << 1), /* DEC lost, needs to be reprogrammed */ + OPAL_HMI_FLAGS_HDEC_LOST = (1ull << 2), /* HDEC lost, needs to be reprogrammed */ + OPAL_HMI_FLAGS_NEW_EVENT = (1ull << 63), /* An event has been created */ +}; + enum { OPAL_P7IOC_DIAG_TYPE_NONE = 0, OPAL_P7IOC_DIAG_TYPE_RGC = 1, diff --git a/include/opal-internal.h b/include/opal-internal.h index 8d3d0a177..40bad4572 100644 --- a/include/opal-internal.h +++ b/include/opal-internal.h @@ -82,7 +82,6 @@ extern void opal_del_host_sync_notifier(bool (*notify)(void *data)); * Opal internal function prototype */ struct OpalHMIEvent; -extern int handle_hmi_exception(__be64 hmer, struct OpalHMIEvent *hmi_evt); extern int occ_msg_queue_occ_reset(void); extern unsigned long top_of_ram; From patchwork Mon Apr 16 17:33:17 2018 Content-Type: text/plain; charset="utf-8" MIME-Version: 1.0 Content-Transfer-Encoding: 7bit X-Patchwork-Submitter: Mahesh J Salgaonkar X-Patchwork-Id: 898826 Return-Path: X-Original-To: incoming@patchwork.ozlabs.org Delivered-To: patchwork-incoming@bilbo.ozlabs.org Received: from lists.ozlabs.org (lists.ozlabs.org [203.11.71.2]) (using TLSv1.2 with cipher ADH-AES256-GCM-SHA384 (256/256 bits)) (No client certificate requested) by ozlabs.org (Postfix) with ESMTPS id 40PwVd0Yr4z9s3B for ; Tue, 17 Apr 2018 03:34:33 +1000 (AEST) Authentication-Results: ozlabs.org; dmarc=none (p=none dis=none) header.from=linux.vnet.ibm.com Received: from lists.ozlabs.org (lists.ozlabs.org [IPv6:2401:3900:2:1::3]) by lists.ozlabs.org (Postfix) with ESMTP id 40PwVc32nbzF1t4 for ; Tue, 17 Apr 2018 03:34:32 +1000 (AEST) Authentication-Results: lists.ozlabs.org; dmarc=none (p=none dis=none) header.from=linux.vnet.ibm.com X-Original-To: skiboot@lists.ozlabs.org Delivered-To: skiboot@lists.ozlabs.org Authentication-Results: lists.ozlabs.org; spf=none (mailfrom) smtp.mailfrom=linux.vnet.ibm.com (client-ip=148.163.158.5; helo=mx0a-001b2d01.pphosted.com; envelope-from=mahesh@linux.vnet.ibm.com; receiver=) Authentication-Results: lists.ozlabs.org; dmarc=none (p=none dis=none) header.from=linux.vnet.ibm.com Received: from mx0a-001b2d01.pphosted.com (mx0b-001b2d01.pphosted.com [148.163.158.5]) (using TLSv1.2 with cipher ECDHE-RSA-AES256-GCM-SHA384 (256/256 bits)) (No client certificate requested) by lists.ozlabs.org (Postfix) with ESMTPS id 40PwTN0HHWzF1w1 for ; Tue, 17 Apr 2018 03:33:27 +1000 (AEST) Received: from pps.filterd (m0098413.ppops.net [127.0.0.1]) by mx0b-001b2d01.pphosted.com (8.16.0.22/8.16.0.22) with SMTP id w3GHUvAf073478 for ; Mon, 16 Apr 2018 13:33:24 -0400 Received: from e06smtp13.uk.ibm.com (e06smtp13.uk.ibm.com [195.75.94.109]) by mx0b-001b2d01.pphosted.com with ESMTP id 2hd0760bt4-1 (version=TLSv1.2 cipher=AES256-SHA256 bits=256 verify=NOT) for ; Mon, 16 Apr 2018 13:33:24 -0400 Received: from localhost by e06smtp13.uk.ibm.com with IBM ESMTP SMTP Gateway: Authorized Use Only! Violators will be prosecuted for from ; Mon, 16 Apr 2018 18:33:22 +0100 Received: from b06cxnps3074.portsmouth.uk.ibm.com (9.149.109.194) by e06smtp13.uk.ibm.com (192.168.101.143) with IBM ESMTP SMTP Gateway: Authorized Use Only! Violators will be prosecuted; Mon, 16 Apr 2018 18:33:20 +0100 Received: from d06av23.portsmouth.uk.ibm.com (d06av23.portsmouth.uk.ibm.com [9.149.105.59]) by b06cxnps3074.portsmouth.uk.ibm.com (8.14.9/8.14.9/NCO v10.0) with ESMTP id w3GHXJIj3670334; Mon, 16 Apr 2018 17:33:19 GMT Received: from d06av23.portsmouth.uk.ibm.com (unknown [127.0.0.1]) by IMSVA (Postfix) with ESMTP id 8C3B5A4055; Mon, 16 Apr 2018 18:25:29 +0100 (BST) Received: from d06av23.portsmouth.uk.ibm.com (unknown [127.0.0.1]) by IMSVA (Postfix) with ESMTP id 0CDC7A4040; Mon, 16 Apr 2018 18:25:29 +0100 (BST) Received: from jupiter.in.ibm.com (unknown [9.102.1.147]) by d06av23.portsmouth.uk.ibm.com (Postfix) with ESMTP; Mon, 16 Apr 2018 18:25:28 +0100 (BST) From: Mahesh J Salgaonkar To: skiboot list Date: Mon, 16 Apr 2018 23:03:17 +0530 In-Reply-To: <152389987405.2566.355149283827806637.stgit@jupiter.in.ibm.com> References: <152389987405.2566.355149283827806637.stgit@jupiter.in.ibm.com> User-Agent: StGit/0.17.1-dirty MIME-Version: 1.0 X-TM-AS-GCONF: 00 x-cbid: 18041617-0012-0000-0000-000005CB7DAD X-IBM-AV-DETECTION: SAVI=unused REMOTE=unused XFE=unused x-cbparentid: 18041617-0013-0000-0000-00001947C322 Message-Id: <152389999788.2566.4471105424361806612.stgit@jupiter.in.ibm.com> X-Proofpoint-Virus-Version: vendor=fsecure engine=2.50.10434:, , definitions=2018-04-16_09:, , signatures=0 X-Proofpoint-Spam-Details: rule=outbound_notspam policy=outbound score=0 priorityscore=1501 malwarescore=0 suspectscore=0 phishscore=0 bulkscore=0 spamscore=0 clxscore=1015 lowpriorityscore=0 impostorscore=0 adultscore=0 classifier=spam adjust=0 reason=mlx scancount=1 engine=8.0.1-1709140000 definitions=main-1804160155 Subject: [Skiboot] [PATCH v2 04/15] opal/hmi: Move timer related error handling to a separate function X-BeenThere: skiboot@lists.ozlabs.org X-Mailman-Version: 2.1.26 Precedence: list List-Id: Mailing list for skiboot development List-Unsubscribe: , List-Archive: List-Post: List-Help: List-Subscribe: , Errors-To: skiboot-bounces+incoming=patchwork.ozlabs.org@lists.ozlabs.org Sender: "Skiboot" From: Benjamin Herrenschmidt Currently no functional change. This is a first step to completely rewriting how these things are handled. Signed-off-by: Benjamin Herrenschmidt --- core/hmi.c | 106 +++++++++++++++++++++++++++++++++--------------------------- 1 file changed, 58 insertions(+), 48 deletions(-) diff --git a/core/hmi.c b/core/hmi.c index 186ff75d7..68583bb1d 100644 --- a/core/hmi.c +++ b/core/hmi.c @@ -1120,12 +1120,6 @@ static void pre_recovery_cleanup(uint64_t hmer, uint64_t *out_flags) return pre_recovery_cleanup_p8(hmer, out_flags); } -static void hmi_exit(void) -{ - /* unconditionally unset the thread bit */ - *(this_cpu()->core_hmi_state_ptr) &= ~(this_cpu()->thread_mask); -} - static void hmi_print_debug(const uint8_t *msg, uint64_t hmer) { const char *loc; @@ -1149,18 +1143,70 @@ static void hmi_print_debug(const uint8_t *msg, uint64_t hmer) } } +static int handle_tfac_errors(uint64_t hmer, struct OpalHMIEvent *hmi_evt, + uint64_t *out_flags) +{ + int recover = 1; + uint64_t tfmr; + + pre_recovery_cleanup(hmer, out_flags); + + lock(&hmi_lock); + this_cpu()->tb_invalid = !(mfspr(SPR_TFMR) & SPR_TFMR_TB_VALID); + + /* + * Assert for now for all TOD errors. In future we need to decode + * TFMR and take corrective action wherever required. + */ + if (hmer & SPR_HMER_TFAC_ERROR) { + tfmr = mfspr(SPR_TFMR); /* save original TFMR */ + + hmi_print_debug("Timer Facility Error", hmer); + + recover = chiptod_recover_tb_errors(); + if (hmi_evt) { + hmi_evt->severity = OpalHMI_SEV_ERROR_SYNC; + hmi_evt->type = OpalHMI_ERROR_TFAC; + hmi_evt->tfmr = tfmr; + queue_hmi_event(hmi_evt, recover, out_flags); + } + } + if (hmer & SPR_HMER_TFMR_PARITY_ERROR) { + tfmr = mfspr(SPR_TFMR); /* save original TFMR */ + + hmi_print_debug("TFMR parity Error", hmer); + recover = chiptod_recover_tb_errors(); + if (hmi_evt) { + hmi_evt->severity = OpalHMI_SEV_FATAL; + hmi_evt->type = OpalHMI_ERROR_TFMR_PARITY; + hmi_evt->tfmr = tfmr; + queue_hmi_event(hmi_evt, recover, out_flags); + } + } + /* Unconditionally unset the thread bit */ + *(this_cpu()->core_hmi_state_ptr) &= ~(this_cpu()->thread_mask); + + /* Set the TB state looking at TFMR register before we head out. */ + this_cpu()->tb_invalid = !(mfspr(SPR_TFMR) & SPR_TFMR_TB_VALID); + unlock(&hmi_lock); + + return recover; +} + static int handle_hmi_exception(uint64_t hmer, struct OpalHMIEvent *hmi_evt, uint64_t *out_flags) { struct cpu_thread *cpu = this_cpu(); int recover = 1; - uint64_t tfmr, handled = 0; + uint64_t handled = 0; - /* - * In case of split core, some of the Timer facility errors need - * cleanup to be done before we proceed with the error recovery. - */ - pre_recovery_cleanup(hmer, out_flags); + /* Handle Timer/TOD errors separately */ + if (hmer & (SPR_HMER_TFAC_ERROR | SPR_HMER_TFMR_PARITY_ERROR)) { + handled = hmer & (SPR_HMER_TFAC_ERROR | SPR_HMER_TFMR_PARITY_ERROR); + mtspr(SPR_HMER, ~handled); + recover = handle_tfac_errors(hmer, hmi_evt, out_flags); + handled = 0; + } lock(&hmi_lock); /* @@ -1168,7 +1214,6 @@ static int handle_hmi_exception(uint64_t hmer, struct OpalHMIEvent *hmi_evt, * looking at TFMR register. TFMR will tell us correct state of * TB register. */ - cpu->tb_invalid = !(mfspr(SPR_TFMR) & SPR_TFMR_TB_VALID); prlog(PR_DEBUG, "Received HMI interrupt: HMER = 0x%016llx\n", hmer); if (hmi_evt) hmi_evt->hmer = hmer; @@ -1237,38 +1282,6 @@ static int handle_hmi_exception(uint64_t hmer, struct OpalHMIEvent *hmi_evt, } } - /* - * Assert for now for all TOD errors. In future we need to decode - * TFMR and take corrective action wherever required. - */ - if (hmer & SPR_HMER_TFAC_ERROR) { - tfmr = mfspr(SPR_TFMR); /* save original TFMR */ - handled |= SPR_HMER_TFAC_ERROR; - - hmi_print_debug("Timer Facility Error", hmer); - - recover = chiptod_recover_tb_errors(); - if (hmi_evt) { - hmi_evt->severity = OpalHMI_SEV_ERROR_SYNC; - hmi_evt->type = OpalHMI_ERROR_TFAC; - hmi_evt->tfmr = tfmr; - queue_hmi_event(hmi_evt, recover, out_flags); - } - } - if (hmer & SPR_HMER_TFMR_PARITY_ERROR) { - tfmr = mfspr(SPR_TFMR); /* save original TFMR */ - handled |= SPR_HMER_TFMR_PARITY_ERROR; - - hmi_print_debug("TFMR parity Error", hmer); - recover = chiptod_recover_tb_errors(); - if (hmi_evt) { - hmi_evt->severity = OpalHMI_SEV_FATAL; - hmi_evt->type = OpalHMI_ERROR_TFMR_PARITY; - hmi_evt->tfmr = tfmr; - queue_hmi_event(hmi_evt, recover, out_flags); - } - } - if (recover == 0) disable_fast_reboot("Unrecoverable HMI"); /* @@ -1279,9 +1292,6 @@ static int handle_hmi_exception(uint64_t hmer, struct OpalHMIEvent *hmi_evt, * we want to clear. */ mtspr(SPR_HMER, ~handled); - hmi_exit(); - /* Set the TB state looking at TFMR register before we head out. */ - cpu->tb_invalid = !(mfspr(SPR_TFMR) & SPR_TFMR_TB_VALID); unlock(&hmi_lock); return recover; } From patchwork Mon Apr 16 17:33:24 2018 Content-Type: text/plain; charset="utf-8" MIME-Version: 1.0 Content-Transfer-Encoding: 7bit X-Patchwork-Submitter: Mahesh J Salgaonkar X-Patchwork-Id: 898827 Return-Path: X-Original-To: incoming@patchwork.ozlabs.org Delivered-To: patchwork-incoming@bilbo.ozlabs.org Received: from lists.ozlabs.org (lists.ozlabs.org [IPv6:2401:3900:2:1::3]) (using TLSv1.2 with cipher ADH-AES256-GCM-SHA384 (256/256 bits)) (No client certificate requested) by ozlabs.org (Postfix) with ESMTPS id 40PwW31mRYz9s3B for ; Tue, 17 Apr 2018 03:34:55 +1000 (AEST) Authentication-Results: ozlabs.org; dmarc=none (p=none dis=none) header.from=linux.vnet.ibm.com Received: from lists.ozlabs.org (lists.ozlabs.org [IPv6:2401:3900:2:1::3]) by lists.ozlabs.org (Postfix) with ESMTP id 40PwW30RkgzF1vC for ; Tue, 17 Apr 2018 03:34:55 +1000 (AEST) Authentication-Results: lists.ozlabs.org; dmarc=none (p=none dis=none) header.from=linux.vnet.ibm.com X-Original-To: skiboot@lists.ozlabs.org Delivered-To: skiboot@lists.ozlabs.org Authentication-Results: lists.ozlabs.org; spf=none (mailfrom) smtp.mailfrom=linux.vnet.ibm.com (client-ip=148.163.156.1; helo=mx0a-001b2d01.pphosted.com; envelope-from=mahesh@linux.vnet.ibm.com; receiver=) Authentication-Results: lists.ozlabs.org; dmarc=none (p=none dis=none) header.from=linux.vnet.ibm.com Received: from mx0a-001b2d01.pphosted.com (mx0a-001b2d01.pphosted.com [148.163.156.1]) (using TLSv1.2 with cipher ECDHE-RSA-AES256-GCM-SHA384 (256/256 bits)) (No client certificate requested) by lists.ozlabs.org (Postfix) with ESMTPS id 40PwTS6wXvzF1w1 for ; Tue, 17 Apr 2018 03:33:32 +1000 (AEST) Received: from pps.filterd (m0098410.ppops.net [127.0.0.1]) by mx0a-001b2d01.pphosted.com (8.16.0.22/8.16.0.22) with SMTP id w3GHVSs9080297 for ; Mon, 16 Apr 2018 13:33:30 -0400 Received: from e06smtp13.uk.ibm.com (e06smtp13.uk.ibm.com [195.75.94.109]) by mx0a-001b2d01.pphosted.com with ESMTP id 2hcxyfkyy5-1 (version=TLSv1.2 cipher=AES256-SHA256 bits=256 verify=NOT) for ; Mon, 16 Apr 2018 13:33:30 -0400 Received: from localhost by e06smtp13.uk.ibm.com with IBM ESMTP SMTP Gateway: Authorized Use Only! Violators will be prosecuted for from ; Mon, 16 Apr 2018 18:33:28 +0100 Received: from b06cxnps3074.portsmouth.uk.ibm.com (9.149.109.194) by e06smtp13.uk.ibm.com (192.168.101.143) with IBM ESMTP SMTP Gateway: Authorized Use Only! Violators will be prosecuted; Mon, 16 Apr 2018 18:33:26 +0100 Received: from d06av24.portsmouth.uk.ibm.com (mk.ibm.com [9.149.105.60]) by b06cxnps3074.portsmouth.uk.ibm.com (8.14.9/8.14.9/NCO v10.0) with ESMTP id w3GHXQxI7799108; Mon, 16 Apr 2018 17:33:26 GMT Received: from d06av24.portsmouth.uk.ibm.com (unknown [127.0.0.1]) by IMSVA (Postfix) with ESMTP id A02B842041; Mon, 16 Apr 2018 18:25:00 +0100 (BST) Received: from d06av24.portsmouth.uk.ibm.com (unknown [127.0.0.1]) by IMSVA (Postfix) with ESMTP id 23F0542049; Mon, 16 Apr 2018 18:25:00 +0100 (BST) Received: from jupiter.in.ibm.com (unknown [9.102.1.147]) by d06av24.portsmouth.uk.ibm.com (Postfix) with ESMTP; Mon, 16 Apr 2018 18:24:59 +0100 (BST) From: Mahesh J Salgaonkar To: skiboot list Date: Mon, 16 Apr 2018 23:03:24 +0530 In-Reply-To: <152389987405.2566.355149283827806637.stgit@jupiter.in.ibm.com> References: <152389987405.2566.355149283827806637.stgit@jupiter.in.ibm.com> User-Agent: StGit/0.17.1-dirty MIME-Version: 1.0 X-TM-AS-GCONF: 00 x-cbid: 18041617-0012-0000-0000-000005CB7DB2 X-IBM-AV-DETECTION: SAVI=unused REMOTE=unused XFE=unused x-cbparentid: 18041617-0013-0000-0000-00001947C327 Message-Id: <152390000456.2566.7629025032566643862.stgit@jupiter.in.ibm.com> X-Proofpoint-Virus-Version: vendor=fsecure engine=2.50.10434:, , definitions=2018-04-16_09:, , signatures=0 X-Proofpoint-Spam-Details: rule=outbound_notspam policy=outbound score=0 priorityscore=1501 malwarescore=0 suspectscore=0 phishscore=0 bulkscore=0 spamscore=0 clxscore=1015 lowpriorityscore=0 impostorscore=0 adultscore=0 classifier=spam adjust=0 reason=mlx scancount=1 engine=8.0.1-1709140000 definitions=main-1804160155 Subject: [Skiboot] [PATCH v2 05/15] opal/hmi: Don't bother passing HMER to pre-recovery cleanup X-BeenThere: skiboot@lists.ozlabs.org X-Mailman-Version: 2.1.26 Precedence: list List-Id: Mailing list for skiboot development List-Unsubscribe: , List-Archive: List-Post: List-Help: List-Subscribe: , Errors-To: skiboot-bounces+incoming=patchwork.ozlabs.org@lists.ozlabs.org Sender: "Skiboot" From: Benjamin Herrenschmidt The test for TFAC error is now redundant so we remove it and remove the HMER argument. Signed-off-by: Benjamin Herrenschmidt --- core/hmi.c | 20 ++++++-------------- 1 file changed, 6 insertions(+), 14 deletions(-) diff --git a/core/hmi.c b/core/hmi.c index 68583bb1d..c2b44b9d1 100644 --- a/core/hmi.c +++ b/core/hmi.c @@ -909,17 +909,13 @@ static int get_split_core_mode(void) * - SPR_TFMR_TB_RESIDUE_ERR * - SPR_TFMR_HDEC_PARITY_ERROR */ -static void pre_recovery_cleanup_p8(uint64_t hmer, uint64_t *out_flags) +static void pre_recovery_cleanup_p8(uint64_t *out_flags) { uint64_t tfmr; uint32_t sibling_thread_mask; int split_core_mode, subcore_id, thread_id, threads_per_core; int i; - /* exit if it is not Time facility error. */ - if (!(hmer & SPR_HMER_TFAC_ERROR)) - return; - /* * Exit if it is not the error that leaves dirty data in timebase * or HDEC register. OR this may be the thread which came in very @@ -1021,16 +1017,12 @@ static void pre_recovery_cleanup_p8(uint64_t hmer, uint64_t *out_flags) * - SPR_TFMR_TB_RESIDUE_ERR * - SPR_TFMR_HDEC_PARITY_ERROR */ -static void pre_recovery_cleanup_p9(uint64_t hmer, uint64_t *out_flags) +static void pre_recovery_cleanup_p9(uint64_t *out_flags) { uint64_t tfmr; int threads_per_core = cpu_thread_count; int i; - /* exit if it is not Time facility error. */ - if (!(hmer & SPR_HMER_TFAC_ERROR)) - return; - /* * Exit if it is not the error that leaves dirty data in timebase * or HDEC register. OR this may be the thread which came in very @@ -1112,12 +1104,12 @@ static void pre_recovery_cleanup_p9(uint64_t hmer, uint64_t *out_flags) wait_for_cleanup_complete(); } -static void pre_recovery_cleanup(uint64_t hmer, uint64_t *out_flags) +static void pre_recovery_cleanup(uint64_t *out_flags) { if (proc_gen == proc_gen_p9) - return pre_recovery_cleanup_p9(hmer, out_flags); + return pre_recovery_cleanup_p9(out_flags); else - return pre_recovery_cleanup_p8(hmer, out_flags); + return pre_recovery_cleanup_p8(out_flags); } static void hmi_print_debug(const uint8_t *msg, uint64_t hmer) @@ -1149,7 +1141,7 @@ static int handle_tfac_errors(uint64_t hmer, struct OpalHMIEvent *hmi_evt, int recover = 1; uint64_t tfmr; - pre_recovery_cleanup(hmer, out_flags); + pre_recovery_cleanup(out_flags); lock(&hmi_lock); this_cpu()->tb_invalid = !(mfspr(SPR_TFMR) & SPR_TFMR_TB_VALID); From patchwork Mon Apr 16 17:33:31 2018 Content-Type: text/plain; charset="utf-8" MIME-Version: 1.0 Content-Transfer-Encoding: 8bit X-Patchwork-Submitter: Mahesh J Salgaonkar X-Patchwork-Id: 898828 Return-Path: X-Original-To: incoming@patchwork.ozlabs.org Delivered-To: patchwork-incoming@bilbo.ozlabs.org Received: from lists.ozlabs.org (lists.ozlabs.org [203.11.71.2]) (using TLSv1.2 with cipher ADH-AES256-GCM-SHA384 (256/256 bits)) (No client certificate requested) by ozlabs.org (Postfix) with ESMTPS id 40PwWF0FqHz9s3B for ; Tue, 17 Apr 2018 03:35:05 +1000 (AEST) Authentication-Results: ozlabs.org; dmarc=none (p=none dis=none) header.from=linux.vnet.ibm.com Received: from lists.ozlabs.org (lists.ozlabs.org [IPv6:2401:3900:2:1::3]) by lists.ozlabs.org (Postfix) with ESMTP id 40PwWD4q1lzF1xV for ; Tue, 17 Apr 2018 03:35:04 +1000 (AEST) Authentication-Results: lists.ozlabs.org; dmarc=none (p=none dis=none) header.from=linux.vnet.ibm.com X-Original-To: skiboot@lists.ozlabs.org Delivered-To: skiboot@lists.ozlabs.org Authentication-Results: lists.ozlabs.org; spf=none (mailfrom) smtp.mailfrom=linux.vnet.ibm.com (client-ip=148.163.158.5; helo=mx0a-001b2d01.pphosted.com; envelope-from=mahesh@linux.vnet.ibm.com; receiver=) Authentication-Results: lists.ozlabs.org; dmarc=none (p=none dis=none) header.from=linux.vnet.ibm.com Received: from mx0a-001b2d01.pphosted.com (mx0b-001b2d01.pphosted.com [148.163.158.5]) (using TLSv1.2 with cipher ECDHE-RSA-AES256-GCM-SHA384 (256/256 bits)) (No client certificate requested) by lists.ozlabs.org (Postfix) with ESMTPS id 40PwTb6NKjzF1t4 for ; Tue, 17 Apr 2018 03:33:39 +1000 (AEST) Received: from pps.filterd (m0098420.ppops.net [127.0.0.1]) by mx0b-001b2d01.pphosted.com (8.16.0.22/8.16.0.22) with SMTP id w3GHX8ee092439 for ; Mon, 16 Apr 2018 13:33:36 -0400 Received: from e06smtp15.uk.ibm.com (e06smtp15.uk.ibm.com [195.75.94.111]) by mx0b-001b2d01.pphosted.com with ESMTP id 2hcwc19a79-1 (version=TLSv1.2 cipher=AES256-SHA256 bits=256 verify=NOT) for ; Mon, 16 Apr 2018 13:33:36 -0400 Received: from localhost by e06smtp15.uk.ibm.com with IBM ESMTP SMTP Gateway: Authorized Use Only! Violators will be prosecuted for from ; Mon, 16 Apr 2018 18:33:34 +0100 Received: from b06cxnps4075.portsmouth.uk.ibm.com (9.149.109.197) by e06smtp15.uk.ibm.com (192.168.101.145) with IBM ESMTP SMTP Gateway: Authorized Use Only! Violators will be prosecuted; Mon, 16 Apr 2018 18:33:33 +0100 Received: from d06av21.portsmouth.uk.ibm.com (d06av21.portsmouth.uk.ibm.com [9.149.105.232]) by b06cxnps4075.portsmouth.uk.ibm.com (8.14.9/8.14.9/NCO v10.0) with ESMTP id w3GHXWun56950804; Mon, 16 Apr 2018 17:33:32 GMT Received: from d06av21.portsmouth.uk.ibm.com (unknown [127.0.0.1]) by IMSVA (Postfix) with ESMTP id 2179A52041; Mon, 16 Apr 2018 17:24:23 +0100 (BST) Received: from jupiter.in.ibm.com (unknown [9.102.1.147]) by d06av21.portsmouth.uk.ibm.com (Postfix) with ESMTP id 725AE52043; Mon, 16 Apr 2018 17:24:22 +0100 (BST) From: Mahesh J Salgaonkar To: skiboot list Date: Mon, 16 Apr 2018 23:03:31 +0530 In-Reply-To: <152389987405.2566.355149283827806637.stgit@jupiter.in.ibm.com> References: <152389987405.2566.355149283827806637.stgit@jupiter.in.ibm.com> User-Agent: StGit/0.17.1-dirty MIME-Version: 1.0 X-TM-AS-GCONF: 00 x-cbid: 18041617-0020-0000-0000-000004128C01 X-IBM-AV-DETECTION: SAVI=unused REMOTE=unused XFE=unused x-cbparentid: 18041617-0021-0000-0000-000042A6C783 Message-Id: <152390001117.2566.4806291625437855943.stgit@jupiter.in.ibm.com> X-Proofpoint-Virus-Version: vendor=fsecure engine=2.50.10434:, , definitions=2018-04-16_09:, , signatures=0 X-Proofpoint-Spam-Details: rule=outbound_notspam policy=outbound score=0 priorityscore=1501 malwarescore=0 suspectscore=0 phishscore=0 bulkscore=0 spamscore=0 clxscore=1015 lowpriorityscore=0 impostorscore=0 adultscore=0 classifier=spam adjust=0 reason=mlx scancount=1 engine=8.0.1-1709140000 definitions=main-1804160155 Subject: [Skiboot] [PATCH v2 06/15] opal/hmi: Rework HMI handling of TFAC errors X-BeenThere: skiboot@lists.ozlabs.org X-Mailman-Version: 2.1.26 Precedence: list List-Id: Mailing list for skiboot development List-Unsubscribe: , List-Archive: List-Post: List-Help: List-Subscribe: , Errors-To: skiboot-bounces+incoming=patchwork.ozlabs.org@lists.ozlabs.org Sender: "Skiboot" From: Benjamin Herrenschmidt This patch reworks the HMI handling for TFAC errors by introducing 4 rendez-vous points improve the thread synchronization while handling timebase errors that requires all thread to clear dirty data from TB/HDEC register before clearing the errors. Signed-off-by: Benjamin Herrenschmidt Signed-off-by: Mahesh Salgaonkar --- core/cpu.c | 2 core/hmi.c | 523 +++++++++++++++++++++++------------------------------ hw/chiptod.c | 118 ++++-------- include/chiptod.h | 6 + include/cpu.h | 3 5 files changed, 280 insertions(+), 372 deletions(-) diff --git a/core/cpu.c b/core/cpu.c index 6826fee0a..b8d31e215 100644 --- a/core/cpu.c +++ b/core/cpu.c @@ -1107,7 +1107,6 @@ void init_all_cpus(void) #endif t->core_hmi_state = 0; t->core_hmi_state_ptr = &t->core_hmi_state; - t->thread_mask = 1; /* Add associativity properties */ add_core_associativity(t); @@ -1131,7 +1130,6 @@ void init_all_cpus(void) t->node = cpu; t->chip_id = chip_id; t->core_hmi_state_ptr = &pt->core_hmi_state; - t->thread_mask = 1 << thread; } prlog(PR_INFO, "CPU: %d secondary threads\n", thread); } diff --git a/core/hmi.c b/core/hmi.c index c2b44b9d1..f51512989 100644 --- a/core/hmi.c +++ b/core/hmi.c @@ -184,8 +184,12 @@ */ #define NX_HMI_ACTIVE PPC_BIT(54) -/* Number of iterations for the various timeouts */ -#define TIMEOUT_LOOPS 20000000 +/* + * Number of iterations for the various timeouts. We can't use the timebase + * as it might be broken. We measured experimentally that 40 millions loops + * of cpu_relax() gives us more than 1s. The margin is comfortable enough. + */ +#define TIMEOUT_LOOPS 40000000 /* TFMR other errors. (other than bit 26 and 45) */ #define SPR_TFMR_OTHER_ERRORS \ @@ -195,6 +199,18 @@ SPR_TFMR_DEC_PARITY_ERR | SPR_TFMR_TFMR_CORRUPT | \ SPR_TFMR_CHIP_TOD_INTERRUPT) +/* TFMR "all core" errors (sent to all threads) */ +#define SPR_TFMR_CORE_ERRORS \ + (SPR_TFMR_TBST_CORRUPT | SPR_TFMR_TB_MISSING_SYNC | \ + SPR_TFMR_TB_MISSING_STEP | SPR_TFMR_FW_CONTROL_ERR | \ + SPR_TFMR_TFMR_CORRUPT | SPR_TFMR_TB_RESIDUE_ERR | \ + SPR_TFMR_HDEC_PARITY_ERROR | SPR_TFMR_CHIP_TOD_INTERRUPT) + +/* TFMR "thread" errors */ +#define SPR_TFMR_THREAD_ERRORS \ + (SPR_TFMR_PURR_PARITY_ERR | SPR_TFMR_SPURR_PARITY_ERR | \ + SPR_TFMR_DEC_PARITY_ERR) + static const struct core_xstop_bit_info { uint8_t bit; /* CORE FIR bit number */ enum OpalHMI_CoreXstopReason reason; @@ -827,360 +843,283 @@ static void decode_malfunction(struct OpalHMIEvent *hmi_evt, uint64_t *out_flags *out_flags |= flags; } -static void wait_for_cleanup_complete(void) -{ - uint64_t timeout = 0; - - smt_lowest(); - while (!(*(this_cpu()->core_hmi_state_ptr) & HMI_STATE_CLEANUP_DONE)) { - /* - * We use a fixed number of TIMEOUT_LOOPS rather - * than using the timebase to do a pseudo-wall time - * timeout due to the fact that timebase may not actually - * work at this point in time. - */ - if (++timeout >= (TIMEOUT_LOOPS*3)) { - /* - * Break out the loop here and fall through - * recovery code. If recovery fails, kernel will get - * informed about the failure. This way we can avoid - * looping here if other threads are stuck. - */ - prlog(PR_DEBUG, "TB pre-recovery timeout\n"); - break; - } - barrier(); - } - smt_medium(); -} - /* - * For successful recovery of TB residue error, remove dirty data - * from TB/HDEC register in each active partition (subcore). Writing - * zero's to TB/HDEC will achieve the same. + * This will "rendez-vous" all threads on the core to the rendez-vous + * id "sig". You need to make sure that "sig" is different from the + * previous rendez vous. The sig value must be between 0 and 7 with + * boot time being set to 0. + * + * Note: in theory, we could just use a flip flop "sig" in the thread + * structure (binary rendez-vous with no argument). This is a bit more + * debuggable and better at handling timeouts (arguably). + * + * This should be called with the no lock held */ -static void timer_facility_do_cleanup(uint64_t tfmr) +static void hmi_rendez_vous(uint32_t sig) { + struct cpu_thread *t = this_cpu(); + uint32_t my_id = cpu_get_thread_index(t); + uint32_t my_shift = my_id << 2; + uint32_t *sptr = t->core_hmi_state_ptr; + uint32_t val, prev, shift, i; + uint64_t timeout; + + assert(sig <= 0x7); + /* - * Workaround for HW logic bug in Power9. Do not reset the - * TB register if TB is valid and running. + * Mark ourselves as having reached the rendez vous point with + * the exit bit cleared */ - if ((tfmr & SPR_TFMR_TB_RESIDUE_ERR) && !(tfmr & SPR_TFMR_TB_VALID)) { + do { + val = prev = *sptr; + val &= ~(0xfu << my_shift); + val |= sig << my_shift; + } while (cmpxchg32(sptr, prev, val) != prev); - /* Reset the TB register to clear the dirty data. */ - mtspr(SPR_TBWU, 0); - mtspr(SPR_TBWL, 0); + /* + * Wait for everybody else to reach that point, ignore the + * exit bit as another thread could have already set it. + */ + for (i = 0; i < cpu_thread_count; i++) { + shift = i << 2; + + timeout = TIMEOUT_LOOPS; + while (((*sptr >> shift) & 0x7) != sig && --timeout) + cpu_relax(); + if (!timeout) + prlog(PR_ERR, "Rendez-vous stage 1 timeout, CPU 0x%x" + " waiting for thread %d\n", t->pir, i); } - if (tfmr & SPR_TFMR_HDEC_PARITY_ERROR) { - /* Reset HDEC register */ - mtspr(SPR_HDEC, 0); + /* Set the exit bit */ + do { + val = prev = *sptr; + val &= ~(0xfu << my_shift); + val |= (sig | 8) << my_shift; + } while (cmpxchg32(sptr, prev, val) != prev); + + /* At this point, we need to wait for everybody else to have a value + * that is *not* sig. IE. they either have set the exit bit *or* they + * have changed the rendez-vous (meaning they have moved on to another + * rendez vous point). + */ + for (i = 0; i < cpu_thread_count; i++) { + shift = i << 2; + + timeout = TIMEOUT_LOOPS; + while (((*sptr >> shift) & 0xf) == sig && --timeout) + cpu_relax(); + if (!timeout) + prlog(PR_ERR, "Rendez-vous stage 2 timeout, CPU 0x%x" + " waiting for thread %d\n", t->pir, i); } } -static int get_split_core_mode(void) +static void hmi_print_debug(const uint8_t *msg, uint64_t hmer) { - uint64_t hid0; + const char *loc; + uint32_t core_id, thread_index; - hid0 = mfspr(SPR_HID0); - if (hid0 & SPR_HID0_POWER8_2LPARMODE) - return 2; - else if (hid0 & SPR_HID0_POWER8_4LPARMODE) - return 4; + core_id = pir_to_core_id(this_cpu()->pir); + thread_index = cpu_get_thread_index(this_cpu()); - return 1; -} + loc = chip_loc_code(this_cpu()->chip_id); + if (!loc) + loc = "Not Available"; + if (hmer & (SPR_HMER_TFAC_ERROR | SPR_HMER_TFMR_PARITY_ERROR)) { + prlog(PR_DEBUG, "[Loc: %s]: P:%d C:%d T:%d: TFMR(%016lx) %s\n", + loc, this_cpu()->chip_id, core_id, thread_index, + mfspr(SPR_TFMR), msg); + } else { + prlog(PR_DEBUG, "[Loc: %s]: P:%d C:%d T:%d: %s\n", + loc, this_cpu()->chip_id, core_id, thread_index, + msg); + } +} -/* - * Certain TB/HDEC errors leaves dirty data in timebase and hdec register - * which need to cleared before we initiate clear_tb_errors through TFMR[24]. - * The cleanup has to be done by once by any one thread from core or subcore. - * - * In split core mode, it is required to clear the dirty data from TB/HDEC - * register by all subcores (active partitions) before we clear tb errors - * through TFMR[24]. The HMI recovery would fail even if one subcore do - * not cleanup the respective TB/HDEC register. - * - * For un-split core, any one thread can do the cleanup. - * For split core, any one thread from each subcore can do the cleanup. - * - * Errors that required pre-recovery cleanup: - * - SPR_TFMR_TB_RESIDUE_ERR - * - SPR_TFMR_HDEC_PARITY_ERROR - */ -static void pre_recovery_cleanup_p8(uint64_t *out_flags) +static int handle_thread_tfac_error(uint64_t tfmr, uint64_t *out_flags) { - uint64_t tfmr; - uint32_t sibling_thread_mask; - int split_core_mode, subcore_id, thread_id, threads_per_core; - int i; + int recover = 1; - /* - * Exit if it is not the error that leaves dirty data in timebase - * or HDEC register. OR this may be the thread which came in very - * late and recovery is been already done. - * - * TFMR is per [sub]core register. If any one thread on the [sub]core - * does the recovery it reflects in TFMR register and applicable to - * all threads in that [sub]core. Hence take a lock before checking - * TFMR errors. Once a thread from a [sub]core completes the - * recovery, all other threads on that [sub]core will return from - * here. - * - * If TFMR does not show error that we are looking for, return - * from here. We would just fall through recovery code which would - * check for other errors on TFMR and fix them. - */ - lock(&hmi_lock); - tfmr = mfspr(SPR_TFMR); - if (!(tfmr & SPR_TFMR_TB_VALID)) - *out_flags |= OPAL_HMI_FLAGS_TB_RESYNC; if (tfmr & SPR_TFMR_DEC_PARITY_ERR) *out_flags |= OPAL_HMI_FLAGS_DEC_LOST; - if (!(tfmr & (SPR_TFMR_TB_RESIDUE_ERR | SPR_TFMR_HDEC_PARITY_ERROR))) { - unlock(&hmi_lock); - return; - } + if (!tfmr_recover_local_errors(tfmr)) + recover = 0; + tfmr &= ~(SPR_TFMR_PURR_PARITY_ERR | + SPR_TFMR_SPURR_PARITY_ERR | + SPR_TFMR_DEC_PARITY_ERR); + return recover; +} - /* Tell OS about a possible loss of HDEC */ - if (tfmr & SPR_TFMR_HDEC_PARITY_ERROR) - *out_flags |= OPAL_HMI_FLAGS_HDEC_LOST; +static int handle_all_core_tfac_error(uint64_t tfmr, uint64_t *out_flags) +{ + struct cpu_thread *t, *t0; + int recover = 1; - /* Gather split core information. */ - split_core_mode = get_split_core_mode(); - threads_per_core = cpu_thread_count / split_core_mode; + t = this_cpu(); + t0 = find_cpu_by_pir(cpu_get_thread0(t)); - /* Prepare core/subcore sibling mask */ - thread_id = cpu_get_thread_index(this_cpu()); - subcore_id = thread_id / threads_per_core; - sibling_thread_mask = SUBCORE_THREAD_MASK(subcore_id, threads_per_core); + /* Rendez vous all threads */ + hmi_rendez_vous(1); - /* - * First thread on the core ? - * if yes, setup the hmi cleanup state to !DONE + /* We use a lock here as some of the TFMR bits are shared and I + * prefer avoiding doing the cleanup simultaneously. */ - if ((*(this_cpu()->core_hmi_state_ptr) & CORE_THREAD_MASK) == 0) - *(this_cpu()->core_hmi_state_ptr) &= ~HMI_STATE_CLEANUP_DONE; + lock(&hmi_lock); - /* - * First thread on subcore ? - * if yes, do cleanup. - * - * Clear TB and wait for other threads (one from each subcore) to - * finish its cleanup work. + /* First handle corrupt TFMR otherwise we can't trust anything. + * We'll use a lock here so that the threads don't try to do it at + * the same time */ + if (tfmr & SPR_TFMR_TFMR_CORRUPT) { + /* Check if it's still in error state */ + if (mfspr(SPR_TFMR) & SPR_TFMR_TFMR_CORRUPT) + if (!recover_corrupt_tfmr()) + recover = 0; - if ((*(this_cpu()->core_hmi_state_ptr) & sibling_thread_mask) == 0) - timer_facility_do_cleanup(tfmr); + if (!recover) + goto error_out; - /* - * Mark this thread bit. This bit will stay on until this thread - * exit from handle_hmi_exception(). - */ - *(this_cpu()->core_hmi_state_ptr) |= this_cpu()->thread_mask; + tfmr = mfspr(SPR_TFMR); - /* - * Check if each subcore has completed the cleanup work. - * if yes, then notify all the threads that we are done with cleanup. - */ - for (i = 0; i < split_core_mode; i++) { - uint32_t subcore_thread_mask = - SUBCORE_THREAD_MASK(i, threads_per_core); - if (!(*(this_cpu()->core_hmi_state_ptr) & subcore_thread_mask)) - break; + /* We could have got new thread errors in the meantime */ + if (tfmr & SPR_TFMR_THREAD_ERRORS) { + recover = handle_thread_tfac_error(tfmr, out_flags); + tfmr &= ~SPR_TFMR_THREAD_ERRORS; + } + if (!recover) + goto error_out; } - if (i == split_core_mode) - *(this_cpu()->core_hmi_state_ptr) |= HMI_STATE_CLEANUP_DONE; - - unlock(&hmi_lock); + /* Tell the OS ... */ + if (tfmr & SPR_TFMR_HDEC_PARITY_ERROR) + *out_flags |= OPAL_HMI_FLAGS_HDEC_LOST; - /* Wait for other subcore to complete the cleanup. */ - wait_for_cleanup_complete(); -} + /* Cleanup bad HDEC or TB on all threads or subcures before we clear + * the error conditions + */ + tfmr_cleanup_core_errors(tfmr); -/* - * Certain TB/HDEC errors leaves dirty data in timebase and hdec register - * which need to cleared before we initiate clear_tb_errors through TFMR[24]. - * The cleanup has to be done by all the threads from core in p9. - * - * On TB/HDEC errors, all 4 threads on the affected receives HMI. On power9, - * every thread on the core has its own copy of TB and hence every thread - * has to clear the dirty data from its own TB register before we clear tb - * errors through TFMR[24]. The HMI recovery would fail even if one thread - * do not cleanup the respective TB/HDEC register. - * - * There is no split core mode in power9. - * - * Errors that required pre-recovery cleanup: - * - SPR_TFMR_TB_RESIDUE_ERR - * - SPR_TFMR_HDEC_PARITY_ERROR - */ -static void pre_recovery_cleanup_p9(uint64_t *out_flags) -{ - uint64_t tfmr; - int threads_per_core = cpu_thread_count; - int i; + /* Unlock before next rendez-vous */ + unlock(&hmi_lock); - /* - * Exit if it is not the error that leaves dirty data in timebase - * or HDEC register. OR this may be the thread which came in very - * late and recovery is been already done. - * - * TFMR is per core register. Ideally if any one thread on the core - * does the recovery it should reflect in TFMR register and - * applicable to all threads in that core. Hence take a lock before - * checking TFMR errors. Once a thread from a core completes the - * recovery, all other threads on that core will return from - * here. - * - * If TFMR does not show error that we are looking for, return - * from here. We would just fall through recovery code which would - * check for other errors on TFMR and fix them. + /* Second rendez vous, ensure the above cleanups are all done before + * we proceed further */ - lock(&hmi_lock); - tfmr = mfspr(SPR_TFMR); - if (!(tfmr & SPR_TFMR_TB_VALID)) - *out_flags |= OPAL_HMI_FLAGS_TB_RESYNC; - if (tfmr & SPR_TFMR_DEC_PARITY_ERR) - *out_flags |= OPAL_HMI_FLAGS_DEC_LOST; - if (!(tfmr & (SPR_TFMR_TB_RESIDUE_ERR | SPR_TFMR_HDEC_PARITY_ERROR))) { - unlock(&hmi_lock); - return; - } + hmi_rendez_vous(2); - /* - * Due to a HW logic bug in p9, TFMR bit 26 and 45 always set - * once TB residue or HDEC errors occurs at first time. Hence for HMI - * on subsequent TB errors add additional check as workaround to - * identify validity of the errors and decide whether pre-recovery - * is required or not. Exit pre-recovery if there are other TB - * errors also present on TFMR. - */ - if (tfmr & SPR_TFMR_OTHER_ERRORS) { - unlock(&hmi_lock); - return; + /* We can now clear the error conditions in the core. */ + if (!tfmr_clear_core_errors(tfmr)) { + recover = 0; + goto error_out; } - /* - * First thread on the core ? - * if yes, setup the hmi cleanup state to !DONE + /* Third rendez-vous. We could in theory do the timebase resync as + * part of the previous one, but I prefer having all the error + * conditions cleared before we start trying. */ - if ((*(this_cpu()->core_hmi_state_ptr) & CORE_THREAD_MASK) == 0) - *(this_cpu()->core_hmi_state_ptr) &= ~HMI_STATE_CLEANUP_DONE; + hmi_rendez_vous(3); - /* Tell OS about a possible loss of HDEC */ - if (tfmr & SPR_TFMR_HDEC_PARITY_ERROR) - *out_flags |= OPAL_HMI_FLAGS_HDEC_LOST; + /* Now perform the actual TB recovery on thread 0 */ + if (t == t0) + recover = chiptod_recover_tb_errors(tfmr, + &this_cpu()->tb_resynced); - /* - * Clear TB and wait for other threads to finish its cleanup work. - */ - timer_facility_do_cleanup(tfmr); - - /* - * Mark this thread bit. This bit will stay on until this thread - * exit from handle_hmi_exception(). - */ - *(this_cpu()->core_hmi_state_ptr) |= this_cpu()->thread_mask; +error_out: + /* Last rendez-vous */ + hmi_rendez_vous(4); - /* - * Check if each thread has completed the cleanup work. - * if yes, then notify all the threads that we are done with cleanup. + /* Now all threads have gone past rendez-vous 3 and not yet past another + * rendez-vous 1, so the value of tb_resynced of thread 0 of the core + * contains an accurate indication as to whether the timebase was lost. */ - for (i = 0; i < threads_per_core; i++) { - uint32_t thread_mask = SINGLE_THREAD_MASK(i); - if (!(*(this_cpu()->core_hmi_state_ptr) & thread_mask)) - break; - } - - if (i == threads_per_core) - *(this_cpu()->core_hmi_state_ptr) |= HMI_STATE_CLEANUP_DONE; - - unlock(&hmi_lock); - - /* Wait for other threads to complete the cleanup. */ - wait_for_cleanup_complete(); -} + if (t0->tb_resynced) + *out_flags |= OPAL_HMI_FLAGS_TB_RESYNC; -static void pre_recovery_cleanup(uint64_t *out_flags) -{ - if (proc_gen == proc_gen_p9) - return pre_recovery_cleanup_p9(out_flags); - else - return pre_recovery_cleanup_p8(out_flags); + return recover; } -static void hmi_print_debug(const uint8_t *msg, uint64_t hmer) +static int handle_tfac_errors(uint64_t hmer, struct OpalHMIEvent *hmi_evt, + uint64_t *out_flags) { - const char *loc; - uint32_t core_id, thread_index; + int recover = 1; + uint64_t tfmr = mfspr(SPR_TFMR); - core_id = pir_to_core_id(this_cpu()->pir); - thread_index = cpu_get_thread_index(this_cpu()); + /* A TFMR parity error makes us ignore all the local stuff */ + if ((hmer & SPR_HMER_TFMR_PARITY_ERROR) || (tfmr & SPR_TFMR_TFMR_CORRUPT)) { + /* Mark TB as invalid for now as we don't trust TFMR, we'll fix + * it up later + */ + this_cpu()->tb_invalid = true; + goto bad_tfmr; + } - loc = chip_loc_code(this_cpu()->chip_id); - if (!loc) - loc = "Not Available"; + this_cpu()->tb_invalid = !(tfmr & SPR_TFMR_TB_VALID); - if (hmer & (SPR_HMER_TFAC_ERROR | SPR_HMER_TFMR_PARITY_ERROR)) { - prlog(PR_DEBUG, "[Loc: %s]: P:%d C:%d T:%d: TFMR(%016lx) %s\n", - loc, this_cpu()->chip_id, core_id, thread_index, - mfspr(SPR_TFMR), msg); - } else { - prlog(PR_DEBUG, "[Loc: %s]: P:%d C:%d T:%d: %s\n", - loc, this_cpu()->chip_id, core_id, thread_index, - msg); + /* P9 errata: In theory, an HDEC error is sent to all threads. However, + * due to an errata on P9 where TFMR bit 26 (HDEC parity) cannot be + * cleared on thread 1..3, I am not confident we can do a rendez-vous + * in all cases. + * + * Our current approach is to ignore that error unless no other TFAC + * error is present in the TFMR. The error will be re-detected and + * re-reported if necessary. + */ + if (proc_gen == proc_gen_p9 && (tfmr & SPR_TFMR_HDEC_PARITY_ERROR)) { + if (this_cpu()->tb_invalid || (tfmr & SPR_TFMR_OTHER_ERRORS)) + tfmr &= ~SPR_TFMR_HDEC_PARITY_ERROR; } -} -static int handle_tfac_errors(uint64_t hmer, struct OpalHMIEvent *hmi_evt, - uint64_t *out_flags) -{ - int recover = 1; - uint64_t tfmr; + /* The TB residue error is ignored if TB is valid due to a similar + * errata as above + */ + if ((tfmr & SPR_TFMR_TB_RESIDUE_ERR) && !this_cpu()->tb_invalid) + tfmr &= ~SPR_TFMR_TB_RESIDUE_ERR; - pre_recovery_cleanup(out_flags); + /* First, handle thread local errors */ + if (tfmr & SPR_TFMR_THREAD_ERRORS) { + recover = handle_thread_tfac_error(tfmr, out_flags); + tfmr &= ~SPR_TFMR_THREAD_ERRORS; + } - lock(&hmi_lock); - this_cpu()->tb_invalid = !(mfspr(SPR_TFMR) & SPR_TFMR_TB_VALID); + bad_tfmr: - /* - * Assert for now for all TOD errors. In future we need to decode - * TFMR and take corrective action wherever required. + /* Let's see if we still have a all-core error to deal with, if + * not, we just bail out */ - if (hmer & SPR_HMER_TFAC_ERROR) { - tfmr = mfspr(SPR_TFMR); /* save original TFMR */ + if (tfmr & SPR_TFMR_CORE_ERRORS) { + int recover2; - hmi_print_debug("Timer Facility Error", hmer); - - recover = chiptod_recover_tb_errors(); - if (hmi_evt) { - hmi_evt->severity = OpalHMI_SEV_ERROR_SYNC; - hmi_evt->type = OpalHMI_ERROR_TFAC; - hmi_evt->tfmr = tfmr; - queue_hmi_event(hmi_evt, recover, out_flags); - } + /* Only update "recover" if it's not already 0 (non-recovered) + */ + recover2 = handle_all_core_tfac_error(tfmr, out_flags); + if (recover != 0) + recover = recover2; + } else if (this_cpu()->tb_invalid) { + /* This shouldn't happen, TB is invalid and no global error + * was reported. We just return for now assuming one will + * be. We can't do a rendez vous without a core-global HMI. + */ + prlog(PR_ERR, "HMI: TB invalid without core error reported ! " + "CPU=%x, TFMR=0x%016lx\n", this_cpu()->pir, + mfspr(SPR_TFMR)); } - if (hmer & SPR_HMER_TFMR_PARITY_ERROR) { - tfmr = mfspr(SPR_TFMR); /* save original TFMR */ - hmi_print_debug("TFMR parity Error", hmer); - recover = chiptod_recover_tb_errors(); - if (hmi_evt) { - hmi_evt->severity = OpalHMI_SEV_FATAL; - hmi_evt->type = OpalHMI_ERROR_TFMR_PARITY; - hmi_evt->tfmr = tfmr; - queue_hmi_event(hmi_evt, recover, out_flags); - } + if (hmi_evt) { + hmi_evt->severity = OpalHMI_SEV_ERROR_SYNC; + hmi_evt->type = OpalHMI_ERROR_TFAC; + hmi_evt->tfmr = tfmr; + queue_hmi_event(hmi_evt, recover, out_flags); } - /* Unconditionally unset the thread bit */ - *(this_cpu()->core_hmi_state_ptr) &= ~(this_cpu()->thread_mask); /* Set the TB state looking at TFMR register before we head out. */ this_cpu()->tb_invalid = !(mfspr(SPR_TFMR) & SPR_TFMR_TB_VALID); - unlock(&hmi_lock); + + if (this_cpu()->tb_invalid) + prlog(PR_WARNING, "Failed to get TB in running state! " + "CPU=%x, TFMR=%016lx\n", this_cpu()->pir, + mfspr(SPR_TFMR)); return recover; } diff --git a/hw/chiptod.c b/hw/chiptod.c index cacc27340..a160e5a10 100644 --- a/hw/chiptod.c +++ b/hw/chiptod.c @@ -1370,17 +1370,10 @@ static bool tfmr_recover_tb_errors(uint64_t tfmr) return true; } -static bool tfmr_recover_non_tb_errors(uint64_t tfmr) +bool tfmr_recover_local_errors(uint64_t tfmr) { uint64_t tfmr_reset_errors = 0; - /* - * write 1 to bit 26 to clear TFMR HDEC parity error. - * HDEC register has already been reset to zero as part pre-recovery. - */ - if (tfmr & SPR_TFMR_HDEC_PARITY_ERROR) - tfmr_reset_errors |= SPR_TFMR_HDEC_PARITY_ERROR; - if (tfmr & SPR_TFMR_DEC_PARITY_ERR) { /* Set DEC with all ones */ mtspr(SPR_DEC, ~0); @@ -1390,11 +1383,11 @@ static bool tfmr_recover_non_tb_errors(uint64_t tfmr) } /* - * Reset PURR/SPURR to recover. We also need help from KVM - * layer to handle this change in PURR/SPURR. That needs - * to be handled in kernel KVM layer. For now, to recover just - * reset it. - */ + * Reset PURR/SPURR to recover. We also need help from KVM + * layer to handle this change in PURR/SPURR. That needs + * to be handled in kernel KVM layer. For now, to recover just + * reset it. + */ if (tfmr & SPR_TFMR_PURR_PARITY_ERR) { /* set PURR register with sane value or reset it. */ mtspr(SPR_PURR, 0); @@ -1432,7 +1425,7 @@ static bool tfmr_recover_non_tb_errors(uint64_t tfmr) * MT(TFMR) bits 11 and 60 are b’1’ * MT(HMER) all bits 1 except for bits 4,5 */ -static bool chiptod_recover_tfmr_error(void) +bool recover_corrupt_tfmr(void) { uint64_t tfmr; @@ -1468,6 +1461,37 @@ static bool chiptod_recover_tfmr_error(void) return true; } +void tfmr_cleanup_core_errors(uint64_t tfmr) +{ + /* If HDEC is bad, clean it on all threads before we clear the + * error condition. + */ + if (tfmr & SPR_TFMR_HDEC_PARITY_ERROR) + mtspr(SPR_HDEC, 0); + + /* If TB is invalid, clean it on all threads as well, it will be + * restored after the next rendez-vous + */ + if (!(tfmr & SPR_TFMR_TB_VALID)) { + mtspr(SPR_TBWU, 0); + mtspr(SPR_TBWU, 0); + } +} + +bool tfmr_clear_core_errors(uint64_t tfmr) +{ + uint64_t tfmr_reset_errors = 0; + + if (tfmr & SPR_TFMR_HDEC_PARITY_ERROR) + tfmr_reset_errors |= SPR_TFMR_HDEC_PARITY_ERROR; + + /* Write TFMR twice to clear the error */ + mtspr(SPR_TFMR, base_tfmr | tfmr_reset_errors); + mtspr(SPR_TFMR, base_tfmr | tfmr_reset_errors); + + return true; +} + /* * Recover from TB and TOD errors. * Timebase register is per core and first thread that gets chance to @@ -1481,46 +1505,17 @@ static bool chiptod_recover_tfmr_error(void) * 1 <= Successfully recovered from errors * -1 <= No errors found. Errors are already been fixed. */ -int chiptod_recover_tb_errors(void) +int chiptod_recover_tb_errors(uint64_t tfmr, bool *out_resynced) { - uint64_t tfmr; int rc = -1; - int thread_id; + + *out_resynced = false; if (chiptod_primary < 0) return 0; lock(&chiptod_lock); - /* Get fresh copy of TFMR */ - tfmr = mfspr(SPR_TFMR); - - /* - * Check for TFMR parity error and recover from it. - * We can not trust any other bits in TFMR If it is corrupt. Fix this - * before we do anything. - */ - if (tfmr & SPR_TFMR_TFMR_CORRUPT) { - if (!chiptod_recover_tfmr_error()) { - rc = 0; - goto error_out; - } - } - - /* Get fresh copy of TFMR */ - tfmr = mfspr(SPR_TFMR); - - /* - * Workaround for HW logic bug in Power9 - * Even after clearing TB residue error by one thread it does not - * get reflected to other threads on same core. - * Check if TB is already valid and skip the checking of TB errors. - */ - - if ((proc_gen == proc_gen_p9) && (tfmr & SPR_TFMR_TB_RESIDUE_ERR) - && (tfmr & SPR_TFMR_TB_VALID)) - goto skip_tb_error_clear; - /* * Check for TB errors. * On Sync check error, bit 44 of TFMR is set. Check for it and @@ -1544,7 +1539,6 @@ int chiptod_recover_tb_errors(void) } } -skip_tb_error_clear: /* * Check for TOD sync check error. * On TOD errors, bit 51 of TFMR is set. If this bit is on then we @@ -1574,35 +1568,9 @@ skip_tb_error_clear: if (!chiptod_to_tb()) goto error_out; - /* We have successfully able to get TB running. */ - rc = 1; - } + *out_resynced = true; - /* - * Workaround for HW logic bug in power9. - * In idea case (without the HW bug) only one thread from the core - * would have fallen through tfmr_recover_non_tb_errors() to clear - * HDEC parity error on TFMR. - * - * Hence to achieve same behavior, allow only thread 0 to clear the - * HDEC parity error. And for rest of the threads just reset the bit - * to avoid other threads to fall through tfmr_recover_non_tb_errors(). - */ - thread_id = cpu_get_thread_index(this_cpu()); - if ((proc_gen == proc_gen_p9) && thread_id) - tfmr &= ~SPR_TFMR_HDEC_PARITY_ERROR; - - /* - * Now that TB is running, check for TFMR non-TB errors. - */ - if ((tfmr & SPR_TFMR_HDEC_PARITY_ERROR) || - (tfmr & SPR_TFMR_PURR_PARITY_ERR) || - (tfmr & SPR_TFMR_SPURR_PARITY_ERR) || - (tfmr & SPR_TFMR_DEC_PARITY_ERR)) { - if (!tfmr_recover_non_tb_errors(tfmr)) { - rc = 0; - goto error_out; - } + /* We have successfully able to get TB running. */ rc = 1; } diff --git a/include/chiptod.h b/include/chiptod.h index fd5cd9644..7708d4899 100644 --- a/include/chiptod.h +++ b/include/chiptod.h @@ -29,7 +29,11 @@ enum chiptod_topology { extern void chiptod_init(void); extern bool chiptod_wakeup_resync(void); -extern int chiptod_recover_tb_errors(void); +extern int chiptod_recover_tb_errors(uint64_t tfmr, bool *out_resynced); +extern bool tfmr_recover_local_errors(uint64_t tfmr); +extern bool recover_corrupt_tfmr(void); +extern void tfmr_cleanup_core_errors(uint64_t tfmr); +extern bool tfmr_clear_core_errors(uint64_t tfmr); extern void chiptod_reset_tb(void); extern bool chiptod_adjust_topology(enum chiptod_topology topo, bool enable); extern bool chiptod_capp_timebase_sync(unsigned int chip_id, uint32_t tfmr_addr, diff --git a/include/cpu.h b/include/cpu.h index b7cd588d5..68f246396 100644 --- a/include/cpu.h +++ b/include/cpu.h @@ -97,9 +97,8 @@ struct cpu_thread { */ uint32_t core_hmi_state; /* primary only */ uint32_t *core_hmi_state_ptr; - /* Mask to indicate thread id in core. */ - uint8_t thread_mask; bool tb_invalid; + bool tb_resynced; /* For use by XICS emulation on XIVE */ struct xive_cpu_state *xstate; From patchwork Mon Apr 16 17:33:43 2018 Content-Type: text/plain; charset="utf-8" MIME-Version: 1.0 Content-Transfer-Encoding: 7bit X-Patchwork-Submitter: Mahesh J Salgaonkar X-Patchwork-Id: 898829 Return-Path: X-Original-To: incoming@patchwork.ozlabs.org Delivered-To: patchwork-incoming@bilbo.ozlabs.org Received: from lists.ozlabs.org (lists.ozlabs.org [IPv6:2401:3900:2:1::3]) (using TLSv1.2 with cipher ADH-AES256-GCM-SHA384 (256/256 bits)) (No client certificate requested) by ozlabs.org (Postfix) with ESMTPS id 40PwWY2ltdz9s3B for ; Tue, 17 Apr 2018 03:35:21 +1000 (AEST) Authentication-Results: ozlabs.org; dmarc=none (p=none dis=none) header.from=linux.vnet.ibm.com Received: from lists.ozlabs.org (lists.ozlabs.org [IPv6:2401:3900:2:1::3]) by lists.ozlabs.org (Postfix) with ESMTP id 40PwWY13cNzF1x5 for ; Tue, 17 Apr 2018 03:35:21 +1000 (AEST) Authentication-Results: lists.ozlabs.org; dmarc=none (p=none dis=none) header.from=linux.vnet.ibm.com X-Original-To: skiboot@lists.ozlabs.org Delivered-To: skiboot@lists.ozlabs.org Authentication-Results: lists.ozlabs.org; spf=none (mailfrom) smtp.mailfrom=linux.vnet.ibm.com (client-ip=148.163.156.1; helo=mx0a-001b2d01.pphosted.com; envelope-from=mahesh@linux.vnet.ibm.com; receiver=) Authentication-Results: lists.ozlabs.org; dmarc=none (p=none dis=none) header.from=linux.vnet.ibm.com Received: from mx0a-001b2d01.pphosted.com (mx0a-001b2d01.pphosted.com [148.163.156.1]) (using TLSv1.2 with cipher ECDHE-RSA-AES256-GCM-SHA384 (256/256 bits)) (No client certificate requested) by lists.ozlabs.org (Postfix) with ESMTPS id 40PwTs152SzF1wG for ; Tue, 17 Apr 2018 03:33:52 +1000 (AEST) Received: from pps.filterd (m0098399.ppops.net [127.0.0.1]) by mx0a-001b2d01.pphosted.com (8.16.0.22/8.16.0.22) with SMTP id w3GHWBgt095877 for ; Mon, 16 Apr 2018 13:33:50 -0400 Received: from e06smtp10.uk.ibm.com (e06smtp10.uk.ibm.com [195.75.94.106]) by mx0a-001b2d01.pphosted.com with ESMTP id 2hcy58kca3-1 (version=TLSv1.2 cipher=AES256-SHA256 bits=256 verify=NOT) for ; Mon, 16 Apr 2018 13:33:50 -0400 Received: from localhost by e06smtp10.uk.ibm.com with IBM ESMTP SMTP Gateway: Authorized Use Only! Violators will be prosecuted for from ; Mon, 16 Apr 2018 18:33:48 +0100 Received: from b06cxnps3075.portsmouth.uk.ibm.com (9.149.109.195) by e06smtp10.uk.ibm.com (192.168.101.140) with IBM ESMTP SMTP Gateway: Authorized Use Only! Violators will be prosecuted; Mon, 16 Apr 2018 18:33:44 +0100 Received: from d06av22.portsmouth.uk.ibm.com (d06av22.portsmouth.uk.ibm.com [9.149.105.58]) by b06cxnps3075.portsmouth.uk.ibm.com (8.14.9/8.14.9/NCO v10.0) with ESMTP id w3GHXiG630408884; Mon, 16 Apr 2018 17:33:44 GMT Received: from d06av22.portsmouth.uk.ibm.com (unknown [127.0.0.1]) by IMSVA (Postfix) with ESMTP id 9AE4D4C044; Mon, 16 Apr 2018 18:26:16 +0100 (BST) Received: from d06av22.portsmouth.uk.ibm.com (unknown [127.0.0.1]) by IMSVA (Postfix) with ESMTP id 202E44C04E; Mon, 16 Apr 2018 18:26:16 +0100 (BST) Received: from jupiter.in.ibm.com (unknown [9.102.1.147]) by d06av22.portsmouth.uk.ibm.com (Postfix) with ESMTP; Mon, 16 Apr 2018 18:26:15 +0100 (BST) From: Mahesh J Salgaonkar To: skiboot list Date: Mon, 16 Apr 2018 23:03:43 +0530 In-Reply-To: <152389987405.2566.355149283827806637.stgit@jupiter.in.ibm.com> References: <152389987405.2566.355149283827806637.stgit@jupiter.in.ibm.com> User-Agent: StGit/0.17.1-dirty MIME-Version: 1.0 X-TM-AS-GCONF: 00 x-cbid: 18041617-0040-0000-0000-0000042FA0C1 X-IBM-AV-DETECTION: SAVI=unused REMOTE=unused XFE=unused x-cbparentid: 18041617-0041-0000-0000-00002633A6A4 Message-Id: <152390001795.2566.10820986169925152182.stgit@jupiter.in.ibm.com> X-Proofpoint-Virus-Version: vendor=fsecure engine=2.50.10434:, , definitions=2018-04-16_09:, , signatures=0 X-Proofpoint-Spam-Details: rule=outbound_notspam policy=outbound score=0 priorityscore=1501 malwarescore=0 suspectscore=0 phishscore=0 bulkscore=0 spamscore=0 clxscore=1015 lowpriorityscore=0 impostorscore=0 adultscore=0 classifier=spam adjust=0 reason=mlx scancount=1 engine=8.0.1-1709140000 definitions=main-1804160155 Subject: [Skiboot] [PATCH v2 07/15] opal/hmi: Initialize the hmi event with old value of HMER. X-BeenThere: skiboot@lists.ozlabs.org X-Mailman-Version: 2.1.26 Precedence: list List-Id: Mailing list for skiboot development List-Unsubscribe: , List-Archive: List-Post: List-Help: List-Subscribe: , Errors-To: skiboot-bounces+incoming=patchwork.ozlabs.org@lists.ozlabs.org Sender: "Skiboot" From: Mahesh Salgaonkar Do this before we check for TFAC errors. Otherwise the event at host console shows no error reported in HMER register. Without this patch the console event show HMER with all zeros [ 216.753417] Severe Hypervisor Maintenance interrupt [Recovered] [ 216.753498] Error detail: Timer facility experienced an error [ 216.753509] HMER: 0000000000000000 [ 216.753518] TFMR: 3c12000870e04000 After this patch it shows old HMER values on host console: [ 2237.652533] Severe Hypervisor Maintenance interrupt [Recovered] [ 2237.652651] Error detail: Timer facility experienced an error [ 2237.652766] HMER: 0840000000000000 [ 2237.652837] TFMR: 3c12000870e04000 Signed-off-by: Mahesh Salgaonkar --- core/hmi.c | 9 ++++++--- 1 file changed, 6 insertions(+), 3 deletions(-) diff --git a/core/hmi.c b/core/hmi.c index f51512989..95ab96cde 100644 --- a/core/hmi.c +++ b/core/hmi.c @@ -1131,8 +1131,14 @@ static int handle_hmi_exception(uint64_t hmer, struct OpalHMIEvent *hmi_evt, int recover = 1; uint64_t handled = 0; + prlog(PR_DEBUG, "Received HMI interrupt: HMER = 0x%016llx\n", hmer); + /* Initialize the hmi event with old value of HMER */ + if (hmi_evt) + hmi_evt->hmer = hmer; + /* Handle Timer/TOD errors separately */ if (hmer & (SPR_HMER_TFAC_ERROR | SPR_HMER_TFMR_PARITY_ERROR)) { + hmi_print_debug("Timer Facility Error", hmer); handled = hmer & (SPR_HMER_TFAC_ERROR | SPR_HMER_TFMR_PARITY_ERROR); mtspr(SPR_HMER, ~handled); recover = handle_tfac_errors(hmer, hmi_evt, out_flags); @@ -1145,9 +1151,6 @@ static int handle_hmi_exception(uint64_t hmer, struct OpalHMIEvent *hmi_evt, * looking at TFMR register. TFMR will tell us correct state of * TB register. */ - prlog(PR_DEBUG, "Received HMI interrupt: HMER = 0x%016llx\n", hmer); - if (hmi_evt) - hmi_evt->hmer = hmer; if (hmer & SPR_HMER_PROC_RECV_DONE) { uint32_t chip_id = pir_to_chip_id(cpu->pir); uint32_t core_id = pir_to_core_id(cpu->pir); From patchwork Mon Apr 16 17:33:49 2018 Content-Type: text/plain; charset="utf-8" MIME-Version: 1.0 Content-Transfer-Encoding: 7bit X-Patchwork-Submitter: Mahesh J Salgaonkar X-Patchwork-Id: 898830 Return-Path: X-Original-To: incoming@patchwork.ozlabs.org Delivered-To: patchwork-incoming@bilbo.ozlabs.org Received: from lists.ozlabs.org (lists.ozlabs.org [IPv6:2401:3900:2:1::3]) (using TLSv1.2 with cipher ADH-AES256-GCM-SHA384 (256/256 bits)) (No client certificate requested) by ozlabs.org (Postfix) with ESMTPS id 40PwWv3rdDz9s3G for ; Tue, 17 Apr 2018 03:35:39 +1000 (AEST) Authentication-Results: ozlabs.org; dmarc=none (p=none dis=none) header.from=linux.vnet.ibm.com Received: from lists.ozlabs.org (lists.ozlabs.org [IPv6:2401:3900:2:1::3]) by lists.ozlabs.org (Postfix) with ESMTP id 40PwWv2XszzF1xW for ; Tue, 17 Apr 2018 03:35:39 +1000 (AEST) Authentication-Results: lists.ozlabs.org; dmarc=none (p=none dis=none) header.from=linux.vnet.ibm.com X-Original-To: skiboot@lists.ozlabs.org Delivered-To: skiboot@lists.ozlabs.org Authentication-Results: lists.ozlabs.org; spf=none (mailfrom) smtp.mailfrom=linux.vnet.ibm.com (client-ip=148.163.156.1; helo=mx0a-001b2d01.pphosted.com; envelope-from=mahesh@linux.vnet.ibm.com; receiver=) Authentication-Results: lists.ozlabs.org; dmarc=none (p=none dis=none) header.from=linux.vnet.ibm.com Received: from mx0a-001b2d01.pphosted.com (mx0a-001b2d01.pphosted.com [148.163.156.1]) (using TLSv1.2 with cipher ECDHE-RSA-AES256-GCM-SHA384 (256/256 bits)) (No client certificate requested) by lists.ozlabs.org (Postfix) with ESMTPS id 40PwTy46pKzF1x2 for ; Tue, 17 Apr 2018 03:33:58 +1000 (AEST) Received: from pps.filterd (m0098394.ppops.net [127.0.0.1]) by mx0a-001b2d01.pphosted.com (8.16.0.22/8.16.0.22) with SMTP id w3GHV1IQ110392 for ; Mon, 16 Apr 2018 13:33:56 -0400 Received: from e06smtp13.uk.ibm.com (e06smtp13.uk.ibm.com [195.75.94.109]) by mx0a-001b2d01.pphosted.com with ESMTP id 2hcxcxnxr8-1 (version=TLSv1.2 cipher=AES256-SHA256 bits=256 verify=NOT) for ; Mon, 16 Apr 2018 13:33:56 -0400 Received: from localhost by e06smtp13.uk.ibm.com with IBM ESMTP SMTP Gateway: Authorized Use Only! Violators will be prosecuted for from ; Mon, 16 Apr 2018 18:33:53 +0100 Received: from b06cxnps4074.portsmouth.uk.ibm.com (9.149.109.196) by e06smtp13.uk.ibm.com (192.168.101.143) with IBM ESMTP SMTP Gateway: Authorized Use Only! Violators will be prosecuted; Mon, 16 Apr 2018 18:33:51 +0100 Received: from d06av23.portsmouth.uk.ibm.com (d06av23.portsmouth.uk.ibm.com [9.149.105.59]) by b06cxnps4074.portsmouth.uk.ibm.com (8.14.9/8.14.9/NCO v10.0) with ESMTP id w3GHXpNm45416650; Mon, 16 Apr 2018 17:33:51 GMT Received: from d06av23.portsmouth.uk.ibm.com (unknown [127.0.0.1]) by IMSVA (Postfix) with ESMTP id 3CEDFA4040; Mon, 16 Apr 2018 18:26:01 +0100 (BST) Received: from d06av23.portsmouth.uk.ibm.com (unknown [127.0.0.1]) by IMSVA (Postfix) with ESMTP id BD328A4057; Mon, 16 Apr 2018 18:26:00 +0100 (BST) Received: from jupiter.in.ibm.com (unknown [9.102.1.147]) by d06av23.portsmouth.uk.ibm.com (Postfix) with ESMTP; Mon, 16 Apr 2018 18:26:00 +0100 (BST) From: Mahesh J Salgaonkar To: skiboot list Date: Mon, 16 Apr 2018 23:03:49 +0530 In-Reply-To: <152389987405.2566.355149283827806637.stgit@jupiter.in.ibm.com> References: <152389987405.2566.355149283827806637.stgit@jupiter.in.ibm.com> User-Agent: StGit/0.17.1-dirty MIME-Version: 1.0 X-TM-AS-GCONF: 00 x-cbid: 18041617-0012-0000-0000-000005CB7DB4 X-IBM-AV-DETECTION: SAVI=unused REMOTE=unused XFE=unused x-cbparentid: 18041617-0013-0000-0000-00001947C329 Message-Id: <152390002962.2566.13673260110761320050.stgit@jupiter.in.ibm.com> X-Proofpoint-Virus-Version: vendor=fsecure engine=2.50.10434:, , definitions=2018-04-16_09:, , signatures=0 X-Proofpoint-Spam-Details: rule=outbound_notspam policy=outbound score=0 priorityscore=1501 malwarescore=0 suspectscore=0 phishscore=0 bulkscore=0 spamscore=0 clxscore=1015 lowpriorityscore=0 impostorscore=0 adultscore=0 classifier=spam adjust=0 reason=mlx scancount=1 engine=8.0.1-1709140000 definitions=main-1804160155 Subject: [Skiboot] [PATCH v2 08/15] opal/hmi: Do not send HMI event if no errors are found. X-BeenThere: skiboot@lists.ozlabs.org X-Mailman-Version: 2.1.26 Precedence: list List-Id: Mailing list for skiboot development List-Unsubscribe: , List-Archive: List-Post: List-Help: List-Subscribe: , Errors-To: skiboot-bounces+incoming=patchwork.ozlabs.org@lists.ozlabs.org Sender: "Skiboot" From: Mahesh Salgaonkar For TOD errors, all the cores in the chip get HMIs. Any one thread from any core can fix the issue and TFMR will have error conditions cleared. Rest of the threads need take any action if TOD errors are already cleared. Hence thread 0 of every core should get a fresh copy of TFMR before going ahead recovery path. Initialize recover = -1, so that if no errors found that thread need not send a HMI event to linux. This helps in stop flooding host with hmi event by every thread even there are no errors found. Signed-off-by: Mahesh Salgaonkar --- core/hmi.c | 21 +++++++++++++-------- hw/chiptod.c | 6 +++++- include/chiptod.h | 2 +- 3 files changed, 19 insertions(+), 10 deletions(-) diff --git a/core/hmi.c b/core/hmi.c index 95ab96cde..eadb75be4 100644 --- a/core/hmi.c +++ b/core/hmi.c @@ -955,7 +955,7 @@ static int handle_thread_tfac_error(uint64_t tfmr, uint64_t *out_flags) static int handle_all_core_tfac_error(uint64_t tfmr, uint64_t *out_flags) { struct cpu_thread *t, *t0; - int recover = 1; + int recover = -1; t = this_cpu(); t0 = find_cpu_by_pir(cpu_get_thread0(t)); @@ -975,11 +975,15 @@ static int handle_all_core_tfac_error(uint64_t tfmr, uint64_t *out_flags) if (tfmr & SPR_TFMR_TFMR_CORRUPT) { /* Check if it's still in error state */ if (mfspr(SPR_TFMR) & SPR_TFMR_TFMR_CORRUPT) - if (!recover_corrupt_tfmr()) + if (!recover_corrupt_tfmr()) { + unlock(&hmi_lock); recover = 0; + } - if (!recover) + if (!recover) { + unlock(&hmi_lock); goto error_out; + } tfmr = mfspr(SPR_TFMR); @@ -988,8 +992,10 @@ static int handle_all_core_tfac_error(uint64_t tfmr, uint64_t *out_flags) recover = handle_thread_tfac_error(tfmr, out_flags); tfmr &= ~SPR_TFMR_THREAD_ERRORS; } - if (!recover) + if (!recover) { + unlock(&hmi_lock); goto error_out; + } } /* Tell the OS ... */ @@ -1023,8 +1029,7 @@ static int handle_all_core_tfac_error(uint64_t tfmr, uint64_t *out_flags) /* Now perform the actual TB recovery on thread 0 */ if (t == t0) - recover = chiptod_recover_tb_errors(tfmr, - &this_cpu()->tb_resynced); + recover = chiptod_recover_tb_errors(&this_cpu()->tb_resynced); error_out: /* Last rendez-vous */ @@ -1043,7 +1048,7 @@ error_out: static int handle_tfac_errors(uint64_t hmer, struct OpalHMIEvent *hmi_evt, uint64_t *out_flags) { - int recover = 1; + int recover = -1; uint64_t tfmr = mfspr(SPR_TFMR); /* A TFMR parity error makes us ignore all the local stuff */ @@ -1106,7 +1111,7 @@ static int handle_tfac_errors(uint64_t hmer, struct OpalHMIEvent *hmi_evt, mfspr(SPR_TFMR)); } - if (hmi_evt) { + if (recover != -1 && hmi_evt) { hmi_evt->severity = OpalHMI_SEV_ERROR_SYNC; hmi_evt->type = OpalHMI_ERROR_TFAC; hmi_evt->tfmr = tfmr; diff --git a/hw/chiptod.c b/hw/chiptod.c index a160e5a10..f6ef9a469 100644 --- a/hw/chiptod.c +++ b/hw/chiptod.c @@ -1505,8 +1505,9 @@ bool tfmr_clear_core_errors(uint64_t tfmr) * 1 <= Successfully recovered from errors * -1 <= No errors found. Errors are already been fixed. */ -int chiptod_recover_tb_errors(uint64_t tfmr, bool *out_resynced) +int chiptod_recover_tb_errors(bool *out_resynced) { + uint64_t tfmr; int rc = -1; *out_resynced = false; @@ -1516,6 +1517,9 @@ int chiptod_recover_tb_errors(uint64_t tfmr, bool *out_resynced) lock(&chiptod_lock); + /* Get fresh copy of TFMR */ + tfmr = mfspr(SPR_TFMR); + /* * Check for TB errors. * On Sync check error, bit 44 of TFMR is set. Check for it and diff --git a/include/chiptod.h b/include/chiptod.h index 7708d4899..667e6fd83 100644 --- a/include/chiptod.h +++ b/include/chiptod.h @@ -29,7 +29,7 @@ enum chiptod_topology { extern void chiptod_init(void); extern bool chiptod_wakeup_resync(void); -extern int chiptod_recover_tb_errors(uint64_t tfmr, bool *out_resynced); +extern int chiptod_recover_tb_errors(bool *out_resynced); extern bool tfmr_recover_local_errors(uint64_t tfmr); extern bool recover_corrupt_tfmr(void); extern void tfmr_cleanup_core_errors(uint64_t tfmr); From patchwork Mon Apr 16 17:33:56 2018 Content-Type: text/plain; charset="utf-8" MIME-Version: 1.0 Content-Transfer-Encoding: 7bit X-Patchwork-Submitter: Mahesh J Salgaonkar X-Patchwork-Id: 898831 Return-Path: X-Original-To: incoming@patchwork.ozlabs.org Delivered-To: patchwork-incoming@bilbo.ozlabs.org Received: from lists.ozlabs.org (lists.ozlabs.org [IPv6:2401:3900:2:1::3]) (using TLSv1.2 with cipher ADH-AES256-GCM-SHA384 (256/256 bits)) (No client certificate requested) by ozlabs.org (Postfix) with ESMTPS id 40PwX63Q44z9s3G for ; Tue, 17 Apr 2018 03:35:50 +1000 (AEST) Authentication-Results: ozlabs.org; dmarc=none (p=none dis=none) header.from=linux.vnet.ibm.com Received: from lists.ozlabs.org (lists.ozlabs.org [IPv6:2401:3900:2:1::3]) by lists.ozlabs.org (Postfix) with ESMTP id 40PwX61NKDzF1xW for ; Tue, 17 Apr 2018 03:35:50 +1000 (AEST) Authentication-Results: lists.ozlabs.org; dmarc=none (p=none dis=none) header.from=linux.vnet.ibm.com X-Original-To: skiboot@lists.ozlabs.org Delivered-To: skiboot@lists.ozlabs.org Authentication-Results: lists.ozlabs.org; spf=none (mailfrom) smtp.mailfrom=linux.vnet.ibm.com (client-ip=148.163.156.1; helo=mx0a-001b2d01.pphosted.com; envelope-from=mahesh@linux.vnet.ibm.com; receiver=) Authentication-Results: lists.ozlabs.org; dmarc=none (p=none dis=none) header.from=linux.vnet.ibm.com Received: from mx0a-001b2d01.pphosted.com (mx0a-001b2d01.pphosted.com [148.163.156.1]) (using TLSv1.2 with cipher ECDHE-RSA-AES256-GCM-SHA384 (256/256 bits)) (No client certificate requested) by lists.ozlabs.org (Postfix) with ESMTPS id 40PwV41mGCzF1w1 for ; Tue, 17 Apr 2018 03:34:04 +1000 (AEST) Received: from pps.filterd (m0098410.ppops.net [127.0.0.1]) by mx0a-001b2d01.pphosted.com (8.16.0.22/8.16.0.22) with SMTP id w3GHVaJK080971 for ; Mon, 16 Apr 2018 13:34:02 -0400 Received: from e06smtp10.uk.ibm.com (e06smtp10.uk.ibm.com [195.75.94.106]) by mx0a-001b2d01.pphosted.com with ESMTP id 2hcxyfm0ns-1 (version=TLSv1.2 cipher=AES256-SHA256 bits=256 verify=NOT) for ; Mon, 16 Apr 2018 13:34:02 -0400 Received: from localhost by e06smtp10.uk.ibm.com with IBM ESMTP SMTP Gateway: Authorized Use Only! Violators will be prosecuted for from ; Mon, 16 Apr 2018 18:34:00 +0100 Received: from b06cxnps4074.portsmouth.uk.ibm.com (9.149.109.196) by e06smtp10.uk.ibm.com (192.168.101.140) with IBM ESMTP SMTP Gateway: Authorized Use Only! Violators will be prosecuted; Mon, 16 Apr 2018 18:33:58 +0100 Received: from d06av24.portsmouth.uk.ibm.com (d06av24.portsmouth.uk.ibm.com [9.149.105.60]) by b06cxnps4074.portsmouth.uk.ibm.com (8.14.9/8.14.9/NCO v10.0) with ESMTP id w3GHXvqL44105886; Mon, 16 Apr 2018 17:33:57 GMT Received: from d06av24.portsmouth.uk.ibm.com (unknown [127.0.0.1]) by IMSVA (Postfix) with ESMTP id 5276F42042; Mon, 16 Apr 2018 18:25:32 +0100 (BST) Received: from d06av24.portsmouth.uk.ibm.com (unknown [127.0.0.1]) by IMSVA (Postfix) with ESMTP id D70F442041; Mon, 16 Apr 2018 18:25:31 +0100 (BST) Received: from jupiter.in.ibm.com (unknown [9.102.1.147]) by d06av24.portsmouth.uk.ibm.com (Postfix) with ESMTP; Mon, 16 Apr 2018 18:25:31 +0100 (BST) From: Mahesh J Salgaonkar To: skiboot list Date: Mon, 16 Apr 2018 23:03:56 +0530 In-Reply-To: <152389987405.2566.355149283827806637.stgit@jupiter.in.ibm.com> References: <152389987405.2566.355149283827806637.stgit@jupiter.in.ibm.com> User-Agent: StGit/0.17.1-dirty MIME-Version: 1.0 X-TM-AS-GCONF: 00 x-cbid: 18041617-0040-0000-0000-0000042FA0C9 X-IBM-AV-DETECTION: SAVI=unused REMOTE=unused XFE=unused x-cbparentid: 18041617-0041-0000-0000-00002633A6AD Message-Id: <152390003628.2566.8734510062345536451.stgit@jupiter.in.ibm.com> X-Proofpoint-Virus-Version: vendor=fsecure engine=2.50.10434:, , definitions=2018-04-16_09:, , signatures=0 X-Proofpoint-Spam-Details: rule=outbound_notspam policy=outbound score=0 priorityscore=1501 malwarescore=0 suspectscore=0 phishscore=0 bulkscore=0 spamscore=0 clxscore=1015 lowpriorityscore=0 impostorscore=0 adultscore=0 classifier=spam adjust=0 reason=mlx scancount=1 engine=8.0.1-1709140000 definitions=main-1804160155 Subject: [Skiboot] [PATCH v2 09/15] opal/hmi: Fix soft lockups during TOD errors X-BeenThere: skiboot@lists.ozlabs.org X-Mailman-Version: 2.1.26 Precedence: list List-Id: Mailing list for skiboot development List-Unsubscribe: , List-Archive: List-Post: List-Help: List-Subscribe: , Errors-To: skiboot-bounces+incoming=patchwork.ozlabs.org@lists.ozlabs.org Sender: "Skiboot" From: Mahesh Salgaonkar There are some TOD errors which do not affect working of TOD and TB. They stay in valid state. Hence we don't need rendez vous for TOD errors that does not affect TB working. TOD errors that affects TOD/TB will report a global error on TFMR[44] alongwith bit 51, and they will go in rendez vous path as expected. But the TOD errors that does not affect TB register sets only TFMR bit 51. The TFMR bit 51 is cleared when any single thread clears the TOD error. Once cleared, the bit 51 is reflected to all the cores on that chip. Any thread that reads the TFMR register after the error is cleared will see TFMR bit 51 reset. Hence the threads that see TFMR[51]=1, falls through rendez-vous path and threads that see TFMR[51]=0, returns doing nothing. This ends up in a soft lockups in host kernel. This patch fixes this issue by not considering TOD interrupt (TFMR[51]) as a core-global error and hence avoiding rendez-vous path completely. Instead threads that see TFMR[51]=1 will now take different path that just do the TOD error recovery. Signed-off-by: Mahesh Salgaonkar --- core/hmi.c | 16 +++++++++++++++- hw/chiptod.c | 14 ++++++++++++-- include/chiptod.h | 1 + 3 files changed, 28 insertions(+), 3 deletions(-) diff --git a/core/hmi.c b/core/hmi.c index eadb75be4..d9dd83c62 100644 --- a/core/hmi.c +++ b/core/hmi.c @@ -204,7 +204,7 @@ (SPR_TFMR_TBST_CORRUPT | SPR_TFMR_TB_MISSING_SYNC | \ SPR_TFMR_TB_MISSING_STEP | SPR_TFMR_FW_CONTROL_ERR | \ SPR_TFMR_TFMR_CORRUPT | SPR_TFMR_TB_RESIDUE_ERR | \ - SPR_TFMR_HDEC_PARITY_ERROR | SPR_TFMR_CHIP_TOD_INTERRUPT) + SPR_TFMR_HDEC_PARITY_ERROR) /* TFMR "thread" errors */ #define SPR_TFMR_THREAD_ERRORS \ @@ -1101,6 +1101,20 @@ static int handle_tfac_errors(uint64_t hmer, struct OpalHMIEvent *hmi_evt, recover2 = handle_all_core_tfac_error(tfmr, out_flags); if (recover != 0) recover = recover2; + } else if (tfmr & SPR_TFMR_CHIP_TOD_INTERRUPT) { + int recover2; + + /* + * There are some TOD errors which do not affect working of + * TOD and TB. They stay in valid state. Hence we don't need + * rendez vous. + * + * TOD errors that affects TOD/TB will report a global error + * on TFMR alongwith bit 51, and they will go in rendez vous. + */ + recover2 = chiptod_recover_tod_errors(); + if (recover != 0) + recover = recover2; } else if (this_cpu()->tb_invalid) { /* This shouldn't happen, TB is invalid and no global error * was reported. We just return for now assuming one will diff --git a/hw/chiptod.c b/hw/chiptod.c index f6ef9a469..33d553956 100644 --- a/hw/chiptod.c +++ b/hw/chiptod.c @@ -970,7 +970,7 @@ bool chiptod_wakeup_resync(void) return false; } -static int chiptod_recover_tod_errors(void) +static int __chiptod_recover_tod_errors(void) { uint64_t terr; uint64_t treset = 0; @@ -1026,6 +1026,16 @@ static int chiptod_recover_tod_errors(void) return 1; } +int chiptod_recover_tod_errors(void) +{ + int rc; + + lock(&chiptod_lock); + rc = __chiptod_recover_tod_errors(); + unlock(&chiptod_lock); + return rc; +} + static int32_t chiptod_get_active_master(void) { if (current_topology < 0) @@ -1550,7 +1560,7 @@ int chiptod_recover_tb_errors(bool *out_resynced) * Bit 33 of TOD error register indicates sync check error. */ if (tfmr & SPR_TFMR_CHIP_TOD_INTERRUPT) - rc = chiptod_recover_tod_errors(); + rc = __chiptod_recover_tod_errors(); /* Check if TB is running. If not then we need to get it running. */ if (!(tfmr & SPR_TFMR_TB_VALID)) { diff --git a/include/chiptod.h b/include/chiptod.h index 667e6fd83..5860e34d2 100644 --- a/include/chiptod.h +++ b/include/chiptod.h @@ -38,5 +38,6 @@ extern void chiptod_reset_tb(void); extern bool chiptod_adjust_topology(enum chiptod_topology topo, bool enable); extern bool chiptod_capp_timebase_sync(unsigned int chip_id, uint32_t tfmr_addr, uint32_t tb_addr, uint32_t offset); +extern int chiptod_recover_tod_errors(void); #endif /* __CHIPTOD_H */ From patchwork Mon Apr 16 17:34:02 2018 Content-Type: text/plain; charset="utf-8" MIME-Version: 1.0 Content-Transfer-Encoding: 7bit X-Patchwork-Submitter: Mahesh J Salgaonkar X-Patchwork-Id: 898832 Return-Path: X-Original-To: incoming@patchwork.ozlabs.org Delivered-To: patchwork-incoming@bilbo.ozlabs.org Received: from lists.ozlabs.org (lists.ozlabs.org [IPv6:2401:3900:2:1::3]) (using TLSv1.2 with cipher ADH-AES256-GCM-SHA384 (256/256 bits)) (No client certificate requested) by ozlabs.org (Postfix) with ESMTPS id 40PwXT6qMGz9s3G for ; Tue, 17 Apr 2018 03:36:09 +1000 (AEST) Authentication-Results: ozlabs.org; dmarc=none (p=none dis=none) header.from=linux.vnet.ibm.com Received: from lists.ozlabs.org (lists.ozlabs.org [IPv6:2401:3900:2:1::3]) by lists.ozlabs.org (Postfix) with ESMTP id 40PwXT5Zj8zF1x4 for ; Tue, 17 Apr 2018 03:36:09 +1000 (AEST) Authentication-Results: lists.ozlabs.org; dmarc=none (p=none dis=none) header.from=linux.vnet.ibm.com X-Original-To: skiboot@lists.ozlabs.org Delivered-To: skiboot@lists.ozlabs.org Authentication-Results: lists.ozlabs.org; spf=none (mailfrom) smtp.mailfrom=linux.vnet.ibm.com (client-ip=148.163.156.1; helo=mx0a-001b2d01.pphosted.com; envelope-from=mahesh@linux.vnet.ibm.com; receiver=) Authentication-Results: lists.ozlabs.org; dmarc=none (p=none dis=none) header.from=linux.vnet.ibm.com Received: from mx0a-001b2d01.pphosted.com (mx0a-001b2d01.pphosted.com [148.163.156.1]) (using TLSv1.2 with cipher ECDHE-RSA-AES256-GCM-SHA384 (256/256 bits)) (No client certificate requested) by lists.ozlabs.org (Postfix) with ESMTPS id 40PwVB5rNVzF1sg for ; Tue, 17 Apr 2018 03:34:10 +1000 (AEST) Received: from pps.filterd (m0098394.ppops.net [127.0.0.1]) by mx0a-001b2d01.pphosted.com (8.16.0.22/8.16.0.22) with SMTP id w3GHVBEw111153 for ; Mon, 16 Apr 2018 13:34:08 -0400 Received: from e06smtp11.uk.ibm.com (e06smtp11.uk.ibm.com [195.75.94.107]) by mx0a-001b2d01.pphosted.com with ESMTP id 2hcxcxny1p-1 (version=TLSv1.2 cipher=AES256-SHA256 bits=256 verify=NOT) for ; Mon, 16 Apr 2018 13:34:08 -0400 Received: from localhost by e06smtp11.uk.ibm.com with IBM ESMTP SMTP Gateway: Authorized Use Only! Violators will be prosecuted for from ; Mon, 16 Apr 2018 18:34:06 +0100 Received: from b06cxnps4076.portsmouth.uk.ibm.com (9.149.109.198) by e06smtp11.uk.ibm.com (192.168.101.141) with IBM ESMTP SMTP Gateway: Authorized Use Only! Violators will be prosecuted; Mon, 16 Apr 2018 18:34:04 +0100 Received: from d06av25.portsmouth.uk.ibm.com (d06av25.portsmouth.uk.ibm.com [9.149.105.61]) by b06cxnps4076.portsmouth.uk.ibm.com (8.14.9/8.14.9/NCO v10.0) with ESMTP id w3GHY4To53674136; Mon, 16 Apr 2018 17:34:04 GMT Received: from d06av25.portsmouth.uk.ibm.com (unknown [127.0.0.1]) by IMSVA (Postfix) with ESMTP id 10F5B11C04A; Mon, 16 Apr 2018 18:26:02 +0100 (BST) Received: from d06av25.portsmouth.uk.ibm.com (unknown [127.0.0.1]) by IMSVA (Postfix) with ESMTP id 83BDB11C04C; Mon, 16 Apr 2018 18:26:01 +0100 (BST) Received: from jupiter.in.ibm.com (unknown [9.102.1.147]) by d06av25.portsmouth.uk.ibm.com (Postfix) with ESMTP; Mon, 16 Apr 2018 18:26:01 +0100 (BST) From: Mahesh J Salgaonkar To: skiboot list Date: Mon, 16 Apr 2018 23:04:02 +0530 In-Reply-To: <152389987405.2566.355149283827806637.stgit@jupiter.in.ibm.com> References: <152389987405.2566.355149283827806637.stgit@jupiter.in.ibm.com> User-Agent: StGit/0.17.1-dirty MIME-Version: 1.0 X-TM-AS-GCONF: 00 x-cbid: 18041617-0040-0000-0000-0000044E2EF7 X-IBM-AV-DETECTION: SAVI=unused REMOTE=unused XFE=unused x-cbparentid: 18041617-0041-0000-0000-000020F271ED Message-Id: <152390004290.2566.6881086010339053229.stgit@jupiter.in.ibm.com> X-Proofpoint-Virus-Version: vendor=fsecure engine=2.50.10434:, , definitions=2018-04-16_09:, , signatures=0 X-Proofpoint-Spam-Details: rule=outbound_notspam policy=outbound score=0 priorityscore=1501 malwarescore=0 suspectscore=0 phishscore=0 bulkscore=0 spamscore=0 clxscore=1015 lowpriorityscore=0 impostorscore=0 adultscore=0 classifier=spam adjust=0 reason=mlx scancount=1 engine=8.0.1-1709140000 definitions=main-1804160155 Subject: [Skiboot] [PATCH v2 10/15] opal/hmi: Stop flooding HMI event for TOD errors. X-BeenThere: skiboot@lists.ozlabs.org X-Mailman-Version: 2.1.26 Precedence: list List-Id: Mailing list for skiboot development List-Unsubscribe: , List-Archive: List-Post: List-Help: List-Subscribe: , Errors-To: skiboot-bounces+incoming=patchwork.ozlabs.org@lists.ozlabs.org Sender: "Skiboot" From: Mahesh Salgaonkar Fix the issue where every thread on the chip sends HMI event to host for TOD errors. TOD errors are reported to all the core/threads on the chip. Any one thread can fix the error and send event. Rest of the threads don't need to send HMI event unnecessarily. This patch fixes this by modifying __chiptod_recover_tod_errors() function to return -1 if no errors found. Without this change every thread that see TFMR[51]=1 sends HMI event to the host kernel. Signed-off-by: Mahesh Salgaonkar --- hw/chiptod.c | 7 +++++-- 1 file changed, 5 insertions(+), 2 deletions(-) diff --git a/hw/chiptod.c b/hw/chiptod.c index 33d553956..28ed8973a 100644 --- a/hw/chiptod.c +++ b/hw/chiptod.c @@ -974,7 +974,7 @@ static int __chiptod_recover_tod_errors(void) { uint64_t terr; uint64_t treset = 0; - int i; + int i, rc = -1; int32_t chip_id = this_cpu()->chip_id; /* Read TOD error register */ @@ -990,6 +990,7 @@ static int __chiptod_recover_tod_errors(void) (terr & TOD_ERR_DELAY_COMPL_PARITY) || (terr & TOD_ERR_TOD_REGISTER_PARITY)) { chiptod_reset_tod_errors(); + rc = 1; } /* @@ -1023,7 +1024,9 @@ static int __chiptod_recover_tod_errors(void) return 0; } /* We have handled all the TOD errors routed to hypervisor */ - return 1; + if (treset) + rc = 1; + return rc; } int chiptod_recover_tod_errors(void) From patchwork Mon Apr 16 17:34:09 2018 Content-Type: text/plain; charset="utf-8" MIME-Version: 1.0 Content-Transfer-Encoding: 7bit X-Patchwork-Submitter: Mahesh J Salgaonkar X-Patchwork-Id: 898833 Return-Path: X-Original-To: incoming@patchwork.ozlabs.org Delivered-To: patchwork-incoming@bilbo.ozlabs.org Received: from lists.ozlabs.org (lists.ozlabs.org [IPv6:2401:3900:2:1::3]) (using TLSv1.2 with cipher ADH-AES256-GCM-SHA384 (256/256 bits)) (No client certificate requested) by ozlabs.org (Postfix) with ESMTPS id 40PwXg6LC3z9s3G for ; Tue, 17 Apr 2018 03:36:19 +1000 (AEST) Authentication-Results: ozlabs.org; dmarc=none (p=none dis=none) header.from=linux.vnet.ibm.com Received: from lists.ozlabs.org (lists.ozlabs.org [IPv6:2401:3900:2:1::3]) by lists.ozlabs.org (Postfix) with ESMTP id 40PwXg4j2WzF1Rs for ; Tue, 17 Apr 2018 03:36:19 +1000 (AEST) Authentication-Results: lists.ozlabs.org; dmarc=none (p=none dis=none) header.from=linux.vnet.ibm.com X-Original-To: skiboot@lists.ozlabs.org Delivered-To: skiboot@lists.ozlabs.org Authentication-Results: lists.ozlabs.org; spf=none (mailfrom) smtp.mailfrom=linux.vnet.ibm.com (client-ip=148.163.156.1; helo=mx0a-001b2d01.pphosted.com; envelope-from=mahesh@linux.vnet.ibm.com; receiver=) Authentication-Results: lists.ozlabs.org; dmarc=none (p=none dis=none) header.from=linux.vnet.ibm.com Received: from mx0a-001b2d01.pphosted.com (mx0a-001b2d01.pphosted.com [148.163.156.1]) (using TLSv1.2 with cipher ECDHE-RSA-AES256-GCM-SHA384 (256/256 bits)) (No client certificate requested) by lists.ozlabs.org (Postfix) with ESMTPS id 40PwVL13nrzF1vf for ; Tue, 17 Apr 2018 03:34:17 +1000 (AEST) Received: from pps.filterd (m0098396.ppops.net [127.0.0.1]) by mx0a-001b2d01.pphosted.com (8.16.0.22/8.16.0.22) with SMTP id w3GHWKwr142883 for ; Mon, 16 Apr 2018 13:34:15 -0400 Received: from e06smtp12.uk.ibm.com (e06smtp12.uk.ibm.com [195.75.94.108]) by mx0a-001b2d01.pphosted.com with ESMTP id 2hcxu1mdmw-1 (version=TLSv1.2 cipher=AES256-SHA256 bits=256 verify=NOT) for ; Mon, 16 Apr 2018 13:34:15 -0400 Received: from localhost by e06smtp12.uk.ibm.com with IBM ESMTP SMTP Gateway: Authorized Use Only! Violators will be prosecuted for from ; Mon, 16 Apr 2018 18:34:13 +0100 Received: from b06cxnps3074.portsmouth.uk.ibm.com (9.149.109.194) by e06smtp12.uk.ibm.com (192.168.101.142) with IBM ESMTP SMTP Gateway: Authorized Use Only! Violators will be prosecuted; Mon, 16 Apr 2018 18:34:11 +0100 Received: from d06av26.portsmouth.uk.ibm.com (d06av26.portsmouth.uk.ibm.com [9.149.105.62]) by b06cxnps3074.portsmouth.uk.ibm.com (8.14.9/8.14.9/NCO v10.0) with ESMTP id w3GHYBOY5046768; Mon, 16 Apr 2018 17:34:11 GMT Received: from d06av26.portsmouth.uk.ibm.com (unknown [127.0.0.1]) by IMSVA (Postfix) with ESMTP id 0E79BAE055; Mon, 16 Apr 2018 18:24:02 +0100 (BST) Received: from d06av26.portsmouth.uk.ibm.com (unknown [127.0.0.1]) by IMSVA (Postfix) with ESMTP id 90C7AAE045; Mon, 16 Apr 2018 18:24:01 +0100 (BST) Received: from jupiter.in.ibm.com (unknown [9.102.1.147]) by d06av26.portsmouth.uk.ibm.com (Postfix) with ESMTP; Mon, 16 Apr 2018 18:24:01 +0100 (BST) From: Mahesh J Salgaonkar To: skiboot list Date: Mon, 16 Apr 2018 23:04:09 +0530 In-Reply-To: <152389987405.2566.355149283827806637.stgit@jupiter.in.ibm.com> References: <152389987405.2566.355149283827806637.stgit@jupiter.in.ibm.com> User-Agent: StGit/0.17.1-dirty MIME-Version: 1.0 X-TM-AS-GCONF: 00 x-cbid: 18041617-0008-0000-0000-000004EB300F X-IBM-AV-DETECTION: SAVI=unused REMOTE=unused XFE=unused x-cbparentid: 18041617-0009-0000-0000-00001E7F4871 Message-Id: <152390004962.2566.14004738930530913719.stgit@jupiter.in.ibm.com> X-Proofpoint-Virus-Version: vendor=fsecure engine=2.50.10434:, , definitions=2018-04-16_09:, , signatures=0 X-Proofpoint-Spam-Details: rule=outbound_notspam policy=outbound score=0 priorityscore=1501 malwarescore=0 suspectscore=0 phishscore=0 bulkscore=0 spamscore=0 clxscore=1015 lowpriorityscore=0 impostorscore=0 adultscore=0 classifier=spam adjust=0 reason=mlx scancount=1 engine=8.0.1-1709140000 definitions=main-1804160155 Subject: [Skiboot] [PATCH v2 11/15] opal/hmi: Fix handling of TFMR parity/corrupt error. X-BeenThere: skiboot@lists.ozlabs.org X-Mailman-Version: 2.1.26 Precedence: list List-Id: Mailing list for skiboot development List-Unsubscribe: , List-Archive: List-Post: List-Help: List-Subscribe: , Errors-To: skiboot-bounces+incoming=patchwork.ozlabs.org@lists.ozlabs.org Sender: "Skiboot" From: Mahesh Salgaonkar While testing TFMR parity/corrupt error it has been observed that HMIs are delivered twice for this error - First time HMI is delivered with HMER[4,5]=1 and TFMR[60]=1. - Second time HMI is delivered with HMER[4,5]=1 and TFMR[60]=0 with valid TB. On second HMI we end up throwing below error message even though TB is in valid state. "HMI: TB invalid without core error reported" This patch fixes this issue by ignoring HMER[5] and checking only for TFMR[60] before setting this_cpu()->tb_invalid to true. Suggested-by: Benjamin Herrenschmidt Signed-off-by: Mahesh Salgaonkar --- core/hmi.c | 9 ++++----- 1 file changed, 4 insertions(+), 5 deletions(-) diff --git a/core/hmi.c b/core/hmi.c index d9dd83c62..b01a2bf32 100644 --- a/core/hmi.c +++ b/core/hmi.c @@ -1045,14 +1045,13 @@ error_out: return recover; } -static int handle_tfac_errors(uint64_t hmer, struct OpalHMIEvent *hmi_evt, - uint64_t *out_flags) +static int handle_tfac_errors(struct OpalHMIEvent *hmi_evt, uint64_t *out_flags) { int recover = -1; uint64_t tfmr = mfspr(SPR_TFMR); - /* A TFMR parity error makes us ignore all the local stuff */ - if ((hmer & SPR_HMER_TFMR_PARITY_ERROR) || (tfmr & SPR_TFMR_TFMR_CORRUPT)) { + /* A TFMR parity/corrupt error makes us ignore all the local stuff.*/ + if (tfmr & SPR_TFMR_TFMR_CORRUPT) { /* Mark TB as invalid for now as we don't trust TFMR, we'll fix * it up later */ @@ -1160,7 +1159,7 @@ static int handle_hmi_exception(uint64_t hmer, struct OpalHMIEvent *hmi_evt, hmi_print_debug("Timer Facility Error", hmer); handled = hmer & (SPR_HMER_TFAC_ERROR | SPR_HMER_TFMR_PARITY_ERROR); mtspr(SPR_HMER, ~handled); - recover = handle_tfac_errors(hmer, hmi_evt, out_flags); + recover = handle_tfac_errors(hmi_evt, out_flags); handled = 0; } From patchwork Mon Apr 16 17:34:16 2018 Content-Type: text/plain; charset="utf-8" MIME-Version: 1.0 Content-Transfer-Encoding: 7bit X-Patchwork-Submitter: Mahesh J Salgaonkar X-Patchwork-Id: 898834 Return-Path: X-Original-To: incoming@patchwork.ozlabs.org Delivered-To: patchwork-incoming@bilbo.ozlabs.org Received: from lists.ozlabs.org (lists.ozlabs.org [203.11.71.2]) (using TLSv1.2 with cipher ADH-AES256-GCM-SHA384 (256/256 bits)) (No client certificate requested) by ozlabs.org (Postfix) with ESMTPS id 40PwXy686yz9s3M for ; Tue, 17 Apr 2018 03:36:34 +1000 (AEST) Authentication-Results: ozlabs.org; dmarc=none (p=none dis=none) header.from=linux.vnet.ibm.com Received: from lists.ozlabs.org (lists.ozlabs.org [IPv6:2401:3900:2:1::3]) by lists.ozlabs.org (Postfix) with ESMTP id 40PwXy4BNjzF1wX for ; Tue, 17 Apr 2018 03:36:34 +1000 (AEST) Authentication-Results: lists.ozlabs.org; dmarc=none (p=none dis=none) header.from=linux.vnet.ibm.com X-Original-To: skiboot@lists.ozlabs.org Delivered-To: skiboot@lists.ozlabs.org Authentication-Results: lists.ozlabs.org; spf=none (mailfrom) smtp.mailfrom=linux.vnet.ibm.com (client-ip=148.163.156.1; helo=mx0a-001b2d01.pphosted.com; envelope-from=mahesh@linux.vnet.ibm.com; receiver=) Authentication-Results: lists.ozlabs.org; dmarc=none (p=none dis=none) header.from=linux.vnet.ibm.com Received: from mx0a-001b2d01.pphosted.com (mx0a-001b2d01.pphosted.com [148.163.156.1]) (using TLSv1.2 with cipher ECDHE-RSA-AES256-GCM-SHA384 (256/256 bits)) (No client certificate requested) by lists.ozlabs.org (Postfix) with ESMTPS id 40PwVT1S32zF1vZ for ; Tue, 17 Apr 2018 03:34:24 +1000 (AEST) Received: from pps.filterd (m0098404.ppops.net [127.0.0.1]) by mx0a-001b2d01.pphosted.com (8.16.0.22/8.16.0.22) with SMTP id w3GHXqpa054916 for ; Mon, 16 Apr 2018 13:34:22 -0400 Received: from e06smtp11.uk.ibm.com (e06smtp11.uk.ibm.com [195.75.94.107]) by mx0a-001b2d01.pphosted.com with ESMTP id 2hcybftt4r-1 (version=TLSv1.2 cipher=AES256-SHA256 bits=256 verify=NOT) for ; Mon, 16 Apr 2018 13:34:22 -0400 Received: from localhost by e06smtp11.uk.ibm.com with IBM ESMTP SMTP Gateway: Authorized Use Only! Violators will be prosecuted for from ; Mon, 16 Apr 2018 18:34:20 +0100 Received: from b06cxnps4075.portsmouth.uk.ibm.com (9.149.109.197) by e06smtp11.uk.ibm.com (192.168.101.141) with IBM ESMTP SMTP Gateway: Authorized Use Only! Violators will be prosecuted; Mon, 16 Apr 2018 18:34:18 +0100 Received: from d06av21.portsmouth.uk.ibm.com (d06av21.portsmouth.uk.ibm.com [9.149.105.232]) by b06cxnps4075.portsmouth.uk.ibm.com (8.14.9/8.14.9/NCO v10.0) with ESMTP id w3GHYHDj58917058; Mon, 16 Apr 2018 17:34:17 GMT Received: from d06av21.portsmouth.uk.ibm.com (unknown [127.0.0.1]) by IMSVA (Postfix) with ESMTP id 33E7452041; Mon, 16 Apr 2018 17:25:08 +0100 (BST) Received: from jupiter.in.ibm.com (unknown [9.102.1.147]) by d06av21.portsmouth.uk.ibm.com (Postfix) with ESMTP id A81AD52045; Mon, 16 Apr 2018 17:25:07 +0100 (BST) From: Mahesh J Salgaonkar To: skiboot list Date: Mon, 16 Apr 2018 23:04:16 +0530 In-Reply-To: <152389987405.2566.355149283827806637.stgit@jupiter.in.ibm.com> References: <152389987405.2566.355149283827806637.stgit@jupiter.in.ibm.com> User-Agent: StGit/0.17.1-dirty MIME-Version: 1.0 X-TM-AS-GCONF: 00 x-cbid: 18041617-0040-0000-0000-0000044E2EFC X-IBM-AV-DETECTION: SAVI=unused REMOTE=unused XFE=unused x-cbparentid: 18041617-0041-0000-0000-000020F271F2 Message-Id: <152390005624.2566.15624466819277867835.stgit@jupiter.in.ibm.com> X-Proofpoint-Virus-Version: vendor=fsecure engine=2.50.10434:, , definitions=2018-04-16_09:, , signatures=0 X-Proofpoint-Spam-Details: rule=outbound_notspam policy=outbound score=0 priorityscore=1501 malwarescore=0 suspectscore=0 phishscore=0 bulkscore=0 spamscore=0 clxscore=1015 lowpriorityscore=0 impostorscore=0 adultscore=0 classifier=spam adjust=0 reason=mlx scancount=1 engine=8.0.1-1709140000 definitions=main-1804160155 Subject: [Skiboot] [PATCH v2 12/15] opal/hmi: Print additional debug information in rendezvous. X-BeenThere: skiboot@lists.ozlabs.org X-Mailman-Version: 2.1.26 Precedence: list List-Id: Mailing list for skiboot development List-Unsubscribe: , List-Archive: List-Post: List-Help: List-Subscribe: , Errors-To: skiboot-bounces+incoming=patchwork.ozlabs.org@lists.ozlabs.org Sender: "Skiboot" From: Mahesh Salgaonkar Helps in debugging... Signed-off-by: Mahesh Salgaonkar --- core/hmi.c | 6 ++++-- 1 file changed, 4 insertions(+), 2 deletions(-) diff --git a/core/hmi.c b/core/hmi.c index b01a2bf32..b062428a3 100644 --- a/core/hmi.c +++ b/core/hmi.c @@ -888,7 +888,8 @@ static void hmi_rendez_vous(uint32_t sig) cpu_relax(); if (!timeout) prlog(PR_ERR, "Rendez-vous stage 1 timeout, CPU 0x%x" - " waiting for thread %d\n", t->pir, i); + " waiting for thread %d (sptr=%08x)\n", + t->pir, i, *sptr); } /* Set the exit bit */ @@ -911,7 +912,8 @@ static void hmi_rendez_vous(uint32_t sig) cpu_relax(); if (!timeout) prlog(PR_ERR, "Rendez-vous stage 2 timeout, CPU 0x%x" - " waiting for thread %d\n", t->pir, i); + " waiting for thread %d (sptr=%08x)\n", + t->pir, i, *sptr); } } From patchwork Mon Apr 16 17:34:23 2018 Content-Type: text/plain; charset="utf-8" MIME-Version: 1.0 Content-Transfer-Encoding: 7bit X-Patchwork-Submitter: Mahesh J Salgaonkar X-Patchwork-Id: 898835 Return-Path: X-Original-To: incoming@patchwork.ozlabs.org Delivered-To: patchwork-incoming@bilbo.ozlabs.org Received: from lists.ozlabs.org (lists.ozlabs.org [203.11.71.2]) (using TLSv1.2 with cipher ADH-AES256-GCM-SHA384 (256/256 bits)) (No client certificate requested) by ozlabs.org (Postfix) with ESMTPS id 40PwYK727Cz9s3G for ; Tue, 17 Apr 2018 03:36:53 +1000 (AEST) Authentication-Results: ozlabs.org; dmarc=none (p=none dis=none) header.from=linux.vnet.ibm.com Received: from lists.ozlabs.org (lists.ozlabs.org [IPv6:2401:3900:2:1::3]) by lists.ozlabs.org (Postfix) with ESMTP id 40PwYK5tGMzF1xg for ; Tue, 17 Apr 2018 03:36:53 +1000 (AEST) Authentication-Results: lists.ozlabs.org; dmarc=none (p=none dis=none) header.from=linux.vnet.ibm.com X-Original-To: skiboot@lists.ozlabs.org Delivered-To: skiboot@lists.ozlabs.org Authentication-Results: lists.ozlabs.org; spf=none (mailfrom) smtp.mailfrom=linux.vnet.ibm.com (client-ip=148.163.158.5; helo=mx0a-001b2d01.pphosted.com; envelope-from=mahesh@linux.vnet.ibm.com; receiver=) Authentication-Results: lists.ozlabs.org; dmarc=none (p=none dis=none) header.from=linux.vnet.ibm.com Received: from mx0a-001b2d01.pphosted.com (mx0b-001b2d01.pphosted.com [148.163.158.5]) (using TLSv1.2 with cipher ECDHE-RSA-AES256-GCM-SHA384 (256/256 bits)) (No client certificate requested) by lists.ozlabs.org (Postfix) with ESMTPS id 40PwVZ070vzF1wG for ; Tue, 17 Apr 2018 03:34:29 +1000 (AEST) Received: from pps.filterd (m0098420.ppops.net [127.0.0.1]) by mx0b-001b2d01.pphosted.com (8.16.0.22/8.16.0.22) with SMTP id w3GHV3CY086052 for ; Mon, 16 Apr 2018 13:34:28 -0400 Received: from e06smtp12.uk.ibm.com (e06smtp12.uk.ibm.com [195.75.94.108]) by mx0b-001b2d01.pphosted.com with ESMTP id 2hcwc19b6c-1 (version=TLSv1.2 cipher=AES256-SHA256 bits=256 verify=NOT) for ; Mon, 16 Apr 2018 13:34:27 -0400 Received: from localhost by e06smtp12.uk.ibm.com with IBM ESMTP SMTP Gateway: Authorized Use Only! Violators will be prosecuted for from ; Mon, 16 Apr 2018 18:34:26 +0100 Received: from b06cxnps4076.portsmouth.uk.ibm.com (9.149.109.198) by e06smtp12.uk.ibm.com (192.168.101.142) with IBM ESMTP SMTP Gateway: Authorized Use Only! Violators will be prosecuted; Mon, 16 Apr 2018 18:34:25 +0100 Received: from d06av25.portsmouth.uk.ibm.com (d06av25.portsmouth.uk.ibm.com [9.149.105.61]) by b06cxnps4076.portsmouth.uk.ibm.com (8.14.9/8.14.9/NCO v10.0) with ESMTP id w3GHYOgP57213092; Mon, 16 Apr 2018 17:34:24 GMT Received: from d06av25.portsmouth.uk.ibm.com (unknown [127.0.0.1]) by IMSVA (Postfix) with ESMTP id 2C46D11C058; Mon, 16 Apr 2018 18:26:22 +0100 (BST) Received: from d06av25.portsmouth.uk.ibm.com (unknown [127.0.0.1]) by IMSVA (Postfix) with ESMTP id A039411C04C; Mon, 16 Apr 2018 18:26:21 +0100 (BST) Received: from jupiter.in.ibm.com (unknown [9.102.1.147]) by d06av25.portsmouth.uk.ibm.com (Postfix) with ESMTP; Mon, 16 Apr 2018 18:26:21 +0100 (BST) From: Mahesh J Salgaonkar To: skiboot list Date: Mon, 16 Apr 2018 23:04:23 +0530 In-Reply-To: <152389987405.2566.355149283827806637.stgit@jupiter.in.ibm.com> References: <152389987405.2566.355149283827806637.stgit@jupiter.in.ibm.com> User-Agent: StGit/0.17.1-dirty MIME-Version: 1.0 X-TM-AS-GCONF: 00 x-cbid: 18041617-0008-0000-0000-000004EB3013 X-IBM-AV-DETECTION: SAVI=unused REMOTE=unused XFE=unused x-cbparentid: 18041617-0009-0000-0000-00001E7F4875 Message-Id: <152390006303.2566.6880660025273037550.stgit@jupiter.in.ibm.com> X-Proofpoint-Virus-Version: vendor=fsecure engine=2.50.10434:, , definitions=2018-04-16_09:, , signatures=0 X-Proofpoint-Spam-Details: rule=outbound_notspam policy=outbound score=0 priorityscore=1501 malwarescore=0 suspectscore=0 phishscore=0 bulkscore=0 spamscore=0 clxscore=1015 lowpriorityscore=0 impostorscore=0 adultscore=0 classifier=spam adjust=0 reason=mlx scancount=1 engine=8.0.1-1709140000 definitions=main-1804160155 Subject: [Skiboot] [PATCH v2 13/15] opal/hmi: check thread 0 tfmr to validate latched tfmr errors. X-BeenThere: skiboot@lists.ozlabs.org X-Mailman-Version: 2.1.26 Precedence: list List-Id: Mailing list for skiboot development List-Unsubscribe: , List-Archive: List-Post: List-Help: List-Subscribe: , Errors-To: skiboot-bounces+incoming=patchwork.ozlabs.org@lists.ozlabs.org Sender: "Skiboot" From: Mahesh Salgaonkar Due to P9 errata, HDEC parity and TB residue errors are latched for non-zero threads 1-3 even if they are cleared. But these are not latched on thread 0. Hence, use xscom SCOMC/SCOMD to read thread 0 tfmr value and ignore them on non-zero threads if they are not present on thread 0. Signed-off-by: Mahesh Salgaonkar --- core/hmi.c | 61 ++++++++++++++++++++++++++++++++--------------- include/xscom-p9-regs.h | 8 ++++++ 2 files changed, 50 insertions(+), 19 deletions(-) diff --git a/core/hmi.c b/core/hmi.c index b062428a3..9b98fbd98 100644 --- a/core/hmi.c +++ b/core/hmi.c @@ -22,6 +22,7 @@ #include #include #include +#include #include #include #include @@ -1047,6 +1048,45 @@ error_out: return recover; } +static uint64_t read_tfmr_t0(void) +{ + uint64_t tfmr_t0; + uint32_t chip_id = this_cpu()->chip_id; + uint32_t core_id = pir_to_core_id(this_cpu()->pir); + + lock(&hmi_lock); + + xscom_write(chip_id, XSCOM_ADDR_P9_EC(core_id, P9_SCOM_SPRC), + SETFIELD(P9_SCOMC_SPR_SELECT, 0, P9_SCOMC_TFMR_T0)); + xscom_read(chip_id, XSCOM_ADDR_P9_EC(core_id, P9_SCOM_SPRD), + &tfmr_t0); + unlock(&hmi_lock); + return tfmr_t0; +} + +/* P9 errata: In theory, an HDEC error is sent to all threads. However, + * due to an errata on P9 where TFMR bit 26 (HDEC parity) cannot be + * cleared on thread 1..3, I am not confident we can do a rendez-vous + * in all cases. + * + * Our current approach is to ignore that error unless it is present + * on thread 0 TFMR. Also, ignore TB residue error due to a similar + * errata as above. + */ +static void validate_latched_errors(uint64_t *tfmr) +{ + if ((*tfmr & (SPR_TFMR_HDEC_PARITY_ERROR | SPR_TFMR_TB_RESIDUE_ERR)) + && this_cpu()->is_secondary) { + uint64_t tfmr_t0 = read_tfmr_t0(); + + if (!(tfmr_t0 & SPR_TFMR_HDEC_PARITY_ERROR)) + *tfmr &= ~SPR_TFMR_HDEC_PARITY_ERROR; + + if (!(tfmr_t0 & SPR_TFMR_TB_RESIDUE_ERR)) + *tfmr &= ~SPR_TFMR_TB_RESIDUE_ERR; + } +} + static int handle_tfac_errors(struct OpalHMIEvent *hmi_evt, uint64_t *out_flags) { int recover = -1; @@ -1063,25 +1103,8 @@ static int handle_tfac_errors(struct OpalHMIEvent *hmi_evt, uint64_t *out_flags) this_cpu()->tb_invalid = !(tfmr & SPR_TFMR_TB_VALID); - /* P9 errata: In theory, an HDEC error is sent to all threads. However, - * due to an errata on P9 where TFMR bit 26 (HDEC parity) cannot be - * cleared on thread 1..3, I am not confident we can do a rendez-vous - * in all cases. - * - * Our current approach is to ignore that error unless no other TFAC - * error is present in the TFMR. The error will be re-detected and - * re-reported if necessary. - */ - if (proc_gen == proc_gen_p9 && (tfmr & SPR_TFMR_HDEC_PARITY_ERROR)) { - if (this_cpu()->tb_invalid || (tfmr & SPR_TFMR_OTHER_ERRORS)) - tfmr &= ~SPR_TFMR_HDEC_PARITY_ERROR; - } - - /* The TB residue error is ignored if TB is valid due to a similar - * errata as above - */ - if ((tfmr & SPR_TFMR_TB_RESIDUE_ERR) && !this_cpu()->tb_invalid) - tfmr &= ~SPR_TFMR_TB_RESIDUE_ERR; + if (proc_gen == proc_gen_p9) + validate_latched_errors(&tfmr); /* First, handle thread local errors */ if (tfmr & SPR_TFMR_THREAD_ERRORS) { diff --git a/include/xscom-p9-regs.h b/include/xscom-p9-regs.h index 4738e812c..c3322499f 100644 --- a/include/xscom-p9-regs.h +++ b/include/xscom-p9-regs.h @@ -21,4 +21,12 @@ #define P9_GPIO_DATA_OUT_ENABLE 0x00000000000B0054ull #define P9_GPIO_DATA_OUT 0x00000000000B0051ull +/* xscom address for SCOM Control and data Register */ +/* bits 54:60 of SCOM SPRC register is used for core specific SPR selection. */ +#define P9_SCOM_SPRC 0x20010A80 +#define P9_SCOMC_SPR_SELECT PPC_BITMASK(54, 60) +#define P9_SCOMC_TFMR_T0 0x8 /* 0b0001000 TFMR */ + +#define P9_SCOM_SPRD 0x20010A81 + #endif /* __XSCOM_P9_REGS_H__ */ From patchwork Mon Apr 16 17:34:29 2018 Content-Type: text/plain; charset="utf-8" MIME-Version: 1.0 Content-Transfer-Encoding: 7bit X-Patchwork-Submitter: Mahesh J Salgaonkar X-Patchwork-Id: 898842 Return-Path: X-Original-To: incoming@patchwork.ozlabs.org Delivered-To: patchwork-incoming@bilbo.ozlabs.org Received: from lists.ozlabs.org (lists.ozlabs.org [203.11.71.2]) (using TLSv1.2 with cipher ADH-AES256-GCM-SHA384 (256/256 bits)) (No client certificate requested) by ozlabs.org (Postfix) with ESMTPS id 40PwYX1mJwz9s3G for ; Tue, 17 Apr 2018 03:37:04 +1000 (AEST) Authentication-Results: ozlabs.org; dmarc=none (p=none dis=none) header.from=linux.vnet.ibm.com Received: from lists.ozlabs.org (lists.ozlabs.org [IPv6:2401:3900:2:1::3]) by lists.ozlabs.org (Postfix) with ESMTP id 40PwYX09v9zF1xm for ; Tue, 17 Apr 2018 03:37:04 +1000 (AEST) Authentication-Results: lists.ozlabs.org; dmarc=none (p=none dis=none) header.from=linux.vnet.ibm.com X-Original-To: skiboot@lists.ozlabs.org Delivered-To: skiboot@lists.ozlabs.org Authentication-Results: lists.ozlabs.org; spf=none (mailfrom) smtp.mailfrom=linux.vnet.ibm.com (client-ip=148.163.156.1; helo=mx0a-001b2d01.pphosted.com; envelope-from=mahesh@linux.vnet.ibm.com; receiver=) Authentication-Results: lists.ozlabs.org; dmarc=none (p=none dis=none) header.from=linux.vnet.ibm.com Received: from mx0a-001b2d01.pphosted.com (mx0a-001b2d01.pphosted.com [148.163.156.1]) (using TLSv1.2 with cipher ECDHE-RSA-AES256-GCM-SHA384 (256/256 bits)) (No client certificate requested) by lists.ozlabs.org (Postfix) with ESMTPS id 40PwVl0B6JzF1wJ for ; Tue, 17 Apr 2018 03:34:38 +1000 (AEST) Received: from pps.filterd (m0098393.ppops.net [127.0.0.1]) by mx0a-001b2d01.pphosted.com (8.16.0.22/8.16.0.22) with SMTP id w3GHXHj9031689 for ; Mon, 16 Apr 2018 13:34:36 -0400 Received: from e06smtp11.uk.ibm.com (e06smtp11.uk.ibm.com [195.75.94.107]) by mx0a-001b2d01.pphosted.com with ESMTP id 2hcwd4h88w-1 (version=TLSv1.2 cipher=AES256-SHA256 bits=256 verify=NOT) for ; Mon, 16 Apr 2018 13:34:35 -0400 Received: from localhost by e06smtp11.uk.ibm.com with IBM ESMTP SMTP Gateway: Authorized Use Only! Violators will be prosecuted for from ; Mon, 16 Apr 2018 18:34:33 +0100 Received: from b06cxnps4074.portsmouth.uk.ibm.com (9.149.109.196) by e06smtp11.uk.ibm.com (192.168.101.141) with IBM ESMTP SMTP Gateway: Authorized Use Only! Violators will be prosecuted; Mon, 16 Apr 2018 18:34:31 +0100 Received: from d06av21.portsmouth.uk.ibm.com (d06av21.portsmouth.uk.ibm.com [9.149.105.232]) by b06cxnps4074.portsmouth.uk.ibm.com (8.14.9/8.14.9/NCO v10.0) with ESMTP id w3GHYVrs57540682; Mon, 16 Apr 2018 17:34:31 GMT Received: from d06av21.portsmouth.uk.ibm.com (unknown [127.0.0.1]) by IMSVA (Postfix) with ESMTP id 80FB352049; Mon, 16 Apr 2018 17:25:21 +0100 (BST) Received: from jupiter.in.ibm.com (unknown [9.102.1.147]) by d06av21.portsmouth.uk.ibm.com (Postfix) with ESMTP id 046BA5203F; Mon, 16 Apr 2018 17:25:20 +0100 (BST) From: Mahesh J Salgaonkar To: skiboot list Date: Mon, 16 Apr 2018 23:04:29 +0530 In-Reply-To: <152389987405.2566.355149283827806637.stgit@jupiter.in.ibm.com> References: <152389987405.2566.355149283827806637.stgit@jupiter.in.ibm.com> User-Agent: StGit/0.17.1-dirty MIME-Version: 1.0 X-TM-AS-GCONF: 00 x-cbid: 18041617-0040-0000-0000-0000044E2F00 X-IBM-AV-DETECTION: SAVI=unused REMOTE=unused XFE=unused x-cbparentid: 18041617-0041-0000-0000-000020F271F6 Message-Id: <152390006973.2566.12036488100857139008.stgit@jupiter.in.ibm.com> X-Proofpoint-Virus-Version: vendor=fsecure engine=2.50.10434:, , definitions=2018-04-16_09:, , signatures=0 X-Proofpoint-Spam-Details: rule=outbound_notspam policy=outbound score=0 priorityscore=1501 malwarescore=0 suspectscore=0 phishscore=0 bulkscore=0 spamscore=0 clxscore=1015 lowpriorityscore=0 impostorscore=0 adultscore=0 classifier=spam adjust=0 reason=mlx scancount=1 engine=8.0.1-1709140000 definitions=main-1804160155 Subject: [Skiboot] [PATCH v2 14/15] opal/hmi: Generate hmi event for recovered HDEC parity error. X-BeenThere: skiboot@lists.ozlabs.org X-Mailman-Version: 2.1.26 Precedence: list List-Id: Mailing list for skiboot development List-Unsubscribe: , List-Archive: List-Post: List-Help: List-Subscribe: , Errors-To: skiboot-bounces+incoming=patchwork.ozlabs.org@lists.ozlabs.org Sender: "Skiboot" From: Mahesh Salgaonkar Signed-off-by: Mahesh Salgaonkar --- core/hmi.c | 5 ++--- hw/chiptod.c | 11 +++++++---- include/chiptod.h | 2 +- 3 files changed, 10 insertions(+), 8 deletions(-) diff --git a/core/hmi.c b/core/hmi.c index 9b98fbd98..53cb84712 100644 --- a/core/hmi.c +++ b/core/hmi.c @@ -1019,10 +1019,9 @@ static int handle_all_core_tfac_error(uint64_t tfmr, uint64_t *out_flags) hmi_rendez_vous(2); /* We can now clear the error conditions in the core. */ - if (!tfmr_clear_core_errors(tfmr)) { - recover = 0; + recover = tfmr_clear_core_errors(tfmr); + if (recover == 0) goto error_out; - } /* Third rendez-vous. We could in theory do the timebase resync as * part of the previous one, but I prefer having all the error diff --git a/hw/chiptod.c b/hw/chiptod.c index 28ed8973a..df1274ca8 100644 --- a/hw/chiptod.c +++ b/hw/chiptod.c @@ -1491,18 +1491,21 @@ void tfmr_cleanup_core_errors(uint64_t tfmr) } } -bool tfmr_clear_core_errors(uint64_t tfmr) +int tfmr_clear_core_errors(uint64_t tfmr) { uint64_t tfmr_reset_errors = 0; - if (tfmr & SPR_TFMR_HDEC_PARITY_ERROR) - tfmr_reset_errors |= SPR_TFMR_HDEC_PARITY_ERROR; + /* return -1 if there is nothing to be fixed. */ + if (!(tfmr & SPR_TFMR_HDEC_PARITY_ERROR)) + return -1; + + tfmr_reset_errors |= SPR_TFMR_HDEC_PARITY_ERROR; /* Write TFMR twice to clear the error */ mtspr(SPR_TFMR, base_tfmr | tfmr_reset_errors); mtspr(SPR_TFMR, base_tfmr | tfmr_reset_errors); - return true; + return 1; } /* diff --git a/include/chiptod.h b/include/chiptod.h index 5860e34d2..3717e6674 100644 --- a/include/chiptod.h +++ b/include/chiptod.h @@ -33,7 +33,7 @@ extern int chiptod_recover_tb_errors(bool *out_resynced); extern bool tfmr_recover_local_errors(uint64_t tfmr); extern bool recover_corrupt_tfmr(void); extern void tfmr_cleanup_core_errors(uint64_t tfmr); -extern bool tfmr_clear_core_errors(uint64_t tfmr); +extern int tfmr_clear_core_errors(uint64_t tfmr); extern void chiptod_reset_tb(void); extern bool chiptod_adjust_topology(enum chiptod_topology topo, bool enable); extern bool chiptod_capp_timebase_sync(unsigned int chip_id, uint32_t tfmr_addr, From patchwork Mon Apr 16 17:34:36 2018 Content-Type: text/plain; charset="utf-8" MIME-Version: 1.0 Content-Transfer-Encoding: 7bit X-Patchwork-Submitter: Mahesh J Salgaonkar X-Patchwork-Id: 898843 Return-Path: X-Original-To: incoming@patchwork.ozlabs.org Delivered-To: patchwork-incoming@bilbo.ozlabs.org Received: from lists.ozlabs.org (lists.ozlabs.org [203.11.71.2]) (using TLSv1.2 with cipher ADH-AES256-GCM-SHA384 (256/256 bits)) (No client certificate requested) by ozlabs.org (Postfix) with ESMTPS id 40PwYt4phtz9s3G for ; Tue, 17 Apr 2018 03:37:22 +1000 (AEST) Authentication-Results: ozlabs.org; dmarc=none (p=none dis=none) header.from=linux.vnet.ibm.com Received: from lists.ozlabs.org (lists.ozlabs.org [IPv6:2401:3900:2:1::3]) by lists.ozlabs.org (Postfix) with ESMTP id 40PwYt3bkgzF1w6 for ; Tue, 17 Apr 2018 03:37:22 +1000 (AEST) Authentication-Results: lists.ozlabs.org; dmarc=none (p=none dis=none) header.from=linux.vnet.ibm.com X-Original-To: skiboot@lists.ozlabs.org Delivered-To: skiboot@lists.ozlabs.org Authentication-Results: lists.ozlabs.org; spf=none (mailfrom) smtp.mailfrom=linux.vnet.ibm.com (client-ip=148.163.156.1; helo=mx0a-001b2d01.pphosted.com; envelope-from=mahesh@linux.vnet.ibm.com; receiver=) Authentication-Results: lists.ozlabs.org; dmarc=none (p=none dis=none) header.from=linux.vnet.ibm.com Received: from mx0a-001b2d01.pphosted.com (mx0a-001b2d01.pphosted.com [148.163.156.1]) (using TLSv1.2 with cipher ECDHE-RSA-AES256-GCM-SHA384 (256/256 bits)) (No client certificate requested) by lists.ozlabs.org (Postfix) with ESMTPS id 40PwVs05qbzF1x2 for ; Tue, 17 Apr 2018 03:34:44 +1000 (AEST) Received: from pps.filterd (m0098394.ppops.net [127.0.0.1]) by mx0a-001b2d01.pphosted.com (8.16.0.22/8.16.0.22) with SMTP id w3GHWJ8F115310 for ; Mon, 16 Apr 2018 13:34:43 -0400 Received: from e06smtp13.uk.ibm.com (e06smtp13.uk.ibm.com [195.75.94.109]) by mx0a-001b2d01.pphosted.com with ESMTP id 2hcxcxnysh-1 (version=TLSv1.2 cipher=AES256-SHA256 bits=256 verify=NOT) for ; Mon, 16 Apr 2018 13:34:43 -0400 Received: from localhost by e06smtp13.uk.ibm.com with IBM ESMTP SMTP Gateway: Authorized Use Only! Violators will be prosecuted for from ; Mon, 16 Apr 2018 18:34:40 +0100 Received: from b06cxnps3074.portsmouth.uk.ibm.com (9.149.109.194) by e06smtp13.uk.ibm.com (192.168.101.143) with IBM ESMTP SMTP Gateway: Authorized Use Only! Violators will be prosecuted; Mon, 16 Apr 2018 18:34:38 +0100 Received: from d06av22.portsmouth.uk.ibm.com (d06av22.portsmouth.uk.ibm.com [9.149.105.58]) by b06cxnps3074.portsmouth.uk.ibm.com (8.14.9/8.14.9/NCO v10.0) with ESMTP id w3GHYbPY12583222; Mon, 16 Apr 2018 17:34:37 GMT Received: from d06av22.portsmouth.uk.ibm.com (unknown [127.0.0.1]) by IMSVA (Postfix) with ESMTP id E2A4C4C04E; Mon, 16 Apr 2018 18:27:09 +0100 (BST) Received: from d06av22.portsmouth.uk.ibm.com (unknown [127.0.0.1]) by IMSVA (Postfix) with ESMTP id 71AF44C058; Mon, 16 Apr 2018 18:27:09 +0100 (BST) Received: from jupiter.in.ibm.com (unknown [9.102.1.147]) by d06av22.portsmouth.uk.ibm.com (Postfix) with ESMTP; Mon, 16 Apr 2018 18:27:09 +0100 (BST) From: Mahesh J Salgaonkar To: skiboot list Date: Mon, 16 Apr 2018 23:04:36 +0530 In-Reply-To: <152389987405.2566.355149283827806637.stgit@jupiter.in.ibm.com> References: <152389987405.2566.355149283827806637.stgit@jupiter.in.ibm.com> User-Agent: StGit/0.17.1-dirty MIME-Version: 1.0 X-TM-AS-GCONF: 00 x-cbid: 18041617-0012-0000-0000-000005CB7DC2 X-IBM-AV-DETECTION: SAVI=unused REMOTE=unused XFE=unused x-cbparentid: 18041617-0013-0000-0000-00001947C339 Message-Id: <152390007634.2566.10635363133250653347.stgit@jupiter.in.ibm.com> X-Proofpoint-Virus-Version: vendor=fsecure engine=2.50.10434:, , definitions=2018-04-16_09:, , signatures=0 X-Proofpoint-Spam-Details: rule=outbound_notspam policy=outbound score=0 priorityscore=1501 malwarescore=0 suspectscore=0 phishscore=0 bulkscore=0 spamscore=0 clxscore=1015 lowpriorityscore=0 impostorscore=0 adultscore=0 classifier=spam adjust=0 reason=mlx scancount=1 engine=8.0.1-1709140000 definitions=main-1804160155 Subject: [Skiboot] [PATCH v2 15/15] opal/hmi: Add documentation for opal_handle_hmi2 call X-BeenThere: skiboot@lists.ozlabs.org X-Mailman-Version: 2.1.26 Precedence: list List-Id: Mailing list for skiboot development List-Unsubscribe: , List-Archive: List-Post: List-Help: List-Subscribe: , Errors-To: skiboot-bounces+incoming=patchwork.ozlabs.org@lists.ozlabs.org Sender: "Skiboot" From: Mahesh Salgaonkar Signed-off-by: Mahesh Salgaonkar --- doc/opal-api/opal-handle-hmi-98-166.rst | 126 +++++++++++++++++++++++++++++++ 1 file changed, 126 insertions(+) create mode 100644 doc/opal-api/opal-handle-hmi-98-166.rst diff --git a/doc/opal-api/opal-handle-hmi-98-166.rst b/doc/opal-api/opal-handle-hmi-98-166.rst new file mode 100644 index 000000000..05f707d71 --- /dev/null +++ b/doc/opal-api/opal-handle-hmi-98-166.rst @@ -0,0 +1,126 @@ +Hypervisor Maintenance Interrupt (HMI) +====================================== + + Hypervisor Maintenance Interrupt usually reports error related to processor + recovery/checkstop, NX/NPU checkstop and Timer facility. Hypervisor then + takes this opportunity to analyze and recover from some of these errors. + Hypervisor takes assistance from OPAL layer to handle and recover from HMI. + After handling HMI, OPAL layer sends the summary of error report and status + of recovery action using HMI event. See ref: `opal-messages.rst` for HMI + event structure under ```OPAL_MSG_HMI_EVT``` section. + + HMI is thread specific. The reason for HMI is available in a per thread + Hypervisor Maintenance Exception Register (HMER). A Hypervisor Maintenance + Exception Enable Register (HMEER) is per core. Bits from the HMER need to + be enabled by the corresponding bits in the HMEER in order to cause an HMI. + + Several interrupt reasons are routed in parallel to each of the thread + specific copies. Each thread can only clear bits in its own HMER. OPAL + handler from each thread clears the respective bit from HMER register + after handling the error. + +List of errors that causes HMI +============================== + + - CPU Errors + + - Processor Core checkstop + - Processor retry recovery + - NX/NPU/CAPP checkstop. + + - Timer facility Errors + + - ChipTOD Errors + + - ChipTOD sync check and parity errors + - ChipTOD configuration register parity errors + - ChiTOD topology failover + + - Timebase (TB) errors + + - TB parity/residue error + - TFMR parity and firmware control error + - DEC/HDEC/PURR/SPURR parity errors + +HMI handling +============ + + A core/NX/NPU checkstops are reported as malfunction alert (HMER bit 0). + OPAL handler scans through Fault Isolation Register (FIR) for each + core/nx/npu to detect the exact reason for checkstop and reports it back + to the host alongwith the disposition. + + A processor recovery is reported through HMER bits 2, 3 and 11. These are + just an informational messages and no extra recovery is required. + + Timer facility errors are reported through HMER bit 4. These are all + recoverable errors. The exact reason for the errors are stored in + Timer Facility Management Register (TFMR). Some of the Timer facility + errors affects TB and some of them affects TOD. TOD is a per chip + Time-Of-Day logic that holds the actual time value of the chip and + communicates with every TOD in the system to achieve synchronized + timer value within a system. TB is per core register (64-bit) derives its + value from ChipTOD at startup and then it gets periodically incremented + by STEP signal provided by the TOD. In a multi-socket system TODs are + always configured as master/backup TOD under primary/secondary + topology configuration respectively. + + TB error generates HMI on all threads of the affected core. TB errors + except DEC/HDEC/PURR/SPURR parity errors, causes TB to stop running + making it invalid. As part of TB recovery, OPAL hmi handler synchronizes + with all threads, clears the TB errors and then re-sync the TB with TOD + value putting it back in running state. + + TOD errors generates HMI on every core/thread of affected chip. The reason + for TOD errors are stored in TOD ERROR register (0x40030). As part of the + recovery OPAL hmi handler clears the TOD error and then requests new TOD + value from another running chipTOD in the system. Sometimes, if a primary + chipTOD is in error, it may need a TOD topology switch to recover from + error. A TOD topology switch basically makes a backup as new active master. + +OPAL_HANDLE_HMI and OPAL_HANDLE_HMI2 +==================================== +:: + + #define OPAL_HANDLE_HMI 98 + #define OPAL_HANDLE_HMI2 166 + +``OPAL_HANDLE_HMI`` + +``OPAL_HANDLE_HMI2`` + When OS host gets an Hypervisor Maintenance Interrupt (HMI), it must call + ```OPAL_HANDLE_HMI``` or ```OPAL_HANDLE_HMI2```. The ```OPAL_HANDLE_HMI``` + is an old interface. ```OPAL_HANDLE_HMI2``` is newly introduced opal call + that returns direct info to Linux. It returns a 64-bit flag mask currently + set to provide info about which timer facilities were lost, and whether an + event was generated. + +OPAL_HANDLE_HMI +--------------- +Syntax: :: + + int64_t opal_handle_hmi(void) + +OPAL_HANDLE_HMI2 +---------------- +Syntax: :: + + int64_t opal_handle_hmi2(__be64 *out_flags) + +parameters +^^^^^^^^^^ + + ``__be64 *out_flags`` + + Returns the 64-bit flag mask that provides info about which timer facilities + were lost, and whether an event was generated. + +:: + + /* OPAL_HANDLE_HMI2 out_flags */ + enum { + OPAL_HMI_FLAGS_TB_RESYNC = (1ull << 0), /* Timebase has been resynced */ + OPAL_HMI_FLAGS_DEC_LOST = (1ull << 1), /* DEC lost, needs to be reprogrammed */ + OPAL_HMI_FLAGS_HDEC_LOST = (1ull << 2), /* HDEC lost, needs to be reprogrammed */ + OPAL_HMI_FLAGS_NEW_EVENT = (1ull << 63), /* An event has been created */ + };