From patchwork Fri Dec 8 10:49:35 2017 Content-Type: text/plain; charset="utf-8" MIME-Version: 1.0 Content-Transfer-Encoding: 7bit X-Patchwork-Submitter: Mahesh J Salgaonkar X-Patchwork-Id: 846141 Return-Path: X-Original-To: incoming@patchwork.ozlabs.org Delivered-To: patchwork-incoming@bilbo.ozlabs.org Received: from lists.ozlabs.org (lists.ozlabs.org [IPv6:2401:3900:2:1::3]) (using TLSv1.2 with cipher ADH-AES256-GCM-SHA384 (256/256 bits)) (No client certificate requested) by ozlabs.org (Postfix) with ESMTPS id 3ytTdH5TsSz9s74 for ; Fri, 8 Dec 2017 21:49:55 +1100 (AEDT) Received: from lists.ozlabs.org (lists.ozlabs.org [IPv6:2401:3900:2:1::3]) by lists.ozlabs.org (Postfix) with ESMTP id 3ytTdH2RJxzDrpX for ; Fri, 8 Dec 2017 21:49:55 +1100 (AEDT) X-Original-To: skiboot@lists.ozlabs.org Delivered-To: skiboot@lists.ozlabs.org Authentication-Results: ozlabs.org; spf=none (mailfrom) smtp.mailfrom=linux.vnet.ibm.com (client-ip=148.163.156.1; helo=mx0a-001b2d01.pphosted.com; envelope-from=mahesh@linux.vnet.ibm.com; receiver=) Received: from mx0a-001b2d01.pphosted.com (mx0a-001b2d01.pphosted.com [148.163.156.1]) (using TLSv1.2 with cipher ECDHE-RSA-AES256-GCM-SHA384 (256/256 bits)) (No client certificate requested) by lists.ozlabs.org (Postfix) with ESMTPS id 3ytTd66brxzDqws for ; Fri, 8 Dec 2017 21:49:45 +1100 (AEDT) Received: from pps.filterd (m0098404.ppops.net [127.0.0.1]) by mx0a-001b2d01.pphosted.com (8.16.0.21/8.16.0.21) with SMTP id vB8AmVLq131483 for ; Fri, 8 Dec 2017 05:49:43 -0500 Received: from e06smtp13.uk.ibm.com (e06smtp13.uk.ibm.com [195.75.94.109]) by mx0a-001b2d01.pphosted.com with ESMTP id 2eqr4nbh69-1 (version=TLSv1.2 cipher=AES256-SHA bits=256 verify=NOT) for ; Fri, 08 Dec 2017 05:49:43 -0500 Received: from localhost by e06smtp13.uk.ibm.com with IBM ESMTP SMTP Gateway: Authorized Use Only! Violators will be prosecuted for from ; Fri, 8 Dec 2017 10:49:40 -0000 Received: from b06cxnps3075.portsmouth.uk.ibm.com (9.149.109.195) by e06smtp13.uk.ibm.com (192.168.101.143) with IBM ESMTP SMTP Gateway: Authorized Use Only! Violators will be prosecuted; Fri, 8 Dec 2017 10:49:37 -0000 Received: from d06av25.portsmouth.uk.ibm.com (d06av25.portsmouth.uk.ibm.com [9.149.105.61]) by b06cxnps3075.portsmouth.uk.ibm.com (8.14.9/8.14.9/NCO v10.0) with ESMTP id vB8Anb6r63766708 for ; Fri, 8 Dec 2017 10:49:37 GMT Received: from d06av25.portsmouth.uk.ibm.com (unknown [127.0.0.1]) by IMSVA (Postfix) with ESMTP id 05C2311C052 for ; Fri, 8 Dec 2017 10:44:02 +0000 (GMT) Received: from d06av25.portsmouth.uk.ibm.com (unknown [127.0.0.1]) by IMSVA (Postfix) with ESMTP id 9331811C054 for ; Fri, 8 Dec 2017 10:44:01 +0000 (GMT) Received: from jupiter.in.ibm.com (unknown [9.193.101.193]) by d06av25.portsmouth.uk.ibm.com (Postfix) with ESMTP for ; Fri, 8 Dec 2017 10:44:01 +0000 (GMT) From: Mahesh J Salgaonkar To: skiboot list Date: Fri, 08 Dec 2017 16:19:35 +0530 User-Agent: StGit/0.17.1-dirty MIME-Version: 1.0 X-TM-AS-GCONF: 00 x-cbid: 17120810-0012-0000-0000-00000596C522 X-IBM-AV-DETECTION: SAVI=unused REMOTE=unused XFE=unused x-cbparentid: 17120810-0013-0000-0000-00001911CDCE Message-Id: <151273017576.22104.17324894016679784398.stgit@jupiter.in.ibm.com> X-Proofpoint-Virus-Version: vendor=fsecure engine=2.50.10432:, , definitions=2017-12-08_06:, , signatures=0 X-Proofpoint-Spam-Details: rule=outbound_notspam policy=outbound score=0 priorityscore=1501 malwarescore=0 suspectscore=0 phishscore=0 bulkscore=0 spamscore=0 clxscore=1015 lowpriorityscore=0 mlxscore=0 impostorscore=0 mlxlogscore=999 adultscore=0 classifier=spam adjust=0 reason=mlx scancount=1 engine=8.0.1-1711220000 definitions=main-1712080156 Subject: [Skiboot] [PATCH v2 1/2] opal/xscom: Move the delay inside xscom_reset() function. X-BeenThere: skiboot@lists.ozlabs.org X-Mailman-Version: 2.1.24 Precedence: list List-Id: Mailing list for skiboot development List-Unsubscribe: , List-Archive: List-Post: List-Help: List-Subscribe: , Errors-To: skiboot-bounces+incoming=patchwork.ozlabs.org@lists.ozlabs.org Sender: "Skiboot" From: Mahesh Salgaonkar So caller of xscom_reset() does not have to bother about adding a delay separately. Instead caller can control whether to add a delay or not using second argument to xscom_reset(). Signed-off-by: Mahesh Salgaonkar Reviewed-by: Nicholas Piggin Reviewed-by: Vaidyanathan Srinivasan --- Changes in V2: - Add a bool argument to xscom_reset() to give the control to caller whether to delay is required or not after reset. --- hw/xscom.c | 39 +++++++++++++++++++++------------------ 1 file changed, 21 insertions(+), 18 deletions(-) diff --git a/hw/xscom.c b/hw/xscom.c index 5b3bd88..d98f5ef 100644 --- a/hw/xscom.c +++ b/hw/xscom.c @@ -92,10 +92,11 @@ static uint64_t xscom_wait_done(void) return mfspr(SPR_HMER); } -static void xscom_reset(uint32_t gcid) +static void xscom_reset(uint32_t gcid, bool need_delay) { u64 hmer; uint32_t recv_status_reg, log_reg, err_reg; + struct timespec ts; /* Clear errors in HMER */ mtspr(SPR_HMER, HMER_CLR_MASK); @@ -126,6 +127,21 @@ static void xscom_reset(uint32_t gcid) hmer = xscom_wait_done(); if (hmer & SPR_HMER_XSCOM_FAIL) goto fail; + + if (need_delay) { + /* + * Its observed that sometimes immediate retry of + * XSCOM operation returns wrong data. Adding a + * delay for XSCOM reset to be effective. Delay of + * 10 ms is found to be working fine experimentally. + * FIXME: Replace 10ms delay by exact delay needed + * or other alternate method to confirm XSCOM reset + * completion, after checking from HW folks. + */ + ts.tv_sec = 0; + ts.tv_nsec = 10 * 1000; + nanosleep_nopoll(&ts, NULL); + } return; fail: /* Fatal error resetting XSCOM */ @@ -140,7 +156,6 @@ static void xscom_reset(uint32_t gcid) static int64_t xscom_handle_error(uint64_t hmer, uint32_t gcid, uint32_t pcb_addr, bool is_write, int64_t retries) { - struct timespec ts; unsigned int stat = GETFIELD(SPR_HMER_XSCOM_STATUS, hmer); int64_t rc = OPAL_HARDWARE; @@ -158,20 +173,8 @@ static int64_t xscom_handle_error(uint64_t hmer, uint32_t gcid, uint32_t pcb_add prlog(PR_NOTICE, "XSCOM: Busy even after %d retries, " "resetting XSCOM now. Total retries = %lld\n", XSCOM_BUSY_RESET_THRESHOLD, retries); - xscom_reset(gcid); - - /* - * Its observed that sometimes immediate retry of - * XSCOM operation returns wrong data. Adding a - * delay for XSCOM reset to be effective. Delay of - * 10 ms is found to be working fine experimentally. - * FIXME: Replace 10ms delay by exact delay needed - * or other alternate method to confirm XSCOM reset - * completion, after checking from HW folks. - */ - ts.tv_sec = 0; - ts.tv_nsec = 10 * 1000; - nanosleep_nopoll(&ts, NULL); + xscom_reset(gcid, true); + } /* Log error if we have retried enough and its still busy */ @@ -183,7 +186,7 @@ static int64_t xscom_handle_error(uint64_t hmer, uint32_t gcid, uint32_t pcb_add return OPAL_XSCOM_BUSY; case 2: /* CPU is asleep, reset XSCOM engine and return */ - xscom_reset(gcid); + xscom_reset(gcid, false); return OPAL_XSCOM_CHIPLET_OFF; case 3: /* Partial good */ rc = OPAL_XSCOM_PARTIAL_GOOD; @@ -208,7 +211,7 @@ static int64_t xscom_handle_error(uint64_t hmer, uint32_t gcid, uint32_t pcb_add is_write ? "write" : "read", gcid, pcb_addr, stat); /* We need to reset the XSCOM or we'll hang on the next access */ - xscom_reset(gcid); + xscom_reset(gcid, false); /* Non recovered ... just fail */ return rc; From patchwork Fri Dec 8 10:49:42 2017 Content-Type: text/plain; charset="utf-8" MIME-Version: 1.0 Content-Transfer-Encoding: 7bit X-Patchwork-Submitter: Mahesh J Salgaonkar X-Patchwork-Id: 846143 Return-Path: X-Original-To: incoming@patchwork.ozlabs.org Delivered-To: patchwork-incoming@bilbo.ozlabs.org Received: from lists.ozlabs.org (lists.ozlabs.org [IPv6:2401:3900:2:1::3]) (using TLSv1.2 with cipher ADH-AES256-GCM-SHA384 (256/256 bits)) (No client certificate requested) by ozlabs.org (Postfix) with ESMTPS id 3ytTdZ4p15z9s74 for ; Fri, 8 Dec 2017 21:50:10 +1100 (AEDT) Received: from lists.ozlabs.org (lists.ozlabs.org [IPv6:2401:3900:2:1::3]) by lists.ozlabs.org (Postfix) with ESMTP id 3ytTdZ3nnRzDrZv for ; Fri, 8 Dec 2017 21:50:10 +1100 (AEDT) X-Original-To: skiboot@lists.ozlabs.org Delivered-To: skiboot@lists.ozlabs.org Authentication-Results: ozlabs.org; spf=none (mailfrom) smtp.mailfrom=linux.vnet.ibm.com (client-ip=148.163.156.1; helo=mx0a-001b2d01.pphosted.com; envelope-from=mahesh@linux.vnet.ibm.com; receiver=) Received: from mx0a-001b2d01.pphosted.com (mx0a-001b2d01.pphosted.com [148.163.156.1]) (using TLSv1.2 with cipher ECDHE-RSA-AES256-GCM-SHA384 (256/256 bits)) (No client certificate requested) by lists.ozlabs.org (Postfix) with ESMTPS id 3ytTdD3Z2bzDrZv for ; Fri, 8 Dec 2017 21:49:52 +1100 (AEDT) Received: from pps.filterd (m0098410.ppops.net [127.0.0.1]) by mx0a-001b2d01.pphosted.com (8.16.0.21/8.16.0.21) with SMTP id vB8AiYYS011466 for ; Fri, 8 Dec 2017 05:49:50 -0500 Received: from e06smtp11.uk.ibm.com (e06smtp11.uk.ibm.com [195.75.94.107]) by mx0a-001b2d01.pphosted.com with ESMTP id 2eqn905du2-1 (version=TLSv1.2 cipher=AES256-SHA bits=256 verify=NOT) for ; Fri, 08 Dec 2017 05:49:50 -0500 Received: from localhost by e06smtp11.uk.ibm.com with IBM ESMTP SMTP Gateway: Authorized Use Only! Violators will be prosecuted for from ; Fri, 8 Dec 2017 10:49:47 -0000 Received: from b06cxnps3075.portsmouth.uk.ibm.com (9.149.109.195) by e06smtp11.uk.ibm.com (192.168.101.141) with IBM ESMTP SMTP Gateway: Authorized Use Only! Violators will be prosecuted; Fri, 8 Dec 2017 10:49:44 -0000 Received: from d06av21.portsmouth.uk.ibm.com (d06av21.portsmouth.uk.ibm.com [9.149.105.232]) by b06cxnps3075.portsmouth.uk.ibm.com (8.14.9/8.14.9/NCO v10.0) with ESMTP id vB8Anhvr53805108 for ; Fri, 8 Dec 2017 10:49:44 GMT Received: from d06av21.portsmouth.uk.ibm.com (unknown [127.0.0.1]) by IMSVA (Postfix) with ESMTP id 7FA9D52041 for ; Fri, 8 Dec 2017 09:42:54 +0000 (GMT) Received: from jupiter.in.ibm.com (unknown [9.193.101.193]) by d06av21.portsmouth.uk.ibm.com (Postfix) with ESMTP id 0DE3E5204B for ; Fri, 8 Dec 2017 09:42:53 +0000 (GMT) From: Mahesh J Salgaonkar To: skiboot list Date: Fri, 08 Dec 2017 16:19:42 +0530 In-Reply-To: <151273017576.22104.17324894016679784398.stgit@jupiter.in.ibm.com> References: <151273017576.22104.17324894016679784398.stgit@jupiter.in.ibm.com> User-Agent: StGit/0.17.1-dirty MIME-Version: 1.0 X-TM-AS-GCONF: 00 x-cbid: 17120810-0040-0000-0000-00000417CB6B X-IBM-AV-DETECTION: SAVI=unused REMOTE=unused XFE=unused x-cbparentid: 17120810-0041-0000-0000-000020BACDE0 Message-Id: <151273018250.22104.15383536921417693262.stgit@jupiter.in.ibm.com> X-Proofpoint-Virus-Version: vendor=fsecure engine=2.50.10432:, , definitions=2017-12-08_06:, , signatures=0 X-Proofpoint-Spam-Details: rule=outbound_notspam policy=outbound score=0 priorityscore=1501 malwarescore=0 suspectscore=0 phishscore=0 bulkscore=0 spamscore=0 clxscore=1015 lowpriorityscore=0 mlxscore=0 impostorscore=0 mlxlogscore=999 adultscore=0 classifier=spam adjust=0 reason=mlx scancount=1 engine=8.0.1-1711220000 definitions=main-1712080156 Subject: [Skiboot] [PATCH v2 2/2] opal/xscom: Add recovery for lost core wakeup scom failures. X-BeenThere: skiboot@lists.ozlabs.org X-Mailman-Version: 2.1.24 Precedence: list List-Id: Mailing list for skiboot development List-Unsubscribe: , List-Archive: List-Post: List-Help: List-Subscribe: , Errors-To: skiboot-bounces+incoming=patchwork.ozlabs.org@lists.ozlabs.org Sender: "Skiboot" From: Mahesh Salgaonkar Due to a hardware issue where core responding to scom was delayed due to thread reconfiguration, leaves the SCOM logic in a state where the subsequent scom to that core can get errors. This is affected for Core PC scom registers in the range of 20010A80-20010ABF The solution is if a xscom timeout occurs to one of Core PC scom registers in the range of 20010A80-20010ABF, a clearing scom write is done to 0x20010800 with data of '0x00000000' which will also get a timeout but clears the scom logic errors. After the clearing write is done the original scom operation can be retried. The scom timeout is reported as status 0x4 (Invalid address) in HMER[21-23]. Signed-off-by: Mahesh Salgaonkar Reviewed-by: Nicholas Piggin --- Changes in V2: - Increase retries to 10. With 3 time retry we still see some failures once in 20 mins under the stress test. Bumping it to 10 makes it more robust. --- hw/xscom.c | 80 +++++++++++++++++++++++++++++++++++++++++++++++++++++-- include/xscom.h | 8 ++++++ 2 files changed, 85 insertions(+), 3 deletions(-) diff --git a/hw/xscom.c b/hw/xscom.c index d98f5ef..de5a27e 100644 --- a/hw/xscom.c +++ b/hw/xscom.c @@ -153,8 +153,69 @@ static void xscom_reset(uint32_t gcid, bool need_delay) */ } +static int xscom_clear_error(uint32_t gcid, uint32_t pcb_addr) +{ + u64 hmer; + uint32_t base_xscom_addr; + uint32_t xscom_clear_reg = 0x20010800; + + /* only in case of p9 */ + if (proc_gen != proc_gen_p9) + return 0; + + /* + * Due to a hardware issue where core responding to scom was delayed + * due to thread reconfiguration, leaves the scom logic in a state + * where the subsequent scom to that core can get errors. This is + * affected for Core PC scom registers in the range of + * 20010A80-20010ABF. + * + * The solution is if a xscom timeout occurs to one of Core PC scom + * registers in the range of 20010A80-20010ABF, a clearing scom + * write is done to 0x20010800 with data of '0x00000000' which will + * also get a timeout but clears the scom logic errors. After the + * clearing write is done the original scom operation can be retried. + * + * The scom timeout is reported as status 0x4 (Invalid address) + * in HMER[21-23]. + */ + + base_xscom_addr = pcb_addr & XSCOM_CLEAR_RANGE_MASK; + if (!((base_xscom_addr >= XSCOM_CLEAR_RANGE_START) && + (base_xscom_addr <= XSCOM_CLEAR_RANGE_END))) + return 0; + + /* + * Reset the XSCOM or next scom operation will fail. + * We also need a small delay before we go ahead with clearing write. + * We have observed that without a delay the clearing write has reported + * a wrong status. + */ + xscom_reset(gcid, true); + + /* Clear errors in HMER */ + mtspr(SPR_HMER, HMER_CLR_MASK); + + /* Write 0 to clear the xscom logic errors on target chip */ + out_be64(xscom_addr(gcid, xscom_clear_reg), 0); + hmer = xscom_wait_done(); + + /* + * Above clearing xscom write will timeout and error out with + * invalid access as there is no register at that address. This + * xscom operation just helps to clear the xscom logic error. + * + * On failure, reset the XSCOM or we'll hang on the next access + */ + if (hmer & SPR_HMER_XSCOM_FAIL) + xscom_reset(gcid, true); + + return 1; +} + static int64_t xscom_handle_error(uint64_t hmer, uint32_t gcid, uint32_t pcb_addr, - bool is_write, int64_t retries) + bool is_write, int64_t retries, + int64_t *xscom_clear_retries) { unsigned int stat = GETFIELD(SPR_HMER_XSCOM_STATUS, hmer); int64_t rc = OPAL_HARDWARE; @@ -193,6 +254,15 @@ static int64_t xscom_handle_error(uint64_t hmer, uint32_t gcid, uint32_t pcb_add break; case 4: /* Invalid address / address error */ rc = OPAL_XSCOM_ADDR_ERROR; + if (xscom_clear_error(gcid, pcb_addr)) { + /* return busy if retries still pending. */ + if ((*xscom_clear_retries)--) + return OPAL_XSCOM_BUSY; + + prlog(PR_DEBUG, "XSCOM: error recovery failed for " + "gcid=0x%x pcb_addr=0x%x\n", gcid, pcb_addr); + + } break; case 5: /* Clock error */ rc = OPAL_XSCOM_CLOCK_ERROR; @@ -255,6 +325,7 @@ static int __xscom_read(uint32_t gcid, uint32_t pcb_addr, uint64_t *val) { uint64_t hmer; int64_t ret, retries; + int64_t xscom_clear_retries = XSCOM_CLEAR_MAX_RETRIES; if (!xscom_gcid_ok(gcid)) { prerror("%s: invalid XSCOM gcid 0x%x\n", __func__, gcid); @@ -278,7 +349,8 @@ static int __xscom_read(uint32_t gcid, uint32_t pcb_addr, uint64_t *val) return OPAL_SUCCESS; /* Handle error and possibly eventually retry */ - ret = xscom_handle_error(hmer, gcid, pcb_addr, false, retries); + ret = xscom_handle_error(hmer, gcid, pcb_addr, false, retries, + &xscom_clear_retries); if (ret != OPAL_BUSY) break; } @@ -305,6 +377,7 @@ static int __xscom_write(uint32_t gcid, uint32_t pcb_addr, uint64_t val) { uint64_t hmer; int64_t ret, retries = 0; + int64_t xscom_clear_retries = XSCOM_CLEAR_MAX_RETRIES; if (!xscom_gcid_ok(gcid)) { prerror("%s: invalid XSCOM gcid 0x%x\n", __func__, gcid); @@ -328,7 +401,8 @@ static int __xscom_write(uint32_t gcid, uint32_t pcb_addr, uint64_t val) return OPAL_SUCCESS; /* Handle error and possibly eventually retry */ - ret = xscom_handle_error(hmer, gcid, pcb_addr, true, retries); + ret = xscom_handle_error(hmer, gcid, pcb_addr, true, retries, + &xscom_clear_retries); if (ret != OPAL_BUSY) break; } diff --git a/include/xscom.h b/include/xscom.h index 5a5d0b9..9853224 100644 --- a/include/xscom.h +++ b/include/xscom.h @@ -206,6 +206,14 @@ /* Max number of retries when XSCOM remains busy */ #define XSCOM_BUSY_MAX_RETRIES 3000 +/* Max number of retries for xscom clearing recovery. */ +#define XSCOM_CLEAR_MAX_RETRIES 10 + +/* xscom clear address range/mask */ +#define XSCOM_CLEAR_RANGE_START 0x20010A00 +#define XSCOM_CLEAR_RANGE_END 0x20010ABF +#define XSCOM_CLEAR_RANGE_MASK 0x200FFBFF + /* Retry count after which to reset XSCOM, if still busy */ #define XSCOM_BUSY_RESET_THRESHOLD 1000