From patchwork Mon Apr 16 17:33:56 2018 Content-Type: text/plain; charset="utf-8" MIME-Version: 1.0 Content-Transfer-Encoding: 7bit X-Patchwork-Submitter: Mahesh J Salgaonkar X-Patchwork-Id: 898831 Return-Path: X-Original-To: incoming@patchwork.ozlabs.org Delivered-To: patchwork-incoming@bilbo.ozlabs.org Received: from lists.ozlabs.org (lists.ozlabs.org [IPv6:2401:3900:2:1::3]) (using TLSv1.2 with cipher ADH-AES256-GCM-SHA384 (256/256 bits)) (No client certificate requested) by ozlabs.org (Postfix) with ESMTPS id 40PwX63Q44z9s3G for ; Tue, 17 Apr 2018 03:35:50 +1000 (AEST) Authentication-Results: ozlabs.org; dmarc=none (p=none dis=none) header.from=linux.vnet.ibm.com Received: from lists.ozlabs.org (lists.ozlabs.org [IPv6:2401:3900:2:1::3]) by lists.ozlabs.org (Postfix) with ESMTP id 40PwX61NKDzF1xW for ; Tue, 17 Apr 2018 03:35:50 +1000 (AEST) Authentication-Results: lists.ozlabs.org; dmarc=none (p=none dis=none) header.from=linux.vnet.ibm.com X-Original-To: skiboot@lists.ozlabs.org Delivered-To: skiboot@lists.ozlabs.org Authentication-Results: lists.ozlabs.org; spf=none (mailfrom) smtp.mailfrom=linux.vnet.ibm.com (client-ip=148.163.156.1; helo=mx0a-001b2d01.pphosted.com; envelope-from=mahesh@linux.vnet.ibm.com; receiver=) Authentication-Results: lists.ozlabs.org; dmarc=none (p=none dis=none) header.from=linux.vnet.ibm.com Received: from mx0a-001b2d01.pphosted.com (mx0a-001b2d01.pphosted.com [148.163.156.1]) (using TLSv1.2 with cipher ECDHE-RSA-AES256-GCM-SHA384 (256/256 bits)) (No client certificate requested) by lists.ozlabs.org (Postfix) with ESMTPS id 40PwV41mGCzF1w1 for ; Tue, 17 Apr 2018 03:34:04 +1000 (AEST) Received: from pps.filterd (m0098410.ppops.net [127.0.0.1]) by mx0a-001b2d01.pphosted.com (8.16.0.22/8.16.0.22) with SMTP id w3GHVaJK080971 for ; Mon, 16 Apr 2018 13:34:02 -0400 Received: from e06smtp10.uk.ibm.com (e06smtp10.uk.ibm.com [195.75.94.106]) by mx0a-001b2d01.pphosted.com with ESMTP id 2hcxyfm0ns-1 (version=TLSv1.2 cipher=AES256-SHA256 bits=256 verify=NOT) for ; Mon, 16 Apr 2018 13:34:02 -0400 Received: from localhost by e06smtp10.uk.ibm.com with IBM ESMTP SMTP Gateway: Authorized Use Only! Violators will be prosecuted for from ; Mon, 16 Apr 2018 18:34:00 +0100 Received: from b06cxnps4074.portsmouth.uk.ibm.com (9.149.109.196) by e06smtp10.uk.ibm.com (192.168.101.140) with IBM ESMTP SMTP Gateway: Authorized Use Only! Violators will be prosecuted; Mon, 16 Apr 2018 18:33:58 +0100 Received: from d06av24.portsmouth.uk.ibm.com (d06av24.portsmouth.uk.ibm.com [9.149.105.60]) by b06cxnps4074.portsmouth.uk.ibm.com (8.14.9/8.14.9/NCO v10.0) with ESMTP id w3GHXvqL44105886; Mon, 16 Apr 2018 17:33:57 GMT Received: from d06av24.portsmouth.uk.ibm.com (unknown [127.0.0.1]) by IMSVA (Postfix) with ESMTP id 5276F42042; Mon, 16 Apr 2018 18:25:32 +0100 (BST) Received: from d06av24.portsmouth.uk.ibm.com (unknown [127.0.0.1]) by IMSVA (Postfix) with ESMTP id D70F442041; Mon, 16 Apr 2018 18:25:31 +0100 (BST) Received: from jupiter.in.ibm.com (unknown [9.102.1.147]) by d06av24.portsmouth.uk.ibm.com (Postfix) with ESMTP; Mon, 16 Apr 2018 18:25:31 +0100 (BST) From: Mahesh J Salgaonkar To: skiboot list Date: Mon, 16 Apr 2018 23:03:56 +0530 In-Reply-To: <152389987405.2566.355149283827806637.stgit@jupiter.in.ibm.com> References: <152389987405.2566.355149283827806637.stgit@jupiter.in.ibm.com> User-Agent: StGit/0.17.1-dirty MIME-Version: 1.0 X-TM-AS-GCONF: 00 x-cbid: 18041617-0040-0000-0000-0000042FA0C9 X-IBM-AV-DETECTION: SAVI=unused REMOTE=unused XFE=unused x-cbparentid: 18041617-0041-0000-0000-00002633A6AD Message-Id: <152390003628.2566.8734510062345536451.stgit@jupiter.in.ibm.com> X-Proofpoint-Virus-Version: vendor=fsecure engine=2.50.10434:, , definitions=2018-04-16_09:, , signatures=0 X-Proofpoint-Spam-Details: rule=outbound_notspam policy=outbound score=0 priorityscore=1501 malwarescore=0 suspectscore=0 phishscore=0 bulkscore=0 spamscore=0 clxscore=1015 lowpriorityscore=0 impostorscore=0 adultscore=0 classifier=spam adjust=0 reason=mlx scancount=1 engine=8.0.1-1709140000 definitions=main-1804160155 Subject: [Skiboot] [PATCH v2 09/15] opal/hmi: Fix soft lockups during TOD errors X-BeenThere: skiboot@lists.ozlabs.org X-Mailman-Version: 2.1.26 Precedence: list List-Id: Mailing list for skiboot development List-Unsubscribe: , List-Archive: List-Post: List-Help: List-Subscribe: , Errors-To: skiboot-bounces+incoming=patchwork.ozlabs.org@lists.ozlabs.org Sender: "Skiboot" From: Mahesh Salgaonkar There are some TOD errors which do not affect working of TOD and TB. They stay in valid state. Hence we don't need rendez vous for TOD errors that does not affect TB working. TOD errors that affects TOD/TB will report a global error on TFMR[44] alongwith bit 51, and they will go in rendez vous path as expected. But the TOD errors that does not affect TB register sets only TFMR bit 51. The TFMR bit 51 is cleared when any single thread clears the TOD error. Once cleared, the bit 51 is reflected to all the cores on that chip. Any thread that reads the TFMR register after the error is cleared will see TFMR bit 51 reset. Hence the threads that see TFMR[51]=1, falls through rendez-vous path and threads that see TFMR[51]=0, returns doing nothing. This ends up in a soft lockups in host kernel. This patch fixes this issue by not considering TOD interrupt (TFMR[51]) as a core-global error and hence avoiding rendez-vous path completely. Instead threads that see TFMR[51]=1 will now take different path that just do the TOD error recovery. Signed-off-by: Mahesh Salgaonkar --- core/hmi.c | 16 +++++++++++++++- hw/chiptod.c | 14 ++++++++++++-- include/chiptod.h | 1 + 3 files changed, 28 insertions(+), 3 deletions(-) diff --git a/core/hmi.c b/core/hmi.c index eadb75be4..d9dd83c62 100644 --- a/core/hmi.c +++ b/core/hmi.c @@ -204,7 +204,7 @@ (SPR_TFMR_TBST_CORRUPT | SPR_TFMR_TB_MISSING_SYNC | \ SPR_TFMR_TB_MISSING_STEP | SPR_TFMR_FW_CONTROL_ERR | \ SPR_TFMR_TFMR_CORRUPT | SPR_TFMR_TB_RESIDUE_ERR | \ - SPR_TFMR_HDEC_PARITY_ERROR | SPR_TFMR_CHIP_TOD_INTERRUPT) + SPR_TFMR_HDEC_PARITY_ERROR) /* TFMR "thread" errors */ #define SPR_TFMR_THREAD_ERRORS \ @@ -1101,6 +1101,20 @@ static int handle_tfac_errors(uint64_t hmer, struct OpalHMIEvent *hmi_evt, recover2 = handle_all_core_tfac_error(tfmr, out_flags); if (recover != 0) recover = recover2; + } else if (tfmr & SPR_TFMR_CHIP_TOD_INTERRUPT) { + int recover2; + + /* + * There are some TOD errors which do not affect working of + * TOD and TB. They stay in valid state. Hence we don't need + * rendez vous. + * + * TOD errors that affects TOD/TB will report a global error + * on TFMR alongwith bit 51, and they will go in rendez vous. + */ + recover2 = chiptod_recover_tod_errors(); + if (recover != 0) + recover = recover2; } else if (this_cpu()->tb_invalid) { /* This shouldn't happen, TB is invalid and no global error * was reported. We just return for now assuming one will diff --git a/hw/chiptod.c b/hw/chiptod.c index f6ef9a469..33d553956 100644 --- a/hw/chiptod.c +++ b/hw/chiptod.c @@ -970,7 +970,7 @@ bool chiptod_wakeup_resync(void) return false; } -static int chiptod_recover_tod_errors(void) +static int __chiptod_recover_tod_errors(void) { uint64_t terr; uint64_t treset = 0; @@ -1026,6 +1026,16 @@ static int chiptod_recover_tod_errors(void) return 1; } +int chiptod_recover_tod_errors(void) +{ + int rc; + + lock(&chiptod_lock); + rc = __chiptod_recover_tod_errors(); + unlock(&chiptod_lock); + return rc; +} + static int32_t chiptod_get_active_master(void) { if (current_topology < 0) @@ -1550,7 +1560,7 @@ int chiptod_recover_tb_errors(bool *out_resynced) * Bit 33 of TOD error register indicates sync check error. */ if (tfmr & SPR_TFMR_CHIP_TOD_INTERRUPT) - rc = chiptod_recover_tod_errors(); + rc = __chiptod_recover_tod_errors(); /* Check if TB is running. If not then we need to get it running. */ if (!(tfmr & SPR_TFMR_TB_VALID)) { diff --git a/include/chiptod.h b/include/chiptod.h index 667e6fd83..5860e34d2 100644 --- a/include/chiptod.h +++ b/include/chiptod.h @@ -38,5 +38,6 @@ extern void chiptod_reset_tb(void); extern bool chiptod_adjust_topology(enum chiptod_topology topo, bool enable); extern bool chiptod_capp_timebase_sync(unsigned int chip_id, uint32_t tfmr_addr, uint32_t tb_addr, uint32_t offset); +extern int chiptod_recover_tod_errors(void); #endif /* __CHIPTOD_H */