From patchwork Fri Jun 9 17:19:05 2017 Content-Type: text/plain; charset="utf-8" MIME-Version: 1.0 Content-Transfer-Encoding: 7bit X-Patchwork-Submitter: Vasant Hegde X-Patchwork-Id: 774079 Return-Path: X-Original-To: incoming@patchwork.ozlabs.org Delivered-To: patchwork-incoming@bilbo.ozlabs.org Received: from lists.ozlabs.org (lists.ozlabs.org [IPv6:2401:3900:2:1::3]) (using TLSv1.2 with cipher ADH-AES256-GCM-SHA384 (256/256 bits)) (No client certificate requested) by ozlabs.org (Postfix) with ESMTPS id 3wkpty0bRKz9sDG for ; Sat, 10 Jun 2017 03:19:38 +1000 (AEST) Received: from lists.ozlabs.org (lists.ozlabs.org [IPv6:2401:3900:2:1::3]) by lists.ozlabs.org (Postfix) with ESMTP id 3wkptx6DmDzDqLJ for ; Sat, 10 Jun 2017 03:19:37 +1000 (AEST) X-Original-To: skiboot@lists.ozlabs.org Delivered-To: skiboot@lists.ozlabs.org Received: from mx0a-001b2d01.pphosted.com (mx0a-001b2d01.pphosted.com [148.163.156.1]) (using TLSv1.2 with cipher ECDHE-RSA-AES256-GCM-SHA384 (256/256 bits)) (No client certificate requested) by lists.ozlabs.org (Postfix) with ESMTPS id 3wkptp2WMvzDqLD for ; Sat, 10 Jun 2017 03:19:30 +1000 (AEST) Received: from pps.filterd (m0098399.ppops.net [127.0.0.1]) by mx0a-001b2d01.pphosted.com (8.16.0.20/8.16.0.20) with SMTP id v59HJ2Gf082199 for ; Fri, 9 Jun 2017 13:19:28 -0400 Received: from e23smtp09.au.ibm.com (e23smtp09.au.ibm.com [202.81.31.142]) by mx0a-001b2d01.pphosted.com with ESMTP id 2ays2mcug2-1 (version=TLSv1.2 cipher=AES256-SHA bits=256 verify=NOT) for ; Fri, 09 Jun 2017 13:19:27 -0400 Received: from localhost by e23smtp09.au.ibm.com with IBM ESMTP SMTP Gateway: Authorized Use Only! Violators will be prosecuted for from ; Sat, 10 Jun 2017 03:19:25 +1000 Received: from d23relay09.au.ibm.com (202.81.31.228) by e23smtp09.au.ibm.com (202.81.31.206) with IBM ESMTP SMTP Gateway: Authorized Use Only! Violators will be prosecuted; Sat, 10 Jun 2017 03:19:22 +1000 Received: from d23av02.au.ibm.com (d23av02.au.ibm.com [9.190.235.138]) by d23relay09.au.ibm.com (8.14.9/8.14.9/NCO v10.0) with ESMTP id v59HJMmt65274108 for ; Sat, 10 Jun 2017 03:19:22 +1000 Received: from d23av02.au.ibm.com (localhost [127.0.0.1]) by d23av02.au.ibm.com (8.14.4/8.14.4/NCO v10.0 AVout) with ESMTP id v59HJDJ0003256 for ; Sat, 10 Jun 2017 03:19:14 +1000 Received: from hegdevasant.in.ibm.com ([9.85.156.113]) by d23av02.au.ibm.com (8.14.4/8.14.4/NCO v10.0 AVin) with ESMTP id v59HJ9B0003177; Sat, 10 Jun 2017 03:19:11 +1000 From: Vasant Hegde To: skiboot@lists.ozlabs.org Date: Fri, 9 Jun 2017 22:49:05 +0530 X-Mailer: git-send-email 2.9.3 X-TM-AS-MML: disable x-cbid: 17060917-0052-0000-0000-000002537490 X-IBM-AV-DETECTION: SAVI=unused REMOTE=unused XFE=unused x-cbparentid: 17060917-0053-0000-0000-000008318576 Message-Id: <20170609171905.6403-1-hegdevasant@linux.vnet.ibm.com> X-Proofpoint-Virus-Version: vendor=fsecure engine=2.50.10432:, , definitions=2017-06-09_07:, , signatures=0 X-Proofpoint-Spam-Details: rule=outbound_notspam policy=outbound score=0 spamscore=0 suspectscore=1 malwarescore=0 phishscore=0 adultscore=0 bulkscore=0 classifier=spam adjust=0 reason=mlx scancount=1 engine=8.0.1-1703280000 definitions=main-1706090302 Subject: [Skiboot] [PATCH v2] FSP/CONSOLE: Workaround for unresponsive ipmi daemon X-BeenThere: skiboot@lists.ozlabs.org X-Mailman-Version: 2.1.23 Precedence: list List-Id: Mailing list for skiboot development List-Unsubscribe: , List-Archive: List-Post: List-Help: List-Subscribe: , MIME-Version: 1.0 Errors-To: skiboot-bounces+incoming=patchwork.ozlabs.org@lists.ozlabs.org Sender: "Skiboot" We use TCE mapped area to write data to console. Console header (fsp_serbuf_hdr) is modified by both FSP and OPAL (OPAL updates next_in pointer in fsp_serbuf_hdr and FSP updates next_out pointer). Kernel makes opal_console_write() OPAL call to write data to console. OPAL write data to TCE mapped area and sends MBOX command to FSP. If our console becomes full and we have data to write to console, we keep on waiting until FSP reads data. In some corner cases, where FSP is active but not responding to console MBOX message (due to buggy IPMI) and we have heavy console write happening from kernel, then eventually our console buffer becomes full. At this point OPAL starts sending OPAL_BUSY_EVENT to kernel. Kernel will keep on retrying. This is creating kernel soft lockups. In some extreme case when every CPU is trying to write to console, user will not be able to ssh and thinks system is hang. If we reset FSP or restart IPMI daemon on FSP, system recovers and everything becomes normal. This patch adds workaround to above issue by returning OPAL_HARDWARE when cosole is full. Side effect of this patch is, we may endup dropping latest console data. But better to drop console data than system hang. Alternative approach is to drop old data from console buffer, make space for new data. But in normal condition only FSP can update 'next_out' pointer and if we touch that pointer, it may introduce some other race conditions. Hence we decided to just new console write request. Signed-off-by: Vasant Hegde Acked-by: Vaidyanathan Srinivasan --- @Vaidy, Stewart, As suggested, I've added error log message. As Vaidy suggested it may not be a good idea to reset FSP. Hence I'm not initiating Host initiated Reset. Also I've retained Vaidy's Ack from V1. -Vasant hw/fsp/fsp-console.c | 18 +++++++++++++++++- include/errorlog.h | 3 +++ 2 files changed, 20 insertions(+), 1 deletion(-) diff --git a/hw/fsp/fsp-console.c b/hw/fsp/fsp-console.c index fd67b20..2ba879b 100644 --- a/hw/fsp/fsp-console.c +++ b/hw/fsp/fsp-console.c @@ -26,6 +26,11 @@ #include #include #include +#include + +DEFINE_LOG_ENTRY(OPAL_RC_CONSOLE_HANG, OPAL_PLATFORM_ERR_EVT, OPAL_CONSOLE, + OPAL_PLATFORM_FIRMWARE, + OPAL_PREDICTIVE_ERR_GENERAL, OPAL_NA); struct fsp_serbuf_hdr { u16 partition_id; @@ -610,7 +615,18 @@ static int64_t fsp_console_write(int64_t term_number, int64_t *length, *length = written; unlock(&fsp_con_lock); - return written ? OPAL_SUCCESS : OPAL_BUSY_EVENT; + if (written) + return OPAL_SUCCESS; + + /* + * FSP is still active but not reading console data. Hence + * our console buffer became full. Most likely IPMI daemon + * on FSP is buggy. Lets log error and return OPAL_HARDWARE + * to payload (Linux). + */ + log_simple_error(&e_info(OPAL_RC_CONSOLE_HANG), "FSPCON: Console " + "buffer is full, dropping console data\n"); + return OPAL_HARDWARE; } static int64_t fsp_console_write_buffer_space(int64_t term_number, diff --git a/include/errorlog.h b/include/errorlog.h index e9d5ad8..285c185 100644 --- a/include/errorlog.h +++ b/include/errorlog.h @@ -332,6 +332,9 @@ enum opal_reasoncode { /* Platform error */ OPAL_RC_ABNORMAL_REBOOT = OPAL_SRC_COMPONENT_CEC | 0x10, + +/* FSP console */ + OPAL_RC_CONSOLE_HANG = OPAL_SRC_COMPONENT_CONSOLE | 0x10, }; #define DEFINE_LOG_ENTRY(reason, type, id, subsys, \