From patchwork Mon Apr 16 17:34:36 2018 Content-Type: text/plain; charset="utf-8" MIME-Version: 1.0 Content-Transfer-Encoding: 7bit X-Patchwork-Submitter: Mahesh J Salgaonkar X-Patchwork-Id: 898843 Return-Path: X-Original-To: incoming@patchwork.ozlabs.org Delivered-To: patchwork-incoming@bilbo.ozlabs.org Received: from lists.ozlabs.org (lists.ozlabs.org [203.11.71.2]) (using TLSv1.2 with cipher ADH-AES256-GCM-SHA384 (256/256 bits)) (No client certificate requested) by ozlabs.org (Postfix) with ESMTPS id 40PwYt4phtz9s3G for ; Tue, 17 Apr 2018 03:37:22 +1000 (AEST) Authentication-Results: ozlabs.org; dmarc=none (p=none dis=none) header.from=linux.vnet.ibm.com Received: from lists.ozlabs.org (lists.ozlabs.org [IPv6:2401:3900:2:1::3]) by lists.ozlabs.org (Postfix) with ESMTP id 40PwYt3bkgzF1w6 for ; Tue, 17 Apr 2018 03:37:22 +1000 (AEST) Authentication-Results: lists.ozlabs.org; dmarc=none (p=none dis=none) header.from=linux.vnet.ibm.com X-Original-To: skiboot@lists.ozlabs.org Delivered-To: skiboot@lists.ozlabs.org Authentication-Results: lists.ozlabs.org; spf=none (mailfrom) smtp.mailfrom=linux.vnet.ibm.com (client-ip=148.163.156.1; helo=mx0a-001b2d01.pphosted.com; envelope-from=mahesh@linux.vnet.ibm.com; receiver=) Authentication-Results: lists.ozlabs.org; dmarc=none (p=none dis=none) header.from=linux.vnet.ibm.com Received: from mx0a-001b2d01.pphosted.com (mx0a-001b2d01.pphosted.com [148.163.156.1]) (using TLSv1.2 with cipher ECDHE-RSA-AES256-GCM-SHA384 (256/256 bits)) (No client certificate requested) by lists.ozlabs.org (Postfix) with ESMTPS id 40PwVs05qbzF1x2 for ; Tue, 17 Apr 2018 03:34:44 +1000 (AEST) Received: from pps.filterd (m0098394.ppops.net [127.0.0.1]) by mx0a-001b2d01.pphosted.com (8.16.0.22/8.16.0.22) with SMTP id w3GHWJ8F115310 for ; Mon, 16 Apr 2018 13:34:43 -0400 Received: from e06smtp13.uk.ibm.com (e06smtp13.uk.ibm.com [195.75.94.109]) by mx0a-001b2d01.pphosted.com with ESMTP id 2hcxcxnysh-1 (version=TLSv1.2 cipher=AES256-SHA256 bits=256 verify=NOT) for ; Mon, 16 Apr 2018 13:34:43 -0400 Received: from localhost by e06smtp13.uk.ibm.com with IBM ESMTP SMTP Gateway: Authorized Use Only! Violators will be prosecuted for from ; Mon, 16 Apr 2018 18:34:40 +0100 Received: from b06cxnps3074.portsmouth.uk.ibm.com (9.149.109.194) by e06smtp13.uk.ibm.com (192.168.101.143) with IBM ESMTP SMTP Gateway: Authorized Use Only! Violators will be prosecuted; Mon, 16 Apr 2018 18:34:38 +0100 Received: from d06av22.portsmouth.uk.ibm.com (d06av22.portsmouth.uk.ibm.com [9.149.105.58]) by b06cxnps3074.portsmouth.uk.ibm.com (8.14.9/8.14.9/NCO v10.0) with ESMTP id w3GHYbPY12583222; Mon, 16 Apr 2018 17:34:37 GMT Received: from d06av22.portsmouth.uk.ibm.com (unknown [127.0.0.1]) by IMSVA (Postfix) with ESMTP id E2A4C4C04E; Mon, 16 Apr 2018 18:27:09 +0100 (BST) Received: from d06av22.portsmouth.uk.ibm.com (unknown [127.0.0.1]) by IMSVA (Postfix) with ESMTP id 71AF44C058; Mon, 16 Apr 2018 18:27:09 +0100 (BST) Received: from jupiter.in.ibm.com (unknown [9.102.1.147]) by d06av22.portsmouth.uk.ibm.com (Postfix) with ESMTP; Mon, 16 Apr 2018 18:27:09 +0100 (BST) From: Mahesh J Salgaonkar To: skiboot list Date: Mon, 16 Apr 2018 23:04:36 +0530 In-Reply-To: <152389987405.2566.355149283827806637.stgit@jupiter.in.ibm.com> References: <152389987405.2566.355149283827806637.stgit@jupiter.in.ibm.com> User-Agent: StGit/0.17.1-dirty MIME-Version: 1.0 X-TM-AS-GCONF: 00 x-cbid: 18041617-0012-0000-0000-000005CB7DC2 X-IBM-AV-DETECTION: SAVI=unused REMOTE=unused XFE=unused x-cbparentid: 18041617-0013-0000-0000-00001947C339 Message-Id: <152390007634.2566.10635363133250653347.stgit@jupiter.in.ibm.com> X-Proofpoint-Virus-Version: vendor=fsecure engine=2.50.10434:, , definitions=2018-04-16_09:, , signatures=0 X-Proofpoint-Spam-Details: rule=outbound_notspam policy=outbound score=0 priorityscore=1501 malwarescore=0 suspectscore=0 phishscore=0 bulkscore=0 spamscore=0 clxscore=1015 lowpriorityscore=0 impostorscore=0 adultscore=0 classifier=spam adjust=0 reason=mlx scancount=1 engine=8.0.1-1709140000 definitions=main-1804160155 Subject: [Skiboot] [PATCH v2 15/15] opal/hmi: Add documentation for opal_handle_hmi2 call X-BeenThere: skiboot@lists.ozlabs.org X-Mailman-Version: 2.1.26 Precedence: list List-Id: Mailing list for skiboot development List-Unsubscribe: , List-Archive: List-Post: List-Help: List-Subscribe: , Errors-To: skiboot-bounces+incoming=patchwork.ozlabs.org@lists.ozlabs.org Sender: "Skiboot" From: Mahesh Salgaonkar Signed-off-by: Mahesh Salgaonkar --- doc/opal-api/opal-handle-hmi-98-166.rst | 126 +++++++++++++++++++++++++++++++ 1 file changed, 126 insertions(+) create mode 100644 doc/opal-api/opal-handle-hmi-98-166.rst diff --git a/doc/opal-api/opal-handle-hmi-98-166.rst b/doc/opal-api/opal-handle-hmi-98-166.rst new file mode 100644 index 000000000..05f707d71 --- /dev/null +++ b/doc/opal-api/opal-handle-hmi-98-166.rst @@ -0,0 +1,126 @@ +Hypervisor Maintenance Interrupt (HMI) +====================================== + + Hypervisor Maintenance Interrupt usually reports error related to processor + recovery/checkstop, NX/NPU checkstop and Timer facility. Hypervisor then + takes this opportunity to analyze and recover from some of these errors. + Hypervisor takes assistance from OPAL layer to handle and recover from HMI. + After handling HMI, OPAL layer sends the summary of error report and status + of recovery action using HMI event. See ref: `opal-messages.rst` for HMI + event structure under ```OPAL_MSG_HMI_EVT``` section. + + HMI is thread specific. The reason for HMI is available in a per thread + Hypervisor Maintenance Exception Register (HMER). A Hypervisor Maintenance + Exception Enable Register (HMEER) is per core. Bits from the HMER need to + be enabled by the corresponding bits in the HMEER in order to cause an HMI. + + Several interrupt reasons are routed in parallel to each of the thread + specific copies. Each thread can only clear bits in its own HMER. OPAL + handler from each thread clears the respective bit from HMER register + after handling the error. + +List of errors that causes HMI +============================== + + - CPU Errors + + - Processor Core checkstop + - Processor retry recovery + - NX/NPU/CAPP checkstop. + + - Timer facility Errors + + - ChipTOD Errors + + - ChipTOD sync check and parity errors + - ChipTOD configuration register parity errors + - ChiTOD topology failover + + - Timebase (TB) errors + + - TB parity/residue error + - TFMR parity and firmware control error + - DEC/HDEC/PURR/SPURR parity errors + +HMI handling +============ + + A core/NX/NPU checkstops are reported as malfunction alert (HMER bit 0). + OPAL handler scans through Fault Isolation Register (FIR) for each + core/nx/npu to detect the exact reason for checkstop and reports it back + to the host alongwith the disposition. + + A processor recovery is reported through HMER bits 2, 3 and 11. These are + just an informational messages and no extra recovery is required. + + Timer facility errors are reported through HMER bit 4. These are all + recoverable errors. The exact reason for the errors are stored in + Timer Facility Management Register (TFMR). Some of the Timer facility + errors affects TB and some of them affects TOD. TOD is a per chip + Time-Of-Day logic that holds the actual time value of the chip and + communicates with every TOD in the system to achieve synchronized + timer value within a system. TB is per core register (64-bit) derives its + value from ChipTOD at startup and then it gets periodically incremented + by STEP signal provided by the TOD. In a multi-socket system TODs are + always configured as master/backup TOD under primary/secondary + topology configuration respectively. + + TB error generates HMI on all threads of the affected core. TB errors + except DEC/HDEC/PURR/SPURR parity errors, causes TB to stop running + making it invalid. As part of TB recovery, OPAL hmi handler synchronizes + with all threads, clears the TB errors and then re-sync the TB with TOD + value putting it back in running state. + + TOD errors generates HMI on every core/thread of affected chip. The reason + for TOD errors are stored in TOD ERROR register (0x40030). As part of the + recovery OPAL hmi handler clears the TOD error and then requests new TOD + value from another running chipTOD in the system. Sometimes, if a primary + chipTOD is in error, it may need a TOD topology switch to recover from + error. A TOD topology switch basically makes a backup as new active master. + +OPAL_HANDLE_HMI and OPAL_HANDLE_HMI2 +==================================== +:: + + #define OPAL_HANDLE_HMI 98 + #define OPAL_HANDLE_HMI2 166 + +``OPAL_HANDLE_HMI`` + +``OPAL_HANDLE_HMI2`` + When OS host gets an Hypervisor Maintenance Interrupt (HMI), it must call + ```OPAL_HANDLE_HMI``` or ```OPAL_HANDLE_HMI2```. The ```OPAL_HANDLE_HMI``` + is an old interface. ```OPAL_HANDLE_HMI2``` is newly introduced opal call + that returns direct info to Linux. It returns a 64-bit flag mask currently + set to provide info about which timer facilities were lost, and whether an + event was generated. + +OPAL_HANDLE_HMI +--------------- +Syntax: :: + + int64_t opal_handle_hmi(void) + +OPAL_HANDLE_HMI2 +---------------- +Syntax: :: + + int64_t opal_handle_hmi2(__be64 *out_flags) + +parameters +^^^^^^^^^^ + + ``__be64 *out_flags`` + + Returns the 64-bit flag mask that provides info about which timer facilities + were lost, and whether an event was generated. + +:: + + /* OPAL_HANDLE_HMI2 out_flags */ + enum { + OPAL_HMI_FLAGS_TB_RESYNC = (1ull << 0), /* Timebase has been resynced */ + OPAL_HMI_FLAGS_DEC_LOST = (1ull << 1), /* DEC lost, needs to be reprogrammed */ + OPAL_HMI_FLAGS_HDEC_LOST = (1ull << 2), /* HDEC lost, needs to be reprogrammed */ + OPAL_HMI_FLAGS_NEW_EVENT = (1ull << 63), /* An event has been created */ + };