[v2,15/15] opal/hmi: Add documentation for opal_handle_hmi2 call

Message ID 152390007634.2566.10635363133250653347.stgit@jupiter.in.ibm.com
State Accepted
Headers show
Series
  • opal/hmi: Rework HMI handling.
Related show

Commit Message

Mahesh Jagannath Salgaonkar April 16, 2018, 5:34 p.m.
From: Mahesh Salgaonkar <mahesh@linux.vnet.ibm.com>

Signed-off-by: Mahesh Salgaonkar <mahesh@linux.vnet.ibm.com>
---
 doc/opal-api/opal-handle-hmi-98-166.rst |  126 +++++++++++++++++++++++++++++++
 1 file changed, 126 insertions(+)
 create mode 100644 doc/opal-api/opal-handle-hmi-98-166.rst

Patch

diff --git a/doc/opal-api/opal-handle-hmi-98-166.rst b/doc/opal-api/opal-handle-hmi-98-166.rst
new file mode 100644
index 000000000..05f707d71
--- /dev/null
+++ b/doc/opal-api/opal-handle-hmi-98-166.rst
@@ -0,0 +1,126 @@ 
+Hypervisor Maintenance Interrupt (HMI)
+======================================
+
+  Hypervisor Maintenance Interrupt usually reports error related to processor
+  recovery/checkstop, NX/NPU checkstop and Timer facility. Hypervisor then
+  takes this opportunity to analyze and recover from some of these errors.
+  Hypervisor takes assistance from OPAL layer to handle and recover from HMI.
+  After handling HMI, OPAL layer sends the summary of error report and status
+  of recovery action using HMI event. See ref: `opal-messages.rst` for HMI
+  event structure under ```OPAL_MSG_HMI_EVT``` section.
+
+  HMI is thread specific. The reason for HMI is available in a per thread
+  Hypervisor Maintenance Exception Register (HMER). A Hypervisor Maintenance
+  Exception Enable Register (HMEER) is per core. Bits from the HMER need to
+  be enabled by the corresponding bits in the HMEER in order to cause an HMI.
+
+  Several interrupt reasons are routed in parallel to each of the thread
+  specific copies. Each thread can only clear bits in its own HMER. OPAL
+  handler from each thread clears the respective bit from HMER register
+  after handling the error.
+
+List of errors that causes HMI
+==============================
+
+  - CPU Errors
+
+   - Processor Core checkstop
+   - Processor retry recovery
+   - NX/NPU/CAPP checkstop.
+
+  - Timer facility Errors
+
+   - ChipTOD Errors
+
+    - ChipTOD sync check and parity errors
+    - ChipTOD configuration register parity errors
+    - ChiTOD topology failover
+
+   - Timebase (TB) errors
+
+    - TB parity/residue error
+    - TFMR parity and firmware control error
+    - DEC/HDEC/PURR/SPURR parity errors
+
+HMI handling
+============
+
+   A core/NX/NPU checkstops are reported as malfunction alert (HMER bit 0).
+   OPAL handler scans through Fault Isolation Register (FIR) for each
+   core/nx/npu to detect the exact reason for checkstop and reports it back
+   to the host alongwith the disposition.
+
+   A processor recovery is reported through HMER bits 2, 3 and 11. These are
+   just an informational messages and no extra recovery is required.
+
+   Timer facility errors are reported through HMER bit 4. These are all
+   recoverable errors. The exact reason for the errors are stored in
+   Timer Facility Management Register (TFMR). Some of the Timer facility
+   errors affects TB and some of them affects TOD. TOD is a per chip
+   Time-Of-Day logic that holds the actual time value of the chip and
+   communicates with every TOD in the system to achieve synchronized
+   timer value within a system. TB is per core register (64-bit) derives its
+   value from ChipTOD at startup and then it gets periodically incremented
+   by STEP signal provided by the TOD. In a multi-socket system TODs are
+   always configured as master/backup TOD under primary/secondary
+   topology configuration respectively.
+
+   TB error generates HMI on all threads of the affected core. TB errors
+   except DEC/HDEC/PURR/SPURR parity errors, causes TB to stop running
+   making it invalid. As part of TB recovery, OPAL hmi handler synchronizes
+   with all threads, clears the TB errors and then re-sync the TB with TOD
+   value putting it back in running state.
+
+   TOD errors generates HMI on every core/thread of affected chip. The reason
+   for TOD errors are stored in TOD ERROR register (0x40030). As part of the
+   recovery OPAL hmi handler clears the TOD error and then requests new TOD
+   value from another running chipTOD in the system. Sometimes, if a primary
+   chipTOD is in error, it may need a TOD topology switch to recover from
+   error. A TOD topology switch basically makes a backup as new active master.
+
+OPAL_HANDLE_HMI and OPAL_HANDLE_HMI2
+====================================
+::
+
+   #define OPAL_HANDLE_HMI	98
+   #define OPAL_HANDLE_HMI2	166
+
+``OPAL_HANDLE_HMI``
+
+``OPAL_HANDLE_HMI2``
+  When OS host gets an Hypervisor Maintenance Interrupt (HMI), it must call
+  ```OPAL_HANDLE_HMI``` or ```OPAL_HANDLE_HMI2```. The ```OPAL_HANDLE_HMI```
+  is an old interface. ```OPAL_HANDLE_HMI2``` is newly introduced opal call
+  that returns direct info to Linux. It returns a 64-bit flag mask currently
+  set to provide info about which timer facilities were lost, and whether an
+  event was generated.
+
+OPAL_HANDLE_HMI
+---------------
+Syntax: ::
+
+  int64_t opal_handle_hmi(void)
+
+OPAL_HANDLE_HMI2
+----------------
+Syntax: ::
+
+  int64_t opal_handle_hmi2(__be64 *out_flags)
+
+parameters
+^^^^^^^^^^
+
+  ``__be64 *out_flags``
+
+  Returns the 64-bit flag mask that provides info about which timer facilities
+  were lost, and whether an event was generated.
+
+::
+
+   /* OPAL_HANDLE_HMI2 out_flags */
+   enum {
+        OPAL_HMI_FLAGS_TB_RESYNC        = (1ull << 0), /* Timebase has been resynced */
+        OPAL_HMI_FLAGS_DEC_LOST         = (1ull << 1), /* DEC lost, needs to be reprogrammed */
+        OPAL_HMI_FLAGS_HDEC_LOST        = (1ull << 2), /* HDEC lost, needs to be reprogrammed */
+        OPAL_HMI_FLAGS_NEW_EVENT        = (1ull << 63), /* An event has been created */
+   };