opal/hmi: Fix a TOD HMI failure during a race condition.

Message ID	147109387161.17720.8901461112236428439.stgit@jupiter.in.ibm.com
State	Accepted
Headers	show Return-Path: <skiboot-bounces+incoming=patchwork.ozlabs.org@lists.ozlabs.org> Gateway: Authorized Use Only! Violators will be prosecuted for <skiboot@lists.ozlabs.org> from <mahesh@linux.vnet.ibm.com>; Sat, 13 Aug 2016 23:11:17 +1000 Gateway: Authorized Use Only! Violators will be prosecuted; Sat, 13 Aug 2016 23:11:15 +1000 From: Mahesh J Salgaonkar <mahesh@linux.vnet.ibm.com> To: Stewart Smith <stewart@linux.vnet.ibm.com>, skiboot list <skiboot@lists.ozlabs.org> Date: Sat, 13 Aug 2016 18:41:11 +0530 User-Agent: StGit/0.17.1-dirty MIME-Version: 1.0 Message-Id: <147109387161.17720.8901461112236428439.stgit@jupiter.in.ibm.com> Subject: [Skiboot] [PATCH] opal/hmi: Fix a TOD HMI failure during a race condition. Precedence: list Content-Type: text/plain; charset="utf-8" Content-Transfer-Encoding: base64 Errors-To: skiboot-bounces+incoming=patchwork.ozlabs.org@lists.ozlabs.org Sender: "Skiboot" <skiboot-bounces+incoming=patchwork.ozlabs.org@lists.ozlabs.org>

Message ID

147109387161.17720.8901461112236428439.stgit@jupiter.in.ibm.com

State

Accepted

Headers

From: Mahesh J Salgaonkar <mahesh@linux.vnet.ibm.com>
To: Stewart Smith <stewart@linux.vnet.ibm.com>,
	skiboot list <skiboot@lists.ozlabs.org>
Date: Sat, 13 Aug 2016 18:41:11 +0530
User-Agent: StGit/0.17.1-dirty
MIME-Version: 1.0
Message-Id: <147109387161.17720.8901461112236428439.stgit@jupiter.in.ibm.com>
Subject: [Skiboot] [PATCH] opal/hmi: Fix a TOD HMI failure during a race
	condition.
Precedence: list
Content-Type: text/plain; charset="utf-8"
Content-Transfer-Encoding: base64
Errors-To: skiboot-bounces+incoming=patchwork.ozlabs.org@lists.ozlabs.org
Sender: "Skiboot"
	<skiboot-bounces+incoming=patchwork.ozlabs.org@lists.ozlabs.org>

Commit Message

Mahesh J Salgaonkar Aug. 13, 2016, 1:11 p.m. UTC

From: Mahesh Salgaonkar <mahesh@linux.vnet.ibm.com>

There are chances where another interrupt can wake a CPU in 0x100
vector just when HMI for TOD error is also pending. In such a rare race
condition if CPU has woken up with tb_loss power saving mode, it will
invoke opal call to resync the TB. Since TOD is already in error state,
resync TB will timeout leaving TFMR bit 18 set to '1'. (TFMR[18]=1 means
TB is prepared to receive new value from TOD. Once the new value is
received this bit gets reset to '0', otherwise TB would stay in waiting
state). When HMI is delivered, it may find all TFMR errors are already
cleared but would fail to restore TB since TFMR bit 18 is already set.
This leads to HMI recovery failure causing a kernel crash.

This patch fixes this by clearing of TB errors if TFMR[18] is set to 1.
This makes sure that TB is in clean state before TB restore process starts.

Signed-off-by: Mahesh Salgaonkar <mahesh@linux.vnet.ibm.com>
---
 hw/chiptod.c |    7 +++++++
 1 file changed, 7 insertions(+)

Comments

Ananth N Mavinakayanahalli Aug. 16, 2016, 7:48 a.m. UTC | #1

On Sat, Aug 13, 2016 at 06:41:11PM +0530, Mahesh J Salgaonkar wrote:
> From: Mahesh Salgaonkar <mahesh@linux.vnet.ibm.com>
> 
> There are chances where another interrupt can wake a CPU in 0x100
> vector just when HMI for TOD error is also pending. In such a rare race
> condition if CPU has woken up with tb_loss power saving mode, it will
> invoke opal call to resync the TB. Since TOD is already in error state,
> resync TB will timeout leaving TFMR bit 18 set to '1'. (TFMR[18]=1 means
> TB is prepared to receive new value from TOD. Once the new value is
> received this bit gets reset to '0', otherwise TB would stay in waiting
> state). When HMI is delivered, it may find all TFMR errors are already
> cleared but would fail to restore TB since TFMR bit 18 is already set.
> This leads to HMI recovery failure causing a kernel crash.
> 
> This patch fixes this by clearing of TB errors if TFMR[18] is set to 1.
> This makes sure that TB is in clean state before TB restore process starts.

Does this need to go into older firmware release updates if there are
any?

Ananth

Mahesh J Salgaonkar Aug. 16, 2016, 10:19 a.m. UTC | #2

On 08/16/2016 01:18 PM, Ananth N Mavinakayanahalli wrote:
> On Sat, Aug 13, 2016 at 06:41:11PM +0530, Mahesh J Salgaonkar wrote:
>> From: Mahesh Salgaonkar <mahesh@linux.vnet.ibm.com>
>>
>> There are chances where another interrupt can wake a CPU in 0x100
>> vector just when HMI for TOD error is also pending. In such a rare race
>> condition if CPU has woken up with tb_loss power saving mode, it will
>> invoke opal call to resync the TB. Since TOD is already in error state,
>> resync TB will timeout leaving TFMR bit 18 set to '1'. (TFMR[18]=1 means
>> TB is prepared to receive new value from TOD. Once the new value is
>> received this bit gets reset to '0', otherwise TB would stay in waiting
>> state). When HMI is delivered, it may find all TFMR errors are already
>> cleared but would fail to restore TB since TFMR bit 18 is already set.
>> This leads to HMI recovery failure causing a kernel crash.
>>
>> This patch fixes this by clearing of TB errors if TFMR[18] is set to 1.
>> This makes sure that TB is in clean state before TB restore process starts.
> 
> Does this need to go into older firmware release updates if there are
> any?

Yes, it should go as updates for FW840 and above.

Thanks,
-Mahesh.

Stewart Smith Aug. 25, 2016, 9:03 a.m. UTC | #3

Mahesh J Salgaonkar <mahesh@linux.vnet.ibm.com> writes:
> From: Mahesh Salgaonkar <mahesh@linux.vnet.ibm.com>
>
> There are chances where another interrupt can wake a CPU in 0x100
> vector just when HMI for TOD error is also pending. In such a rare race
> condition if CPU has woken up with tb_loss power saving mode, it will
> invoke opal call to resync the TB. Since TOD is already in error state,
> resync TB will timeout leaving TFMR bit 18 set to '1'. (TFMR[18]=1 means
> TB is prepared to receive new value from TOD. Once the new value is
> received this bit gets reset to '0', otherwise TB would stay in waiting
> state). When HMI is delivered, it may find all TFMR errors are already
> cleared but would fail to restore TB since TFMR bit 18 is already set.
> This leads to HMI recovery failure causing a kernel crash.
>
> This patch fixes this by clearing of TB errors if TFMR[18] is set to 1.
> This makes sure that TB is in clean state before TB restore process starts.
>
> Signed-off-by: Mahesh Salgaonkar <mahesh@linux.vnet.ibm.com>
> ---
>  hw/chiptod.c |    7 +++++++
>  1 file changed, 7 insertions(+)

Thanks,

merged to:
026b9a1  master
bb18811  skiboot-5.1.x
0abc875  skiboot-5.3.x

diff --git a/hw/chiptod.c b/hw/chiptod.c
index 58302fe..f647830 100644
--- a/hw/chiptod.c
+++ b/hw/chiptod.c
@@ -1498,11 +1498,18 @@  int chiptod_recover_tb_errors(void)
 	 * Check for TB errors.
 	 * On Sync check error, bit 44 of TFMR is set. Check for it and
 	 * clear it.
+	 *
+	 * In some rare situations we may have all TB errors already cleared,
+	 * but TB stuck in waiting for new value from TOD with TFMR bit 18
+	 * set to '1'. This uncertain state of TB would fail the process
+	 * of getting TB back into running state. Get TB in clean initial
+	 * state by clearing TB errors if TFMR[18] is set.
 	 */
 	if ((tfmr & SPR_TFMR_TB_MISSING_STEP) ||
 		(tfmr & SPR_TFMR_TB_RESIDUE_ERR) ||
 		(tfmr & SPR_TFMR_FW_CONTROL_ERR) ||
 		(tfmr & SPR_TFMR_TBST_CORRUPT) ||
+		(tfmr & SPR_TFMR_MOVE_CHIP_TOD_TO_TB) ||
 		(tfmr & SPR_TFMR_TB_MISSING_SYNC)) {
 		if (!tfmr_recover_tb_errors(tfmr)) {
 			rc = 0;

opal/hmi: Fix a TOD HMI failure during a race condition.

Commit Message

Comments

Patch