slw: improve error message for SLW timer stuck

Message ID	1473299410-17557-1-git-send-email-stewart@linux.vnet.ibm.com
State	Superseded
Headers	show Return-Path: <skiboot-bounces+incoming=patchwork.ozlabs.org@lists.ozlabs.org> Gateway: Authorized Use Only! Violators will be prosecuted for <skiboot@lists.ozlabs.org> from <stewart@linux.vnet.ibm.com>; Wed, 7 Sep 2016 19:50:26 -0600 Gateway: Authorized Use Only! Violators will be prosecuted; Wed, 7 Sep 2016 19:50:25 -0600 From: Stewart Smith <stewart@linux.vnet.ibm.com> To: skiboot@lists.ozlabs.org Date: Thu, 8 Sep 2016 11:50:10 +1000 Message-Id: <1473299410-17557-1-git-send-email-stewart@linux.vnet.ibm.com> Subject: [Skiboot] [PATCH] slw: improve error message for SLW timer stuck Precedence: list MIME-Version: 1.0 Content-Type: text/plain; charset="utf-8" Content-Transfer-Encoding: base64 Errors-To: skiboot-bounces+incoming=patchwork.ozlabs.org@lists.ozlabs.org Sender: "Skiboot" <skiboot-bounces+incoming=patchwork.ozlabs.org@lists.ozlabs.org>

Message ID

1473299410-17557-1-git-send-email-stewart@linux.vnet.ibm.com

State

Superseded

Headers

From: Stewart Smith <stewart@linux.vnet.ibm.com>
To: skiboot@lists.ozlabs.org
Date: Thu,  8 Sep 2016 11:50:10 +1000
Message-Id: <1473299410-17557-1-git-send-email-stewart@linux.vnet.ibm.com>
Subject: [Skiboot] [PATCH] slw: improve error message for SLW timer stuck
Precedence: list
MIME-Version: 1.0
Content-Type: text/plain; charset="utf-8"
Content-Transfer-Encoding: base64
Errors-To: skiboot-bounces+incoming=patchwork.ozlabs.org@lists.ozlabs.org
Sender: "Skiboot"
	<skiboot-bounces+incoming=patchwork.ozlabs.org@lists.ozlabs.org>

We still register dump, but only to in memory console buffer by default. Signed-off-by: Stewart Smith <stewart@linux.vnet.ibm.com> --- hw/slw.c | 25 ++++++++++++++++++++++--- 1 file changed, 22 insertions(+), 3 deletions(-)

Comments

Shilpasri G Bhat Sept. 8, 2016, 5:59 a.m. UTC | #1

Hi Stewart,

On 09/08/2016 07:20 AM, Stewart Smith wrote:
> +				 * operations (for example). It may also
> +				 * increase jitter, as more time is spent
> +				 * inside OPAL every time a timer is requested
> +				 * and we time out after 1ms.

This error sets 'slw_has_timer' to false, which disables queuing of new timers
and thus we wont be in the timeout loop after this error. So this error will not
bring in any new jitter.

Thanks and Regards.
Shilpa

Stewart Smith Sept. 9, 2016, 4:09 a.m. UTC | #2

Shilpasri G Bhat <shilpa.bhat@linux.vnet.ibm.com> writes:
> Hi Stewart,
>
> On 09/08/2016 07:20 AM, Stewart Smith wrote:
>> +				 * operations (for example). It may also
>> +				 * increase jitter, as more time is spent
>> +				 * inside OPAL every time a timer is requested
>> +				 * and we time out after 1ms.
>
> This error sets 'slw_has_timer' to false, which disables queuing of new timers
> and thus we wont be in the timeout loop after this error. So this error will not
> bring in any new jitter.

Good point! Maybe I should have payed attention to the code a bit more
when I wrote the text. I blame early morning meetings and insufficient
coffee.

Fixed in V2.

diff --git a/hw/slw.c b/hw/slw.c
index 74b9cd5..640b287 100644
--- a/hw/slw.c
+++ b/hw/slw.c
@@ -1197,13 +1197,13 @@  static void slw_dump_timer_ffdc(void)
 	 * root-cause. OPAL/skiboot may be stuck on some operation that
 	 * requires SLW timer state machine (e.g. core powersaving)
 	 */
-	prlog(PR_ERR, "SLW: Register state:\n");
+	prlog(PR_DEBUG, "SLW: Register state:\n");
 
 	for (i = 0; i < ARRAY_SIZE(dump_regs); i++) {
 		uint32_t reg = dump_regs[i];
 		rc = xscom_read(slw_timer_chip, reg, &val);
 		if (rc) {
-			prlog(PR_ERR, "SLW: XSCOM error %lld reading"
+			prlog(PR_DEBUG, "SLW: XSCOM error %lld reading"
 			      " reg 0x%x\n", rc, reg);
 			break;
 		}
@@ -1250,7 +1250,26 @@  void slw_update_timer_expiry(uint64_t new_target)
 			if (!(gen & 1))
 				break;
 			if (tb_compare(now + msecs_to_tb(1), mftb()) == TB_ABEFOREB) {
-				prerror("SLW: Stuck with odd generation !\n");
+				/**
+				 * @fwts-label SLWTimerStuck
+				 * @fwts-advice The SLeep/Winkle Engine (SLW)
+				 * failed to increment the generation number
+				 * within our timeout period (it *should* have
+				 * done so within ~10us, not >1ms. OPAL uses
+				 * the SLW timer to schedule some operations,
+				 * but can fall back to the (much less frequent
+				 * OPAL poller, which although does not affect
+				 * functionality, runs *much* less frequently.
+				 * This could have the effect of slow I2C
+				 * operations (for example). It may also
+				 * increase jitter, as more time is spent
+				 * inside OPAL every time a timer is requested
+				 * and we time out after 1ms. This error may
+				 * also occur if the machine is connected
+				 * to via soft FSI.
+				 */
+				prerror("SLW: timer stuck, falling back to OPAL pollers. Possible increased jitter and slower I2C.\n");
+				prlog(PR_DEBUG, "SLW: Stuck with odd generation !\n");
 				slw_has_timer = false;
 				slw_dump_timer_ffdc();
 				return;

slw: improve error message for SLW timer stuck

Commit Message

Comments

Patch