diff mbox

[v2,08/10] powerpc/opal: Add opal_async_wait_response_interruptible() to opal-async

Message ID 20170710013106.27276-9-cyrilbur@gmail.com
State Superseded
Headers show

Commit Message

Cyril Bur July 10, 2017, 1:31 a.m. UTC
This patch adds an _interruptible version of opal_async_wait_response().
This is useful when a long running OPAL call is performed on behalf of a
userspace thread, for example, the opal_flash_{read,write,erase}
functions performed by the powernv-flash MTD driver.

It is foreseeable that these functions would take upwards of two minutes
causing the wait_event() to block long enough to cause hung task
warnings. Furthermore, wait_event_interruptible() is preferable as
otherwise there is no way for signals to stop the process which is going
to be confusing in userspace.

Signed-off-by: Cyril Bur <cyrilbur@gmail.com>
---
 arch/powerpc/include/asm/opal.h             |  2 +
 arch/powerpc/platforms/powernv/opal-async.c | 87 +++++++++++++++++++++++++++--
 2 files changed, 85 insertions(+), 4 deletions(-)

Comments

David Laight July 10, 2017, 2:05 p.m. UTC | #1
From: Cyril Bur
> Sent: 10 July 2017 02:31
> This patch adds an _interruptible version of opal_async_wait_response().
> This is useful when a long running OPAL call is performed on behalf of a
> userspace thread, for example, the opal_flash_{read,write,erase}
> functions performed by the powernv-flash MTD driver.
> 
> It is foreseeable that these functions would take upwards of two minutes
> causing the wait_event() to block long enough to cause hung task
> warnings. Furthermore, wait_event_interruptible() is preferable as
> otherwise there is no way for signals to stop the process which is going
> to be confusing in userspace.

ISTM that if you are doing (something like) a flash full device erase
(that really can take minutes) it isn't actually an interruptible
operation - the flash chip will still be busy.
So allowing the user process be interrupted just leaves a big mess.

OTOH the 'hung task' warning isn't the only problem with uninterruptible
sleeps - the processes also count towards the 'load average'.
Some software believes the 'load average' is a meaningful value.

It would be more generally useful for tasks to be able to sleep
uninterruptibly without counting towards the 'load average' or triggering
the 'task stuck' warning.
(I've code somewhere that sleeps interruptibly unless there is a signal
pending when it sleeps uninterruptibly.)

WRT flash erases, 'whole device' erases aren't significantly quicker
than sector by sector erases.
The latter can be interrupted between sectors.
I'm not sure you'd want to do writes than lock down enough kernel
memory to take even a second to complete.

	David
Cyril Bur July 11, 2017, 12:48 a.m. UTC | #2
On Mon, 2017-07-10 at 14:05 +0000, David Laight wrote:
> From: Cyril Bur
> > Sent: 10 July 2017 02:31
> > This patch adds an _interruptible version of opal_async_wait_response().
> > This is useful when a long running OPAL call is performed on behalf of a
> > userspace thread, for example, the opal_flash_{read,write,erase}
> > functions performed by the powernv-flash MTD driver.
> > 
> > It is foreseeable that these functions would take upwards of two minutes
> > causing the wait_event() to block long enough to cause hung task
> > warnings. Furthermore, wait_event_interruptible() is preferable as
> > otherwise there is no way for signals to stop the process which is going
> > to be confusing in userspace.
> 
> ISTM that if you are doing (something like) a flash full device erase
> (that really can take minutes) it isn't actually an interruptible
> operation - the flash chip will still be busy.
> So allowing the user process be interrupted just leaves a big mess.
> 

Agreed.

> OTOH the 'hung task' warning isn't the only problem with uninterruptible
> sleeps - the processes also count towards the 'load average'.
> Some software believes the 'load average' is a meaningful value.
> 

Yes, and because the read write and erase ops are actually calls into
firmware which in some cases completely emulates flash we can mitigate
the mess of allowing the process to be interrupted that you mention
above.

> It would be more generally useful for tasks to be able to sleep
> uninterruptibly without counting towards the 'load average' or triggering
> the 'task stuck' warning.
> (I've code somewhere that sleeps interruptibly unless there is a signal
> pending when it sleeps uninterruptibly.)
> 

I'm not sure what you mean here, if I understand correctly, this is
what I'm doing. In the patch to the powernv_flash driver which uses
opal_async_wait_response_interruptible() I essentially do the
interruptible and if it breaks early the driver determines if it is
safe to return to userspace otherwise it sleeps uninterruptibly
(hopefully not for long). I was hesitant to put that logic here and
prefered to leave it up to the caller to make the decision as to what
to do.

> WRT flash erases, 'whole device' erases aren't significantly quicker
> than sector by sector erases.

Yes, I'm rewriting some of the userspace tools we have to do everything
sector by sector. MTD interfaces allow big reads/writes and erases so
we should still handle it as best we can.

> The latter can be interrupted between sectors.
> I'm not sure you'd want to do writes than lock down enough kernel
> memory to take even a second to complete.
> 

The MTD core actually chunks them up before passing them to the to
backing driver so I don't think holding too much memory is a problem.
Because the time spent in OPAL servicing flash calls is hard to
determine and might not even be that related to the size of the
operation, even smaller ops might still spend 'time' in OPAL.

Cyril

> 	David
>
diff mbox

Patch

diff --git a/arch/powerpc/include/asm/opal.h b/arch/powerpc/include/asm/opal.h
index 5553ad2f3e53..6e9e53d744f3 100644
--- a/arch/powerpc/include/asm/opal.h
+++ b/arch/powerpc/include/asm/opal.h
@@ -294,6 +294,8 @@  extern void opal_notifier_update_evt(uint64_t evt_mask, uint64_t evt_val);
 extern int opal_async_get_token_interruptible(void);
 extern int opal_async_release_token(int token);
 extern int opal_async_wait_response(uint64_t token, struct opal_msg *msg);
+extern int opal_async_wait_response_interruptible(uint64_t token,
+		struct opal_msg *msg);
 extern int opal_get_sensor_data(u32 sensor_hndl, u32 *sensor_data);
 
 struct rtc_time;
diff --git a/arch/powerpc/platforms/powernv/opal-async.c b/arch/powerpc/platforms/powernv/opal-async.c
index d692372a0363..f6b30cfceb8f 100644
--- a/arch/powerpc/platforms/powernv/opal-async.c
+++ b/arch/powerpc/platforms/powernv/opal-async.c
@@ -26,6 +26,8 @@ 
 enum opal_async_token_state {
 	ASYNC_TOKEN_FREE,
 	ASYNC_TOKEN_ALLOCATED,
+	ASYNC_TOKEN_DISPATCHED,
+	ASYNC_TOKEN_ABANDONED,
 	ASYNC_TOKEN_COMPLETED
 };
 
@@ -59,8 +61,10 @@  static int __opal_async_get_token(void)
 }
 
 /*
- * Note: If the returned token is used in an opal call and opal returns
- * OPAL_ASYNC_COMPLETION you MUST opal_async_wait_response() before
+ * Note: If the returned token is used in an opal call and opal
+ * returns OPAL_ASYNC_COMPLETION you MUST one of
+ * opal_async_wait_response() or
+ * opal_async_wait_response_interruptible() at least once before
  * calling another other opal_async_* function
  */
 int opal_async_get_token_interruptible(void)
@@ -97,6 +101,16 @@  static int __opal_async_release_token(int token)
 		opal_async_tokens[token].state = ASYNC_TOKEN_FREE;
 		rc = 0;
 		break;
+	/*
+	 * DISPATCHED and ABANDONED tokens must wait for OPAL to
+	 * respond.
+	 * Mark a DISPATCHED token as ABANDONED so that the response
+	 * response handling code knows no one cares and that it can
+	 * free it then.
+	 */
+	case ASYNC_TOKEN_DISPATCHED:
+		opal_async_tokens[token].state = ASYNC_TOKEN_ABANDONED;
+		/* Fall through */
 	default:
 		rc = 1;
 	}
@@ -129,7 +143,11 @@  int opal_async_wait_response(uint64_t token, struct opal_msg *msg)
 		return -EINVAL;
 	}
 
-	/* Wakeup the poller before we wait for events to speed things
+	/*
+	 * There is no need to mark the token as dispatched, wait_event()
+	 * will block until the token completes.
+	 *
+	 * Wakeup the poller before we wait for events to speed things
 	 * up on platforms or simulators where the interrupts aren't
 	 * functional.
 	 */
@@ -142,11 +160,66 @@  int opal_async_wait_response(uint64_t token, struct opal_msg *msg)
 }
 EXPORT_SYMBOL_GPL(opal_async_wait_response);
 
+int opal_async_wait_response_interruptible(uint64_t token, struct opal_msg *msg)
+{
+	unsigned long flags;
+	int ret;
+
+	if (token >= opal_max_async_tokens) {
+		pr_err("%s: Invalid token passed\n", __func__);
+		return -EINVAL;
+	}
+
+	if (!msg) {
+		pr_err("%s: Invalid message pointer passed\n", __func__);
+		return -EINVAL;
+	}
+
+	/*
+	 * The first time this gets called we mark the token as DISPATCHED
+	 * so that if wait_event_interruptible() returns not zero and the
+	 * caller frees the token, we know not to actually free the token
+	 * until the response comes.
+	 *
+	 * Only change if the token is ALLOCATED - it may have been
+	 * completed even before the caller gets around to calling this
+	 * the first time.
+	 *
+	 * There is also a dirty great comment at the token allocation
+	 * function that if the opal call returns OPAL_ASYNC_COMPLETION to
+	 * the caller then the caller *must* call this or the not
+	 * interruptible version before doing anything else with the
+	 * token.
+	 */
+	if (opal_async_tokens[token].state == ASYNC_TOKEN_ALLOCATED) {
+		spin_lock_irqsave(&opal_async_comp_lock, flags);
+		if (opal_async_tokens[token].state == ASYNC_TOKEN_ALLOCATED)
+			opal_async_tokens[token].state = ASYNC_TOKEN_DISPATCHED;
+		spin_unlock_irqrestore(&opal_async_comp_lock, flags);
+	}
+
+	/*
+	 * Wakeup the poller before we wait for events to speed things
+	 * up on platforms or simulators where the interrupts aren't
+	 * functional.
+	 */
+	opal_wake_poller();
+	ret = wait_event_interruptible(opal_async_wait,
+			opal_async_tokens[token].state ==
+			ASYNC_TOKEN_COMPLETED);
+	if (!ret)
+		memcpy(msg, &opal_async_tokens[token].response, sizeof(*msg));
+
+	return ret;
+}
+EXPORT_SYMBOL_GPL(opal_async_wait_response_interruptible);
+
 /* Called from interrupt context */
 static int opal_async_comp_event(struct notifier_block *nb,
 		unsigned long msg_type, void *msg)
 {
 	struct opal_msg *comp_msg = msg;
+	enum opal_async_token_state state;
 	unsigned long flags;
 	uint64_t token;
 
@@ -154,11 +227,17 @@  static int opal_async_comp_event(struct notifier_block *nb,
 		return 0;
 
 	token = be64_to_cpu(comp_msg->params[0]);
-	memcpy(&opal_async_tokens[token].response, comp_msg, sizeof(*comp_msg));
 	spin_lock_irqsave(&opal_async_comp_lock, flags);
+	state = opal_async_tokens[token].state;
 	opal_async_tokens[token].state = ASYNC_TOKEN_COMPLETED;
 	spin_unlock_irqrestore(&opal_async_comp_lock, flags);
 
+	if (state == ASYNC_TOKEN_ABANDONED) {
+		/* Free the token, no one else will */
+		opal_async_release_token(token);
+		return 0;
+	}
+	memcpy(&opal_async_tokens[token].response, comp_msg, sizeof(*comp_msg));
 	wake_up(&opal_async_wait);
 
 	return 0;