diff mbox series

hw/prd: Hold FSP notifications while PRD is inactive

Message ID 20200320081849.6651-1-oohall@gmail.com
State Accepted
Headers show
Series hw/prd: Hold FSP notifications while PRD is inactive | expand

Checks

Context Check Description
snowpatch_ozlabs/apply_patch success Successfully applied on branch master (e19dddc58280e6120459053dfcbf9c026b0ac4f9)
snowpatch_ozlabs/snowpatch_job_snowpatch-skiboot success Test snowpatch/job/snowpatch-skiboot on branch master
snowpatch_ozlabs/snowpatch_job_snowpatch-skiboot-dco success Signed-off-by present

Commit Message

Oliver O'Halloran March 20, 2020, 8:18 a.m. UTC
On FSP systems we rely on a service on the FSP to send us a notification
when the OCCs become active. On systems with NVDIMMs this is especially
critical because the OCC is responsible for starting the NVDIMM save
procedure when power fails.

The message sent from the FSP isn't sent to OPAL itself, rather it's
sent to the PRD service running on the host (via OPAL). If this service
is not running OPAL will currently send an error response back to the
FSP and drop the message. This causes problems because the OCCs active
message is generally sent while OPAL is still booting the system so
the PRD daemon never gets notified that the OCC is active.

Once the OS is running we rely on PRD to report the protection status
of the NVDIMMs on the system. However, because it never recieves the
notification from the FSP it will always report the DIMMs as
un-protected because it thinks the OCCs are inactive.

This patch fixes the issue by allowing a single message to be held in
OPAL while PRD is inactive. Once OPAL recieves a notification that PRD
has started we deliver the message.

It's worth pointing out that this is kind of janky and brittle and would
probably break horribly if FSP notify messages were multi-part since
we could end up in a situation where only a single part of a multi-part
message is queued, with the rest being dropped. However, the only user
of the FSP notification message appears to be the OCC, and the OCC team
says it's not a problem. I'll take their word for it.

Signed-off-by: Oliver O'Halloran <oohall@gmail.com>
---
---
 hw/prd.c | 13 ++++++++++++-
 1 file changed, 12 insertions(+), 1 deletion(-)

Comments

Oliver O'Halloran March 30, 2020, 6:48 a.m. UTC | #1
On Fri, Mar 20, 2020 at 7:19 PM Oliver O'Halloran <oohall@gmail.com> wrote:
>
> On FSP systems we rely on a service on the FSP to send us a notification
> when the OCCs become active. On systems with NVDIMMs this is especially
> critical because the OCC is responsible for starting the NVDIMM save
> procedure when power fails.
>
> The message sent from the FSP isn't sent to OPAL itself, rather it's
> sent to the PRD service running on the host (via OPAL). If this service
> is not running OPAL will currently send an error response back to the
> FSP and drop the message. This causes problems because the OCCs active
> message is generally sent while OPAL is still booting the system so
> the PRD daemon never gets notified that the OCC is active.
>
> Once the OS is running we rely on PRD to report the protection status
> of the NVDIMMs on the system. However, because it never recieves the
> notification from the FSP it will always report the DIMMs as
> un-protected because it thinks the OCCs are inactive.
>
> This patch fixes the issue by allowing a single message to be held in
> OPAL while PRD is inactive. Once OPAL recieves a notification that PRD
> has started we deliver the message.
>
> It's worth pointing out that this is kind of janky and brittle and would
> probably break horribly if FSP notify messages were multi-part since
> we could end up in a situation where only a single part of a multi-part
> message is queued, with the rest being dropped. However, the only user
> of the FSP notification message appears to be the OCC, and the OCC team
> says it's not a problem. I'll take their word for it.
>
> Signed-off-by: Oliver O'Halloran <oohall@gmail.com>

Merged as d703ad5b8ea93f2bcd98ca2642dcd3c66da82c91
diff mbox series

Patch

diff --git a/hw/prd.c b/hw/prd.c
index b9d04c79d543..a9c3b34c27ce 100644
--- a/hw/prd.c
+++ b/hw/prd.c
@@ -374,7 +374,7 @@  int prd_hbrt_fsp_msg_notify(void *data, u32 dsize)
 	int size, fw_notify_size;
 	int rc = FSP_STATUS_GENERIC_ERROR;
 
-	if (!prd_enabled || !prd_active) {
+	if (!prd_enabled) {
 		prlog(PR_NOTICE, "PRD: %s: PRD daemon is not ready\n",
 		      __func__);
 		return rc;
@@ -415,6 +415,12 @@  int prd_hbrt_fsp_msg_notify(void *data, u32 dsize)
 	fw_notify->type = cpu_to_be64(PRD_FW_MSG_TYPE_HBRT_FSP);
 	memcpy(&(fw_notify->mbox_msg), data, dsize);
 
+	if (!prd_active) {
+		// save the message, we'll deliver it when prd starts
+		rc = FSP_STATUS_BUSY;
+		goto unlock_events;
+	}
+
 	rc = opal_queue_prd_msg(prd_msg_fsp_notify);
 	if (!rc)
 		prd_msg_inuse = true;
@@ -455,6 +461,11 @@  static int prd_msg_handle_init(struct opal_prd_msg *msg)
 	 * interrupts */
 	lock(&events_lock);
 	prd_active = true;
+
+	if (prd_msg_fsp_notify) {
+		if (!opal_queue_prd_msg(prd_msg_fsp_notify))
+			prd_msg_inuse = true;
+	}
 	if (!prd_msg_inuse)
 		send_next_pending_event();
 	unlock(&events_lock);