Message ID | 20190516175816.10558-4-hegdevasant@linux.vnet.ibm.com |
---|---|
State | Superseded |
Headers | show |
Series | [v4,01/10] core/opal: Increase opal-msg-size size | expand |
Context | Check | Description |
---|---|---|
snowpatch_ozlabs/apply_patch | success | Successfully applied on branch master (c8b5e8a95caf029ffe73ea18769fdd7f2da48ab4) |
snowpatch_ozlabs/snowpatch_job_snowpatch-skiboot | success | Test snowpatch/job/snowpatch-skiboot on branch master |
snowpatch_ozlabs/snowpatch_job_snowpatch-skiboot-dco | success | Signed-off-by present |
Vasant Hegde <hegdevasant@linux.vnet.ibm.com> writes: > Linux uses opal_get_msg (OPAL_GET_MSG) API to get OPAL messages. This interface > supports upto 8 params (64 bytes). We have a requirement to send bigger data to > Linux. This patch enhances OPAL to send bigger data to Linux. > > - Linux will use "opal-msg-size" device tree property to allocate memory for > OPAL messages (previous patch increased "opal-msg-size" to 64K). > > - Replaced `reserved` field in "struct opal_msg" with `size`. So that Linux > side opal_get_msg user can detect actual data size. > > - If buffer size < actual message size, then opal_get_msg will copy partial > data and return OPAL_PARTIAL to Linux. Looking through the linux code, we should probably very carfully document the expected behaviour for larger messages an the impact. Specifically around what happens with OPAL_PARTIAL and ensuring we don't just print error messages out from the kernel forever. Specifically, we seem to have two places where we consume messages in Linux: in opal.c: static void opal_handle_message(void) { .... ret = opal_get_msg(__pa(&msg), sizeof(msg)); /* No opal message pending. */ if (ret == OPAL_RESOURCE) return; /* check for errors. */ if (ret) { pr_warn("%s: Failed to retrieve opal message, err=%lld\n", __func__, ret); return; } So this gives me pause, as if we have a large message needing to be retreived, existing kernels never will clear it, and spew pr_warns from here to eternity. and in opal-hmi.c: if (unrecoverable) { /* Pull all HMI events from OPAL before we panic. */ while (opal_get_msg(__pa(&msg), sizeof(msg)) == OPAL_SUCCESS) { ..... print_hmi_event_info(hmi_evt); So if we have a large event in the middle of a bunch of fatal HMI info, we'll lose the HMI event info. So I think we need to ensure two things: 1) we don't lose unrecoverable HMIs 2) existing kernels degrade gracefully. We can probably solve 1 by something only casually ugly (remove every event that isn't unrecoverable hmi or don't return OPAL_PARTIAL when there's an unrecoverable HMI event present). For 2, I'm trying to think of what's good to do.. my ideas go down two paths: - if a message has gotten OPAL_PARTIAL twice, drop it. - Drop message after first OPAL_PARTIAL if opal_get_msg size parameter is the sizeof(msg) rather than what we put in the device tree. Either way, we should very clearly document this (and the reasoning behind it) in the OPAL API docs. I haven't looked at what FreeBSD does, and we should probably do that if only to get into the habit of doing so. > - Add new variable "extended" to "opal_msg_entry" structure to keep track > of messages that has more than 64byte data. We will allocate separate > memory for these messages and once kernel consumes message we will > release that memory. > > Cc: Jeremy Kerr <jk@ozlabs.org> > Cc: Mahesh Salgaonkar <mahesh@linux.vnet.ibm.com> > Cc: Oliver O'Halloran <oohall@gmail.com> > Signed-off-by: Vasant Hegde <hegdevasant@linux.vnet.ibm.com> > Acked-by: Jeremy Kerr <jk@ozlabs.org> > --- > core/opal-msg.c | 66 ++++++++++++++++++++++++++++++------------ > core/test/run-msg.c | 6 ++-- > doc/opal-api/opal-messages.rst | 2 +- > include/opal-api.h | 2 +- > 4 files changed, 53 insertions(+), 23 deletions(-) > > diff --git a/core/opal-msg.c b/core/opal-msg.c > index 907a9e0af..af1ec7d00 100644 > --- a/core/opal-msg.c > +++ b/core/opal-msg.c > @@ -25,6 +25,7 @@ > struct opal_msg_entry { > struct list_node link; > void (*consumed)(void *data, int status); > + bool extended; > void *data; > struct opal_msg msg; > }; > @@ -39,37 +40,47 @@ int _opal_queue_msg(enum opal_msg_type msg_type, void *data, > size_t params_size, const void *params) > { > struct opal_msg_entry *entry; > + uint64_t entry_size; > + > + if ((params_size + OPAL_MSG_HDR_SIZE) > OPAL_MSG_SIZE) { > + prlog(PR_DEBUG, "param_size (0x%x) > opal_msg param size (0x%x)\n", > + (u32)params_size, (u32)(OPAL_MSG_SIZE - OPAL_MSG_HDR_SIZE)); > + return OPAL_PARAMETER; > + } > > lock(&opal_msg_lock); > > - entry = list_pop(&msg_free_list, struct opal_msg_entry, link); > - if (!entry) { > - prerror("No available node in the free list, allocating\n"); > - entry = zalloc(sizeof(struct opal_msg_entry)); > + if (params_size > OPAL_MSG_FIXED_PARAMS_SIZE) { > + entry_size = sizeof(struct opal_msg_entry) + params_size; > + entry_size -= OPAL_MSG_FIXED_PARAMS_SIZE; > + entry = zalloc(entry_size); > + if (entry) > + entry->extended = true; > + } else { > + entry = list_pop(&msg_free_list, struct opal_msg_entry, link); > if (!entry) { > - prerror("Allocation failed\n"); > - unlock(&opal_msg_lock); > - return OPAL_RESOURCE; > + prerror("No available node in the free list, allocating\n"); > + entry = zalloc(sizeof(struct opal_msg_entry)); I'm tempted to say we should switch to using the pool allocator for these and hard failing when we run out. This could come in a separate patch though. We've (more than once) gotten everything fairly wrong and eaten up all of skiboot heap with messages that aren't being consumed, causing everything to explode even more horribly than it was already exploding.
On 05/21/2019 12:50 PM, Stewart Smith wrote: > Vasant Hegde <hegdevasant@linux.vnet.ibm.com> writes: > >> Linux uses opal_get_msg (OPAL_GET_MSG) API to get OPAL messages. This interface >> supports upto 8 params (64 bytes). We have a requirement to send bigger data to >> Linux. This patch enhances OPAL to send bigger data to Linux. >> >> - Linux will use "opal-msg-size" device tree property to allocate memory for >> OPAL messages (previous patch increased "opal-msg-size" to 64K). >> >> - Replaced `reserved` field in "struct opal_msg" with `size`. So that Linux >> side opal_get_msg user can detect actual data size. >> >> - If buffer size < actual message size, then opal_get_msg will copy partial >> data and return OPAL_PARTIAL to Linux. > > Looking through the linux code, we should probably very carfully > document the expected behaviour for larger messages an the impact. Correct. Its already documented in OPAL. OPAL_PARTIAL If pending opal message is greater than supplied buffer. In this case the message is *DISCARDED* by OPAL. ...and this patch retains above behavior. i.e. we remove entry from list and return OPAL_PARTIAL to kernel. > > Specifically around what happens with OPAL_PARTIAL and ensuring we don't > just print error messages out from the kernel forever. It will print error message once and move on....as OPAL removes entry from list. (Same as existing behavior). > > Specifically, we seem to have two places where we consume messages in > Linux: > > in opal.c: > static void opal_handle_message(void) > { > .... > ret = opal_get_msg(__pa(&msg), sizeof(msg)); > /* No opal message pending. */ > if (ret == OPAL_RESOURCE) > return; > > /* check for errors. */ > if (ret) { > pr_warn("%s: Failed to retrieve opal message, err=%lld\n", > __func__, ret); > return; > } > > So this gives me pause, as if we have a large message needing to be > retreived, existing kernels never will clear it, and spew pr_warns from > here to eternity. > > and in opal-hmi.c: > > if (unrecoverable) { > /* Pull all HMI events from OPAL before we panic. */ > while (opal_get_msg(__pa(&msg), sizeof(msg)) == OPAL_SUCCESS) { > ..... > print_hmi_event_info(hmi_evt); > > > So if we have a large event in the middle of a bunch of fatal HMI info, > we'll lose the HMI event info. > > So I think we need to ensure two things: > 1) we don't lose unrecoverable HMIs > 2) existing kernels degrade gracefully. I think both are taken care. > > We can probably solve 1 by something only casually ugly (remove every > event that isn't unrecoverable hmi or don't return OPAL_PARTIAL when > there's an unrecoverable HMI event present). This is not an issue because existing interface handles HMI events properly. And on old kernel / new OPAL, for bigger messages we will return OPAL_PARTIAL *and* remove message from list. We will not have infinite loop. so we are good. > > For 2, I'm trying to think of what's good to do.. my ideas go down two > paths: > - if a message has gotten OPAL_PARTIAL twice, drop it. > - Drop message after first OPAL_PARTIAL if opal_get_msg size parameter > is the sizeof(msg) rather than what we put in the device tree. Only way for kernel to know the maximum message size is through device tree. So it makes sense to drop message if kernel is not passing sufficient buffer (i.e. Old kernel / new OPAL combination). > > Either way, we should very clearly document this (and the reasoning > behind it) in the OPAL API docs. > > I haven't looked at what FreeBSD does, and we should probably do that if > only to get into the habit of doing so. > > >> - Add new variable "extended" to "opal_msg_entry" structure to keep track >> of messages that has more than 64byte data. We will allocate separate >> memory for these messages and once kernel consumes message we will >> release that memory. >> >> Cc: Jeremy Kerr <jk@ozlabs.org> >> Cc: Mahesh Salgaonkar <mahesh@linux.vnet.ibm.com> >> Cc: Oliver O'Halloran <oohall@gmail.com> >> Signed-off-by: Vasant Hegde <hegdevasant@linux.vnet.ibm.com> >> Acked-by: Jeremy Kerr <jk@ozlabs.org> >> --- >> core/opal-msg.c | 66 ++++++++++++++++++++++++++++++------------ >> core/test/run-msg.c | 6 ++-- >> doc/opal-api/opal-messages.rst | 2 +- >> include/opal-api.h | 2 +- >> 4 files changed, 53 insertions(+), 23 deletions(-) >> >> diff --git a/core/opal-msg.c b/core/opal-msg.c >> index 907a9e0af..af1ec7d00 100644 >> --- a/core/opal-msg.c >> +++ b/core/opal-msg.c >> @@ -25,6 +25,7 @@ >> struct opal_msg_entry { >> struct list_node link; >> void (*consumed)(void *data, int status); >> + bool extended; >> void *data; >> struct opal_msg msg; >> }; >> @@ -39,37 +40,47 @@ int _opal_queue_msg(enum opal_msg_type msg_type, void *data, >> size_t params_size, const void *params) >> { >> struct opal_msg_entry *entry; >> + uint64_t entry_size; >> + >> + if ((params_size + OPAL_MSG_HDR_SIZE) > OPAL_MSG_SIZE) { >> + prlog(PR_DEBUG, "param_size (0x%x) > opal_msg param size (0x%x)\n", >> + (u32)params_size, (u32)(OPAL_MSG_SIZE - OPAL_MSG_HDR_SIZE)); >> + return OPAL_PARAMETER; >> + } >> >> lock(&opal_msg_lock); >> >> - entry = list_pop(&msg_free_list, struct opal_msg_entry, link); >> - if (!entry) { >> - prerror("No available node in the free list, allocating\n"); >> - entry = zalloc(sizeof(struct opal_msg_entry)); >> + if (params_size > OPAL_MSG_FIXED_PARAMS_SIZE) { >> + entry_size = sizeof(struct opal_msg_entry) + params_size; >> + entry_size -= OPAL_MSG_FIXED_PARAMS_SIZE; >> + entry = zalloc(entry_size); >> + if (entry) >> + entry->extended = true; >> + } else { >> + entry = list_pop(&msg_free_list, struct opal_msg_entry, link); >> if (!entry) { >> - prerror("Allocation failed\n"); >> - unlock(&opal_msg_lock); >> - return OPAL_RESOURCE; >> + prerror("No available node in the free list, allocating\n"); >> + entry = zalloc(sizeof(struct opal_msg_entry)); > > I'm tempted to say we should switch to using the pool allocator for > these and hard failing when we run out. This could come in a separate > patch though. Agreed. Makes sense to control number of allocation. I will send separate patch for that. -Vasant
Vasant Hegde <hegdevasant@linux.vnet.ibm.com> writes: > On 05/21/2019 12:50 PM, Stewart Smith wrote: >> Vasant Hegde <hegdevasant@linux.vnet.ibm.com> writes: >> >>> Linux uses opal_get_msg (OPAL_GET_MSG) API to get OPAL messages. This interface >>> supports upto 8 params (64 bytes). We have a requirement to send bigger data to >>> Linux. This patch enhances OPAL to send bigger data to Linux. >>> >>> - Linux will use "opal-msg-size" device tree property to allocate memory for >>> OPAL messages (previous patch increased "opal-msg-size" to 64K). >>> >>> - Replaced `reserved` field in "struct opal_msg" with `size`. So that Linux >>> side opal_get_msg user can detect actual data size. >>> >>> - If buffer size < actual message size, then opal_get_msg will copy partial >>> data and return OPAL_PARTIAL to Linux. >> >> Looking through the linux code, we should probably very carfully >> document the expected behaviour for larger messages an the impact. > > Correct. Its already documented in OPAL. > > OPAL_PARTIAL > If pending opal message is greater than supplied buffer. > In this case the message is *DISCARDED* by OPAL. > > ...and this patch retains above behavior. i.e. we remove entry from list and return > OPAL_PARTIAL to kernel. Ahh excellent. Oh look, some guy named Stewart wrote that nearly 3 years ago. >> Specifically around what happens with OPAL_PARTIAL and ensuring we don't >> just print error messages out from the kernel forever. > > It will print error message once and move on....as OPAL removes entry from list. > (Same as existing behavior). > >> >> Specifically, we seem to have two places where we consume messages in >> Linux: >> >> in opal.c: >> static void opal_handle_message(void) >> { >> .... >> ret = opal_get_msg(__pa(&msg), sizeof(msg)); >> /* No opal message pending. */ >> if (ret == OPAL_RESOURCE) >> return; >> >> /* check for errors. */ >> if (ret) { >> pr_warn("%s: Failed to retrieve opal message, err=%lld\n", >> __func__, ret); >> return; >> } >> >> So this gives me pause, as if we have a large message needing to be >> retreived, existing kernels never will clear it, and spew pr_warns from >> here to eternity. >> >> and in opal-hmi.c: >> >> if (unrecoverable) { >> /* Pull all HMI events from OPAL before we panic. */ >> while (opal_get_msg(__pa(&msg), sizeof(msg)) == OPAL_SUCCESS) { >> ..... >> print_hmi_event_info(hmi_evt); >> >> >> So if we have a large event in the middle of a bunch of fatal HMI info, >> we'll lose the HMI event info. >> >> So I think we need to ensure two things: >> 1) we don't lose unrecoverable HMIs >> 2) existing kernels degrade gracefully. > > I think both are taken care. Has it been tested? >> I'm tempted to say we should switch to using the pool allocator for >> these and hard failing when we run out. This could come in a separate >> patch though. > > Agreed. Makes sense to control number of allocation. I will send separate patch > for that. ok.
On 05/30/2019 09:10 AM, Stewart Smith wrote: > Vasant Hegde <hegdevasant@linux.vnet.ibm.com> writes: >> On 05/21/2019 12:50 PM, Stewart Smith wrote: >>> Vasant Hegde <hegdevasant@linux.vnet.ibm.com> writes: >>> .../... >> >> OPAL_PARTIAL >> If pending opal message is greater than supplied buffer. >> In this case the message is *DISCARDED* by OPAL. >> >> ...and this patch retains above behavior. i.e. we remove entry from list and return >> OPAL_PARTIAL to kernel. > > Ahh excellent. > > Oh look, some guy named Stewart wrote that nearly 3 years ago. > Yes :-) .../.. >>> >>> So I think we need to ensure two things: >>> 1) we don't lose unrecoverable HMIs >>> 2) existing kernels degrade gracefully. >> >> I think both are taken care. > > Has it been tested? Yes. I had custom patch to inject bigger message size and tested with both old and new kernel. -Vasant
diff --git a/core/opal-msg.c b/core/opal-msg.c index 907a9e0af..af1ec7d00 100644 --- a/core/opal-msg.c +++ b/core/opal-msg.c @@ -25,6 +25,7 @@ struct opal_msg_entry { struct list_node link; void (*consumed)(void *data, int status); + bool extended; void *data; struct opal_msg msg; }; @@ -39,37 +40,47 @@ int _opal_queue_msg(enum opal_msg_type msg_type, void *data, size_t params_size, const void *params) { struct opal_msg_entry *entry; + uint64_t entry_size; + + if ((params_size + OPAL_MSG_HDR_SIZE) > OPAL_MSG_SIZE) { + prlog(PR_DEBUG, "param_size (0x%x) > opal_msg param size (0x%x)\n", + (u32)params_size, (u32)(OPAL_MSG_SIZE - OPAL_MSG_HDR_SIZE)); + return OPAL_PARAMETER; + } lock(&opal_msg_lock); - entry = list_pop(&msg_free_list, struct opal_msg_entry, link); - if (!entry) { - prerror("No available node in the free list, allocating\n"); - entry = zalloc(sizeof(struct opal_msg_entry)); + if (params_size > OPAL_MSG_FIXED_PARAMS_SIZE) { + entry_size = sizeof(struct opal_msg_entry) + params_size; + entry_size -= OPAL_MSG_FIXED_PARAMS_SIZE; + entry = zalloc(entry_size); + if (entry) + entry->extended = true; + } else { + entry = list_pop(&msg_free_list, struct opal_msg_entry, link); if (!entry) { - prerror("Allocation failed\n"); - unlock(&opal_msg_lock); - return OPAL_RESOURCE; + prerror("No available node in the free list, allocating\n"); + entry = zalloc(sizeof(struct opal_msg_entry)); } } + if (!entry) { + prerror("Allocation failed\n"); + unlock(&opal_msg_lock); + return OPAL_RESOURCE; + } entry->consumed = consumed; entry->data = data; entry->msg.msg_type = cpu_to_be32(msg_type); - - if (params_size > OPAL_MSG_FIXED_PARAMS_SIZE) { - prerror("Discarding extra parameters\n"); - params_size = OPAL_MSG_FIXED_PARAMS_SIZE; - } + entry->msg.size = cpu_to_be32(params_size); memcpy(entry->msg.params, params, params_size); list_add_tail(&msg_pending_list, &entry->link); opal_update_pending_evt(OPAL_EVENT_MSG_PENDING, OPAL_EVENT_MSG_PENDING); - unlock(&opal_msg_lock); - return 0; + return OPAL_SUCCESS; } static int64_t opal_get_msg(uint64_t *buffer, uint64_t size) @@ -77,6 +88,8 @@ static int64_t opal_get_msg(uint64_t *buffer, uint64_t size) struct opal_msg_entry *entry; void (*callback)(void *data, int status); void *data; + uint64_t msg_size; + int rc = OPAL_SUCCESS; if (size < sizeof(struct opal_msg) || !buffer) return OPAL_PARAMETER; @@ -92,20 +105,37 @@ static int64_t opal_get_msg(uint64_t *buffer, uint64_t size) return OPAL_RESOURCE; } - memcpy(buffer, &entry->msg, sizeof(entry->msg)); + msg_size = OPAL_MSG_HDR_SIZE + be32_to_cpu(entry->msg.size); + if (size < msg_size) { + /* Send partial data to Linux */ + prlog(PR_NOTICE, "Sending partial data [msg_type : 0x%x, " + "msg_size : 0x%x, buf_size : 0x%x]\n", + be32_to_cpu(entry->msg.msg_type), + (u32)msg_size, (u32)size); + + entry->msg.size = cpu_to_be32(size - OPAL_MSG_HDR_SIZE); + msg_size = size; + rc = OPAL_PARTIAL; + } + + memcpy((void *)buffer, (void *)&entry->msg, msg_size); callback = entry->consumed; data = entry->data; - list_add(&msg_free_list, &entry->link); + if (entry->extended) + free(entry); + else + list_add(&msg_free_list, &entry->link); + if (list_empty(&msg_pending_list)) opal_update_pending_evt(OPAL_EVENT_MSG_PENDING, 0); unlock(&opal_msg_lock); if (callback) - callback(data, OPAL_SUCCESS); + callback(data, rc); - return OPAL_SUCCESS; + return rc; } opal_call(OPAL_GET_MSG, opal_get_msg, 2); diff --git a/core/test/run-msg.c b/core/test/run-msg.c index 08e1a019b..f5f948ace 100644 --- a/core/test/run-msg.c +++ b/core/test/run-msg.c @@ -132,13 +132,13 @@ int main(void) assert(r == 0); assert(list_count(&msg_pending_list) == ++npending); - assert(list_count(&msg_free_list) == --nfree); + assert(list_count(&msg_free_list) == nfree); r = opal_get_msg(m_ptr, sizeof(m)); - assert(r == 0); + assert(r == OPAL_PARTIAL); assert(list_count(&msg_pending_list) == --npending); - assert(list_count(&msg_free_list) == ++nfree); + assert(list_count(&msg_free_list) == nfree); assert(m.params[0] == 0); assert(m.params[1] == 1); diff --git a/doc/opal-api/opal-messages.rst b/doc/opal-api/opal-messages.rst index e4e813aad..25acaec2c 100644 --- a/doc/opal-api/opal-messages.rst +++ b/doc/opal-api/opal-messages.rst @@ -11,7 +11,7 @@ An opal_msg is: :: struct opal_msg { __be32 msg_type; - __be32 reserved; + __be32 size; __be64 params[8]; }; diff --git a/include/opal-api.h b/include/opal-api.h index e461c9d27..e15c5b89e 100644 --- a/include/opal-api.h +++ b/include/opal-api.h @@ -555,7 +555,7 @@ enum opal_msg_type { struct opal_msg { __be32 msg_type; - __be32 reserved; + __be32 size; __be64 params[8]; };