Message ID | 20170609171905.6403-1-hegdevasant@linux.vnet.ibm.com |
---|---|
State | Accepted |
Headers | show |
Vasant Hegde <hegdevasant@linux.vnet.ibm.com> writes: > We use TCE mapped area to write data to console. Console header > (fsp_serbuf_hdr) is modified by both FSP and OPAL (OPAL updates > next_in pointer in fsp_serbuf_hdr and FSP updates next_out pointer). > > Kernel makes opal_console_write() OPAL call to write data to console. > OPAL write data to TCE mapped area and sends MBOX command to FSP. > If our console becomes full and we have data to write to console, > we keep on waiting until FSP reads data. > > In some corner cases, where FSP is active but not responding to > console MBOX message (due to buggy IPMI) and we have heavy console > write happening from kernel, then eventually our console buffer > becomes full. At this point OPAL starts sending OPAL_BUSY_EVENT to > kernel. Kernel will keep on retrying. This is creating kernel soft > lockups. In some extreme case when every CPU is trying to write to > console, user will not be able to ssh and thinks system is hang. > > If we reset FSP or restart IPMI daemon on FSP, system recovers and > everything becomes normal. > > This patch adds workaround to above issue by returning OPAL_HARDWARE > when cosole is full. Side effect of this patch is, we may endup dropping > latest console data. But better to drop console data than system hang. > > Alternative approach is to drop old data from console buffer, make space > for new data. But in normal condition only FSP can update 'next_out' > pointer and if we touch that pointer, it may introduce some other > race conditions. Hence we decided to just new console write request. > > Signed-off-by: Vasant Hegde <hegdevasant@linux.vnet.ibm.com> > Acked-by: Vaidyanathan Srinivasan <svaidy@linux.vnet.ibm.com> > --- > @Vaidy, Stewart, > As suggested, I've added error log message. As Vaidy suggested it may not > be a good idea to reset FSP. Hence I'm not initiating Host initiated Reset. > > Also I've retained Vaidy's Ack from V1. > -Vasant Okay... let's see how this goes from a practical sense (it's certainly the simplest solution). It's managed to survive a bunch of op-test-framework tests, which is more than can be said for some service processor's console implementations. Merged to master as of c8a7535f3539c79955645e6b3714b367a994b1e9 and 5.4.x as of 316f99bdb4e0911c2d3970a8ca23f30101dba57a
diff --git a/hw/fsp/fsp-console.c b/hw/fsp/fsp-console.c index fd67b20..2ba879b 100644 --- a/hw/fsp/fsp-console.c +++ b/hw/fsp/fsp-console.c @@ -26,6 +26,11 @@ #include <timebase.h> #include <device.h> #include <fsp-sysparam.h> +#include <errorlog.h> + +DEFINE_LOG_ENTRY(OPAL_RC_CONSOLE_HANG, OPAL_PLATFORM_ERR_EVT, OPAL_CONSOLE, + OPAL_PLATFORM_FIRMWARE, + OPAL_PREDICTIVE_ERR_GENERAL, OPAL_NA); struct fsp_serbuf_hdr { u16 partition_id; @@ -610,7 +615,18 @@ static int64_t fsp_console_write(int64_t term_number, int64_t *length, *length = written; unlock(&fsp_con_lock); - return written ? OPAL_SUCCESS : OPAL_BUSY_EVENT; + if (written) + return OPAL_SUCCESS; + + /* + * FSP is still active but not reading console data. Hence + * our console buffer became full. Most likely IPMI daemon + * on FSP is buggy. Lets log error and return OPAL_HARDWARE + * to payload (Linux). + */ + log_simple_error(&e_info(OPAL_RC_CONSOLE_HANG), "FSPCON: Console " + "buffer is full, dropping console data\n"); + return OPAL_HARDWARE; } static int64_t fsp_console_write_buffer_space(int64_t term_number, diff --git a/include/errorlog.h b/include/errorlog.h index e9d5ad8..285c185 100644 --- a/include/errorlog.h +++ b/include/errorlog.h @@ -332,6 +332,9 @@ enum opal_reasoncode { /* Platform error */ OPAL_RC_ABNORMAL_REBOOT = OPAL_SRC_COMPONENT_CEC | 0x10, + +/* FSP console */ + OPAL_RC_CONSOLE_HANG = OPAL_SRC_COMPONENT_CONSOLE | 0x10, }; #define DEFINE_LOG_ENTRY(reason, type, id, subsys, \