Message ID | 20200520115139.20734-1-hegdevasant@linux.vnet.ibm.com |
---|---|
State | Superseded |
Headers | show |
Series | [1/3] uart: Drop console write data if BMC becomes unresponsive | expand |
Context | Check | Description |
---|---|---|
snowpatch_ozlabs/apply_patch | success | Successfully applied on branch master (0f1937ef40fca0c3212a9dff1010b832a24fb063) |
snowpatch_ozlabs/snowpatch_job_snowpatch-skiboot | success | Test snowpatch/job/snowpatch-skiboot on branch master |
snowpatch_ozlabs/snowpatch_job_snowpatch-skiboot-dco | success | Signed-off-by present |
Excerpts from Vasant Hegde's message of May 20, 2020 9:51 pm: > If BMC becomes unresponsive (ex: during BMC reboot) during console write > then we may get stuck in uart_wait_tx_room(). This will result in kernel > lockups and in some cases host becomes unresponsive. > > This patch retries for `tx_room` for predefined number of time. If we don't > get space within that time it will return zero. This will result in > dropping console message. Is the number of times important, or should it be time based? > > Signed-off-by: Vasant Hegde <hegdevasant@linux.vnet.ibm.com> > --- > hw/lpc-uart.c | 8 ++++++++ > 1 file changed, 8 insertions(+) > > diff --git a/hw/lpc-uart.c b/hw/lpc-uart.c > index 979a617c3..ed56a4bdc 100644 > --- a/hw/lpc-uart.c > +++ b/hw/lpc-uart.c > @@ -63,6 +63,8 @@ static void *mmio_uart_base; > static int uart_console_policy = UART_CONSOLE_OPAL; > static int lpc_irq = -1; > > +#define UART_WRITE_MAX_RETRIES 0x100 > + > void uart_set_console_policy(int policy) > { > uart_console_policy = policy; > @@ -106,6 +108,8 @@ static void uart_check_tx_room(void) > > static void uart_wait_tx_room(void) > { > + int retry_cnt = 0; > + > while (!tx_room) { > uart_check_tx_room(); > if (!tx_room) { > @@ -113,6 +117,10 @@ static void uart_wait_tx_room(void) > do { > barrier(); > uart_check_tx_room(); > + > + retry_cnt++; > + if (retry_cnt > UART_WRITE_MAX_RETRIES) > + return; > } while (!tx_room); > smt_medium(); Must restore SMT priority to medium. Thanks, Nick
On 5/20/20 6:32 PM, Nicholas Piggin wrote: > Excerpts from Vasant Hegde's message of May 20, 2020 9:51 pm: >> If BMC becomes unresponsive (ex: during BMC reboot) during console write >> then we may get stuck in uart_wait_tx_room(). This will result in kernel >> lockups and in some cases host becomes unresponsive. >> >> This patch retries for `tx_room` for predefined number of time. If we don't >> get space within that time it will return zero. This will result in >> dropping console message. > > Is the number of times important, or should it be time based? It can be time base as well. May be I will introduce timeout option in v2. > >> >> Signed-off-by: Vasant Hegde <hegdevasant@linux.vnet.ibm.com> >> --- >> hw/lpc-uart.c | 8 ++++++++ >> 1 file changed, 8 insertions(+) >> >> diff --git a/hw/lpc-uart.c b/hw/lpc-uart.c >> index 979a617c3..ed56a4bdc 100644 >> --- a/hw/lpc-uart.c >> +++ b/hw/lpc-uart.c >> @@ -63,6 +63,8 @@ static void *mmio_uart_base; >> static int uart_console_policy = UART_CONSOLE_OPAL; >> static int lpc_irq = -1; >> >> +#define UART_WRITE_MAX_RETRIES 0x100 >> + >> void uart_set_console_policy(int policy) >> { >> uart_console_policy = policy; >> @@ -106,6 +108,8 @@ static void uart_check_tx_room(void) >> >> static void uart_wait_tx_room(void) >> { >> + int retry_cnt = 0; >> + >> while (!tx_room) { >> uart_check_tx_room(); >> if (!tx_room) { >> @@ -113,6 +117,10 @@ static void uart_wait_tx_room(void) >> do { >> barrier(); >> uart_check_tx_room(); >> + >> + retry_cnt++; >> + if (retry_cnt > UART_WRITE_MAX_RETRIES) >> + return; >> } while (!tx_room); >> smt_medium(); > > Must restore SMT priority to medium. Oh yes. I missed it. Will fix. Thanks! -Vasant
On Thu, May 21, 2020 at 1:29 AM Vasant Hegde <hegdevasant@linux.vnet.ibm.com> wrote: > > On 5/20/20 6:32 PM, Nicholas Piggin wrote: > > Excerpts from Vasant Hegde's message of May 20, 2020 9:51 pm: > >> If BMC becomes unresponsive (ex: during BMC reboot) during console write > >> then we may get stuck in uart_wait_tx_room(). This will result in kernel > >> lockups and in some cases host becomes unresponsive. > >> > >> This patch retries for `tx_room` for predefined number of time. If we don't > >> get space within that time it will return zero. This will result in > >> dropping console message. > > > > Is the number of times important, or should it be time based? > > It can be time base as well. May be I will introduce timeout option in v2. I think it needs to be time based. The CPU is relatively fast compared to the rate at which the UART sends out characters. The slowness of the LPC bus might help, but it seem possible that we could loop 256 times before space becomes available even if the UART was functoning normally.
diff --git a/hw/lpc-uart.c b/hw/lpc-uart.c index 979a617c3..ed56a4bdc 100644 --- a/hw/lpc-uart.c +++ b/hw/lpc-uart.c @@ -63,6 +63,8 @@ static void *mmio_uart_base; static int uart_console_policy = UART_CONSOLE_OPAL; static int lpc_irq = -1; +#define UART_WRITE_MAX_RETRIES 0x100 + void uart_set_console_policy(int policy) { uart_console_policy = policy; @@ -106,6 +108,8 @@ static void uart_check_tx_room(void) static void uart_wait_tx_room(void) { + int retry_cnt = 0; + while (!tx_room) { uart_check_tx_room(); if (!tx_room) { @@ -113,6 +117,10 @@ static void uart_wait_tx_room(void) do { barrier(); uart_check_tx_room(); + + retry_cnt++; + if (retry_cnt > UART_WRITE_MAX_RETRIES) + return; } while (!tx_room); smt_medium(); }
If BMC becomes unresponsive (ex: during BMC reboot) during console write then we may get stuck in uart_wait_tx_room(). This will result in kernel lockups and in some cases host becomes unresponsive. This patch retries for `tx_room` for predefined number of time. If we don't get space within that time it will return zero. This will result in dropping console message. Signed-off-by: Vasant Hegde <hegdevasant@linux.vnet.ibm.com> --- hw/lpc-uart.c | 8 ++++++++ 1 file changed, 8 insertions(+)