diff mbox series

[1/3] uart: Drop console write data if BMC becomes unresponsive

Message ID 20200520115139.20734-1-hegdevasant@linux.vnet.ibm.com
State Superseded
Headers show
Series [1/3] uart: Drop console write data if BMC becomes unresponsive | expand

Checks

Context Check Description
snowpatch_ozlabs/apply_patch success Successfully applied on branch master (0f1937ef40fca0c3212a9dff1010b832a24fb063)
snowpatch_ozlabs/snowpatch_job_snowpatch-skiboot success Test snowpatch/job/snowpatch-skiboot on branch master
snowpatch_ozlabs/snowpatch_job_snowpatch-skiboot-dco success Signed-off-by present

Commit Message

Vasant Hegde May 20, 2020, 11:51 a.m. UTC
If BMC becomes unresponsive (ex: during BMC reboot) during console write
then we may get stuck in uart_wait_tx_room(). This will result in kernel
lockups and in some cases host becomes unresponsive.

This patch retries for `tx_room` for predefined number of time. If we don't
get space within that time it will return zero. This will result in
dropping console message.

Signed-off-by: Vasant Hegde <hegdevasant@linux.vnet.ibm.com>
---
 hw/lpc-uart.c | 8 ++++++++
 1 file changed, 8 insertions(+)

Comments

Nicholas Piggin May 20, 2020, 1:02 p.m. UTC | #1
Excerpts from Vasant Hegde's message of May 20, 2020 9:51 pm:
> If BMC becomes unresponsive (ex: during BMC reboot) during console write
> then we may get stuck in uart_wait_tx_room(). This will result in kernel
> lockups and in some cases host becomes unresponsive.
> 
> This patch retries for `tx_room` for predefined number of time. If we don't
> get space within that time it will return zero. This will result in
> dropping console message.

Is the number of times important, or should it be time based?

> 
> Signed-off-by: Vasant Hegde <hegdevasant@linux.vnet.ibm.com>
> ---
>  hw/lpc-uart.c | 8 ++++++++
>  1 file changed, 8 insertions(+)
> 
> diff --git a/hw/lpc-uart.c b/hw/lpc-uart.c
> index 979a617c3..ed56a4bdc 100644
> --- a/hw/lpc-uart.c
> +++ b/hw/lpc-uart.c
> @@ -63,6 +63,8 @@ static void *mmio_uart_base;
>  static int uart_console_policy = UART_CONSOLE_OPAL;
>  static int lpc_irq = -1;
>  
> +#define UART_WRITE_MAX_RETRIES	0x100
> +
>  void uart_set_console_policy(int policy)
>  {
>  	uart_console_policy = policy;
> @@ -106,6 +108,8 @@ static void uart_check_tx_room(void)
>  
>  static void uart_wait_tx_room(void)
>  {
> +	int retry_cnt = 0;
> +
>  	while (!tx_room) {
>  		uart_check_tx_room();
>  		if (!tx_room) {
> @@ -113,6 +117,10 @@ static void uart_wait_tx_room(void)
>  			do {
>  				barrier();
>  				uart_check_tx_room();
> +
> +				retry_cnt++;
> +				if (retry_cnt > UART_WRITE_MAX_RETRIES)
> +					return;
>  			} while (!tx_room);
>  			smt_medium();

Must restore SMT priority to medium.

Thanks,
Nick
Vasant Hegde May 20, 2020, 3:28 p.m. UTC | #2
On 5/20/20 6:32 PM, Nicholas Piggin wrote:
> Excerpts from Vasant Hegde's message of May 20, 2020 9:51 pm:
>> If BMC becomes unresponsive (ex: during BMC reboot) during console write
>> then we may get stuck in uart_wait_tx_room(). This will result in kernel
>> lockups and in some cases host becomes unresponsive.
>>
>> This patch retries for `tx_room` for predefined number of time. If we don't
>> get space within that time it will return zero. This will result in
>> dropping console message.
> 
> Is the number of times important, or should it be time based?

It can be time base as well. May be I will introduce timeout option in v2.
> 
>>
>> Signed-off-by: Vasant Hegde <hegdevasant@linux.vnet.ibm.com>
>> ---
>>   hw/lpc-uart.c | 8 ++++++++
>>   1 file changed, 8 insertions(+)
>>
>> diff --git a/hw/lpc-uart.c b/hw/lpc-uart.c
>> index 979a617c3..ed56a4bdc 100644
>> --- a/hw/lpc-uart.c
>> +++ b/hw/lpc-uart.c
>> @@ -63,6 +63,8 @@ static void *mmio_uart_base;
>>   static int uart_console_policy = UART_CONSOLE_OPAL;
>>   static int lpc_irq = -1;
>>   
>> +#define UART_WRITE_MAX_RETRIES	0x100
>> +
>>   void uart_set_console_policy(int policy)
>>   {
>>   	uart_console_policy = policy;
>> @@ -106,6 +108,8 @@ static void uart_check_tx_room(void)
>>   
>>   static void uart_wait_tx_room(void)
>>   {
>> +	int retry_cnt = 0;
>> +
>>   	while (!tx_room) {
>>   		uart_check_tx_room();
>>   		if (!tx_room) {
>> @@ -113,6 +117,10 @@ static void uart_wait_tx_room(void)
>>   			do {
>>   				barrier();
>>   				uart_check_tx_room();
>> +
>> +				retry_cnt++;
>> +				if (retry_cnt > UART_WRITE_MAX_RETRIES)
>> +					return;
>>   			} while (!tx_room);
>>   			smt_medium();
> 
> Must restore SMT priority to medium.

Oh yes. I missed it. Will fix.

Thanks!

-Vasant
Oliver O'Halloran May 21, 2020, 5:19 a.m. UTC | #3
On Thu, May 21, 2020 at 1:29 AM Vasant Hegde
<hegdevasant@linux.vnet.ibm.com> wrote:
>
> On 5/20/20 6:32 PM, Nicholas Piggin wrote:
> > Excerpts from Vasant Hegde's message of May 20, 2020 9:51 pm:
> >> If BMC becomes unresponsive (ex: during BMC reboot) during console write
> >> then we may get stuck in uart_wait_tx_room(). This will result in kernel
> >> lockups and in some cases host becomes unresponsive.
> >>
> >> This patch retries for `tx_room` for predefined number of time. If we don't
> >> get space within that time it will return zero. This will result in
> >> dropping console message.
> >
> > Is the number of times important, or should it be time based?
>
> It can be time base as well. May be I will introduce timeout option in v2.

I think it needs to be time based. The CPU is relatively fast compared
to the rate at which the UART sends out characters.  The slowness of
the LPC bus might help, but it seem possible that we could loop 256
times before space becomes available even if the UART was functoning
normally.
diff mbox series

Patch

diff --git a/hw/lpc-uart.c b/hw/lpc-uart.c
index 979a617c3..ed56a4bdc 100644
--- a/hw/lpc-uart.c
+++ b/hw/lpc-uart.c
@@ -63,6 +63,8 @@  static void *mmio_uart_base;
 static int uart_console_policy = UART_CONSOLE_OPAL;
 static int lpc_irq = -1;
 
+#define UART_WRITE_MAX_RETRIES	0x100
+
 void uart_set_console_policy(int policy)
 {
 	uart_console_policy = policy;
@@ -106,6 +108,8 @@  static void uart_check_tx_room(void)
 
 static void uart_wait_tx_room(void)
 {
+	int retry_cnt = 0;
+
 	while (!tx_room) {
 		uart_check_tx_room();
 		if (!tx_room) {
@@ -113,6 +117,10 @@  static void uart_wait_tx_room(void)
 			do {
 				barrier();
 				uart_check_tx_room();
+
+				retry_cnt++;
+				if (retry_cnt > UART_WRITE_MAX_RETRIES)
+					return;
 			} while (!tx_room);
 			smt_medium();
 		}