diff mbox

[v5] powerpc/pseries: Limit EPOW reset event warnings

Message ID 1447836157-19599-1-git-send-email-vipin@linux.vnet.ibm.com (mailing list archive)
State Superseded
Headers show

Commit Message

Vipin K Parashar Nov. 18, 2015, 8:42 a.m. UTC
Kernel prints respective warnings about various EPOW events for
user information/action after parsing EPOW interrupts. At times
below EPOW reset event warning is seen to be flooding kernel log
over a period of time.

May 25 03:46:34 alp kernel: Non critical power or cooling issue cleared
May 25 03:46:52 alp kernel: Non critical power or cooling issue cleared
May 25 03:53:48 alp kernel: Non critical power or cooling issue cleared
May 25 03:55:46 alp kernel: Non critical power or cooling issue cleared
May 25 03:56:34 alp kernel: Non critical power or cooling issue cleared
May 25 03:59:04 alp kernel: Non critical power or cooling issue cleared
May 25 04:02:01 alp kernel: Non critical power or cooling issue cleared

These EPOW reset events are spurious in nature and are triggered by
firmware witout an actual EPOW event being reset. This patch avoids these
multiple EPOW reset warnings by using a counter variable. This variable
is incremented every time an EPOW event is reported. Upon receiving a EPOW
reset event the same variable is checked to filer out spurious events and
decremented accordingly.

This patch also improves log messages to better describe EPOW event being
reported. Merged adjacent log messages into single one to reduce number of
lines printed per event.

Signed-off-by: Kamalesh Babulal <kamalesh@linux.vnet.ibm.com>
Signed-off-by: Vipin K Parashar <vipin@linux.vnet.ibm.com>
---
v5 changes:
   - Used num_epow_events counter variable to count number of epow_events
   - Improved log messages to better describe epow event.
   - Merged adjacent warnings into single lines.

v4 changes:
   - Changed the approach to depth counter to match the EPOW events and
     EPOW reset.
   - Converted pr_err() ot pr_info() for non-critical errors.
   - Merged adjacent warnings into single line across the file.
   - Fixed grammar in the warnings to make is short.

v3 changes:
   - Limit warning printed by EPOW RESET event, by guarding it with bool flag.
     Instead of rate limiting all the EPOW events.

v2 changes:
   - Merged multiple adjacent pr_err/pr_emerg into single line to reduce multi-line
     warnings, based on Michael's comments.

 arch/powerpc/platforms/pseries/ras.c | 54 ++++++++++++++++++++----------------
 1 file changed, 30 insertions(+), 24 deletions(-)

Comments

Vasant Hegde Nov. 26, 2015, 9:20 a.m. UTC | #1
On 11/18/2015 02:12 PM, Vipin K Parashar wrote:
> Kernel prints respective warnings about various EPOW events for
> user information/action after parsing EPOW interrupts. At times
> below EPOW reset event warning is seen to be flooding kernel log
> over a period of time.
> 
> May 25 03:46:34 alp kernel: Non critical power or cooling issue cleared
> May 25 03:46:52 alp kernel: Non critical power or cooling issue cleared
> May 25 03:53:48 alp kernel: Non critical power or cooling issue cleared
> May 25 03:55:46 alp kernel: Non critical power or cooling issue cleared
> May 25 03:56:34 alp kernel: Non critical power or cooling issue cleared
> May 25 03:59:04 alp kernel: Non critical power or cooling issue cleared
> May 25 04:02:01 alp kernel: Non critical power or cooling issue cleared
> 

@Michael,
 I think above log is raising some concern. We have been asked by multiple
people on this. Hence I think we should avoid these duplicate messages.


> These EPOW reset events are spurious in nature and are triggered by
> firmware witout an actual EPOW event being reset. This patch avoids these

s/witout/without/

> multiple EPOW reset warnings by using a counter variable. This variable
> is incremented every time an EPOW event is reported. Upon receiving a EPOW
> reset event the same variable is checked to filer out spurious events and
> decremented accordingly.
> 
> This patch also improves log messages to better describe EPOW event being
> reported. Merged adjacent log messages into single one to reduce number of
> lines printed per event.
> 
> Signed-off-by: Kamalesh Babulal <kamalesh@linux.vnet.ibm.com>
> Signed-off-by: Vipin K Parashar <vipin@linux.vnet.ibm.com>
> ---
> v5 changes:
>    - Used num_epow_events counter variable to count number of epow_events
>    - Improved log messages to better describe epow event.
>    - Merged adjacent warnings into single lines.
> 
> v4 changes:
>    - Changed the approach to depth counter to match the EPOW events and
>      EPOW reset.
>    - Converted pr_err() ot pr_info() for non-critical errors.
>    - Merged adjacent warnings into single line across the file.
>    - Fixed grammar in the warnings to make is short.
> 
> v3 changes:
>    - Limit warning printed by EPOW RESET event, by guarding it with bool flag.
>      Instead of rate limiting all the EPOW events.
> 
> v2 changes:
>    - Merged multiple adjacent pr_err/pr_emerg into single line to reduce multi-line
>      warnings, based on Michael's comments.
> 
>  arch/powerpc/platforms/pseries/ras.c | 54 ++++++++++++++++++++----------------
>  1 file changed, 30 insertions(+), 24 deletions(-)
> 
> diff --git a/arch/powerpc/platforms/pseries/ras.c b/arch/powerpc/platforms/pseries/ras.c
> index 3b6647e..bbe2856 100644
> --- a/arch/powerpc/platforms/pseries/ras.c
> +++ b/arch/powerpc/platforms/pseries/ras.c
> @@ -40,6 +40,8 @@ static int ras_check_exception_token;
>  #define EPOW_SENSOR_TOKEN	9
>  #define EPOW_SENSOR_INDEX	0
>  
> +static int num_epow_events;
> +
>  static irqreturn_t ras_epow_interrupt(int irq, void *dev_id);
>  static irqreturn_t ras_error_interrupt(int irq, void *dev_id);
>  
> @@ -82,32 +84,30 @@ static void handle_system_shutdown(char event_modifier)
>  {
>  	switch (event_modifier) {
>  	case EPOW_SHUTDOWN_NORMAL:
> -		pr_emerg("Firmware initiated power off");
> +		pr_emerg("Power off requested\n");

Why are you changing this  message? These are FW initiated Power off and helps
us to identify who initiated shutdown request.

>  		orderly_poweroff(true);
>  		break;
>  
>  	case EPOW_SHUTDOWN_ON_UPS:
> -		pr_emerg("Loss of power reported by firmware, system is "
> -			"running on UPS/battery");
> -		pr_emerg("Check RTAS error log for details");
> +		pr_emerg("Loss of system power detected. System is running on"
> +			 " UPS/battery. Check RTAS error log for details\n");
>  		orderly_poweroff(true);
>  		break;
>  
>  	case EPOW_SHUTDOWN_LOSS_OF_CRITICAL_FUNCTIONS:
> -		pr_emerg("Loss of system critical functions reported by "
> -			"firmware");
> -		pr_emerg("Check RTAS error log for details");
> +		pr_emerg("Loss of system critical functions detected. Check"
> +			 " RTAS error log for details\n");
>  		orderly_poweroff(true);
>  		break;
>  
>  	case EPOW_SHUTDOWN_AMBIENT_TEMPERATURE_TOO_HIGH:
> -		pr_emerg("Ambient temperature too high reported by firmware");
> -		pr_emerg("Check RTAS error log for details");
> +		pr_emerg("High ambient temperature detected. Check RTAS"
> +			 " error log for details\n");
>  		orderly_poweroff(true);
>  		break;
>  
>  	default:
> -		pr_err("Unknown power/cooling shutdown event (modifier %d)",
> +		pr_err("Unknown power/cooling shutdown event (modifier = %d)\n",
>  			event_modifier);
>  	}
>  }
> @@ -145,40 +145,47 @@ static void rtas_parse_epow_errlog(struct rtas_error_log *log)
>  
>  	switch (action_code) {
>  	case EPOW_RESET:
> -		pr_err("Non critical power or cooling issue cleared");
> +		if (num_epow_events) {
> +			pr_info("Non critical power/cooling issue cleared\n");
> +			num_epow_events--;
> +		}
>  		break;
>  
>  	case EPOW_WARN_COOLING:
> -		pr_err("Non critical cooling issue reported by firmware");
> -		pr_err("Check RTAS error log for details");
> +		pr_info("Non-critical cooling issue detected. Check RTAS error"
> +			" log for details\n");
> +		num_epow_events++;

So every switch-case you are modifying this variable. Not sure this is the best way.
How about adding if condition after switch to modify this variable ?

-Vasant
Vipin K Parashar Nov. 30, 2015, 12:01 p.m. UTC | #2
On Thursday 26 November 2015 02:50 PM, Vasant Hegde wrote:
> On 11/18/2015 02:12 PM, Vipin K Parashar wrote:
>> Kernel prints respective warnings about various EPOW events for
>> user information/action after parsing EPOW interrupts. At times
>> below EPOW reset event warning is seen to be flooding kernel log
>> over a period of time.
>>
>> May 25 03:46:34 alp kernel: Non critical power or cooling issue cleared
>> May 25 03:46:52 alp kernel: Non critical power or cooling issue cleared
>> May 25 03:53:48 alp kernel: Non critical power or cooling issue cleared
>> May 25 03:55:46 alp kernel: Non critical power or cooling issue cleared
>> May 25 03:56:34 alp kernel: Non critical power or cooling issue cleared
>> May 25 03:59:04 alp kernel: Non critical power or cooling issue cleared
>> May 25 04:02:01 alp kernel: Non critical power or cooling issue cleared
>>
> @Michael,
>   I think above log is raising some concern. We have been asked by multiple
> people on this. Hence I think we should avoid these duplicate messages.

Hi Michael,
          Please do let know if you have some suggestions with this patch.

>
>> These EPOW reset events are spurious in nature and are triggered by
>> firmware witout an actual EPOW event being reset. This patch avoids these
> s/witout/without/

sure, will edit.

>
>> multiple EPOW reset warnings by using a counter variable. This variable
>> is incremented every time an EPOW event is reported. Upon receiving a EPOW
>> reset event the same variable is checked to filer out spurious events and
>> decremented accordingly.
>>
>> This patch also improves log messages to better describe EPOW event being
>> reported. Merged adjacent log messages into single one to reduce number of
>> lines printed per event.
>>
>> Signed-off-by: Kamalesh Babulal<kamalesh@linux.vnet.ibm.com>
>> Signed-off-by: Vipin K Parashar<vipin@linux.vnet.ibm.com>
>> ---
>> v5 changes:
>>     - Used num_epow_events counter variable to count number of epow_events
>>     - Improved log messages to better describe epow event.
>>     - Merged adjacent warnings into single lines.
>>
>> v4 changes:
>>     - Changed the approach to depth counter to match the EPOW events and
>>       EPOW reset.
>>     - Converted pr_err() ot pr_info() for non-critical errors.
>>     - Merged adjacent warnings into single line across the file.
>>     - Fixed grammar in the warnings to make is short.
>>
>> v3 changes:
>>     - Limit warning printed by EPOW RESET event, by guarding it with bool flag.
>>       Instead of rate limiting all the EPOW events.
>>
>> v2 changes:
>>     - Merged multiple adjacent pr_err/pr_emerg into single line to reduce multi-line
>>       warnings, based on Michael's comments.
>>
>>   arch/powerpc/platforms/pseries/ras.c | 54 ++++++++++++++++++++----------------
>>   1 file changed, 30 insertions(+), 24 deletions(-)
>>
>> diff --git a/arch/powerpc/platforms/pseries/ras.c b/arch/powerpc/platforms/pseries/ras.c
>> index 3b6647e..bbe2856 100644
>> --- a/arch/powerpc/platforms/pseries/ras.c
>> +++ b/arch/powerpc/platforms/pseries/ras.c
>> @@ -40,6 +40,8 @@ static int ras_check_exception_token;
>>   #define EPOW_SENSOR_TOKEN	9
>>   #define EPOW_SENSOR_INDEX	0
>>   
>> +static int num_epow_events;
>> +
>>   static irqreturn_t ras_epow_interrupt(int irq, void *dev_id);
>>   static irqreturn_t ras_error_interrupt(int irq, void *dev_id);
>>   
>> @@ -82,32 +84,30 @@ static void handle_system_shutdown(char event_modifier)
>>   {
>>   	switch (event_modifier) {
>>   	case EPOW_SHUTDOWN_NORMAL:
>> -		pr_emerg("Firmware initiated power off");
>> +		pr_emerg("Power off requested\n");
> Why are you changing this  message? These are FW initiated Power off and helps
> us to identify who initiated shutdown request.

EPOW_SHUTDOWN_NORMAL event maps to DPO event in harwdare, which is received
upon system admin requesting LPAR poweroff. I felt that using FW 
initiated poweroff
phrase doesn't convey that poweroff was requested by a user so changed it.
        Please do suggest if you have something better to convey message.

>>   		orderly_poweroff(true);
>>   		break;
>>   
>>   	case EPOW_SHUTDOWN_ON_UPS:
>> -		pr_emerg("Loss of power reported by firmware, system is "
>> -			"running on UPS/battery");
>> -		pr_emerg("Check RTAS error log for details");
>> +		pr_emerg("Loss of system power detected. System is running on"
>> +			 " UPS/battery. Check RTAS error log for details\n");
>>   		orderly_poweroff(true);
>>   		break;
>>   
>>   	case EPOW_SHUTDOWN_LOSS_OF_CRITICAL_FUNCTIONS:
>> -		pr_emerg("Loss of system critical functions reported by "
>> -			"firmware");
>> -		pr_emerg("Check RTAS error log for details");
>> +		pr_emerg("Loss of system critical functions detected. Check"
>> +			 " RTAS error log for details\n");
>>   		orderly_poweroff(true);
>>   		break;
>>   
>>   	case EPOW_SHUTDOWN_AMBIENT_TEMPERATURE_TOO_HIGH:
>> -		pr_emerg("Ambient temperature too high reported by firmware");
>> -		pr_emerg("Check RTAS error log for details");
>> +		pr_emerg("High ambient temperature detected. Check RTAS"
>> +			 " error log for details\n");
>>   		orderly_poweroff(true);
>>   		break;
>>   
>>   	default:
>> -		pr_err("Unknown power/cooling shutdown event (modifier %d)",
>> +		pr_err("Unknown power/cooling shutdown event (modifier = %d)\n",
>>   			event_modifier);
>>   	}
>>   }
>> @@ -145,40 +145,47 @@ static void rtas_parse_epow_errlog(struct rtas_error_log *log)
>>   
>>   	switch (action_code) {
>>   	case EPOW_RESET:
>> -		pr_err("Non critical power or cooling issue cleared");
>> +		if (num_epow_events) {
>> +			pr_info("Non critical power/cooling issue cleared\n");
>> +			num_epow_events--;
>> +		}
>>   		break;
>>   
>>   	case EPOW_WARN_COOLING:
>> -		pr_err("Non critical cooling issue reported by firmware");
>> -		pr_err("Check RTAS error log for details");
>> +		pr_info("Non-critical cooling issue detected. Check RTAS error"
>> +			" log for details\n");
>> +		num_epow_events++;
> So every switch-case you are modifying this variable. Not sure this is the best way.
> How about adding if condition after switch to modify this variable ?

yeah i thought about it but dropped the idea as the variable isn't being 
incrementing
for all switch cases. Its only getting incremented for cases when we 
aren't doing system
poweroff. Though i can send out a patch with a single increment of the 
variable down
below the switch case as incrementing it for poweroff cases too would 
not no harm.

> -Vasant
>
Michael Ellerman Dec. 1, 2015, 3:46 a.m. UTC | #3
On Mon, 2015-11-30 at 17:31 +0530, Vipin K Parashar wrote:
> On Thursday 26 November 2015 02:50 PM, Vasant Hegde wrote:
> > On 11/18/2015 02:12 PM, Vipin K Parashar wrote:
> > > Kernel prints respective warnings about various EPOW events for
> > > user information/action after parsing EPOW interrupts. At times
> > > below EPOW reset event warning is seen to be flooding kernel log
> > > over a period of time.
> > > 
> > > May 25 03:46:34 alp kernel: Non critical power or cooling issue cleared
> > > May 25 03:46:52 alp kernel: Non critical power or cooling issue cleared
> > > May 25 03:53:48 alp kernel: Non critical power or cooling issue cleared
> > > May 25 03:55:46 alp kernel: Non critical power or cooling issue cleared
> > > May 25 03:56:34 alp kernel: Non critical power or cooling issue cleared
> > > May 25 03:59:04 alp kernel: Non critical power or cooling issue cleared
> > > May 25 04:02:01 alp kernel: Non critical power or cooling issue cleared
> > > 
> > @Michael,
> >   I think above log is raising some concern. We have been asked by multiple
> > people on this. Hence I think we should avoid these duplicate messages.
> 
> Hi Michael,
>           Please do let know if you have some suggestions with this patch.


It seems OK to me. I actually had it in my testing tree and was about to merge
it when Vasant replied with comments.

So if you two can agree on a final patch I'll merge it.

cheers
Vipin K Parashar Dec. 1, 2015, 9:11 a.m. UTC | #4
On Tuesday 01 December 2015 09:16 AM, Michael Ellerman wrote:
> On Mon, 2015-11-30 at 17:31 +0530, Vipin K Parashar wrote:
>> On Thursday 26 November 2015 02:50 PM, Vasant Hegde wrote:
>>> On 11/18/2015 02:12 PM, Vipin K Parashar wrote:
>>>> Kernel prints respective warnings about various EPOW events for
>>>> user information/action after parsing EPOW interrupts. At times
>>>> below EPOW reset event warning is seen to be flooding kernel log
>>>> over a period of time.
>>>>
>>>> May 25 03:46:34 alp kernel: Non critical power or cooling issue cleared
>>>> May 25 03:46:52 alp kernel: Non critical power or cooling issue cleared
>>>> May 25 03:53:48 alp kernel: Non critical power or cooling issue cleared
>>>> May 25 03:55:46 alp kernel: Non critical power or cooling issue cleared
>>>> May 25 03:56:34 alp kernel: Non critical power or cooling issue cleared
>>>> May 25 03:59:04 alp kernel: Non critical power or cooling issue cleared
>>>> May 25 04:02:01 alp kernel: Non critical power or cooling issue cleared
>>>>
>>> @Michael,
>>>    I think above log is raising some concern. We have been asked by multiple
>>> people on this. Hence I think we should avoid these duplicate messages.
>> Hi Michael,
>>            Please do let know if you have some suggestions with this patch.
>
> It seems OK to me. I actually had it in my testing tree and was about to merge
> it when Vasant replied with comments.
>
> So if you two can agree on a final patch I'll merge it.

Thanks Michael, I will send out next revision for this addressing 
Vasant's comment.

>
> cheers
>
diff mbox

Patch

diff --git a/arch/powerpc/platforms/pseries/ras.c b/arch/powerpc/platforms/pseries/ras.c
index 3b6647e..bbe2856 100644
--- a/arch/powerpc/platforms/pseries/ras.c
+++ b/arch/powerpc/platforms/pseries/ras.c
@@ -40,6 +40,8 @@  static int ras_check_exception_token;
 #define EPOW_SENSOR_TOKEN	9
 #define EPOW_SENSOR_INDEX	0
 
+static int num_epow_events;
+
 static irqreturn_t ras_epow_interrupt(int irq, void *dev_id);
 static irqreturn_t ras_error_interrupt(int irq, void *dev_id);
 
@@ -82,32 +84,30 @@  static void handle_system_shutdown(char event_modifier)
 {
 	switch (event_modifier) {
 	case EPOW_SHUTDOWN_NORMAL:
-		pr_emerg("Firmware initiated power off");
+		pr_emerg("Power off requested\n");
 		orderly_poweroff(true);
 		break;
 
 	case EPOW_SHUTDOWN_ON_UPS:
-		pr_emerg("Loss of power reported by firmware, system is "
-			"running on UPS/battery");
-		pr_emerg("Check RTAS error log for details");
+		pr_emerg("Loss of system power detected. System is running on"
+			 " UPS/battery. Check RTAS error log for details\n");
 		orderly_poweroff(true);
 		break;
 
 	case EPOW_SHUTDOWN_LOSS_OF_CRITICAL_FUNCTIONS:
-		pr_emerg("Loss of system critical functions reported by "
-			"firmware");
-		pr_emerg("Check RTAS error log for details");
+		pr_emerg("Loss of system critical functions detected. Check"
+			 " RTAS error log for details\n");
 		orderly_poweroff(true);
 		break;
 
 	case EPOW_SHUTDOWN_AMBIENT_TEMPERATURE_TOO_HIGH:
-		pr_emerg("Ambient temperature too high reported by firmware");
-		pr_emerg("Check RTAS error log for details");
+		pr_emerg("High ambient temperature detected. Check RTAS"
+			 " error log for details\n");
 		orderly_poweroff(true);
 		break;
 
 	default:
-		pr_err("Unknown power/cooling shutdown event (modifier %d)",
+		pr_err("Unknown power/cooling shutdown event (modifier = %d)\n",
 			event_modifier);
 	}
 }
@@ -145,40 +145,47 @@  static void rtas_parse_epow_errlog(struct rtas_error_log *log)
 
 	switch (action_code) {
 	case EPOW_RESET:
-		pr_err("Non critical power or cooling issue cleared");
+		if (num_epow_events) {
+			pr_info("Non critical power/cooling issue cleared\n");
+			num_epow_events--;
+		}
 		break;
 
 	case EPOW_WARN_COOLING:
-		pr_err("Non critical cooling issue reported by firmware");
-		pr_err("Check RTAS error log for details");
+		pr_info("Non-critical cooling issue detected. Check RTAS error"
+			" log for details\n");
+		num_epow_events++;
 		break;
 
 	case EPOW_WARN_POWER:
-		pr_err("Non critical power issue reported by firmware");
-		pr_err("Check RTAS error log for details");
+		pr_info("Non-critical power issue detected. Check RTAS error"
+			" log for details\n");
+		num_epow_events++;
 		break;
 
 	case EPOW_SYSTEM_SHUTDOWN:
 		handle_system_shutdown(epow_log->event_modifier);
+		num_epow_events++;
 		break;
 
 	case EPOW_SYSTEM_HALT:
-		pr_emerg("Firmware initiated power off");
+		pr_emerg("Critical power/cooling issue detected. Check RTAS"
+			 " error log for details. Powering off.\n");
 		orderly_poweroff(true);
 		break;
 
 	case EPOW_MAIN_ENCLOSURE:
 	case EPOW_POWER_OFF:
-		pr_emerg("Critical power/cooling issue reported by firmware");
-		pr_emerg("Check RTAS error log for details");
-		pr_emerg("Immediate power off");
+		pr_emerg("System about to lose power. Check RTAS error log "
+			 " for details. Powering off immediately.\n");
 		emergency_sync();
 		kernel_power_off();
 		break;
 
 	default:
-		pr_err("Unknown power/cooling event (action code %d)",
+		pr_err("Unknown power/cooling event (action code  = %d)\n",
 			action_code);
+		num_epow_events++;
 	}
 }
 
@@ -249,13 +256,12 @@  static irqreturn_t ras_error_interrupt(int irq, void *dev_id)
 	log_error(ras_log_buf, ERR_TYPE_RTAS_LOG, fatal);
 
 	if (fatal) {
-		pr_emerg("Fatal hardware error reported by firmware");
-		pr_emerg("Check RTAS error log for details");
-		pr_emerg("Immediate power off");
+		pr_emerg("Fatal hardware error detected. Check RTAS error"
+			 " log for details. Powering off immediately\n");
 		emergency_sync();
 		kernel_power_off();
 	} else {
-		pr_err("Recoverable hardware error reported by firmware");
+		pr_err("Recoverable hardware error detected\n");
 	}
 
 	spin_unlock(&ras_log_buf_lock);