diff mbox

opal-prd: Direct systemd to always restart opal-prd

Message ID 148896572827.16968.5882246446209692412.stgit@thinktux.in.ibm.com
State Accepted
Headers show

Commit Message

Ananth N Mavinakayanahalli March 8, 2017, 9:35 a.m. UTC
Always restart the opal-prd daemon, irrespective of why it stopped.

Signed-off-by: Ananth N Mavinakayanahalli <ananth@linux.vnet.ibm.com>
---

V2: Add Signed-off-by
---
 external/opal-prd/opal-prd.service |    1 +
 1 file changed, 1 insertion(+)

Comments

Vasant Hegde March 8, 2017, 3:26 p.m. UTC | #1
On 03/08/2017 03:05 PM, Ananth N Mavinakayanahalli wrote:
> Always restart the opal-prd daemon, irrespective of why it stopped.

[CCed Rafael who fixed service script earlier]


>
> Signed-off-by: Ananth N Mavinakayanahalli <ananth@linux.vnet.ibm.com>

Looks good to me .

Reviewed-by: Vasant Hegde <hegdevasant@linux.vnet.ibm.com>

-Vasant


> ---
>
> V2: Add Signed-off-by
> ---
>   external/opal-prd/opal-prd.service |    1 +
>   1 file changed, 1 insertion(+)
>
> diff --git a/external/opal-prd/opal-prd.service b/external/opal-prd/opal-prd.service
> index 9956912..a53d8d9 100644
> --- a/external/opal-prd/opal-prd.service
> +++ b/external/opal-prd/opal-prd.service
> @@ -5,6 +5,7 @@ ConditionPathExists=/sys/firmware/devicetree/base/ibm,opal/diagnostics
>
>   [Service]
>   ExecStart=/usr/sbin/opal-prd --pnor /dev/mtd0
> +Restart=always
>
>   [Install]
>   WantedBy=multi-user.target
>
> _______________________________________________
> Skiboot mailing list
> Skiboot@lists.ozlabs.org
> https://lists.ozlabs.org/listinfo/skiboot
>
Rafael Fonseca March 8, 2017, 3:40 p.m. UTC | #2
On 8 March 2017 at 16:26, Vasant Hegde <hegdevasant@linux.vnet.ibm.com>
wrote:

> On 03/08/2017 03:05 PM, Ananth N Mavinakayanahalli wrote:
>
>> Always restart the opal-prd daemon, irrespective of why it stopped.
>>
>
> [CCed Rafael who fixed service script earlier]
>
>
>
>> Signed-off-by: Ananth N Mavinakayanahalli <ananth@linux.vnet.ibm.com>
>>
>
> Looks good to me .
>

lgtm too.


Att.
--
Rafael Fonseca
Vaidyanathan Srinivasan March 8, 2017, 5:40 p.m. UTC | #3
* Ananth N Mavinakayanahalli <ananth@linux.vnet.ibm.com> [2017-03-08 15:05:28]:

> Always restart the opal-prd daemon, irrespective of why it stopped.
> 
> Signed-off-by: Ananth N Mavinakayanahalli <ananth@linux.vnet.ibm.com>
> ---
> 
> V2: Add Signed-off-by
> ---
>  external/opal-prd/opal-prd.service |    1 +
>  1 file changed, 1 insertion(+)
> 
> diff --git a/external/opal-prd/opal-prd.service b/external/opal-prd/opal-prd.service
> index 9956912..a53d8d9 100644
> --- a/external/opal-prd/opal-prd.service
> +++ b/external/opal-prd/opal-prd.service
> @@ -5,6 +5,7 @@ ConditionPathExists=/sys/firmware/devicetree/base/ibm,opal/diagnostics
>  
>  [Service]
>  ExecStart=/usr/sbin/opal-prd --pnor /dev/mtd0
> +Restart=always

There should be a restart limit.  If we actually crash starting
opal-prd, then we will get stuck infinitely trying to start it.

This could happen if there are events pending and actually the event
action is crashing prd.  We will keep getting the same attention and
not make any progress.

Does systemd unit file allow a limited restart attempts?

--Vaidy
Vaibhav Jain March 9, 2017, 3:47 a.m. UTC | #4
Hi Vaidy,

Vaidyanathan Srinivasan <svaidy@linux.vnet.ibm.com> writes:

> There should be a restart limit.  If we actually crash starting
> opal-prd, then we will get stuck infinitely trying to start it.
>
> This could happen if there are events pending and actually the event
> action is crashing prd.  We will keep getting the same attention and
> not make any progress.
>
> Does systemd unit file allow a limited restart attempts?

By default systemd will ratelimit the number of times a unit is
started via StartLimitBurst, StartLimitIntervalSec option which defaults
to 5-times/10-sec.

Also the default RestartSec interval is 100ms. So in case of an error
spike and default RestartSec; opal-prd may be restarted too quickly and
will get rate limited thereby missing reported prd errors.

So I would suggest RestartSec=1 and StartLimitIntervalSec=5 so that
opal-prd is restart after about 1 second interval without any rate
limiting.
Ananth N Mavinakayanahalli March 9, 2017, 5:03 a.m. UTC | #5
On Thu, Mar 09, 2017 at 09:17:08AM +0530, Vaibhav Jain wrote:
> Hi Vaidy,
> 
> Vaidyanathan Srinivasan <svaidy@linux.vnet.ibm.com> writes:
> 
> > There should be a restart limit.  If we actually crash starting
> > opal-prd, then we will get stuck infinitely trying to start it.
> >
> > This could happen if there are events pending and actually the event
> > action is crashing prd.  We will keep getting the same attention and
> > not make any progress.
> >
> > Does systemd unit file allow a limited restart attempts?
> 
> By default systemd will ratelimit the number of times a unit is
> started via StartLimitBurst, StartLimitIntervalSec option which defaults
> to 5-times/10-sec.
> 
> Also the default RestartSec interval is 100ms. So in case of an error
> spike and default RestartSec; opal-prd may be restarted too quickly and
> will get rate limited thereby missing reported prd errors.
> 
> So I would suggest RestartSec=1 and StartLimitIntervalSec=5 so that
> opal-prd is restart after about 1 second interval without any rate
> limiting.

One second is too long a time to not run opal-prd. I would think
rate-limiting as it is, and getting to know systemd stopped trying to
restart it is a better option than to keep trying and failing.

Ananth
ppaidipe March 9, 2017, 5:16 a.m. UTC | #6
On 2017-03-09 10:33, Ananth N Mavinakayanahalli wrote:
> On Thu, Mar 09, 2017 at 09:17:08AM +0530, Vaibhav Jain wrote:
>> Hi Vaidy,
>> 
>> Vaidyanathan Srinivasan <svaidy@linux.vnet.ibm.com> writes:
>> 
>> > There should be a restart limit.  If we actually crash starting
>> > opal-prd, then we will get stuck infinitely trying to start it.
>> >
>> > This could happen if there are events pending and actually the event
>> > action is crashing prd.  We will keep getting the same attention and
>> > not make any progress.
>> >
>> > Does systemd unit file allow a limited restart attempts?
>> 
>> By default systemd will ratelimit the number of times a unit is
>> started via StartLimitBurst, StartLimitIntervalSec option which 
>> defaults
>> to 5-times/10-sec.
>> 
>> Also the default RestartSec interval is 100ms. So in case of an error
>> spike and default RestartSec; opal-prd may be restarted too quickly 
>> and
>> will get rate limited thereby missing reported prd errors.
>> 
>> So I would suggest RestartSec=1 and StartLimitIntervalSec=5 so that
>> opal-prd is restart after about 1 second interval without any rate
>> limiting.
> 
> One second is too long a time to not run opal-prd. I would think
> rate-limiting as it is, and getting to know systemd stopped trying to
> restart it is a better option than to keep trying and failing.
> 
> Ananth
> 


I have seen couple of times opal-prd getting crashed when it is trying 
to
process bad dimm's. During that state it is getting the below signal 11.

[   13.661462] opal-prd[1290]: unhandled signal 11 at 0000000000000000 
nip 00003fff8371fd70 lr 00003fff8371fa98 code 30001

Once it gets the signal 11, next further restarts of daemon are failed 
to start it.


service opal-prd status
● opal-prd.service - OPAL PRD daemon
    Loaded: loaded (/lib/systemd/system/opal-prd.service; enabled; vendor 
preset: enabled)
    Active: failed (Result: signal) since Tue 2017-03-07 14:24:09 IST; 
45s ago
   Process: 1290 ExecStart=/usr/sbin/opal-prd --pnor /dev/mtd0 
(code=killed, signal=SEGV)
  Main PID: 1290 (code=killed, signal=SEGV)

Mar 07 14:24:09 ltc-test-hab02 opal-prd[1290]: HBRT: 
FAPI:dimmBadDqBitmapAccessHwp.C: >>dimmBadDqBitmapAccessHwp: Getting 
bitmap
Mar 07 14:24:09 ltc-test-hab02 opal-prd[1290]: HBRT: 
FAPI:dimmBadDqBitmapAccessHwp.C: <<dimmBadDqBitmapAccessHwp
Mar 07 14:24:09 ltc-test-hab02 opal-prd[1290]: HBRT: 
FAPI:dimmBadDqBitmapFuncs.C: <<dimmGetBadDqBitmap
Mar 07 14:24:09 ltc-test-hab02 opal-prd[1290]: HBRT: 
FAPI:dimmBadDqBitmapFuncs.C: >>dimmSetBadDqBitmap. centaur.mba     
k0:n0:s0:p01:c1    :0:0:1
Mar 07 14:24:09 ltc-test-hab02 opal-prd[1290]: HBRT: 
FAPI:dimmBadDqBitmapAccessHwp.C: >>dimmBadDqBitmapAccessHwp: Getting 
bitmap
Mar 07 14:24:09 ltc-test-hab02 opal-prd[1290]: HBRT: 
FAPI:dimmBadDqBitmapAccessHwp.C: <<dimmBadDqBitmapAccessHwp
Mar 07 14:24:09 ltc-test-hab02 opal-prd[1290]: HBRT: 
FAPI:dimmBadDqBitmapAccessHwp.C: >>dimmBadDqBitmapAccessHwp: Setting 
bitmap
Mar 07 14:24:09 ltc-test-hab02 systemd[1]: opal-prd.service: Main 
process exited, code=killed, status=11/SEGV
Mar 07 14:24:09 ltc-test-hab02 systemd[1]: opal-prd.service: Unit 
entered failed state.
Mar 07 14:24:09 ltc-test-hab02 systemd[1]: opal-prd.service: Failed with 
result 'signal'.

So rate limiting to 5 times(which is default) is a good idea.


Thanks
Pridhiviraj


> _______________________________________________
> Skiboot mailing list
> Skiboot@lists.ozlabs.org
> https://lists.ozlabs.org/listinfo/skiboot
Stewart Smith March 24, 2017, 3:35 a.m. UTC | #7
Ananth N Mavinakayanahalli <ananth@linux.vnet.ibm.com> writes:
> Always restart the opal-prd daemon, irrespective of why it stopped.
>
> Signed-off-by: Ananth N Mavinakayanahalli <ananth@linux.vnet.ibm.com>

The consensus seems to be that this is okay, so merged to master as of 461d410dee0dc5e5deac948ebdcc4ec75eefe682
diff mbox

Patch

diff --git a/external/opal-prd/opal-prd.service b/external/opal-prd/opal-prd.service
index 9956912..a53d8d9 100644
--- a/external/opal-prd/opal-prd.service
+++ b/external/opal-prd/opal-prd.service
@@ -5,6 +5,7 @@  ConditionPathExists=/sys/firmware/devicetree/base/ibm,opal/diagnostics
 
 [Service]
 ExecStart=/usr/sbin/opal-prd --pnor /dev/mtd0
+Restart=always
 
 [Install]
 WantedBy=multi-user.target