Message ID | 148896572827.16968.5882246446209692412.stgit@thinktux.in.ibm.com |
---|---|
State | Accepted |
Headers | show |
On 03/08/2017 03:05 PM, Ananth N Mavinakayanahalli wrote: > Always restart the opal-prd daemon, irrespective of why it stopped. [CCed Rafael who fixed service script earlier] > > Signed-off-by: Ananth N Mavinakayanahalli <ananth@linux.vnet.ibm.com> Looks good to me . Reviewed-by: Vasant Hegde <hegdevasant@linux.vnet.ibm.com> -Vasant > --- > > V2: Add Signed-off-by > --- > external/opal-prd/opal-prd.service | 1 + > 1 file changed, 1 insertion(+) > > diff --git a/external/opal-prd/opal-prd.service b/external/opal-prd/opal-prd.service > index 9956912..a53d8d9 100644 > --- a/external/opal-prd/opal-prd.service > +++ b/external/opal-prd/opal-prd.service > @@ -5,6 +5,7 @@ ConditionPathExists=/sys/firmware/devicetree/base/ibm,opal/diagnostics > > [Service] > ExecStart=/usr/sbin/opal-prd --pnor /dev/mtd0 > +Restart=always > > [Install] > WantedBy=multi-user.target > > _______________________________________________ > Skiboot mailing list > Skiboot@lists.ozlabs.org > https://lists.ozlabs.org/listinfo/skiboot >
On 8 March 2017 at 16:26, Vasant Hegde <hegdevasant@linux.vnet.ibm.com> wrote: > On 03/08/2017 03:05 PM, Ananth N Mavinakayanahalli wrote: > >> Always restart the opal-prd daemon, irrespective of why it stopped. >> > > [CCed Rafael who fixed service script earlier] > > > >> Signed-off-by: Ananth N Mavinakayanahalli <ananth@linux.vnet.ibm.com> >> > > Looks good to me . > lgtm too. Att. -- Rafael Fonseca
* Ananth N Mavinakayanahalli <ananth@linux.vnet.ibm.com> [2017-03-08 15:05:28]: > Always restart the opal-prd daemon, irrespective of why it stopped. > > Signed-off-by: Ananth N Mavinakayanahalli <ananth@linux.vnet.ibm.com> > --- > > V2: Add Signed-off-by > --- > external/opal-prd/opal-prd.service | 1 + > 1 file changed, 1 insertion(+) > > diff --git a/external/opal-prd/opal-prd.service b/external/opal-prd/opal-prd.service > index 9956912..a53d8d9 100644 > --- a/external/opal-prd/opal-prd.service > +++ b/external/opal-prd/opal-prd.service > @@ -5,6 +5,7 @@ ConditionPathExists=/sys/firmware/devicetree/base/ibm,opal/diagnostics > > [Service] > ExecStart=/usr/sbin/opal-prd --pnor /dev/mtd0 > +Restart=always There should be a restart limit. If we actually crash starting opal-prd, then we will get stuck infinitely trying to start it. This could happen if there are events pending and actually the event action is crashing prd. We will keep getting the same attention and not make any progress. Does systemd unit file allow a limited restart attempts? --Vaidy
Hi Vaidy, Vaidyanathan Srinivasan <svaidy@linux.vnet.ibm.com> writes: > There should be a restart limit. If we actually crash starting > opal-prd, then we will get stuck infinitely trying to start it. > > This could happen if there are events pending and actually the event > action is crashing prd. We will keep getting the same attention and > not make any progress. > > Does systemd unit file allow a limited restart attempts? By default systemd will ratelimit the number of times a unit is started via StartLimitBurst, StartLimitIntervalSec option which defaults to 5-times/10-sec. Also the default RestartSec interval is 100ms. So in case of an error spike and default RestartSec; opal-prd may be restarted too quickly and will get rate limited thereby missing reported prd errors. So I would suggest RestartSec=1 and StartLimitIntervalSec=5 so that opal-prd is restart after about 1 second interval without any rate limiting.
On Thu, Mar 09, 2017 at 09:17:08AM +0530, Vaibhav Jain wrote: > Hi Vaidy, > > Vaidyanathan Srinivasan <svaidy@linux.vnet.ibm.com> writes: > > > There should be a restart limit. If we actually crash starting > > opal-prd, then we will get stuck infinitely trying to start it. > > > > This could happen if there are events pending and actually the event > > action is crashing prd. We will keep getting the same attention and > > not make any progress. > > > > Does systemd unit file allow a limited restart attempts? > > By default systemd will ratelimit the number of times a unit is > started via StartLimitBurst, StartLimitIntervalSec option which defaults > to 5-times/10-sec. > > Also the default RestartSec interval is 100ms. So in case of an error > spike and default RestartSec; opal-prd may be restarted too quickly and > will get rate limited thereby missing reported prd errors. > > So I would suggest RestartSec=1 and StartLimitIntervalSec=5 so that > opal-prd is restart after about 1 second interval without any rate > limiting. One second is too long a time to not run opal-prd. I would think rate-limiting as it is, and getting to know systemd stopped trying to restart it is a better option than to keep trying and failing. Ananth
On 2017-03-09 10:33, Ananth N Mavinakayanahalli wrote: > On Thu, Mar 09, 2017 at 09:17:08AM +0530, Vaibhav Jain wrote: >> Hi Vaidy, >> >> Vaidyanathan Srinivasan <svaidy@linux.vnet.ibm.com> writes: >> >> > There should be a restart limit. If we actually crash starting >> > opal-prd, then we will get stuck infinitely trying to start it. >> > >> > This could happen if there are events pending and actually the event >> > action is crashing prd. We will keep getting the same attention and >> > not make any progress. >> > >> > Does systemd unit file allow a limited restart attempts? >> >> By default systemd will ratelimit the number of times a unit is >> started via StartLimitBurst, StartLimitIntervalSec option which >> defaults >> to 5-times/10-sec. >> >> Also the default RestartSec interval is 100ms. So in case of an error >> spike and default RestartSec; opal-prd may be restarted too quickly >> and >> will get rate limited thereby missing reported prd errors. >> >> So I would suggest RestartSec=1 and StartLimitIntervalSec=5 so that >> opal-prd is restart after about 1 second interval without any rate >> limiting. > > One second is too long a time to not run opal-prd. I would think > rate-limiting as it is, and getting to know systemd stopped trying to > restart it is a better option than to keep trying and failing. > > Ananth > I have seen couple of times opal-prd getting crashed when it is trying to process bad dimm's. During that state it is getting the below signal 11. [ 13.661462] opal-prd[1290]: unhandled signal 11 at 0000000000000000 nip 00003fff8371fd70 lr 00003fff8371fa98 code 30001 Once it gets the signal 11, next further restarts of daemon are failed to start it. service opal-prd status ● opal-prd.service - OPAL PRD daemon Loaded: loaded (/lib/systemd/system/opal-prd.service; enabled; vendor preset: enabled) Active: failed (Result: signal) since Tue 2017-03-07 14:24:09 IST; 45s ago Process: 1290 ExecStart=/usr/sbin/opal-prd --pnor /dev/mtd0 (code=killed, signal=SEGV) Main PID: 1290 (code=killed, signal=SEGV) Mar 07 14:24:09 ltc-test-hab02 opal-prd[1290]: HBRT: FAPI:dimmBadDqBitmapAccessHwp.C: >>dimmBadDqBitmapAccessHwp: Getting bitmap Mar 07 14:24:09 ltc-test-hab02 opal-prd[1290]: HBRT: FAPI:dimmBadDqBitmapAccessHwp.C: <<dimmBadDqBitmapAccessHwp Mar 07 14:24:09 ltc-test-hab02 opal-prd[1290]: HBRT: FAPI:dimmBadDqBitmapFuncs.C: <<dimmGetBadDqBitmap Mar 07 14:24:09 ltc-test-hab02 opal-prd[1290]: HBRT: FAPI:dimmBadDqBitmapFuncs.C: >>dimmSetBadDqBitmap. centaur.mba k0:n0:s0:p01:c1 :0:0:1 Mar 07 14:24:09 ltc-test-hab02 opal-prd[1290]: HBRT: FAPI:dimmBadDqBitmapAccessHwp.C: >>dimmBadDqBitmapAccessHwp: Getting bitmap Mar 07 14:24:09 ltc-test-hab02 opal-prd[1290]: HBRT: FAPI:dimmBadDqBitmapAccessHwp.C: <<dimmBadDqBitmapAccessHwp Mar 07 14:24:09 ltc-test-hab02 opal-prd[1290]: HBRT: FAPI:dimmBadDqBitmapAccessHwp.C: >>dimmBadDqBitmapAccessHwp: Setting bitmap Mar 07 14:24:09 ltc-test-hab02 systemd[1]: opal-prd.service: Main process exited, code=killed, status=11/SEGV Mar 07 14:24:09 ltc-test-hab02 systemd[1]: opal-prd.service: Unit entered failed state. Mar 07 14:24:09 ltc-test-hab02 systemd[1]: opal-prd.service: Failed with result 'signal'. So rate limiting to 5 times(which is default) is a good idea. Thanks Pridhiviraj > _______________________________________________ > Skiboot mailing list > Skiboot@lists.ozlabs.org > https://lists.ozlabs.org/listinfo/skiboot
Ananth N Mavinakayanahalli <ananth@linux.vnet.ibm.com> writes: > Always restart the opal-prd daemon, irrespective of why it stopped. > > Signed-off-by: Ananth N Mavinakayanahalli <ananth@linux.vnet.ibm.com> The consensus seems to be that this is okay, so merged to master as of 461d410dee0dc5e5deac948ebdcc4ec75eefe682
diff --git a/external/opal-prd/opal-prd.service b/external/opal-prd/opal-prd.service index 9956912..a53d8d9 100644 --- a/external/opal-prd/opal-prd.service +++ b/external/opal-prd/opal-prd.service @@ -5,6 +5,7 @@ ConditionPathExists=/sys/firmware/devicetree/base/ibm,opal/diagnostics [Service] ExecStart=/usr/sbin/opal-prd --pnor /dev/mtd0 +Restart=always [Install] WantedBy=multi-user.target
Always restart the opal-prd daemon, irrespective of why it stopped. Signed-off-by: Ananth N Mavinakayanahalli <ananth@linux.vnet.ibm.com> --- V2: Add Signed-off-by --- external/opal-prd/opal-prd.service | 1 + 1 file changed, 1 insertion(+)