diff mbox series

[RFC,iproute2-next] man: Add devlink health man page

Message ID 1536826696-9413-2-git-send-email-eranbe@mellanox.com
State RFC, archived
Delegated to: David Ahern
Headers show
Series [RFC,iproute2-next] man: Add devlink health man page | expand

Commit Message

Eran Ben Elisha Sept. 13, 2018, 8:18 a.m. UTC
Add devlink-health man page. Devlink-health tool will control device
health attributes, sensors, actions and logging.

Signed-off-by: Eran Ben Elisha <eranbe@mellanox.com>

-------------------------------------------------------
Copy paste man output to here for easier review process of the RFC.

DEVLINK-HEALTH(8)                                                                                               Linux                                                                                              DEVLINK-HEALTH(8)

NAME
       devlink-health - devlink health configuration

SYNOPSIS
       devlink [ OPTIONS ] health  { COMMAND | help }

       OPTIONS := { -V[ersion] | -n[no-nice-names] }

       devlink health show [ DEV ] [ sensor NAME ]

       devlink health sensor set DEV name NAME [ action NAME { active | inactive } ]"

       devlink health action set DEV name NAME period PERIOD count COUNT fail { ignore | down }

       devlink health action reinit DEV name NAME

       devlink health help

DESCRIPTION
       devlink-health tool allows user to configure the way driver treats unexpected status. The tool allows configuration of the sensors that can trigger health activity. Set for each sensor the follow up operations, such as,
       reset and dump of info. In addition, set the health activity termination action.

   devlink health show - Display devlink health sensors and actions attributes
       DEV - Specifies the devlink device to show.  If this argument is omitted, all devices are listed.

           Format is:
             BUS_NAME/BUS_ADDRESS

       sensor NAME - Specifies the devlink sensor to show.

   devlink health sensor set - sets devlink health sensor attributes
       DEV    Specifies the devlink device to show.

       name NAME
              Name of the sensor to set.

       action NAME { active | inactive }
                  Specify which actions to activate and which to deactivate once a sensor was triggered. actions can be dump, reset, etc.

   devlink health action set - sets devlink action attributes
       DEV    Specifies the devlink device to set.

       name NAME
              Specifies the devlink action to set.

       period PERIOD
              The period on which we limit the amount of performed actions, measured in seconds.

       count COUNT
              The maximum amount of actions performed in a limit time frame.

       fail   { ignore | down }
                  Specify the behavior once count limit was reached.

                  ignore - Ignore errors without execution of any action.

                  down - Driver will remain in nonoperational state.

   devlink health action reinit - reset devlink action attributes (period, count, fail, etc)
       DEV    Specifies the devlink device to set.

       name NAME
              Specifies the devlink action to set.

EXAMPLES
       devlink health show
           Shows the health state of all devlink devices on the system.

       devlink health show pci/0000:01:00.0
           Shows the health state of specified devlink device.

       devlink health sensor set pci/0000:01:00.0 name TX_COMP_ERROR action reset off action dump on
           Sets TX_COMP_ERROR sensor parameters for a specific device.

       devlink health action set pci/0000:01:00.0 name reset period 3600 count 5 fail ignore
           Sets health attributes for reset action.

SEE ALSO
       devlink(8), devlink-port(8), devlink-sb(8), devlink-monitor(8), devlink-dev(8),

AUTHOR
       Eran ben Elisha <eranbe@mellanox.com>

iproute2                                                                                                     15 Aug 2018                                                                                           DEVLINK-HEALTH(8)
---
 man/man8/devlink-health.8 | 171 ++++++++++++++++++++++++++++++++++++++++++++++
 1 file changed, 171 insertions(+)
 create mode 100644 man/man8/devlink-health.8

Comments

Tobin C. Harding Sept. 13, 2018, 10:27 a.m. UTC | #1
On Thu, Sep 13, 2018 at 11:18:16AM +0300, Eran Ben Elisha wrote:
> Add devlink-health man page. Devlink-health tool will control device
> health attributes, sensors, actions and logging.
> 
> Signed-off-by: Eran Ben Elisha <eranbe@mellanox.com>
> 
> -------------------------------------------------------
> Copy paste man output to here for easier review process of the RFC.
> 
> DEVLINK-HEALTH(8)                                                                                               Linux                                                                                              DEVLINK-HEALTH(8)
> 
> NAME
>        devlink-health - devlink health configuration
> 
> SYNOPSIS
>        devlink [ OPTIONS ] health  { COMMAND | help }
> 
>        OPTIONS := { -V[ersion] | -n[no-nice-names] }
> 
>        devlink health show [ DEV ] [ sensor NAME ]
> 
>        devlink health sensor set DEV name NAME [ action NAME { active | inactive } ]"
> 
>        devlink health action set DEV name NAME period PERIOD count COUNT fail { ignore | down }
> 
>        devlink health action reinit DEV name NAME
> 
>        devlink health help
> 
> DESCRIPTION
>        devlink-health tool allows user to configure the way driver treats unexpected status. The tool allows configuration of the sensors that can trigger health activity. Set for each sensor the follow up operations, such as,
>        reset and dump of info. In addition, set the health activity termination action.
> 
>    devlink health show - Display devlink health sensors and actions attributes
>        DEV - Specifies the devlink device to show.  If this argument is omitted, all devices are listed.
> 
>            Format is:
>              BUS_NAME/BUS_ADDRESS
> 
>        sensor NAME - Specifies the devlink sensor to show.
> 

Perhaps the commands should include the optional arguments so when
reading the description one doesn't have to scroll to the top of the
page all the time

e.g
     devlink health show [ DEV ] [ sensor NAME ] - Display devlink health sensors and actions attributes

>    devlink health sensor set - sets devlink health sensor attributes
>        DEV    Specifies the devlink device to show.

	 		      	      	     	set

>        name NAME
>               Name of the sensor to set.
> 
>        action NAME { active | inactive }
>                   Specify which actions to activate and which to deactivate once a sensor was triggered. actions can be dump, reset, etc.
> 
>    devlink health action set - sets devlink action attributes
>        DEV    Specifies the devlink device to set.
> 
>        name NAME
>               Specifies the devlink action to set.

This is a little unclear to me?

>        period PERIOD
>               The period on which we limit the amount of performed actions, measured in seconds.
> 
>        count COUNT
>               The maximum amount of actions performed in a limit time frame.

Perhaps		    	    	      	      
                The maximum number of actions performed in a limited time frame.

>        fail   { ignore | down }
>                   Specify the behavior once count limit was reached.
> 
>                   ignore - Ignore errors without execution of any action.
> 
>                   down - Driver will remain in nonoperational state.
> 
>    devlink health action reinit - reset devlink action attributes (period, count, fail, etc)
>        DEV    Specifies the devlink device to set.
> 
>        name NAME
>               Specifies the devlink action to set.

Perhaps s/set/reinitialise/g for the above two descriptions.

Hope this helps,
Tobin.
Eran Ben Elisha Sept. 13, 2018, 11:58 a.m. UTC | #2
On 9/13/2018 1:27 PM, Tobin C. Harding wrote:
> On Thu, Sep 13, 2018 at 11:18:16AM +0300, Eran Ben Elisha wrote:
>> Add devlink-health man page. Devlink-health tool will control device
>> health attributes, sensors, actions and logging.
>>
>> Signed-off-by: Eran Ben Elisha <eranbe@mellanox.com>
>>
>> -------------------------------------------------------
>> Copy paste man output to here for easier review process of the RFC.
>>
>> DEVLINK-HEALTH(8)                                                                                               Linux                                                                                              DEVLINK-HEALTH(8)
>>
>> NAME
>>         devlink-health - devlink health configuration
>>
>> SYNOPSIS
>>         devlink [ OPTIONS ] health  { COMMAND | help }
>>
>>         OPTIONS := { -V[ersion] | -n[no-nice-names] }
>>
>>         devlink health show [ DEV ] [ sensor NAME ]
>>
>>         devlink health sensor set DEV name NAME [ action NAME { active | inactive } ]"
>>
>>         devlink health action set DEV name NAME period PERIOD count COUNT fail { ignore | down }
>>
>>         devlink health action reinit DEV name NAME
>>
>>         devlink health help
>>
>> DESCRIPTION
>>         devlink-health tool allows user to configure the way driver treats unexpected status. The tool allows configuration of the sensors that can trigger health activity. Set for each sensor the follow up operations, such as,
>>         reset and dump of info. In addition, set the health activity termination action.
>>
>>     devlink health show - Display devlink health sensors and actions attributes
>>         DEV - Specifies the devlink device to show.  If this argument is omitted, all devices are listed.
>>
>>             Format is:
>>               BUS_NAME/BUS_ADDRESS
>>
>>         sensor NAME - Specifies the devlink sensor to show.
>>
> 
> Perhaps the commands should include the optional arguments so when
> reading the description one doesn't have to scroll to the top of the
> page all the time
> 
> e.g
>       devlink health show [ DEV ] [ sensor NAME ] - Display devlink health sensors and actions attributes
> 

I followed the scheme presented in all other devlink man pages.
see devlink-region, devlink-port, etc.

 From my perspective, I am fine with adding it to devlink-health, need 
ack from the devlink maintainer to see if he likes it...

>>     devlink health sensor set - sets devlink health sensor attributes
>>         DEV    Specifies the devlink device to show.
> 
> 	 		      	      	     	set
> 
>>         name NAME
>>                Name of the sensor to set.
>>
>>         action NAME { active | inactive }
>>                    Specify which actions to activate and which to deactivate once a sensor was triggered. actions can be dump, reset, etc.
>>
>>     devlink health action set - sets devlink action attributes
>>         DEV    Specifies the devlink device to set.
>>
>>         name NAME
>>                Specifies the devlink action to set.
> 
> This is a little unclear to me?

what is not clear? the term 'action' or the naming? can you elaborate?

> 
>>         period PERIOD
>>                The period on which we limit the amount of performed actions, measured in seconds.
>>
>>         count COUNT
>>                The maximum amount of actions performed in a limit time frame.
> 
> Perhaps		    	    	      	
>                  The maximum number of actions performed in a limited time frame.

ack

> 
>>         fail   { ignore | down }
>>                    Specify the behavior once count limit was reached.
>>
>>                    ignore - Ignore errors without execution of any action.
>>
>>                    down - Driver will remain in nonoperational state.
>>
>>     devlink health action reinit - reset devlink action attributes (period, count, fail, etc)
>>         DEV    Specifies the devlink device to set.
>>
>>         name NAME
>>                Specifies the devlink action to set.
> 
> Perhaps s/set/reinitialise/g for the above two descriptions.

ack

> 
> Hope this helps,
> Tobin.

thanks
Andrew Lunn Sept. 13, 2018, 12:08 p.m. UTC | #3
>        devlink health sensor set pci/0000:01:00.0 name TX_COMP_ERROR action reset off action dump on
>            Sets TX_COMP_ERROR sensor parameters for a specific device.

I hope the real sensors have more understandable names. If i remember
correctly, the same sort of comment was given for resource
management. It was pretty unclear what the resource names actually
mean. Is an average user going to have any idea how to actually use
these sensors and actions?

Can you give more examples of sensors. We should understand if there
are any overlaps with hwmon.

    Andrew
Eran Ben Elisha Sept. 13, 2018, 12:49 p.m. UTC | #4
On 9/13/2018 3:08 PM, Andrew Lunn wrote:
>>         devlink health sensor set pci/0000:01:00.0 name TX_COMP_ERROR action reset off action dump on
>>             Sets TX_COMP_ERROR sensor parameters for a specific device.
> 
> I hope the real sensors have more understandable names. If i remember
> correctly, the same sort of comment was given for resource
> management. It was pretty unclear what the resource names actually
> mean. Is an average user going to have any idea how to actually use
> these sensors and actions?

well, hopefully. the whole point is to have it fully controlled by the 
user. However, names for the command should be short. I guess we shall 
have it documented (challenge is to fit to multi vendors).

> 
> Can you give more examples of sensors. We should understand if there
> are any overlaps with hwmon.

I restate here that we shall have SW sensors as well, and not only HW 
sensors.

This is what I had in mind:
1. command interface error
2. command interface timeout
3. stuck TX queue (like tx_timeout)
4. stuck TX completion queue (driver did not process packets in a 
reasonable time period)
5. stuck RX queue
6. RX completion error
7. TX completion error
8. HW / FW catastrophic error report
9. completion queue overrun

Eran

> 
>      Andrew
>
Andrew Lunn Sept. 13, 2018, 1:24 p.m. UTC | #5
On Thu, Sep 13, 2018 at 03:49:37PM +0300, Eran Ben Elisha wrote:
> 
> 
> On 9/13/2018 3:08 PM, Andrew Lunn wrote:
> >>        devlink health sensor set pci/0000:01:00.0 name TX_COMP_ERROR action reset off action dump on
> >>            Sets TX_COMP_ERROR sensor parameters for a specific device.
> >
> >I hope the real sensors have more understandable names. If i remember
> >correctly, the same sort of comment was given for resource
> >management. It was pretty unclear what the resource names actually
> >mean. Is an average user going to have any idea how to actually use
> >these sensors and actions?
> 
> well, hopefully. the whole point is to have it fully controlled by the user.
> However, names for the command should be short. I guess we shall have it
> documented (challenge is to fit to multi vendors).
> 
> >
> >Can you give more examples of sensors. We should understand if there
> >are any overlaps with hwmon.
> 
> I restate here that we shall have SW sensors as well, and not only HW
> sensors.
> 
> This is what I had in mind:
> 1. command interface error
> 2. command interface timeout
> 3. stuck TX queue (like tx_timeout)
> 4. stuck TX completion queue (driver did not process packets in a reasonable
> time period)
> 5. stuck RX queue
> 6. RX completion error
> 7. TX completion error
> 8. HW / FW catastrophic error report
> 9. completion queue overrun

Hi Eran

I'm having trouble differentiating between these SW sensors and bugs
which need fixing. What causes a command interface error? Sending it a
command it does not understand? A wrongly formatted command? A command
the version of the firmware does not support? These all sound just
like plain old bugs which need fixing, not something which needs a
framework to detect them and try to recover from them by resetting
something.

I would of expected that all the issues are about physical
properties. Something similar to SMART for hard disks. The power
supplies are starting to droop, suggesting it might die soon. The
tacho on the fan suggests the FAN is not rotating as fast as it
should, so the motor is going to die soon. An SFP is giving i2c
errors, suggesting it is not seated correctly. The card as a whole is
overheating, despite the fan working, suggesting the ambient
temperature is just too high.

	Andrew
Eran Ben Elisha Sept. 13, 2018, 2:30 p.m. UTC | #6
On 9/13/2018 4:24 PM, Andrew Lunn wrote:
> On Thu, Sep 13, 2018 at 03:49:37PM +0300, Eran Ben Elisha wrote:
>>
>>
>> On 9/13/2018 3:08 PM, Andrew Lunn wrote:
>>>>         devlink health sensor set pci/0000:01:00.0 name TX_COMP_ERROR action reset off action dump on
>>>>             Sets TX_COMP_ERROR sensor parameters for a specific device.
>>>
>>> I hope the real sensors have more understandable names. If i remember
>>> correctly, the same sort of comment was given for resource
>>> management. It was pretty unclear what the resource names actually
>>> mean. Is an average user going to have any idea how to actually use
>>> these sensors and actions?
>>
>> well, hopefully. the whole point is to have it fully controlled by the user.
>> However, names for the command should be short. I guess we shall have it
>> documented (challenge is to fit to multi vendors).
>>
>>>
>>> Can you give more examples of sensors. We should understand if there
>>> are any overlaps with hwmon.
>>
>> I restate here that we shall have SW sensors as well, and not only HW
>> sensors.
>>
>> This is what I had in mind:
>> 1. command interface error
>> 2. command interface timeout
>> 3. stuck TX queue (like tx_timeout)
>> 4. stuck TX completion queue (driver did not process packets in a reasonable
>> time period)
>> 5. stuck RX queue
>> 6. RX completion error
>> 7. TX completion error
>> 8. HW / FW catastrophic error report
>> 9. completion queue overrun
> 
> Hi Eran
> 
> I'm having trouble differentiating between these SW sensors and bugs
> which need fixing. What causes a command interface error? Sending it a
> command it does not understand? A wrongly formatted command? A command
> the version of the firmware does not support? These all sound just
> like plain old bugs which need fixing, not something which needs a
> framework to detect them and try to recover from them by resetting
> something.

Such issues do exist in production environment, and need to be handled 
even if root cause is a bug which will be fixed in latest release. My 
feature should help developers / administrator to control and recover 
their live systems, by auto correction and logging support.
Goal is:
- Provide alert debug information
- Self healing
- If problem needs vendor support, provide a way to gather all needed 
debugging information.

> 
> I would of expected that all the issues are about physical
> properties. Something similar to SMART for hard disks. The power
> supplies are starting to droop, suggesting it might die soon. The
> tacho on the fan suggests the FAN is not rotating as fast as it
> should, so the motor is going to die soon. An SFP is giving i2c
> errors, suggesting it is not seated correctly. The card as a whole is
> overheating, despite the fan working, suggesting the ambient
> temperature is just too high.

AFAIU, the kind of sensors you suggest here requires manual fix / 
physically approaching to the setup, replace HW, install new Fan, etc.
Monitor such events is easy, driver can just log events from HW to the 
dmesg and end its handle there.
None of these is a real networking issue I would like to handle with 
devlink-health.

Eran

> 
> 	Andrew
>
Andrew Lunn Sept. 13, 2018, 3:12 p.m. UTC | #7
> >>>>        devlink health sensor set pci/0000:01:00.0 name TX_COMP_ERROR action reset off action dump on
> >>>>            Sets TX_COMP_ERROR sensor parameters for a specific device.

> >>This is what I had in mind:
> >>1. command interface error
> >>2. command interface timeout
> >>3. stuck TX queue (like tx_timeout)
> >>4. stuck TX completion queue (driver did not process packets in a reasonable
> >>time period)
> >>5. stuck RX queue
> >>6. RX completion error
> >>7. TX completion error
> >>8. HW / FW catastrophic error report
> >>9. completion queue overrun

> Such issues do exist in production environment, and need to be handled even
> if root cause is a bug which will be fixed in latest release. My feature
> should help developers / administrator to control and recover their live
> systems, by auto correction and logging support.
> Goal is:
> - Provide alert debug information
> - Self healing
> - If problem needs vendor support, provide a way to gather all needed
> debugging information.

So maybe you have the wrong name for this. Health is nice in terms of
Marketing, but we are actually talking about bug recovery.

devlink bug sensor set pci/0000:01:00.0 name command_interface_error action reset off action dump on
devlink bug sensor set pci/0000:01:00.0 name command_interface_timeout action reset off action dump on
devlink bug sensor set pci/0000:01:00.0 name transmit_completion_error action reset off action dump on
devlink bug sensor set pci/0000:01:00.0 name completion_queue_overrun action reset off action dump on

seems a lot more understandable than:

devlink health set pci/0000:01:00.0 name TX_COMP_ERROR action reset off action dump on

	Andrew
Tobin C. Harding Sept. 13, 2018, 10:06 p.m. UTC | #8
On Thu, Sep 13, 2018 at 02:58:52PM +0300, Eran Ben Elisha wrote:
> 
> 
> On 9/13/2018 1:27 PM, Tobin C. Harding wrote:
> > On Thu, Sep 13, 2018 at 11:18:16AM +0300, Eran Ben Elisha wrote:
> > > Add devlink-health man page. Devlink-health tool will control device
> > > health attributes, sensors, actions and logging.
> > > 
> > > Signed-off-by: Eran Ben Elisha <eranbe@mellanox.com>
> > > 
> > > -------------------------------------------------------
> > > Copy paste man output to here for easier review process of the RFC.
> > > 
> > > DEVLINK-HEALTH(8)                                                                                               Linux                                                                                              DEVLINK-HEALTH(8)
> > > 
> > > NAME
> > >         devlink-health - devlink health configuration
> > > 
> > > SYNOPSIS
> > >         devlink [ OPTIONS ] health  { COMMAND | help }
> > > 
> > >         OPTIONS := { -V[ersion] | -n[no-nice-names] }
> > > 
> > >         devlink health show [ DEV ] [ sensor NAME ]
> > > 
> > >         devlink health sensor set DEV name NAME [ action NAME { active | inactive } ]"
> > > 
> > >         devlink health action set DEV name NAME period PERIOD count COUNT fail { ignore | down }
> > > 
> > >         devlink health action reinit DEV name NAME
> > > 
> > >         devlink health help
> > > 
> > > DESCRIPTION
> > >         devlink-health tool allows user to configure the way driver treats unexpected status. The tool allows configuration of the sensors that can trigger health activity. Set for each sensor the follow up operations, such as,
> > >         reset and dump of info. In addition, set the health activity termination action.
> > > 
> > >     devlink health show - Display devlink health sensors and actions attributes
> > >         DEV - Specifies the devlink device to show.  If this argument is omitted, all devices are listed.
> > > 
> > >             Format is:
> > >               BUS_NAME/BUS_ADDRESS
> > > 
> > >         sensor NAME - Specifies the devlink sensor to show.
> > > 
> > 
> > Perhaps the commands should include the optional arguments so when
> > reading the description one doesn't have to scroll to the top of the
> > page all the time
> > 
> > e.g
> >       devlink health show [ DEV ] [ sensor NAME ] - Display devlink health sensors and actions attributes
> > 
> 
> I followed the scheme presented in all other devlink man pages.
> see devlink-region, devlink-port, etc.

Oh ok, my mistake.  I'd stick with what you have then.  Thanks for
pointing this out.

> From my perspective, I am fine with adding it to devlink-health, need ack
> from the devlink maintainer to see if he likes it...
> 
> > >     devlink health sensor set - sets devlink health sensor attributes
> > >         DEV    Specifies the devlink device to show.
> > 
> > 	 		      	      	     	set
> > 
> > >         name NAME
> > >                Name of the sensor to set.
> > > 
> > >         action NAME { active | inactive }
> > >                    Specify which actions to activate and which to deactivate once a sensor was triggered. actions can be dump, reset, etc.
> > > 
> > >     devlink health action set - sets devlink action attributes
> > >         DEV    Specifies the devlink device to set.
> > > 
> > >         name NAME
> > >                Specifies the devlink action to set.
> > 
> > This is a little unclear to me?
> 
> what is not clear? the term 'action' or the naming? can you elaborate?

It wasn't immediately clear what 'name' referred to.  But following on
from discussion above this may be because I have not read any of the
other devlink man pages.

thanks,
Tobin.
Eran Ben Elisha Sept. 16, 2018, 9:14 a.m. UTC | #9
On 9/13/2018 6:12 PM, Andrew Lunn wrote:
>>>>>>         devlink health sensor set pci/0000:01:00.0 name TX_COMP_ERROR action reset off action dump on
>>>>>>             Sets TX_COMP_ERROR sensor parameters for a specific device.
> 
>>>> This is what I had in mind:
>>>> 1. command interface error
>>>> 2. command interface timeout
>>>> 3. stuck TX queue (like tx_timeout)
>>>> 4. stuck TX completion queue (driver did not process packets in a reasonable
>>>> time period)
>>>> 5. stuck RX queue
>>>> 6. RX completion error
>>>> 7. TX completion error
>>>> 8. HW / FW catastrophic error report
>>>> 9. completion queue overrun
> 
>> Such issues do exist in production environment, and need to be handled even
>> if root cause is a bug which will be fixed in latest release. My feature
>> should help developers / administrator to control and recover their live
>> systems, by auto correction and logging support.
>> Goal is:
>> - Provide alert debug information
>> - Self healing
>> - If problem needs vendor support, provide a way to gather all needed
>> debugging information.
> 
> So maybe you have the wrong name for this. Health is nice in terms of
> Marketing, but we are actually talking about bug recovery.

The way I see it, this feature is responsible for the health of the 
system from the pci/xxxx perspective.
I though about devlink-recover for example, but I really wouldn't like 
to limit the feature to be called after one of its actions. The same for 
devlink-bug, which highlights only part of the range of capabilities 
(sensor).

My work is currently focused on error reporting and recovery, but I 
wouldn't like to see the API limited for "bugs" only.

Eran

> 
> devlink bug sensor set pci/0000:01:00.0 name command_interface_error action reset off action dump on
> devlink bug sensor set pci/0000:01:00.0 name command_interface_timeout action reset off action dump on
> devlink bug sensor set pci/0000:01:00.0 name transmit_completion_error action reset off action dump on
> devlink bug sensor set pci/0000:01:00.0 name completion_queue_overrun action reset off action dump on
> 
> seems a lot more understandable than:
> 
> devlink health set pci/0000:01:00.0 name TX_COMP_ERROR action reset off action dump on
> 
> 	Andrew
>
diff mbox series

Patch

diff --git a/man/man8/devlink-health.8 b/man/man8/devlink-health.8
new file mode 100644
index 000000000000..ac28b020be0d
--- /dev/null
+++ b/man/man8/devlink-health.8
@@ -0,0 +1,171 @@ 
+.TH DEVLINK\-HEALTH 8 "15 Aug 2018" "iproute2" "Linux"
+.SH NAME
+devlink-health \- devlink health configuration
+.SH SYNOPSIS
+.sp
+.ad l
+.in +8
+.ti -8
+.B devlink
+.RI "[ " OPTIONS " ]"
+.BR health
+.RI  " { " COMMAND " | "
+.BR help " }"
+.sp
+
+.ti -8
+.IR OPTIONS " := { "
+\fB\-V\fR[\fIersion\fR] |
+\fB\-n\fR[\fIno-nice-names\fR] }
+
+.ti -8
+.B devlink health show
+.RI "[ " DEV " ]"
+.RI "[ "
+.B sensor
+.IR NAME
+.RI "]"
+
+.ti -8
+.B devlink health sensor set
+.IR DEV
+.B name
+.IR NAME
+.RI "[ "
+.BR action
+.IR NAME
+.R "{" active "|" inactive "}" ]"
+
+.ti -8
+.B devlink health action set
+.IR DEV
+.B name
+.IR NAME
+.BR period
+.IR PERIOD
+.BR count
+.IR COUNT
+.BR fail " { "
+.IR ignore
+.BR "| "
+.IR down
+.R "} "
+
+.ti -8
+.B devlink health action reinit
+.IR DEV
+.B name
+.IR NAME
+
+.ti -8
+.B devlink health help
+
+.SH "DESCRIPTION"
+.B devlink-health
+tool allows user to configure the way driver treats unexpected status. The tool allows configuration of the sensors that can trigger health activity. Set for each sensor the follow up operations, such as, reset and dump of info. In addition, set the health activity termination action.
+
+.SS devlink health show - Display devlink health sensors and actions attributes
+.PP
+.B "DEV"
+- Specifies the devlink device to show.
+If this argument is omitted, all devices are listed.
+
+.in +4
+Format is:
+.in +2
+BUS_NAME/BUS_ADDRESS
+
+.PP
+.BR sensor
+.IR "NAME"
+- Specifies the devlink sensor to show.
+
+.SS devlink health sensor set - sets devlink health sensor attributes
+
+.TP
+.B "DEV"
+Specifies the devlink device to show.
+
+.TP
+.BI name " NAME"
+Name of the sensor to set.
+
+.TP
+.BR action
+.IR NAME
+.R "{" active "|" inactive "} "
+.in +4
+Specify which actions to activate and which to deactivate once a sensor was triggered. actions can be dump, reset, etc.
+
+.SS devlink health action set - sets devlink action attributes
+
+.TP
+.B "DEV"
+Specifies the devlink device to set.
+
+.TP
+.BI name " NAME"
+Specifies the devlink action to set.
+
+.TP
+.BI period " PERIOD"
+The period on which we limit the amount of performed actions, measured in seconds.
+
+.TP
+.BI count " COUNT"
+The maximum amount of actions performed in a limit time frame.
+
+.TP
+.BR fail
+.R "{" ignore "|" down "}"
+.in +4
+Specify the behavior once count limit was reached.
+
+.I ignore
+- Ignore errors without execution of any action.
+
+.I down
+- Driver will remain in nonoperational state.
+
+.SS devlink health action reinit - reset devlink action attributes (period, count, fail, etc)
+
+.TP
+.B "DEV"
+Specifies the devlink device to set.
+
+.TP
+.BI name " NAME"
+Specifies the devlink action to set.
+
+.SH "EXAMPLES"
+.PP
+devlink health show
+.RS 4
+Shows the health state of all devlink devices on the system.
+.RE
+.PP
+devlink health show pci/0000:01:00.0
+.RS 4
+Shows the health state of specified devlink device.
+.RE
+.PP
+devlink health sensor set pci/0000:01:00.0 name TX_COMP_ERROR action reset off action dump on
+.RS 4
+Sets TX_COMP_ERROR sensor parameters for a specific device.
+.RE
+.PP
+devlink health action set pci/0000:01:00.0 name reset period 3600 count 5 fail ignore
+.RS 4
+Sets health attributes for reset action.
+.RE
+
+.SH SEE ALSO
+.BR devlink (8),
+.BR devlink-port (8),
+.BR devlink-sb (8),
+.BR devlink-monitor (8),
+.BR devlink-dev (8),
+.br
+
+.SH AUTHOR
+Eran ben Elisha <eranbe@mellanox.com>