[linux,dev-4.13,1/2] hwmon (occ): Remove temperature fault attribute and VRM temp alarm

Message ID 1519944259-27898-2-git-send-email-eajames@linux.vnet.ibm.com
State New
Headers show
Series
  • hwmon: (occ): Port fixes to 4.13
Related show

Commit Message

Eddie James March 1, 2018, 10:44 p.m.
From: "Edward A. James" <eajames@us.ibm.com>

This removes the EAGAIN return for "sensor not ready" and the temperature
fault and vrm temp alarms attributes.

Apologies for pushing this up so early... Userspace applications aren't
ready for this change. The hwmon polling application cannot accept EGAIN
yet, and we can't be returning apparent errors if the sensor is
temporarily unavailable.

Brad Bishop <bradleyb@fuzziesquirrel.com> wrote:
> userspace retries for a configurable period on eio, etimedout,
> ebadmsg, eagain, and exnio.  It also exits cleanly on enoent.
>
> If any of these errnos don’t go away after the configured number of retry
> attempts, an error is logged.  Any other errnos and an error is logged
> immediately. If this list needs improvement I’d love to hear about it.
>
> ebadmsg and enxio are observed when i2c devices are unplugged with transfers
> in various stages of flight.  They occur just before the driver is unbound
> after the presence gpio toggle on these devices is noticed and processed
> (killing the hwmon userspace daemon cleanly).
>
> eio, etimedout seem to be bus type errors that appear somewhat infrequently,
> but occur nevertheless.
>
> we all know what eagain means...
>
> It isn’t so much that the hwmon userspace doesn’t support this change,
> its just the resulting behavior difference at the other end of the
> hwmon abi <-> dbus api translation is not optimal, right now.
>
> 1 - Without this change, the hwmon userspace reads a value of 0 out of
> the sysfs attribute and happily reports that as the temp at the DBus level.
>
> With this change, the hwmon userspace would read the attribute, get an
> error and retry for a bit.  After a number of retries, the error is
> logged and the sensor is marked faulted at a dbus level.
>
> The issue is the consumer of the sensor at the dbus level.  Today, the
> dbus consumer happens to translate the sensor value of zero into the
> desired behavior.  If we submit this change the sensor will be put
> into faulted state and the consuming application doesn’t have support
> for that yet.
>
> We have open issues to enhance the application to support faulted sensors,
> its just not implemented yet.

OpenBMC-Staging-Count: 1
Signed-off-by: Eddie James <eajames@us.ibm.com>
Signed-off-by: Joel Stanley <joel@jms.id.au>
---
 drivers/hwmon/occ/common.c | 33 +++------------------------------
 1 file changed, 3 insertions(+), 30 deletions(-)

Comments

Joel Stanley March 2, 2018, 12:39 a.m. | #1
On Fri, Mar 2, 2018 at 9:14 AM, Eddie James <eajames@linux.vnet.ibm.com> wrote:
> From: "Edward A. James" <eajames@us.ibm.com>
>
> This removes the EAGAIN return for "sensor not ready" and the temperature
> fault and vrm temp alarms attributes.
>
> Apologies for pushing this up so early... Userspace applications aren't
> ready for this change. The hwmon polling application cannot accept EGAIN
> yet, and we can't be returning apparent errors if the sensor is
> temporarily unavailable.

Okay. Brad's email is from October 5th, 2017. That's 148 days ago.

Any ETA on when the userspace plans to get some love?

Cheers,

Joel


>
> Brad Bishop <bradleyb@fuzziesquirrel.com> wrote:
>> userspace retries for a configurable period on eio, etimedout,
>> ebadmsg, eagain, and exnio.  It also exits cleanly on enoent.
>>
>> If any of these errnos don’t go away after the configured number of retry
>> attempts, an error is logged.  Any other errnos and an error is logged
>> immediately. If this list needs improvement I’d love to hear about it.
>>
>> ebadmsg and enxio are observed when i2c devices are unplugged with transfers
>> in various stages of flight.  They occur just before the driver is unbound
>> after the presence gpio toggle on these devices is noticed and processed
>> (killing the hwmon userspace daemon cleanly).
>>
>> eio, etimedout seem to be bus type errors that appear somewhat infrequently,
>> but occur nevertheless.
>>
>> we all know what eagain means...
>>
>> It isn’t so much that the hwmon userspace doesn’t support this change,
>> its just the resulting behavior difference at the other end of the
>> hwmon abi <-> dbus api translation is not optimal, right now.
>>
>> 1 - Without this change, the hwmon userspace reads a value of 0 out of
>> the sysfs attribute and happily reports that as the temp at the DBus level.
>>
>> With this change, the hwmon userspace would read the attribute, get an
>> error and retry for a bit.  After a number of retries, the error is
>> logged and the sensor is marked faulted at a dbus level.
>>
>> The issue is the consumer of the sensor at the dbus level.  Today, the
>> dbus consumer happens to translate the sensor value of zero into the
>> desired behavior.  If we submit this change the sensor will be put
>> into faulted state and the consuming application doesn’t have support
>> for that yet.
>>
>> We have open issues to enhance the application to support faulted sensors,
>> its just not implemented yet.
>
> OpenBMC-Staging-Count: 1
> Signed-off-by: Eddie James <eajames@us.ibm.com>
> Signed-off-by: Joel Stanley <joel@jms.id.au>
> ---
>  drivers/hwmon/occ/common.c | 33 +++------------------------------
>  1 file changed, 3 insertions(+), 30 deletions(-)
>
> diff --git a/drivers/hwmon/occ/common.c b/drivers/hwmon/occ/common.c
> index 7c97a4c..a5c4df0 100644
> --- a/drivers/hwmon/occ/common.c
> +++ b/drivers/hwmon/occ/common.c
> @@ -304,24 +304,11 @@ static ssize_t occ_show_temp_2(struct device *dev,
>                 val = get_unaligned_be32(&temp->sensor_id);
>                 break;
>         case 1:
> -               val = temp->value;
> -               if (val == OCC_TEMP_SENSOR_FAULT)
> -                       return -EREMOTEIO;
> -
> -               if (temp->fru_type != OCC_FRU_TYPE_VRM) {
> -                       /* sensor not ready */
> -                       if (val == 0)
> -                               return -EAGAIN;
> -
> -                       val *= 1000;    /* millidegrees */
> -               }
> +               val = temp->value * 1000;
>                 break;
>         case 2:
>                 val = temp->fru_type;
>                 break;
> -       case 3:
> -               val = temp->value == OCC_TEMP_SENSOR_FAULT;
> -               break;
>         default:
>                 return -EINVAL;
>         }
> @@ -790,7 +777,7 @@ static int occ_setup_sensor_attrs(struct occ *occ)
>                 num_attrs += (sensors->temp.num_sensors * 2);
>                 break;
>         case 2:
> -               num_attrs += (sensors->temp.num_sensors * 4);
> +               num_attrs += (sensors->temp.num_sensors * 3);
>                 show_temp = occ_show_temp_2;
>                 break;
>         default:
> @@ -863,22 +850,13 @@ static int occ_setup_sensor_attrs(struct occ *occ)
>
>         for (i = 0; i < sensors->temp.num_sensors; ++i) {
>                 s = i + 1;
> -               temp = ((struct temp_sensor_2 *)sensors->temp.data) + i;
>
>                 snprintf(attr->name, sizeof(attr->name), "temp%d_label", s);
>                 attr->sensor = OCC_INIT_ATTR(attr->name, 0444, show_temp, NULL,
>                                              0, i);
>                 attr++;
>
> -               if (sensors->temp.version > 1 &&
> -                   temp->fru_type == OCC_FRU_TYPE_VRM) {
> -                       snprintf(attr->name, sizeof(attr->name),
> -                                "temp%d_alarm", s);
> -               } else {
> -                       snprintf(attr->name, sizeof(attr->name),
> -                                "temp%d_input", s);
> -               }
> -
> +               snprintf(attr->name, sizeof(attr->name), "temp%d_input", s);
>                 attr->sensor = OCC_INIT_ATTR(attr->name, 0444, show_temp, NULL,
>                                              1, i);
>                 attr++;
> @@ -890,11 +868,6 @@ static int occ_setup_sensor_attrs(struct occ *occ)
>                                                      show_temp, NULL, 2, i);
>                         attr++;
>
> -                       snprintf(attr->name, sizeof(attr->name),
> -                                "temp%d_fault", s);
> -                       attr->sensor = OCC_INIT_ATTR(attr->name, 0444,
> -                                                    show_temp, NULL, 3, i);
> -                       attr++;
>                 }
>         }
>
> --
> 1.8.3.1
>
Matt Spinler March 8, 2018, 7:15 p.m. | #2
On 3/1/2018 6:39 PM, Joel Stanley wrote:
> On Fri, Mar 2, 2018 at 9:14 AM, Eddie James <eajames@linux.vnet.ibm.com> wrote:
>> From: "Edward A. James" <eajames@us.ibm.com>
>>
>> This removes the EAGAIN return for "sensor not ready" and the temperature
>> fault and vrm temp alarms attributes.
>>
>> Apologies for pushing this up so early... Userspace applications aren't
>> ready for this change. The hwmon polling application cannot accept EGAIN
>> yet, and we can't be returning apparent errors if the sensor is
>> temporarily unavailable.
> 
> Okay. Brad's email is from October 5th, 2017. That's 148 days ago.
> 
> Any ETA on when the userspace plans to get some love?
> 

We have 4 issues in the backlog to address the OCC GPU sensors returning
either a 0 or 0xFF:
https://github.com/openbmc/openbmc/issues/2327
https://github.com/openbmc/openbmc/issues/2329
https://github.com/openbmc/openbmc/issues/2222
https://github.com/openbmc/openbmc/issues/2223

As fan control is already doing the right thing when it reads the values
as they are now, we deprioritized the work quite a bit so it still
isn't scheduled and may not be until later systems.


> Cheers,
> 
> Joel
> 
> 
>>
>> Brad Bishop <bradleyb@fuzziesquirrel.com> wrote:
>>> userspace retries for a configurable period on eio, etimedout,
>>> ebadmsg, eagain, and exnio.  It also exits cleanly on enoent.
>>>
>>> If any of these errnos don’t go away after the configured number of retry
>>> attempts, an error is logged.  Any other errnos and an error is logged
>>> immediately. If this list needs improvement I’d love to hear about it.
>>>
>>> ebadmsg and enxio are observed when i2c devices are unplugged with transfers
>>> in various stages of flight.  They occur just before the driver is unbound
>>> after the presence gpio toggle on these devices is noticed and processed
>>> (killing the hwmon userspace daemon cleanly).
>>>
>>> eio, etimedout seem to be bus type errors that appear somewhat infrequently,
>>> but occur nevertheless.
>>>
>>> we all know what eagain means...
>>>
>>> It isn’t so much that the hwmon userspace doesn’t support this change,
>>> its just the resulting behavior difference at the other end of the
>>> hwmon abi <-> dbus api translation is not optimal, right now.
>>>
>>> 1 - Without this change, the hwmon userspace reads a value of 0 out of
>>> the sysfs attribute and happily reports that as the temp at the DBus level.
>>>
>>> With this change, the hwmon userspace would read the attribute, get an
>>> error and retry for a bit.  After a number of retries, the error is
>>> logged and the sensor is marked faulted at a dbus level.
>>>
>>> The issue is the consumer of the sensor at the dbus level.  Today, the
>>> dbus consumer happens to translate the sensor value of zero into the
>>> desired behavior.  If we submit this change the sensor will be put
>>> into faulted state and the consuming application doesn’t have support
>>> for that yet.
>>>
>>> We have open issues to enhance the application to support faulted sensors,
>>> its just not implemented yet.
>>
>> OpenBMC-Staging-Count: 1
>> Signed-off-by: Eddie James <eajames@us.ibm.com>
>> Signed-off-by: Joel Stanley <joel@jms.id.au>
>> ---
>>   drivers/hwmon/occ/common.c | 33 +++------------------------------
>>   1 file changed, 3 insertions(+), 30 deletions(-)
>>
>> diff --git a/drivers/hwmon/occ/common.c b/drivers/hwmon/occ/common.c
>> index 7c97a4c..a5c4df0 100644
>> --- a/drivers/hwmon/occ/common.c
>> +++ b/drivers/hwmon/occ/common.c
>> @@ -304,24 +304,11 @@ static ssize_t occ_show_temp_2(struct device *dev,
>>                  val = get_unaligned_be32(&temp->sensor_id);
>>                  break;
>>          case 1:
>> -               val = temp->value;
>> -               if (val == OCC_TEMP_SENSOR_FAULT)
>> -                       return -EREMOTEIO;
>> -
>> -               if (temp->fru_type != OCC_FRU_TYPE_VRM) {
>> -                       /* sensor not ready */
>> -                       if (val == 0)
>> -                               return -EAGAIN;
>> -
>> -                       val *= 1000;    /* millidegrees */
>> -               }
>> +               val = temp->value * 1000;
>>                  break;
>>          case 2:
>>                  val = temp->fru_type;
>>                  break;
>> -       case 3:
>> -               val = temp->value == OCC_TEMP_SENSOR_FAULT;
>> -               break;
>>          default:
>>                  return -EINVAL;
>>          }
>> @@ -790,7 +777,7 @@ static int occ_setup_sensor_attrs(struct occ *occ)
>>                  num_attrs += (sensors->temp.num_sensors * 2);
>>                  break;
>>          case 2:
>> -               num_attrs += (sensors->temp.num_sensors * 4);
>> +               num_attrs += (sensors->temp.num_sensors * 3);
>>                  show_temp = occ_show_temp_2;
>>                  break;
>>          default:
>> @@ -863,22 +850,13 @@ static int occ_setup_sensor_attrs(struct occ *occ)
>>
>>          for (i = 0; i < sensors->temp.num_sensors; ++i) {
>>                  s = i + 1;
>> -               temp = ((struct temp_sensor_2 *)sensors->temp.data) + i;
>>
>>                  snprintf(attr->name, sizeof(attr->name), "temp%d_label", s);
>>                  attr->sensor = OCC_INIT_ATTR(attr->name, 0444, show_temp, NULL,
>>                                               0, i);
>>                  attr++;
>>
>> -               if (sensors->temp.version > 1 &&
>> -                   temp->fru_type == OCC_FRU_TYPE_VRM) {
>> -                       snprintf(attr->name, sizeof(attr->name),
>> -                                "temp%d_alarm", s);
>> -               } else {
>> -                       snprintf(attr->name, sizeof(attr->name),
>> -                                "temp%d_input", s);
>> -               }
>> -
>> +               snprintf(attr->name, sizeof(attr->name), "temp%d_input", s);
>>                  attr->sensor = OCC_INIT_ATTR(attr->name, 0444, show_temp, NULL,
>>                                               1, i);
>>                  attr++;
>> @@ -890,11 +868,6 @@ static int occ_setup_sensor_attrs(struct occ *occ)
>>                                                       show_temp, NULL, 2, i);
>>                          attr++;
>>
>> -                       snprintf(attr->name, sizeof(attr->name),
>> -                                "temp%d_fault", s);
>> -                       attr->sensor = OCC_INIT_ATTR(attr->name, 0444,
>> -                                                    show_temp, NULL, 3, i);
>> -                       attr++;
>>                  }
>>          }
>>
>> --
>> 1.8.3.1
>>
>
Joel Stanley March 13, 2018, 3:26 a.m. | #3
On Fri, Mar 9, 2018 at 5:45 AM, Matt Spinler
<mspinler@linux.vnet.ibm.com> wrote:
>
>
> On 3/1/2018 6:39 PM, Joel Stanley wrote:
>>
>> On Fri, Mar 2, 2018 at 9:14 AM, Eddie James <eajames@linux.vnet.ibm.com>
>> wrote:
>>>
>>> From: "Edward A. James" <eajames@us.ibm.com>
>>>
>>> This removes the EAGAIN return for "sensor not ready" and the temperature
>>> fault and vrm temp alarms attributes.
>>>
>>> Apologies for pushing this up so early... Userspace applications aren't
>>> ready for this change. The hwmon polling application cannot accept EGAIN
>>> yet, and we can't be returning apparent errors if the sensor is
>>> temporarily unavailable.
>>
>>
>> Okay. Brad's email is from October 5th, 2017. That's 148 days ago.
>>
>> Any ETA on when the userspace plans to get some love?
>>
>
> We have 4 issues in the backlog to address the OCC GPU sensors returning
> either a 0 or 0xFF:
> https://github.com/openbmc/openbmc/issues/2327
> https://github.com/openbmc/openbmc/issues/2329
> https://github.com/openbmc/openbmc/issues/2222
> https://github.com/openbmc/openbmc/issues/2223
>
> As fan control is already doing the right thing when it reads the values
> as they are now, we deprioritized the work quite a bit so it still
> isn't scheduled and may not be until later systems.

Okay. Lets get that work scheduled ASAP, as it's already been quite
some time since we needed that fixed. I think it's now holding up the
Witherspoon team from moving to the new kernel.

Do you have an ETA?

Cheers,

Joel
Matt Spinler March 13, 2018, 9:19 p.m. | #4
On 3/12/2018 10:26 PM, Joel Stanley wrote:
> On Fri, Mar 9, 2018 at 5:45 AM, Matt Spinler
> <mspinler@linux.vnet.ibm.com> wrote:
>>
>>
>> On 3/1/2018 6:39 PM, Joel Stanley wrote:
>>>
>>> On Fri, Mar 2, 2018 at 9:14 AM, Eddie James <eajames@linux.vnet.ibm.com>
>>> wrote:
>>>>
>>>> From: "Edward A. James" <eajames@us.ibm.com>
>>>>
>>>> This removes the EAGAIN return for "sensor not ready" and the temperature
>>>> fault and vrm temp alarms attributes.
>>>>
>>>> Apologies for pushing this up so early... Userspace applications aren't
>>>> ready for this change. The hwmon polling application cannot accept EGAIN
>>>> yet, and we can't be returning apparent errors if the sensor is
>>>> temporarily unavailable.
>>>
>>>
>>> Okay. Brad's email is from October 5th, 2017. That's 148 days ago.
>>>
>>> Any ETA on when the userspace plans to get some love?
>>>
>>
>> We have 4 issues in the backlog to address the OCC GPU sensors returning
>> either a 0 or 0xFF:
>> https://github.com/openbmc/openbmc/issues/2327
>> https://github.com/openbmc/openbmc/issues/2329
>> https://github.com/openbmc/openbmc/issues/2222
>> https://github.com/openbmc/openbmc/issues/2223
>>
>> As fan control is already doing the right thing when it reads the values
>> as they are now, we deprioritized the work quite a bit so it still
>> isn't scheduled and may not be until later systems.
> 
> Okay. Lets get that work scheduled ASAP, as it's already been quite
> some time since we needed that fixed. I think it's now holding up the
> Witherspoon team from moving to the new kernel.
> 
> Do you have an ETA?

We're going to try to start work on it next sprint, so done in about a 
month give or take.

> 
> Cheers,
> 
> Joel
>

Patch

diff --git a/drivers/hwmon/occ/common.c b/drivers/hwmon/occ/common.c
index 7c97a4c..a5c4df0 100644
--- a/drivers/hwmon/occ/common.c
+++ b/drivers/hwmon/occ/common.c
@@ -304,24 +304,11 @@  static ssize_t occ_show_temp_2(struct device *dev,
 		val = get_unaligned_be32(&temp->sensor_id);
 		break;
 	case 1:
-		val = temp->value;
-		if (val == OCC_TEMP_SENSOR_FAULT)
-			return -EREMOTEIO;
-
-		if (temp->fru_type != OCC_FRU_TYPE_VRM) {
-			/* sensor not ready */
-			if (val == 0)
-				return -EAGAIN;
-
-			val *= 1000;	/* millidegrees */
-		}
+		val = temp->value * 1000;
 		break;
 	case 2:
 		val = temp->fru_type;
 		break;
-	case 3:
-		val = temp->value == OCC_TEMP_SENSOR_FAULT;
-		break;
 	default:
 		return -EINVAL;
 	}
@@ -790,7 +777,7 @@  static int occ_setup_sensor_attrs(struct occ *occ)
 		num_attrs += (sensors->temp.num_sensors * 2);
 		break;
 	case 2:
-		num_attrs += (sensors->temp.num_sensors * 4);
+		num_attrs += (sensors->temp.num_sensors * 3);
 		show_temp = occ_show_temp_2;
 		break;
 	default:
@@ -863,22 +850,13 @@  static int occ_setup_sensor_attrs(struct occ *occ)
 
 	for (i = 0; i < sensors->temp.num_sensors; ++i) {
 		s = i + 1;
-		temp = ((struct temp_sensor_2 *)sensors->temp.data) + i;
 
 		snprintf(attr->name, sizeof(attr->name), "temp%d_label", s);
 		attr->sensor = OCC_INIT_ATTR(attr->name, 0444, show_temp, NULL,
 					     0, i);
 		attr++;
 
-		if (sensors->temp.version > 1 &&
-		    temp->fru_type == OCC_FRU_TYPE_VRM) {
-			snprintf(attr->name, sizeof(attr->name),
-				 "temp%d_alarm", s);
-		} else {
-			snprintf(attr->name, sizeof(attr->name),
-				 "temp%d_input", s);
-		}
-
+		snprintf(attr->name, sizeof(attr->name), "temp%d_input", s);
 		attr->sensor = OCC_INIT_ATTR(attr->name, 0444, show_temp, NULL,
 					     1, i);
 		attr++;
@@ -890,11 +868,6 @@  static int occ_setup_sensor_attrs(struct occ *occ)
 						     show_temp, NULL, 2, i);
 			attr++;
 
-			snprintf(attr->name, sizeof(attr->name),
-				 "temp%d_fault", s);
-			attr->sensor = OCC_INIT_ATTR(attr->name, 0444,
-						     show_temp, NULL, 3, i);
-			attr++;
 		}
 	}