Message ID | 1519944259-27898-2-git-send-email-eajames@linux.vnet.ibm.com |
---|---|
State | Rejected, archived |
Headers | show |
Series | hwmon: (occ): Port fixes to 4.13 | expand |
On Fri, Mar 2, 2018 at 9:14 AM, Eddie James <eajames@linux.vnet.ibm.com> wrote: > From: "Edward A. James" <eajames@us.ibm.com> > > This removes the EAGAIN return for "sensor not ready" and the temperature > fault and vrm temp alarms attributes. > > Apologies for pushing this up so early... Userspace applications aren't > ready for this change. The hwmon polling application cannot accept EGAIN > yet, and we can't be returning apparent errors if the sensor is > temporarily unavailable. Okay. Brad's email is from October 5th, 2017. That's 148 days ago. Any ETA on when the userspace plans to get some love? Cheers, Joel > > Brad Bishop <bradleyb@fuzziesquirrel.com> wrote: >> userspace retries for a configurable period on eio, etimedout, >> ebadmsg, eagain, and exnio. It also exits cleanly on enoent. >> >> If any of these errnos don’t go away after the configured number of retry >> attempts, an error is logged. Any other errnos and an error is logged >> immediately. If this list needs improvement I’d love to hear about it. >> >> ebadmsg and enxio are observed when i2c devices are unplugged with transfers >> in various stages of flight. They occur just before the driver is unbound >> after the presence gpio toggle on these devices is noticed and processed >> (killing the hwmon userspace daemon cleanly). >> >> eio, etimedout seem to be bus type errors that appear somewhat infrequently, >> but occur nevertheless. >> >> we all know what eagain means... >> >> It isn’t so much that the hwmon userspace doesn’t support this change, >> its just the resulting behavior difference at the other end of the >> hwmon abi <-> dbus api translation is not optimal, right now. >> >> 1 - Without this change, the hwmon userspace reads a value of 0 out of >> the sysfs attribute and happily reports that as the temp at the DBus level. >> >> With this change, the hwmon userspace would read the attribute, get an >> error and retry for a bit. After a number of retries, the error is >> logged and the sensor is marked faulted at a dbus level. >> >> The issue is the consumer of the sensor at the dbus level. Today, the >> dbus consumer happens to translate the sensor value of zero into the >> desired behavior. If we submit this change the sensor will be put >> into faulted state and the consuming application doesn’t have support >> for that yet. >> >> We have open issues to enhance the application to support faulted sensors, >> its just not implemented yet. > > OpenBMC-Staging-Count: 1 > Signed-off-by: Eddie James <eajames@us.ibm.com> > Signed-off-by: Joel Stanley <joel@jms.id.au> > --- > drivers/hwmon/occ/common.c | 33 +++------------------------------ > 1 file changed, 3 insertions(+), 30 deletions(-) > > diff --git a/drivers/hwmon/occ/common.c b/drivers/hwmon/occ/common.c > index 7c97a4c..a5c4df0 100644 > --- a/drivers/hwmon/occ/common.c > +++ b/drivers/hwmon/occ/common.c > @@ -304,24 +304,11 @@ static ssize_t occ_show_temp_2(struct device *dev, > val = get_unaligned_be32(&temp->sensor_id); > break; > case 1: > - val = temp->value; > - if (val == OCC_TEMP_SENSOR_FAULT) > - return -EREMOTEIO; > - > - if (temp->fru_type != OCC_FRU_TYPE_VRM) { > - /* sensor not ready */ > - if (val == 0) > - return -EAGAIN; > - > - val *= 1000; /* millidegrees */ > - } > + val = temp->value * 1000; > break; > case 2: > val = temp->fru_type; > break; > - case 3: > - val = temp->value == OCC_TEMP_SENSOR_FAULT; > - break; > default: > return -EINVAL; > } > @@ -790,7 +777,7 @@ static int occ_setup_sensor_attrs(struct occ *occ) > num_attrs += (sensors->temp.num_sensors * 2); > break; > case 2: > - num_attrs += (sensors->temp.num_sensors * 4); > + num_attrs += (sensors->temp.num_sensors * 3); > show_temp = occ_show_temp_2; > break; > default: > @@ -863,22 +850,13 @@ static int occ_setup_sensor_attrs(struct occ *occ) > > for (i = 0; i < sensors->temp.num_sensors; ++i) { > s = i + 1; > - temp = ((struct temp_sensor_2 *)sensors->temp.data) + i; > > snprintf(attr->name, sizeof(attr->name), "temp%d_label", s); > attr->sensor = OCC_INIT_ATTR(attr->name, 0444, show_temp, NULL, > 0, i); > attr++; > > - if (sensors->temp.version > 1 && > - temp->fru_type == OCC_FRU_TYPE_VRM) { > - snprintf(attr->name, sizeof(attr->name), > - "temp%d_alarm", s); > - } else { > - snprintf(attr->name, sizeof(attr->name), > - "temp%d_input", s); > - } > - > + snprintf(attr->name, sizeof(attr->name), "temp%d_input", s); > attr->sensor = OCC_INIT_ATTR(attr->name, 0444, show_temp, NULL, > 1, i); > attr++; > @@ -890,11 +868,6 @@ static int occ_setup_sensor_attrs(struct occ *occ) > show_temp, NULL, 2, i); > attr++; > > - snprintf(attr->name, sizeof(attr->name), > - "temp%d_fault", s); > - attr->sensor = OCC_INIT_ATTR(attr->name, 0444, > - show_temp, NULL, 3, i); > - attr++; > } > } > > -- > 1.8.3.1 >
On 3/1/2018 6:39 PM, Joel Stanley wrote: > On Fri, Mar 2, 2018 at 9:14 AM, Eddie James <eajames@linux.vnet.ibm.com> wrote: >> From: "Edward A. James" <eajames@us.ibm.com> >> >> This removes the EAGAIN return for "sensor not ready" and the temperature >> fault and vrm temp alarms attributes. >> >> Apologies for pushing this up so early... Userspace applications aren't >> ready for this change. The hwmon polling application cannot accept EGAIN >> yet, and we can't be returning apparent errors if the sensor is >> temporarily unavailable. > > Okay. Brad's email is from October 5th, 2017. That's 148 days ago. > > Any ETA on when the userspace plans to get some love? > We have 4 issues in the backlog to address the OCC GPU sensors returning either a 0 or 0xFF: https://github.com/openbmc/openbmc/issues/2327 https://github.com/openbmc/openbmc/issues/2329 https://github.com/openbmc/openbmc/issues/2222 https://github.com/openbmc/openbmc/issues/2223 As fan control is already doing the right thing when it reads the values as they are now, we deprioritized the work quite a bit so it still isn't scheduled and may not be until later systems. > Cheers, > > Joel > > >> >> Brad Bishop <bradleyb@fuzziesquirrel.com> wrote: >>> userspace retries for a configurable period on eio, etimedout, >>> ebadmsg, eagain, and exnio. It also exits cleanly on enoent. >>> >>> If any of these errnos don’t go away after the configured number of retry >>> attempts, an error is logged. Any other errnos and an error is logged >>> immediately. If this list needs improvement I’d love to hear about it. >>> >>> ebadmsg and enxio are observed when i2c devices are unplugged with transfers >>> in various stages of flight. They occur just before the driver is unbound >>> after the presence gpio toggle on these devices is noticed and processed >>> (killing the hwmon userspace daemon cleanly). >>> >>> eio, etimedout seem to be bus type errors that appear somewhat infrequently, >>> but occur nevertheless. >>> >>> we all know what eagain means... >>> >>> It isn’t so much that the hwmon userspace doesn’t support this change, >>> its just the resulting behavior difference at the other end of the >>> hwmon abi <-> dbus api translation is not optimal, right now. >>> >>> 1 - Without this change, the hwmon userspace reads a value of 0 out of >>> the sysfs attribute and happily reports that as the temp at the DBus level. >>> >>> With this change, the hwmon userspace would read the attribute, get an >>> error and retry for a bit. After a number of retries, the error is >>> logged and the sensor is marked faulted at a dbus level. >>> >>> The issue is the consumer of the sensor at the dbus level. Today, the >>> dbus consumer happens to translate the sensor value of zero into the >>> desired behavior. If we submit this change the sensor will be put >>> into faulted state and the consuming application doesn’t have support >>> for that yet. >>> >>> We have open issues to enhance the application to support faulted sensors, >>> its just not implemented yet. >> >> OpenBMC-Staging-Count: 1 >> Signed-off-by: Eddie James <eajames@us.ibm.com> >> Signed-off-by: Joel Stanley <joel@jms.id.au> >> --- >> drivers/hwmon/occ/common.c | 33 +++------------------------------ >> 1 file changed, 3 insertions(+), 30 deletions(-) >> >> diff --git a/drivers/hwmon/occ/common.c b/drivers/hwmon/occ/common.c >> index 7c97a4c..a5c4df0 100644 >> --- a/drivers/hwmon/occ/common.c >> +++ b/drivers/hwmon/occ/common.c >> @@ -304,24 +304,11 @@ static ssize_t occ_show_temp_2(struct device *dev, >> val = get_unaligned_be32(&temp->sensor_id); >> break; >> case 1: >> - val = temp->value; >> - if (val == OCC_TEMP_SENSOR_FAULT) >> - return -EREMOTEIO; >> - >> - if (temp->fru_type != OCC_FRU_TYPE_VRM) { >> - /* sensor not ready */ >> - if (val == 0) >> - return -EAGAIN; >> - >> - val *= 1000; /* millidegrees */ >> - } >> + val = temp->value * 1000; >> break; >> case 2: >> val = temp->fru_type; >> break; >> - case 3: >> - val = temp->value == OCC_TEMP_SENSOR_FAULT; >> - break; >> default: >> return -EINVAL; >> } >> @@ -790,7 +777,7 @@ static int occ_setup_sensor_attrs(struct occ *occ) >> num_attrs += (sensors->temp.num_sensors * 2); >> break; >> case 2: >> - num_attrs += (sensors->temp.num_sensors * 4); >> + num_attrs += (sensors->temp.num_sensors * 3); >> show_temp = occ_show_temp_2; >> break; >> default: >> @@ -863,22 +850,13 @@ static int occ_setup_sensor_attrs(struct occ *occ) >> >> for (i = 0; i < sensors->temp.num_sensors; ++i) { >> s = i + 1; >> - temp = ((struct temp_sensor_2 *)sensors->temp.data) + i; >> >> snprintf(attr->name, sizeof(attr->name), "temp%d_label", s); >> attr->sensor = OCC_INIT_ATTR(attr->name, 0444, show_temp, NULL, >> 0, i); >> attr++; >> >> - if (sensors->temp.version > 1 && >> - temp->fru_type == OCC_FRU_TYPE_VRM) { >> - snprintf(attr->name, sizeof(attr->name), >> - "temp%d_alarm", s); >> - } else { >> - snprintf(attr->name, sizeof(attr->name), >> - "temp%d_input", s); >> - } >> - >> + snprintf(attr->name, sizeof(attr->name), "temp%d_input", s); >> attr->sensor = OCC_INIT_ATTR(attr->name, 0444, show_temp, NULL, >> 1, i); >> attr++; >> @@ -890,11 +868,6 @@ static int occ_setup_sensor_attrs(struct occ *occ) >> show_temp, NULL, 2, i); >> attr++; >> >> - snprintf(attr->name, sizeof(attr->name), >> - "temp%d_fault", s); >> - attr->sensor = OCC_INIT_ATTR(attr->name, 0444, >> - show_temp, NULL, 3, i); >> - attr++; >> } >> } >> >> -- >> 1.8.3.1 >> >
On Fri, Mar 9, 2018 at 5:45 AM, Matt Spinler <mspinler@linux.vnet.ibm.com> wrote: > > > On 3/1/2018 6:39 PM, Joel Stanley wrote: >> >> On Fri, Mar 2, 2018 at 9:14 AM, Eddie James <eajames@linux.vnet.ibm.com> >> wrote: >>> >>> From: "Edward A. James" <eajames@us.ibm.com> >>> >>> This removes the EAGAIN return for "sensor not ready" and the temperature >>> fault and vrm temp alarms attributes. >>> >>> Apologies for pushing this up so early... Userspace applications aren't >>> ready for this change. The hwmon polling application cannot accept EGAIN >>> yet, and we can't be returning apparent errors if the sensor is >>> temporarily unavailable. >> >> >> Okay. Brad's email is from October 5th, 2017. That's 148 days ago. >> >> Any ETA on when the userspace plans to get some love? >> > > We have 4 issues in the backlog to address the OCC GPU sensors returning > either a 0 or 0xFF: > https://github.com/openbmc/openbmc/issues/2327 > https://github.com/openbmc/openbmc/issues/2329 > https://github.com/openbmc/openbmc/issues/2222 > https://github.com/openbmc/openbmc/issues/2223 > > As fan control is already doing the right thing when it reads the values > as they are now, we deprioritized the work quite a bit so it still > isn't scheduled and may not be until later systems. Okay. Lets get that work scheduled ASAP, as it's already been quite some time since we needed that fixed. I think it's now holding up the Witherspoon team from moving to the new kernel. Do you have an ETA? Cheers, Joel
On 3/12/2018 10:26 PM, Joel Stanley wrote: > On Fri, Mar 9, 2018 at 5:45 AM, Matt Spinler > <mspinler@linux.vnet.ibm.com> wrote: >> >> >> On 3/1/2018 6:39 PM, Joel Stanley wrote: >>> >>> On Fri, Mar 2, 2018 at 9:14 AM, Eddie James <eajames@linux.vnet.ibm.com> >>> wrote: >>>> >>>> From: "Edward A. James" <eajames@us.ibm.com> >>>> >>>> This removes the EAGAIN return for "sensor not ready" and the temperature >>>> fault and vrm temp alarms attributes. >>>> >>>> Apologies for pushing this up so early... Userspace applications aren't >>>> ready for this change. The hwmon polling application cannot accept EGAIN >>>> yet, and we can't be returning apparent errors if the sensor is >>>> temporarily unavailable. >>> >>> >>> Okay. Brad's email is from October 5th, 2017. That's 148 days ago. >>> >>> Any ETA on when the userspace plans to get some love? >>> >> >> We have 4 issues in the backlog to address the OCC GPU sensors returning >> either a 0 or 0xFF: >> https://github.com/openbmc/openbmc/issues/2327 >> https://github.com/openbmc/openbmc/issues/2329 >> https://github.com/openbmc/openbmc/issues/2222 >> https://github.com/openbmc/openbmc/issues/2223 >> >> As fan control is already doing the right thing when it reads the values >> as they are now, we deprioritized the work quite a bit so it still >> isn't scheduled and may not be until later systems. > > Okay. Lets get that work scheduled ASAP, as it's already been quite > some time since we needed that fixed. I think it's now holding up the > Witherspoon team from moving to the new kernel. > > Do you have an ETA? We're going to try to start work on it next sprint, so done in about a month give or take. > > Cheers, > > Joel >
diff --git a/drivers/hwmon/occ/common.c b/drivers/hwmon/occ/common.c index 7c97a4c..a5c4df0 100644 --- a/drivers/hwmon/occ/common.c +++ b/drivers/hwmon/occ/common.c @@ -304,24 +304,11 @@ static ssize_t occ_show_temp_2(struct device *dev, val = get_unaligned_be32(&temp->sensor_id); break; case 1: - val = temp->value; - if (val == OCC_TEMP_SENSOR_FAULT) - return -EREMOTEIO; - - if (temp->fru_type != OCC_FRU_TYPE_VRM) { - /* sensor not ready */ - if (val == 0) - return -EAGAIN; - - val *= 1000; /* millidegrees */ - } + val = temp->value * 1000; break; case 2: val = temp->fru_type; break; - case 3: - val = temp->value == OCC_TEMP_SENSOR_FAULT; - break; default: return -EINVAL; } @@ -790,7 +777,7 @@ static int occ_setup_sensor_attrs(struct occ *occ) num_attrs += (sensors->temp.num_sensors * 2); break; case 2: - num_attrs += (sensors->temp.num_sensors * 4); + num_attrs += (sensors->temp.num_sensors * 3); show_temp = occ_show_temp_2; break; default: @@ -863,22 +850,13 @@ static int occ_setup_sensor_attrs(struct occ *occ) for (i = 0; i < sensors->temp.num_sensors; ++i) { s = i + 1; - temp = ((struct temp_sensor_2 *)sensors->temp.data) + i; snprintf(attr->name, sizeof(attr->name), "temp%d_label", s); attr->sensor = OCC_INIT_ATTR(attr->name, 0444, show_temp, NULL, 0, i); attr++; - if (sensors->temp.version > 1 && - temp->fru_type == OCC_FRU_TYPE_VRM) { - snprintf(attr->name, sizeof(attr->name), - "temp%d_alarm", s); - } else { - snprintf(attr->name, sizeof(attr->name), - "temp%d_input", s); - } - + snprintf(attr->name, sizeof(attr->name), "temp%d_input", s); attr->sensor = OCC_INIT_ATTR(attr->name, 0444, show_temp, NULL, 1, i); attr++; @@ -890,11 +868,6 @@ static int occ_setup_sensor_attrs(struct occ *occ) show_temp, NULL, 2, i); attr++; - snprintf(attr->name, sizeof(attr->name), - "temp%d_fault", s); - attr->sensor = OCC_INIT_ATTR(attr->name, 0444, - show_temp, NULL, 3, i); - attr++; } }