[v2] ipmi: looped device detection

Message ID 20180911225630.124502-1-venture@google.com
State New
Headers show
Series
  • [v2] ipmi: looped device detection
Related show

Commit Message

Patrick Venture Sept. 11, 2018, 10:56 p.m.
Try to get the device ID repeatedly during initialization before giving up.
The BMC isn't always responsive, and this allows it to be slightly flaky
during early boot.

Tested: Installed on a system with the BMC software disabled
such that it was non-responsive.  The driver correctly detected this
and gave up as expected.  Then I re-enabled the BMC software unloaded
and reloaded the driver and it was detected properly.

Signed-off-by: Patrick Venture <venture@google.com>
---
v2:
 - removed extra variable that was set but not used.
---
 drivers/char/ipmi/ipmi_si_intf.c | 23 ++++++++++++++++++++++-
 1 file changed, 22 insertions(+), 1 deletion(-)

Comments

Corey Minyard Sept. 12, 2018, 10:09 p.m. | #1
On 09/11/2018 05:56 PM, Patrick Venture wrote:
> Try to get the device ID repeatedly during initialization before giving up.
> The BMC isn't always responsive, and this allows it to be slightly flaky
> during early boot.
>
> Tested: Installed on a system with the BMC software disabled
> such that it was non-responsive.  The driver correctly detected this
> and gave up as expected.  Then I re-enabled the BMC software unloaded
> and reloaded the driver and it was detected properly.

The patch looks fine, but I wonder if this is something that is really 
valuable.
I have wondered about this before.

The question is: If the BMC is unavailable, what are the chances of it 
becoming
available by the time you do 5 attempts?  I would guess that is a pretty 
small
chance, which is why I haven't done this already.

You could have something that re-tested periodically, but there are so many
systems with IPMI specified in ACPI or SMBIOS that is wrong, and it would
try forever.  Also not really a good thing.

So I've left it to reload the driver or use the hotmod interface.

-corey

> Signed-off-by: Patrick Venture <venture@google.com>
> ---
> v2:
>   - removed extra variable that was set but not used.
> ---
>   drivers/char/ipmi/ipmi_si_intf.c | 23 ++++++++++++++++++++++-
>   1 file changed, 22 insertions(+), 1 deletion(-)
>
> diff --git a/drivers/char/ipmi/ipmi_si_intf.c b/drivers/char/ipmi/ipmi_si_intf.c
> index 90ec010bffbd..5fed96897fe8 100644
> --- a/drivers/char/ipmi/ipmi_si_intf.c
> +++ b/drivers/char/ipmi/ipmi_si_intf.c
> @@ -1918,11 +1918,13 @@ int ipmi_si_add_smi(struct si_sm_io *io)
>    * held, primarily to keep smi_num consistent, we only one to do these
>    * one at a time.
>    */
> +#define GET_DEVICE_ID_ATTEMPTS	5
>   static int try_smi_init(struct smi_info *new_smi)
>   {
>   	int rv = 0;
>   	int i;
>   	char *init_name = NULL;
> +	unsigned long sleep_rm;
>   
>   	pr_info(PFX "Trying %s-specified %s state machine at %s address 0x%lx, slave address 0x%x, irq %d\n",
>   		ipmi_addr_src_to_str(new_smi->io.addr_source),
> @@ -2003,7 +2005,26 @@ static int try_smi_init(struct smi_info *new_smi)
>   	 * Attempt a get device id command.  If it fails, we probably
>   	 * don't have a BMC here.
>   	 */
> -	rv = try_get_dev_id(new_smi);
> +	for (i = 0; i < GET_DEVICE_ID_ATTEMPTS; i++) {
> +		pr_info(PFX "Attempting to read BMC device ID\n");
> +		rv = try_get_dev_id(new_smi);
> +		/* If it succeeded, stop trying */
> +		if (!rv)
> +			break;
> +
> +		/* Sleep for ~0.25s before trying again instead of hammering
> +		 * the BMC.
> +		 */
> +		sleep_rm = msleep_interruptible(250);
> +		if (sleep_rm != 0) {
> +			pr_info(PFX "Find BMC interrupted\n");
> +			rv = -EINTR;
> +			goto out_err;
> +		}
> +	}
> +
> +	/* If we exited the loop above and rv is non-zero we ran out of tries.
> +	 */
>   	if (rv) {
>   		if (new_smi->io.addr_source)
>   			dev_err(new_smi->io.dev,
Patrick Venture Sept. 12, 2018, 10:54 p.m. | #2
On Wed, Sep 12, 2018 at 3:10 PM Corey Minyard <minyard@acm.org> wrote:
>
> On 09/11/2018 05:56 PM, Patrick Venture wrote:
> > Try to get the device ID repeatedly during initialization before giving up.
> > The BMC isn't always responsive, and this allows it to be slightly flaky
> > during early boot.
> >
> > Tested: Installed on a system with the BMC software disabled
> > such that it was non-responsive.  The driver correctly detected this
> > and gave up as expected.  Then I re-enabled the BMC software unloaded
> > and reloaded the driver and it was detected properly.
>
> The patch looks fine, but I wonder if this is something that is really
> valuable.
> I have wondered about this before.
>
> The question is: If the BMC is unavailable, what are the chances of it
> becoming
> available by the time you do 5 attempts?  I would guess that is a pretty
> small
> chance, which is why I haven't done this already.

This patch was actually critical for us to provide a reliable IPMI
interface.  The version of OpenBMC or the state of the BMC at the
point the kernel was loading was flaky, so following the example in
the BIOS source, we just re-try a few times.  We also can hold boot X
seconds until it's responding, but, this avoided some issues inherent
with that.

>
> You could have something that re-tested periodically, but there are so many
> systems with IPMI specified in ACPI or SMBIOS that is wrong, and it would
> try forever.  Also not really a good thing.

If we did a periodic check, it could check X times, but I felt going
for a simple solution was ideal -- and this idea was proved out on a
few platforms.  We have other drivers that are loaded by the kernel
(not at run-time) and they depend on IPMI, and without this patch they
would then have a non-trivial probability of failure.

>
> So I've left it to reload the driver or use the hotmod interface.
>
> -corey
>
> > Signed-off-by: Patrick Venture <venture@google.com>
> > ---
> > v2:
> >   - removed extra variable that was set but not used.
> > ---
> >   drivers/char/ipmi/ipmi_si_intf.c | 23 ++++++++++++++++++++++-
> >   1 file changed, 22 insertions(+), 1 deletion(-)
> >
> > diff --git a/drivers/char/ipmi/ipmi_si_intf.c b/drivers/char/ipmi/ipmi_si_intf.c
> > index 90ec010bffbd..5fed96897fe8 100644
> > --- a/drivers/char/ipmi/ipmi_si_intf.c
> > +++ b/drivers/char/ipmi/ipmi_si_intf.c
> > @@ -1918,11 +1918,13 @@ int ipmi_si_add_smi(struct si_sm_io *io)
> >    * held, primarily to keep smi_num consistent, we only one to do these
> >    * one at a time.
> >    */
> > +#define GET_DEVICE_ID_ATTEMPTS       5
> >   static int try_smi_init(struct smi_info *new_smi)
> >   {
> >       int rv = 0;
> >       int i;
> >       char *init_name = NULL;
> > +     unsigned long sleep_rm;
> >
> >       pr_info(PFX "Trying %s-specified %s state machine at %s address 0x%lx, slave address 0x%x, irq %d\n",
> >               ipmi_addr_src_to_str(new_smi->io.addr_source),
> > @@ -2003,7 +2005,26 @@ static int try_smi_init(struct smi_info *new_smi)
> >        * Attempt a get device id command.  If it fails, we probably
> >        * don't have a BMC here.
> >        */
> > -     rv = try_get_dev_id(new_smi);
> > +     for (i = 0; i < GET_DEVICE_ID_ATTEMPTS; i++) {
> > +             pr_info(PFX "Attempting to read BMC device ID\n");
> > +             rv = try_get_dev_id(new_smi);
> > +             /* If it succeeded, stop trying */
> > +             if (!rv)
> > +                     break;
> > +
> > +             /* Sleep for ~0.25s before trying again instead of hammering
> > +              * the BMC.
> > +              */
> > +             sleep_rm = msleep_interruptible(250);
> > +             if (sleep_rm != 0) {
> > +                     pr_info(PFX "Find BMC interrupted\n");
> > +                     rv = -EINTR;
> > +                     goto out_err;
> > +             }
> > +     }
> > +
> > +     /* If we exited the loop above and rv is non-zero we ran out of tries.
> > +      */
> >       if (rv) {
> >               if (new_smi->io.addr_source)
> >                       dev_err(new_smi->io.dev,
>
>
Patrick Venture Sept. 18, 2018, 6:42 p.m. | #3
On Wed, Sep 12, 2018 at 3:54 PM Patrick Venture <venture@google.com> wrote:
>
> On Wed, Sep 12, 2018 at 3:10 PM Corey Minyard <minyard@acm.org> wrote:
> >
> > On 09/11/2018 05:56 PM, Patrick Venture wrote:
> > > Try to get the device ID repeatedly during initialization before giving up.
> > > The BMC isn't always responsive, and this allows it to be slightly flaky
> > > during early boot.
> > >
> > > Tested: Installed on a system with the BMC software disabled
> > > such that it was non-responsive.  The driver correctly detected this
> > > and gave up as expected.  Then I re-enabled the BMC software unloaded
> > > and reloaded the driver and it was detected properly.
> >
> > The patch looks fine, but I wonder if this is something that is really
> > valuable.
> > I have wondered about this before.
> >
> > The question is: If the BMC is unavailable, what are the chances of it
> > becoming
> > available by the time you do 5 attempts?  I would guess that is a pretty
> > small
> > chance, which is why I haven't done this already.

Friendly ping.  I'd like to get a sense of whether you're likely to
accept this.  If not, it's fine, will close out patch in current
downstream rebase.

Thanks

>
> This patch was actually critical for us to provide a reliable IPMI
> interface.  The version of OpenBMC or the state of the BMC at the
> point the kernel was loading was flaky, so following the example in
> the BIOS source, we just re-try a few times.  We also can hold boot X
> seconds until it's responding, but, this avoided some issues inherent
> with that.
>
> >
> > You could have something that re-tested periodically, but there are so many
> > systems with IPMI specified in ACPI or SMBIOS that is wrong, and it would
> > try forever.  Also not really a good thing.
>
> If we did a periodic check, it could check X times, but I felt going
> for a simple solution was ideal -- and this idea was proved out on a
> few platforms.  We have other drivers that are loaded by the kernel
> (not at run-time) and they depend on IPMI, and without this patch they
> would then have a non-trivial probability of failure.
>
> >
> > So I've left it to reload the driver or use the hotmod interface.
> >
> > -corey
> >
> > > Signed-off-by: Patrick Venture <venture@google.com>
> > > ---
> > > v2:
> > >   - removed extra variable that was set but not used.
> > > ---
> > >   drivers/char/ipmi/ipmi_si_intf.c | 23 ++++++++++++++++++++++-
> > >   1 file changed, 22 insertions(+), 1 deletion(-)
> > >
> > > diff --git a/drivers/char/ipmi/ipmi_si_intf.c b/drivers/char/ipmi/ipmi_si_intf.c
> > > index 90ec010bffbd..5fed96897fe8 100644
> > > --- a/drivers/char/ipmi/ipmi_si_intf.c
> > > +++ b/drivers/char/ipmi/ipmi_si_intf.c
> > > @@ -1918,11 +1918,13 @@ int ipmi_si_add_smi(struct si_sm_io *io)
> > >    * held, primarily to keep smi_num consistent, we only one to do these
> > >    * one at a time.
> > >    */
> > > +#define GET_DEVICE_ID_ATTEMPTS       5
> > >   static int try_smi_init(struct smi_info *new_smi)
> > >   {
> > >       int rv = 0;
> > >       int i;
> > >       char *init_name = NULL;
> > > +     unsigned long sleep_rm;
> > >
> > >       pr_info(PFX "Trying %s-specified %s state machine at %s address 0x%lx, slave address 0x%x, irq %d\n",
> > >               ipmi_addr_src_to_str(new_smi->io.addr_source),
> > > @@ -2003,7 +2005,26 @@ static int try_smi_init(struct smi_info *new_smi)
> > >        * Attempt a get device id command.  If it fails, we probably
> > >        * don't have a BMC here.
> > >        */
> > > -     rv = try_get_dev_id(new_smi);
> > > +     for (i = 0; i < GET_DEVICE_ID_ATTEMPTS; i++) {
> > > +             pr_info(PFX "Attempting to read BMC device ID\n");
> > > +             rv = try_get_dev_id(new_smi);
> > > +             /* If it succeeded, stop trying */
> > > +             if (!rv)
> > > +                     break;
> > > +
> > > +             /* Sleep for ~0.25s before trying again instead of hammering
> > > +              * the BMC.
> > > +              */
> > > +             sleep_rm = msleep_interruptible(250);
> > > +             if (sleep_rm != 0) {
> > > +                     pr_info(PFX "Find BMC interrupted\n");
> > > +                     rv = -EINTR;
> > > +                     goto out_err;
> > > +             }
> > > +     }
> > > +
> > > +     /* If we exited the loop above and rv is non-zero we ran out of tries.
> > > +      */
> > >       if (rv) {
> > >               if (new_smi->io.addr_source)
> > >                       dev_err(new_smi->io.dev,
> >
> >
Corey Minyard Sept. 18, 2018, 9:37 p.m. | #4
On 09/18/2018 01:42 PM, Patrick Venture wrote:
> On Wed, Sep 12, 2018 at 3:54 PM Patrick Venture <venture@google.com> wrote:
>> On Wed, Sep 12, 2018 at 3:10 PM Corey Minyard <minyard@acm.org> wrote:
>>> On 09/11/2018 05:56 PM, Patrick Venture wrote:
>>>> Try to get the device ID repeatedly during initialization before giving up.
>>>> The BMC isn't always responsive, and this allows it to be slightly flaky
>>>> during early boot.
>>>>
>>>> Tested: Installed on a system with the BMC software disabled
>>>> such that it was non-responsive.  The driver correctly detected this
>>>> and gave up as expected.  Then I re-enabled the BMC software unloaded
>>>> and reloaded the driver and it was detected properly.
>>> The patch looks fine, but I wonder if this is something that is really
>>> valuable.
>>> I have wondered about this before.
>>>
>>> The question is: If the BMC is unavailable, what are the chances of it
>>> becoming
>>> available by the time you do 5 attempts?  I would guess that is a pretty
>>> small
>>> chance, which is why I haven't done this already.
> Friendly ping.  I'd like to get a sense of whether you're likely to
> accept this.  If not, it's fine, will close out patch in current
> downstream rebase.

I'm ok with doing this, but I lied about the patch being fine, there are 
some issue.
Well, I didn't lie, but I didn't look closely enough.

Can you use dev_xxx() instead of pr_xxx().  I know the driver isn't 
currently
consistent, but there are a number of patches I have pending to make it
better and it's a longer-term goal.

Can you make GET_DEVICE_ID_ATTEMPTS more specific, add IPMI_SI_ to
the beginning or something.

I am not sure that I'm ok with waiting up to 1.25 seconds in the init 
function.
As I mentioned before, a large number of systems have broken ACPI/SMBIOS
information, and for those it will add 1.25 seconds to the boot time of 
every
one of those systems.  That won't make me a popular guy :-).

This is a harder problem to figure out what to do.  To solve it properly 
would
mean having a timer or thread drive this, and unload the module later if
the process fails.

-corey

> Thanks
>
>> This patch was actually critical for us to provide a reliable IPMI
>> interface.  The version of OpenBMC or the state of the BMC at the
>> point the kernel was loading was flaky, so following the example in
>> the BIOS source, we just re-try a few times.  We also can hold boot X
>> seconds until it's responding, but, this avoided some issues inherent
>> with that.
>>
>>> You could have something that re-tested periodically, but there are so many
>>> systems with IPMI specified in ACPI or SMBIOS that is wrong, and it would
>>> try forever.  Also not really a good thing.
>> If we did a periodic check, it could check X times, but I felt going
>> for a simple solution was ideal -- and this idea was proved out on a
>> few platforms.  We have other drivers that are loaded by the kernel
>> (not at run-time) and they depend on IPMI, and without this patch they
>> would then have a non-trivial probability of failure.
>>
>>> So I've left it to reload the driver or use the hotmod interface.
>>>
>>> -corey
>>>
>>>> Signed-off-by: Patrick Venture <venture@google.com>
>>>> ---
>>>> v2:
>>>>    - removed extra variable that was set but not used.
>>>> ---
>>>>    drivers/char/ipmi/ipmi_si_intf.c | 23 ++++++++++++++++++++++-
>>>>    1 file changed, 22 insertions(+), 1 deletion(-)
>>>>
>>>> diff --git a/drivers/char/ipmi/ipmi_si_intf.c b/drivers/char/ipmi/ipmi_si_intf.c
>>>> index 90ec010bffbd..5fed96897fe8 100644
>>>> --- a/drivers/char/ipmi/ipmi_si_intf.c
>>>> +++ b/drivers/char/ipmi/ipmi_si_intf.c
>>>> @@ -1918,11 +1918,13 @@ int ipmi_si_add_smi(struct si_sm_io *io)
>>>>     * held, primarily to keep smi_num consistent, we only one to do these
>>>>     * one at a time.
>>>>     */
>>>> +#define GET_DEVICE_ID_ATTEMPTS       5
>>>>    static int try_smi_init(struct smi_info *new_smi)
>>>>    {
>>>>        int rv = 0;
>>>>        int i;
>>>>        char *init_name = NULL;
>>>> +     unsigned long sleep_rm;
>>>>
>>>>        pr_info(PFX "Trying %s-specified %s state machine at %s address 0x%lx, slave address 0x%x, irq %d\n",
>>>>                ipmi_addr_src_to_str(new_smi->io.addr_source),
>>>> @@ -2003,7 +2005,26 @@ static int try_smi_init(struct smi_info *new_smi)
>>>>         * Attempt a get device id command.  If it fails, we probably
>>>>         * don't have a BMC here.
>>>>         */
>>>> -     rv = try_get_dev_id(new_smi);
>>>> +     for (i = 0; i < GET_DEVICE_ID_ATTEMPTS; i++) {
>>>> +             pr_info(PFX "Attempting to read BMC device ID\n");
>>>> +             rv = try_get_dev_id(new_smi);
>>>> +             /* If it succeeded, stop trying */
>>>> +             if (!rv)
>>>> +                     break;
>>>> +
>>>> +             /* Sleep for ~0.25s before trying again instead of hammering
>>>> +              * the BMC.
>>>> +              */
>>>> +             sleep_rm = msleep_interruptible(250);
>>>> +             if (sleep_rm != 0) {
>>>> +                     pr_info(PFX "Find BMC interrupted\n");
>>>> +                     rv = -EINTR;
>>>> +                     goto out_err;
>>>> +             }
>>>> +     }
>>>> +
>>>> +     /* If we exited the loop above and rv is non-zero we ran out of tries.
>>>> +      */
>>>>        if (rv) {
>>>>                if (new_smi->io.addr_source)
>>>>                        dev_err(new_smi->io.dev,
>>>
Patrick Venture Sept. 19, 2018, 7:56 p.m. | #5
On Tue, Sep 18, 2018 at 2:37 PM Corey Minyard <tcminyard@gmail.com> wrote:
>
> On 09/18/2018 01:42 PM, Patrick Venture wrote:
> > On Wed, Sep 12, 2018 at 3:54 PM Patrick Venture <venture@google.com> wrote:
> >> On Wed, Sep 12, 2018 at 3:10 PM Corey Minyard <minyard@acm.org> wrote:
> >>> On 09/11/2018 05:56 PM, Patrick Venture wrote:
> >>>> Try to get the device ID repeatedly during initialization before giving up.
> >>>> The BMC isn't always responsive, and this allows it to be slightly flaky
> >>>> during early boot.
> >>>>
> >>>> Tested: Installed on a system with the BMC software disabled
> >>>> such that it was non-responsive.  The driver correctly detected this
> >>>> and gave up as expected.  Then I re-enabled the BMC software unloaded
> >>>> and reloaded the driver and it was detected properly.
> >>> The patch looks fine, but I wonder if this is something that is really
> >>> valuable.
> >>> I have wondered about this before.
> >>>
> >>> The question is: If the BMC is unavailable, what are the chances of it
> >>> becoming
> >>> available by the time you do 5 attempts?  I would guess that is a pretty
> >>> small
> >>> chance, which is why I haven't done this already.
> > Friendly ping.  I'd like to get a sense of whether you're likely to
> > accept this.  If not, it's fine, will close out patch in current
> > downstream rebase.
>
> I'm ok with doing this, but I lied about the patch being fine, there are
> some issue.
> Well, I didn't lie, but I didn't look closely enough.
>
> Can you use dev_xxx() instead of pr_xxx().  I know the driver isn't
> currently
> consistent, but there are a number of patches I have pending to make it
> better and it's a longer-term goal.

Ack.

>
> Can you make GET_DEVICE_ID_ATTEMPTS more specific, add IPMI_SI_ to
> the beginning or something.

Ack.

>
> I am not sure that I'm ok with waiting up to 1.25 seconds in the init
> function.
> As I mentioned before, a large number of systems have broken ACPI/SMBIOS
> information, and for those it will add 1.25 seconds to the boot time of
> every
> one of those systems.  That won't make me a popular guy :-).

Yeah, that's problematic for the systems that'll never get a valid
response.  I don't think it makes sense to gate the feature with a
configuration option, do you?

>
> This is a harder problem to figure out what to do.  To solve it properly
> would
> mean having a timer or thread drive this, and unload the module later if
> the process fails.
>
> -corey
>
> > Thanks
> >
> >> This patch was actually critical for us to provide a reliable IPMI
> >> interface.  The version of OpenBMC or the state of the BMC at the
> >> point the kernel was loading was flaky, so following the example in
> >> the BIOS source, we just re-try a few times.  We also can hold boot X
> >> seconds until it's responding, but, this avoided some issues inherent
> >> with that.
> >>
> >>> You could have something that re-tested periodically, but there are so many
> >>> systems with IPMI specified in ACPI or SMBIOS that is wrong, and it would
> >>> try forever.  Also not really a good thing.
> >> If we did a periodic check, it could check X times, but I felt going
> >> for a simple solution was ideal -- and this idea was proved out on a
> >> few platforms.  We have other drivers that are loaded by the kernel
> >> (not at run-time) and they depend on IPMI, and without this patch they
> >> would then have a non-trivial probability of failure.
> >>
> >>> So I've left it to reload the driver or use the hotmod interface.
> >>>
> >>> -corey
> >>>
> >>>> Signed-off-by: Patrick Venture <venture@google.com>
> >>>> ---
> >>>> v2:
> >>>>    - removed extra variable that was set but not used.
> >>>> ---
> >>>>    drivers/char/ipmi/ipmi_si_intf.c | 23 ++++++++++++++++++++++-
> >>>>    1 file changed, 22 insertions(+), 1 deletion(-)
> >>>>
> >>>> diff --git a/drivers/char/ipmi/ipmi_si_intf.c b/drivers/char/ipmi/ipmi_si_intf.c
> >>>> index 90ec010bffbd..5fed96897fe8 100644
> >>>> --- a/drivers/char/ipmi/ipmi_si_intf.c
> >>>> +++ b/drivers/char/ipmi/ipmi_si_intf.c
> >>>> @@ -1918,11 +1918,13 @@ int ipmi_si_add_smi(struct si_sm_io *io)
> >>>>     * held, primarily to keep smi_num consistent, we only one to do these
> >>>>     * one at a time.
> >>>>     */
> >>>> +#define GET_DEVICE_ID_ATTEMPTS       5
> >>>>    static int try_smi_init(struct smi_info *new_smi)
> >>>>    {
> >>>>        int rv = 0;
> >>>>        int i;
> >>>>        char *init_name = NULL;
> >>>> +     unsigned long sleep_rm;
> >>>>
> >>>>        pr_info(PFX "Trying %s-specified %s state machine at %s address 0x%lx, slave address 0x%x, irq %d\n",
> >>>>                ipmi_addr_src_to_str(new_smi->io.addr_source),
> >>>> @@ -2003,7 +2005,26 @@ static int try_smi_init(struct smi_info *new_smi)
> >>>>         * Attempt a get device id command.  If it fails, we probably
> >>>>         * don't have a BMC here.
> >>>>         */
> >>>> -     rv = try_get_dev_id(new_smi);
> >>>> +     for (i = 0; i < GET_DEVICE_ID_ATTEMPTS; i++) {
> >>>> +             pr_info(PFX "Attempting to read BMC device ID\n");
> >>>> +             rv = try_get_dev_id(new_smi);
> >>>> +             /* If it succeeded, stop trying */
> >>>> +             if (!rv)
> >>>> +                     break;
> >>>> +
> >>>> +             /* Sleep for ~0.25s before trying again instead of hammering
> >>>> +              * the BMC.
> >>>> +              */
> >>>> +             sleep_rm = msleep_interruptible(250);
> >>>> +             if (sleep_rm != 0) {
> >>>> +                     pr_info(PFX "Find BMC interrupted\n");
> >>>> +                     rv = -EINTR;
> >>>> +                     goto out_err;
> >>>> +             }
> >>>> +     }
> >>>> +
> >>>> +     /* If we exited the loop above and rv is non-zero we ran out of tries.
> >>>> +      */
> >>>>        if (rv) {
> >>>>                if (new_smi->io.addr_source)
> >>>>                        dev_err(new_smi->io.dev,
> >>>
>
Corey Minyard Sept. 19, 2018, 8:20 p.m. | #6
On 09/19/2018 02:56 PM, Patrick Venture wrote:
> On Tue, Sep 18, 2018 at 2:37 PM Corey Minyard <tcminyard@gmail.com> wrote:
>> On 09/18/2018 01:42 PM, Patrick Venture wrote:
>>> On Wed, Sep 12, 2018 at 3:54 PM Patrick Venture <venture@google.com> wrote:
>>>> On Wed, Sep 12, 2018 at 3:10 PM Corey Minyard <minyard@acm.org> wrote:
>>>>> On 09/11/2018 05:56 PM, Patrick Venture wrote:
>>>>>> Try to get the device ID repeatedly during initialization before giving up.
>>>>>> The BMC isn't always responsive, and this allows it to be slightly flaky
>>>>>> during early boot.
>>>>>>
>>>>>> Tested: Installed on a system with the BMC software disabled
>>>>>> such that it was non-responsive.  The driver correctly detected this
>>>>>> and gave up as expected.  Then I re-enabled the BMC software unloaded
>>>>>> and reloaded the driver and it was detected properly.
>>>>> The patch looks fine, but I wonder if this is something that is really
>>>>> valuable.
>>>>> I have wondered about this before.
>>>>>
>>>>> The question is: If the BMC is unavailable, what are the chances of it
>>>>> becoming
>>>>> available by the time you do 5 attempts?  I would guess that is a pretty
>>>>> small
>>>>> chance, which is why I haven't done this already.
>>> Friendly ping.  I'd like to get a sense of whether you're likely to
>>> accept this.  If not, it's fine, will close out patch in current
>>> downstream rebase.
>> I'm ok with doing this, but I lied about the patch being fine, there are
>> some issue.
>> Well, I didn't lie, but I didn't look closely enough.
>>
>> Can you use dev_xxx() instead of pr_xxx().  I know the driver isn't
>> currently
>> consistent, but there are a number of patches I have pending to make it
>> better and it's a longer-term goal.
> Ack.
>
>> Can you make GET_DEVICE_ID_ATTEMPTS more specific, add IPMI_SI_ to
>> the beginning or something.
> Ack.
>
>> I am not sure that I'm ok with waiting up to 1.25 seconds in the init
>> function.
>> As I mentioned before, a large number of systems have broken ACPI/SMBIOS
>> information, and for those it will add 1.25 seconds to the boot time of
>> every
>> one of those systems.  That won't make me a popular guy :-).
> Yeah, that's problematic for the systems that'll never get a valid
> response.  I don't think it makes sense to gate the feature with a
> configuration option, do you?

Yeah, a config option wouldn't really help.  You could do a module parm, 
but that's
really not very easy to use.

Pushing the detection off to a thread would solve that problem, but it 
creates its
own problems.  The driver may be in the process of detecting something, but
it may not be ready when the module load or kernel boot finishes.

I don't think this has been an issue in the past because generally the 
BMC is
integral to the boot of most systems.  So it's always operational by the 
time
the OS comes up.

So I'm not sure what to do.

-corey

>> This is a harder problem to figure out what to do.  To solve it properly
>> would
>> mean having a timer or thread drive this, and unload the module later if
>> the process fails.
>>
>> -corey
>>
>>> Thanks
>>>
>>>> This patch was actually critical for us to provide a reliable IPMI
>>>> interface.  The version of OpenBMC or the state of the BMC at the
>>>> point the kernel was loading was flaky, so following the example in
>>>> the BIOS source, we just re-try a few times.  We also can hold boot X
>>>> seconds until it's responding, but, this avoided some issues inherent
>>>> with that.
>>>>
>>>>> You could have something that re-tested periodically, but there are so many
>>>>> systems with IPMI specified in ACPI or SMBIOS that is wrong, and it would
>>>>> try forever.  Also not really a good thing.
>>>> If we did a periodic check, it could check X times, but I felt going
>>>> for a simple solution was ideal -- and this idea was proved out on a
>>>> few platforms.  We have other drivers that are loaded by the kernel
>>>> (not at run-time) and they depend on IPMI, and without this patch they
>>>> would then have a non-trivial probability of failure.
>>>>
>>>>> So I've left it to reload the driver or use the hotmod interface.
>>>>>
>>>>> -corey
>>>>>
>>>>>> Signed-off-by: Patrick Venture <venture@google.com>
>>>>>> ---
>>>>>> v2:
>>>>>>     - removed extra variable that was set but not used.
>>>>>> ---
>>>>>>     drivers/char/ipmi/ipmi_si_intf.c | 23 ++++++++++++++++++++++-
>>>>>>     1 file changed, 22 insertions(+), 1 deletion(-)
>>>>>>
>>>>>> diff --git a/drivers/char/ipmi/ipmi_si_intf.c b/drivers/char/ipmi/ipmi_si_intf.c
>>>>>> index 90ec010bffbd..5fed96897fe8 100644
>>>>>> --- a/drivers/char/ipmi/ipmi_si_intf.c
>>>>>> +++ b/drivers/char/ipmi/ipmi_si_intf.c
>>>>>> @@ -1918,11 +1918,13 @@ int ipmi_si_add_smi(struct si_sm_io *io)
>>>>>>      * held, primarily to keep smi_num consistent, we only one to do these
>>>>>>      * one at a time.
>>>>>>      */
>>>>>> +#define GET_DEVICE_ID_ATTEMPTS       5
>>>>>>     static int try_smi_init(struct smi_info *new_smi)
>>>>>>     {
>>>>>>         int rv = 0;
>>>>>>         int i;
>>>>>>         char *init_name = NULL;
>>>>>> +     unsigned long sleep_rm;
>>>>>>
>>>>>>         pr_info(PFX "Trying %s-specified %s state machine at %s address 0x%lx, slave address 0x%x, irq %d\n",
>>>>>>                 ipmi_addr_src_to_str(new_smi->io.addr_source),
>>>>>> @@ -2003,7 +2005,26 @@ static int try_smi_init(struct smi_info *new_smi)
>>>>>>          * Attempt a get device id command.  If it fails, we probably
>>>>>>          * don't have a BMC here.
>>>>>>          */
>>>>>> -     rv = try_get_dev_id(new_smi);
>>>>>> +     for (i = 0; i < GET_DEVICE_ID_ATTEMPTS; i++) {
>>>>>> +             pr_info(PFX "Attempting to read BMC device ID\n");
>>>>>> +             rv = try_get_dev_id(new_smi);
>>>>>> +             /* If it succeeded, stop trying */
>>>>>> +             if (!rv)
>>>>>> +                     break;
>>>>>> +
>>>>>> +             /* Sleep for ~0.25s before trying again instead of hammering
>>>>>> +              * the BMC.
>>>>>> +              */
>>>>>> +             sleep_rm = msleep_interruptible(250);
>>>>>> +             if (sleep_rm != 0) {
>>>>>> +                     pr_info(PFX "Find BMC interrupted\n");
>>>>>> +                     rv = -EINTR;
>>>>>> +                     goto out_err;
>>>>>> +             }
>>>>>> +     }
>>>>>> +
>>>>>> +     /* If we exited the loop above and rv is non-zero we ran out of tries.
>>>>>> +      */
>>>>>>         if (rv) {
>>>>>>                 if (new_smi->io.addr_source)
>>>>>>                         dev_err(new_smi->io.dev,
Patrick Venture Sept. 20, 2018, 6:08 p.m. | #7
On Wed, Sep 19, 2018 at 1:20 PM Corey Minyard <minyard@acm.org> wrote:
>
> On 09/19/2018 02:56 PM, Patrick Venture wrote:
> > On Tue, Sep 18, 2018 at 2:37 PM Corey Minyard <tcminyard@gmail.com> wrote:
> >> On 09/18/2018 01:42 PM, Patrick Venture wrote:
> >>> On Wed, Sep 12, 2018 at 3:54 PM Patrick Venture <venture@google.com> wrote:
> >>>> On Wed, Sep 12, 2018 at 3:10 PM Corey Minyard <minyard@acm.org> wrote:
> >>>>> On 09/11/2018 05:56 PM, Patrick Venture wrote:
> >>>>>> Try to get the device ID repeatedly during initialization before giving up.
> >>>>>> The BMC isn't always responsive, and this allows it to be slightly flaky
> >>>>>> during early boot.
> >>>>>>
> >>>>>> Tested: Installed on a system with the BMC software disabled
> >>>>>> such that it was non-responsive.  The driver correctly detected this
> >>>>>> and gave up as expected.  Then I re-enabled the BMC software unloaded
> >>>>>> and reloaded the driver and it was detected properly.
> >>>>> The patch looks fine, but I wonder if this is something that is really
> >>>>> valuable.
> >>>>> I have wondered about this before.
> >>>>>
> >>>>> The question is: If the BMC is unavailable, what are the chances of it
> >>>>> becoming
> >>>>> available by the time you do 5 attempts?  I would guess that is a pretty
> >>>>> small
> >>>>> chance, which is why I haven't done this already.
> >>> Friendly ping.  I'd like to get a sense of whether you're likely to
> >>> accept this.  If not, it's fine, will close out patch in current
> >>> downstream rebase.
> >> I'm ok with doing this, but I lied about the patch being fine, there are
> >> some issue.
> >> Well, I didn't lie, but I didn't look closely enough.
> >>
> >> Can you use dev_xxx() instead of pr_xxx().  I know the driver isn't
> >> currently
> >> consistent, but there are a number of patches I have pending to make it
> >> better and it's a longer-term goal.
> > Ack.
> >
> >> Can you make GET_DEVICE_ID_ATTEMPTS more specific, add IPMI_SI_ to
> >> the beginning or something.
> > Ack.
> >
> >> I am not sure that I'm ok with waiting up to 1.25 seconds in the init
> >> function.
> >> As I mentioned before, a large number of systems have broken ACPI/SMBIOS
> >> information, and for those it will add 1.25 seconds to the boot time of
> >> every
> >> one of those systems.  That won't make me a popular guy :-).
> > Yeah, that's problematic for the systems that'll never get a valid
> > response.  I don't think it makes sense to gate the feature with a
> > configuration option, do you?
>
> Yeah, a config option wouldn't really help.  You could do a module parm,
> but that's
> really not very easy to use.
>
> Pushing the detection off to a thread would solve that problem, but it
> creates its
> own problems.  The driver may be in the process of detecting something, but
> it may not be ready when the module load or kernel boot finishes.
>
> I don't think this has been an issue in the past because generally the
> BMC is
> integral to the boot of most systems.  So it's always operational by the
> time
> the OS comes up.
>
> So I'm not sure what to do.

Let's drop the patch.  I'll maintain it downstream.  Thanks for your
thoughts and time on this.

>
> -corey
>
> >> This is a harder problem to figure out what to do.  To solve it properly
> >> would
> >> mean having a timer or thread drive this, and unload the module later if
> >> the process fails.
> >>
> >> -corey
> >>
> >>> Thanks
> >>>
> >>>> This patch was actually critical for us to provide a reliable IPMI
> >>>> interface.  The version of OpenBMC or the state of the BMC at the
> >>>> point the kernel was loading was flaky, so following the example in
> >>>> the BIOS source, we just re-try a few times.  We also can hold boot X
> >>>> seconds until it's responding, but, this avoided some issues inherent
> >>>> with that.
> >>>>
> >>>>> You could have something that re-tested periodically, but there are so many
> >>>>> systems with IPMI specified in ACPI or SMBIOS that is wrong, and it would
> >>>>> try forever.  Also not really a good thing.
> >>>> If we did a periodic check, it could check X times, but I felt going
> >>>> for a simple solution was ideal -- and this idea was proved out on a
> >>>> few platforms.  We have other drivers that are loaded by the kernel
> >>>> (not at run-time) and they depend on IPMI, and without this patch they
> >>>> would then have a non-trivial probability of failure.
> >>>>
> >>>>> So I've left it to reload the driver or use the hotmod interface.
> >>>>>
> >>>>> -corey
> >>>>>
> >>>>>> Signed-off-by: Patrick Venture <venture@google.com>
> >>>>>> ---
> >>>>>> v2:
> >>>>>>     - removed extra variable that was set but not used.
> >>>>>> ---
> >>>>>>     drivers/char/ipmi/ipmi_si_intf.c | 23 ++++++++++++++++++++++-
> >>>>>>     1 file changed, 22 insertions(+), 1 deletion(-)
> >>>>>>
> >>>>>> diff --git a/drivers/char/ipmi/ipmi_si_intf.c b/drivers/char/ipmi/ipmi_si_intf.c
> >>>>>> index 90ec010bffbd..5fed96897fe8 100644
> >>>>>> --- a/drivers/char/ipmi/ipmi_si_intf.c
> >>>>>> +++ b/drivers/char/ipmi/ipmi_si_intf.c
> >>>>>> @@ -1918,11 +1918,13 @@ int ipmi_si_add_smi(struct si_sm_io *io)
> >>>>>>      * held, primarily to keep smi_num consistent, we only one to do these
> >>>>>>      * one at a time.
> >>>>>>      */
> >>>>>> +#define GET_DEVICE_ID_ATTEMPTS       5
> >>>>>>     static int try_smi_init(struct smi_info *new_smi)
> >>>>>>     {
> >>>>>>         int rv = 0;
> >>>>>>         int i;
> >>>>>>         char *init_name = NULL;
> >>>>>> +     unsigned long sleep_rm;
> >>>>>>
> >>>>>>         pr_info(PFX "Trying %s-specified %s state machine at %s address 0x%lx, slave address 0x%x, irq %d\n",
> >>>>>>                 ipmi_addr_src_to_str(new_smi->io.addr_source),
> >>>>>> @@ -2003,7 +2005,26 @@ static int try_smi_init(struct smi_info *new_smi)
> >>>>>>          * Attempt a get device id command.  If it fails, we probably
> >>>>>>          * don't have a BMC here.
> >>>>>>          */
> >>>>>> -     rv = try_get_dev_id(new_smi);
> >>>>>> +     for (i = 0; i < GET_DEVICE_ID_ATTEMPTS; i++) {
> >>>>>> +             pr_info(PFX "Attempting to read BMC device ID\n");
> >>>>>> +             rv = try_get_dev_id(new_smi);
> >>>>>> +             /* If it succeeded, stop trying */
> >>>>>> +             if (!rv)
> >>>>>> +                     break;
> >>>>>> +
> >>>>>> +             /* Sleep for ~0.25s before trying again instead of hammering
> >>>>>> +              * the BMC.
> >>>>>> +              */
> >>>>>> +             sleep_rm = msleep_interruptible(250);
> >>>>>> +             if (sleep_rm != 0) {
> >>>>>> +                     pr_info(PFX "Find BMC interrupted\n");
> >>>>>> +                     rv = -EINTR;
> >>>>>> +                     goto out_err;
> >>>>>> +             }
> >>>>>> +     }
> >>>>>> +
> >>>>>> +     /* If we exited the loop above and rv is non-zero we ran out of tries.
> >>>>>> +      */
> >>>>>>         if (rv) {
> >>>>>>                 if (new_smi->io.addr_source)
> >>>>>>                         dev_err(new_smi->io.dev,
>
>

Patch

diff --git a/drivers/char/ipmi/ipmi_si_intf.c b/drivers/char/ipmi/ipmi_si_intf.c
index 90ec010bffbd..5fed96897fe8 100644
--- a/drivers/char/ipmi/ipmi_si_intf.c
+++ b/drivers/char/ipmi/ipmi_si_intf.c
@@ -1918,11 +1918,13 @@  int ipmi_si_add_smi(struct si_sm_io *io)
  * held, primarily to keep smi_num consistent, we only one to do these
  * one at a time.
  */
+#define GET_DEVICE_ID_ATTEMPTS	5
 static int try_smi_init(struct smi_info *new_smi)
 {
 	int rv = 0;
 	int i;
 	char *init_name = NULL;
+	unsigned long sleep_rm;
 
 	pr_info(PFX "Trying %s-specified %s state machine at %s address 0x%lx, slave address 0x%x, irq %d\n",
 		ipmi_addr_src_to_str(new_smi->io.addr_source),
@@ -2003,7 +2005,26 @@  static int try_smi_init(struct smi_info *new_smi)
 	 * Attempt a get device id command.  If it fails, we probably
 	 * don't have a BMC here.
 	 */
-	rv = try_get_dev_id(new_smi);
+	for (i = 0; i < GET_DEVICE_ID_ATTEMPTS; i++) {
+		pr_info(PFX "Attempting to read BMC device ID\n");
+		rv = try_get_dev_id(new_smi);
+		/* If it succeeded, stop trying */
+		if (!rv)
+			break;
+
+		/* Sleep for ~0.25s before trying again instead of hammering
+		 * the BMC.
+		 */
+		sleep_rm = msleep_interruptible(250);
+		if (sleep_rm != 0) {
+			pr_info(PFX "Find BMC interrupted\n");
+			rv = -EINTR;
+			goto out_err;
+		}
+	}
+
+	/* If we exited the loop above and rv is non-zero we ran out of tries.
+	 */
 	if (rv) {
 		if (new_smi->io.addr_source)
 			dev_err(new_smi->io.dev,