diff mbox

[v1,i2c/for-next] i1c: i801: recover from hardware PEC errors

Message ID 1428970319-32544-1-git-send-email-ellen@cumulusnetworks.com
State Superseded
Headers show

Commit Message

Ellen Wang April 14, 2015, 12:11 a.m. UTC
On a CRC error while using hardware-supported PEC, an additional
error bit is set in the auxiliary status register.  If this bit
isn't cleared, all subsequent operations will fail, essentially
hanging the controller.

The fix is simple: clear the bit at the same time we clear
the error bits in the main status register.

Signed-off-by: Ellen Wang <ellen@cumulusnetworks.com>
---
 drivers/i2c/busses/i2c-i801.c |   21 +++++++++++++++++++++
 1 file changed, 21 insertions(+)

Comments

Jean Delvare April 15, 2015, 12:08 p.m. UTC | #1
Hi Ellen,

On Mon, 13 Apr 2015 17:11:59 -0700, Ellen Wang wrote:
> On a CRC error while using hardware-supported PEC, an additional
> error bit is set in the auxiliary status register.  If this bit
> isn't cleared, all subsequent operations will fail, essentially
> hanging the controller.
> 
> The fix is simple: clear the bit at the same time we clear
> the error bits in the main status register.
> 
> Signed-off-by: Ellen Wang <ellen@cumulusnetworks.com>
> ---
>  drivers/i2c/busses/i2c-i801.c |   21 +++++++++++++++++++++
>  1 file changed, 21 insertions(+)

Thanks for the patch. I noticed the issue over a year ago and wrote a
patch on my own, but then couldn't find the time to test it and finally
forgot about it :( so I never posted it. You can read it here for
reference:
http://jdelvare.nerim.net/devel/misc/i2c-i801-07-check-pec-status.patch

I am currently working on other matters but I'll look into this ASAP. A
quick comparison between our patches suggests that yours only clears
the PEC status bit while mine also reports the error properly to the
caller, so mine might be a better working base. Maybe you could review
and/or test my patch as a replacement of yours?

Thanks,
Ellen Wang April 15, 2015, 8:46 p.m. UTC | #2
Thanks for the quick reply.  My answers are below:

On 4/15/2015 5:08 AM, Jean Delvare wrote:
> Hi Ellen,
>
> On Mon, 13 Apr 2015 17:11:59 -0700, Ellen Wang wrote:
>> On a CRC error while using hardware-supported PEC, an additional
>> error bit is set in the auxiliary status register.  If this bit
>> isn't cleared, all subsequent operations will fail, essentially
>> hanging the controller.
>>
>> The fix is simple: clear the bit at the same time we clear
>> the error bits in the main status register.
>>
>> Signed-off-by: Ellen Wang <ellen@cumulusnetworks.com>
>> ---
>>   drivers/i2c/busses/i2c-i801.c |   21 +++++++++++++++++++++
>>   1 file changed, 21 insertions(+)
>
> Thanks for the patch. I noticed the issue over a year ago and wrote a
> patch on my own, but then couldn't find the time to test it and finally
> forgot about it :( so I never posted it. You can read it here for
> reference:
> http://jdelvare.nerim.net/devel/misc/i2c-i801-07-check-pec-status.patch
>
> I am currently working on other matters but I'll look into this ASAP. A
> quick comparison between our patches suggests that yours only clears
> the PEC status bit while mine also reports the error properly to the
> caller, so mine might be a better working base. Maybe you could review
> and/or test my patch as a replacement of yours?

My version does actually report the error, because the main status 
register error bit is also set in this case, and that gets detected and 
reported.  It just doesn't doesn't return a specific errno for PEC error.

The main difference between our versions is that you check and clear the 
CRC error bit in i801_check_pre() and i801_checkpost(), while I do it in 
i801_isr().  i801_isr() is also where it checks and clears the main 
status register for errors, so it seems natural to do it also for the 
auxiliary status.

Probably what we should do is combine the two versions, especially if we 
want to return a PEC-specific errno.  I think to do that properly in the 
current structure of the code, we should save the auxiliary status in 
priv->auxsts in i801_isr(), the same way it saves priv->status.  This 
ways, isr() can clear the error in the hardware and check_post() can 
still see it and process accordingly.  What do you think?  (I actually 
thought of this, but opted for a minimal fix.)  This is more 
complicated, and functions like i801_wait_intr() return the status and 
it can't return two values easily.

If we decide to go with your version, I can certainly test it.  We have 
a machine with the right hardware combination, and it generates PEC 
errors deterministically.

By the way, I noticed a small bug in i801_transaction(). At line 391, it 
has "status = -ETIMEOUT", but "status" at that point should be the 
register value not -errno.  It probably should be "return -ETIMEOUT", 
but I'm not sure whether hardware cleanup is necessary at that point.

> Thanks,

Thank you!
--
To unsubscribe from this list: send the line "unsubscribe linux-i2c" in
the body of a message to majordomo@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Jean Delvare April 17, 2015, 7:17 a.m. UTC | #3
Hi Ellen,

On Wed, 15 Apr 2015 13:46:17 -0700, Ellen Wang wrote:
> On 4/15/2015 5:08 AM, Jean Delvare wrote:
> > On Mon, 13 Apr 2015 17:11:59 -0700, Ellen Wang wrote:
> >> On a CRC error while using hardware-supported PEC, an additional
> >> error bit is set in the auxiliary status register.  If this bit
> >> isn't cleared, all subsequent operations will fail, essentially
> >> hanging the controller.
> >>
> >> The fix is simple: clear the bit at the same time we clear
> >> the error bits in the main status register.
> >>
> >> Signed-off-by: Ellen Wang <ellen@cumulusnetworks.com>
> >> ---
> >>   drivers/i2c/busses/i2c-i801.c |   21 +++++++++++++++++++++
> >>   1 file changed, 21 insertions(+)
> >
> > Thanks for the patch. I noticed the issue over a year ago and wrote a
> > patch on my own, but then couldn't find the time to test it and finally
> > forgot about it :( so I never posted it. You can read it here for
> > reference:
> > http://jdelvare.nerim.net/devel/misc/i2c-i801-07-check-pec-status.patch
> >
> > I am currently working on other matters but I'll look into this ASAP. A
> > quick comparison between our patches suggests that yours only clears
> > the PEC status bit while mine also reports the error properly to the
> > caller, so mine might be a better working base. Maybe you could review
> > and/or test my patch as a replacement of yours?
> 
> My version does actually report the error, because the main status 
> register error bit is also set in this case, and that gets detected and 
> reported.  It just doesn't doesn't return a specific errno for PEC error.

Correct, I did not remember this, sorry.

> The main difference between our versions is that you check and clear the 
> CRC error bit in i801_check_pre() and i801_checkpost(), while I do it in 
> i801_isr().  i801_isr() is also where it checks and clears the main 
> status register for errors, so it seems natural to do it also for the 
> auxiliary status.

The check in i801_check_pre() is in case the CRC error bit is set at
the time the driver is loaded. If it is, and we don't clear it, the
driver won't work (for PEC transactions or even for all transactions,
I'm not sure.) So I think it is needed either way.

At the end of the transaction, the main status register is cleared in
both i801_isr() and i801_check_post(). The latter is needed in case the
driver operates in polling (non-interrupt) mode. That was the only mode
available originally and my patch is so old that it is entirely
possible that I wrote it before interrupt support was added to the
driver. In fact that might even be the reason why I never submitted it,
it may not support interrupt mode properly.

So it is possible that the aux status register must be cleared in both
i801_isr() and i801_check_post() to be on the safe side.

> Probably what we should do is combine the two versions, especially if we 
> want to return a PEC-specific errno.  I think to do that properly in the 
> current structure of the code, we should save the auxiliary status in 
> priv->auxsts in i801_isr(), the same way it saves priv->status.  This 
> ways, isr() can clear the error in the hardware and check_post() can 
> still see it and process accordingly.  What do you think?  (I actually 
> thought of this, but opted for a minimal fix.)  This is more 

Sounds like a good idea, but I still believe that i801_check_post()
will also have to clear the CRC error bit, for the non-interrupt case.

> complicated, and functions like i801_wait_intr() return the status and 
> it can't return two values easily.

I don't think we actually need to deal with the aux status register
value in i801_wait_intr(). According to my notes, whenever the CRC
error bit is set, SMBHSTSTS_DEV_ERR is also set in the main status
register. So it's enough to check for that bit in i801_wait_intr(). If
you need to pass the aux register status to i801_check_post(), in the
non-interrupt case you could simply do:

	return i801_check_post(priv, status, inb_p(SMBAUXSTS(priv)));

> If we decide to go with your version, I can certainly test it.  We have 
> a machine with the right hardware combination, and it generates PEC 
> errors deterministically.

Well as I wrote above, it is possible that my patch doesn't work at all
in the interrupt case, or that it works but is racy. So most probably
we have to do as you said and combine both patches into one.

> By the way, I noticed a small bug in i801_transaction(). At line 391, it 
> has "status = -ETIMEOUT", but "status" at that point should be the 
> register value not -errno.  It probably should be "return -ETIMEOUT", 
> but I'm not sure whether hardware cleanup is necessary at that point.

I can't see any bug there. status can be either the (positive) register
value or a negative error code. It is passed to i801_check_post() which
can deal with both.
Ellen Wang April 20, 2015, 10:52 p.m. UTC | #4
On 4/17/2015 12:17 AM, Jean Delvare wrote:
> Hi Ellen,
>
> On Wed, 15 Apr 2015 13:46:17 -0700, Ellen Wang wrote:
>> On 4/15/2015 5:08 AM, Jean Delvare wrote:
>>> On Mon, 13 Apr 2015 17:11:59 -0700, Ellen Wang wrote:
>>>> On a CRC error while using hardware-supported PEC, an additional
>>>> error bit is set in the auxiliary status register.  If this bit
>>>> isn't cleared, all subsequent operations will fail, essentially
>>>> hanging the controller.
>>>>
>>>> The fix is simple: clear the bit at the same time we clear
>>>> the error bits in the main status register.
>>>>
>>>> Signed-off-by: Ellen Wang <ellen@cumulusnetworks.com>
>>>> ---
>>>>    drivers/i2c/busses/i2c-i801.c |   21 +++++++++++++++++++++
>>>>    1 file changed, 21 insertions(+)
>>>
>>> Thanks for the patch. I noticed the issue over a year ago and wrote a
>>> patch on my own, but then couldn't find the time to test it and finally
>>> forgot about it :( so I never posted it. You can read it here for
>>> reference:
>>> http://jdelvare.nerim.net/devel/misc/i2c-i801-07-check-pec-status.patch
>>>
>>> I am currently working on other matters but I'll look into this ASAP. A
>>> quick comparison between our patches suggests that yours only clears
>>> the PEC status bit while mine also reports the error properly to the
>>> caller, so mine might be a better working base. Maybe you could review
>>> and/or test my patch as a replacement of yours?
>>
>> My version does actually report the error, because the main status
>> register error bit is also set in this case, and that gets detected and
>> reported.  It just doesn't doesn't return a specific errno for PEC error.
>
> Correct, I did not remember this, sorry.
>
>> The main difference between our versions is that you check and clear the
>> CRC error bit in i801_check_pre() and i801_checkpost(), while I do it in
>> i801_isr().  i801_isr() is also where it checks and clears the main
>> status register for errors, so it seems natural to do it also for the
>> auxiliary status.
>
> The check in i801_check_pre() is in case the CRC error bit is set at
> the time the driver is loaded. If it is, and we don't clear it, the
> driver won't work (for PEC transactions or even for all transactions,
> I'm not sure.) So I think it is needed either way.

Right.  The driver doesn't work for any type of transaction if the CRC 
error bit is set.

> At the end of the transaction, the main status register is cleared in
> both i801_isr() and i801_check_post(). The latter is needed in case the
> driver operates in polling (non-interrupt) mode. That was the only mode
> available originally and my patch is so old that it is entirely
> possible that I wrote it before interrupt support was added to the
> driver. In fact that might even be the reason why I never submitted it,
> it may not support interrupt mode properly.
>
> So it is possible that the aux status register must be cleared in both
> i801_isr() and i801_check_post() to be on the safe side.
>
>> Probably what we should do is combine the two versions, especially if we
>> want to return a PEC-specific errno.  I think to do that properly in the
>> current structure of the code, we should save the auxiliary status in
>> priv->auxsts in i801_isr(), the same way it saves priv->status.  This
>> ways, isr() can clear the error in the hardware and check_post() can
>> still see it and process accordingly.  What do you think?  (I actually
>> thought of this, but opted for a minimal fix.)  This is more
>
> Sounds like a good idea, but I still believe that i801_check_post()
> will also have to clear the CRC error bit, for the non-interrupt case.

Yes.  My fix doesn't work in the polling case.

>> complicated, and functions like i801_wait_intr() return the status and
>> it can't return two values easily.
>
> I don't think we actually need to deal with the aux status register
> value in i801_wait_intr(). According to my notes, whenever the CRC
> error bit is set, SMBHSTSTS_DEV_ERR is also set in the main status
> register. So it's enough to check for that bit in i801_wait_intr(). If
> you need to pass the aux register status to i801_check_post(), in the
> non-interrupt case you could simply do:
>
> 	return i801_check_post(priv, status, inb_p(SMBAUXSTS(priv)));
>
>> If we decide to go with your version, I can certainly test it.  We have
>> a machine with the right hardware combination, and it generates PEC
>> errors deterministically.
>
> Well as I wrote above, it is possible that my patch doesn't work at all
> in the interrupt case, or that it works but is racy. So most probably
> we have to do as you said and combine both patches into one.

I will do that.

I see that I didn't understand that priv->status is only used in the 
interrupt case.

This leads to a related question:  If the driver is serialized, then the 
(status = priv->status) inside wait_event_timeout() isn't strictly 
necessary, correct?  It can just be priv->status.  I'm just want to 
double check that there's no race and I can add a priv->auxsts and not 
have to stick something like this inside the wait_event_timeout():

    (auxsts = priv->auxsts, status = priv->status)

>> By the way, I noticed a small bug in i801_transaction(). At line 391, it
>> has "status = -ETIMEOUT", but "status" at that point should be the
>> register value not -errno.  It probably should be "return -ETIMEOUT",
>> but I'm not sure whether hardware cleanup is necessary at that point.
>
> I can't see any bug there. status can be either the (positive) register
> value or a negative error code. It is passed to i801_check_post() which
> can deal with both.

You're right.  I didn't see that case in check_post().

Thanks!
--
To unsubscribe from this list: send the line "unsubscribe linux-i2c" in
the body of a message to majordomo@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Jean Delvare April 24, 2015, 10:08 a.m. UTC | #5
Hi Ellen,

On Mon, 20 Apr 2015 15:52:47 -0700, Ellen Wang wrote:
> This leads to a related question:  If the driver is serialized, then the 
> (status = priv->status) inside wait_event_timeout() isn't strictly 
> necessary, correct?  It can just be priv->status.  I'm just want to 
> double check that there's no race and I can add a priv->auxsts and not 
> have to stick something like this inside the wait_event_timeout():
> 
>     (auxsts = priv->auxsts, status = priv->status)

To be honest I'm not 100% certain. I think this was only a minor
optimization, but I may be wrong. Adding Daniel to Cc, he added that
code to the i2c-i801 driver. Daniel, any comment on this?

Equally mysterious to me is:

		priv->status |= status;

in i801_isr() where it would seem that a simple "=" would suffice.
Daniel Kurtz April 24, 2015, 11 a.m. UTC | #6
Hi guys,

On Fri, Apr 24, 2015 at 6:08 PM, Jean Delvare <jdelvare@suse.de> wrote:
> Hi Ellen,
>
> On Mon, 20 Apr 2015 15:52:47 -0700, Ellen Wang wrote:
>> This leads to a related question:  If the driver is serialized, then the
>> (status = priv->status) inside wait_event_timeout() isn't strictly
>> necessary, correct?  It can just be priv->status.  I'm just want to
>> double check that there's no race and I can add a priv->auxsts and not
>> have to stick something like this inside the wait_event_timeout():
>>
>>     (auxsts = priv->auxsts, status = priv->status)
>
> To be honest I'm not 100% certain. I think this was only a minor
> optimization, but I may be wrong. Adding Daniel to Cc, he added that
> code to the i2c-i801 driver. Daniel, any comment on this?
>
> Equally mysterious to me is:
>
>                 priv->status |= status;
>
> in i801_isr() where it would seem that a simple "=" would suffice.

Sorry, it has been a long time since i looked at i801 code,
I think you guys are correct.  You can probably just do:

            wait_event(priv->waitq, priv->status);
            status = priv->status;
            priv->status = 0;
            return i801_check_post(priv, status);

And, I agree with Jean in i801_isr(),
   priv->status |= status;

Could probably just be:
   priv->status = status;

-Dan

>
> --
> Jean Delvare
> SUSE L3 Support
--
To unsubscribe from this list: send the line "unsubscribe linux-i2c" in
the body of a message to majordomo@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Ellen Wang April 25, 2015, 7:51 p.m. UTC | #7
Thanks!  Given that the driver is fully serialized, there really 
shouldn't be any subtleties there.

I'll update the patch after some more testing.

On 04/24/2015 04:00 AM, Daniel Kurtz wrote:
> Hi guys,
>
> On Fri, Apr 24, 2015 at 6:08 PM, Jean Delvare <jdelvare@suse.de> wrote:
>> Hi Ellen,
>>
>> On Mon, 20 Apr 2015 15:52:47 -0700, Ellen Wang wrote:
>>> This leads to a related question:  If the driver is serialized, then the
>>> (status = priv->status) inside wait_event_timeout() isn't strictly
>>> necessary, correct?  It can just be priv->status.  I'm just want to
>>> double check that there's no race and I can add a priv->auxsts and not
>>> have to stick something like this inside the wait_event_timeout():
>>>
>>>      (auxsts = priv->auxsts, status = priv->status)
>>
>> To be honest I'm not 100% certain. I think this was only a minor
>> optimization, but I may be wrong. Adding Daniel to Cc, he added that
>> code to the i2c-i801 driver. Daniel, any comment on this?
>>
>> Equally mysterious to me is:
>>
>>                  priv->status |= status;
>>
>> in i801_isr() where it would seem that a simple "=" would suffice.
>
> Sorry, it has been a long time since i looked at i801 code,
> I think you guys are correct.  You can probably just do:
>
>              wait_event(priv->waitq, priv->status);
>              status = priv->status;
>              priv->status = 0;
>              return i801_check_post(priv, status);
>
> And, I agree with Jean in i801_isr(),
>     priv->status |= status;
>
> Could probably just be:
>     priv->status = status;
>
> -Dan
>
>>
>> --
>> Jean Delvare
>> SUSE L3 Support
--
To unsubscribe from this list: send the line "unsubscribe linux-i2c" in
the body of a message to majordomo@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
diff mbox

Patch

diff --git a/drivers/i2c/busses/i2c-i801.c b/drivers/i2c/busses/i2c-i801.c
index 5ecbb3f..7f3331e 100644
--- a/drivers/i2c/busses/i2c-i801.c
+++ b/drivers/i2c/busses/i2c-i801.c
@@ -129,6 +129,10 @@ 
 #define SMBAUXCTL_CRC		1
 #define SMBAUXCTL_E32B		2
 
+/* Auxiliary status register bits, ICH4+ only */
+#define SMBAUXSTS_CRCE		1
+#define SMBAUXSTS_STCO		2
+
 /* Other settings */
 #define MAX_RETRIES		400
 
@@ -164,6 +168,8 @@ 
 #define STATUS_FLAGS		(SMBHSTSTS_BYTE_DONE | SMBHSTSTS_INTR | \
 				 STATUS_ERROR_FLAGS)
 
+#define AUXSTS_ERROR_FLAGS	SMBAUXSTS_CRCE
+
 /* Older devices have their ID defined in <linux/pci_ids.h> */
 #define PCI_DEVICE_ID_INTEL_BAYTRAIL_SMBUS		0x0f12
 #define PCI_DEVICE_ID_INTEL_BRASWELL_SMBUS		0x2292
@@ -509,6 +515,21 @@  static irqreturn_t i801_isr(int irq, void *dev_id)
 		i801_isr_byte_done(priv);
 
 	/*
+	 * Clear error condition in aux status that may have come from
+	 * a PEC error.  The main status register also should be showing
+	 * an error.  That should be sufficient for reporting purposes.
+	 */
+	if (priv->features & FEATURE_SMBUS_PEC) {
+		u8 auxsts = inb_p(SMBAUXSTS(priv));
+
+		if (auxsts & AUXSTS_ERROR_FLAGS) {
+			dev_dbg(&priv->pci_dev->dev, "irq: auxsts = %02x\n",
+				auxsts);
+			outb_p(auxsts & AUXSTS_ERROR_FLAGS, SMBAUXSTS(priv));
+		}
+	}
+
+	/*
 	 * Clear irq sources and report transaction result.
 	 * ->status must be cleared before the next transaction is started.
 	 */