diff mbox series

Fix Bug for cadence i2c interrupt overrunning buffer

Message ID 20170916030252.10680-1-amworsley@gmail.com
State Superseded
Headers show
Series Fix Bug for cadence i2c interrupt overrunning buffer | expand

Commit Message

Andrew Worsley Sept. 16, 2017, 3:02 a.m. UTC
The i2c interrupt handler was not checking against the length of
supplied buffer so if the hardware supplied more data than requested in
the i2c transfer it happily copies it right past the end of the buffer!

Signed-off-by: Andrew Worsley <amworsley@gmail.com>
---

  This bug reliably caused a stack corruption when the hardware provided more data than
asked for in the i2c request. The hardware (a tpm) very occasionally appends a burst of
0xff to the end of the data requested and the i2c interrupt handler mindlessly copies all
the data right past the end of the buffer and in my case across the stack. :-(

  With this patch the transfer now terminates with an -EIO but with out voilating the
buffer boundaries. I would prefer to just make the transfer succeed while not retrieving
additional bytes but I wasn't able to find an easy way to do this. As this is definitely a
fault that causes a kernel oops I just wanted to get a fix out there for others to avoid
the problem as it has taken me a few weeks to chase down this oops. If someone has a
better or nicer fix I would appreciate it or a pointer to how to do this.

    Thanks

    Andrew
 drivers/i2c/busses/i2c-cadence.c | 7 +++++++
 1 file changed, 7 insertions(+)

Comments

Wolfram Sang Oct. 5, 2017, 12:01 p.m. UTC | #1
On Sat, Sep 16, 2017 at 01:02:52PM +1000, Andrew Worsley wrote:
> The i2c interrupt handler was not checking against the length of
> supplied buffer so if the hardware supplied more data than requested in
> the i2c transfer it happily copies it right past the end of the buffer!
> 
> Signed-off-by: Andrew Worsley <amworsley@gmail.com>

Thank you for the bug report. I don't know the hardware in detail, but I
am a bit confused...

> ---
> 
>   This bug reliably caused a stack corruption when the hardware provided more data than
> asked for in the i2c request. The hardware (a tpm) very occasionally appends a burst of
> 0xff to the end of the data requested and the i2c interrupt handler mindlessly copies all
> the data right past the end of the buffer and in my case across the stack. :-(

... because you as the master should have control over how many bytes
you want to have. If you are done, send NAK+STOP and the message is
over. Is it really the device (which one is it BTW?)? Or is the FIFO
handling of the master driver buggy?

Michal, Sören?

> 
>   With this patch the transfer now terminates with an -EIO but with out voilating the
> buffer boundaries. I would prefer to just make the transfer succeed while not retrieving
> additional bytes but I wasn't able to find an easy way to do this. As this is definitely a
> fault that causes a kernel oops I just wanted to get a fix out there for others to avoid
> the problem as it has taken me a few weeks to chase down this oops. If someone has a
> better or nicer fix I would appreciate it or a pointer to how to do this.
> 
>     Thanks
> 
>     Andrew
>  drivers/i2c/busses/i2c-cadence.c | 7 +++++++
>  1 file changed, 7 insertions(+)
> 
> diff --git a/drivers/i2c/busses/i2c-cadence.c b/drivers/i2c/busses/i2c-cadence.c
> index b13605718291..c1f61ca6843e 100644
> --- a/drivers/i2c/busses/i2c-cadence.c
> +++ b/drivers/i2c/busses/i2c-cadence.c
> @@ -234,6 +234,7 @@ static irqreturn_t cdns_i2c_isr(int irq, void *ptr)
>  	if (id->p_recv_buf &&
>  	    ((isr_status & CDNS_I2C_IXR_COMP) ||
>  	     (isr_status & CDNS_I2C_IXR_DATA))) {
> +		unsigned char *p = id->p_recv_buf;
>  		/* Read data if receive data valid is set */
>  		while (cdns_i2c_readreg(CDNS_I2C_SR_OFFSET) &
>  		       CDNS_I2C_SR_RXDV) {
> @@ -246,6 +247,12 @@ static irqreturn_t cdns_i2c_isr(int irq, void *ptr)
>  			    !id->bus_hold_flag)
>  				cdns_i2c_clear_bus_hold(id);
>  
> +			if (id->recv_count == 0) {
> +				pr_notice("%s: i2c receive buffer filled : %u aborting transfer %p - %p\n",
> +					__func__, (id->p_recv_buf - p));
> +				break;
> +			}
> +
>  			*(id->p_recv_buf)++ =
>  				cdns_i2c_readreg(CDNS_I2C_DATA_OFFSET);
>  			id->recv_count--;
> -- 
> 2.11.0
>
Andrew Worsley Oct. 7, 2017, 5:32 a.m. UTC | #2
> Thank you for the bug report. I don't know the hardware in detail, but I
> am a bit confused...
>
>> ---
>>   This bug reliably caused a stack corruption when the hardware provided more data than
>> asked for in the i2c request. The hardware (a tpm) very occasionally appends a burst of
>> 0xff to the end of the data requested and the i2c interrupt handler mindlessly copies all
>> the data right past the end of the buffer and in my case across the stack. :-(
>
> ... because you as the master should have control over how many bytes
> you want to have. If you are done, send NAK+STOP and the message is
> over. Is it really the device (which one is it BTW?)? Or is the FIFO
> handling of the master driver buggy?
>
> Michal, Sören?

Just a quick comment - as I don't have time at the moment to look
deeply into the driver at the moment.

I was just looking at Ch 20 of the TRM on the i2c under Read Transfer
and it says step 4 write the number of requested bytes to the Transfer
Size Register.

But in the code it talks about a work around for large receive operation.

Looking at the change log: commit  9fae82e1acda

Explains for transfers > 252 bytes that

     "the transfer size register has to be maintained at a value >= 1

So my guess is the code is trying to keep this transfer size non-zero
and so just keeps pulling more bytes our of the slave i2c device. I
find it is hard to follow the complex code though.

My buggy condition was a transfer of 128 bytes so it is not hit this
condition except in that the slave supplies way more bytes than was in
the original i2c_transfer request.

Andrew
Wolfram Sang Oct. 13, 2017, 6:46 p.m. UTC | #3
On Sat, Oct 07, 2017 at 04:32:29PM +1100, Andrew Worsley wrote:
> > Thank you for the bug report. I don't know the hardware in detail, but I
> > am a bit confused...
> >
> >> ---
> >>   This bug reliably caused a stack corruption when the hardware provided more data than
> >> asked for in the i2c request. The hardware (a tpm) very occasionally appends a burst of
> >> 0xff to the end of the data requested and the i2c interrupt handler mindlessly copies all
> >> the data right past the end of the buffer and in my case across the stack. :-(
> >
> > ... because you as the master should have control over how many bytes
> > you want to have. If you are done, send NAK+STOP and the message is
> > over. Is it really the device (which one is it BTW?)? Or is the FIFO
> > handling of the master driver buggy?
> >
> > Michal, Sören?
> 
> Just a quick comment - as I don't have time at the moment to look
> deeply into the driver at the moment.
> 
> I was just looking at Ch 20 of the TRM on the i2c under Read Transfer
> and it says step 4 write the number of requested bytes to the Transfer
> Size Register.
> 
> But in the code it talks about a work around for large receive operation.
> 
> Looking at the change log: commit  9fae82e1acda
> 
> Explains for transfers > 252 bytes that
> 
>      "the transfer size register has to be maintained at a value >= 1
> 
> So my guess is the code is trying to keep this transfer size non-zero
> and so just keeps pulling more bytes our of the slave i2c device. I
> find it is hard to follow the complex code though.
> 
> My buggy condition was a transfer of 128 bytes so it is not hit this
> condition except in that the slave supplies way more bytes than was in
> the original i2c_transfer request.

So, let's add Harini Katakam to CC who implemented this. But I'd be
happy for anyone from Xilinx to respond...

> 
> Andrew
Michal Simek Oct. 16, 2017, 5:49 a.m. UTC | #4
Hi Wolfram,

On 13.10.2017 20:46, Wolfram Sang wrote:
> On Sat, Oct 07, 2017 at 04:32:29PM +1100, Andrew Worsley wrote:
>>> Thank you for the bug report. I don't know the hardware in detail, but I
>>> am a bit confused...
>>>
>>>> ---
>>>>   This bug reliably caused a stack corruption when the hardware provided more data than
>>>> asked for in the i2c request. The hardware (a tpm) very occasionally appends a burst of
>>>> 0xff to the end of the data requested and the i2c interrupt handler mindlessly copies all
>>>> the data right past the end of the buffer and in my case across the stack. :-(
>>>
>>> ... because you as the master should have control over how many bytes
>>> you want to have. If you are done, send NAK+STOP and the message is
>>> over. Is it really the device (which one is it BTW?)? Or is the FIFO
>>> handling of the master driver buggy?
>>>
>>> Michal, Sören?
>>
>> Just a quick comment - as I don't have time at the moment to look
>> deeply into the driver at the moment.
>>
>> I was just looking at Ch 20 of the TRM on the i2c under Read Transfer
>> and it says step 4 write the number of requested bytes to the Transfer
>> Size Register.
>>
>> But in the code it talks about a work around for large receive operation.
>>
>> Looking at the change log: commit  9fae82e1acda
>>
>> Explains for transfers > 252 bytes that
>>
>>      "the transfer size register has to be maintained at a value >= 1
>>
>> So my guess is the code is trying to keep this transfer size non-zero
>> and so just keeps pulling more bytes our of the slave i2c device. I
>> find it is hard to follow the complex code though.
>>
>> My buggy condition was a transfer of 128 bytes so it is not hit this
>> condition except in that the slave supplies way more bytes than was in
>> the original i2c_transfer request.
> 
> So, let's add Harini Katakam to CC who implemented this. But I'd be
> happy for anyone from Xilinx to respond...

It is still in my inbox to look at more details or ask my colleague to
look at it. I simply didn't have a time to do it.

Thanks,
Michal
Harini Katakam Oct. 16, 2017, 5:55 a.m. UTC | #5
Hi Andrew,

On Sat, Oct 14, 2017 at 12:16 AM, Wolfram Sang <wsa@the-dreams.de> wrote:
> On Sat, Oct 07, 2017 at 04:32:29PM +1100, Andrew Worsley wrote:
>> > Thank you for the bug report. I don't know the hardware in detail, but I
>> > am a bit confused...
>> >
>> >> ---
>> >>   This bug reliably caused a stack corruption when the hardware provided more data than
>> >> asked for in the i2c request. The hardware (a tpm) very occasionally appends a burst of
>> >> 0xff to the end of the data requested and the i2c interrupt handler mindlessly copies all
>> >> the data right past the end of the buffer and in my case across the stack. :-(
>> >
>> > ... because you as the master should have control over how many bytes
>> > you want to have. If you are done, send NAK+STOP and the message is
>> > over. Is it really the device (which one is it BTW?)? Or is the FIFO
>> > handling of the master driver buggy?
>> >
>> > Michal, Sören?
>>
>> Just a quick comment - as I don't have time at the moment to look
>> deeply into the driver at the moment.
>>
>> I was just looking at Ch 20 of the TRM on the i2c under Read Transfer
>> and it says step 4 write the number of requested bytes to the Transfer
>> Size Register.
>>
>> But in the code it talks about a work around for large receive operation.
>>
>> Looking at the change log: commit  9fae82e1acda
>>
>> Explains for transfers > 252 bytes that
>>
>>      "the transfer size register has to be maintained at a value >= 1
>>
>> So my guess is the code is trying to keep this transfer size non-zero
>> and so just keeps pulling more bytes our of the slave i2c device. I
>> find it is hard to follow the complex code though.
>>
>> My buggy condition was a transfer of 128 bytes so it is not hit this
>> condition except in that the slave supplies way more bytes than was in
>> the original i2c_transfer request.
>
> So, let's add Harini Katakam to CC who implemented this. But I'd be
> happy for anyone from Xilinx to respond...

The workaround you are referring to is implemented under
"if (cdns_is_holdquirk(id, hold_quirk)) {"
and should not apply to your case of 128 byte transfer at all.

The specific place you have problem is under the loop
where RX data valid is being checked and ideally, that
buffer should never be written with more data than requested -
because no. of bytes requested is written to transfer size register,
the same is the amount the controller (master) fetches and
RX data valid remains set only for that duration.

Are you using Zynq?
I'm not working on this anymore but we could take a look at
your error if we are able to reproduce it.
Could you please provide more details of your application and
setup?

Regards,
Harini
Andrew Worsley Oct. 16, 2017, 9:18 a.m. UTC | #6
Thanks I think I know understand how this transfer register is set up.
I will add some diagnostic printk's to track down exactly how it is
failing and get back to you.

 Yes it is zynq. The test case is large (128 byte) transfers from a
tpm to fill an rng driver. I just pull large amounts of data via dd.

It may depend on the i2c slave being willing to just respond to
requests that read arbitrary amounts of data. In this case the TPM
appeared to generate FF when requested for more data than was in the
TPM_RANDOM request.



On 16 October 2017 at 16:55, Harini Katakam
<harinikatakamlinux@gmail.com> wrote:
> Hi Andrew,
>
> On Sat, Oct 14, 2017 at 12:16 AM, Wolfram Sang <wsa@the-dreams.de> wrote:
.....
>>
>> So, let's add Harini Katakam to CC who implemented this. But I'd be
>> happy for anyone from Xilinx to respond...
>
> The workaround you are referring to is implemented under
> "if (cdns_is_holdquirk(id, hold_quirk)) {"
> and should not apply to your case of 128 byte transfer at all.

Ah I can see the logic now. If the transfer is small (< 250 or so
bytes) it will just write it into the transfer register. Yes my
transfers are only 128 + some TPM related headers ( not big I expect).
....


>
> The specific place you have problem is under the loop
> where RX data valid is being checked and ideally, that
> buffer should never be written with more data than requested -
> because no. of bytes requested is written to transfer size register,
> the same is the amount the controller (master) fetches and
> RX data valid remains set only for that duration.

If the transfer register was set with the
>
> Are you using Zynq?
> I'm not working on this anymore but we could take a look at
> your error if we are able to reproduce it.
> Could you please provide more details of your application and
> setup?
>
> Regards,
> Harini

Yes - I will get back to you with more details and further debugging
around exactly how it is failing.

Assuming it is not obvious how to fix it - perhaps I can try setting
the transfer register to zero when it hits the end of the transfer?
(if recv_count == 0).

Thanks very much Harini  - that was just the pointer I needed to
follow the code.

Andrew
Harini Katakam Oct. 16, 2017, 10:38 a.m. UTC | #7
Hi Andrew,

On Mon, Oct 16, 2017 at 2:48 PM, Andrew Worsley <amworsley@gmail.com> wrote:
> Thanks I think I know understand how this transfer register is set up.
> I will add some diagnostic printk's to track down exactly how it is
> failing and get back to you.
>
>  Yes it is zynq. The test case is large (128 byte) transfers from a
> tpm to fill an rng driver. I just pull large amounts of data via dd.
>
> It may depend on the i2c slave being willing to just respond to
> requests that read arbitrary amounts of data. In this case the TPM
> appeared to generate FF when requested for more data than was in the
> TPM_RANDOM request.
>
>
>
> On 16 October 2017 at 16:55, Harini Katakam
> <harinikatakamlinux@gmail.com> wrote:
>> Hi Andrew,
>>
>> On Sat, Oct 14, 2017 at 12:16 AM, Wolfram Sang <wsa@the-dreams.de> wrote:
> .....
>>>
>>> So, let's add Harini Katakam to CC who implemented this. But I'd be
>>> happy for anyone from Xilinx to respond...
>>
>> The workaround you are referring to is implemented under
>> "if (cdns_is_holdquirk(id, hold_quirk)) {"
>> and should not apply to your case of 128 byte transfer at all.
>
> Ah I can see the logic now. If the transfer is small (< 250 or so
> bytes) it will just write it into the transfer register. Yes my
> transfers are only 128 + some TPM related headers ( not big I expect).
> ....
>
>
>>
>> The specific place you have problem is under the loop
>> where RX data valid is being checked and ideally, that
>> buffer should never be written with more data than requested -
>> because no. of bytes requested is written to transfer size register,
>> the same is the amount the controller (master) fetches and
>> RX data valid remains set only for that duration.
>
> If the transfer register was set with the

Yes, this is assuming the transfer size register was not set inittially
or later with more byted than required.

>>
>> Are you using Zynq?
>> I'm not working on this anymore but we could take a look at
>> your error if we are able to reproduce it.
>> Could you please provide more details of your application and
>> setup?
>>
>> Regards,
>> Harini
>
> Yes - I will get back to you with more details and further debugging
> around exactly how it is failing.
>
> Assuming it is not obvious how to fix it - perhaps I can try setting
> the transfer register to zero when it hits the end of the transfer?
> (if recv_count == 0).

Yes, that can be done as a workaround but ideally, the transfer
size register should have decremented to zero by this point.

>
> Thanks very much Harini  - that was just the pointer I needed to
> follow the code.

Thanks for the information - please let me know what you find.

Regards,
Harini
Andrew Worsley Oct. 18, 2017, 9:41 a.m. UTC | #8
On 16 October 2017 at 21:38, Harini Katakam
<harinikatakamlinux@gmail.com> wrote:
> Hi Andrew,

Hi Harini, I did manage to reproduce the issue and got some more info
on  the bug some more
- but I don't  really understanding the cause.

> On Mon, Oct 16, 2017 at 2:48 PM, Andrew Worsley <amworsley@gmail.com> wrote:
>> Thanks I think I know understand how this transfer register is set up.
>> I will add some diagnostic printk's to track down exactly how it is
>> failing and get back to you.


It appears to be working fine and then just fails with  the transfer
size register value set to 242.
Always the value 240 or  242 in the transfer register.

>>
>>  Yes it is zynq. The test case is large (128 byte) transfers from a
>> tpm to fill an rng driver. I just pull large amounts of data via dd.
>>

I have a lot of data so I will summarise it and I can post more
details on what people want to know.

I added printk's on each i2c operation and each write to the transfer
size register.
When the overrun is hit it prints out most of the registers, about
half the time the overflow bit is set.

Here is my last output which has the most diagnostics:
...
[   15.607821] zs2 kernel: cdns_i2c_process_msg: Rd len=142 #8
[   15.607826] zs2 kernel: cdns_i2c_mrecv:cdns_i2c_writereg(142,
CDNS_I2C_XFER_SIZE_OFFSET)
 #6
[   15.611343] zs2 kernel: cdns_i2c_process_msg: Wr len=14 #8
[   15.611777] zs2 kernel: cdns_i2c_process_msg: Rd len=35 #8
[   15.611781] zs2 kernel: cdns_i2c_mrecv:cdns_i2c_writereg(35,
CDNS_I2C_XFER_SIZE_OFFSET)
#6
[   15.611836] zs2 kernel: cdns_i2c_master_reset:cdns_i2c_writereg(0,
CDNS_I2C_XFER_SIZE_OF
FSET) #4
[   15.612847] zs2 kernel: cdns_i2c_process_msg: Rd len=35 #8
[   15.612850] zs2 kernel: cdns_i2c_mrecv:cdns_i2c_writereg(35,
CDNS_I2C_XFER_SIZE_OFFSET)
#6
[   15.612911] zs2 kernel: cdns_i2c_master_reset:cdns_i2c_writereg(0,
CDNS_I2C_XFER_SIZE_OF
FSET) #4
[   15.613924] zs2 kernel: cdns_i2c_process_msg: Rd len=35 #8
[   15.613928] zs2 kernel: cdns_i2c_mrecv:cdns_i2c_writereg(35,
CDNS_I2C_XFER_SIZE_OFFSET)
#6
[   15.614836] zs2 kernel: tpm_transmit: after req_complete_val status=0x1
[   15.614846] zs2 kernel: cdns_i2c_process_msg: Rd len=142 #8
[   15.614851] zs2 kernel: cdns_i2c_mrecv:cdns_i2c_writereg(142,
CDNS_I2C_XFER_SIZE_OFFSET)
 #6
[   15.626087] zs2 kernel: cdns_i2c_isr: buffer overrunning !!!-
transfer-size=242 ctrl=0xc0f status=0x120 int-status=0x0
[   15.626091] zs2 kernel: cdns_i2c_isr:Aborting transfer...
....

I tried to handle situation by setting the transfer register to zero
and breaking out of the loop which is copying bytes out of the FIFO.
At no time are any of the transfers bigger than 142 bytes or so as
shown above. So the driver  always sets the transfer register to the
full size.

The only writes to the transfer register are 0,  1, 35 and 142. I
never see an overly large write to the transfer register.

>>> Could you please provide more details of your application and
>>> setup?
>>>
>>> Regards,
>>> Harini
>>

I originally discovered the flaw with two of the last i2c-cadence
driver fixes not present

  Initialize-configuration-before-prob
  fix-ctrl-addr-reg-write-order

 and a fair amount of debug in the TPM driver - then the bug hit
nearly every run.
With these two i2c fixes in it occured only once per 4-5 runs.

Perhaps this information gives some one a path to investigate further?
I can't really see how this transfer size register can get confused -
especially always with 240 or 242 = 255-13
Perhaps the transfer register rolls over from zero or something?
Or Perhaps I am just looking for magic numbers?


>>
>> Assuming it is not obvious how to fix it - perhaps I can try setting
>> the transfer register to zero when it hits the end of the transfer?
>> (if recv_count == 0).
>
> Yes, that can be done as a workaround but ideally, the transfer
> size register should have decremented to zero by this point.

Setting the transfer size to zero and then doing a break on an error
is ok but causes the transfer to fail still.
Now thinking about it perhaps I should have done a continue - so the
loop would terminate more naturally.

>>
>> Thanks very much Harini  - that was just the pointer I needed to
>> follow the code.
>
> Thanks for the information - please let me know what you find.
>
> Regards,
>

Let me know if you have any other suggestions for me to try. my tpm
driver changes are not in main line so it's hard for anyone else to do
exactly that test case.
I can't get back to this till next week so I won't be able to try this
till then.
My thoughts are to remove the TPM debugging code to confirm that no
other changes are needed to reproduce this issue.
But I can't see how anything can overwrite the transfer register.

> Harini

I attach a patch of the changes to the i2c driver showing my debug in
this driver.
As I said I am happy to send more verbose bug output - I have about 36
odd runs with 9 occurences of the bug.
Once it happen twice in the one run

I include the crashing line from the runs where the bug occured (I
added more diagnostics as the runs progressed):
% grep overrunning i2c-tpm-run*
i2c-tpm-run12-bad-rev-ecbd28b9.txt:[    8.543441] zs2 kernel:
cdns_i2c_isr: buffer overrunning !!!- transfer-size=242
i2c-tpm-run13-bad-rev-ecbd28b9.txt:[   16.193018] zs2 kernel:
cdns_i2c_isr: buffer overrunning !!!- transfer-size=242
i2c-tpm-run18-bad-rev-ecbd28b9.txt:[   26.065783] zs2 kernel:
cdns_i2c_isr: buffer overrunning !!!- transfer-size=240 ctrl=0xc0f
status=0xa0 int-status=0x0
i2c-tpm-run18-bad-rev-ecbd28b9.txt:[   30.272113] zs2 kernel:
cdns_i2c_isr: buffer overrunning !!!- transfer-size=240 ctrl=0xc0f
status=0xa0 int-status=0x0
i2c-tpm-run21-bad-rev-ecbd28b9.txt:[   20.162880] zs2 kernel:
cdns_i2c_isr: buffer overrunning !!!- transfer-size=240 ctrl=0xc0f
status=0xa0=>Overflow<= int-status=0x0
i2c-tpm-run24-bad-rev-ecbd28b9.txt:[   15.191316] zs2 kernel:
cdns_i2c_isr: buffer overrunning !!!- transfer-size=242 ctrl=0xc0f
status=0x120 int-status=0x0
i2c-tpm-run26-bad-rev-ecbd28b9.txt:[   18.370014] zs2 kernel:
cdns_i2c_isr: buffer overrunning !!!- transfer-size=240 ctrl=0xc0f
status=0xa0=>Overflow<= int-status=0x0
i2c-tpm-run30-bad-rev-ecbd28b9.txt:[   20.177092] zs2 kernel:
cdns_i2c_isr: buffer overrunning !!!- transfer-size=242 ctrl=0xc0f
status=0x120 int-status=0x0
i2c-tpm-run36-bad-rev-ecbd28b9.txt:[   15.626087] zs2 kernel:
cdns_i2c_isr: buffer overrunning !!!- transfer-size=242 ctrl=0xc0f
status=0x120 int-status=0x0

I attach the diff of the changes to the i2c-cadence driver file

Suggestions appreciated.

Andrew
diff --git a/drivers/i2c/busses/i2c-cadence.c b/drivers/i2c/busses/i2c-cadence.c
index 75d80161931f..a6103c303b08 100644
--- a/drivers/i2c/busses/i2c-cadence.c
+++ b/drivers/i2c/busses/i2c-cadence.c
@@ -246,6 +246,20 @@ static irqreturn_t cdns_i2c_isr(int irq, void *ptr)
 			    !id->bus_hold_flag)
 				cdns_i2c_clear_bus_hold(id);
 
+            if (id->recv_count == 0)
+            {
+                unsigned int st = cdns_i2c_readreg(CDNS_I2C_SR_OFFSET);
+                pr_notice("%s: buffer overrunning !!!- transfer-size=%d ctrl=%#x status=%#x%s int-status=%#x\n",
+                    __func__,
+                    cdns_i2c_readreg(CDNS_I2C_XFER_SIZE_OFFSET),
+                    cdns_i2c_readreg(CDNS_I2C_CR_OFFSET),
+                    st, st & BIT(7) ? "=>Overflow<=" : "",
+                    cdns_i2c_readreg(CDNS_I2C_ISR_OFFSET)
+                    );
+                pr_notice("%s:Aborting transfer...\n", __func__);
+                cdns_i2c_writereg(0, CDNS_I2C_XFER_SIZE_OFFSET);
+                break;
+            }
 			*(id->p_recv_buf)++ =
 				cdns_i2c_readreg(CDNS_I2C_DATA_OFFSET);
 			id->recv_count--;
@@ -274,11 +288,15 @@ static irqreturn_t cdns_i2c_isr(int irq, void *ptr)
 			 */
 			if (((int)(id->recv_count) - CDNS_I2C_FIFO_DEPTH) >
 			    CDNS_I2C_TRANSFER_SIZE) {
+                pr_info("%s:cdns_i2c_writereg(%d, CDNS_I2C_XFER_SIZE_OFFSET) #1\n",
+                    __func__, CDNS_I2C_TRANSFER_SIZE);
 				cdns_i2c_writereg(CDNS_I2C_TRANSFER_SIZE,
 						  CDNS_I2C_XFER_SIZE_OFFSET);
 				id->curr_recv_count = CDNS_I2C_TRANSFER_SIZE +
 						      CDNS_I2C_FIFO_DEPTH;
 			} else {
+                pr_info("%s:cdns_i2c_writereg(%d, CDNS_I2C_XFER_SIZE_OFFSET) #2\n",
+                    __func__, id->recv_count - CDNS_I2C_FIFO_DEPTH);
 				cdns_i2c_writereg(id->recv_count -
 						  CDNS_I2C_FIFO_DEPTH,
 						  CDNS_I2C_XFER_SIZE_OFFSET);
@@ -292,10 +310,14 @@ static irqreturn_t cdns_i2c_isr(int irq, void *ptr)
 						CDNS_I2C_ADDR_OFFSET);
 
 			if (id->recv_count > CDNS_I2C_TRANSFER_SIZE) {
+                pr_info("%s:cdns_i2c_writereg(%d, CDNS_I2C_XFER_SIZE_OFFSET) #3\n",
+                    __func__, CDNS_I2C_TRANSFER_SIZE);
 				cdns_i2c_writereg(CDNS_I2C_TRANSFER_SIZE,
 						CDNS_I2C_XFER_SIZE_OFFSET);
 				id->curr_recv_count = CDNS_I2C_TRANSFER_SIZE;
 			} else {
+                pr_info("%s:cdns_i2c_writereg(%d, CDNS_I2C_XFER_SIZE_OFFSET) #4\n",
+                    __func__, id->recv_count);
 				cdns_i2c_writereg(id->recv_count,
 						CDNS_I2C_XFER_SIZE_OFFSET);
 				id->curr_recv_count = id->recv_count;
@@ -398,10 +420,14 @@ static void cdns_i2c_mrecv(struct cdns_i2c *id)
 	 * it is more. Enable the interrupts.
 	 */
 	if (id->recv_count > CDNS_I2C_TRANSFER_SIZE) {
+        pr_info("%s:cdns_i2c_writereg(%d, CDNS_I2C_XFER_SIZE_OFFSET) #5\n",
+            __func__, CDNS_I2C_XFER_SIZE_OFFSET);
 		cdns_i2c_writereg(CDNS_I2C_TRANSFER_SIZE,
 				  CDNS_I2C_XFER_SIZE_OFFSET);
 		id->curr_recv_count = CDNS_I2C_TRANSFER_SIZE;
 	} else {
+        pr_info("%s:cdns_i2c_writereg(%d, CDNS_I2C_XFER_SIZE_OFFSET) #6\n",
+            __func__, id->recv_count);
 		cdns_i2c_writereg(id->recv_count, CDNS_I2C_XFER_SIZE_OFFSET);
 	}
 
@@ -499,6 +525,8 @@ static void cdns_i2c_master_reset(struct i2c_adapter *adap)
 	regval |= CDNS_I2C_CR_CLR_FIFO;
 	cdns_i2c_writereg(regval, CDNS_I2C_CR_OFFSET);
 	/* Update the transfercount register to zero */
+    pr_info("%s:cdns_i2c_writereg(%d, CDNS_I2C_XFER_SIZE_OFFSET) #4\n",
+        __func__, 0);
 	cdns_i2c_writereg(0, CDNS_I2C_XFER_SIZE_OFFSET);
 	/* Clear the interupt status register */
 	regval = cdns_i2c_readreg(CDNS_I2C_ISR_OFFSET);
@@ -530,6 +558,7 @@ static int cdns_i2c_process_msg(struct cdns_i2c *id, struct i2c_msg *msg,
 					CDNS_I2C_CR_OFFSET);
 	}
 
+    pr_info("%s: %s len=%d #8\n", __func__, msg->flags & I2C_M_RD ? "Rd" : "Wr", msg->len);
 	/* Check for the R/W flag on each msg */
 	if (msg->flags & I2C_M_RD)
 		cdns_i2c_mrecv(id);
@@ -629,6 +658,7 @@ static int cdns_i2c_master_xfer(struct i2c_adapter *adap, struct i2c_msg *msgs,
 				goto out;
 			}
 			ret = -EIO;
+			pr_info("%s: id->err_status=%#x\n", __func__, id->err_status);
 			goto out;
 		}
 	}
Harini Katakam Oct. 20, 2017, 5:04 a.m. UTC | #9
Hi Andrew,

On Wed, Oct 18, 2017 at 3:11 PM, Andrew Worsley <amworsley@gmail.com> wrote:
<snip>
> Perhaps this information gives some one a path to investigate further?
> I can't really see how this transfer size register can get confused -
> especially always with 240 or 242 = 255-13
> Perhaps the transfer register rolls over from zero or something?
> Or Perhaps I am just looking for magic numbers?

Thanks for the debug Andrew..
We do have one known issue for Zynq:
https://www.xilinx.com/support/answers/61664.html
The patch for >252 byte transfers was a workaround to this.
In short, the I2C master generates invalid clocks and
starts reading extra bytes (0xFFs) when the transfer size register
reaches zero and the hold bit is set (as is the case with >252 transfers).
The transfer size register does roll over to 255 in this case.
But in case of 242 bytes or all the other values (13, 142) written
to transfer size register, I don't see this errata coming into picture.

<snip>

> I attach a patch of the changes to the i2c driver showing my debug in
> this driver.
> As I said I am happy to send more verbose bug output - I have about 36
> odd runs with 9 occurences of the bug.
> Once it happen twice in the one run
>
> I include the crashing line from the runs where the bug occured (I
> added more diagnostics as the runs progressed):

I'll check with out HW team as well and check your flow to
see if we can reproduce it; will also check if the existing errata
has larger scope.
I'll let you know if we require further assistance with debug
next week.

Regards,
Harini
Wolfram Sang Dec. 7, 2017, 10:49 a.m. UTC | #10
> > I attach a patch of the changes to the i2c driver showing my debug in
> > this driver.
> > As I said I am happy to send more verbose bug output - I have about 36
> > odd runs with 9 occurences of the bug.
> > Once it happen twice in the one run
> >
> > I include the crashing line from the runs where the bug occured (I
> > added more diagnostics as the runs progressed):
> 
> I'll check with out HW team as well and check your flow to
> see if we can reproduce it; will also check if the existing errata
> has larger scope.
> I'll let you know if we require further assistance with debug
> next week.

Any news here?
Harini Katakam Dec. 7, 2017, 1:30 p.m. UTC | #11
Hi Wolfram and Andrew,

On Thu, Dec 7, 2017 at 4:19 PM, Wolfram Sang <wsa@the-dreams.de> wrote:
>
>> > I attach a patch of the changes to the i2c driver showing my debug in
>> > this driver.
>> > As I said I am happy to send more verbose bug output - I have about 36
>> > odd runs with 9 occurences of the bug.
>> > Once it happen twice in the one run
>> >
>> > I include the crashing line from the runs where the bug occured (I
>> > added more diagnostics as the runs progressed):
>>
>> I'll check with out HW team as well and check your flow to
>> see if we can reproduce it; will also check if the existing errata
>> has larger scope.
>> I'll let you know if we require further assistance with debug
>> next week.
>
> Any news here?

Sorry for the delay.
I've checked the known errata and confirmed that it wont
affect this use case. And we haven't been able to hit this
issue on our HW (with eeprom slave devices).
I will add my colleague looking into this further to debug.

Regards,
Harini
Andrew Worsley Jan. 18, 2018, 2:20 a.m. UTC | #12
On 8 December 2017 at 09:15, Andrew Worsley <amworsley@gmail.com> wrote:
> Thanks. I regularly see the warning my patch prints out  on boot up as then
> there is a big pull of random data from the TPM.

   Just an update on this bug - I did a i2c capture with pulseview on
a boot that included this issue.

> On 8 Dec 2017 12:30 am, "Harini Katakam" <harinikatakamlinux@gmail.com>
> wrote:
>>
>> Hi Wolfram and Andrew,
>>
>> On Thu, Dec 7, 2017 at 4:19 PM, Wolfram Sang <wsa@the-dreams.de> wrote:
>> >
>> >> > I attach a patch of the changes to the i2c driver showing my debug in
>> >> > this driver.
>> >> > As I said I am happy to send more verbose bug output - I have about
>> >> > 36
>> >> > odd runs with 9 occurences of the bug.
>> >> > Once it happen twice in the one run
>> >> >
>> >> > I include the crashing line from the runs where the bug occured (I
>> >> > added more diagnostics as the runs progressed):
>> >>
>> >> I'll check with out HW team as well and check your flow to
>> >> see if we can reproduce it; will also check if the existing errata
>> >> has larger scope.
>> >> I'll let you know if we require further assistance with debug
>> >> next week.
>> >
>> > Any news here?
>>
>> Sorry for the delay.
>> I've checked the known errata and confirmed that it wont
>> affect this use case. And we haven't been able to hit this
>> issue on our HW (with eeprom slave devices).
>> I will add my colleague looking into this further to debug.
>>
>> Regards,
>> Harini

I suspect a well behaved i2c slave may never show the issue. But the
i2c cadence doesn't handle bad slaves safely
so I suggest that length check is needed to avoid buffer kernel
crashes due to mis-behaving i2c hardware.

I got a board that this issue was *very* bad to the point that the
TPM_RANDOM requests repeatedly returned the same data and so were
failing the sanity tests. So I captured the i2c transactions using
pulseview and very bad behaviour from the TPM. The problem disappears
at 100KHz and very reliably occurs at 400k on this individual board so
I believe this TPM is failing to keep up with the i2c clock and
screwying everything up.

What I am concerned about is that the i2c-cadence driver is almost
always carrying on and returning data with out errors.

In particular I see repeated starts and long periods where the data
line is held low and NO Bus Arbitration or Time out errors. In
particular I can get repeated TPM_RANDOM requests to the TPM returning
the exact same data until another i2c operation to a different chip
(RTC) which works fine at this 400kHz.

So to summarize most (all?) other boards work at this 400kHz speed so
I definitely have a marginal TPM chip on this board. But why doesn't
the i2c-cadence HW pick up these faults? The i2c transactions decoded
by pulseview show a
a real mess but i2c-cadence is bravely pulling data out of it.

I will attempt to append a few images of i2c traces - to illustrate
the mess that happens but that the i2c cadence hardware is very rarely
reporting any problems - but returns invalid data. In about 15+ boots
I got one Bus arbitration.

After looking at the i2c traces it is not surprising that the cadence
i2c hardware can become confused and return way more
data than was requested. As the buffer overrun can cause a crash in a
different routine (after the i2c driver has finished) I think it is
prudent to keep some sort of check on the data size returned.

Andrew

read-overview image is the large TPM read I suspect where things go
wrong showing lots of i2c bursts on one transactions with the
clock+data held low between them and actually ends with the data held
low. It is not fixed until *much* later way past the end of that
image.

The read-start image is the very start of this long transaction
showing the read to TPM at address 0x29 - but the pulseview decode is
correctly reporting repeated starts although it looks like the master
isn't detecting it - there are many many occurrences of these sorts of
problems
Shubhrajyoti Datta Dec. 17, 2019, 7:38 a.m. UTC | #13
Hi Andrew,
Apologies for the late reply.

On Thu, Jan 18, 2018 at 7:53 AM Andrew Worsley <amworsley@gmail.com> wrote:
>
> On 8 December 2017 at 09:15, Andrew Worsley <amworsley@gmail.com> wrote:
> > Thanks. I regularly see the warning my patch prints out  on boot up as then
> > there is a big pull of random data from the TPM.
>
>    Just an update on this bug - I did a i2c capture with pulseview on
> a boot that included this issue.
>
> > On 8 Dec 2017 12:30 am, "Harini Katakam" <harinikatakamlinux@gmail.com>
> > wrote:
> >>
> >> Hi Wolfram and Andrew,
> >>
> >> On Thu, Dec 7, 2017 at 4:19 PM, Wolfram Sang <wsa@the-dreams.de> wrote:
> >> >
> >> >> > I attach a patch of the changes to the i2c driver showing my debug in
> >> >> > this driver.
> >> >> > As I said I am happy to send more verbose bug output - I have about
> >> >> > 36
> >> >> > odd runs with 9 occurences of the bug.
> >> >> > Once it happen twice in the one run
> >> >> >
> >> >> > I include the crashing line from the runs where the bug occured (I
> >> >> > added more diagnostics as the runs progressed):
> >> >>
> >> >> I'll check with out HW team as well and check your flow to
> >> >> see if we can reproduce it; will also check if the existing errata
> >> >> has larger scope.
> >> >> I'll let you know if we require further assistance with debug
> >> >> next week.
> >> >
> >> > Any news here?
> >>
> >> Sorry for the delay.
> >> I've checked the known errata and confirmed that it wont
> >> affect this use case. And we haven't been able to hit this
> >> issue on our HW (with eeprom slave devices).
> >> I will add my colleague looking into this further to debug.
> >>
> >> Regards,
> >> Harini
>
> I suspect a well behaved i2c slave may never show the issue. But the
> i2c cadence doesn't handle bad slaves safely
> so I suggest that length check is needed to avoid buffer kernel
> crashes due to mis-behaving i2c hardware.
>
> I got a board that this issue was *very* bad to the point that the
> TPM_RANDOM requests repeatedly returned the same data and so were
> failing the sanity tests. So I captured the i2c transactions using
> pulseview and very bad behaviour from the TPM. The problem disappears
> at 100KHz and very reliably occurs at 400k on this individual board so
> I believe this TPM is failing to keep up with the i2c clock and
> screwying everything up.
>
> What I am concerned about is that the i2c-cadence driver is almost
> always carrying on and returning data with out errors.
>
I agree we should attempt to fix this could  you resend the patch
removing the debug.

thanks and Regards,
Shubhrajyoti
diff mbox series

Patch

diff --git a/drivers/i2c/busses/i2c-cadence.c b/drivers/i2c/busses/i2c-cadence.c
index b13605718291..c1f61ca6843e 100644
--- a/drivers/i2c/busses/i2c-cadence.c
+++ b/drivers/i2c/busses/i2c-cadence.c
@@ -234,6 +234,7 @@  static irqreturn_t cdns_i2c_isr(int irq, void *ptr)
 	if (id->p_recv_buf &&
 	    ((isr_status & CDNS_I2C_IXR_COMP) ||
 	     (isr_status & CDNS_I2C_IXR_DATA))) {
+		unsigned char *p = id->p_recv_buf;
 		/* Read data if receive data valid is set */
 		while (cdns_i2c_readreg(CDNS_I2C_SR_OFFSET) &
 		       CDNS_I2C_SR_RXDV) {
@@ -246,6 +247,12 @@  static irqreturn_t cdns_i2c_isr(int irq, void *ptr)
 			    !id->bus_hold_flag)
 				cdns_i2c_clear_bus_hold(id);
 
+			if (id->recv_count == 0) {
+				pr_notice("%s: i2c receive buffer filled : %u aborting transfer %p - %p\n",
+					__func__, (id->p_recv_buf - p));
+				break;
+			}
+
 			*(id->p_recv_buf)++ =
 				cdns_i2c_readreg(CDNS_I2C_DATA_OFFSET);
 			id->recv_count--;