diff mbox series

i2c: Fix i2c request hang during opal init if timers are not checked

Message ID 20181129181717.19559-1-fbarrat@linux.ibm.com
State Accepted
Headers show
Series i2c: Fix i2c request hang during opal init if timers are not checked | expand

Checks

Context Check Description
snowpatch_ozlabs/apply_patch success master/apply_patch Successfully applied
snowpatch_ozlabs/snowpatch_job_snowpatch-skiboot success Test snowpatch/job/snowpatch-skiboot on branch master

Commit Message

Frederic Barrat Nov. 29, 2018, 6:17 p.m. UTC
If an i2c request cannot go through the first time, because the bus is
found in error and need a reset or it's locked by the OCC for example,
the underlying i2c implementation is using timers to manage the
request. However during opal init, opal pollers may not be called, it
depends in the context in which the i2c request is made. If the
pollers are not called, the timers are not checked and we can end up
with an i2c request which will not move foward and skiboot hangs.

Fix it by explicitly checking the timers if we are waiting for an i2c
request to complete and it seems to be taking a while.

Signed-off-by: Frederic Barrat <fbarrat@linux.ibm.com>
---
 core/i2c.c | 16 ++++++++++++++++
 1 file changed, 16 insertions(+)

Comments

Andrew Donnellan Nov. 30, 2018, 6:04 a.m. UTC | #1
On 30/11/18 5:17 am, Frederic Barrat wrote:
> If an i2c request cannot go through the first time, because the bus is
> found in error and need a reset or it's locked by the OCC for example,
> the underlying i2c implementation is using timers to manage the
> request. However during opal init, opal pollers may not be called, it
> depends in the context in which the i2c request is made. If the
> pollers are not called, the timers are not checked and we can end up
> with an i2c request which will not move foward and skiboot hangs.
> 
> Fix it by explicitly checking the timers if we are waiting for an i2c
> request to complete and it seems to be taking a while.
> 
> Signed-off-by: Frederic Barrat <fbarrat@linux.ibm.com>

This seems like a reasonable solution.

Do we want this to go to stable?

Reviewed-by: Andrew Donnellan <andrew.donnellan@au1.ibm.com>

> ---
>   core/i2c.c | 16 ++++++++++++++++
>   1 file changed, 16 insertions(+)
> 
> diff --git a/core/i2c.c b/core/i2c.c
> index 26926fc2..3c600024 100644
> --- a/core/i2c.c
> +++ b/core/i2c.c
> @@ -21,6 +21,7 @@
>   #include <opal-msg.h>
>   #include <timebase.h>
>   #include <processor.h>
> +#include <timer.h>
>   
>   static LIST_HEAD(i2c_bus_list);
>   
> @@ -177,6 +178,7 @@ int i2c_request_send(int bus_id, int dev_addr, int read_write,
>   	struct i2c_bus *bus;
>   	uint64_t time_to_wait = 0;
>   	struct i2c_sync_userdata ud;
> +	uint64_t timer_period = msecs_to_tb(5), timer_count;
>   
>   	bus = i2c_find_bus_by_id(bus_id);
>   	if (!bus) {
> @@ -215,6 +217,7 @@ int i2c_request_send(int bus_id, int dev_addr, int read_write,
>   
>   	for (retries = 0; retries <= MAX_NACK_RETRIES; retries++) {
>   		waited = 0;
> +		timer_count = 0;
>   
>   		i2c_queue_req(req);
>   
> @@ -224,6 +227,19 @@ int i2c_request_send(int bus_id, int dev_addr, int read_write,
>   				time_to_wait = REQ_COMPLETE_POLLING;
>   			time_wait(time_to_wait);
>   			waited += time_to_wait;
> +			timer_count += time_to_wait;
> +			if (timer_count > timer_period) {
> +				/*
> +				 * The above request may be relying on
> +				 * timers to complete, yet there may
> +				 * not be called, especially during
> +				 * opal init. We could be looping here
> +				 * forever. So explicitly check the
> +				 * timers once in a while
> +				 */
> +				check_timers(false);
> +				timer_count = 0;
> +			}
>   		} while (!ud.done);
>   
>   		lwsync();
>
Vasant Hegde Dec. 10, 2018, 8:46 a.m. UTC | #2
On 11/29/2018 11:47 PM, Frederic Barrat wrote:
> If an i2c request cannot go through the first time, because the bus is
> found in error and need a reset or it's locked by the OCC for example,
> the underlying i2c implementation is using timers to manage the
> request. However during opal init, opal pollers may not be called, it
> depends in the context in which the i2c request is made. If the
> pollers are not called, the timers are not checked and we can end up
> with an i2c request which will not move foward and skiboot hangs.
> 
> Fix it by explicitly checking the timers if we are waiting for an i2c
> request to complete and it seems to be taking a while.
> 
> Signed-off-by: Frederic Barrat <fbarrat@linux.ibm.com>

I hit similar issue today with upstream master code. and this patch fixes the issue.

Tested-by: Vasant Hegde <hegdevasant@linux.vnet.ibm.com>

-Vasant
Stewart Smith Dec. 11, 2018, 3:41 a.m. UTC | #3
Frederic Barrat <fbarrat@linux.ibm.com> writes:
> If an i2c request cannot go through the first time, because the bus is
> found in error and need a reset or it's locked by the OCC for example,
> the underlying i2c implementation is using timers to manage the
> request. However during opal init, opal pollers may not be called, it
> depends in the context in which the i2c request is made. If the
> pollers are not called, the timers are not checked and we can end up
> with an i2c request which will not move foward and skiboot hangs.
>
> Fix it by explicitly checking the timers if we are waiting for an i2c
> request to complete and it seems to be taking a while.

Okay, I see what happens here: we're doing the i2c request in the main
thread and doing time_wait() with a value that'll never run pollers and
thus never check_timers() either.

Looking at the code here, is there any reason we shouldn't just call
check_timers(false) straight after the time_wait() ?

>
> Signed-off-by: Frederic Barrat <fbarrat@linux.ibm.com>
> ---
>  core/i2c.c | 16 ++++++++++++++++
>  1 file changed, 16 insertions(+)
>
> diff --git a/core/i2c.c b/core/i2c.c
> index 26926fc2..3c600024 100644
> --- a/core/i2c.c
> +++ b/core/i2c.c
> @@ -21,6 +21,7 @@
>  #include <opal-msg.h>
>  #include <timebase.h>
>  #include <processor.h>
> +#include <timer.h>
>  
>  static LIST_HEAD(i2c_bus_list);
>  
> @@ -177,6 +178,7 @@ int i2c_request_send(int bus_id, int dev_addr, int read_write,
>  	struct i2c_bus *bus;
>  	uint64_t time_to_wait = 0;
>  	struct i2c_sync_userdata ud;
> +	uint64_t timer_period = msecs_to_tb(5), timer_count;
>  
>  	bus = i2c_find_bus_by_id(bus_id);
>  	if (!bus) {
> @@ -215,6 +217,7 @@ int i2c_request_send(int bus_id, int dev_addr, int read_write,
>  
>  	for (retries = 0; retries <= MAX_NACK_RETRIES; retries++) {
>  		waited = 0;
> +		timer_count = 0;
>  
>  		i2c_queue_req(req);
>  
> @@ -224,6 +227,19 @@ int i2c_request_send(int bus_id, int dev_addr, int read_write,
>  				time_to_wait = REQ_COMPLETE_POLLING;
>  			time_wait(time_to_wait);
>  			waited += time_to_wait;
> +			timer_count += time_to_wait;
> +			if (timer_count > timer_period) {
> +				/*
> +				 * The above request may be relying on
> +				 * timers to complete, yet there may
> +				 * not be called, especially during
> +				 * opal init. We could be looping here
> +				 * forever. So explicitly check the
> +				 * timers once in a while
> +				 */
> +				check_timers(false);
> +				timer_count = 0;
> +			}
>  		} while (!ud.done);
>  
>  		lwsync();
> -- 
> 2.19.1
>
> _______________________________________________
> Skiboot mailing list
> Skiboot@lists.ozlabs.org
> https://lists.ozlabs.org/listinfo/skiboot
>
Frederic Barrat Dec. 11, 2018, 8:17 a.m. UTC | #4
Le 11/12/2018 à 04:41, Stewart Smith a écrit :
> Frederic Barrat <fbarrat@linux.ibm.com> writes:
>> If an i2c request cannot go through the first time, because the bus is
>> found in error and need a reset or it's locked by the OCC for example,
>> the underlying i2c implementation is using timers to manage the
>> request. However during opal init, opal pollers may not be called, it
>> depends in the context in which the i2c request is made. If the
>> pollers are not called, the timers are not checked and we can end up
>> with an i2c request which will not move foward and skiboot hangs.
>>
>> Fix it by explicitly checking the timers if we are waiting for an i2c
>> request to complete and it seems to be taking a while.
> 
> Okay, I see what happens here: we're doing the i2c request in the main
> thread and doing time_wait() with a value that'll never run pollers and
> thus never check_timers() either.
> 
> Looking at the code here, is there any reason we shouldn't just call
> check_timers(false) straight after the time_wait() ?


The only reason I was reluctant to do so is that the default 
time_to_wait we get from i2c_run_req() is pretty short, so I didn't want 
to degrade a common path for the benefit of a supposedly rarer error 
path (bus in use by OCC, bus found in error state, ...).

Also I didn't call opal_run_pollers() for a similar reason, as the call 
is more expensive and I got scared by how we treat it in 
time_wait_poll(), seemingly avoiding calling it too frequently.
I basically did the minimum required to get the i2c request unstuck, so 
that the main thread can get back to do real work during opal init.

   Fred




> 
>>
>> Signed-off-by: Frederic Barrat <fbarrat@linux.ibm.com>
>> ---
>>   core/i2c.c | 16 ++++++++++++++++
>>   1 file changed, 16 insertions(+)
>>
>> diff --git a/core/i2c.c b/core/i2c.c
>> index 26926fc2..3c600024 100644
>> --- a/core/i2c.c
>> +++ b/core/i2c.c
>> @@ -21,6 +21,7 @@
>>   #include <opal-msg.h>
>>   #include <timebase.h>
>>   #include <processor.h>
>> +#include <timer.h>
>>   
>>   static LIST_HEAD(i2c_bus_list);
>>   
>> @@ -177,6 +178,7 @@ int i2c_request_send(int bus_id, int dev_addr, int read_write,
>>   	struct i2c_bus *bus;
>>   	uint64_t time_to_wait = 0;
>>   	struct i2c_sync_userdata ud;
>> +	uint64_t timer_period = msecs_to_tb(5), timer_count;
>>   
>>   	bus = i2c_find_bus_by_id(bus_id);
>>   	if (!bus) {
>> @@ -215,6 +217,7 @@ int i2c_request_send(int bus_id, int dev_addr, int read_write,
>>   
>>   	for (retries = 0; retries <= MAX_NACK_RETRIES; retries++) {
>>   		waited = 0;
>> +		timer_count = 0;
>>   
>>   		i2c_queue_req(req);
>>   
>> @@ -224,6 +227,19 @@ int i2c_request_send(int bus_id, int dev_addr, int read_write,
>>   				time_to_wait = REQ_COMPLETE_POLLING;
>>   			time_wait(time_to_wait);
>>   			waited += time_to_wait;
>> +			timer_count += time_to_wait;
>> +			if (timer_count > timer_period) {
>> +				/*
>> +				 * The above request may be relying on
>> +				 * timers to complete, yet there may
>> +				 * not be called, especially during
>> +				 * opal init. We could be looping here
>> +				 * forever. So explicitly check the
>> +				 * timers once in a while
>> +				 */
>> +				check_timers(false);
>> +				timer_count = 0;
>> +			}
>>   		} while (!ud.done);
>>   
>>   		lwsync();
>> -- 
>> 2.19.1
>>
>> _______________________________________________
>> Skiboot mailing list
>> Skiboot@lists.ozlabs.org
>> https://lists.ozlabs.org/listinfo/skiboot
>>
>
Stewart Smith Dec. 12, 2018, 11:21 p.m. UTC | #5
Frederic Barrat <fbarrat@linux.ibm.com> writes:
> Le 11/12/2018 à 04:41, Stewart Smith a écrit :
>> Frederic Barrat <fbarrat@linux.ibm.com> writes:
>>> If an i2c request cannot go through the first time, because the bus is
>>> found in error and need a reset or it's locked by the OCC for example,
>>> the underlying i2c implementation is using timers to manage the
>>> request. However during opal init, opal pollers may not be called, it
>>> depends in the context in which the i2c request is made. If the
>>> pollers are not called, the timers are not checked and we can end up
>>> with an i2c request which will not move foward and skiboot hangs.
>>>
>>> Fix it by explicitly checking the timers if we are waiting for an i2c
>>> request to complete and it seems to be taking a while.
>> 
>> Okay, I see what happens here: we're doing the i2c request in the main
>> thread and doing time_wait() with a value that'll never run pollers and
>> thus never check_timers() either.
>> 
>> Looking at the code here, is there any reason we shouldn't just call
>> check_timers(false) straight after the time_wait() ?
>
>
> The only reason I was reluctant to do so is that the default 
> time_to_wait we get from i2c_run_req() is pretty short, so I didn't want 
> to degrade a common path for the benefit of a supposedly rarer error 
> path (bus in use by OCC, bus found in error state, ...).
>
> Also I didn't call opal_run_pollers() for a similar reason, as the call 
> is more expensive and I got scared by how we treat it in 
> time_wait_poll(), seemingly avoiding calling it too frequently.
> I basically did the minimum required to get the i2c request unstuck, so 
> that the main thread can get back to do real work during opal init.

That's reasonable, I was lazy and didn't attempt to do the math/measure
it.

We should probably have some kind of spin_on_condition() routine that
will do a bit better rather than open coding this every single
time.
Stewart Smith Dec. 12, 2018, 11:22 p.m. UTC | #6
Frederic Barrat <fbarrat@linux.ibm.com> writes:
> If an i2c request cannot go through the first time, because the bus is
> found in error and need a reset or it's locked by the OCC for example,
> the underlying i2c implementation is using timers to manage the
> request. However during opal init, opal pollers may not be called, it
> depends in the context in which the i2c request is made. If the
> pollers are not called, the timers are not checked and we can end up
> with an i2c request which will not move foward and skiboot hangs.
>
> Fix it by explicitly checking the timers if we are waiting for an i2c
> request to complete and it seems to be taking a while.
>
> Signed-off-by: Frederic Barrat <fbarrat@linux.ibm.com>
> ---
>  core/i2c.c | 16 ++++++++++++++++
>  1 file changed, 16 insertions(+)

Merged to master as of 1bd34e4c60c69faeb825ba5f64658941a1422403
diff mbox series

Patch

diff --git a/core/i2c.c b/core/i2c.c
index 26926fc2..3c600024 100644
--- a/core/i2c.c
+++ b/core/i2c.c
@@ -21,6 +21,7 @@ 
 #include <opal-msg.h>
 #include <timebase.h>
 #include <processor.h>
+#include <timer.h>
 
 static LIST_HEAD(i2c_bus_list);
 
@@ -177,6 +178,7 @@  int i2c_request_send(int bus_id, int dev_addr, int read_write,
 	struct i2c_bus *bus;
 	uint64_t time_to_wait = 0;
 	struct i2c_sync_userdata ud;
+	uint64_t timer_period = msecs_to_tb(5), timer_count;
 
 	bus = i2c_find_bus_by_id(bus_id);
 	if (!bus) {
@@ -215,6 +217,7 @@  int i2c_request_send(int bus_id, int dev_addr, int read_write,
 
 	for (retries = 0; retries <= MAX_NACK_RETRIES; retries++) {
 		waited = 0;
+		timer_count = 0;
 
 		i2c_queue_req(req);
 
@@ -224,6 +227,19 @@  int i2c_request_send(int bus_id, int dev_addr, int read_write,
 				time_to_wait = REQ_COMPLETE_POLLING;
 			time_wait(time_to_wait);
 			waited += time_to_wait;
+			timer_count += time_to_wait;
+			if (timer_count > timer_period) {
+				/*
+				 * The above request may be relying on
+				 * timers to complete, yet there may
+				 * not be called, especially during
+				 * opal init. We could be looping here
+				 * forever. So explicitly check the
+				 * timers once in a while
+				 */
+				check_timers(false);
+				timer_count = 0;
+			}
 		} while (!ud.done);
 
 		lwsync();