Message ID | 20181129181717.19559-1-fbarrat@linux.ibm.com |
---|---|
State | Accepted |
Headers | show |
Series | i2c: Fix i2c request hang during opal init if timers are not checked | expand |
Context | Check | Description |
---|---|---|
snowpatch_ozlabs/apply_patch | success | master/apply_patch Successfully applied |
snowpatch_ozlabs/snowpatch_job_snowpatch-skiboot | success | Test snowpatch/job/snowpatch-skiboot on branch master |
On 30/11/18 5:17 am, Frederic Barrat wrote: > If an i2c request cannot go through the first time, because the bus is > found in error and need a reset or it's locked by the OCC for example, > the underlying i2c implementation is using timers to manage the > request. However during opal init, opal pollers may not be called, it > depends in the context in which the i2c request is made. If the > pollers are not called, the timers are not checked and we can end up > with an i2c request which will not move foward and skiboot hangs. > > Fix it by explicitly checking the timers if we are waiting for an i2c > request to complete and it seems to be taking a while. > > Signed-off-by: Frederic Barrat <fbarrat@linux.ibm.com> This seems like a reasonable solution. Do we want this to go to stable? Reviewed-by: Andrew Donnellan <andrew.donnellan@au1.ibm.com> > --- > core/i2c.c | 16 ++++++++++++++++ > 1 file changed, 16 insertions(+) > > diff --git a/core/i2c.c b/core/i2c.c > index 26926fc2..3c600024 100644 > --- a/core/i2c.c > +++ b/core/i2c.c > @@ -21,6 +21,7 @@ > #include <opal-msg.h> > #include <timebase.h> > #include <processor.h> > +#include <timer.h> > > static LIST_HEAD(i2c_bus_list); > > @@ -177,6 +178,7 @@ int i2c_request_send(int bus_id, int dev_addr, int read_write, > struct i2c_bus *bus; > uint64_t time_to_wait = 0; > struct i2c_sync_userdata ud; > + uint64_t timer_period = msecs_to_tb(5), timer_count; > > bus = i2c_find_bus_by_id(bus_id); > if (!bus) { > @@ -215,6 +217,7 @@ int i2c_request_send(int bus_id, int dev_addr, int read_write, > > for (retries = 0; retries <= MAX_NACK_RETRIES; retries++) { > waited = 0; > + timer_count = 0; > > i2c_queue_req(req); > > @@ -224,6 +227,19 @@ int i2c_request_send(int bus_id, int dev_addr, int read_write, > time_to_wait = REQ_COMPLETE_POLLING; > time_wait(time_to_wait); > waited += time_to_wait; > + timer_count += time_to_wait; > + if (timer_count > timer_period) { > + /* > + * The above request may be relying on > + * timers to complete, yet there may > + * not be called, especially during > + * opal init. We could be looping here > + * forever. So explicitly check the > + * timers once in a while > + */ > + check_timers(false); > + timer_count = 0; > + } > } while (!ud.done); > > lwsync(); >
On 11/29/2018 11:47 PM, Frederic Barrat wrote: > If an i2c request cannot go through the first time, because the bus is > found in error and need a reset or it's locked by the OCC for example, > the underlying i2c implementation is using timers to manage the > request. However during opal init, opal pollers may not be called, it > depends in the context in which the i2c request is made. If the > pollers are not called, the timers are not checked and we can end up > with an i2c request which will not move foward and skiboot hangs. > > Fix it by explicitly checking the timers if we are waiting for an i2c > request to complete and it seems to be taking a while. > > Signed-off-by: Frederic Barrat <fbarrat@linux.ibm.com> I hit similar issue today with upstream master code. and this patch fixes the issue. Tested-by: Vasant Hegde <hegdevasant@linux.vnet.ibm.com> -Vasant
Frederic Barrat <fbarrat@linux.ibm.com> writes: > If an i2c request cannot go through the first time, because the bus is > found in error and need a reset or it's locked by the OCC for example, > the underlying i2c implementation is using timers to manage the > request. However during opal init, opal pollers may not be called, it > depends in the context in which the i2c request is made. If the > pollers are not called, the timers are not checked and we can end up > with an i2c request which will not move foward and skiboot hangs. > > Fix it by explicitly checking the timers if we are waiting for an i2c > request to complete and it seems to be taking a while. Okay, I see what happens here: we're doing the i2c request in the main thread and doing time_wait() with a value that'll never run pollers and thus never check_timers() either. Looking at the code here, is there any reason we shouldn't just call check_timers(false) straight after the time_wait() ? > > Signed-off-by: Frederic Barrat <fbarrat@linux.ibm.com> > --- > core/i2c.c | 16 ++++++++++++++++ > 1 file changed, 16 insertions(+) > > diff --git a/core/i2c.c b/core/i2c.c > index 26926fc2..3c600024 100644 > --- a/core/i2c.c > +++ b/core/i2c.c > @@ -21,6 +21,7 @@ > #include <opal-msg.h> > #include <timebase.h> > #include <processor.h> > +#include <timer.h> > > static LIST_HEAD(i2c_bus_list); > > @@ -177,6 +178,7 @@ int i2c_request_send(int bus_id, int dev_addr, int read_write, > struct i2c_bus *bus; > uint64_t time_to_wait = 0; > struct i2c_sync_userdata ud; > + uint64_t timer_period = msecs_to_tb(5), timer_count; > > bus = i2c_find_bus_by_id(bus_id); > if (!bus) { > @@ -215,6 +217,7 @@ int i2c_request_send(int bus_id, int dev_addr, int read_write, > > for (retries = 0; retries <= MAX_NACK_RETRIES; retries++) { > waited = 0; > + timer_count = 0; > > i2c_queue_req(req); > > @@ -224,6 +227,19 @@ int i2c_request_send(int bus_id, int dev_addr, int read_write, > time_to_wait = REQ_COMPLETE_POLLING; > time_wait(time_to_wait); > waited += time_to_wait; > + timer_count += time_to_wait; > + if (timer_count > timer_period) { > + /* > + * The above request may be relying on > + * timers to complete, yet there may > + * not be called, especially during > + * opal init. We could be looping here > + * forever. So explicitly check the > + * timers once in a while > + */ > + check_timers(false); > + timer_count = 0; > + } > } while (!ud.done); > > lwsync(); > -- > 2.19.1 > > _______________________________________________ > Skiboot mailing list > Skiboot@lists.ozlabs.org > https://lists.ozlabs.org/listinfo/skiboot >
Le 11/12/2018 à 04:41, Stewart Smith a écrit : > Frederic Barrat <fbarrat@linux.ibm.com> writes: >> If an i2c request cannot go through the first time, because the bus is >> found in error and need a reset or it's locked by the OCC for example, >> the underlying i2c implementation is using timers to manage the >> request. However during opal init, opal pollers may not be called, it >> depends in the context in which the i2c request is made. If the >> pollers are not called, the timers are not checked and we can end up >> with an i2c request which will not move foward and skiboot hangs. >> >> Fix it by explicitly checking the timers if we are waiting for an i2c >> request to complete and it seems to be taking a while. > > Okay, I see what happens here: we're doing the i2c request in the main > thread and doing time_wait() with a value that'll never run pollers and > thus never check_timers() either. > > Looking at the code here, is there any reason we shouldn't just call > check_timers(false) straight after the time_wait() ? The only reason I was reluctant to do so is that the default time_to_wait we get from i2c_run_req() is pretty short, so I didn't want to degrade a common path for the benefit of a supposedly rarer error path (bus in use by OCC, bus found in error state, ...). Also I didn't call opal_run_pollers() for a similar reason, as the call is more expensive and I got scared by how we treat it in time_wait_poll(), seemingly avoiding calling it too frequently. I basically did the minimum required to get the i2c request unstuck, so that the main thread can get back to do real work during opal init. Fred > >> >> Signed-off-by: Frederic Barrat <fbarrat@linux.ibm.com> >> --- >> core/i2c.c | 16 ++++++++++++++++ >> 1 file changed, 16 insertions(+) >> >> diff --git a/core/i2c.c b/core/i2c.c >> index 26926fc2..3c600024 100644 >> --- a/core/i2c.c >> +++ b/core/i2c.c >> @@ -21,6 +21,7 @@ >> #include <opal-msg.h> >> #include <timebase.h> >> #include <processor.h> >> +#include <timer.h> >> >> static LIST_HEAD(i2c_bus_list); >> >> @@ -177,6 +178,7 @@ int i2c_request_send(int bus_id, int dev_addr, int read_write, >> struct i2c_bus *bus; >> uint64_t time_to_wait = 0; >> struct i2c_sync_userdata ud; >> + uint64_t timer_period = msecs_to_tb(5), timer_count; >> >> bus = i2c_find_bus_by_id(bus_id); >> if (!bus) { >> @@ -215,6 +217,7 @@ int i2c_request_send(int bus_id, int dev_addr, int read_write, >> >> for (retries = 0; retries <= MAX_NACK_RETRIES; retries++) { >> waited = 0; >> + timer_count = 0; >> >> i2c_queue_req(req); >> >> @@ -224,6 +227,19 @@ int i2c_request_send(int bus_id, int dev_addr, int read_write, >> time_to_wait = REQ_COMPLETE_POLLING; >> time_wait(time_to_wait); >> waited += time_to_wait; >> + timer_count += time_to_wait; >> + if (timer_count > timer_period) { >> + /* >> + * The above request may be relying on >> + * timers to complete, yet there may >> + * not be called, especially during >> + * opal init. We could be looping here >> + * forever. So explicitly check the >> + * timers once in a while >> + */ >> + check_timers(false); >> + timer_count = 0; >> + } >> } while (!ud.done); >> >> lwsync(); >> -- >> 2.19.1 >> >> _______________________________________________ >> Skiboot mailing list >> Skiboot@lists.ozlabs.org >> https://lists.ozlabs.org/listinfo/skiboot >> >
Frederic Barrat <fbarrat@linux.ibm.com> writes: > Le 11/12/2018 à 04:41, Stewart Smith a écrit : >> Frederic Barrat <fbarrat@linux.ibm.com> writes: >>> If an i2c request cannot go through the first time, because the bus is >>> found in error and need a reset or it's locked by the OCC for example, >>> the underlying i2c implementation is using timers to manage the >>> request. However during opal init, opal pollers may not be called, it >>> depends in the context in which the i2c request is made. If the >>> pollers are not called, the timers are not checked and we can end up >>> with an i2c request which will not move foward and skiboot hangs. >>> >>> Fix it by explicitly checking the timers if we are waiting for an i2c >>> request to complete and it seems to be taking a while. >> >> Okay, I see what happens here: we're doing the i2c request in the main >> thread and doing time_wait() with a value that'll never run pollers and >> thus never check_timers() either. >> >> Looking at the code here, is there any reason we shouldn't just call >> check_timers(false) straight after the time_wait() ? > > > The only reason I was reluctant to do so is that the default > time_to_wait we get from i2c_run_req() is pretty short, so I didn't want > to degrade a common path for the benefit of a supposedly rarer error > path (bus in use by OCC, bus found in error state, ...). > > Also I didn't call opal_run_pollers() for a similar reason, as the call > is more expensive and I got scared by how we treat it in > time_wait_poll(), seemingly avoiding calling it too frequently. > I basically did the minimum required to get the i2c request unstuck, so > that the main thread can get back to do real work during opal init. That's reasonable, I was lazy and didn't attempt to do the math/measure it. We should probably have some kind of spin_on_condition() routine that will do a bit better rather than open coding this every single time.
Frederic Barrat <fbarrat@linux.ibm.com> writes: > If an i2c request cannot go through the first time, because the bus is > found in error and need a reset or it's locked by the OCC for example, > the underlying i2c implementation is using timers to manage the > request. However during opal init, opal pollers may not be called, it > depends in the context in which the i2c request is made. If the > pollers are not called, the timers are not checked and we can end up > with an i2c request which will not move foward and skiboot hangs. > > Fix it by explicitly checking the timers if we are waiting for an i2c > request to complete and it seems to be taking a while. > > Signed-off-by: Frederic Barrat <fbarrat@linux.ibm.com> > --- > core/i2c.c | 16 ++++++++++++++++ > 1 file changed, 16 insertions(+) Merged to master as of 1bd34e4c60c69faeb825ba5f64658941a1422403
diff --git a/core/i2c.c b/core/i2c.c index 26926fc2..3c600024 100644 --- a/core/i2c.c +++ b/core/i2c.c @@ -21,6 +21,7 @@ #include <opal-msg.h> #include <timebase.h> #include <processor.h> +#include <timer.h> static LIST_HEAD(i2c_bus_list); @@ -177,6 +178,7 @@ int i2c_request_send(int bus_id, int dev_addr, int read_write, struct i2c_bus *bus; uint64_t time_to_wait = 0; struct i2c_sync_userdata ud; + uint64_t timer_period = msecs_to_tb(5), timer_count; bus = i2c_find_bus_by_id(bus_id); if (!bus) { @@ -215,6 +217,7 @@ int i2c_request_send(int bus_id, int dev_addr, int read_write, for (retries = 0; retries <= MAX_NACK_RETRIES; retries++) { waited = 0; + timer_count = 0; i2c_queue_req(req); @@ -224,6 +227,19 @@ int i2c_request_send(int bus_id, int dev_addr, int read_write, time_to_wait = REQ_COMPLETE_POLLING; time_wait(time_to_wait); waited += time_to_wait; + timer_count += time_to_wait; + if (timer_count > timer_period) { + /* + * The above request may be relying on + * timers to complete, yet there may + * not be called, especially during + * opal init. We could be looping here + * forever. So explicitly check the + * timers once in a while + */ + check_timers(false); + timer_count = 0; + } } while (!ud.done); lwsync();
If an i2c request cannot go through the first time, because the bus is found in error and need a reset or it's locked by the OCC for example, the underlying i2c implementation is using timers to manage the request. However during opal init, opal pollers may not be called, it depends in the context in which the i2c request is made. If the pollers are not called, the timers are not checked and we can end up with an i2c request which will not move foward and skiboot hangs. Fix it by explicitly checking the timers if we are waiting for an i2c request to complete and it seems to be taking a while. Signed-off-by: Frederic Barrat <fbarrat@linux.ibm.com> --- core/i2c.c | 16 ++++++++++++++++ 1 file changed, 16 insertions(+)