[RFC] iopoll: allow for poll_timeout to back-off

Submitted by Nicholas Mc Guire on Feb. 13, 2017, 4:52 p.m.

Details

Message ID 1487004724-2806-1-git-send-email-der.herr@hofr.at
State Not Applicable
Headers show

Commit Message

Nicholas Mc Guire Feb. 13, 2017, 4:52 p.m.
Provide an exponential back-off after initial busy-loop
to prvent extremly long busy-looping respectively arming
of many highresolution timers.

Signed-off-by: Nicholas Mc Guire <der.herr@hofr.at>
---

During a review of hieghrestimer users I have been looking at the users
of read*_poll_timemout that have a tight loop declared (i.e. passing in 
sleep_us as 0) most of them have quite large timeouts defined. At least
in some cases it is documented to be a tight loop intentionally as the 
expectation is that it will normally succeed e.g.  commit ae02ab00aa3c 
("mtd: nand: jz4780: driver for NAND devices on JZ4780 SoCs")

<snip drivers/mtd/nand/jz4780_bch.c>
        /*
         * While we could use interrupts here and sleep until the operation
         * completes, the controller works fairly quickly (usually a few
         * microseconds) and so the overhead of sleeping until we get an
         * interrupt quite noticeably decreases performance.
         */
        ret = readl_poll_timeout(bch->base + BCH_BHINT, reg,
                                 (reg & irq) == irq, 0, BCH_TIMEOUT_US);
<snip>

While this seems justified here to have a tight loop the problem is that 
BCH_TIMEOUT_US is 100000 - so in the case of failure this would busyloop
for 100ms and in some cases tightloops of 5 seconds can be found (e.g.
drivers/mtd/spi-nor/intel-spi.c:intel_spi_wait_hw_busy())

There currently are (if my coccinelle script is correct) 7 files with 
read*_poll_timeout where sleep_us is set to 0 (busy-loop)
  timeout_us                value    file
  INTEL_SPI_TIMEOUT * 1000  5000000  intel-spi.c
  FMC_WAIT_TIMEOUT          1000000  hisi-sfc.c
  BCH_TIMEOUT_US             100000  z4780_bch.c
  numeric const.              10000  clkgen-pll.c
  numeric const.              10000  clk-stm32f4.c
  numeric const.               1000  tango_nand.c
  numeric const.               1000  mxsfb_crtc.c
  numeric const.                100  clk.c

One case in /drivers/mmc/host/sdhci-cadence.c has a timeout_us of 1 and
sleep_us of 0 which might actually be a bug as it means that the poll loop
would do a single read OP in many cases (this is non-atomic context) so it
would have to be robust for a single read, thus the poll_timeout might not
be suitable - not sure though - but thats a different problem.

If we now look at those cases where sleep_us is using a min < the recommended
10 us: that can be found in 68 cases (as of 4.10-rc6) and that is probably
also not that sensible given that the timeout_us are in the same ranges as
the cases above (minimum is 20 max is 5000000). So in the error case again
that would result in thousands of high-resolution timers being initialized
- presumably that is not that reasonable.

The problem thus is not the success case but the failure case. A possible 
mitigation of this would be to have something like an exponential back-off
built into the non-atomic poll_timeout function:

#define readx_poll_timeout(op, addr, val, cond, sleep_us, timeout_us)   \
({ \
        ktime_t timeout = ktime_add_us(ktime_get(), timeout_us); \
        might_sleep(); \
	unsigned long min = 0; \
	unsigned long max = sleep_us | 1; \
        for (;;) { \
                (val) = op(addr); \
                if (cond) \
                        break; \
                if (timeout_us && ktime_compare(ktime_get(), timeout) > 0) { \
                        (val) = op(addr); \
                        break; \
                } \
                if (min >= 10) \
                        usleep_range(min, max); \
                max <<= 1; \
		min = max >> 2; \
        } \
        (cond) ? 0 : -ETIMEDOUT; \
})

 which results in min,max values for the initial iterations of:
      min  max
      0    1
      0    2
      1    4
      2    8
      4    16
      8    32
      16   64
      32   128
      ...

 for sleep_us being passed in as 0 and would thus effectively be a 
busy-loop for the first 6 iterations and then switch to sleeping.
The above code proposal would busy-loop at most 6 times and the
busy-wait would reduce if sleep_us > 0 is passed in:

      sleep_us  busy-loops
      0-1       6
      2-3       4
      4-9       3
      10-19     2
      20-39     1
      40+       0

which should be a resonable behavior for the current use cases and eliminate
side effects of very long busy-wait-loops in the failure cases as well.

Pleas let me know if this sounds reasonable or what might have been 
overlooked here. The only downside located is that some of the constant
folding that would be possible now would no longer be doable (e.g. 
(sleep_us >> 2) + 1 in case of passing in a compile-time constant.
Also I guess readx_poll_timeout() could (should?) be converted to an
inline function.

I did compile test this but actually the key issue is to get feedback on the
concept rather than if the patch is usable in the below form.

Patch was compile tested with: x86_64_defconfig

Patch is against 4.10-rc7 (localversion-next is next-20170213)

 include/linux/iopoll.h | 12 ++++++++----
 1 file changed, 8 insertions(+), 4 deletions(-)

Comments

Masahiro Yamada Feb. 23, 2017, 8:29 a.m.
Hi Nicholas,


2017-02-14 1:52 GMT+09:00 Nicholas Mc Guire <der.herr@hofr.at>:

>
> One case in /drivers/mmc/host/sdhci-cadence.c has a timeout_us of 1 and
> sleep_us of 0 which might actually be a bug as it means that the poll loop
> would do a single read OP in many cases (this is non-atomic context) so it
> would have to be robust for a single read, thus the poll_timeout might not
> be suitable - not sure though - but thats a different problem.

Sorry, I could not understand why you thought sdhci-cadence.c seemed a bug.

The SDHCI_CDNS_HRS06_TUNE_UP is auto-cleared by the hardware.
This will normally be done really soon (within 1us),
but it may not be cleared if the hardware is in trouble.
It is the intention of the code:

    readl_poll_timeout(reg, tmp, !(tmp & SDHCI_CDNS_HRS06_TUNE_UP),
                       0, 1);




>  which results in min,max values for the initial iterations of:
>       min  max
>       0    1
>       0    2
>       1    4
>       2    8
>       4    16
>       8    32
>       16   64
>       32   128
>       ...


I notice you were sending this to me, but
please notice I am not responsible for this file
(include/linux/iopoll.h) in any way.

Please take my comments for grain of salt:

With this patch, the sleep range is doubled in each iteration.
Let's say this routine is called with delay_us=1, timeout_us=50,
then it slept 16 us, then 32 us.
If it sleeps 64 us in the next iteration,
it ends up with sleeping 112 us (=16 + 32 + 64) in total
where we know waiting 50 us is enough.
So, the sleep range granularity may get bigger than
users' intention.

Probably, waiting too long is not a problem in most cases.
If so, what is the meaning of the argument "sleep_us" after all?
Nicholas Mc Guire Feb. 23, 2017, 10:06 a.m.
On Thu, Feb 23, 2017 at 05:29:03PM +0900, Masahiro Yamada wrote:
> Hi Nicholas,
> 
> 
> 2017-02-14 1:52 GMT+09:00 Nicholas Mc Guire <der.herr@hofr.at>:
> 
> >
> > One case in /drivers/mmc/host/sdhci-cadence.c has a timeout_us of 1 and
> > sleep_us of 0 which might actually be a bug as it means that the poll loop
> > would do a single read OP in many cases (this is non-atomic context) so it
> > would have to be robust for a single read, thus the poll_timeout might not
> > be suitable - not sure though - but thats a different problem.
> 
> Sorry, I could not understand why you thought sdhci-cadence.c seemed a bug.
> 
> The SDHCI_CDNS_HRS06_TUNE_UP is auto-cleared by the hardware.
> This will normally be done really soon (within 1us),
> but it may not be cleared if the hardware is in trouble.
> It is the intention of the code:
> 
>     readl_poll_timeout(reg, tmp, !(tmp & SDHCI_CDNS_HRS06_TUNE_UP),
>                        0, 1);
>

Im not sure my selfe but as this would need to be robust for a single read try
to be reliable I did not see the point in using a pool loop of 1 microsecond
on x86 I tried this and timeout_us of 1 will resulted in a single
read in more than 10% on the idle system and in almost 100% doing
a single read on loaded systems. So if the poll loop was introduced with
the assumption that just reading once imediately could miss the ritical
event with resonable probability then timeout_us==1 does not change that.

 
> 
> 
> 
> >  which results in min,max values for the initial iterations of:
> >       min  max
> >       0    1
> >       0    2
> >       1    4
> >       2    8
> >       4    16
> >       8    32
> >       16   64
> >       32   128
> >       ...
> 
> 
> I notice you were sending this to me, but
> please notice I am not responsible for this file
> (include/linux/iopoll.h) in any way.

I included you as your it showed up with:
scripts/get_maintainer.pl -f include/linux/iopoll.h

> 
> Please take my comments for grain of salt:
> 
> With this patch, the sleep range is doubled in each iteration.
> Let's say this routine is called with delay_us=1, timeout_us=50,
> then it slept 16 us, then 32 us.
> If it sleeps 64 us in the next iteration,
> it ends up with sleeping 112 us (=16 + 32 + 64) in total
> where we know waiting 50 us is enough.

that is right - but the assumption of poll_timeout is that the
timeout case would actually be rare and if you look at the actual
timings that one has of poll_timout routines (that are based on
usleep_range()) on loaded systems they overrun by 100s of microseconds
very frequnently. 

But you are right that in the 50us case this is not ideal - the 
focus I had was on the very long timeouts that in some cases were set
to up to 5 seconds with tight-loops (or close to timght-loops)

> So, the sleep range granularity may get bigger than
> users' intention.
> 
> Probably, waiting too long is not a problem in most cases.
> If so, what is the meaning of the argument "sleep_us" after all?

The problem really is that on idle ssytems the sleep_us argument
works ok and the overruns are not that dramatic - but on loaded
systems the sleep_us routlinely overruns significantly

usleep_range() 5000 samples - idle system
 100,200         200,200         190,200
 Min.   :188481  Min.   :201917  Min.   :197793
 1st Qu.:207062  1st Qu.:207057  1st Qu.:207051
 Median :207139  Median :207133  Median :207133
 Mean   :207254  Mean   :207233  Mean   :207244
 3rd Qu.:207341  erd Qu.:207262  3rd Qu.:207610
 Max.   :225340  Max.   :214222  Max.   :214885

CONFIG_PREEMPT_VOLUNTARY=y
usleep_range() 5000 samples - load ~ 8
 100,200         190,200          200,200
 Min.   : 107812 Min.   :  203307 Min.   :  203432
 1st Qu.: 558221 1st Qu.:  557490 1st Qu.:  510356
 Median :1123425 Median : 1121939 Median : 1123316
 Mean   :1103718 Mean   : 1100965 Mean   : 1100542
 3rd Qu.:1541986 3rd Qu.: 1531478 3rd Qu.: 1517414
 Max.   :8979183 Max.   :13765789 Max.   :12476136

CONFIG_PREEMPT=y
usleep_range() 5000 samples - load ~ 8
 100,200          190,200          200,200
 Min.   :  115321 Min.   :  203963 Min.   :  203864
 1st Qu.:  510296 1st Qu.:  451479 1st Qu.:  548131
 Median : 1148660 Median : 1062576 Median : 1145228
 Mean   : 1193449 Mean   : 1079379 Mean   : 1154728
 3rd Qu.: 1601552 3rd Qu.: 1378622 3rd Qu.: 1570742
 Max.   :12936192 Max.   :12346313 Max.   :13858732

so really small sleep_us make no sense I think - setting it to 0
tight-loop might be justified for small timeout_us (say 10us)
but long busy-wait loops are bad (and probably technically not
that sensible ither). If a busy-wait loop does not get the data/state
it wants within a few loops then busy-waiting is nonsentical and
that is why the intent of the exponential back-off solution does
a few tight-loops and then switches to sleeping delays.

It might be necessary to set it to a more fine grain stepping than
the brute-force *2 - but the principle I think would be better
than what is being done now.

and thanks for your comments !

thx!
hofrat

Patch hide | download patch | download mbox

diff --git a/include/linux/iopoll.h b/include/linux/iopoll.h
index d29e1e2..788c6b1 100644
--- a/include/linux/iopoll.h
+++ b/include/linux/iopoll.h
@@ -40,10 +40,12 @@ 
  * When available, you'll probably want to use one of the specialized
  * macros defined below rather than this macro directly.
  */
-#define readx_poll_timeout(op, addr, val, cond, sleep_us, timeout_us)	\
+#define readx_poll_timeout(op, addr, val, cond, sleep_us, timeout_us)   \
 ({ \
 	ktime_t timeout = ktime_add_us(ktime_get(), timeout_us); \
-	might_sleep_if(sleep_us); \
+	might_sleep(); \
+	unsigned long min = 0; \
+	unsigned long max = sleep_us | 1; \
 	for (;;) { \
 		(val) = op(addr); \
 		if (cond) \
@@ -52,8 +54,10 @@ 
 			(val) = op(addr); \
 			break; \
 		} \
-		if (sleep_us) \
-			usleep_range((sleep_us >> 2) + 1, sleep_us); \
+		if (min >= 10) \
+			usleep_range(min, max); \
+		max <<= 1; \
+		min = max >> 2; \
 	} \
 	(cond) ? 0 : -ETIMEDOUT; \
 })