spin loop primitives for busy waiting

Submitted by Nicholas Piggin on May 11, 2017, 4:57 p.m.

Details

Message ID 20170511165727.17847-1-npiggin@gmail.com
State Not Applicable
Headers show

Commit Message

Nicholas Piggin May 11, 2017, 4:57 p.m.
Current busy-wait loops are implemented by repeatedly calling cpu_relax()
to give an arch option for a low-latency option to improve power and/or
SMT resource contention.

This poses some difficulties for powerpc, which has SMT priority setting
instructions (priorities determine how ifetch cycles are apportioned).
powerpc's cpu_relax() is implemented by setting a low priority then
setting normal priority. This has several problems:

 - Changing thread priority can have some execution cost and potential
   impact to other threads in the core. It's inefficient to execute them
   every time around a busy-wait loop.

 - Depending on implementation details, a `low ; medium` sequence may
   not have much if any affect. Some software with similar pattern
   actually inserts a lot of nops between, in order to cause a few fetch
   cycles with the low priority.

 - The busy-wait loop runs with regular priority. This might only be a few
   fetch cycles, but if there are several threads running such loops, they
   could cause a noticable impact on a non-idle thread.

Implement spin_begin, spin_end primitives that can be used around busy
wait loops, which default to no-ops. And spin_cpu_relax which defaults to
cpu_relax.

This will allow architectures to hook the entry and exit of busy-wait
loops, and will allow powerpc to set low SMT priority at entry, and
normal priority at exit.

Signed-off-by: Nicholas Piggin <npiggin@gmail.com>
---

Hi Linus,

Since last discussion of this, I changed the interface to match what you
suggested (e.g., just start/end to be called as a pair from anywhere
in control flow).

If you find this acceptable, I'd like to start wiring in the powerpc
and adding the annotations to some important core spin loops (there's
not too many really). I'm hoping if you take this patch during this
merge window, I'll be able to start sending small patches to maintainers
for the next window. Unless you have a better suggestion for how to
deal with this.

Thanks,
Nick

 include/linux/processor.h | 62 +++++++++++++++++++++++++++++++++++++++++++++++
 1 file changed, 62 insertions(+)
 create mode 100644 include/linux/processor.h

Comments

Linus Torvalds May 11, 2017, 6:47 p.m.
On Thu, May 11, 2017 at 9:57 AM, Nicholas Piggin <npiggin@gmail.com> wrote:
>
> If you find this acceptable, I'd like to start wiring in the powerpc
> and adding the annotations to some important core spin loops (there's
> not too many really). I'm hoping if you take this patch during this
> merge window, I'll be able to start sending small patches to maintainers
> for the next window. Unless you have a better suggestion for how to
> deal with this.

This looks fine to me as an interface, and yes, I guess I can just
take this scaffolding as-is.

The one question I have is about "spin_on_cond()": since you
explicitly document that the "no spinning" case is expected to be the
default, I really think that the default implementation should be
along the lines if

  #define spin_on_cond(cond) do { \
    if (unlikely(!(cond))) { \
        spin_begin(); do spin_cpu_relax(); while (!(cond)); spin_end(); \
    } \
  } while (0)

which will actually result in better code generation even if
spin_begin/end() are no-ops, and just generally optimizes for the
right behavior (ie do the spinning out-of-line, since by definition it
cannot be performance-critical after the first iteration).

So it's better even when you don't really have the begin/end overhead,
but that version of "spin_on_cond()" is then likely to work _much_
better if you actually do have it, when your version would seem to be
entirely unacceptable.

Hmm?

In fact, do you even want to make that "spin_on_cond()" version
conditional? I don't see how an architecture can really improve on it.
If an architecture wants to use things like "wait on this particular
variable", then that generic "cond" version is too generic an
interface anyway.

                  Linus
Nicholas Piggin May 11, 2017, 7:26 p.m.
On Thu, 11 May 2017 11:47:47 -0700
Linus Torvalds <torvalds@linux-foundation.org> wrote:

> On Thu, May 11, 2017 at 9:57 AM, Nicholas Piggin <npiggin@gmail.com> wrote:
> >
> > If you find this acceptable, I'd like to start wiring in the powerpc
> > and adding the annotations to some important core spin loops (there's
> > not too many really). I'm hoping if you take this patch during this
> > merge window, I'll be able to start sending small patches to maintainers
> > for the next window. Unless you have a better suggestion for how to
> > deal with this.  
> 
> This looks fine to me as an interface, and yes, I guess I can just
> take this scaffolding as-is.
> 
> The one question I have is about "spin_on_cond()": since you
> explicitly document that the "no spinning" case is expected to be the
> default, I really think that the default implementation should be
> along the lines if
> 
>   #define spin_on_cond(cond) do { \
>     if (unlikely(!(cond))) { \
>         spin_begin(); do spin_cpu_relax(); while (!(cond)); spin_end(); \
>     } \
>   } while (0)
> 
> which will actually result in better code generation even if
> spin_begin/end() are no-ops, and just generally optimizes for the
> right behavior (ie do the spinning out-of-line, since by definition it
> cannot be performance-critical after the first iteration).
> 
> So it's better even when you don't really have the begin/end overhead,
> but that version of "spin_on_cond()" is then likely to work _much_
> better if you actually do have it, when your version would seem to be
> entirely unacceptable.
> 
> Hmm?

I agree completely. But might it depend on architecture and so
just be something they could do themselves?

> In fact, do you even want to make that "spin_on_cond()" version
> conditional? I don't see how an architecture can really improve on it.
> If an architecture wants to use things like "wait on this particular
> variable", then that generic "cond" version is too generic an
> interface anyway.

So what I want is to be able to have architectures control the entire
loop, and you can do some interesting things with goto labels and
coding branches how you want them.

That was my motivation for doing the previous weird spin_do {} spin_while()
interface. But since you didn't like it, and I realized that kind of
control probably only matters for a very small subset of spin loops
(e.g., bit spinlock/seqlock, some idle loops in the scheduler, etc. that
do just tend to spin on a quite simple condition).

For example, if you sent the branch prediction to statically predict
the loop exits, then instead of taking a branch miss the first thing
you do when your lock gets freed, you will be doing useful instructions.

I haven't verified and done enough analysis yet to know if that's going
to be worthwhile yet, but I thought for some of those simple spins, it's
a nicer interface than the begin/relax/end.

Thanks,
Nick
David Laight May 12, 2017, 12:58 p.m.
From: Linus Torvalds

> Sent: 11 May 2017 19:48

...
> The one question I have is about "spin_on_cond()": since you

> explicitly document that the "no spinning" case is expected to be the

> default, I really think that the default implementation should be

> along the lines if

> 

>   #define spin_on_cond(cond) do { \

>     if (unlikely(!(cond))) { \

>         spin_begin(); do spin_cpu_relax(); while (!(cond)); spin_end(); \

>     } \

>   } while (0)

> 

> which will actually result in better code generation even if

> spin_begin/end() are no-ops, and just generally optimizes for the

> right behavior (ie do the spinning out-of-line, since by definition it

> cannot be performance-critical after the first iteration).


At least some versions of gcc convert while (cond) do {body}
into if (cond) do {body} while (cond) even when 'cond'
is a non-trivial expression and 'body' is trivial.
The code-bloat is silly.
No point enforcing the 'optimisation' here.

	David
Linus Torvalds May 12, 2017, 3:56 p.m.
On Fri, May 12, 2017 at 5:58 AM, David Laight <David.Laight@aculab.com> wrote:
>
> At least some versions of gcc convert while (cond) do {body}
> into if (cond) do {body} while (cond) even when 'cond'
> is a non-trivial expression and 'body' is trivial.

Afaik pretty much all versions of gcc do that, unless you use -Os
(which we have effectively stopped doing because it causes so many
other problems in stupid instruction choices etc).

> The code-bloat is silly.
> No point enforcing the 'optimisation' here.

Actually, for places where we expect cold code and the loop to be
often run zero times, it actually makes sense. When we can make the
initial "if()" be unlikely, and gcc can then generate the unlikely
code all out of line, then the while -> if+do-while construction makes
sense.

It's the "normal" while loops that it's sad how big code gcc
generates, considering that most of our loop counts are vert small
(but generally nonzero).

               Linus
Nicholas Piggin May 12, 2017, 3:56 p.m.
On Fri, 12 May 2017 12:58:12 +0000
David Laight <David.Laight@ACULAB.COM> wrote:

> From: Linus Torvalds
> > Sent: 11 May 2017 19:48  
> ...
> > The one question I have is about "spin_on_cond()": since you
> > explicitly document that the "no spinning" case is expected to be the
> > default, I really think that the default implementation should be
> > along the lines if
> > 
> >   #define spin_on_cond(cond) do { \
> >     if (unlikely(!(cond))) { \
> >         spin_begin(); do spin_cpu_relax(); while (!(cond)); spin_end(); \
> >     } \
> >   } while (0)
> > 
> > which will actually result in better code generation even if
> > spin_begin/end() are no-ops, and just generally optimizes for the
> > right behavior (ie do the spinning out-of-line, since by definition it
> > cannot be performance-critical after the first iteration).  
> 
> At least some versions of gcc convert while (cond) do {body}
> into if (cond) do {body} while (cond) even when 'cond'
> is a non-trivial expression and 'body' is trivial.
> The code-bloat is silly.
> No point enforcing the 'optimisation' here.

The point is for something like this:

static inline unsigned __read_seqcount_begin(const seqcount_t *s)
{
        unsigned ret;

repeat:
        ret = READ_ONCE(s->sequence);
        if (unlikely(ret & 1)) {
                cpu_relax();
                goto repeat;
        }
        return ret;
}

to be coded as:

static inline unsigned __read_seqcount_begin(const seqcount_t *s)
{
        unsigned ret;

	spin_on_cond( !((ret = READ_ONCE(s->sequence)) & 1) );

        return ret;
}

That's about as complex as you'd want to go with this, but I think
it's a reasonable case.

Now for x86, you would want these to fall out to the same code
generated. For powerpc, you do not want those spin_begin(); spin_end();

You are right there's a bit of code bloat there. It gets moved out
of line, but gcc still isn't all that smart about it though, and
it doesn't fold the tests back nicely if I go with Linus's suggestion,
so it doesn't work so well as generic implementation.

For powerpc we have to live with it I think.

Thanks,
Nick

Patch hide | download patch | download mbox

diff --git a/include/linux/processor.h b/include/linux/processor.h
new file mode 100644
index 000000000000..0a058aaa9bab
--- /dev/null
+++ b/include/linux/processor.h
@@ -0,0 +1,62 @@ 
+/* Misc low level processor primitives */
+#ifndef _LINUX_PROCESSOR_H
+#define _LINUX_PROCESSOR_H
+
+#include <asm/processor.h>
+
+/*
+ * spin_begin is used before beginning a busy-wait loop, and must be paired
+ * with spin_end when the loop is exited. spin_cpu_relax must be called
+ * within the loop.
+ *
+ * The loop body should be as small and fast as possible, on the order of
+ * tens of instructions/cycles as a guide. It should and avoid calling
+ * cpu_relax, or any "spin" or sleep type of primitive including nested uses
+ * of these primitives. It should not lock or take any other resource.
+ * Violations of these guidelies will not cause a bug, but may cause sub
+ * optimal performance.
+ *
+ * These loops are optimized to be used where wait times are expected to be
+ * less than the cost of a context switch (and associated overhead).
+ *
+ * Detection of resource owner and decision to spin or sleep or guest-yield
+ * (e.g., spin lock holder vcpu preempted, or mutex owner not on CPU) can be
+ * tested within the loop body.
+ */
+#ifndef spin_begin
+#define spin_begin()
+#endif
+
+#ifndef spin_cpu_relax
+#define spin_cpu_relax() cpu_relax()
+#endif
+
+/*
+ * spin_cpu_yield may be called to yield (undirected) to the hypervisor if
+ * necessary. This should be used if the wait is expected to take longer
+ * than context switch overhead, but we can't sleep or do a directed yield.
+ */
+#ifndef spin_cpu_yield
+#define spin_cpu_yield() cpu_relax_yield()
+#endif
+
+#ifndef spin_end
+#define spin_end()
+#endif
+
+/*
+ * spin_on_cond can be used to wait for a condition to become true. It
+ * may be expected that the first iteration will true in the common case
+ * (no spinning).
+ */
+#ifndef spin_on_cond
+#define spin_on_cond(cond)					\
+do {								\
+	spin_begin();						\
+	while (cond)						\
+		spin_cpu_relax();				\
+	spin_end();						\
+} while (0)
+#endif
+
+#endif /* _LINUX_PROCESSOR_H */