diff mbox

[RFC] v5 expedited "big hammer" RCU grace periods

Message ID 20090517191141.GA25915@linux.vnet.ibm.com
State RFC, archived
Delegated to: David Miller
Headers show

Commit Message

Paul E. McKenney May 17, 2009, 7:11 p.m. UTC
Fifth cut of "big hammer" expedited RCU grace periods.  This uses per-CPU
kthreads that are scheduled in parallel by a call to smp_call_function()
by yet another kthread.  The synchronize_sched(), synchronize_rcu(),
and synchronize_bh() primitives wake this kthread up and then wait for
it to force the grace period.

As before, this does nothing to expedite callbacks already registered
with call_rcu() or call_rcu_bh(), but there is no need to.  Just maps
to synchronize_rcu() and a new synchronize_rcu_bh() on preemptable RCU,
which has more complex grace-period detection -- this can be fixed later.

Passes light rcutorture testing.  Grace periods take about 45 microseconds
on an 8-CPU Power machine, which I believe is good enough from a
performance viewpoint.  Scalability may eventually need to be addressed
in the smp_call_function() primitive and perhaps also in the scan through
the CPUs that determines when all have completed.  Although this is most
definitely not ready for inclusion, it seems to be a promising approach.

Shortcomings:

o	CPU hotplug testing would probably cause it to die horribly.

o	It is not clear to me why I retained the single non-per-CPU
	kthread, it can probably be dispensed with.

o	The per-CPU kthreads do not boost themselves to real-time
	priority, and thus could be blocked by real-time processes.
	Use of real-time priority might also speed things up a bit.

o	Contains random debug statements.

o	Does not address preemptable RCU.

Changes since v4:

o	Use per-CPU kthreads to force the quiescent states in parallel.

Changes since v3:

o	Use a kthread that schedules itself on each CPU in turn to
	force a grace period.  The synchronize_rcu() primitive
	wakes up the kthread in order to avoid messing with affinity
	masks on user tasks.

o	Tried a number of additional variations on the v3 approach, none
	of which helped much.

Changes since v2:

o	Use reschedule IPIs rather than a softirq.

Changes since v1:

o	Added rcutorture support, and added exports required by
	rcutorture.

o	Added comment stating that smp_call_function() implies a
	memory barrier, suggested by Mathieu.

o	Added #include for delay.h.

Signed-off-by: Paul E. McKenney <paulmck@linux.vnet.ibm.com>
---

 include/linux/rcuclassic.h |   16 ++
 include/linux/rcupdate.h   |   25 ++--
 include/linux/rcupreempt.h |   10 +
 include/linux/rcutree.h    |   13 +-
 kernel/rcupdate.c          |  273 +++++++++++++++++++++++++++++++++++++++++++++
 kernel/rcupreempt.c        |    1 
 kernel/rcutorture.c        |  200 +++++++++++++++++---------------
 7 files changed, 432 insertions(+), 106 deletions(-)

--
To unsubscribe from this list: send the line "unsubscribe netdev" in
the body of a message to majordomo@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html

Comments

Evgeniy Polyakov May 17, 2009, 8:02 p.m. UTC | #1
Hi.

On Sun, May 17, 2009 at 12:11:41PM -0700, Paul E. McKenney (paulmck@linux.vnet.ibm.com) wrote:
> Fifth cut of "big hammer" expedited RCU grace periods.  This uses per-CPU
> kthreads that are scheduled in parallel by a call to smp_call_function()
> by yet another kthread.  The synchronize_sched(), synchronize_rcu(),
> and synchronize_bh() primitives wake this kthread up and then wait for
> it to force the grace period.

I'm curious, but doesn't the fact that registered 'barrier' callback is
invoked mean grace period completion? I.e. why to bother with
rescheduling, waiting for thread to complete and so on, when we only
care in the fact that 'barrier' callback is invoked, and thus all
previous ones are completed?
Or it is done just for the simplicity, since all rescheduling machinery
already manages the rcu bits correctly, so you do not want to put it
directly into 'barrier' callback?
Paul E. McKenney May 17, 2009, 10:08 p.m. UTC | #2
On Mon, May 18, 2009 at 12:02:23AM +0400, Evgeniy Polyakov wrote:
> Hi.

Hello, Evgeniy!

> On Sun, May 17, 2009 at 12:11:41PM -0700, Paul E. McKenney (paulmck@linux.vnet.ibm.com) wrote:
> > Fifth cut of "big hammer" expedited RCU grace periods.  This uses per-CPU
> > kthreads that are scheduled in parallel by a call to smp_call_function()
> > by yet another kthread.  The synchronize_sched(), synchronize_rcu(),
> > and synchronize_bh() primitives wake this kthread up and then wait for
> > it to force the grace period.
> 
> I'm curious, but doesn't the fact that registered 'barrier' callback is
> invoked mean grace period completion? I.e. why to bother with
> rescheduling, waiting for thread to complete and so on, when we only
> care in the fact that 'barrier' callback is invoked, and thus all
> previous ones are completed?
> Or it is done just for the simplicity, since all rescheduling machinery
> already manages the rcu bits correctly, so you do not want to put it
> directly into 'barrier' callback?

It is a short-term expedient course of action.  Longer term, I will drop
rcuclassic in favor of rcutree, and then merge rcupreempt into rcutree.
I will then add machinery to rcutree to handle expedited grace periods
(somewhat) more naturally.  Trying to expedite three very different RCU
implementations seems a bit silly, hence the current off-on-the-side
approach.

But even then I will avoid relying on a "barrier" callback, or, indeed,
any sort of callback, because we don't want expedited grace periods to
have to wait on invocation of earlier RCU callbacks.  There will thus
not be a call_rcu_expedited(), at least not unless someone comes up with
a -really- compelling reason why.

But the exercise of going through several possible implementations was
quite useful, as I learned a number of things that will improve the
eventual rcutree implementation.  Like the fact that expedited grace
periods don't want to be waiting on invocation of prior callbacks.  ;-)

And rcutiny is, as always, as special case.  Here is the implementation
of synchronize_rcu_expedited() in rcutiny:

	void synchronize_rcu_expedited(void)
	{
	}

Or even:

	#define synchronize_rcu_expedited synchronize_rcu

;-)

							Thanx, Paul
--
To unsubscribe from this list: send the line "unsubscribe netdev" in
the body of a message to majordomo@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Lai Jiangshan May 18, 2009, 6:59 a.m. UTC | #3
Paul E. McKenney wrote:
> +void sched_expedited_wake(void *unused)
> +{
> +	mutex_lock(&__get_cpu_var(sched_expedited_done_mutex));
> +	if (__get_cpu_var(sched_expedited_done_qs) ==
> +	    SCHED_EXPEDITED_QS_DONE_QS) {
> +		__get_cpu_var(sched_expedited_done_qs) =
> +			SCHED_EXPEDITED_QS_NEED_QS;
> +		wake_up(&__get_cpu_var(sched_expedited_qs_wq));
> +	}
> +	mutex_unlock(&__get_cpu_var(sched_expedited_done_mutex));
> +}

[...]

> +		get_online_cpus();
> +		preempt_disable();
> +		mycpu = smp_processor_id();
> +		smp_call_function(sched_expedited_wake, NULL, 1);

sched_expedited_wake() calls mutex_lock() which may sleep?

And I think you have re-implement workqueue.

Thanks, Lai.

--
To unsubscribe from this list: send the line "unsubscribe netdev" in
the body of a message to majordomo@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Ingo Molnar May 18, 2009, 7:56 a.m. UTC | #4
* Paul E. McKenney <paulmck@linux.vnet.ibm.com> wrote:

> +void sched_expedited_wake(void *unused)
> +{
> +	mutex_lock(&__get_cpu_var(sched_expedited_done_mutex));
> +	if (__get_cpu_var(sched_expedited_done_qs) ==
> +	    SCHED_EXPEDITED_QS_DONE_QS) {
> +		__get_cpu_var(sched_expedited_done_qs) =
> +			SCHED_EXPEDITED_QS_NEED_QS;
> +		wake_up(&__get_cpu_var(sched_expedited_qs_wq));
> +	}
> +	mutex_unlock(&__get_cpu_var(sched_expedited_done_mutex));
> +}

( hm, IPI handlers are supposed to be atomic. )

> +/*
> + * Kernel thread that processes synchronize_sched_expedited() requests.
> + * This is implemented as a separate kernel thread to avoid the need
> + * to mess with other tasks' cpumasks.
> + */
> +static int krcu_sched_expedited(void *arg)
> +{
> +	int cpu;
> +	int mycpu;
> +	int nwait;
> +
> +	do {
> +		wait_event_interruptible(need_sched_expedited_wq,
> +					 need_sched_expedited);
> +		smp_mb(); /* In case we didn't sleep. */
> +		if (!need_sched_expedited)
> +			continue;
> +		need_sched_expedited = 0;
> +		get_online_cpus();
> +		preempt_disable();
> +		mycpu = smp_processor_id();
> +		smp_call_function(sched_expedited_wake, NULL, 1);
> +		preempt_enable();

i might be missing something fundamental here, but why not just have 
per CPU helper threads, all on the same waitqueue, and wake them up 
via a single wake_up() call? That would remove the SMP cross call 
(wakeups do immediate cross-calls already).

Even more - we already have a per-CPU, high RT priority helper 
thread that could be reused: the per CPU migration threads. Couldnt 
we queue these requests to them? RCU is arguably closely related to 
scheduling so there's no layering violation IMO.

There's already a struct migration_req machinery that performs 
something quite similar. (do work on behalf of another task, on a 
specific CPU, and then signal completion)

Also, per CPU workqueues have similar features as well.

	Ingo
--
To unsubscribe from this list: send the line "unsubscribe netdev" in
the body of a message to majordomo@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Paul E. McKenney May 18, 2009, 2:40 p.m. UTC | #5
On Mon, May 18, 2009 at 02:59:52PM +0800, Lai Jiangshan wrote:
> Paul E. McKenney wrote:
> > +void sched_expedited_wake(void *unused)
> > +{
> > +	mutex_lock(&__get_cpu_var(sched_expedited_done_mutex));
> > +	if (__get_cpu_var(sched_expedited_done_qs) ==
> > +	    SCHED_EXPEDITED_QS_DONE_QS) {
> > +		__get_cpu_var(sched_expedited_done_qs) =
> > +			SCHED_EXPEDITED_QS_NEED_QS;
> > +		wake_up(&__get_cpu_var(sched_expedited_qs_wq));
> > +	}
> > +	mutex_unlock(&__get_cpu_var(sched_expedited_done_mutex));
> > +}
> 
> [...]
> 
> > +		get_online_cpus();
> > +		preempt_disable();
> > +		mycpu = smp_processor_id();
> > +		smp_call_function(sched_expedited_wake, NULL, 1);
> 
> sched_expedited_wake() calls mutex_lock() which may sleep?

Good eyes!  Fixing this and the failure to release this lock in
krcu_sched_expedited_percpu() allows it to survive 10 hours of
rcutorture running in parallel with onlining/offlining random CPUs.

> And I think you have re-implement workqueue.

Hmmm...  I suppose I could use schedule_work(), though I am concerned
about interference from other work.  But I will give this some thought.

							Thanx, Paul
--
To unsubscribe from this list: send the line "unsubscribe netdev" in
the body of a message to majordomo@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Paul E. McKenney May 18, 2009, 3:14 p.m. UTC | #6
On Mon, May 18, 2009 at 09:56:30AM +0200, Ingo Molnar wrote:
> 
> * Paul E. McKenney <paulmck@linux.vnet.ibm.com> wrote:
> 
> > +void sched_expedited_wake(void *unused)
> > +{
> > +	mutex_lock(&__get_cpu_var(sched_expedited_done_mutex));
> > +	if (__get_cpu_var(sched_expedited_done_qs) ==
> > +	    SCHED_EXPEDITED_QS_DONE_QS) {
> > +		__get_cpu_var(sched_expedited_done_qs) =
> > +			SCHED_EXPEDITED_QS_NEED_QS;
> > +		wake_up(&__get_cpu_var(sched_expedited_qs_wq));
> > +	}
> > +	mutex_unlock(&__get_cpu_var(sched_expedited_done_mutex));
> > +}
> 
> ( hm, IPI handlers are supposed to be atomic. )

<red face>

> > +/*
> > + * Kernel thread that processes synchronize_sched_expedited() requests.
> > + * This is implemented as a separate kernel thread to avoid the need
> > + * to mess with other tasks' cpumasks.
> > + */
> > +static int krcu_sched_expedited(void *arg)
> > +{
> > +	int cpu;
> > +	int mycpu;
> > +	int nwait;
> > +
> > +	do {
> > +		wait_event_interruptible(need_sched_expedited_wq,
> > +					 need_sched_expedited);
> > +		smp_mb(); /* In case we didn't sleep. */
> > +		if (!need_sched_expedited)
> > +			continue;
> > +		need_sched_expedited = 0;
> > +		get_online_cpus();
> > +		preempt_disable();
> > +		mycpu = smp_processor_id();
> > +		smp_call_function(sched_expedited_wake, NULL, 1);
> > +		preempt_enable();
> 
> i might be missing something fundamental here, but why not just have 
> per CPU helper threads, all on the same waitqueue, and wake them up 
> via a single wake_up() call? That would remove the SMP cross call 
> (wakeups do immediate cross-calls already).

My concern with this is that the cache misses accessing all the processes
on this single waitqueue would be serialized, slowing things down.
In contrast, the bitmask that smp_call_function() traverses delivers on
the order of a thousand CPUs' worth of bits per cache miss.  I will give
it a try, though.

> Even more - we already have a per-CPU, high RT priority helper 
> thread that could be reused: the per CPU migration threads. Couldnt 
> we queue these requests to them? RCU is arguably closely related to 
> scheduling so there's no layering violation IMO.
> 
> There's already a struct migration_req machinery that performs 
> something quite similar. (do work on behalf of another task, on a 
> specific CPU, and then signal completion)
> 
> Also, per CPU workqueues have similar features as well.

Good points!!!

I will post a working patch using my current approach, then try out some
of these approaches.

							Thanx, Paul
--
To unsubscribe from this list: send the line "unsubscribe netdev" in
the body of a message to majordomo@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Ingo Molnar May 18, 2009, 3:42 p.m. UTC | #7
* Paul E. McKenney <paulmck@linux.vnet.ibm.com> wrote:

> > i might be missing something fundamental here, but why not just 
> > have per CPU helper threads, all on the same waitqueue, and wake 
> > them up via a single wake_up() call? That would remove the SMP 
> > cross call (wakeups do immediate cross-calls already).
> 
> My concern with this is that the cache misses accessing all the 
> processes on this single waitqueue would be serialized, slowing 
> things down. In contrast, the bitmask that smp_call_function() 
> traverses delivers on the order of a thousand CPUs' worth of bits 
> per cache miss.  I will give it a try, though.

At least if you go via the migration threads, you can queue up 
requests to them locally. But there's going to be cachemisses 
_anyway_, since you have to access them all from a single CPU, and 
then they have to fetch details about what to do, and then have to 
notify the originator about completion.

	Ingo
--
To unsubscribe from this list: send the line "unsubscribe netdev" in
the body of a message to majordomo@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Paul E. McKenney May 18, 2009, 4:02 p.m. UTC | #8
On Mon, May 18, 2009 at 05:42:41PM +0200, Ingo Molnar wrote:
> 
> * Paul E. McKenney <paulmck@linux.vnet.ibm.com> wrote:
> 
> > > i might be missing something fundamental here, but why not just 
> > > have per CPU helper threads, all on the same waitqueue, and wake 
> > > them up via a single wake_up() call? That would remove the SMP 
> > > cross call (wakeups do immediate cross-calls already).
> > 
> > My concern with this is that the cache misses accessing all the 
> > processes on this single waitqueue would be serialized, slowing 
> > things down. In contrast, the bitmask that smp_call_function() 
> > traverses delivers on the order of a thousand CPUs' worth of bits 
> > per cache miss.  I will give it a try, though.
> 
> At least if you go via the migration threads, you can queue up 
> requests to them locally. But there's going to be cachemisses 
> _anyway_, since you have to access them all from a single CPU, and 
> then they have to fetch details about what to do, and then have to 
> notify the originator about completion.

Ah, so you are suggesting that I use smp_call_function() to run code on
each CPU that wakes up that CPU's migration thread?  I will take a look
at this.

							Thanx, Paul
--
To unsubscribe from this list: send the line "unsubscribe netdev" in
the body of a message to majordomo@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Ingo Molnar May 19, 2009, 8:58 a.m. UTC | #9
* Paul E. McKenney <paulmck@linux.vnet.ibm.com> wrote:

> On Mon, May 18, 2009 at 05:42:41PM +0200, Ingo Molnar wrote:
> > 
> > * Paul E. McKenney <paulmck@linux.vnet.ibm.com> wrote:
> > 
> > > > i might be missing something fundamental here, but why not just 
> > > > have per CPU helper threads, all on the same waitqueue, and wake 
> > > > them up via a single wake_up() call? That would remove the SMP 
> > > > cross call (wakeups do immediate cross-calls already).
> > > 
> > > My concern with this is that the cache misses accessing all the 
> > > processes on this single waitqueue would be serialized, slowing 
> > > things down. In contrast, the bitmask that smp_call_function() 
> > > traverses delivers on the order of a thousand CPUs' worth of bits 
> > > per cache miss.  I will give it a try, though.
> > 
> > At least if you go via the migration threads, you can queue up 
> > requests to them locally. But there's going to be cachemisses 
> > _anyway_, since you have to access them all from a single CPU, 
> > and then they have to fetch details about what to do, and then 
> > have to notify the originator about completion.
> 
> Ah, so you are suggesting that I use smp_call_function() to run 
> code on each CPU that wakes up that CPU's migration thread?  I 
> will take a look at this.

My suggestion was to queue up a dummy 'struct migration_req' up with 
it (change migration_req::task == NULL to mean 'nothing') and simply 
wake it up using wake_up_process().

That will force a quiescent state, without the need for any extra 
information, right?

This is what the scheduler code does, roughly:

                wake_up_process(rq->migration_thread);
                wait_for_completion(&req.done);

and this will always have to perform well. The 'req' could be put 
into PER_CPU, and a loop could be done like this:

	for_each_online_cpu(cpu)
                wake_up_process(cpu_rq(cpu)->migration_thread);

	for_each_online_cpu(cpu)
                wait_for_completion(&per_cpu(req, cpu).done);

hm?

	Ingo
--
To unsubscribe from this list: send the line "unsubscribe netdev" in
the body of a message to majordomo@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Paul E. McKenney May 19, 2009, 12:33 p.m. UTC | #10
On Tue, May 19, 2009 at 10:58:25AM +0200, Ingo Molnar wrote:
> 
> * Paul E. McKenney <paulmck@linux.vnet.ibm.com> wrote:
> 
> > On Mon, May 18, 2009 at 05:42:41PM +0200, Ingo Molnar wrote:
> > > 
> > > * Paul E. McKenney <paulmck@linux.vnet.ibm.com> wrote:
> > > 
> > > > > i might be missing something fundamental here, but why not just 
> > > > > have per CPU helper threads, all on the same waitqueue, and wake 
> > > > > them up via a single wake_up() call? That would remove the SMP 
> > > > > cross call (wakeups do immediate cross-calls already).
> > > > 
> > > > My concern with this is that the cache misses accessing all the 
> > > > processes on this single waitqueue would be serialized, slowing 
> > > > things down. In contrast, the bitmask that smp_call_function() 
> > > > traverses delivers on the order of a thousand CPUs' worth of bits 
> > > > per cache miss.  I will give it a try, though.
> > > 
> > > At least if you go via the migration threads, you can queue up 
> > > requests to them locally. But there's going to be cachemisses 
> > > _anyway_, since you have to access them all from a single CPU, 
> > > and then they have to fetch details about what to do, and then 
> > > have to notify the originator about completion.
> > 
> > Ah, so you are suggesting that I use smp_call_function() to run 
> > code on each CPU that wakes up that CPU's migration thread?  I 
> > will take a look at this.
> 
> My suggestion was to queue up a dummy 'struct migration_req' up with 
> it (change migration_req::task == NULL to mean 'nothing') and simply 
> wake it up using wake_up_process().

OK.  I was thinking of just using wake_up_process() without the
migration_req structure, and unconditionally setting a per-CPU
variable from within migration_thread() just before the list_empty()
check.  In your approach we would need a NULL-pointer check just
before the call to __migrate_task().

> That will force a quiescent state, without the need for any extra 
> information, right?

Yep!

> This is what the scheduler code does, roughly:
> 
>                 wake_up_process(rq->migration_thread);
>                 wait_for_completion(&req.done);
> 
> and this will always have to perform well. The 'req' could be put 
> into PER_CPU, and a loop could be done like this:
> 
> 	for_each_online_cpu(cpu)
>                 wake_up_process(cpu_rq(cpu)->migration_thread);
> 
> 	for_each_online_cpu(cpu)
>                 wait_for_completion(&per_cpu(req, cpu).done);
> 
> hm?

My concern is the linear slowdown for large systems, but this should be
OK for modest systems (a few 10s of CPUs).  However, I will try it out --
it does not need to be a long-term solution, after all.

							Thanx, Paul
--
To unsubscribe from this list: send the line "unsubscribe netdev" in
the body of a message to majordomo@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Ingo Molnar May 19, 2009, 12:44 p.m. UTC | #11
* Paul E. McKenney <paulmck@linux.vnet.ibm.com> wrote:

> On Tue, May 19, 2009 at 10:58:25AM +0200, Ingo Molnar wrote:
> > 
> > * Paul E. McKenney <paulmck@linux.vnet.ibm.com> wrote:
> > 
> > > On Mon, May 18, 2009 at 05:42:41PM +0200, Ingo Molnar wrote:
> > > > 
> > > > * Paul E. McKenney <paulmck@linux.vnet.ibm.com> wrote:
> > > > 
> > > > > > i might be missing something fundamental here, but why not just 
> > > > > > have per CPU helper threads, all on the same waitqueue, and wake 
> > > > > > them up via a single wake_up() call? That would remove the SMP 
> > > > > > cross call (wakeups do immediate cross-calls already).
> > > > > 
> > > > > My concern with this is that the cache misses accessing all the 
> > > > > processes on this single waitqueue would be serialized, slowing 
> > > > > things down. In contrast, the bitmask that smp_call_function() 
> > > > > traverses delivers on the order of a thousand CPUs' worth of bits 
> > > > > per cache miss.  I will give it a try, though.
> > > > 
> > > > At least if you go via the migration threads, you can queue up 
> > > > requests to them locally. But there's going to be cachemisses 
> > > > _anyway_, since you have to access them all from a single CPU, 
> > > > and then they have to fetch details about what to do, and then 
> > > > have to notify the originator about completion.
> > > 
> > > Ah, so you are suggesting that I use smp_call_function() to run 
> > > code on each CPU that wakes up that CPU's migration thread?  I 
> > > will take a look at this.
> > 
> > My suggestion was to queue up a dummy 'struct migration_req' up with 
> > it (change migration_req::task == NULL to mean 'nothing') and simply 
> > wake it up using wake_up_process().
> 
> OK.  I was thinking of just using wake_up_process() without the
> migration_req structure, and unconditionally setting a per-CPU
> variable from within migration_thread() just before the list_empty()
> check.  In your approach we would need a NULL-pointer check just
> before the call to __migrate_task().
> 
> > That will force a quiescent state, without the need for any extra 
> > information, right?
> 
> Yep!
> 
> > This is what the scheduler code does, roughly:
> > 
> >                 wake_up_process(rq->migration_thread);
> >                 wait_for_completion(&req.done);
> > 
> > and this will always have to perform well. The 'req' could be put 
> > into PER_CPU, and a loop could be done like this:
> > 
> > 	for_each_online_cpu(cpu)
> >                 wake_up_process(cpu_rq(cpu)->migration_thread);
> > 
> > 	for_each_online_cpu(cpu)
> >                 wait_for_completion(&per_cpu(req, cpu).done);
> > 
> > hm?
> 
> My concern is the linear slowdown for large systems, but this 
> should be OK for modest systems (a few 10s of CPUs).  However, I 
> will try it out -- it does not need to be a long-term solution, 
> after all.

I think there is going to be a linear slowdown no matter what - 
because sending that many IPIs is going to be linear. (there are no 
'broadcast to all' IPIs anymore - on x86 we only have them if all 
physical APIC IDs are 7 or smaller.)

Also, no matter what scheme we use, the target CPU does have to be 
processed somehow and it does have to signal completion back somehow 
- which generates cachemisses.

I think what probaby matters most is to go simple, and to use 
established kernel primitives - and the above is really typical 
pattern for things like TLB flushes to a process having a presence 
on every physical CPU. Those aspects will be kept reasonably fast 
and balanced on all hardware that matters. (and if not, people will 
notice any TLB flush/shootdown linear slowdowns and will address it)

I could be wrong though ... maybe someone can get some numbers from 
a really large system?

	Ingo
--
To unsubscribe from this list: send the line "unsubscribe netdev" in
the body of a message to majordomo@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Paul E. McKenney May 19, 2009, 4:18 p.m. UTC | #12
On Tue, May 19, 2009 at 02:44:36PM +0200, Ingo Molnar wrote:
> 
> * Paul E. McKenney <paulmck@linux.vnet.ibm.com> wrote:
> 
> > On Tue, May 19, 2009 at 10:58:25AM +0200, Ingo Molnar wrote:
> > > 
> > > * Paul E. McKenney <paulmck@linux.vnet.ibm.com> wrote:
> > > 
> > > > On Mon, May 18, 2009 at 05:42:41PM +0200, Ingo Molnar wrote:
> > > > > 
> > > > > * Paul E. McKenney <paulmck@linux.vnet.ibm.com> wrote:
> > > > > 
> > > > > > > i might be missing something fundamental here, but why not just 
> > > > > > > have per CPU helper threads, all on the same waitqueue, and wake 
> > > > > > > them up via a single wake_up() call? That would remove the SMP 
> > > > > > > cross call (wakeups do immediate cross-calls already).
> > > > > > 
> > > > > > My concern with this is that the cache misses accessing all the 
> > > > > > processes on this single waitqueue would be serialized, slowing 
> > > > > > things down. In contrast, the bitmask that smp_call_function() 
> > > > > > traverses delivers on the order of a thousand CPUs' worth of bits 
> > > > > > per cache miss.  I will give it a try, though.
> > > > > 
> > > > > At least if you go via the migration threads, you can queue up 
> > > > > requests to them locally. But there's going to be cachemisses 
> > > > > _anyway_, since you have to access them all from a single CPU, 
> > > > > and then they have to fetch details about what to do, and then 
> > > > > have to notify the originator about completion.
> > > > 
> > > > Ah, so you are suggesting that I use smp_call_function() to run 
> > > > code on each CPU that wakes up that CPU's migration thread?  I 
> > > > will take a look at this.
> > > 
> > > My suggestion was to queue up a dummy 'struct migration_req' up with 
> > > it (change migration_req::task == NULL to mean 'nothing') and simply 
> > > wake it up using wake_up_process().
> > 
> > OK.  I was thinking of just using wake_up_process() without the
> > migration_req structure, and unconditionally setting a per-CPU
> > variable from within migration_thread() just before the list_empty()
> > check.  In your approach we would need a NULL-pointer check just
> > before the call to __migrate_task().
> > 
> > > That will force a quiescent state, without the need for any extra 
> > > information, right?
> > 
> > Yep!
> > 
> > > This is what the scheduler code does, roughly:
> > > 
> > >                 wake_up_process(rq->migration_thread);
> > >                 wait_for_completion(&req.done);
> > > 
> > > and this will always have to perform well. The 'req' could be put 
> > > into PER_CPU, and a loop could be done like this:
> > > 
> > > 	for_each_online_cpu(cpu)
> > >                 wake_up_process(cpu_rq(cpu)->migration_thread);
> > > 
> > > 	for_each_online_cpu(cpu)
> > >                 wait_for_completion(&per_cpu(req, cpu).done);
> > > 
> > > hm?
> > 
> > My concern is the linear slowdown for large systems, but this 
> > should be OK for modest systems (a few 10s of CPUs).  However, I 
> > will try it out -- it does not need to be a long-term solution, 
> > after all.
> 
> I think there is going to be a linear slowdown no matter what - 
> because sending that many IPIs is going to be linear. (there are no 
> 'broadcast to all' IPIs anymore - on x86 we only have them if all 
> physical APIC IDs are 7 or smaller.)

With the current code, agreed.  One could imagine making an IPI tree,
so that a given CPU IPIs (say) eight subordinates.  Making this work
nice with CPU hotplug would be entertaining, to say the least.

> Also, no matter what scheme we use, the target CPU does have to be 
> processed somehow and it does have to signal completion back somehow 
> - which generates cachemisses.

One could in theory use a combining tree, so that results filter up,
sort of like they do in rcutree.  But given that rcutree already has a
combining tree, I would like to do this part in rcutree.

> I think what probaby matters most is to go simple, and to use 
> established kernel primitives - and the above is really typical 
> pattern for things like TLB flushes to a process having a presence 
> on every physical CPU. Those aspects will be kept reasonably fast 
> and balanced on all hardware that matters. (and if not, people will 
> notice any TLB flush/shootdown linear slowdowns and will address it)
> 
> I could be wrong though ... maybe someone can get some numbers from 
> a really large system?

In theory, I have access to a 64-way system.  In practice, it is
extremely heavily booked.

I will try your straightforward approach.

							Thanx, Paul
--
To unsubscribe from this list: send the line "unsubscribe netdev" in
the body of a message to majordomo@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Ingo Molnar May 20, 2009, 8:09 a.m. UTC | #13
* Paul E. McKenney <paulmck@linux.vnet.ibm.com> wrote:

> On Tue, May 19, 2009 at 02:44:36PM +0200, Ingo Molnar wrote:
> > 
> > * Paul E. McKenney <paulmck@linux.vnet.ibm.com> wrote:
> > 
> > > On Tue, May 19, 2009 at 10:58:25AM +0200, Ingo Molnar wrote:
> > > > 
> > > > * Paul E. McKenney <paulmck@linux.vnet.ibm.com> wrote:
> > > > 
> > > > > On Mon, May 18, 2009 at 05:42:41PM +0200, Ingo Molnar wrote:
> > > > > > 
> > > > > > * Paul E. McKenney <paulmck@linux.vnet.ibm.com> wrote:
> > > > > > 
> > > > > > > > i might be missing something fundamental here, but why not just 
> > > > > > > > have per CPU helper threads, all on the same waitqueue, and wake 
> > > > > > > > them up via a single wake_up() call? That would remove the SMP 
> > > > > > > > cross call (wakeups do immediate cross-calls already).
> > > > > > > 
> > > > > > > My concern with this is that the cache misses accessing all the 
> > > > > > > processes on this single waitqueue would be serialized, slowing 
> > > > > > > things down. In contrast, the bitmask that smp_call_function() 
> > > > > > > traverses delivers on the order of a thousand CPUs' worth of bits 
> > > > > > > per cache miss.  I will give it a try, though.
> > > > > > 
> > > > > > At least if you go via the migration threads, you can queue up 
> > > > > > requests to them locally. But there's going to be cachemisses 
> > > > > > _anyway_, since you have to access them all from a single CPU, 
> > > > > > and then they have to fetch details about what to do, and then 
> > > > > > have to notify the originator about completion.
> > > > > 
> > > > > Ah, so you are suggesting that I use smp_call_function() to run 
> > > > > code on each CPU that wakes up that CPU's migration thread?  I 
> > > > > will take a look at this.
> > > > 
> > > > My suggestion was to queue up a dummy 'struct migration_req' up with 
> > > > it (change migration_req::task == NULL to mean 'nothing') and simply 
> > > > wake it up using wake_up_process().
> > > 
> > > OK.  I was thinking of just using wake_up_process() without the
> > > migration_req structure, and unconditionally setting a per-CPU
> > > variable from within migration_thread() just before the list_empty()
> > > check.  In your approach we would need a NULL-pointer check just
> > > before the call to __migrate_task().
> > > 
> > > > That will force a quiescent state, without the need for any extra 
> > > > information, right?
> > > 
> > > Yep!
> > > 
> > > > This is what the scheduler code does, roughly:
> > > > 
> > > >                 wake_up_process(rq->migration_thread);
> > > >                 wait_for_completion(&req.done);
> > > > 
> > > > and this will always have to perform well. The 'req' could be put 
> > > > into PER_CPU, and a loop could be done like this:
> > > > 
> > > > 	for_each_online_cpu(cpu)
> > > >                 wake_up_process(cpu_rq(cpu)->migration_thread);
> > > > 
> > > > 	for_each_online_cpu(cpu)
> > > >                 wait_for_completion(&per_cpu(req, cpu).done);
> > > > 
> > > > hm?
> > > 
> > > My concern is the linear slowdown for large systems, but this 
> > > should be OK for modest systems (a few 10s of CPUs).  However, I 
> > > will try it out -- it does not need to be a long-term solution, 
> > > after all.
> > 
> > I think there is going to be a linear slowdown no matter what - 
> > because sending that many IPIs is going to be linear. (there are 
> > no 'broadcast to all' IPIs anymore - on x86 we only have them if 
> > all physical APIC IDs are 7 or smaller.)
> 
> With the current code, agreed.  One could imagine making an IPI 
> tree, so that a given CPU IPIs (say) eight subordinates.  Making 
> this work nice with CPU hotplug would be entertaining, to say the 
> least.

Certainly! :-)

As a general note, unrelated to your patches: i think our 
CPU-hotplug related complexity seems to be a bit too much. This is 
really just a gut feeling - from having seen many patches that also 
have hotplug notifiers.

I'm wondering whether this is because it's structured in a 
suboptimal way, or because i'm (intuitively) under-estimating the 
complexity of what it takes to express what happens when a CPU is 
offlined and then onlined?

	Ingo
--
To unsubscribe from this list: send the line "unsubscribe netdev" in
the body of a message to majordomo@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Paul E. McKenney May 20, 2009, 3:30 p.m. UTC | #14
On Wed, May 20, 2009 at 10:09:24AM +0200, Ingo Molnar wrote:
> 
> * Paul E. McKenney <paulmck@linux.vnet.ibm.com> wrote:
> 
> > On Tue, May 19, 2009 at 02:44:36PM +0200, Ingo Molnar wrote:
> > > 
> > > * Paul E. McKenney <paulmck@linux.vnet.ibm.com> wrote:
> > > 
> > > > On Tue, May 19, 2009 at 10:58:25AM +0200, Ingo Molnar wrote:
> > > > > 
> > > > > * Paul E. McKenney <paulmck@linux.vnet.ibm.com> wrote:
> > > > > 
> > > > > > On Mon, May 18, 2009 at 05:42:41PM +0200, Ingo Molnar wrote:
> > > > > > > 
> > > > > > > * Paul E. McKenney <paulmck@linux.vnet.ibm.com> wrote:
> > > > > > > 
> > > > > > > > > i might be missing something fundamental here, but why not just 
> > > > > > > > > have per CPU helper threads, all on the same waitqueue, and wake 
> > > > > > > > > them up via a single wake_up() call? That would remove the SMP 
> > > > > > > > > cross call (wakeups do immediate cross-calls already).
> > > > > > > > 
> > > > > > > > My concern with this is that the cache misses accessing all the 
> > > > > > > > processes on this single waitqueue would be serialized, slowing 
> > > > > > > > things down. In contrast, the bitmask that smp_call_function() 
> > > > > > > > traverses delivers on the order of a thousand CPUs' worth of bits 
> > > > > > > > per cache miss.  I will give it a try, though.
> > > > > > > 
> > > > > > > At least if you go via the migration threads, you can queue up 
> > > > > > > requests to them locally. But there's going to be cachemisses 
> > > > > > > _anyway_, since you have to access them all from a single CPU, 
> > > > > > > and then they have to fetch details about what to do, and then 
> > > > > > > have to notify the originator about completion.
> > > > > > 
> > > > > > Ah, so you are suggesting that I use smp_call_function() to run 
> > > > > > code on each CPU that wakes up that CPU's migration thread?  I 
> > > > > > will take a look at this.
> > > > > 
> > > > > My suggestion was to queue up a dummy 'struct migration_req' up with 
> > > > > it (change migration_req::task == NULL to mean 'nothing') and simply 
> > > > > wake it up using wake_up_process().
> > > > 
> > > > OK.  I was thinking of just using wake_up_process() without the
> > > > migration_req structure, and unconditionally setting a per-CPU
> > > > variable from within migration_thread() just before the list_empty()
> > > > check.  In your approach we would need a NULL-pointer check just
> > > > before the call to __migrate_task().
> > > > 
> > > > > That will force a quiescent state, without the need for any extra 
> > > > > information, right?
> > > > 
> > > > Yep!
> > > > 
> > > > > This is what the scheduler code does, roughly:
> > > > > 
> > > > >                 wake_up_process(rq->migration_thread);
> > > > >                 wait_for_completion(&req.done);
> > > > > 
> > > > > and this will always have to perform well. The 'req' could be put 
> > > > > into PER_CPU, and a loop could be done like this:
> > > > > 
> > > > > 	for_each_online_cpu(cpu)
> > > > >                 wake_up_process(cpu_rq(cpu)->migration_thread);
> > > > > 
> > > > > 	for_each_online_cpu(cpu)
> > > > >                 wait_for_completion(&per_cpu(req, cpu).done);
> > > > > 
> > > > > hm?
> > > > 
> > > > My concern is the linear slowdown for large systems, but this 
> > > > should be OK for modest systems (a few 10s of CPUs).  However, I 
> > > > will try it out -- it does not need to be a long-term solution, 
> > > > after all.
> > > 
> > > I think there is going to be a linear slowdown no matter what - 
> > > because sending that many IPIs is going to be linear. (there are 
> > > no 'broadcast to all' IPIs anymore - on x86 we only have them if 
> > > all physical APIC IDs are 7 or smaller.)
> > 
> > With the current code, agreed.  One could imagine making an IPI 
> > tree, so that a given CPU IPIs (say) eight subordinates.  Making 
> > this work nice with CPU hotplug would be entertaining, to say the 
> > least.
> 
> Certainly! :-)
> 
> As a general note, unrelated to your patches: i think our 
> CPU-hotplug related complexity seems to be a bit too much. This is 
> really just a gut feeling - from having seen many patches that also 
> have hotplug notifiers.
> 
> I'm wondering whether this is because it's structured in a 
> suboptimal way, or because i'm (intuitively) under-estimating the 
> complexity of what it takes to express what happens when a CPU is 
> offlined and then onlined?

I suppose that I could take this as a cue to reminisce about the old days
in a past life with a different implementation of CPU online/offline,
but life is just too short for that sort of thing.  Not that guys my
age let that stop them.  ;-)

And in that past life, exercising CPU online/offline usually exposed
painful bugs in new code, so I cannot claim that the old-life approach
to CPU hotplug was perfect.  Interestingly enough, running uniprocessor
also exposed painful bugs more often than not.  Of course, the only way
to run uniprocessor was to offline all but one of the CPUs, so you would
hit the online/offline bugs before hitting the uniprocessor-only bugs.

The thing that worries me most about CPU hotplug in Linux is that
there is no clear hierarchy of CPU function in the offline process,
given that the offlining process invokes notifiers in the same order
as does the onlining process.  Whether this is a real defect in the CPU
hotplug design or is instead simply a symptom of my not yet being fully
comfortable with the two-phase CPU-removal process is an interesting
question to which I do not have an answer.

Either way, the thought process is different.  In my old life, CPUs shed
roles in the opposite order that they acquired them.  This meant that a
given CPU was naturally guaranteed to be correctly taking interrupts for
the entire time that it was capable of running user-level processes.
Later in the offlining process, it would still take interrupts, but
would be unable to run user processes.  Still later, it would no longer
be taking interrupts, and would stop participating in RCU and in the
global TLB-flush algorithm.  There was no need to stop the whole machine
to make a given CPU go offline, in fact, most of the work was done by
the CPU in question.

In the case of RCU, this meant that there was no need for double-checking
for offlined CPUs, because CPUs could reliably indicate a quiescent
state on their way out.

On the other hand, there was no equivalent of dynticks in the old days.
And it is dynticks that is responsible for most of the complexity present
in force_quiescent_state(), not CPU hotplug.

So I cannot hold up RCU as something that would be greatly simplified
by changing the CPU hotplug design, much as I might like to.  ;-)

							Thanx, Paul
--
To unsubscribe from this list: send the line "unsubscribe netdev" in
the body of a message to majordomo@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Ingo Molnar May 27, 2009, 10:57 p.m. UTC | #15
* Paul E. McKenney <paulmck@linux.vnet.ibm.com> wrote:

> On Wed, May 20, 2009 at 10:09:24AM +0200, Ingo Molnar wrote:
> > 
> > * Paul E. McKenney <paulmck@linux.vnet.ibm.com> wrote:
> > 
> > > On Tue, May 19, 2009 at 02:44:36PM +0200, Ingo Molnar wrote:
> > > > 
> > > > * Paul E. McKenney <paulmck@linux.vnet.ibm.com> wrote:
> > > > 
> > > > > On Tue, May 19, 2009 at 10:58:25AM +0200, Ingo Molnar wrote:
> > > > > > 
> > > > > > * Paul E. McKenney <paulmck@linux.vnet.ibm.com> wrote:
> > > > > > 
> > > > > > > On Mon, May 18, 2009 at 05:42:41PM +0200, Ingo Molnar wrote:
> > > > > > > > 
> > > > > > > > * Paul E. McKenney <paulmck@linux.vnet.ibm.com> wrote:
> > > > > > > > 
> > > > > > > > > > i might be missing something fundamental here, but why not just 
> > > > > > > > > > have per CPU helper threads, all on the same waitqueue, and wake 
> > > > > > > > > > them up via a single wake_up() call? That would remove the SMP 
> > > > > > > > > > cross call (wakeups do immediate cross-calls already).
> > > > > > > > > 
> > > > > > > > > My concern with this is that the cache misses accessing all the 
> > > > > > > > > processes on this single waitqueue would be serialized, slowing 
> > > > > > > > > things down. In contrast, the bitmask that smp_call_function() 
> > > > > > > > > traverses delivers on the order of a thousand CPUs' worth of bits 
> > > > > > > > > per cache miss.  I will give it a try, though.
> > > > > > > > 
> > > > > > > > At least if you go via the migration threads, you can queue up 
> > > > > > > > requests to them locally. But there's going to be cachemisses 
> > > > > > > > _anyway_, since you have to access them all from a single CPU, 
> > > > > > > > and then they have to fetch details about what to do, and then 
> > > > > > > > have to notify the originator about completion.
> > > > > > > 
> > > > > > > Ah, so you are suggesting that I use smp_call_function() to run 
> > > > > > > code on each CPU that wakes up that CPU's migration thread?  I 
> > > > > > > will take a look at this.
> > > > > > 
> > > > > > My suggestion was to queue up a dummy 'struct migration_req' up with 
> > > > > > it (change migration_req::task == NULL to mean 'nothing') and simply 
> > > > > > wake it up using wake_up_process().
> > > > > 
> > > > > OK.  I was thinking of just using wake_up_process() without the
> > > > > migration_req structure, and unconditionally setting a per-CPU
> > > > > variable from within migration_thread() just before the list_empty()
> > > > > check.  In your approach we would need a NULL-pointer check just
> > > > > before the call to __migrate_task().
> > > > > 
> > > > > > That will force a quiescent state, without the need for any extra 
> > > > > > information, right?
> > > > > 
> > > > > Yep!
> > > > > 
> > > > > > This is what the scheduler code does, roughly:
> > > > > > 
> > > > > >                 wake_up_process(rq->migration_thread);
> > > > > >                 wait_for_completion(&req.done);
> > > > > > 
> > > > > > and this will always have to perform well. The 'req' could be put 
> > > > > > into PER_CPU, and a loop could be done like this:
> > > > > > 
> > > > > > 	for_each_online_cpu(cpu)
> > > > > >                 wake_up_process(cpu_rq(cpu)->migration_thread);
> > > > > > 
> > > > > > 	for_each_online_cpu(cpu)
> > > > > >                 wait_for_completion(&per_cpu(req, cpu).done);
> > > > > > 
> > > > > > hm?
> > > > > 
> > > > > My concern is the linear slowdown for large systems, but this 
> > > > > should be OK for modest systems (a few 10s of CPUs).  However, I 
> > > > > will try it out -- it does not need to be a long-term solution, 
> > > > > after all.
> > > > 
> > > > I think there is going to be a linear slowdown no matter what - 
> > > > because sending that many IPIs is going to be linear. (there are 
> > > > no 'broadcast to all' IPIs anymore - on x86 we only have them if 
> > > > all physical APIC IDs are 7 or smaller.)
> > > 
> > > With the current code, agreed.  One could imagine making an IPI 
> > > tree, so that a given CPU IPIs (say) eight subordinates.  Making 
> > > this work nice with CPU hotplug would be entertaining, to say the 
> > > least.
> > 
> > Certainly! :-)
> > 
> > As a general note, unrelated to your patches: i think our 
> > CPU-hotplug related complexity seems to be a bit too much. This is 
> > really just a gut feeling - from having seen many patches that also 
> > have hotplug notifiers.
> > 
> > I'm wondering whether this is because it's structured in a 
> > suboptimal way, or because i'm (intuitively) under-estimating the 
> > complexity of what it takes to express what happens when a CPU is 
> > offlined and then onlined?
> 
> I suppose that I could take this as a cue to reminisce about the 
> old days in a past life with a different implementation of CPU 
> online/offline, but life is just too short for that sort of thing.  
> Not that guys my age let that stop them.  ;-)
> 
> And in that past life, exercising CPU online/offline usually 
> exposed painful bugs in new code, so I cannot claim that the 
> old-life approach to CPU hotplug was perfect.  Interestingly 
> enough, running uniprocessor also exposed painful bugs more often 
> than not.  Of course, the only way to run uniprocessor was to 
> offline all but one of the CPUs, so you would hit the 
> online/offline bugs before hitting the uniprocessor-only bugs.
> 
> The thing that worries me most about CPU hotplug in Linux is that 
> there is no clear hierarchy of CPU function in the offline 
> process, given that the offlining process invokes notifiers in the 
> same order as does the onlining process.  Whether this is a real 
> defect in the CPU hotplug design or is instead simply a symptom of 
> my not yet being fully comfortable with the two-phase CPU-removal 
> process is an interesting question to which I do not have an 
> answer.

I strongly believe it's the former.

> Either way, the thought process is different.  In my old life, 
> CPUs shed roles in the opposite order that they acquired them.  

Yeah, that looks a whole lot more logical to do.

> This meant that a given CPU was naturally guaranteed to be 
> correctly taking interrupts for the entire time that it was 
> capable of running user-level processes. Later in the offlining 
> process, it would still take interrupts, but would be unable to 
> run user processes.  Still later, it would no longer be taking 
> interrupts, and would stop participating in RCU and in the global 
> TLB-flush algorithm.  There was no need to stop the whole machine 
> to make a given CPU go offline, in fact, most of the work was done 
> by the CPU in question.
> 
> In the case of RCU, this meant that there was no need for 
> double-checking for offlined CPUs, because CPUs could reliably 
> indicate a quiescent state on their way out.
> 
> On the other hand, there was no equivalent of dynticks in the old 
> days. And it is dynticks that is responsible for most of the 
> complexity present in force_quiescent_state(), not CPU hotplug.
> 
> So I cannot hold up RCU as something that would be greatly 
> simplified by changing the CPU hotplug design, much as I might 
> like to.  ;-)

We could probably remove a fair bit of dynticks complexity by 
removing non-dynticks and removing non-hrtimer. People could still 
force a 'periodic' interrupting mode (if they want, or if their hw 
forces that), but that would be a plain periodic hrtimer firing off 
all the time.

	Ingo
--
To unsubscribe from this list: send the line "unsubscribe netdev" in
the body of a message to majordomo@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Paul E. McKenney May 29, 2009, 1:22 a.m. UTC | #16
On Thu, May 28, 2009 at 12:57:05AM +0200, Ingo Molnar wrote:
> 
> * Paul E. McKenney <paulmck@linux.vnet.ibm.com> wrote:
> 
> > On Wed, May 20, 2009 at 10:09:24AM +0200, Ingo Molnar wrote:
> > > 
> > > * Paul E. McKenney <paulmck@linux.vnet.ibm.com> wrote:
> > > 
> > > > On Tue, May 19, 2009 at 02:44:36PM +0200, Ingo Molnar wrote:
> > > > > 
> > > > > * Paul E. McKenney <paulmck@linux.vnet.ibm.com> wrote:
> > > > > 
> > > > > > On Tue, May 19, 2009 at 10:58:25AM +0200, Ingo Molnar wrote:
> > > > > > > 
> > > > > > > * Paul E. McKenney <paulmck@linux.vnet.ibm.com> wrote:
> > > > > > > 
> > > > > > > > On Mon, May 18, 2009 at 05:42:41PM +0200, Ingo Molnar wrote:
> > > > > > > > > 
> > > > > > > > > * Paul E. McKenney <paulmck@linux.vnet.ibm.com> wrote:
> > > > > > > > > 
> > > > > > > > > > > i might be missing something fundamental here, but why not just 
> > > > > > > > > > > have per CPU helper threads, all on the same waitqueue, and wake 
> > > > > > > > > > > them up via a single wake_up() call? That would remove the SMP 
> > > > > > > > > > > cross call (wakeups do immediate cross-calls already).
> > > > > > > > > > 
> > > > > > > > > > My concern with this is that the cache misses accessing all the 
> > > > > > > > > > processes on this single waitqueue would be serialized, slowing 
> > > > > > > > > > things down. In contrast, the bitmask that smp_call_function() 
> > > > > > > > > > traverses delivers on the order of a thousand CPUs' worth of bits 
> > > > > > > > > > per cache miss.  I will give it a try, though.
> > > > > > > > > 
> > > > > > > > > At least if you go via the migration threads, you can queue up 
> > > > > > > > > requests to them locally. But there's going to be cachemisses 
> > > > > > > > > _anyway_, since you have to access them all from a single CPU, 
> > > > > > > > > and then they have to fetch details about what to do, and then 
> > > > > > > > > have to notify the originator about completion.
> > > > > > > > 
> > > > > > > > Ah, so you are suggesting that I use smp_call_function() to run 
> > > > > > > > code on each CPU that wakes up that CPU's migration thread?  I 
> > > > > > > > will take a look at this.
> > > > > > > 
> > > > > > > My suggestion was to queue up a dummy 'struct migration_req' up with 
> > > > > > > it (change migration_req::task == NULL to mean 'nothing') and simply 
> > > > > > > wake it up using wake_up_process().
> > > > > > 
> > > > > > OK.  I was thinking of just using wake_up_process() without the
> > > > > > migration_req structure, and unconditionally setting a per-CPU
> > > > > > variable from within migration_thread() just before the list_empty()
> > > > > > check.  In your approach we would need a NULL-pointer check just
> > > > > > before the call to __migrate_task().
> > > > > > 
> > > > > > > That will force a quiescent state, without the need for any extra 
> > > > > > > information, right?
> > > > > > 
> > > > > > Yep!
> > > > > > 
> > > > > > > This is what the scheduler code does, roughly:
> > > > > > > 
> > > > > > >                 wake_up_process(rq->migration_thread);
> > > > > > >                 wait_for_completion(&req.done);
> > > > > > > 
> > > > > > > and this will always have to perform well. The 'req' could be put 
> > > > > > > into PER_CPU, and a loop could be done like this:
> > > > > > > 
> > > > > > > 	for_each_online_cpu(cpu)
> > > > > > >                 wake_up_process(cpu_rq(cpu)->migration_thread);
> > > > > > > 
> > > > > > > 	for_each_online_cpu(cpu)
> > > > > > >                 wait_for_completion(&per_cpu(req, cpu).done);
> > > > > > > 
> > > > > > > hm?
> > > > > > 
> > > > > > My concern is the linear slowdown for large systems, but this 
> > > > > > should be OK for modest systems (a few 10s of CPUs).  However, I 
> > > > > > will try it out -- it does not need to be a long-term solution, 
> > > > > > after all.
> > > > > 
> > > > > I think there is going to be a linear slowdown no matter what - 
> > > > > because sending that many IPIs is going to be linear. (there are 
> > > > > no 'broadcast to all' IPIs anymore - on x86 we only have them if 
> > > > > all physical APIC IDs are 7 or smaller.)
> > > > 
> > > > With the current code, agreed.  One could imagine making an IPI 
> > > > tree, so that a given CPU IPIs (say) eight subordinates.  Making 
> > > > this work nice with CPU hotplug would be entertaining, to say the 
> > > > least.
> > > 
> > > Certainly! :-)
> > > 
> > > As a general note, unrelated to your patches: i think our 
> > > CPU-hotplug related complexity seems to be a bit too much. This is 
> > > really just a gut feeling - from having seen many patches that also 
> > > have hotplug notifiers.
> > > 
> > > I'm wondering whether this is because it's structured in a 
> > > suboptimal way, or because i'm (intuitively) under-estimating the 
> > > complexity of what it takes to express what happens when a CPU is 
> > > offlined and then onlined?
> > 
> > I suppose that I could take this as a cue to reminisce about the 
> > old days in a past life with a different implementation of CPU 
> > online/offline, but life is just too short for that sort of thing.  
> > Not that guys my age let that stop them.  ;-)
> > 
> > And in that past life, exercising CPU online/offline usually 
> > exposed painful bugs in new code, so I cannot claim that the 
> > old-life approach to CPU hotplug was perfect.  Interestingly 
> > enough, running uniprocessor also exposed painful bugs more often 
> > than not.  Of course, the only way to run uniprocessor was to 
> > offline all but one of the CPUs, so you would hit the 
> > online/offline bugs before hitting the uniprocessor-only bugs.
> > 
> > The thing that worries me most about CPU hotplug in Linux is that 
> > there is no clear hierarchy of CPU function in the offline 
> > process, given that the offlining process invokes notifiers in the 
> > same order as does the onlining process.  Whether this is a real 
> > defect in the CPU hotplug design or is instead simply a symptom of 
> > my not yet being fully comfortable with the two-phase CPU-removal 
> > process is an interesting question to which I do not have an 
> > answer.
> 
> I strongly believe it's the former.
> 
> > Either way, the thought process is different.  In my old life, 
> > CPUs shed roles in the opposite order that they acquired them.  
> 
> Yeah, that looks a whole lot more logical to do.

Hmmm...  Making the transition work nicely would require some thought.
It might be good to retain the two-phase nature, even when reversing
the order of offline notifications.  This would address one disadvantage
of the past-life version, which was unnecessary migration of processes
off of the CPU in question, only to find that a later notifier aborted
the offlining.

So only the first phase is permitted to abort the offlining of the CPU,
and this first phase must also set whatever state is necessary to prevent
some later operation from making it impossible to offline the CPU.
The second phase would unconditionally take the CPU out of service.
In theory, this approach would allow incremental conversion of the
notifiers, waiting to remove the stop_machine stuff until all notifiers
had been converted.

If this actually works out, the sequence of changes would be as follows:

1.	Reverse the order of the offline notifications, fixing any
	bugs induced/exposed by this change.

2.	Incrementally convert notifiers to the new mechanism.  This
	will require more thought.

3.	Get rid of the stop_machine and the CPU_DEAD once all are
	converted.

Or we might find that simply reversing the order (#1 above) suffices.

> > This meant that a given CPU was naturally guaranteed to be 
> > correctly taking interrupts for the entire time that it was 
> > capable of running user-level processes. Later in the offlining 
> > process, it would still take interrupts, but would be unable to 
> > run user processes.  Still later, it would no longer be taking 
> > interrupts, and would stop participating in RCU and in the global 
> > TLB-flush algorithm.  There was no need to stop the whole machine 
> > to make a given CPU go offline, in fact, most of the work was done 
> > by the CPU in question.
> > 
> > In the case of RCU, this meant that there was no need for 
> > double-checking for offlined CPUs, because CPUs could reliably 
> > indicate a quiescent state on their way out.
> > 
> > On the other hand, there was no equivalent of dynticks in the old 
> > days. And it is dynticks that is responsible for most of the 
> > complexity present in force_quiescent_state(), not CPU hotplug.
> > 
> > So I cannot hold up RCU as something that would be greatly 
> > simplified by changing the CPU hotplug design, much as I might 
> > like to.  ;-)
> 
> We could probably remove a fair bit of dynticks complexity by 
> removing non-dynticks and removing non-hrtimer. People could still 
> force a 'periodic' interrupting mode (if they want, or if their hw 
> forces that), but that would be a plain periodic hrtimer firing off 
> all the time.

Hmmm...  That would not simplify RCU much, but on the other hand (1) the
rcutree.c dynticks approach is already quite a bit simpler than the
rcupreempt.c approach and (2) doing this could potentially simplify
other things.

							Thanx, Paul
--
To unsubscribe from this list: send the line "unsubscribe netdev" in
the body of a message to majordomo@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Gautham R Shenoy May 29, 2009, 12:06 p.m. UTC | #17
On Thu, May 28, 2009 at 06:22:51PM -0700, Paul E. McKenney wrote:
> 
> Hmmm...  Making the transition work nicely would require some thought.
> It might be good to retain the two-phase nature, even when reversing
> the order of offline notifications.  This would address one disadvantage
> of the past-life version, which was unnecessary migration of processes
> off of the CPU in question, only to find that a later notifier aborted
> the offlining.

The notifiers handling CPU_DEAD cannot abort it from here since the
operation has already completed, whether they like it or not!

If there exist notifiers which try to abort it from here, it's a BUG, as
the code says:

        /* CPU is completely dead: tell everyone.  Too late to complain.
	 * */
         if (raw_notifier_call_chain(&cpu_chain, CPU_DEAD | mod,
	                                     hcpu) == NOTIFY_BAD)
	                     BUG();

Also, one can thus consider the CPU_DEAD and the CPU_POST_DEAD parts to be
the extensions of the second phase. Just that we do some
additional cleanup once the CPU has actually gone down. migration of
processes (while breaking their affinity if required) is one of them.

But there are other things as well, such as rebuilding the sched-domain
which have to be done after the cpu has gone down. Currently this
operation contributes to majority of time taken to bring a cpu-offline.

>
> So only the first phase is permitted to abort the offlining of the CPU,
> and this first phase must also set whatever state is necessary to prevent
> some later operation from making it impossible to offline the CPU.
> The second phase would unconditionally take the CPU out of service.
> In theory, this approach would allow incremental conversion of the
> notifiers, waiting to remove the stop_machine stuff until all notifiers
> had been converted.
> If this actually works out, the sequence of changes would be as follows:
> 
> 1.	Reverse the order of the offline notifications, fixing any
> 	bugs induced/exposed by this change.
> 
> 2.	Incrementally convert notifiers to the new mechanism.  This
> 	will require more thought.
> 
> 3.	Get rid of the stop_machine and the CPU_DEAD once all are
> 	converted.

I agree with this sequence. It seems quite logical.

However, I am not yet sure if we can completely get rid of stop_machine
and CPU_DEAD in practice, unless we're okay with having an
time-consuming rollback operation. Currently the rollback only consists of
rolling back the actions done during CPU_UP_PREPARE/CPU_DOWN_PREPARE.

And from the notifiers profile (see attached file),
UP_PREPARE/DOWN_PREPARE seem to consume a lot lesser time
when compared to the post-hotplug notifications.

> 
> Or we might find that simply reversing the order (#1 above) suffices.
> 
> > > This meant that a given CPU was naturally guaranteed to be 
> > > correctly taking interrupts for the entire time that it was 
> > > capable of running user-level processes. Later in the offlining 
> > > process, it would still take interrupts, but would be unable to 
> > > run user processes.  Still later, it would no longer be taking 
> > > interrupts, and would stop participating in RCU and in the global 
> > > TLB-flush algorithm.  There was no need to stop the whole machine 
> > > to make a given CPU go offline, in fact, most of the work was done 
> > > by the CPU in question.
> > > 
> > > In the case of RCU, this meant that there was no need for 
> > > double-checking for offlined CPUs, because CPUs could reliably 
> > > indicate a quiescent state on their way out.
> > > 
> > > On the other hand, there was no equivalent of dynticks in the old 
> > > days. And it is dynticks that is responsible for most of the 
> > > complexity present in force_quiescent_state(), not CPU hotplug.
> > > 
> > > So I cannot hold up RCU as something that would be greatly 
> > > simplified by changing the CPU hotplug design, much as I might 
> > > like to.  ;-)
> > 
> > We could probably remove a fair bit of dynticks complexity by 
> > removing non-dynticks and removing non-hrtimer. People could still 
> > force a 'periodic' interrupting mode (if they want, or if their hw 
> > forces that), but that would be a plain periodic hrtimer firing off 
> > all the time.
> 
> Hmmm...  That would not simplify RCU much, but on the other hand (1) the
> rcutree.c dynticks approach is already quite a bit simpler than the
> rcupreempt.c approach and (2) doing this could potentially simplify
> other things.
> 
> 							Thanx, Paul
> --
> To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
> the body of a message to majordomo@vger.kernel.org
> More majordomo info at  http://vger.kernel.org/majordomo-info.html
> Please read the FAQ at  http://www.tux.org/lkml/
>
Paul E. McKenney May 30, 2009, 4:56 a.m. UTC | #18
On Fri, May 29, 2009 at 05:36:37PM +0530, Gautham R Shenoy wrote:
> On Thu, May 28, 2009 at 06:22:51PM -0700, Paul E. McKenney wrote:
> > 
> > Hmmm...  Making the transition work nicely would require some thought.
> > It might be good to retain the two-phase nature, even when reversing
> > the order of offline notifications.  This would address one disadvantage
> > of the past-life version, which was unnecessary migration of processes
> > off of the CPU in question, only to find that a later notifier aborted
> > the offlining.
> 
> The notifiers handling CPU_DEAD cannot abort it from here since the
> operation has already completed, whether they like it or not!

Hello, Gautham,

We are talking past each other -- the past-life (not Linux) CPU-offlining
scheme had but one phase for offlining, which meant that if a very late
notifier-equivalent realized that the offlining could not proceed,
it would have mostly shut the CPU down, only to have to restart it.

For example, it might have needlessly migrated processes off of that
CPU.  This did not happen often, but it was a bit of a disadvantage.

							Thanx, Paul

> If there exist notifiers which try to abort it from here, it's a BUG, as
> the code says:
> 
>         /* CPU is completely dead: tell everyone.  Too late to complain.
> 	 * */
>          if (raw_notifier_call_chain(&cpu_chain, CPU_DEAD | mod,
> 	                                     hcpu) == NOTIFY_BAD)
> 	                     BUG();
> 
> Also, one can thus consider the CPU_DEAD and the CPU_POST_DEAD parts to be
> the extensions of the second phase. Just that we do some
> additional cleanup once the CPU has actually gone down. migration of
> processes (while breaking their affinity if required) is one of them.
> 
> But there are other things as well, such as rebuilding the sched-domain
> which have to be done after the cpu has gone down. Currently this
> operation contributes to majority of time taken to bring a cpu-offline.
> 
> >
> > So only the first phase is permitted to abort the offlining of the CPU,
> > and this first phase must also set whatever state is necessary to prevent
> > some later operation from making it impossible to offline the CPU.
> > The second phase would unconditionally take the CPU out of service.
> > In theory, this approach would allow incremental conversion of the
> > notifiers, waiting to remove the stop_machine stuff until all notifiers
> > had been converted.
> > If this actually works out, the sequence of changes would be as follows:
> > 
> > 1.	Reverse the order of the offline notifications, fixing any
> > 	bugs induced/exposed by this change.
> > 
> > 2.	Incrementally convert notifiers to the new mechanism.  This
> > 	will require more thought.
> > 
> > 3.	Get rid of the stop_machine and the CPU_DEAD once all are
> > 	converted.
> 
> I agree with this sequence. It seems quite logical.
> 
> However, I am not yet sure if we can completely get rid of stop_machine
> and CPU_DEAD in practice, unless we're okay with having an
> time-consuming rollback operation. Currently the rollback only consists of
> rolling back the actions done during CPU_UP_PREPARE/CPU_DOWN_PREPARE.
> 
> And from the notifiers profile (see attached file),
> UP_PREPARE/DOWN_PREPARE seem to consume a lot lesser time
> when compared to the post-hotplug notifications.
> 
> > 
> > Or we might find that simply reversing the order (#1 above) suffices.
> > 
> > > > This meant that a given CPU was naturally guaranteed to be 
> > > > correctly taking interrupts for the entire time that it was 
> > > > capable of running user-level processes. Later in the offlining 
> > > > process, it would still take interrupts, but would be unable to 
> > > > run user processes.  Still later, it would no longer be taking 
> > > > interrupts, and would stop participating in RCU and in the global 
> > > > TLB-flush algorithm.  There was no need to stop the whole machine 
> > > > to make a given CPU go offline, in fact, most of the work was done 
> > > > by the CPU in question.
> > > > 
> > > > In the case of RCU, this meant that there was no need for 
> > > > double-checking for offlined CPUs, because CPUs could reliably 
> > > > indicate a quiescent state on their way out.
> > > > 
> > > > On the other hand, there was no equivalent of dynticks in the old 
> > > > days. And it is dynticks that is responsible for most of the 
> > > > complexity present in force_quiescent_state(), not CPU hotplug.
> > > > 
> > > > So I cannot hold up RCU as something that would be greatly 
> > > > simplified by changing the CPU hotplug design, much as I might 
> > > > like to.  ;-)
> > > 
> > > We could probably remove a fair bit of dynticks complexity by 
> > > removing non-dynticks and removing non-hrtimer. People could still 
> > > force a 'periodic' interrupting mode (if they want, or if their hw 
> > > forces that), but that would be a plain periodic hrtimer firing off 
> > > all the time.
> > 
> > Hmmm...  That would not simplify RCU much, but on the other hand (1) the
> > rcutree.c dynticks approach is already quite a bit simpler than the
> > rcupreempt.c approach and (2) doing this could potentially simplify
> > other things.
> > 
> > 							Thanx, Paul
> > --
> > To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
> > the body of a message to majordomo@vger.kernel.org
> > More majordomo info at  http://vger.kernel.org/majordomo-info.html
> > Please read the FAQ at  http://www.tux.org/lkml/
> > 
> 
> -- 
> Thanks and Regards
> gautham

> =============================================================================
> statistics for CPU_DOWN_PREPARE
> =============================================================================
>       410 ns: buffer_cpu_notify             : CPU_DOWN_PREPARE
>       441 ns: radix_tree_callback           : CPU_DOWN_PREPARE
>       473 ns: relay_hotcpu_callback         : CPU_DOWN_PREPARE
>       486 ns: blk_cpu_notify                : CPU_DOWN_PREPARE
>       563 ns: cpu_callback                  : CPU_DOWN_PREPARE
>       579 ns: hotplug_hrtick                : CPU_DOWN_PREPARE
>       594 ns: cpu_callback                  : CPU_DOWN_PREPARE
>       605 ns: cpu_numa_callback             : CPU_DOWN_PREPARE
>       611 ns: hrtimer_cpu_notify            : CPU_DOWN_PREPARE
>       625 ns: flow_cache_cpu                : CPU_DOWN_PREPARE
>       625 ns: rcu_barrier_cpu_hotplug       : CPU_DOWN_PREPARE
>       639 ns: hotplug_cfd                   : CPU_DOWN_PREPARE
>       641 ns: pageset_cpuup_callback        : CPU_DOWN_PREPARE
>       656 ns: rb_cpu_notify                 : CPU_DOWN_PREPARE
>       670 ns: dev_cpu_callback              : CPU_DOWN_PREPARE
>       670 ns: topology_cpu_callback         : CPU_DOWN_PREPARE
>       672 ns: remote_softirq_cpu_notify     : CPU_DOWN_PREPARE
>       715 ns: ratelimit_handler             : CPU_DOWN_PREPARE
>       715 ns: rcu_cpu_notify                : CPU_DOWN_PREPARE
>       717 ns: timer_cpu_notify              : CPU_DOWN_PREPARE
>       730 ns: page_alloc_cpu_notify         : CPU_DOWN_PREPARE
>       746 ns: cpu_callback                  : CPU_DOWN_PREPARE
>       821 ns: cpuset_track_online_cpus      : CPU_DOWN_PREPARE
>       824 ns: slab_cpuup_callback           : CPU_DOWN_PREPARE
>       849 ns: sysfs_cpu_notify              : CPU_DOWN_PREPARE
>       884 ns: percpu_counter_hotcpu_callback: CPU_DOWN_PREPARE
>       961 ns: update_runtime                : CPU_DOWN_PREPARE
>      1323 ns: migration_call                : CPU_DOWN_PREPARE
>      1918 ns: vmstat_cpuup_callback         : CPU_DOWN_PREPARE
>      2072 ns: workqueue_cpu_callback        : CPU_DOWN_PREPARE
> =========================================================================
> Total time for CPU_DOWN_PREPARE = .023235000 ms
> =========================================================================
> =============================================================================
> statistics for CPU_DYING
> =============================================================================
>       365 ns: remote_softirq_cpu_notify     : CPU_DYING
>       365 ns: topology_cpu_callback         : CPU_DYING
>       381 ns: blk_cpu_notify                : CPU_DYING
>       381 ns: cpu_callback                  : CPU_DYING
>       381 ns: relay_hotcpu_callback         : CPU_DYING
>       381 ns: update_runtime                : CPU_DYING
>       394 ns: dev_cpu_callback              : CPU_DYING
>       395 ns: hotplug_cfd                   : CPU_DYING
>       395 ns: vmstat_cpuup_callback         : CPU_DYING
>       397 ns: cpuset_track_online_cpus      : CPU_DYING
>       397 ns: flow_cache_cpu                : CPU_DYING
>       397 ns: pageset_cpuup_callback        : CPU_DYING
>       397 ns: rb_cpu_notify                 : CPU_DYING
>       398 ns: hotplug_hrtick                : CPU_DYING
>       410 ns: cpu_callback                  : CPU_DYING
>       410 ns: page_alloc_cpu_notify         : CPU_DYING
>       411 ns: rcu_cpu_notify                : CPU_DYING
>       412 ns: slab_cpuup_callback           : CPU_DYING
>       412 ns: sysfs_cpu_notify              : CPU_DYING
>       412 ns: timer_cpu_notify              : CPU_DYING
>       426 ns: buffer_cpu_notify             : CPU_DYING
>       426 ns: radix_tree_callback           : CPU_DYING
>       441 ns: cpu_callback                  : CPU_DYING
>       442 ns: cpu_numa_callback             : CPU_DYING
>       473 ns: ratelimit_handler             : CPU_DYING
>       531 ns: percpu_counter_hotcpu_callback: CPU_DYING
>       562 ns: workqueue_cpu_callback        : CPU_DYING
>       730 ns: rcu_barrier_cpu_hotplug       : CPU_DYING
>      1536 ns: migration_call                : CPU_DYING
>      1873 ns: hrtimer_cpu_notify            : CPU_DYING
> =========================================================================
> Total time for CPU_DYING = .015331000 ms
> =========================================================================
> =============================================================================
> statistics for CPU_DOWN_CANCELED
> =============================================================================
> =========================================================================
> Total time for CPU_DOWN_CANCELED = 0 ms
> =========================================================================
> =============================================================================
> statistics for __stop_machine
> =============================================================================
>    357983 ns: __stop_machine                :
> =========================================================================
> Total time for __stop_machine = .357983000 ms
> =========================================================================
> =============================================================================
> statistics for CPU_DEAD
> =============================================================================
>       350 ns: update_runtime                : CPU_DEAD
>       379 ns: hotplug_hrtick                : CPU_DEAD
>       381 ns: cpu_callback                  : CPU_DEAD
>       381 ns: rb_cpu_notify                 : CPU_DEAD
>       426 ns: hotplug_cfd                   : CPU_DEAD
>       426 ns: relay_hotcpu_callback         : CPU_DEAD
>       441 ns: rcu_barrier_cpu_hotplug       : CPU_DEAD
>       442 ns: remote_softirq_cpu_notify     : CPU_DEAD
>       609 ns: ratelimit_handler             : CPU_DEAD
>       625 ns: cpu_numa_callback             : CPU_DEAD
>       684 ns: dev_cpu_callback              : CPU_DEAD
>       686 ns: workqueue_cpu_callback        : CPU_DEAD
>       838 ns: rcu_cpu_notify                : CPU_DEAD
>       898 ns: pageset_cpuup_callback        : CPU_DEAD
>      1202 ns: vmstat_cpuup_callback         : CPU_DEAD
>      1295 ns: blk_cpu_notify                : CPU_DEAD
>      1554 ns: buffer_cpu_notify             : CPU_DEAD
>      2588 ns: hrtimer_cpu_notify            : CPU_DEAD
>      3274 ns: radix_tree_callback           : CPU_DEAD
>      5246 ns: timer_cpu_notify              : CPU_DEAD
>      8587 ns: flow_cache_cpu                : CPU_DEAD
>      8645 ns: topology_cpu_callback         : CPU_DEAD
>     12454 ns: cpu_callback                  : CPU_DEAD
>     12650 ns: cpu_callback                  : CPU_DEAD
>     45727 ns: percpu_counter_hotcpu_callback: CPU_DEAD
>     55242 ns: page_alloc_cpu_notify         : CPU_DEAD
>     56766 ns: sysfs_cpu_notify              : CPU_DEAD
>     58241 ns: slab_cpuup_callback           : CPU_DEAD
>     78250 ns: migration_call                : CPU_DEAD
>  10784759 ns: cpuset_track_online_cpus      : CPU_DEAD
> =========================================================================
> Total time for CPU_DEAD = 11.144046000 ms
> =========================================================================
> =============================================================================
> statistics for CPU_POST_DEAD
> =============================================================================
>       350 ns: cpu_callback                  : CPU_POST_DEAD
>       365 ns: blk_cpu_notify                : CPU_POST_DEAD
>       365 ns: buffer_cpu_notify             : CPU_POST_DEAD
>       365 ns: cpu_numa_callback             : CPU_POST_DEAD
>       365 ns: dev_cpu_callback              : CPU_POST_DEAD
>       365 ns: flow_cache_cpu                : CPU_POST_DEAD
>       365 ns: hrtimer_cpu_notify            : CPU_POST_DEAD
>       365 ns: page_alloc_cpu_notify         : CPU_POST_DEAD
>       365 ns: rb_cpu_notify                 : CPU_POST_DEAD
>       365 ns: rcu_cpu_notify                : CPU_POST_DEAD
>       365 ns: timer_cpu_notify              : CPU_POST_DEAD
>       365 ns: update_runtime                : CPU_POST_DEAD
>       366 ns: cpu_callback                  : CPU_POST_DEAD
>       366 ns: hotplug_cfd                   : CPU_POST_DEAD
>       366 ns: pageset_cpuup_callback        : CPU_POST_DEAD
>       366 ns: radix_tree_callback           : CPU_POST_DEAD
>       367 ns: hotplug_hrtick                : CPU_POST_DEAD
>       367 ns: topology_cpu_callback         : CPU_POST_DEAD
>       367 ns: vmstat_cpuup_callback         : CPU_POST_DEAD
>       381 ns: cpu_callback                  : CPU_POST_DEAD
>       381 ns: cpuset_track_online_cpus      : CPU_POST_DEAD
>       381 ns: relay_hotcpu_callback         : CPU_POST_DEAD
>       381 ns: sysfs_cpu_notify              : CPU_POST_DEAD
>       383 ns: rcu_barrier_cpu_hotplug       : CPU_POST_DEAD
>       410 ns: remote_softirq_cpu_notify     : CPU_POST_DEAD
>       412 ns: slab_cpuup_callback           : CPU_POST_DEAD
>       442 ns: migration_call                : CPU_POST_DEAD
>       457 ns: percpu_counter_hotcpu_callback: CPU_POST_DEAD
>       502 ns: ratelimit_handler             : CPU_POST_DEAD
>     86200 ns: workqueue_cpu_callback        : CPU_POST_DEAD
> =========================================================================
> Total time for CPU_POST_DEAD = .097260000 ms
> =========================================================================
> =============================================================================
> statistics for CPU_UP_PREPARE
> =============================================================================
>       336 ns: hotplug_hrtick                : CPU_UP_PREPARE
>       350 ns: cpu_callback                  : CPU_UP_PREPARE
>       365 ns: blk_cpu_notify                : CPU_UP_PREPARE
>       381 ns: vmstat_cpuup_callback         : CPU_UP_PREPARE
>       410 ns: buffer_cpu_notify             : CPU_UP_PREPARE
>       410 ns: radix_tree_callback           : CPU_UP_PREPARE
>       426 ns: dev_cpu_callback              : CPU_UP_PREPARE
>       426 ns: remote_softirq_cpu_notify     : CPU_UP_PREPARE
>       428 ns: cpuset_track_online_cpus      : CPU_UP_PREPARE
>       441 ns: sysfs_cpu_notify              : CPU_UP_PREPARE
>       471 ns: hotplug_cfd                   : CPU_UP_PREPARE
>       472 ns: rb_cpu_notify                 : CPU_UP_PREPARE
>       473 ns: flow_cache_cpu                : CPU_UP_PREPARE
>       486 ns: page_alloc_cpu_notify         : CPU_UP_PREPARE
>       488 ns: hrtimer_cpu_notify            : CPU_UP_PREPARE
>       488 ns: update_runtime                : CPU_UP_PREPARE
>       502 ns: rcu_barrier_cpu_hotplug       : CPU_UP_PREPARE
>       531 ns: percpu_counter_hotcpu_callback: CPU_UP_PREPARE
>       547 ns: ratelimit_handler             : CPU_UP_PREPARE
>       594 ns: relay_hotcpu_callback         : CPU_UP_PREPARE
>      1125 ns: rcu_cpu_notify                : CPU_UP_PREPARE
>      1309 ns: pageset_cpuup_callback        : CPU_UP_PREPARE
>      1947 ns: timer_cpu_notify              : CPU_UP_PREPARE
>      5389 ns: cpu_numa_callback             : CPU_UP_PREPARE
>      6379 ns: topology_cpu_callback         : CPU_UP_PREPARE
>      6436 ns: slab_cpuup_callback           : CPU_UP_PREPARE
>     19879 ns: cpu_callback                  : CPU_UP_PREPARE
>     20227 ns: cpu_callback                  : CPU_UP_PREPARE
>     33940 ns: migration_call                : CPU_UP_PREPARE
>    143731 ns: workqueue_cpu_callback        : CPU_UP_PREPARE
> =========================================================================
> Total time for CPU_UP_PREPARE = .249387000 ms
> =========================================================================
> =============================================================================
> statistics for CPU_UP_CANCELED
> =============================================================================
> =========================================================================
> Total time for CPU_UP_CANCELED = 0 ms
> =========================================================================
> =============================================================================
> statistics for __cpu_up
> =============================================================================
> 205868908 ns: __cpu_up                      :
> =========================================================================
> Total time for __cpu_up = 205.868908000 ms
> =========================================================================
> =============================================================================
> statistics for CPU_STARTING
> =============================================================================
>       350 ns: hotplug_cfd                   : CPU_STARTING
>       352 ns: cpu_callback                  : CPU_STARTING
>       352 ns: remote_softirq_cpu_notify     : CPU_STARTING
>       363 ns: vmstat_cpuup_callback         : CPU_STARTING
>       365 ns: cpu_callback                  : CPU_STARTING
>       365 ns: dev_cpu_callback              : CPU_STARTING
>       365 ns: hotplug_hrtick                : CPU_STARTING
>       365 ns: radix_tree_callback           : CPU_STARTING
>       365 ns: rb_cpu_notify                 : CPU_STARTING
>       368 ns: update_runtime                : CPU_STARTING
>       379 ns: cpu_callback                  : CPU_STARTING
>       379 ns: cpu_numa_callback             : CPU_STARTING
>       380 ns: rcu_barrier_cpu_hotplug       : CPU_STARTING
>       380 ns: relay_hotcpu_callback         : CPU_STARTING
>       381 ns: hrtimer_cpu_notify            : CPU_STARTING
>       381 ns: pageset_cpuup_callback        : CPU_STARTING
>       381 ns: slab_cpuup_callback           : CPU_STARTING
>       382 ns: flow_cache_cpu                : CPU_STARTING
>       394 ns: blk_cpu_notify                : CPU_STARTING
>       397 ns: buffer_cpu_notify             : CPU_STARTING
>       397 ns: percpu_counter_hotcpu_callback: CPU_STARTING
>       397 ns: sysfs_cpu_notify              : CPU_STARTING
>       397 ns: topology_cpu_callback         : CPU_STARTING
>       410 ns: rcu_cpu_notify                : CPU_STARTING
>       412 ns: page_alloc_cpu_notify         : CPU_STARTING
>       426 ns: cpuset_track_online_cpus      : CPU_STARTING
>       455 ns: ratelimit_handler             : CPU_STARTING
>       471 ns: timer_cpu_notify              : CPU_STARTING
>       516 ns: migration_call                : CPU_STARTING
>       549 ns: workqueue_cpu_callback        : CPU_STARTING
> =========================================================================
> Total time for CPU_STARTING = .011874000 ms
> =========================================================================
> =============================================================================
> statistics for CPU_ONLINE
> =============================================================================
>       365 ns: radix_tree_callback           : CPU_ONLINE
>       379 ns: hotplug_hrtick                : CPU_ONLINE
>       381 ns: hrtimer_cpu_notify            : CPU_ONLINE
>       381 ns: remote_softirq_cpu_notify     : CPU_ONLINE
>       410 ns: slab_cpuup_callback           : CPU_ONLINE
>       410 ns: timer_cpu_notify              : CPU_ONLINE
>       412 ns: blk_cpu_notify                : CPU_ONLINE
>       426 ns: dev_cpu_callback              : CPU_ONLINE
>       426 ns: flow_cache_cpu                : CPU_ONLINE
>       426 ns: topology_cpu_callback         : CPU_ONLINE
>       428 ns: rcu_barrier_cpu_hotplug       : CPU_ONLINE
>       428 ns: rcu_cpu_notify                : CPU_ONLINE
>       440 ns: buffer_cpu_notify             : CPU_ONLINE
>       455 ns: pageset_cpuup_callback        : CPU_ONLINE
>       457 ns: relay_hotcpu_callback         : CPU_ONLINE
>       473 ns: rb_cpu_notify                 : CPU_ONLINE
>       518 ns: update_runtime                : CPU_ONLINE
>       549 ns: cpu_numa_callback             : CPU_ONLINE
>       562 ns: ratelimit_handler             : CPU_ONLINE
>       595 ns: page_alloc_cpu_notify         : CPU_ONLINE
>       596 ns: hotplug_cfd                   : CPU_ONLINE
>       777 ns: percpu_counter_hotcpu_callback: CPU_ONLINE
>      1037 ns: cpu_callback                  : CPU_ONLINE
>      1280 ns: cpu_callback                  : CPU_ONLINE
>      1680 ns: cpu_callback                  : CPU_ONLINE
>      2043 ns: vmstat_cpuup_callback         : CPU_ONLINE
>      3422 ns: migration_call                : CPU_ONLINE
>     12344 ns: workqueue_cpu_callback        : CPU_ONLINE
>     52879 ns: sysfs_cpu_notify              : CPU_ONLINE
>  12287706 ns: cpuset_track_online_cpus      : CPU_ONLINE
> =========================================================================
> Total time for CPU_ONLINE = 12.372685000 ms
> =========================================================================

--
To unsubscribe from this list: send the line "unsubscribe netdev" in
the body of a message to majordomo@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
diff mbox

Patch

diff --git a/include/linux/rcuclassic.h b/include/linux/rcuclassic.h
index bfd92e1..ea1ceb2 100644
--- a/include/linux/rcuclassic.h
+++ b/include/linux/rcuclassic.h
@@ -158,14 +158,28 @@  extern struct lockdep_map rcu_lock_map;
 
 #define call_rcu_sched(head, func) call_rcu(head, func)
 
+static inline void synchronize_rcu_expedited(void)
+{
+	synchronize_sched_expedited();
+}
+
+static inline void synchronize_rcu_bh_expedited(void)
+{
+	synchronize_sched_expedited();
+}
+
 extern void __rcu_init(void);
-#define rcu_init_sched()	do { } while (0)
 extern void rcu_check_callbacks(int cpu, int user);
 extern void rcu_restart_cpu(int cpu);
 
 extern long rcu_batches_completed(void);
 extern long rcu_batches_completed_bh(void);
 
+static inline void rcu_init_sched(void)
+{
+	synchronize_sched_expedited_init();
+}
+
 #define rcu_enter_nohz()	do { } while (0)
 #define rcu_exit_nohz()		do { } while (0)
 
diff --git a/include/linux/rcupdate.h b/include/linux/rcupdate.h
index 15fbb3c..60163d2 100644
--- a/include/linux/rcupdate.h
+++ b/include/linux/rcupdate.h
@@ -51,7 +51,19 @@  struct rcu_head {
 	void (*func)(struct rcu_head *head);
 };
 
-/* Internal to kernel, but needed by rcupreempt.h. */
+/* Exported common interfaces */
+extern void synchronize_rcu(void);
+extern void rcu_barrier(void);
+extern void rcu_barrier_bh(void);
+extern void rcu_barrier_sched(void);
+extern void synchronize_sched_expedited(void);
+extern int sched_expedited_torture_stats(char *page);
+
+/* Internal to kernel */
+extern void rcu_init(void);
+extern void rcu_scheduler_starting(void);
+extern void synchronize_sched_expedited_init(void);
+extern int rcu_needs_cpu(int cpu);
 extern int rcu_scheduler_active;
 
 #if defined(CONFIG_CLASSIC_RCU)
@@ -259,15 +271,4 @@  extern void call_rcu(struct rcu_head *head,
 extern void call_rcu_bh(struct rcu_head *head,
 			void (*func)(struct rcu_head *head));
 
-/* Exported common interfaces */
-extern void synchronize_rcu(void);
-extern void rcu_barrier(void);
-extern void rcu_barrier_bh(void);
-extern void rcu_barrier_sched(void);
-
-/* Internal to kernel */
-extern void rcu_init(void);
-extern void rcu_scheduler_starting(void);
-extern int rcu_needs_cpu(int cpu);
-
 #endif /* __LINUX_RCUPDATE_H */
diff --git a/include/linux/rcupreempt.h b/include/linux/rcupreempt.h
index fce5227..78117ed 100644
--- a/include/linux/rcupreempt.h
+++ b/include/linux/rcupreempt.h
@@ -74,6 +74,16 @@  extern int rcu_needs_cpu(int cpu);
 
 extern void __synchronize_sched(void);
 
+static inline void synchronize_rcu_expedited(void)
+{
+	synchronize_rcu();  /* Placeholder for new rcupreempt implementation. */
+}
+
+static inline void synchronize_rcu_bh_expedited(void)
+{
+	synchronize_rcu();  /* Placeholder for new rcupreempt implementation. */
+}
+
 extern void __rcu_init(void);
 extern void rcu_init_sched(void);
 extern void rcu_check_callbacks(int cpu, int user);
diff --git a/include/linux/rcutree.h b/include/linux/rcutree.h
index 58b2aa5..7b533ec 100644
--- a/include/linux/rcutree.h
+++ b/include/linux/rcutree.h
@@ -279,8 +279,14 @@  static inline void __rcu_read_unlock_bh(void)
 
 #define call_rcu_sched(head, func) call_rcu(head, func)
 
-static inline void rcu_init_sched(void)
+static inline void synchronize_rcu_expedited(void)
+{
+	synchronize_sched_expedited();
+}
+
+static inline void synchronize_rcu_bh_expedited(void)
 {
+	synchronize_sched_expedited();
 }
 
 extern void __rcu_init(void);
@@ -290,6 +296,11 @@  extern void rcu_restart_cpu(int cpu);
 extern long rcu_batches_completed(void);
 extern long rcu_batches_completed_bh(void);
 
+static inline void rcu_init_sched(void)
+{
+	synchronize_sched_expedited_init();
+}
+
 #ifdef CONFIG_NO_HZ
 void rcu_enter_nohz(void);
 void rcu_exit_nohz(void);
diff --git a/kernel/rcupdate.c b/kernel/rcupdate.c
index a967c9f..e80b75d 100644
--- a/kernel/rcupdate.c
+++ b/kernel/rcupdate.c
@@ -45,6 +45,8 @@ 
 #include <linux/mutex.h>
 #include <linux/module.h>
 #include <linux/kernel_stat.h>
+#include <linux/delay.h>
+#include <linux/kthread.h>
 
 enum rcu_barrier {
 	RCU_BARRIER_STD,
@@ -98,6 +100,30 @@  void synchronize_rcu(void)
 }
 EXPORT_SYMBOL_GPL(synchronize_rcu);
 
+/**
+ * synchronize_rcu_bh - wait until an rcu_bh grace period has elapsed.
+ *
+ * Control will return to the caller some time after a full rcu_bh grace
+ * period has elapsed, in other words after all currently executing rcu_bh
+ * read-side critical sections have completed.  RCU read-side critical
+ * sections are delimited by rcu_read_lock_bh() and rcu_read_unlock_bh(),
+ * and may be nested.
+ */
+void synchronize_rcu_bh(void)
+{
+	struct rcu_synchronize rcu;
+
+	if (rcu_blocking_is_gp())
+		return;
+
+	init_completion(&rcu.completion);
+	/* Will wake me after RCU finished. */
+	call_rcu_bh(&rcu.head, wakeme_after_rcu);
+	/* Wait for it. */
+	wait_for_completion(&rcu.completion);
+}
+EXPORT_SYMBOL_GPL(synchronize_rcu_bh);
+
 static void rcu_barrier_callback(struct rcu_head *notused)
 {
 	if (atomic_dec_and_test(&rcu_barrier_cpu_count))
@@ -129,6 +155,7 @@  static void rcu_barrier_func(void *type)
 static inline void wait_migrated_callbacks(void)
 {
 	wait_event(rcu_migrate_wq, !atomic_read(&rcu_migrate_type_count));
+	smp_mb(); /* In case we didn't sleep. */
 }
 
 /*
@@ -229,3 +256,249 @@  void rcu_scheduler_starting(void)
 	WARN_ON(nr_context_switches() > 0);
 	rcu_scheduler_active = 1;
 }
+
+
+#ifndef CONFIG_SMP
+
+void __init synchronize_sched_expedited_init(void)
+{
+}
+
+void synchronize_sched_expedited(void)
+{
+}
+EXPORT_SYMBOL_GPL(synchronize_sched_expedited);
+
+int sched_expedited_torture_stats(char *page)
+{
+}
+EXPORT_SYMBOL_GPL(sched_expedited_torture_stats);
+
+#else /* #ifndef CONFIG_SMP */
+
+static DEFINE_MUTEX(rcu_sched_expedited_mutex);
+static DECLARE_WAIT_QUEUE_HEAD(need_sched_expedited_wq);
+static DECLARE_WAIT_QUEUE_HEAD(sched_expedited_done_wq);
+static int need_sched_expedited;
+static int sched_expedited_done;
+struct task_struct *krcu_sched_expedited_task;
+static DEFINE_PER_CPU(struct task_struct *, krcu_sched_expedited_task);
+static DEFINE_PER_CPU(wait_queue_head_t, sched_expedited_qs_wq);
+static DEFINE_PER_CPU(int, sched_expedited_done_qs);
+static DEFINE_PER_CPU(struct mutex, sched_expedited_done_mutex);
+
+#define SCHED_EXPEDITED_QS_DONE_QS	0
+#define SCHED_EXPEDITED_QS_NEED_QS	1
+#define SCHED_EXPEDITED_QS_STOP		2
+#define SCHED_EXPEDITED_QS_STOPPED	3
+
+int sched_expedited_torture_stats(char *page)
+{
+	int cnt = 0;
+#ifdef CONFIG_RCU_TRACE
+	int cpu;
+	
+	cnt += sprintf(&page[cnt],
+		       "nse = %d, sed = %d, QSneededFrom: ",
+		       need_sched_expedited, sched_expedited_done);
+	for_each_online_cpu(cpu) {
+		if (per_cpu(sched_expedited_done_qs, cpu))
+			cnt += sprintf(&page[cnt], " %d/%d",
+				       cpu,
+				       per_cpu(sched_expedited_done_qs, cpu));
+	}
+	cnt += sprintf(&page[cnt], "\n");
+#endif /* #ifdef CONFIG_RCU_TRACE */
+	return cnt;
+}
+EXPORT_SYMBOL_GPL(sched_expedited_torture_stats);
+
+/*
+ * Per-CPU kernel thread that constitutes a quiescent state when running.
+ */
+static int krcu_sched_expedited_percpu(void *hcpu)
+{
+	long cpu = (long)hcpu;
+	struct mutex *mp = &per_cpu(sched_expedited_done_mutex, cpu);
+	int *mydonqs = &per_cpu(sched_expedited_done_qs, cpu);
+	wait_queue_head_t *mywq = &per_cpu(sched_expedited_qs_wq, cpu);
+	/* @@@ struct sched_param param = { .sched_priority = 0 }; */
+
+	sched_setaffinity(0, &cpumask_of_cpu(cpu));
+	/* @@@ FIXME: need to handle sched_setaffinity() failure. */
+	/* set_freezable(); */
+	/* sched_setscheduler_nocheck(current, SCHED_FIFO, &param); */
+	for (;;) {
+		wait_event_interruptible(*mywq,
+					 *mydonqs != SCHED_EXPEDITED_QS_DONE_QS);
+		mutex_lock(mp);
+		if (*mydonqs == SCHED_EXPEDITED_QS_DONE_QS)
+			continue;
+		if (*mydonqs == SCHED_EXPEDITED_QS_STOP) {
+			*mydonqs = SCHED_EXPEDITED_QS_STOPPED;
+			break;
+		}
+		*mydonqs = SCHED_EXPEDITED_QS_DONE_QS;
+		mutex_unlock(mp);
+	}
+	while (!kthread_should_stop())
+		schedule_timeout_uninterruptible(1);
+	return 0;
+}
+
+void sched_expedited_wake(void *unused)
+{
+	mutex_lock(&__get_cpu_var(sched_expedited_done_mutex));
+	if (__get_cpu_var(sched_expedited_done_qs) ==
+	    SCHED_EXPEDITED_QS_DONE_QS) {
+		__get_cpu_var(sched_expedited_done_qs) =
+			SCHED_EXPEDITED_QS_NEED_QS;
+		wake_up(&__get_cpu_var(sched_expedited_qs_wq));
+	}
+	mutex_unlock(&__get_cpu_var(sched_expedited_done_mutex));
+}
+
+/*
+ * Kernel thread that processes synchronize_sched_expedited() requests.
+ * This is implemented as a separate kernel thread to avoid the need
+ * to mess with other tasks' cpumasks.
+ */
+static int krcu_sched_expedited(void *arg)
+{
+	int cpu;
+	int mycpu;
+	int nwait;
+
+	do {
+		wait_event_interruptible(need_sched_expedited_wq,
+					 need_sched_expedited);
+		smp_mb(); /* In case we didn't sleep. */
+		if (!need_sched_expedited)
+			continue;
+		need_sched_expedited = 0;
+		get_online_cpus();
+		preempt_disable();
+		mycpu = smp_processor_id();
+		smp_call_function(sched_expedited_wake, NULL, 1);
+		preempt_enable();
+		nwait = 0;
+		for_each_online_cpu(cpu) {
+			if (cpu == mycpu)
+				continue;
+			while (per_cpu(sched_expedited_done_qs, cpu) ==
+			       SCHED_EXPEDITED_QS_NEED_QS) {
+				if (++nwait <= 10)
+					udelay(10);
+				else {
+					schedule_timeout_uninterruptible(1);
+/*&&&&*/if (nwait == HZ) {
+/*&&&&*/printk(KERN_ALERT "krcu_sched_expedited(): cpu=%d/%d, mycpu=%d\n", cpu, per_cpu(sched_expedited_done_qs, cpu), mycpu);
+/*&&&&*/}
+				}
+			}
+		}
+		put_online_cpus();
+		sched_expedited_done = 1;
+		smp_mb();  /* in case not yet asleep. */
+		wake_up(&sched_expedited_done_wq);
+	} while (!kthread_should_stop());
+	return 0;
+}
+
+static int __cpuinit
+synchronize_sched_expedited_notify(struct notifier_block *self,
+				   unsigned long action, void *hcpu)
+{
+	long cpu = (long)hcpu;
+	struct task_struct **tsp = &per_cpu(krcu_sched_expedited_task, cpu);
+	struct mutex *mp = &per_cpu(sched_expedited_done_mutex, cpu);
+
+	switch (action) {
+	case CPU_UP_PREPARE:
+	case CPU_UP_PREPARE_FROZEN:
+		if (*tsp == NULL)
+			mutex_lock(mp);
+			per_cpu(sched_expedited_done_qs, cpu) =
+				SCHED_EXPEDITED_QS_DONE_QS;
+			mutex_unlock(mp);
+			init_waitqueue_head(&per_cpu(sched_expedited_qs_wq,
+						     cpu));
+			*tsp = kthread_run(krcu_sched_expedited_percpu,
+					   (void *)cpu,
+					   "krcu_sched_expedited");
+			WARN_ON(IS_ERR(*tsp));
+			if (IS_ERR(*tsp)) {
+				*tsp = NULL;
+				return NOTIFY_BAD;
+			}
+/*&&&&*/printk(KERN_ALERT "synchronize_sched_expedited_notify() onlining cpu %ld\n", cpu);
+		break;
+	case CPU_DEAD:
+	case CPU_DEAD_FROZEN:
+	case CPU_UP_CANCELED:
+	case CPU_UP_CANCELED_FROZEN:
+		WARN_ON(*tsp == NULL);
+		if (*tsp) {
+			mutex_lock(mp);
+			while (per_cpu(sched_expedited_done_qs, cpu) !=
+			       SCHED_EXPEDITED_QS_STOPPED) {
+				per_cpu(sched_expedited_done_qs, cpu) =
+					SCHED_EXPEDITED_QS_STOP;
+				mutex_unlock(mp);
+				wake_up(&per_cpu(sched_expedited_qs_wq, cpu));
+				schedule_timeout_uninterruptible(1);
+				mutex_lock(mp);
+			}
+			mutex_unlock(mp);
+			kthread_stop(*tsp);
+			*tsp = NULL;
+		}
+/*&&&&*/printk(KERN_ALERT "synchronize_sched_expedited_notify() offlining cpu %ld\n", cpu);
+		break;
+	default:
+		break;
+	}
+	return NOTIFY_OK;
+}
+
+/*
+ * Late-boot initialization for synchronize_sched_expedited().
+ * The scheduler must be running before this can be called.
+ */
+void __init synchronize_sched_expedited_init(void)
+{
+	int cpu;
+
+/*&&&&*/printk(KERN_ALERT "Initializing synchronize_sched_expedited()\n");
+	for_each_possible_cpu(cpu)
+		mutex_init(&per_cpu(sched_expedited_done_mutex, cpu));
+	hotcpu_notifier(synchronize_sched_expedited_notify, 0);
+	get_online_cpus();
+	for_each_online_cpu(cpu)
+		synchronize_sched_expedited_notify(NULL, CPU_UP_PREPARE,
+						   (void *)cpu);
+	put_online_cpus();
+	krcu_sched_expedited_task = kthread_run(krcu_sched_expedited, NULL,
+					        "krcu_sched_expedited");
+	WARN_ON(IS_ERR(krcu_sched_expedited_task));
+}
+
+void synchronize_sched_expedited(void)
+{
+	/* If there is only one CPU, we are done. */
+	if (num_online_cpus() == 1)
+		return;
+
+	/* Multiple CPUs, make krcu_sched_expedited() sequence through them. */
+	mutex_lock(&rcu_sched_expedited_mutex);
+	need_sched_expedited = 1;
+	smp_mb(); /* in case kthread not yet sleeping. */
+	wake_up(&need_sched_expedited_wq);
+	wait_event(sched_expedited_done_wq, sched_expedited_done);
+	smp_mb(); /* in case we didn't actually sleep. */
+	sched_expedited_done = 0;
+	mutex_unlock(&rcu_sched_expedited_mutex);
+}
+EXPORT_SYMBOL_GPL(synchronize_sched_expedited);
+
+#endif /* #else #ifndef CONFIG_SMP */
diff --git a/kernel/rcupreempt.c b/kernel/rcupreempt.c
index ce97a4d..4485758 100644
--- a/kernel/rcupreempt.c
+++ b/kernel/rcupreempt.c
@@ -1507,6 +1507,7 @@  void __init rcu_init_sched(void)
 						  NULL,
 						  "rcu_sched_grace_period");
 	WARN_ON(IS_ERR(rcu_sched_grace_period_task));
+	synchronize_sched_expedited_init();
 }
 
 #ifdef CONFIG_RCU_TRACE
diff --git a/kernel/rcutorture.c b/kernel/rcutorture.c
index 9b4a975..eebd4b8 100644
--- a/kernel/rcutorture.c
+++ b/kernel/rcutorture.c
@@ -257,14 +257,14 @@  struct rcu_torture_ops {
 	void (*init)(void);
 	void (*cleanup)(void);
 	int (*readlock)(void);
-	void (*readdelay)(struct rcu_random_state *rrsp);
+	void (*read_delay)(struct rcu_random_state *rrsp);
 	void (*readunlock)(int idx);
 	int (*completed)(void);
-	void (*deferredfree)(struct rcu_torture *p);
+	void (*deferred_free)(struct rcu_torture *p);
 	void (*sync)(void);
 	void (*cb_barrier)(void);
 	int (*stats)(char *page);
-	int irqcapable;
+	int irq_capable;
 	char *name;
 };
 static struct rcu_torture_ops *cur_ops = NULL;
@@ -320,7 +320,7 @@  rcu_torture_cb(struct rcu_head *p)
 		rp->rtort_mbtest = 0;
 		rcu_torture_free(rp);
 	} else
-		cur_ops->deferredfree(rp);
+		cur_ops->deferred_free(rp);
 }
 
 static void rcu_torture_deferred_free(struct rcu_torture *p)
@@ -329,18 +329,18 @@  static void rcu_torture_deferred_free(struct rcu_torture *p)
 }
 
 static struct rcu_torture_ops rcu_ops = {
-	.init = NULL,
-	.cleanup = NULL,
-	.readlock = rcu_torture_read_lock,
-	.readdelay = rcu_read_delay,
-	.readunlock = rcu_torture_read_unlock,
-	.completed = rcu_torture_completed,
-	.deferredfree = rcu_torture_deferred_free,
-	.sync = synchronize_rcu,
-	.cb_barrier = rcu_barrier,
-	.stats = NULL,
-	.irqcapable = 1,
-	.name = "rcu"
+	.init		= NULL,
+	.cleanup	= NULL,
+	.readlock	= rcu_torture_read_lock,
+	.read_delay	= rcu_read_delay,
+	.readunlock	= rcu_torture_read_unlock,
+	.completed	= rcu_torture_completed,
+	.deferred_free	= rcu_torture_deferred_free,
+	.sync		= synchronize_rcu,
+	.cb_barrier	= rcu_barrier,
+	.stats		= NULL,
+	.irq_capable 	= 1,
+	.name 		= "rcu"
 };
 
 static void rcu_sync_torture_deferred_free(struct rcu_torture *p)
@@ -370,18 +370,18 @@  static void rcu_sync_torture_init(void)
 }
 
 static struct rcu_torture_ops rcu_sync_ops = {
-	.init = rcu_sync_torture_init,
-	.cleanup = NULL,
-	.readlock = rcu_torture_read_lock,
-	.readdelay = rcu_read_delay,
-	.readunlock = rcu_torture_read_unlock,
-	.completed = rcu_torture_completed,
-	.deferredfree = rcu_sync_torture_deferred_free,
-	.sync = synchronize_rcu,
-	.cb_barrier = NULL,
-	.stats = NULL,
-	.irqcapable = 1,
-	.name = "rcu_sync"
+	.init		= rcu_sync_torture_init,
+	.cleanup	= NULL,
+	.readlock	= rcu_torture_read_lock,
+	.read_delay	= rcu_read_delay,
+	.readunlock	= rcu_torture_read_unlock,
+	.completed	= rcu_torture_completed,
+	.deferred_free	= rcu_sync_torture_deferred_free,
+	.sync		= synchronize_rcu,
+	.cb_barrier	= NULL,
+	.stats		= NULL,
+	.irq_capable	= 1,
+	.name		= "rcu_sync"
 };
 
 /*
@@ -432,33 +432,33 @@  static void rcu_bh_torture_synchronize(void)
 }
 
 static struct rcu_torture_ops rcu_bh_ops = {
-	.init = NULL,
-	.cleanup = NULL,
-	.readlock = rcu_bh_torture_read_lock,
-	.readdelay = rcu_read_delay,  /* just reuse rcu's version. */
-	.readunlock = rcu_bh_torture_read_unlock,
-	.completed = rcu_bh_torture_completed,
-	.deferredfree = rcu_bh_torture_deferred_free,
-	.sync = rcu_bh_torture_synchronize,
-	.cb_barrier = rcu_barrier_bh,
-	.stats = NULL,
-	.irqcapable = 1,
-	.name = "rcu_bh"
+	.init		= NULL,
+	.cleanup	= NULL,
+	.readlock	= rcu_bh_torture_read_lock,
+	.read_delay	= rcu_read_delay,  /* just reuse rcu's version. */
+	.readunlock	= rcu_bh_torture_read_unlock,
+	.completed	= rcu_bh_torture_completed,
+	.deferred_free	= rcu_bh_torture_deferred_free,
+	.sync		= rcu_bh_torture_synchronize,
+	.cb_barrier	= rcu_barrier_bh,
+	.stats		= NULL,
+	.irq_capable	= 1,
+	.name		= "rcu_bh"
 };
 
 static struct rcu_torture_ops rcu_bh_sync_ops = {
-	.init = rcu_sync_torture_init,
-	.cleanup = NULL,
-	.readlock = rcu_bh_torture_read_lock,
-	.readdelay = rcu_read_delay,  /* just reuse rcu's version. */
-	.readunlock = rcu_bh_torture_read_unlock,
-	.completed = rcu_bh_torture_completed,
-	.deferredfree = rcu_sync_torture_deferred_free,
-	.sync = rcu_bh_torture_synchronize,
-	.cb_barrier = NULL,
-	.stats = NULL,
-	.irqcapable = 1,
-	.name = "rcu_bh_sync"
+	.init		= rcu_sync_torture_init,
+	.cleanup	= NULL,
+	.readlock	= rcu_bh_torture_read_lock,
+	.read_delay	= rcu_read_delay,  /* just reuse rcu's version. */
+	.readunlock	= rcu_bh_torture_read_unlock,
+	.completed	= rcu_bh_torture_completed,
+	.deferred_free	= rcu_sync_torture_deferred_free,
+	.sync		= rcu_bh_torture_synchronize,
+	.cb_barrier	= NULL,
+	.stats		= NULL,
+	.irq_capable	= 1,
+	.name		= "rcu_bh_sync"
 };
 
 /*
@@ -530,17 +530,17 @@  static int srcu_torture_stats(char *page)
 }
 
 static struct rcu_torture_ops srcu_ops = {
-	.init = srcu_torture_init,
-	.cleanup = srcu_torture_cleanup,
-	.readlock = srcu_torture_read_lock,
-	.readdelay = srcu_read_delay,
-	.readunlock = srcu_torture_read_unlock,
-	.completed = srcu_torture_completed,
-	.deferredfree = rcu_sync_torture_deferred_free,
-	.sync = srcu_torture_synchronize,
-	.cb_barrier = NULL,
-	.stats = srcu_torture_stats,
-	.name = "srcu"
+	.init		= srcu_torture_init,
+	.cleanup	= srcu_torture_cleanup,
+	.readlock	= srcu_torture_read_lock,
+	.read_delay	= srcu_read_delay,
+	.readunlock	= srcu_torture_read_unlock,
+	.completed	= srcu_torture_completed,
+	.deferred_free	= rcu_sync_torture_deferred_free,
+	.sync		= srcu_torture_synchronize,
+	.cb_barrier	= NULL,
+	.stats		= srcu_torture_stats,
+	.name		= "srcu"
 };
 
 /*
@@ -574,32 +574,47 @@  static void sched_torture_synchronize(void)
 }
 
 static struct rcu_torture_ops sched_ops = {
-	.init = rcu_sync_torture_init,
-	.cleanup = NULL,
-	.readlock = sched_torture_read_lock,
-	.readdelay = rcu_read_delay,  /* just reuse rcu's version. */
-	.readunlock = sched_torture_read_unlock,
-	.completed = sched_torture_completed,
-	.deferredfree = rcu_sched_torture_deferred_free,
-	.sync = sched_torture_synchronize,
-	.cb_barrier = rcu_barrier_sched,
-	.stats = NULL,
-	.irqcapable = 1,
-	.name = "sched"
+	.init		= rcu_sync_torture_init,
+	.cleanup	= NULL,
+	.readlock	= sched_torture_read_lock,
+	.read_delay	= rcu_read_delay,  /* just reuse rcu's version. */
+	.readunlock	= sched_torture_read_unlock,
+	.completed	= sched_torture_completed,
+	.deferred_free	= rcu_sched_torture_deferred_free,
+	.sync		= sched_torture_synchronize,
+	.cb_barrier	= rcu_barrier_sched,
+	.stats		= NULL,
+	.irq_capable	= 1,
+	.name		= "sched"
 };
 
 static struct rcu_torture_ops sched_ops_sync = {
-	.init = rcu_sync_torture_init,
-	.cleanup = NULL,
-	.readlock = sched_torture_read_lock,
-	.readdelay = rcu_read_delay,  /* just reuse rcu's version. */
-	.readunlock = sched_torture_read_unlock,
-	.completed = sched_torture_completed,
-	.deferredfree = rcu_sync_torture_deferred_free,
-	.sync = sched_torture_synchronize,
-	.cb_barrier = NULL,
-	.stats = NULL,
-	.name = "sched_sync"
+	.init		= rcu_sync_torture_init,
+	.cleanup	= NULL,
+	.readlock	= sched_torture_read_lock,
+	.read_delay	= rcu_read_delay,  /* just reuse rcu's version. */
+	.readunlock	= sched_torture_read_unlock,
+	.completed	= sched_torture_completed,
+	.deferred_free	= rcu_sync_torture_deferred_free,
+	.sync		= sched_torture_synchronize,
+	.cb_barrier	= NULL,
+	.stats		= NULL,
+	.name		= "sched_sync"
+};
+
+static struct rcu_torture_ops sched_expedited_ops = {
+	.init		= rcu_sync_torture_init,
+	.cleanup	= NULL,
+	.readlock	= sched_torture_read_lock,
+	.read_delay	= rcu_read_delay,  /* just reuse rcu's version. */
+	.readunlock	= sched_torture_read_unlock,
+	.completed	= sched_torture_completed,
+	.deferred_free	= rcu_sync_torture_deferred_free,
+	.sync		= synchronize_sched_expedited,
+	.cb_barrier	= NULL,
+	.stats		= sched_expedited_torture_stats,
+	.irq_capable	= 1,
+	.name		= "sched_expedited"
 };
 
 /*
@@ -635,7 +650,7 @@  rcu_torture_writer(void *arg)
 				i = RCU_TORTURE_PIPE_LEN;
 			atomic_inc(&rcu_torture_wcount[i]);
 			old_rp->rtort_pipe_count++;
-			cur_ops->deferredfree(old_rp);
+			cur_ops->deferred_free(old_rp);
 		}
 		rcu_torture_current_version++;
 		oldbatch = cur_ops->completed();
@@ -700,7 +715,7 @@  static void rcu_torture_timer(unsigned long unused)
 	if (p->rtort_mbtest == 0)
 		atomic_inc(&n_rcu_torture_mberror);
 	spin_lock(&rand_lock);
-	cur_ops->readdelay(&rand);
+	cur_ops->read_delay(&rand);
 	n_rcu_torture_timers++;
 	spin_unlock(&rand_lock);
 	preempt_disable();
@@ -738,11 +753,11 @@  rcu_torture_reader(void *arg)
 
 	VERBOSE_PRINTK_STRING("rcu_torture_reader task started");
 	set_user_nice(current, 19);
-	if (irqreader && cur_ops->irqcapable)
+	if (irqreader && cur_ops->irq_capable)
 		setup_timer_on_stack(&t, rcu_torture_timer, 0);
 
 	do {
-		if (irqreader && cur_ops->irqcapable) {
+		if (irqreader && cur_ops->irq_capable) {
 			if (!timer_pending(&t))
 				mod_timer(&t, 1);
 		}
@@ -757,7 +772,7 @@  rcu_torture_reader(void *arg)
 		}
 		if (p->rtort_mbtest == 0)
 			atomic_inc(&n_rcu_torture_mberror);
-		cur_ops->readdelay(&rand);
+		cur_ops->read_delay(&rand);
 		preempt_disable();
 		pipe_count = p->rtort_pipe_count;
 		if (pipe_count > RCU_TORTURE_PIPE_LEN) {
@@ -778,7 +793,7 @@  rcu_torture_reader(void *arg)
 	} while (!kthread_should_stop() && fullstop == FULLSTOP_DONTSTOP);
 	VERBOSE_PRINTK_STRING("rcu_torture_reader task stopping");
 	rcutorture_shutdown_absorb("rcu_torture_reader");
-	if (irqreader && cur_ops->irqcapable)
+	if (irqreader && cur_ops->irq_capable)
 		del_timer_sync(&t);
 	while (!kthread_should_stop())
 		schedule_timeout_uninterruptible(1);
@@ -1078,6 +1093,7 @@  rcu_torture_init(void)
 	int firsterr = 0;
 	static struct rcu_torture_ops *torture_ops[] =
 		{ &rcu_ops, &rcu_sync_ops, &rcu_bh_ops, &rcu_bh_sync_ops,
+		  &sched_expedited_ops,
 		  &srcu_ops, &sched_ops, &sched_ops_sync, };
 
 	mutex_lock(&fullstop_mutex);