diff mbox

[RFC] v7 expedited "big hammer" RCU grace periods

Message ID 20090522190525.GA13286@linux.vnet.ibm.com
State RFC, archived
Delegated to: David Miller
Headers show

Commit Message

Paul E. McKenney May 22, 2009, 7:05 p.m. UTC
Seventh cut of "big hammer" expedited RCU grace periods.  This leverages
the existing per-CPU migration kthreads, as suggested by Ingo.  These
are awakened in a loop, and waited for in a second loop.  Not fully
scalable, but removing the extra hop through smp_call_function
reduces latency on systems with moderate numbers of CPUs.  The
synchronize_rcu_expedited() and and synchronize_bh_expedited() primitives
invoke synchronize_sched_expedited(), except for CONFIG_PREEMPT_RCU,
where they instead invoke synchronize_rcu() and synchronize_rcu_bh(),
respectively.  This will be fixed in the future, after preemptable RCU
is folded into the rcutree implementation.

As before, this does nothing to expedite callbacks already registered
with call_rcu() or call_rcu_bh(), but there is no need to.

Passes 10 hours of rcutorture testing in parallel with a script that
randomly offlines and onlines CPUs.  Grace periods take about 28
microseconds on an 8-CPU Power machine, which I believe is good enough
from a performance viewpoint for the near future.

This is not yet for inclusion.  This does seem to me to be a good-enough
implementation, but more testing and review is required.

Shortcomings:

o	Does not address preemptable RCU.

o	Probably not helpful on systems with thousands of CPUs, but likely
	quite helpful even on systems with a few hundreds of CPUs.

Changes since v6:

o	Moved to using the migration threads, as suggested by Ingo.

Changes since v5:

o	Fixed several embarrassing locking bugs, including those
	noted by Ingo and Lai.

o	Added a missing set of braces.

o	Cut out the extra kthread, so that synchronize_sched_expedited()
	directly calls smp_call_function() and waits for the quiescent
	states.

o	Removed some debug code, but promoted one to production.

o	Fix a compiler warning.

Changes since v4:

o	Use per-CPU kthreads to force the quiescent states in parallel.

Changes since v3:

o	Use a kthread that schedules itself on each CPU in turn to
	force a grace period.  The synchronize_rcu() primitive
	wakes up the kthread in order to avoid messing with affinity
	masks on user tasks.

o	Tried a number of additional variations on the v3 approach, none
	of which helped much.

Changes since v2:

o	Use reschedule IPIs rather than a softirq.

Changes since v1:

o	Added rcutorture support, and added exports required by
	rcutorture.

o	Added comment stating that smp_call_function() implies a
	memory barrier, suggested by Mathieu.

o	Added #include for delay.h.

Signed-off-by: Paul E. McKenney <paulmck@linux.vnet.ibm.com>
---

 include/linux/rcuclassic.h |   15 +++
 include/linux/rcupdate.h   |   24 ++---
 include/linux/rcupreempt.h |   10 ++
 include/linux/rcutree.h    |   12 ++
 kernel/rcupdate.c          |   27 ++++++
 kernel/rcutorture.c        |  200 ++++++++++++++++++++++++---------------------
 kernel/sched.c             |   52 +++++++++++
 7 files changed, 233 insertions(+), 107 deletions(-)

--
To unsubscribe from this list: send the line "unsubscribe netdev" in
the body of a message to majordomo@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html

Comments

Lai Jiangshan May 25, 2009, 6:35 a.m. UTC | #1
Paul E. McKenney wrote:
> Seventh cut of "big hammer" expedited RCU grace periods.  This leverages
> the existing per-CPU migration kthreads, as suggested by Ingo.  These
> are awakened in a loop, and waited for in a second loop.  Not fully
> scalable, but removing the extra hop through smp_call_function
> reduces latency on systems with moderate numbers of CPUs.  The
> synchronize_rcu_expedited() and and synchronize_bh_expedited() primitives
> invoke synchronize_sched_expedited(), except for CONFIG_PREEMPT_RCU,
> where they instead invoke synchronize_rcu() and synchronize_rcu_bh(),
> respectively.  This will be fixed in the future, after preemptable RCU
> is folded into the rcutree implementation.
> 

I'm strongly need this guarantee:

preempt_disable() guarantees/implies rcu_read_lock().

And
local_irq_diable() guarantees/implies rcu_read_lock().
rcu_read_lock_bh() guarantees/implies rcu_read_lock().


It will simplifies codes.

And
A lot's of current kernel code does not use rcu_read_lock()
when it has preempt_disable()-ed/local_irq_diable()-ed or
when it is in irq/softirq.

Without these guarantees, these code is broken.

> +
> +static DEFINE_PER_CPU(struct migration_req, rcu_migration_req);
> +static DEFINE_MUTEX(rcu_sched_expedited_mutex);
> +
> +/*
> + * Wait for an rcu-sched grace period to elapse, but use "big hammer"
> + * approach to force grace period to end quickly.  This consumes
> + * significant time on all CPUs, and is thus not recommended for
> + * any sort of common-case code.
> + */
> +void synchronize_sched_expedited(void)
> +{
> +	int cpu;
> +	unsigned long flags;
> +	struct rq *rq;
> +	struct migration_req *req;
> +
> +	mutex_lock(&rcu_sched_expedited_mutex);
> +	get_online_cpus();
> +	for_each_online_cpu(cpu) {
> +		rq = cpu_rq(cpu);
> +		req = &per_cpu(rcu_migration_req, cpu);
> +		init_completion(&req->done);
> +		req->task = NULL;
> +		req->dest_cpu = -1;
> +		spin_lock_irqsave(&rq->lock, flags);
> +		list_add(&req->list, &rq->migration_queue);
> +		spin_unlock_irqrestore(&rq->lock, flags);
> +		wake_up_process(rq->migration_thread);
> +	}
> +	for_each_online_cpu(cpu) {
> +		req = &per_cpu(rcu_migration_req, cpu);
> +		wait_for_completion(&req->done);
> +	}
> +	put_online_cpus();
> +	mutex_unlock(&rcu_sched_expedited_mutex);
> +}
> +EXPORT_SYMBOL_GPL(synchronize_sched_expedited);
> +
> +#endif /* #else #ifndef CONFIG_SMP */
> 
> 

Very nice implement!

Only one opinion:
get_online_cpus() is a large lock, a lot's of lock in kernel is required
after cpu_hotplug.lock.

_cpu_down()
	cpu_hotplug_begin()
		mutex_lock(&cpu_hotplug.lock)
	__raw_notifier_call_chain(CPU_DOWN_PREPARE)
		Lock a-kernel-lock.

It means when we have held a-kernel-lock, we can not call
synchronize_sched_expedited(). get_online_cpus() narrows
synchronize_sched_expedited()'s usages.

I think we can reuse req->dest_cpu and remove get_online_cpus().
(and use preempt_disable() and for_each_possible_cpu())

req->dest_cpu = -2 means @req is not queued
req->dest_cpu = -1 means @req is queued

a little like this code:

	mutex_lock(&rcu_sched_expedited_mutex);
	for_each_possible_cpu(cpu) {
		preempt_disable()
		if (cpu is not online)
			just set req->dest_cpu to -2;
		else
			init and queue req, and wake_up_process().
		preempt_enable()
	}
	for_each_possible_cpu(cpu) {
		if (req is queued)
			wait_for_completion().
	}
	mutex_unlock(&rcu_sched_expedited_mutex);


The coupling of synchronize_sched_expedited() and migration_req
is largely increased:

1) The offline cpu's per_cpu(rcu_migration_req, cpu) is handled.
   See migration_call::CPU_DEAD

2) migration_call() is the highest priority of cpu notifiers,
   So even any other cpu notifier calls synchronize_sched_expedited(),
   It'll not cause DEADLOCK.

Lai.

--
To unsubscribe from this list: send the line "unsubscribe netdev" in
the body of a message to majordomo@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Paul E. McKenney May 25, 2009, 4:44 p.m. UTC | #2
On Mon, May 25, 2009 at 02:35:15PM +0800, Lai Jiangshan wrote:
> Paul E. McKenney wrote:
> > Seventh cut of "big hammer" expedited RCU grace periods.  This leverages
> > the existing per-CPU migration kthreads, as suggested by Ingo.  These
> > are awakened in a loop, and waited for in a second loop.  Not fully
> > scalable, but removing the extra hop through smp_call_function
> > reduces latency on systems with moderate numbers of CPUs.  The
> > synchronize_rcu_expedited() and and synchronize_bh_expedited() primitives
> > invoke synchronize_sched_expedited(), except for CONFIG_PREEMPT_RCU,
> > where they instead invoke synchronize_rcu() and synchronize_rcu_bh(),
> > respectively.  This will be fixed in the future, after preemptable RCU
> > is folded into the rcutree implementation.
> > 
> 
> I'm strongly need this guarantee:
> 
> preempt_disable() guarantees/implies rcu_read_lock().
> 
> And
> local_irq_diable() guarantees/implies rcu_read_lock().
> rcu_read_lock_bh() guarantees/implies rcu_read_lock().
> 
> 
> It will simplifies codes.

Hmmm...  I did produce a patch that modified preemptable RCU
in this way quite some time back, but at that time the problem
was solved in a different way.  The approach was either to add
rcu_read_lock()/rcu_read_unlock() pairs as needed, or to switch to
call_rcu_sched() or synchronize_sched() for the update side.

But please note that this approach would not permit the current
implementation of synchronize_sched_expedited() to be used in the
preemptable-RCU case, because synchronize_sched_expedited() would
not wait for RCU read-side critical sections that had been preempted.
And the ability to preempt RCU read-side critical sections is absolutely
required for CONFIG_PREEMPT_RT versions of the Linux kernel.

> And
> A lot's of current kernel code does not use rcu_read_lock()
> when it has preempt_disable()-ed/local_irq_diable()-ed or
> when it is in irq/softirq.
> 
> Without these guarantees, these code is broken.

One way or another, such code does need to be fixed.  Either by
reintroducing the old patch, or by fixing the code itself.

> > +
> > +static DEFINE_PER_CPU(struct migration_req, rcu_migration_req);
> > +static DEFINE_MUTEX(rcu_sched_expedited_mutex);
> > +
> > +/*
> > + * Wait for an rcu-sched grace period to elapse, but use "big hammer"
> > + * approach to force grace period to end quickly.  This consumes
> > + * significant time on all CPUs, and is thus not recommended for
> > + * any sort of common-case code.
> > + */
> > +void synchronize_sched_expedited(void)
> > +{
> > +	int cpu;
> > +	unsigned long flags;
> > +	struct rq *rq;
> > +	struct migration_req *req;
> > +
> > +	mutex_lock(&rcu_sched_expedited_mutex);
> > +	get_online_cpus();
> > +	for_each_online_cpu(cpu) {
> > +		rq = cpu_rq(cpu);
> > +		req = &per_cpu(rcu_migration_req, cpu);
> > +		init_completion(&req->done);
> > +		req->task = NULL;
> > +		req->dest_cpu = -1;
> > +		spin_lock_irqsave(&rq->lock, flags);
> > +		list_add(&req->list, &rq->migration_queue);
> > +		spin_unlock_irqrestore(&rq->lock, flags);
> > +		wake_up_process(rq->migration_thread);
> > +	}
> > +	for_each_online_cpu(cpu) {
> > +		req = &per_cpu(rcu_migration_req, cpu);
> > +		wait_for_completion(&req->done);
> > +	}
> > +	put_online_cpus();
> > +	mutex_unlock(&rcu_sched_expedited_mutex);
> > +}
> > +EXPORT_SYMBOL_GPL(synchronize_sched_expedited);
> > +
> > +#endif /* #else #ifndef CONFIG_SMP */
> > 
> > 
> 
> Very nice implement!

Glad you like it!!!  The idea is Ingo's.

Gah!!!  I need to have included a Suggested-by on v7 of the patch!!!

> Only one opinion:
> get_online_cpus() is a large lock, a lot's of lock in kernel is required
> after cpu_hotplug.lock.
> 
> _cpu_down()
> 	cpu_hotplug_begin()
> 		mutex_lock(&cpu_hotplug.lock)
> 	__raw_notifier_call_chain(CPU_DOWN_PREPARE)
> 		Lock a-kernel-lock.
> 
> It means when we have held a-kernel-lock, we can not call
> synchronize_sched_expedited(). get_online_cpus() narrows
> synchronize_sched_expedited()'s usages.
> 
> I think we can reuse req->dest_cpu and remove get_online_cpus().
> (and use preempt_disable() and for_each_possible_cpu())
> 
> req->dest_cpu = -2 means @req is not queued
> req->dest_cpu = -1 means @req is queued
> 
> a little like this code:
> 
> 	mutex_lock(&rcu_sched_expedited_mutex);
> 	for_each_possible_cpu(cpu) {
> 		preempt_disable()
> 		if (cpu is not online)
> 			just set req->dest_cpu to -2;
> 		else
> 			init and queue req, and wake_up_process().
> 		preempt_enable()
> 	}
> 	for_each_possible_cpu(cpu) {
> 		if (req is queued)
> 			wait_for_completion().
> 	}
> 	mutex_unlock(&rcu_sched_expedited_mutex);

Good point -- I should at the very least add a comment to
synchronize_sched_expedited() stating that it cannot be called holding
any lock that is acquired in a CPU hotplug notifier.  If this restriction
causes any problems, then your approach seems like a promising fix.

> The coupling of synchronize_sched_expedited() and migration_req
> is largely increased:
> 
> 1) The offline cpu's per_cpu(rcu_migration_req, cpu) is handled.
>    See migration_call::CPU_DEAD

Good.  ;-)

> 2) migration_call() is the highest priority of cpu notifiers,
>    So even any other cpu notifier calls synchronize_sched_expedited(),
>    It'll not cause DEADLOCK.

You mean if using your preempt_disable() approach, right?  Unless I am
missing something, the current get_online_cpus() approach would deadlock
in this case.

							Thanx, Paul
--
To unsubscribe from this list: send the line "unsubscribe netdev" in
the body of a message to majordomo@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Lai Jiangshan May 26, 2009, 1:03 a.m. UTC | #3
Paul E. McKenney wrote:
> 
> Good point -- I should at the very least add a comment to
> synchronize_sched_expedited() stating that it cannot be called holding
> any lock that is acquired in a CPU hotplug notifier.  If this restriction
> causes any problems, then your approach seems like a promising fix.


Reviewed-by: Lai Jiangshan <laijs@cn.fujitsu.com>

> 
>> The coupling of synchronize_sched_expedited() and migration_req
>> is largely increased:
>>
>> 1) The offline cpu's per_cpu(rcu_migration_req, cpu) is handled.
>>    See migration_call::CPU_DEAD
> 
> Good.  ;-)
> 
>> 2) migration_call() is the highest priority of cpu notifiers,
>>    So even any other cpu notifier calls synchronize_sched_expedited(),
>>    It'll not cause DEADLOCK.
> 
> You mean if using your preempt_disable() approach, right?  Unless I am
> missing something, the current get_online_cpus() approach would deadlock
> in this case.
> 

Yes, I mean if using my preempt_disable() approach. The current
get_online_cpus() approach would NOT deadlock in this case also,
we can require get_online_cpus() in cpu notifiers.

> diff --git a/include/linux/rcupreempt.h b/include/linux/rcupreempt.h
> index fce5227..78117ed 100644
> --- a/include/linux/rcupreempt.h
> +++ b/include/linux/rcupreempt.h
> @@ -74,6 +74,16 @@ extern int rcu_needs_cpu(int cpu);
>  
>  extern void __synchronize_sched(void);
>  
> +static inline void synchronize_rcu_expedited(void)
> +{
> +	synchronize_rcu();  /* Placeholder for new rcupreempt implementation. */
> +}
> +
> +static inline void synchronize_rcu_bh_expedited(void)
> +{
> +	synchronize_rcu();  /* Placeholder for new rcupreempt implementation. */
> +}
> +

Why not synchronize_rcu_bh() ?

In mainline, rcu_read_lock_bh() is not preemptable,
So I think synchronize_sched_expedited() is better.

Anyway, synchronize_rcu() is OK for me, because it is
"Placeholder for new rcupreempt implementation".

Lai

--
To unsubscribe from this list: send the line "unsubscribe netdev" in
the body of a message to majordomo@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Paul E. McKenney May 26, 2009, 1:28 a.m. UTC | #4
On Tue, May 26, 2009 at 09:03:55AM +0800, Lai Jiangshan wrote:
> Paul E. McKenney wrote:
> > 
> > Good point -- I should at the very least add a comment to
> > synchronize_sched_expedited() stating that it cannot be called holding
> > any lock that is acquired in a CPU hotplug notifier.  If this restriction
> > causes any problems, then your approach seems like a promising fix.
> 
> Reviewed-by: Lai Jiangshan <laijs@cn.fujitsu.com>

Thank you very much for your review and comments!!!

> >> The coupling of synchronize_sched_expedited() and migration_req
> >> is largely increased:
> >>
> >> 1) The offline cpu's per_cpu(rcu_migration_req, cpu) is handled.
> >>    See migration_call::CPU_DEAD
> > 
> > Good.  ;-)
> > 
> >> 2) migration_call() is the highest priority of cpu notifiers,
> >>    So even any other cpu notifier calls synchronize_sched_expedited(),
> >>    It'll not cause DEADLOCK.
> > 
> > You mean if using your preempt_disable() approach, right?  Unless I am
> > missing something, the current get_online_cpus() approach would deadlock
> > in this case.
> 
> Yes, I mean if using my preempt_disable() approach. The current
> get_online_cpus() approach would NOT deadlock in this case also,
> we can require get_online_cpus() in cpu notifiers.

I have added the comment for the time being, but should people need to
use this in CPU-hotplug notifiers, then again your preempt_disable()
approach looks to be a promising fix.

> > diff --git a/include/linux/rcupreempt.h b/include/linux/rcupreempt.h
> > index fce5227..78117ed 100644
> > --- a/include/linux/rcupreempt.h
> > +++ b/include/linux/rcupreempt.h
> > @@ -74,6 +74,16 @@ extern int rcu_needs_cpu(int cpu);
> >  
> >  extern void __synchronize_sched(void);
> >  
> > +static inline void synchronize_rcu_expedited(void)
> > +{
> > +	synchronize_rcu();  /* Placeholder for new rcupreempt implementation. */
> > +}
> > +
> > +static inline void synchronize_rcu_bh_expedited(void)
> > +{
> > +	synchronize_rcu();  /* Placeholder for new rcupreempt implementation. */
> > +}
> > +
> 
> Why not synchronize_rcu_bh() ?

Ummm...  Because I did a typo.  Fixed.

> In mainline, rcu_read_lock_bh() is not preemptable,
> So I think synchronize_sched_expedited() is better.
> 
> Anyway, synchronize_rcu() is OK for me, because it is
> "Placeholder for new rcupreempt implementation".

And I am worried that preemptable RCU's rcu_bh read sides might someday
become preemptable.  Seems a bit unlikely at this point, but why tempt
fate?

							Thanx, Paul
--
To unsubscribe from this list: send the line "unsubscribe netdev" in
the body of a message to majordomo@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Paul E. McKenney May 26, 2009, 3:46 p.m. UTC | #5
On Mon, May 25, 2009 at 06:28:43PM -0700, Paul E. McKenney wrote:
> On Tue, May 26, 2009 at 09:03:55AM +0800, Lai Jiangshan wrote:
> > Paul E. McKenney wrote:
> > > 
> > > Good point -- I should at the very least add a comment to
> > > synchronize_sched_expedited() stating that it cannot be called holding
> > > any lock that is acquired in a CPU hotplug notifier.  If this restriction
> > > causes any problems, then your approach seems like a promising fix.
> > 
> > Reviewed-by: Lai Jiangshan <laijs@cn.fujitsu.com>
> 
> Thank you very much for your review and comments!!!
> 
> > >> The coupling of synchronize_sched_expedited() and migration_req
> > >> is largely increased:
> > >>
> > >> 1) The offline cpu's per_cpu(rcu_migration_req, cpu) is handled.
> > >>    See migration_call::CPU_DEAD
> > > 
> > > Good.  ;-)
> > > 
> > >> 2) migration_call() is the highest priority of cpu notifiers,
> > >>    So even any other cpu notifier calls synchronize_sched_expedited(),
> > >>    It'll not cause DEADLOCK.
> > > 
> > > You mean if using your preempt_disable() approach, right?  Unless I am
> > > missing something, the current get_online_cpus() approach would deadlock
> > > in this case.
> > 
> > Yes, I mean if using my preempt_disable() approach. The current
> > get_online_cpus() approach would NOT deadlock in this case also,
> > we can require get_online_cpus() in cpu notifiers.
> 
> I have added the comment for the time being, but should people need to
> use this in CPU-hotplug notifiers, then again your preempt_disable()
> approach looks to be a promising fix.

I looked more closely at your preempt_disable() suggestion, which you
presented earlier as follows:

> I think we can reuse req->dest_cpu and remove get_online_cpus().
> (and use preempt_disable() and for_each_possible_cpu())
> 
> req->dest_cpu = -2 means @req is not queued
> req->dest_cpu = -1 means @req is queued
> 
> a little like this code:
> 
> 	mutex_lock(&rcu_sched_expedited_mutex);
> 	for_each_possible_cpu(cpu) {
> 		preempt_disable()
> 		if (cpu is not online)
> 			just set req->dest_cpu to -2;
> 		else
> 			init and queue req, and wake_up_process().
> 		preempt_enable()
> 	}
> 	for_each_possible_cpu(cpu) {
> 		if (req is queued)
> 			wait_for_completion().
> 	}
> 	mutex_unlock(&rcu_sched_expedited_mutex);

I am concerned about the following sequence of events:

o	synchronize_sched_expedited() disables preemption, thus blocking
	offlining operations.

o	CPU 1 starts offlining CPU 0.  It acquires the CPU-hotplug lock,
	and proceeds, and is now waiting for preemption to be enabled.

o	synchronize_sched_expedited() disables preemption, sees
	that CPU 0 is online, so initializes and queues a request,
	does a wake-up-process(), and finally does a preempt_enable().

o	CPU 0 is currently running a high-priority real-time process,
	so the wakeup does not immediately happen.

o	The offlining process completes, including the kthread_stop()
	to the migration task.

o	The migration task wakes up, sees kthread_should_stop(),
	and so exits without checking its queue.

o	synchronize_sched_expedited() waits forever for CPU 0 to respond.

I suppose that one way to handle this would be to check for the CPU
going offline before doing the wait_for_completion(), but I am concerned
about races affecting this check as well.

Or is there something in the CPU-offline process that makes the above
sequence of events impossible?

							Thanx, Paul
--
To unsubscribe from this list: send the line "unsubscribe netdev" in
the body of a message to majordomo@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Mathieu Desnoyers May 26, 2009, 4:41 p.m. UTC | #6
* Paul E. McKenney (paulmck@linux.vnet.ibm.com) wrote:
> On Mon, May 25, 2009 at 06:28:43PM -0700, Paul E. McKenney wrote:
> > On Tue, May 26, 2009 at 09:03:55AM +0800, Lai Jiangshan wrote:
> > > Paul E. McKenney wrote:
> > > > 
> > > > Good point -- I should at the very least add a comment to
> > > > synchronize_sched_expedited() stating that it cannot be called holding
> > > > any lock that is acquired in a CPU hotplug notifier.  If this restriction
> > > > causes any problems, then your approach seems like a promising fix.
> > > 
> > > Reviewed-by: Lai Jiangshan <laijs@cn.fujitsu.com>
> > 
> > Thank you very much for your review and comments!!!
> > 
> > > >> The coupling of synchronize_sched_expedited() and migration_req
> > > >> is largely increased:
> > > >>
> > > >> 1) The offline cpu's per_cpu(rcu_migration_req, cpu) is handled.
> > > >>    See migration_call::CPU_DEAD
> > > > 
> > > > Good.  ;-)
> > > > 
> > > >> 2) migration_call() is the highest priority of cpu notifiers,
> > > >>    So even any other cpu notifier calls synchronize_sched_expedited(),
> > > >>    It'll not cause DEADLOCK.
> > > > 
> > > > You mean if using your preempt_disable() approach, right?  Unless I am
> > > > missing something, the current get_online_cpus() approach would deadlock
> > > > in this case.
> > > 
> > > Yes, I mean if using my preempt_disable() approach. The current
> > > get_online_cpus() approach would NOT deadlock in this case also,
> > > we can require get_online_cpus() in cpu notifiers.
> > 
> > I have added the comment for the time being, but should people need to
> > use this in CPU-hotplug notifiers, then again your preempt_disable()
> > approach looks to be a promising fix.
> 
> I looked more closely at your preempt_disable() suggestion, which you
> presented earlier as follows:
> 
> > I think we can reuse req->dest_cpu and remove get_online_cpus().
> > (and use preempt_disable() and for_each_possible_cpu())
> > 
> > req->dest_cpu = -2 means @req is not queued
> > req->dest_cpu = -1 means @req is queued
> > 
> > a little like this code:
> > 
> > 	mutex_lock(&rcu_sched_expedited_mutex);
> > 	for_each_possible_cpu(cpu) {
> > 		preempt_disable()
> > 		if (cpu is not online)
> > 			just set req->dest_cpu to -2;
> > 		else
> > 			init and queue req, and wake_up_process().
> > 		preempt_enable()
> > 	}
> > 	for_each_possible_cpu(cpu) {
> > 		if (req is queued)
> > 			wait_for_completion().
> > 	}
> > 	mutex_unlock(&rcu_sched_expedited_mutex);
> 
> I am concerned about the following sequence of events:
> 
> o	synchronize_sched_expedited() disables preemption, thus blocking
> 	offlining operations.
> 
> o	CPU 1 starts offlining CPU 0.  It acquires the CPU-hotplug lock,
> 	and proceeds, and is now waiting for preemption to be enabled.
> 
> o	synchronize_sched_expedited() disables preemption, sees
> 	that CPU 0 is online, so initializes and queues a request,
> 	does a wake-up-process(), and finally does a preempt_enable().
> 
> o	CPU 0 is currently running a high-priority real-time process,
> 	so the wakeup does not immediately happen.
> 
> o	The offlining process completes, including the kthread_stop()
> 	to the migration task.
> 
> o	The migration task wakes up, sees kthread_should_stop(),
> 	and so exits without checking its queue.
> 
> o	synchronize_sched_expedited() waits forever for CPU 0 to respond.
> 
> I suppose that one way to handle this would be to check for the CPU
> going offline before doing the wait_for_completion(), but I am concerned
> about races affecting this check as well.
> 
> Or is there something in the CPU-offline process that makes the above
> sequence of events impossible?
> 

I think you are right, there is a problem there. The simple fact that
this needs to disable preemption to protect against cpu hotplug seems a
bit strange. If I may propose an alternate solution, which assumes that
threads pinned to a CPU are migrated to a different CPU when a CPU goes
offline (and will therefore execute anyway), and that a CPU brought
online after the first iteration on online cpus was already quiescent
(hopefully my assumptions are right). Preemption is left enabled during
all the critical section.

It looks a lot like Lai's approach, except that I use a cpumask (I
thought it looked cleaner and typically involves less operations than
looping on each possible cpu). I also don't disable preemption and
assume that cpu hotplug can happen at any point during this critical
section.

Something along the lines of :

static DECLARE_BITMAP(cpu_wait_expedited_bits, CONFIG_NR_CPUS);
const struct cpumask *const cpu_wait_expedited_mask =
			to_cpumask(cpu_wait_expedited_bits);

	mutex_lock(&rcu_sched_expedited_mutex);
	cpumask_clear(cpu_wait_expedited_mask);
	for_each_online_cpu(cpu) {
		init and queue cpu req, and wake_up_process().
		cpumask_set_cpu(cpu, cpu_wait_expedited_mask);
	}
	for_each_cpu_mask(cpu, cpu_wait_expedited_mask) {
		wait_for_completion(cpu req);
	}
	mutex_unlock(&rcu_sched_expedited_mutex);

There is one concern with this approach : if a CPU is hotunplugged and
hotplugged during the critical section, I think the scheduler would
migrate the thread to a different CPU (upon hotunplug) and let the
thread run on this other CPU. If the target CPU is hotplugged again,
this would mean the thread would have run on a different CPU than the
target. I think we can argue that a CPU going offline and online again
will meet quiescent state requirements, so this should not be a problem.

Mathieu
Paul E. McKenney May 26, 2009, 6:13 p.m. UTC | #7
On Tue, May 26, 2009 at 12:41:29PM -0400, Mathieu Desnoyers wrote:
> * Paul E. McKenney (paulmck@linux.vnet.ibm.com) wrote:
> > On Mon, May 25, 2009 at 06:28:43PM -0700, Paul E. McKenney wrote:
> > > On Tue, May 26, 2009 at 09:03:55AM +0800, Lai Jiangshan wrote:
> > > > Paul E. McKenney wrote:
> > > > > 
> > > > > Good point -- I should at the very least add a comment to
> > > > > synchronize_sched_expedited() stating that it cannot be called holding
> > > > > any lock that is acquired in a CPU hotplug notifier.  If this restriction
> > > > > causes any problems, then your approach seems like a promising fix.
> > > > 
> > > > Reviewed-by: Lai Jiangshan <laijs@cn.fujitsu.com>
> > > 
> > > Thank you very much for your review and comments!!!
> > > 
> > > > >> The coupling of synchronize_sched_expedited() and migration_req
> > > > >> is largely increased:
> > > > >>
> > > > >> 1) The offline cpu's per_cpu(rcu_migration_req, cpu) is handled.
> > > > >>    See migration_call::CPU_DEAD
> > > > > 
> > > > > Good.  ;-)
> > > > > 
> > > > >> 2) migration_call() is the highest priority of cpu notifiers,
> > > > >>    So even any other cpu notifier calls synchronize_sched_expedited(),
> > > > >>    It'll not cause DEADLOCK.
> > > > > 
> > > > > You mean if using your preempt_disable() approach, right?  Unless I am
> > > > > missing something, the current get_online_cpus() approach would deadlock
> > > > > in this case.
> > > > 
> > > > Yes, I mean if using my preempt_disable() approach. The current
> > > > get_online_cpus() approach would NOT deadlock in this case also,
> > > > we can require get_online_cpus() in cpu notifiers.
> > > 
> > > I have added the comment for the time being, but should people need to
> > > use this in CPU-hotplug notifiers, then again your preempt_disable()
> > > approach looks to be a promising fix.
> > 
> > I looked more closely at your preempt_disable() suggestion, which you
> > presented earlier as follows:
> > 
> > > I think we can reuse req->dest_cpu and remove get_online_cpus().
> > > (and use preempt_disable() and for_each_possible_cpu())
> > > 
> > > req->dest_cpu = -2 means @req is not queued
> > > req->dest_cpu = -1 means @req is queued
> > > 
> > > a little like this code:
> > > 
> > > 	mutex_lock(&rcu_sched_expedited_mutex);
> > > 	for_each_possible_cpu(cpu) {
> > > 		preempt_disable()
> > > 		if (cpu is not online)
> > > 			just set req->dest_cpu to -2;
> > > 		else
> > > 			init and queue req, and wake_up_process().
> > > 		preempt_enable()
> > > 	}
> > > 	for_each_possible_cpu(cpu) {
> > > 		if (req is queued)
> > > 			wait_for_completion().
> > > 	}
> > > 	mutex_unlock(&rcu_sched_expedited_mutex);
> > 
> > I am concerned about the following sequence of events:
> > 
> > o	synchronize_sched_expedited() disables preemption, thus blocking
> > 	offlining operations.
> > 
> > o	CPU 1 starts offlining CPU 0.  It acquires the CPU-hotplug lock,
> > 	and proceeds, and is now waiting for preemption to be enabled.
> > 
> > o	synchronize_sched_expedited() disables preemption, sees
> > 	that CPU 0 is online, so initializes and queues a request,
> > 	does a wake-up-process(), and finally does a preempt_enable().
> > 
> > o	CPU 0 is currently running a high-priority real-time process,
> > 	so the wakeup does not immediately happen.
> > 
> > o	The offlining process completes, including the kthread_stop()
> > 	to the migration task.
> > 
> > o	The migration task wakes up, sees kthread_should_stop(),
> > 	and so exits without checking its queue.
> > 
> > o	synchronize_sched_expedited() waits forever for CPU 0 to respond.
> > 
> > I suppose that one way to handle this would be to check for the CPU
> > going offline before doing the wait_for_completion(), but I am concerned
> > about races affecting this check as well.
> > 
> > Or is there something in the CPU-offline process that makes the above
> > sequence of events impossible?
> > 
> 
> I think you are right, there is a problem there. The simple fact that
> this needs to disable preemption to protect against cpu hotplug seems a
> bit strange. If I may propose an alternate solution, which assumes that
> threads pinned to a CPU are migrated to a different CPU when a CPU goes
> offline (and will therefore execute anyway), and that a CPU brought
> online after the first iteration on online cpus was already quiescent
> (hopefully my assumptions are right). Preemption is left enabled during
> all the critical section.
> 
> It looks a lot like Lai's approach, except that I use a cpumask (I
> thought it looked cleaner and typically involves less operations than
> looping on each possible cpu). I also don't disable preemption and
> assume that cpu hotplug can happen at any point during this critical
> section.
> 
> Something along the lines of :
> 
> static DECLARE_BITMAP(cpu_wait_expedited_bits, CONFIG_NR_CPUS);
> const struct cpumask *const cpu_wait_expedited_mask =
> 			to_cpumask(cpu_wait_expedited_bits);
> 
> 	mutex_lock(&rcu_sched_expedited_mutex);
> 	cpumask_clear(cpu_wait_expedited_mask);
> 	for_each_online_cpu(cpu) {
> 		init and queue cpu req, and wake_up_process().
> 		cpumask_set_cpu(cpu, cpu_wait_expedited_mask);
> 	}
> 	for_each_cpu_mask(cpu, cpu_wait_expedited_mask) {
> 		wait_for_completion(cpu req);
> 	}
> 	mutex_unlock(&rcu_sched_expedited_mutex);
> 
> There is one concern with this approach : if a CPU is hotunplugged and
> hotplugged during the critical section, I think the scheduler would
> migrate the thread to a different CPU (upon hotunplug) and let the
> thread run on this other CPU. If the target CPU is hotplugged again,
> this would mean the thread would have run on a different CPU than the
> target. I think we can argue that a CPU going offline and online again
> will meet quiescent state requirements, so this should not be a problem.

Having the task runnable on some other CPU is very scary to me.  If the
CPU comes back online, and synchronize_sched_expedited() manages to
run before the task gets migrated back onto that CPU, then the grace
period could be ended too soon.

All of this is intended to make synchronize_sched_expedited() be able to
run in a CPU hotplug notifier.  Do we have an example where someone
really wants to do this?  If not, I am really starting to like v7 of
the patch.  ;-)

If someone really does need to run synchronize_sched_expedited() from a
CPU hotplug notifier, perhaps a simpler approach is to have something
like a try_get_online_cpus(), and just invoke synchronize_sched() upon
failure:

	void synchronize_sched_expedited(void)
	{
		int cpu;
		unsigned long flags;
		struct rq *rq;
		struct migration_req *req;

		mutex_lock(&rcu_sched_expedited_mutex);
		if (!try_get_online_cpus()) {
			synchronize_sched();
			return;
		}

		/* rest of synchronize_sched_expedited()... */

But I would want to see a real need for this beforehand.

							Thanx, Paul
--
To unsubscribe from this list: send the line "unsubscribe netdev" in
the body of a message to majordomo@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Mathieu Desnoyers May 27, 2009, 1:47 a.m. UTC | #8
* Paul E. McKenney (paulmck@linux.vnet.ibm.com) wrote:
> On Tue, May 26, 2009 at 12:41:29PM -0400, Mathieu Desnoyers wrote:
> > * Paul E. McKenney (paulmck@linux.vnet.ibm.com) wrote:
> > > On Mon, May 25, 2009 at 06:28:43PM -0700, Paul E. McKenney wrote:
> > > > On Tue, May 26, 2009 at 09:03:55AM +0800, Lai Jiangshan wrote:
> > > > > Paul E. McKenney wrote:
> > > > > > 
> > > > > > Good point -- I should at the very least add a comment to
> > > > > > synchronize_sched_expedited() stating that it cannot be called holding
> > > > > > any lock that is acquired in a CPU hotplug notifier.  If this restriction
> > > > > > causes any problems, then your approach seems like a promising fix.
> > > > > 
> > > > > Reviewed-by: Lai Jiangshan <laijs@cn.fujitsu.com>
> > > > 
> > > > Thank you very much for your review and comments!!!
> > > > 
> > > > > >> The coupling of synchronize_sched_expedited() and migration_req
> > > > > >> is largely increased:
> > > > > >>
> > > > > >> 1) The offline cpu's per_cpu(rcu_migration_req, cpu) is handled.
> > > > > >>    See migration_call::CPU_DEAD
> > > > > > 
> > > > > > Good.  ;-)
> > > > > > 
> > > > > >> 2) migration_call() is the highest priority of cpu notifiers,
> > > > > >>    So even any other cpu notifier calls synchronize_sched_expedited(),
> > > > > >>    It'll not cause DEADLOCK.
> > > > > > 
> > > > > > You mean if using your preempt_disable() approach, right?  Unless I am
> > > > > > missing something, the current get_online_cpus() approach would deadlock
> > > > > > in this case.
> > > > > 
> > > > > Yes, I mean if using my preempt_disable() approach. The current
> > > > > get_online_cpus() approach would NOT deadlock in this case also,
> > > > > we can require get_online_cpus() in cpu notifiers.
> > > > 
> > > > I have added the comment for the time being, but should people need to
> > > > use this in CPU-hotplug notifiers, then again your preempt_disable()
> > > > approach looks to be a promising fix.
> > > 
> > > I looked more closely at your preempt_disable() suggestion, which you
> > > presented earlier as follows:
> > > 
> > > > I think we can reuse req->dest_cpu and remove get_online_cpus().
> > > > (and use preempt_disable() and for_each_possible_cpu())
> > > > 
> > > > req->dest_cpu = -2 means @req is not queued
> > > > req->dest_cpu = -1 means @req is queued
> > > > 
> > > > a little like this code:
> > > > 
> > > > 	mutex_lock(&rcu_sched_expedited_mutex);
> > > > 	for_each_possible_cpu(cpu) {
> > > > 		preempt_disable()
> > > > 		if (cpu is not online)
> > > > 			just set req->dest_cpu to -2;
> > > > 		else
> > > > 			init and queue req, and wake_up_process().
> > > > 		preempt_enable()
> > > > 	}
> > > > 	for_each_possible_cpu(cpu) {
> > > > 		if (req is queued)
> > > > 			wait_for_completion().
> > > > 	}
> > > > 	mutex_unlock(&rcu_sched_expedited_mutex);
> > > 
> > > I am concerned about the following sequence of events:
> > > 
> > > o	synchronize_sched_expedited() disables preemption, thus blocking
> > > 	offlining operations.
> > > 
> > > o	CPU 1 starts offlining CPU 0.  It acquires the CPU-hotplug lock,
> > > 	and proceeds, and is now waiting for preemption to be enabled.
> > > 
> > > o	synchronize_sched_expedited() disables preemption, sees
> > > 	that CPU 0 is online, so initializes and queues a request,
> > > 	does a wake-up-process(), and finally does a preempt_enable().
> > > 
> > > o	CPU 0 is currently running a high-priority real-time process,
> > > 	so the wakeup does not immediately happen.
> > > 
> > > o	The offlining process completes, including the kthread_stop()
> > > 	to the migration task.
> > > 
> > > o	The migration task wakes up, sees kthread_should_stop(),
> > > 	and so exits without checking its queue.
> > > 
> > > o	synchronize_sched_expedited() waits forever for CPU 0 to respond.
> > > 
> > > I suppose that one way to handle this would be to check for the CPU
> > > going offline before doing the wait_for_completion(), but I am concerned
> > > about races affecting this check as well.
> > > 
> > > Or is there something in the CPU-offline process that makes the above
> > > sequence of events impossible?
> > > 
> > 
> > I think you are right, there is a problem there. The simple fact that
> > this needs to disable preemption to protect against cpu hotplug seems a
> > bit strange. If I may propose an alternate solution, which assumes that
> > threads pinned to a CPU are migrated to a different CPU when a CPU goes
> > offline (and will therefore execute anyway), and that a CPU brought
> > online after the first iteration on online cpus was already quiescent
> > (hopefully my assumptions are right). Preemption is left enabled during
> > all the critical section.
> > 
> > It looks a lot like Lai's approach, except that I use a cpumask (I
> > thought it looked cleaner and typically involves less operations than
> > looping on each possible cpu). I also don't disable preemption and
> > assume that cpu hotplug can happen at any point during this critical
> > section.
> > 
> > Something along the lines of :
> > 
> > static DECLARE_BITMAP(cpu_wait_expedited_bits, CONFIG_NR_CPUS);
> > const struct cpumask *const cpu_wait_expedited_mask =
> > 			to_cpumask(cpu_wait_expedited_bits);
> > 
> > 	mutex_lock(&rcu_sched_expedited_mutex);
> > 	cpumask_clear(cpu_wait_expedited_mask);
> > 	for_each_online_cpu(cpu) {
> > 		init and queue cpu req, and wake_up_process().
> > 		cpumask_set_cpu(cpu, cpu_wait_expedited_mask);
> > 	}
> > 	for_each_cpu_mask(cpu, cpu_wait_expedited_mask) {
> > 		wait_for_completion(cpu req);
> > 	}
> > 	mutex_unlock(&rcu_sched_expedited_mutex);
> > 
> > There is one concern with this approach : if a CPU is hotunplugged and
> > hotplugged during the critical section, I think the scheduler would
> > migrate the thread to a different CPU (upon hotunplug) and let the
> > thread run on this other CPU. If the target CPU is hotplugged again,
> > this would mean the thread would have run on a different CPU than the
> > target. I think we can argue that a CPU going offline and online again
> > will meet quiescent state requirements, so this should not be a problem.
> 
> Having the task runnable on some other CPU is very scary to me.  If the
> CPU comes back online, and synchronize_sched_expedited() manages to
> run before the task gets migrated back onto that CPU, then the grace
> period could be ended too soon.
> 

Well, the idea is that we want all in-flight preempt off sections (as
seen at the beginning of synchronize_sched_expedited()) to be over
before we consider the grace period as ended, right ?

Let's say we read the cpu online mask at a given time (potentially non
atomically, we don't really care).

If, at any point in time while we read the cpu online mask, a CPU
appears to be offline, this means that it cannot hold any in-flight
preempt off section.

Even if that specific CPU comes back online after this moment, and
starts scheduling threads again, these threads cannot ever possibly be
in-flight in the old grace period.

Therefore, my argument is that for rcu_sched (classic rcu), a CPU going
back online while we wait for quiescent state cannot possibly ever start
running a thread in the previous grace period.

My second argument is that if a CPU is hotunplugging while we wait for
QS, either :

- It lets the completion thread run before it goes offline. That's fine
- It goes offline and the completion thread is migrated to another CPU.
  This will just make synchronize_sched_expedited() wait for one more
  completion that will execute on the CPU the thread has migrated to.
  Again, we don't care.
- It goes offline/online/offline/online/... : We go back to my first
  argument, which states that if a CPU is out of the cpu online mask at
  any given time after we started the synchronize_sched_expedited()
  execution, it cannot possibly hold an in-flight preempt off section
  belonging to the old GP.

Or am I missing something ?

Mathieu


> All of this is intended to make synchronize_sched_expedited() be able to
> run in a CPU hotplug notifier.  Do we have an example where someone
> really wants to do this?  If not, I am really starting to like v7 of
> the patch.  ;-)
> 
> If someone really does need to run synchronize_sched_expedited() from a
> CPU hotplug notifier, perhaps a simpler approach is to have something
> like a try_get_online_cpus(), and just invoke synchronize_sched() upon
> failure:
> 
> 	void synchronize_sched_expedited(void)
> 	{
> 		int cpu;
> 		unsigned long flags;
> 		struct rq *rq;
> 		struct migration_req *req;
> 
> 		mutex_lock(&rcu_sched_expedited_mutex);
> 		if (!try_get_online_cpus()) {
> 			synchronize_sched();
> 			return;
> 		}
> 
> 		/* rest of synchronize_sched_expedited()... */
> 
> But I would want to see a real need for this beforehand.
> 
> 							Thanx, Paul
Lai Jiangshan May 27, 2009, 1:57 a.m. UTC | #9
Paul E. McKenney wrote:
> 
> I am concerned about the following sequence of events:
> 
> o	synchronize_sched_expedited() disables preemption, thus blocking
> 	offlining operations.
> 
> o	CPU 1 starts offlining CPU 0.  It acquires the CPU-hotplug lock,
> 	and proceeds, and is now waiting for preemption to be enabled.
> 
> o	synchronize_sched_expedited() disables preemption, sees
> 	that CPU 0 is online, so initializes and queues a request,
> 	does a wake-up-process(), and finally does a preempt_enable().
> 
> o	CPU 0 is currently running a high-priority real-time process,
> 	so the wakeup does not immediately happen.
> 
> o	The offlining process completes, including the kthread_stop()
> 	to the migration task.
> 
> o	The migration task wakes up, sees kthread_should_stop(),
> 	and so exits without checking its queue.
> 
> o	synchronize_sched_expedited() waits forever for CPU 0 to respond.
> 
> I suppose that one way to handle this would be to check for the CPU
> going offline before doing the wait_for_completion(), but I am concerned
> about races affecting this check as well.
> 
> Or is there something in the CPU-offline process that makes the above
> sequence of events impossible?
> 
> 							Thanx, Paul
> 
> 

I realized this, I wrote this:
> 
> The coupling of synchronize_sched_expedited() and migration_req
> is largely increased:
> 
> 1) The offline cpu's per_cpu(rcu_migration_req, cpu) is handled.
>    See migration_call::CPU_DEAD

synchronize_sched_expedited() will not wait for CPU#0, because
migration_call()::case CPU_DEAD wakes up the requestors.

migration_call()
{
	...
	case CPU_DEAD:
	case CPU_DEAD_FROZEN:
		...
		/*
		 * No need to migrate the tasks: it was best-effort if
		 * they didn't take sched_hotcpu_mutex. Just wake up
		 * the requestors.
		 */
		spin_lock_irq(&rq->lock);
		while (!list_empty(&rq->migration_queue)) {
			struct migration_req *req;

			req = list_entry(rq->migration_queue.next,
					 struct migration_req, list);
			list_del_init(&req->list);
			spin_unlock_irq(&rq->lock);
			complete(&req->done);
			spin_lock_irq(&rq->lock);
		}
		spin_unlock_irq(&rq->lock);
		...
	...
}

My approach depend on the requestors are waked up at any case.
migration_call() does it for us but the coupling is largely
increased.

Lai

--
To unsubscribe from this list: send the line "unsubscribe netdev" in
the body of a message to majordomo@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Paul E. McKenney May 27, 2009, 4:27 a.m. UTC | #10
On Tue, May 26, 2009 at 09:47:26PM -0400, Mathieu Desnoyers wrote:
> * Paul E. McKenney (paulmck@linux.vnet.ibm.com) wrote:
> > On Tue, May 26, 2009 at 12:41:29PM -0400, Mathieu Desnoyers wrote:
> > > * Paul E. McKenney (paulmck@linux.vnet.ibm.com) wrote:
> > > > On Mon, May 25, 2009 at 06:28:43PM -0700, Paul E. McKenney wrote:
> > > > > On Tue, May 26, 2009 at 09:03:55AM +0800, Lai Jiangshan wrote:
> > > > > > Paul E. McKenney wrote:
> > > > > > > 
> > > > > > > Good point -- I should at the very least add a comment to
> > > > > > > synchronize_sched_expedited() stating that it cannot be called holding
> > > > > > > any lock that is acquired in a CPU hotplug notifier.  If this restriction
> > > > > > > causes any problems, then your approach seems like a promising fix.
> > > > > > 
> > > > > > Reviewed-by: Lai Jiangshan <laijs@cn.fujitsu.com>
> > > > > 
> > > > > Thank you very much for your review and comments!!!
> > > > > 
> > > > > > >> The coupling of synchronize_sched_expedited() and migration_req
> > > > > > >> is largely increased:
> > > > > > >>
> > > > > > >> 1) The offline cpu's per_cpu(rcu_migration_req, cpu) is handled.
> > > > > > >>    See migration_call::CPU_DEAD
> > > > > > > 
> > > > > > > Good.  ;-)
> > > > > > > 
> > > > > > >> 2) migration_call() is the highest priority of cpu notifiers,
> > > > > > >>    So even any other cpu notifier calls synchronize_sched_expedited(),
> > > > > > >>    It'll not cause DEADLOCK.
> > > > > > > 
> > > > > > > You mean if using your preempt_disable() approach, right?  Unless I am
> > > > > > > missing something, the current get_online_cpus() approach would deadlock
> > > > > > > in this case.
> > > > > > 
> > > > > > Yes, I mean if using my preempt_disable() approach. The current
> > > > > > get_online_cpus() approach would NOT deadlock in this case also,
> > > > > > we can require get_online_cpus() in cpu notifiers.
> > > > > 
> > > > > I have added the comment for the time being, but should people need to
> > > > > use this in CPU-hotplug notifiers, then again your preempt_disable()
> > > > > approach looks to be a promising fix.
> > > > 
> > > > I looked more closely at your preempt_disable() suggestion, which you
> > > > presented earlier as follows:
> > > > 
> > > > > I think we can reuse req->dest_cpu and remove get_online_cpus().
> > > > > (and use preempt_disable() and for_each_possible_cpu())
> > > > > 
> > > > > req->dest_cpu = -2 means @req is not queued
> > > > > req->dest_cpu = -1 means @req is queued
> > > > > 
> > > > > a little like this code:
> > > > > 
> > > > > 	mutex_lock(&rcu_sched_expedited_mutex);
> > > > > 	for_each_possible_cpu(cpu) {
> > > > > 		preempt_disable()
> > > > > 		if (cpu is not online)
> > > > > 			just set req->dest_cpu to -2;
> > > > > 		else
> > > > > 			init and queue req, and wake_up_process().
> > > > > 		preempt_enable()
> > > > > 	}
> > > > > 	for_each_possible_cpu(cpu) {
> > > > > 		if (req is queued)
> > > > > 			wait_for_completion().
> > > > > 	}
> > > > > 	mutex_unlock(&rcu_sched_expedited_mutex);
> > > > 
> > > > I am concerned about the following sequence of events:
> > > > 
> > > > o	synchronize_sched_expedited() disables preemption, thus blocking
> > > > 	offlining operations.
> > > > 
> > > > o	CPU 1 starts offlining CPU 0.  It acquires the CPU-hotplug lock,
> > > > 	and proceeds, and is now waiting for preemption to be enabled.
> > > > 
> > > > o	synchronize_sched_expedited() disables preemption, sees
> > > > 	that CPU 0 is online, so initializes and queues a request,
> > > > 	does a wake-up-process(), and finally does a preempt_enable().
> > > > 
> > > > o	CPU 0 is currently running a high-priority real-time process,
> > > > 	so the wakeup does not immediately happen.
> > > > 
> > > > o	The offlining process completes, including the kthread_stop()
> > > > 	to the migration task.
> > > > 
> > > > o	The migration task wakes up, sees kthread_should_stop(),
> > > > 	and so exits without checking its queue.
> > > > 
> > > > o	synchronize_sched_expedited() waits forever for CPU 0 to respond.
> > > > 
> > > > I suppose that one way to handle this would be to check for the CPU
> > > > going offline before doing the wait_for_completion(), but I am concerned
> > > > about races affecting this check as well.
> > > > 
> > > > Or is there something in the CPU-offline process that makes the above
> > > > sequence of events impossible?
> > > > 
> > > 
> > > I think you are right, there is a problem there. The simple fact that
> > > this needs to disable preemption to protect against cpu hotplug seems a
> > > bit strange. If I may propose an alternate solution, which assumes that
> > > threads pinned to a CPU are migrated to a different CPU when a CPU goes
> > > offline (and will therefore execute anyway), and that a CPU brought
> > > online after the first iteration on online cpus was already quiescent
> > > (hopefully my assumptions are right). Preemption is left enabled during
> > > all the critical section.
> > > 
> > > It looks a lot like Lai's approach, except that I use a cpumask (I
> > > thought it looked cleaner and typically involves less operations than
> > > looping on each possible cpu). I also don't disable preemption and
> > > assume that cpu hotplug can happen at any point during this critical
> > > section.
> > > 
> > > Something along the lines of :
> > > 
> > > static DECLARE_BITMAP(cpu_wait_expedited_bits, CONFIG_NR_CPUS);
> > > const struct cpumask *const cpu_wait_expedited_mask =
> > > 			to_cpumask(cpu_wait_expedited_bits);
> > > 
> > > 	mutex_lock(&rcu_sched_expedited_mutex);
> > > 	cpumask_clear(cpu_wait_expedited_mask);
> > > 	for_each_online_cpu(cpu) {
> > > 		init and queue cpu req, and wake_up_process().
> > > 		cpumask_set_cpu(cpu, cpu_wait_expedited_mask);
> > > 	}
> > > 	for_each_cpu_mask(cpu, cpu_wait_expedited_mask) {
> > > 		wait_for_completion(cpu req);
> > > 	}
> > > 	mutex_unlock(&rcu_sched_expedited_mutex);
> > > 
> > > There is one concern with this approach : if a CPU is hotunplugged and
> > > hotplugged during the critical section, I think the scheduler would
> > > migrate the thread to a different CPU (upon hotunplug) and let the
> > > thread run on this other CPU. If the target CPU is hotplugged again,
> > > this would mean the thread would have run on a different CPU than the
> > > target. I think we can argue that a CPU going offline and online again
> > > will meet quiescent state requirements, so this should not be a problem.
> > 
> > Having the task runnable on some other CPU is very scary to me.  If the
> > CPU comes back online, and synchronize_sched_expedited() manages to
> > run before the task gets migrated back onto that CPU, then the grace
> > period could be ended too soon.
> > 
> 
> Well, the idea is that we want all in-flight preempt off sections (as
> seen at the beginning of synchronize_sched_expedited()) to be over
> before we consider the grace period as ended, right ?
> 
> Let's say we read the cpu online mask at a given time (potentially non
> atomically, we don't really care).
> 
> If, at any point in time while we read the cpu online mask, a CPU
> appears to be offline, this means that it cannot hold any in-flight
> preempt off section.
> 
> Even if that specific CPU comes back online after this moment, and
> starts scheduling threads again, these threads cannot ever possibly be
> in-flight in the old grace period.
> 
> Therefore, my argument is that for rcu_sched (classic rcu), a CPU going
> back online while we wait for quiescent state cannot possibly ever start
> running a thread in the previous grace period.
> 
> My second argument is that if a CPU is hotunplugging while we wait for
> QS, either :
> 
> - It lets the completion thread run before it goes offline. That's fine
> - It goes offline and the completion thread is migrated to another CPU.
>   This will just make synchronize_sched_expedited() wait for one more
>   completion that will execute on the CPU the thread has migrated to.
>   Again, we don't care.
> - It goes offline/online/offline/online/... : We go back to my first
>   argument, which states that if a CPU is out of the cpu online mask at
>   any given time after we started the synchronize_sched_expedited()
>   execution, it cannot possibly hold an in-flight preempt off section
>   belonging to the old GP.
> 
> Or am I missing something ?

I am worried (perhaps unnecessarily) about the CPU coming online,
its kthread still running on some other CPU, someone doing a
synchronize_sched_expedited(), which then might possibly complete before
the kthread migrates back where it belongs.  If the newly onlined CPU is
in an extended RCU read-side critical section, we might end the expedited
grace period too soon.

My turn.  Am I missing something?  ;-)

						Thanx, Paul

> Mathieu
> 
> 
> > All of this is intended to make synchronize_sched_expedited() be able to
> > run in a CPU hotplug notifier.  Do we have an example where someone
> > really wants to do this?  If not, I am really starting to like v7 of
> > the patch.  ;-)
> > 
> > If someone really does need to run synchronize_sched_expedited() from a
> > CPU hotplug notifier, perhaps a simpler approach is to have something
> > like a try_get_online_cpus(), and just invoke synchronize_sched() upon
> > failure:
> > 
> > 	void synchronize_sched_expedited(void)
> > 	{
> > 		int cpu;
> > 		unsigned long flags;
> > 		struct rq *rq;
> > 		struct migration_req *req;
> > 
> > 		mutex_lock(&rcu_sched_expedited_mutex);
> > 		if (!try_get_online_cpus()) {
> > 			synchronize_sched();
> > 			return;
> > 		}
> > 
> > 		/* rest of synchronize_sched_expedited()... */
> > 
> > But I would want to see a real need for this beforehand.
> > 
> > 							Thanx, Paul
> 
> -- 
> Mathieu Desnoyers
> OpenPGP key fingerprint: 8CD5 52C3 8E3C 4140 715F  BA06 3F25 A8FE 3BAE 9A68
> --
> To unsubscribe from this list: send the line "unsubscribe netfilter-devel" in
> the body of a message to majordomo@vger.kernel.org
> More majordomo info at  http://vger.kernel.org/majordomo-info.html
--
To unsubscribe from this list: send the line "unsubscribe netdev" in
the body of a message to majordomo@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Paul E. McKenney May 27, 2009, 4:30 a.m. UTC | #11
On Wed, May 27, 2009 at 09:57:19AM +0800, Lai Jiangshan wrote:
> Paul E. McKenney wrote:
> > 
> > I am concerned about the following sequence of events:
> > 
> > o	synchronize_sched_expedited() disables preemption, thus blocking
> > 	offlining operations.
> > 
> > o	CPU 1 starts offlining CPU 0.  It acquires the CPU-hotplug lock,
> > 	and proceeds, and is now waiting for preemption to be enabled.
> > 
> > o	synchronize_sched_expedited() disables preemption, sees
> > 	that CPU 0 is online, so initializes and queues a request,
> > 	does a wake-up-process(), and finally does a preempt_enable().
> > 
> > o	CPU 0 is currently running a high-priority real-time process,
> > 	so the wakeup does not immediately happen.
> > 
> > o	The offlining process completes, including the kthread_stop()
> > 	to the migration task.
> > 
> > o	The migration task wakes up, sees kthread_should_stop(),
> > 	and so exits without checking its queue.
> > 
> > o	synchronize_sched_expedited() waits forever for CPU 0 to respond.
> > 
> > I suppose that one way to handle this would be to check for the CPU
> > going offline before doing the wait_for_completion(), but I am concerned
> > about races affecting this check as well.
> > 
> > Or is there something in the CPU-offline process that makes the above
> > sequence of events impossible?
> > 
> > 							Thanx, Paul
> > 
> > 
> 
> I realized this, I wrote this:
> > 
> > The coupling of synchronize_sched_expedited() and migration_req
> > is largely increased:
> > 
> > 1) The offline cpu's per_cpu(rcu_migration_req, cpu) is handled.
> >    See migration_call::CPU_DEAD
> 
> synchronize_sched_expedited() will not wait for CPU#0, because
> migration_call()::case CPU_DEAD wakes up the requestors.
> 
> migration_call()
> {
> 	...
> 	case CPU_DEAD:
> 	case CPU_DEAD_FROZEN:
> 		...
> 		/*
> 		 * No need to migrate the tasks: it was best-effort if
> 		 * they didn't take sched_hotcpu_mutex. Just wake up
> 		 * the requestors.
> 		 */
> 		spin_lock_irq(&rq->lock);
> 		while (!list_empty(&rq->migration_queue)) {
> 			struct migration_req *req;
> 
> 			req = list_entry(rq->migration_queue.next,
> 					 struct migration_req, list);
> 			list_del_init(&req->list);
> 			spin_unlock_irq(&rq->lock);
> 			complete(&req->done);
> 			spin_lock_irq(&rq->lock);
> 		}
> 		spin_unlock_irq(&rq->lock);
> 		...
> 	...
> }
> 
> My approach depend on the requestors are waked up at any case.
> migration_call() does it for us but the coupling is largely
> increased.

OK, good point!  I do need to think about this.

In the meantime, where do you see a need to run
synchronize_sched_expedited() from within a hotplug CPU notifier?

						Thanx, Paul
--
To unsubscribe from this list: send the line "unsubscribe netdev" in
the body of a message to majordomo@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Lai Jiangshan May 27, 2009, 5:37 a.m. UTC | #12
Paul E. McKenney wrote:
> OK, good point!  I do need to think about this.
> 
> In the meantime, where do you see a need to run
> synchronize_sched_expedited() from within a hotplug CPU notifier?
> 
> 						Thanx, Paul
> 

I don't worry about synchronize_sched_expedited() called
from within a hotplug CPU notifier:

1st synchronize_sched_expedited() is newly, nobody calls it before current.
2nd get_online_cpus() will not cause DEADLOCK in CPU notifier:
	get_online_cpus() finds itself owns the cpu_hotplug.lock, it will
	not take it again.

I worry DEADLOCK like this:(ABBA DEADLOCK)
> get_online_cpus() is a large lock, a lot's of lock in kernel is required
> after cpu_hotplug.lock.
> 
> _cpu_down()
> 	cpu_hotplug_begin()
> 		mutex_lock(&cpu_hotplug.lock)
> 	__raw_notifier_call_chain(CPU_DOWN_PREPARE)
> 		Lock a-kernel-lock.
> 
> It means when we have held a-kernel-lock, we can not call
> synchronize_sched_expedited(). get_online_cpus() narrows
> synchronize_sched_expedited()'s usages.

One thread calls _cpu_down() which do "mutex_lock(&cpu_hotplug.lock)"
and then do "Lock a-kernel-lock", other thread calls
synchronize_sched_expedited() with a-kernel-lock held,
ABBA DEADLOCK would happen:

thread 1                            |        thread 2
_cpu_down()                         |    Lock a-kernel-lock. 
  mutex_lock(&cpu_hotplug.lock)     |    synchronize_sched_expedited()
------------------------------------------------------------------------
  Lock a-kernel-lock.(wait thread2) |       mutex_lock(&cpu_hotplug.lock)
                                            (wait thread 1)


cpuset_lock() is an example of a-kernel-lock as described before.
cpuset_lock() is required in CPU notifier.

But some work in cpuset need get_online_cpus().
(cpuset_lock() and then get_online_cpus(), we can
not release cpuset_lock() temporarily)

The fix is putting this work done in workqueue.
(get_online_cpus() and then cpuset_lock());

Thanx.
Lai

--
To unsubscribe from this list: send the line "unsubscribe netdev" in
the body of a message to majordomo@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Mathieu Desnoyers May 27, 2009, 2:45 p.m. UTC | #13
* Paul E. McKenney (paulmck@linux.vnet.ibm.com) wrote:
> On Tue, May 26, 2009 at 09:47:26PM -0400, Mathieu Desnoyers wrote:
> > * Paul E. McKenney (paulmck@linux.vnet.ibm.com) wrote:
> > > On Tue, May 26, 2009 at 12:41:29PM -0400, Mathieu Desnoyers wrote:
> > > > * Paul E. McKenney (paulmck@linux.vnet.ibm.com) wrote:
> > > > > On Mon, May 25, 2009 at 06:28:43PM -0700, Paul E. McKenney wrote:
> > > > > > On Tue, May 26, 2009 at 09:03:55AM +0800, Lai Jiangshan wrote:
> > > > > > > Paul E. McKenney wrote:
> > > > > > > > 
> > > > > > > > Good point -- I should at the very least add a comment to
> > > > > > > > synchronize_sched_expedited() stating that it cannot be called holding
> > > > > > > > any lock that is acquired in a CPU hotplug notifier.  If this restriction
> > > > > > > > causes any problems, then your approach seems like a promising fix.
> > > > > > > 
> > > > > > > Reviewed-by: Lai Jiangshan <laijs@cn.fujitsu.com>
> > > > > > 
> > > > > > Thank you very much for your review and comments!!!
> > > > > > 
> > > > > > > >> The coupling of synchronize_sched_expedited() and migration_req
> > > > > > > >> is largely increased:
> > > > > > > >>
> > > > > > > >> 1) The offline cpu's per_cpu(rcu_migration_req, cpu) is handled.
> > > > > > > >>    See migration_call::CPU_DEAD
> > > > > > > > 
> > > > > > > > Good.  ;-)
> > > > > > > > 
> > > > > > > >> 2) migration_call() is the highest priority of cpu notifiers,
> > > > > > > >>    So even any other cpu notifier calls synchronize_sched_expedited(),
> > > > > > > >>    It'll not cause DEADLOCK.
> > > > > > > > 
> > > > > > > > You mean if using your preempt_disable() approach, right?  Unless I am
> > > > > > > > missing something, the current get_online_cpus() approach would deadlock
> > > > > > > > in this case.
> > > > > > > 
> > > > > > > Yes, I mean if using my preempt_disable() approach. The current
> > > > > > > get_online_cpus() approach would NOT deadlock in this case also,
> > > > > > > we can require get_online_cpus() in cpu notifiers.
> > > > > > 
> > > > > > I have added the comment for the time being, but should people need to
> > > > > > use this in CPU-hotplug notifiers, then again your preempt_disable()
> > > > > > approach looks to be a promising fix.
> > > > > 
> > > > > I looked more closely at your preempt_disable() suggestion, which you
> > > > > presented earlier as follows:
> > > > > 
> > > > > > I think we can reuse req->dest_cpu and remove get_online_cpus().
> > > > > > (and use preempt_disable() and for_each_possible_cpu())
> > > > > > 
> > > > > > req->dest_cpu = -2 means @req is not queued
> > > > > > req->dest_cpu = -1 means @req is queued
> > > > > > 
> > > > > > a little like this code:
> > > > > > 
> > > > > > 	mutex_lock(&rcu_sched_expedited_mutex);
> > > > > > 	for_each_possible_cpu(cpu) {
> > > > > > 		preempt_disable()
> > > > > > 		if (cpu is not online)
> > > > > > 			just set req->dest_cpu to -2;
> > > > > > 		else
> > > > > > 			init and queue req, and wake_up_process().
> > > > > > 		preempt_enable()
> > > > > > 	}
> > > > > > 	for_each_possible_cpu(cpu) {
> > > > > > 		if (req is queued)
> > > > > > 			wait_for_completion().
> > > > > > 	}
> > > > > > 	mutex_unlock(&rcu_sched_expedited_mutex);
> > > > > 
> > > > > I am concerned about the following sequence of events:
> > > > > 
> > > > > o	synchronize_sched_expedited() disables preemption, thus blocking
> > > > > 	offlining operations.
> > > > > 
> > > > > o	CPU 1 starts offlining CPU 0.  It acquires the CPU-hotplug lock,
> > > > > 	and proceeds, and is now waiting for preemption to be enabled.
> > > > > 
> > > > > o	synchronize_sched_expedited() disables preemption, sees
> > > > > 	that CPU 0 is online, so initializes and queues a request,
> > > > > 	does a wake-up-process(), and finally does a preempt_enable().
> > > > > 
> > > > > o	CPU 0 is currently running a high-priority real-time process,
> > > > > 	so the wakeup does not immediately happen.
> > > > > 
> > > > > o	The offlining process completes, including the kthread_stop()
> > > > > 	to the migration task.
> > > > > 
> > > > > o	The migration task wakes up, sees kthread_should_stop(),
> > > > > 	and so exits without checking its queue.
> > > > > 
> > > > > o	synchronize_sched_expedited() waits forever for CPU 0 to respond.
> > > > > 
> > > > > I suppose that one way to handle this would be to check for the CPU
> > > > > going offline before doing the wait_for_completion(), but I am concerned
> > > > > about races affecting this check as well.
> > > > > 
> > > > > Or is there something in the CPU-offline process that makes the above
> > > > > sequence of events impossible?
> > > > > 
> > > > 
> > > > I think you are right, there is a problem there. The simple fact that
> > > > this needs to disable preemption to protect against cpu hotplug seems a
> > > > bit strange. If I may propose an alternate solution, which assumes that
> > > > threads pinned to a CPU are migrated to a different CPU when a CPU goes
> > > > offline (and will therefore execute anyway), and that a CPU brought
> > > > online after the first iteration on online cpus was already quiescent
> > > > (hopefully my assumptions are right). Preemption is left enabled during
> > > > all the critical section.
> > > > 
> > > > It looks a lot like Lai's approach, except that I use a cpumask (I
> > > > thought it looked cleaner and typically involves less operations than
> > > > looping on each possible cpu). I also don't disable preemption and
> > > > assume that cpu hotplug can happen at any point during this critical
> > > > section.
> > > > 
> > > > Something along the lines of :
> > > > 
> > > > static DECLARE_BITMAP(cpu_wait_expedited_bits, CONFIG_NR_CPUS);
> > > > const struct cpumask *const cpu_wait_expedited_mask =
> > > > 			to_cpumask(cpu_wait_expedited_bits);
> > > > 
> > > > 	mutex_lock(&rcu_sched_expedited_mutex);
> > > > 	cpumask_clear(cpu_wait_expedited_mask);
> > > > 	for_each_online_cpu(cpu) {
> > > > 		init and queue cpu req, and wake_up_process().
> > > > 		cpumask_set_cpu(cpu, cpu_wait_expedited_mask);
> > > > 	}
> > > > 	for_each_cpu_mask(cpu, cpu_wait_expedited_mask) {
> > > > 		wait_for_completion(cpu req);
> > > > 	}
> > > > 	mutex_unlock(&rcu_sched_expedited_mutex);
> > > > 
> > > > There is one concern with this approach : if a CPU is hotunplugged and
> > > > hotplugged during the critical section, I think the scheduler would
> > > > migrate the thread to a different CPU (upon hotunplug) and let the
> > > > thread run on this other CPU. If the target CPU is hotplugged again,
> > > > this would mean the thread would have run on a different CPU than the
> > > > target. I think we can argue that a CPU going offline and online again
> > > > will meet quiescent state requirements, so this should not be a problem.
> > > 
> > > Having the task runnable on some other CPU is very scary to me.  If the
> > > CPU comes back online, and synchronize_sched_expedited() manages to
> > > run before the task gets migrated back onto that CPU, then the grace
> > > period could be ended too soon.
> > > 
> > 
> > Well, the idea is that we want all in-flight preempt off sections (as
> > seen at the beginning of synchronize_sched_expedited()) to be over
> > before we consider the grace period as ended, right ?
> > 
> > Let's say we read the cpu online mask at a given time (potentially non
> > atomically, we don't really care).
> > 
> > If, at any point in time while we read the cpu online mask, a CPU
> > appears to be offline, this means that it cannot hold any in-flight
> > preempt off section.
> > 
> > Even if that specific CPU comes back online after this moment, and
> > starts scheduling threads again, these threads cannot ever possibly be
> > in-flight in the old grace period.
> > 
> > Therefore, my argument is that for rcu_sched (classic rcu), a CPU going
> > back online while we wait for quiescent state cannot possibly ever start
> > running a thread in the previous grace period.
> > 
> > My second argument is that if a CPU is hotunplugging while we wait for
> > QS, either :
> > 
> > - It lets the completion thread run before it goes offline. That's fine
> > - It goes offline and the completion thread is migrated to another CPU.
> >   This will just make synchronize_sched_expedited() wait for one more
> >   completion that will execute on the CPU the thread has migrated to.
> >   Again, we don't care.
> > - It goes offline/online/offline/online/... : We go back to my first
> >   argument, which states that if a CPU is out of the cpu online mask at
> >   any given time after we started the synchronize_sched_expedited()
> >   execution, it cannot possibly hold an in-flight preempt off section
> >   belonging to the old GP.
> > 
> > Or am I missing something ?
> 
> I am worried (perhaps unnecessarily) about the CPU coming online,
> its kthread still running on some other CPU, someone doing a
> synchronize_sched_expedited(), which then might possibly complete before
> the kthread migrates back where it belongs.  If the newly onlined CPU is
> in an extended RCU read-side critical section, we might end the expedited
> grace period too soon.
> 
> My turn.  Am I missing something?  ;-)
> 

If the completion kthreads only live within the boundary of the 
rcu_sched_expedited_mutex critical section, we never face this problem.

Therefore, all kthreads created for a given expedited grace period
should be waited for before the rcu_sched_expedited_mutex is released.

Here, I don't know the specific behavior of the threads you are willing
to use, but I see two possibilities when facing cpu hotplug :

- Either those threads are always active in the system, _really_ tied to
  a CPU and the hotunplug event kills them and sends their completion.
  (this is I think what Lai described)
- Or we create them while holding the rcu_sched_expedited_mutex, pin
  them to a CPU. They are less strictly bound to a CPU and get migrated
  to a different CPU upon hotunplug. Note that their life-span is
  limited to the rcu_sched_expedited_mutex section, because we expect
  them to die after completion. This implies a thread creation overhead.

Hopefully one of the solutions I describe above match the current
implementation. :)

Mathieu



> 						Thanx, Paul
> 
> > Mathieu
> > 
> > 
> > > All of this is intended to make synchronize_sched_expedited() be able to
> > > run in a CPU hotplug notifier.  Do we have an example where someone
> > > really wants to do this?  If not, I am really starting to like v7 of
> > > the patch.  ;-)
> > > 
> > > If someone really does need to run synchronize_sched_expedited() from a
> > > CPU hotplug notifier, perhaps a simpler approach is to have something
> > > like a try_get_online_cpus(), and just invoke synchronize_sched() upon
> > > failure:
> > > 
> > > 	void synchronize_sched_expedited(void)
> > > 	{
> > > 		int cpu;
> > > 		unsigned long flags;
> > > 		struct rq *rq;
> > > 		struct migration_req *req;
> > > 
> > > 		mutex_lock(&rcu_sched_expedited_mutex);
> > > 		if (!try_get_online_cpus()) {
> > > 			synchronize_sched();
> > > 			return;
> > > 		}
> > > 
> > > 		/* rest of synchronize_sched_expedited()... */
> > > 
> > > But I would want to see a real need for this beforehand.
> > > 
> > > 							Thanx, Paul
> > 
> > -- 
> > Mathieu Desnoyers
> > OpenPGP key fingerprint: 8CD5 52C3 8E3C 4140 715F  BA06 3F25 A8FE 3BAE 9A68
> > --
> > To unsubscribe from this list: send the line "unsubscribe netfilter-devel" in
> > the body of a message to majordomo@vger.kernel.org
> > More majordomo info at  http://vger.kernel.org/majordomo-info.html
Paul E. McKenney May 28, 2009, 11:52 p.m. UTC | #14
On Wed, May 27, 2009 at 10:45:54AM -0400, Mathieu Desnoyers wrote:
> * Paul E. McKenney (paulmck@linux.vnet.ibm.com) wrote:
> > On Tue, May 26, 2009 at 09:47:26PM -0400, Mathieu Desnoyers wrote:
> > > * Paul E. McKenney (paulmck@linux.vnet.ibm.com) wrote:
> > > > On Tue, May 26, 2009 at 12:41:29PM -0400, Mathieu Desnoyers wrote:
> > > > > * Paul E. McKenney (paulmck@linux.vnet.ibm.com) wrote:
> > > > > > On Mon, May 25, 2009 at 06:28:43PM -0700, Paul E. McKenney wrote:
> > > > > > > On Tue, May 26, 2009 at 09:03:55AM +0800, Lai Jiangshan wrote:
> > > > > > > > Paul E. McKenney wrote:
> > > > > > > > > 
> > > > > > > > > Good point -- I should at the very least add a comment to
> > > > > > > > > synchronize_sched_expedited() stating that it cannot be called holding
> > > > > > > > > any lock that is acquired in a CPU hotplug notifier.  If this restriction
> > > > > > > > > causes any problems, then your approach seems like a promising fix.
> > > > > > > > 
> > > > > > > > Reviewed-by: Lai Jiangshan <laijs@cn.fujitsu.com>
> > > > > > > 
> > > > > > > Thank you very much for your review and comments!!!
> > > > > > > 
> > > > > > > > >> The coupling of synchronize_sched_expedited() and migration_req
> > > > > > > > >> is largely increased:
> > > > > > > > >>
> > > > > > > > >> 1) The offline cpu's per_cpu(rcu_migration_req, cpu) is handled.
> > > > > > > > >>    See migration_call::CPU_DEAD
> > > > > > > > > 
> > > > > > > > > Good.  ;-)
> > > > > > > > > 
> > > > > > > > >> 2) migration_call() is the highest priority of cpu notifiers,
> > > > > > > > >>    So even any other cpu notifier calls synchronize_sched_expedited(),
> > > > > > > > >>    It'll not cause DEADLOCK.
> > > > > > > > > 
> > > > > > > > > You mean if using your preempt_disable() approach, right?  Unless I am
> > > > > > > > > missing something, the current get_online_cpus() approach would deadlock
> > > > > > > > > in this case.
> > > > > > > > 
> > > > > > > > Yes, I mean if using my preempt_disable() approach. The current
> > > > > > > > get_online_cpus() approach would NOT deadlock in this case also,
> > > > > > > > we can require get_online_cpus() in cpu notifiers.
> > > > > > > 
> > > > > > > I have added the comment for the time being, but should people need to
> > > > > > > use this in CPU-hotplug notifiers, then again your preempt_disable()
> > > > > > > approach looks to be a promising fix.
> > > > > > 
> > > > > > I looked more closely at your preempt_disable() suggestion, which you
> > > > > > presented earlier as follows:
> > > > > > 
> > > > > > > I think we can reuse req->dest_cpu and remove get_online_cpus().
> > > > > > > (and use preempt_disable() and for_each_possible_cpu())
> > > > > > > 
> > > > > > > req->dest_cpu = -2 means @req is not queued
> > > > > > > req->dest_cpu = -1 means @req is queued
> > > > > > > 
> > > > > > > a little like this code:
> > > > > > > 
> > > > > > > 	mutex_lock(&rcu_sched_expedited_mutex);
> > > > > > > 	for_each_possible_cpu(cpu) {
> > > > > > > 		preempt_disable()
> > > > > > > 		if (cpu is not online)
> > > > > > > 			just set req->dest_cpu to -2;
> > > > > > > 		else
> > > > > > > 			init and queue req, and wake_up_process().
> > > > > > > 		preempt_enable()
> > > > > > > 	}
> > > > > > > 	for_each_possible_cpu(cpu) {
> > > > > > > 		if (req is queued)
> > > > > > > 			wait_for_completion().
> > > > > > > 	}
> > > > > > > 	mutex_unlock(&rcu_sched_expedited_mutex);
> > > > > > 
> > > > > > I am concerned about the following sequence of events:
> > > > > > 
> > > > > > o	synchronize_sched_expedited() disables preemption, thus blocking
> > > > > > 	offlining operations.
> > > > > > 
> > > > > > o	CPU 1 starts offlining CPU 0.  It acquires the CPU-hotplug lock,
> > > > > > 	and proceeds, and is now waiting for preemption to be enabled.
> > > > > > 
> > > > > > o	synchronize_sched_expedited() disables preemption, sees
> > > > > > 	that CPU 0 is online, so initializes and queues a request,
> > > > > > 	does a wake-up-process(), and finally does a preempt_enable().
> > > > > > 
> > > > > > o	CPU 0 is currently running a high-priority real-time process,
> > > > > > 	so the wakeup does not immediately happen.
> > > > > > 
> > > > > > o	The offlining process completes, including the kthread_stop()
> > > > > > 	to the migration task.
> > > > > > 
> > > > > > o	The migration task wakes up, sees kthread_should_stop(),
> > > > > > 	and so exits without checking its queue.
> > > > > > 
> > > > > > o	synchronize_sched_expedited() waits forever for CPU 0 to respond.
> > > > > > 
> > > > > > I suppose that one way to handle this would be to check for the CPU
> > > > > > going offline before doing the wait_for_completion(), but I am concerned
> > > > > > about races affecting this check as well.
> > > > > > 
> > > > > > Or is there something in the CPU-offline process that makes the above
> > > > > > sequence of events impossible?
> > > > > > 
> > > > > 
> > > > > I think you are right, there is a problem there. The simple fact that
> > > > > this needs to disable preemption to protect against cpu hotplug seems a
> > > > > bit strange. If I may propose an alternate solution, which assumes that
> > > > > threads pinned to a CPU are migrated to a different CPU when a CPU goes
> > > > > offline (and will therefore execute anyway), and that a CPU brought
> > > > > online after the first iteration on online cpus was already quiescent
> > > > > (hopefully my assumptions are right). Preemption is left enabled during
> > > > > all the critical section.
> > > > > 
> > > > > It looks a lot like Lai's approach, except that I use a cpumask (I
> > > > > thought it looked cleaner and typically involves less operations than
> > > > > looping on each possible cpu). I also don't disable preemption and
> > > > > assume that cpu hotplug can happen at any point during this critical
> > > > > section.
> > > > > 
> > > > > Something along the lines of :
> > > > > 
> > > > > static DECLARE_BITMAP(cpu_wait_expedited_bits, CONFIG_NR_CPUS);
> > > > > const struct cpumask *const cpu_wait_expedited_mask =
> > > > > 			to_cpumask(cpu_wait_expedited_bits);
> > > > > 
> > > > > 	mutex_lock(&rcu_sched_expedited_mutex);
> > > > > 	cpumask_clear(cpu_wait_expedited_mask);
> > > > > 	for_each_online_cpu(cpu) {
> > > > > 		init and queue cpu req, and wake_up_process().
> > > > > 		cpumask_set_cpu(cpu, cpu_wait_expedited_mask);
> > > > > 	}
> > > > > 	for_each_cpu_mask(cpu, cpu_wait_expedited_mask) {
> > > > > 		wait_for_completion(cpu req);
> > > > > 	}
> > > > > 	mutex_unlock(&rcu_sched_expedited_mutex);
> > > > > 
> > > > > There is one concern with this approach : if a CPU is hotunplugged and
> > > > > hotplugged during the critical section, I think the scheduler would
> > > > > migrate the thread to a different CPU (upon hotunplug) and let the
> > > > > thread run on this other CPU. If the target CPU is hotplugged again,
> > > > > this would mean the thread would have run on a different CPU than the
> > > > > target. I think we can argue that a CPU going offline and online again
> > > > > will meet quiescent state requirements, so this should not be a problem.
> > > > 
> > > > Having the task runnable on some other CPU is very scary to me.  If the
> > > > CPU comes back online, and synchronize_sched_expedited() manages to
> > > > run before the task gets migrated back onto that CPU, then the grace
> > > > period could be ended too soon.
> > > > 
> > > 
> > > Well, the idea is that we want all in-flight preempt off sections (as
> > > seen at the beginning of synchronize_sched_expedited()) to be over
> > > before we consider the grace period as ended, right ?
> > > 
> > > Let's say we read the cpu online mask at a given time (potentially non
> > > atomically, we don't really care).
> > > 
> > > If, at any point in time while we read the cpu online mask, a CPU
> > > appears to be offline, this means that it cannot hold any in-flight
> > > preempt off section.
> > > 
> > > Even if that specific CPU comes back online after this moment, and
> > > starts scheduling threads again, these threads cannot ever possibly be
> > > in-flight in the old grace period.
> > > 
> > > Therefore, my argument is that for rcu_sched (classic rcu), a CPU going
> > > back online while we wait for quiescent state cannot possibly ever start
> > > running a thread in the previous grace period.
> > > 
> > > My second argument is that if a CPU is hotunplugging while we wait for
> > > QS, either :
> > > 
> > > - It lets the completion thread run before it goes offline. That's fine
> > > - It goes offline and the completion thread is migrated to another CPU.
> > >   This will just make synchronize_sched_expedited() wait for one more
> > >   completion that will execute on the CPU the thread has migrated to.
> > >   Again, we don't care.
> > > - It goes offline/online/offline/online/... : We go back to my first
> > >   argument, which states that if a CPU is out of the cpu online mask at
> > >   any given time after we started the synchronize_sched_expedited()
> > >   execution, it cannot possibly hold an in-flight preempt off section
> > >   belonging to the old GP.
> > > 
> > > Or am I missing something ?
> > 
> > I am worried (perhaps unnecessarily) about the CPU coming online,
> > its kthread still running on some other CPU, someone doing a
> > synchronize_sched_expedited(), which then might possibly complete before
> > the kthread migrates back where it belongs.  If the newly onlined CPU is
> > in an extended RCU read-side critical section, we might end the expedited
> > grace period too soon.
> > 
> > My turn.  Am I missing something?  ;-)
> > 
> 
> If the completion kthreads only live within the boundary of the 
> rcu_sched_expedited_mutex critical section, we never face this problem.
> 
> Therefore, all kthreads created for a given expedited grace period
> should be waited for before the rcu_sched_expedited_mutex is released.

True, but the overhead of creating (say) 100 kthreads is not so consistent
with the "expedited" in synchronize_sched_expedited().

> Here, I don't know the specific behavior of the threads you are willing
> to use, but I see two possibilities when facing cpu hotplug :
> 
> - Either those threads are always active in the system, _really_ tied to
>   a CPU and the hotunplug event kills them and sends their completion.
>   (this is I think what Lai described)

Yep.  And this is what v7 of the patch did.  Lai would like me to get
rid of the get_online_cpus(), replacing it with preempt_disable().
Which I find to be incredibly scary, given the possibility of the
kthread somehow executing on some other CPU, thus breaking RCU.

But see my next reply to Lai.

> - Or we create them while holding the rcu_sched_expedited_mutex, pin
>   them to a CPU. They are less strictly bound to a CPU and get migrated
>   to a different CPU upon hotunplug. Note that their life-span is
>   limited to the rcu_sched_expedited_mutex section, because we expect
>   them to die after completion. This implies a thread creation overhead.

And the thread-creation/destruction overhead is fatal, I believe.

> Hopefully one of the solutions I describe above match the current
> implementation. :)

#1.  ;-)

							Thanx, Paul

> Mathieu
> 
> 
> 
> > 						Thanx, Paul
> > 
> > > Mathieu
> > > 
> > > 
> > > > All of this is intended to make synchronize_sched_expedited() be able to
> > > > run in a CPU hotplug notifier.  Do we have an example where someone
> > > > really wants to do this?  If not, I am really starting to like v7 of
> > > > the patch.  ;-)
> > > > 
> > > > If someone really does need to run synchronize_sched_expedited() from a
> > > > CPU hotplug notifier, perhaps a simpler approach is to have something
> > > > like a try_get_online_cpus(), and just invoke synchronize_sched() upon
> > > > failure:
> > > > 
> > > > 	void synchronize_sched_expedited(void)
> > > > 	{
> > > > 		int cpu;
> > > > 		unsigned long flags;
> > > > 		struct rq *rq;
> > > > 		struct migration_req *req;
> > > > 
> > > > 		mutex_lock(&rcu_sched_expedited_mutex);
> > > > 		if (!try_get_online_cpus()) {
> > > > 			synchronize_sched();
> > > > 			return;
> > > > 		}
> > > > 
> > > > 		/* rest of synchronize_sched_expedited()... */
> > > > 
> > > > But I would want to see a real need for this beforehand.
> > > > 
> > > > 							Thanx, Paul
> > > 
> > > -- 
> > > Mathieu Desnoyers
> > > OpenPGP key fingerprint: 8CD5 52C3 8E3C 4140 715F  BA06 3F25 A8FE 3BAE 9A68
> > > --
> > > To unsubscribe from this list: send the line "unsubscribe netfilter-devel" in
> > > the body of a message to majordomo@vger.kernel.org
> > > More majordomo info at  http://vger.kernel.org/majordomo-info.html
> 
> -- 
> Mathieu Desnoyers
> OpenPGP key fingerprint: 8CD5 52C3 8E3C 4140 715F  BA06 3F25 A8FE 3BAE 9A68
> --
> To unsubscribe from this list: send the line "unsubscribe netfilter-devel" in
> the body of a message to majordomo@vger.kernel.org
> More majordomo info at  http://vger.kernel.org/majordomo-info.html
--
To unsubscribe from this list: send the line "unsubscribe netdev" in
the body of a message to majordomo@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Paul E. McKenney May 29, 2009, 12:08 a.m. UTC | #15
On Wed, May 27, 2009 at 01:37:03PM +0800, Lai Jiangshan wrote:
> Paul E. McKenney wrote:
> > OK, good point!  I do need to think about this.
> > 
> > In the meantime, where do you see a need to run
> > synchronize_sched_expedited() from within a hotplug CPU notifier?
> > 
> > 						Thanx, Paul
> > 
> 
> I don't worry about synchronize_sched_expedited() called
> from within a hotplug CPU notifier:
> 
> 1st synchronize_sched_expedited() is newly, nobody calls it before current.
> 2nd get_online_cpus() will not cause DEADLOCK in CPU notifier:
> 	get_online_cpus() finds itself owns the cpu_hotplug.lock, it will
> 	not take it again.
> 
> I worry DEADLOCK like this:(ABBA DEADLOCK)

Good point -- you had in fact mentioned this earlier.

> > get_online_cpus() is a large lock, a lot's of lock in kernel is required
> > after cpu_hotplug.lock.
> > 
> > _cpu_down()
> > 	cpu_hotplug_begin()
> > 		mutex_lock(&cpu_hotplug.lock)
> > 	__raw_notifier_call_chain(CPU_DOWN_PREPARE)
> > 		Lock a-kernel-lock.
> > 
> > It means when we have held a-kernel-lock, we can not call
> > synchronize_sched_expedited(). get_online_cpus() narrows
> > synchronize_sched_expedited()'s usages.
> 
> One thread calls _cpu_down() which do "mutex_lock(&cpu_hotplug.lock)"
> and then do "Lock a-kernel-lock", other thread calls
> synchronize_sched_expedited() with a-kernel-lock held,
> ABBA DEADLOCK would happen:
> 
> thread 1                            |        thread 2
> _cpu_down()                         |    Lock a-kernel-lock. 
>   mutex_lock(&cpu_hotplug.lock)     |    synchronize_sched_expedited()
> ------------------------------------------------------------------------
>   Lock a-kernel-lock.(wait thread2) |       mutex_lock(&cpu_hotplug.lock)
>                                             (wait thread 1)
> 
> 
> cpuset_lock() is an example of a-kernel-lock as described before.
> cpuset_lock() is required in CPU notifier.
> 
> But some work in cpuset need get_online_cpus().
> (cpuset_lock() and then get_online_cpus(), we can
> not release cpuset_lock() temporarily)
> 
> The fix is putting this work done in workqueue.
> (get_online_cpus() and then cpuset_lock());

But there is another way.

Continue to use the migration kthreads, given that they already exist,
already are created and destroyed by CPU hotplug operations, and given
that they run as maximal priority.

My main concern with moving from get_online_cpus() to preempt_disable()
has been the thought that somehow, sometime, in some future release
of Linux, it will be possible for the migration threads to execute
on the wrong CPU, perhaps only occasionally and perhaps only for
very short time periods.  If this were to happen, there would be the
possibility that the grace period would end too soon, which would be
silently fatal.  My fingers simply refused to code something with this
potential vulnerability.

But it is easy to insert a check into migration_thread() to see if it
is running on the wrong CPU.  If it is, I can do a WARN_ONCE() and also
set a state variable to tell synchronize_sched_expedited() to invoke
sychronize_sched(), thus avoiding messing up RCU.  On the next call
to synchronize_sched_expedited(), it would again try relying on the
migration threads.

I am putting together yet another patch, but constructed along these
lines, and will let you know how it turns out.

							Thanx, Paul
--
To unsubscribe from this list: send the line "unsubscribe netdev" in
the body of a message to majordomo@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
diff mbox

Patch

diff --git a/include/linux/rcuclassic.h b/include/linux/rcuclassic.h
index bfd92e1..b4b4fe1 100644
--- a/include/linux/rcuclassic.h
+++ b/include/linux/rcuclassic.h
@@ -158,14 +158,27 @@  extern struct lockdep_map rcu_lock_map;
 
 #define call_rcu_sched(head, func) call_rcu(head, func)
 
+static inline void synchronize_rcu_expedited(void)
+{
+	synchronize_sched_expedited();
+}
+
+static inline void synchronize_rcu_bh_expedited(void)
+{
+	synchronize_sched_expedited();
+}
+
 extern void __rcu_init(void);
-#define rcu_init_sched()	do { } while (0)
 extern void rcu_check_callbacks(int cpu, int user);
 extern void rcu_restart_cpu(int cpu);
 
 extern long rcu_batches_completed(void);
 extern long rcu_batches_completed_bh(void);
 
+static inline void rcu_init_sched(void)
+{
+}
+
 #define rcu_enter_nohz()	do { } while (0)
 #define rcu_exit_nohz()		do { } while (0)
 
diff --git a/include/linux/rcupdate.h b/include/linux/rcupdate.h
index 15fbb3c..d67dfce 100644
--- a/include/linux/rcupdate.h
+++ b/include/linux/rcupdate.h
@@ -51,7 +51,18 @@  struct rcu_head {
 	void (*func)(struct rcu_head *head);
 };
 
-/* Internal to kernel, but needed by rcupreempt.h. */
+/* Exported common interfaces */
+extern void synchronize_rcu(void);
+extern void rcu_barrier(void);
+extern void rcu_barrier_bh(void);
+extern void rcu_barrier_sched(void);
+extern void synchronize_sched_expedited(void);
+extern int sched_expedited_torture_stats(char *page);
+
+/* Internal to kernel */
+extern void rcu_init(void);
+extern void rcu_scheduler_starting(void);
+extern int rcu_needs_cpu(int cpu);
 extern int rcu_scheduler_active;
 
 #if defined(CONFIG_CLASSIC_RCU)
@@ -259,15 +270,4 @@  extern void call_rcu(struct rcu_head *head,
 extern void call_rcu_bh(struct rcu_head *head,
 			void (*func)(struct rcu_head *head));
 
-/* Exported common interfaces */
-extern void synchronize_rcu(void);
-extern void rcu_barrier(void);
-extern void rcu_barrier_bh(void);
-extern void rcu_barrier_sched(void);
-
-/* Internal to kernel */
-extern void rcu_init(void);
-extern void rcu_scheduler_starting(void);
-extern int rcu_needs_cpu(int cpu);
-
 #endif /* __LINUX_RCUPDATE_H */
diff --git a/include/linux/rcupreempt.h b/include/linux/rcupreempt.h
index fce5227..78117ed 100644
--- a/include/linux/rcupreempt.h
+++ b/include/linux/rcupreempt.h
@@ -74,6 +74,16 @@  extern int rcu_needs_cpu(int cpu);
 
 extern void __synchronize_sched(void);
 
+static inline void synchronize_rcu_expedited(void)
+{
+	synchronize_rcu();  /* Placeholder for new rcupreempt implementation. */
+}
+
+static inline void synchronize_rcu_bh_expedited(void)
+{
+	synchronize_rcu();  /* Placeholder for new rcupreempt implementation. */
+}
+
 extern void __rcu_init(void);
 extern void rcu_init_sched(void);
 extern void rcu_check_callbacks(int cpu, int user);
diff --git a/include/linux/rcutree.h b/include/linux/rcutree.h
index 58b2aa5..41c23cb 100644
--- a/include/linux/rcutree.h
+++ b/include/linux/rcutree.h
@@ -279,8 +279,14 @@  static inline void __rcu_read_unlock_bh(void)
 
 #define call_rcu_sched(head, func) call_rcu(head, func)
 
-static inline void rcu_init_sched(void)
+static inline void synchronize_rcu_expedited(void)
+{
+	synchronize_sched_expedited();
+}
+
+static inline void synchronize_rcu_bh_expedited(void)
 {
+	synchronize_sched_expedited();
 }
 
 extern void __rcu_init(void);
@@ -290,6 +296,10 @@  extern void rcu_restart_cpu(int cpu);
 extern long rcu_batches_completed(void);
 extern long rcu_batches_completed_bh(void);
 
+static inline void rcu_init_sched(void)
+{
+}
+
 #ifdef CONFIG_NO_HZ
 void rcu_enter_nohz(void);
 void rcu_exit_nohz(void);
diff --git a/kernel/rcupdate.c b/kernel/rcupdate.c
index a967c9f..454995f 100644
--- a/kernel/rcupdate.c
+++ b/kernel/rcupdate.c
@@ -45,6 +45,8 @@ 
 #include <linux/mutex.h>
 #include <linux/module.h>
 #include <linux/kernel_stat.h>
+#include <linux/delay.h>
+#include <linux/kthread.h>
 
 enum rcu_barrier {
 	RCU_BARRIER_STD,
@@ -98,6 +100,30 @@  void synchronize_rcu(void)
 }
 EXPORT_SYMBOL_GPL(synchronize_rcu);
 
+/**
+ * synchronize_rcu_bh - wait until an rcu_bh grace period has elapsed.
+ *
+ * Control will return to the caller some time after a full rcu_bh grace
+ * period has elapsed, in other words after all currently executing rcu_bh
+ * read-side critical sections have completed.  RCU read-side critical
+ * sections are delimited by rcu_read_lock_bh() and rcu_read_unlock_bh(),
+ * and may be nested.
+ */
+void synchronize_rcu_bh(void)
+{
+	struct rcu_synchronize rcu;
+
+	if (rcu_blocking_is_gp())
+		return;
+
+	init_completion(&rcu.completion);
+	/* Will wake me after RCU finished. */
+	call_rcu_bh(&rcu.head, wakeme_after_rcu);
+	/* Wait for it. */
+	wait_for_completion(&rcu.completion);
+}
+EXPORT_SYMBOL_GPL(synchronize_rcu_bh);
+
 static void rcu_barrier_callback(struct rcu_head *notused)
 {
 	if (atomic_dec_and_test(&rcu_barrier_cpu_count))
@@ -129,6 +155,7 @@  static void rcu_barrier_func(void *type)
 static inline void wait_migrated_callbacks(void)
 {
 	wait_event(rcu_migrate_wq, !atomic_read(&rcu_migrate_type_count));
+	smp_mb(); /* In case we didn't sleep. */
 }
 
 /*
diff --git a/kernel/rcutorture.c b/kernel/rcutorture.c
index 9b4a975..d3a1e56 100644
--- a/kernel/rcutorture.c
+++ b/kernel/rcutorture.c
@@ -257,14 +257,14 @@  struct rcu_torture_ops {
 	void (*init)(void);
 	void (*cleanup)(void);
 	int (*readlock)(void);
-	void (*readdelay)(struct rcu_random_state *rrsp);
+	void (*read_delay)(struct rcu_random_state *rrsp);
 	void (*readunlock)(int idx);
 	int (*completed)(void);
-	void (*deferredfree)(struct rcu_torture *p);
+	void (*deferred_free)(struct rcu_torture *p);
 	void (*sync)(void);
 	void (*cb_barrier)(void);
 	int (*stats)(char *page);
-	int irqcapable;
+	int irq_capable;
 	char *name;
 };
 static struct rcu_torture_ops *cur_ops = NULL;
@@ -320,7 +320,7 @@  rcu_torture_cb(struct rcu_head *p)
 		rp->rtort_mbtest = 0;
 		rcu_torture_free(rp);
 	} else
-		cur_ops->deferredfree(rp);
+		cur_ops->deferred_free(rp);
 }
 
 static void rcu_torture_deferred_free(struct rcu_torture *p)
@@ -329,18 +329,18 @@  static void rcu_torture_deferred_free(struct rcu_torture *p)
 }
 
 static struct rcu_torture_ops rcu_ops = {
-	.init = NULL,
-	.cleanup = NULL,
-	.readlock = rcu_torture_read_lock,
-	.readdelay = rcu_read_delay,
-	.readunlock = rcu_torture_read_unlock,
-	.completed = rcu_torture_completed,
-	.deferredfree = rcu_torture_deferred_free,
-	.sync = synchronize_rcu,
-	.cb_barrier = rcu_barrier,
-	.stats = NULL,
-	.irqcapable = 1,
-	.name = "rcu"
+	.init		= NULL,
+	.cleanup	= NULL,
+	.readlock	= rcu_torture_read_lock,
+	.read_delay	= rcu_read_delay,
+	.readunlock	= rcu_torture_read_unlock,
+	.completed	= rcu_torture_completed,
+	.deferred_free	= rcu_torture_deferred_free,
+	.sync		= synchronize_rcu,
+	.cb_barrier	= rcu_barrier,
+	.stats		= NULL,
+	.irq_capable 	= 1,
+	.name 		= "rcu"
 };
 
 static void rcu_sync_torture_deferred_free(struct rcu_torture *p)
@@ -370,18 +370,18 @@  static void rcu_sync_torture_init(void)
 }
 
 static struct rcu_torture_ops rcu_sync_ops = {
-	.init = rcu_sync_torture_init,
-	.cleanup = NULL,
-	.readlock = rcu_torture_read_lock,
-	.readdelay = rcu_read_delay,
-	.readunlock = rcu_torture_read_unlock,
-	.completed = rcu_torture_completed,
-	.deferredfree = rcu_sync_torture_deferred_free,
-	.sync = synchronize_rcu,
-	.cb_barrier = NULL,
-	.stats = NULL,
-	.irqcapable = 1,
-	.name = "rcu_sync"
+	.init		= rcu_sync_torture_init,
+	.cleanup	= NULL,
+	.readlock	= rcu_torture_read_lock,
+	.read_delay	= rcu_read_delay,
+	.readunlock	= rcu_torture_read_unlock,
+	.completed	= rcu_torture_completed,
+	.deferred_free	= rcu_sync_torture_deferred_free,
+	.sync		= synchronize_rcu,
+	.cb_barrier	= NULL,
+	.stats		= NULL,
+	.irq_capable	= 1,
+	.name		= "rcu_sync"
 };
 
 /*
@@ -432,33 +432,33 @@  static void rcu_bh_torture_synchronize(void)
 }
 
 static struct rcu_torture_ops rcu_bh_ops = {
-	.init = NULL,
-	.cleanup = NULL,
-	.readlock = rcu_bh_torture_read_lock,
-	.readdelay = rcu_read_delay,  /* just reuse rcu's version. */
-	.readunlock = rcu_bh_torture_read_unlock,
-	.completed = rcu_bh_torture_completed,
-	.deferredfree = rcu_bh_torture_deferred_free,
-	.sync = rcu_bh_torture_synchronize,
-	.cb_barrier = rcu_barrier_bh,
-	.stats = NULL,
-	.irqcapable = 1,
-	.name = "rcu_bh"
+	.init		= NULL,
+	.cleanup	= NULL,
+	.readlock	= rcu_bh_torture_read_lock,
+	.read_delay	= rcu_read_delay,  /* just reuse rcu's version. */
+	.readunlock	= rcu_bh_torture_read_unlock,
+	.completed	= rcu_bh_torture_completed,
+	.deferred_free	= rcu_bh_torture_deferred_free,
+	.sync		= rcu_bh_torture_synchronize,
+	.cb_barrier	= rcu_barrier_bh,
+	.stats		= NULL,
+	.irq_capable	= 1,
+	.name		= "rcu_bh"
 };
 
 static struct rcu_torture_ops rcu_bh_sync_ops = {
-	.init = rcu_sync_torture_init,
-	.cleanup = NULL,
-	.readlock = rcu_bh_torture_read_lock,
-	.readdelay = rcu_read_delay,  /* just reuse rcu's version. */
-	.readunlock = rcu_bh_torture_read_unlock,
-	.completed = rcu_bh_torture_completed,
-	.deferredfree = rcu_sync_torture_deferred_free,
-	.sync = rcu_bh_torture_synchronize,
-	.cb_barrier = NULL,
-	.stats = NULL,
-	.irqcapable = 1,
-	.name = "rcu_bh_sync"
+	.init		= rcu_sync_torture_init,
+	.cleanup	= NULL,
+	.readlock	= rcu_bh_torture_read_lock,
+	.read_delay	= rcu_read_delay,  /* just reuse rcu's version. */
+	.readunlock	= rcu_bh_torture_read_unlock,
+	.completed	= rcu_bh_torture_completed,
+	.deferred_free	= rcu_sync_torture_deferred_free,
+	.sync		= rcu_bh_torture_synchronize,
+	.cb_barrier	= NULL,
+	.stats		= NULL,
+	.irq_capable	= 1,
+	.name		= "rcu_bh_sync"
 };
 
 /*
@@ -530,17 +530,17 @@  static int srcu_torture_stats(char *page)
 }
 
 static struct rcu_torture_ops srcu_ops = {
-	.init = srcu_torture_init,
-	.cleanup = srcu_torture_cleanup,
-	.readlock = srcu_torture_read_lock,
-	.readdelay = srcu_read_delay,
-	.readunlock = srcu_torture_read_unlock,
-	.completed = srcu_torture_completed,
-	.deferredfree = rcu_sync_torture_deferred_free,
-	.sync = srcu_torture_synchronize,
-	.cb_barrier = NULL,
-	.stats = srcu_torture_stats,
-	.name = "srcu"
+	.init		= srcu_torture_init,
+	.cleanup	= srcu_torture_cleanup,
+	.readlock	= srcu_torture_read_lock,
+	.read_delay	= srcu_read_delay,
+	.readunlock	= srcu_torture_read_unlock,
+	.completed	= srcu_torture_completed,
+	.deferred_free	= rcu_sync_torture_deferred_free,
+	.sync		= srcu_torture_synchronize,
+	.cb_barrier	= NULL,
+	.stats		= srcu_torture_stats,
+	.name		= "srcu"
 };
 
 /*
@@ -574,32 +574,47 @@  static void sched_torture_synchronize(void)
 }
 
 static struct rcu_torture_ops sched_ops = {
-	.init = rcu_sync_torture_init,
-	.cleanup = NULL,
-	.readlock = sched_torture_read_lock,
-	.readdelay = rcu_read_delay,  /* just reuse rcu's version. */
-	.readunlock = sched_torture_read_unlock,
-	.completed = sched_torture_completed,
-	.deferredfree = rcu_sched_torture_deferred_free,
-	.sync = sched_torture_synchronize,
-	.cb_barrier = rcu_barrier_sched,
-	.stats = NULL,
-	.irqcapable = 1,
-	.name = "sched"
+	.init		= rcu_sync_torture_init,
+	.cleanup	= NULL,
+	.readlock	= sched_torture_read_lock,
+	.read_delay	= rcu_read_delay,  /* just reuse rcu's version. */
+	.readunlock	= sched_torture_read_unlock,
+	.completed	= sched_torture_completed,
+	.deferred_free	= rcu_sched_torture_deferred_free,
+	.sync		= sched_torture_synchronize,
+	.cb_barrier	= rcu_barrier_sched,
+	.stats		= NULL,
+	.irq_capable	= 1,
+	.name		= "sched"
 };
 
 static struct rcu_torture_ops sched_ops_sync = {
-	.init = rcu_sync_torture_init,
-	.cleanup = NULL,
-	.readlock = sched_torture_read_lock,
-	.readdelay = rcu_read_delay,  /* just reuse rcu's version. */
-	.readunlock = sched_torture_read_unlock,
-	.completed = sched_torture_completed,
-	.deferredfree = rcu_sync_torture_deferred_free,
-	.sync = sched_torture_synchronize,
-	.cb_barrier = NULL,
-	.stats = NULL,
-	.name = "sched_sync"
+	.init		= rcu_sync_torture_init,
+	.cleanup	= NULL,
+	.readlock	= sched_torture_read_lock,
+	.read_delay	= rcu_read_delay,  /* just reuse rcu's version. */
+	.readunlock	= sched_torture_read_unlock,
+	.completed	= sched_torture_completed,
+	.deferred_free	= rcu_sync_torture_deferred_free,
+	.sync		= sched_torture_synchronize,
+	.cb_barrier	= NULL,
+	.stats		= NULL,
+	.name		= "sched_sync"
+};
+
+static struct rcu_torture_ops sched_expedited_ops = {
+	.init		= rcu_sync_torture_init,
+	.cleanup	= NULL,
+	.readlock	= sched_torture_read_lock,
+	.read_delay	= rcu_read_delay,  /* just reuse rcu's version. */
+	.readunlock	= sched_torture_read_unlock,
+	.completed	= sched_torture_completed,
+	.deferred_free	= rcu_sync_torture_deferred_free,
+	.sync		= synchronize_sched_expedited,
+	.cb_barrier	= NULL,
+	.stats		= NULL,
+	.irq_capable	= 1,
+	.name		= "sched_expedited"
 };
 
 /*
@@ -635,7 +650,7 @@  rcu_torture_writer(void *arg)
 				i = RCU_TORTURE_PIPE_LEN;
 			atomic_inc(&rcu_torture_wcount[i]);
 			old_rp->rtort_pipe_count++;
-			cur_ops->deferredfree(old_rp);
+			cur_ops->deferred_free(old_rp);
 		}
 		rcu_torture_current_version++;
 		oldbatch = cur_ops->completed();
@@ -700,7 +715,7 @@  static void rcu_torture_timer(unsigned long unused)
 	if (p->rtort_mbtest == 0)
 		atomic_inc(&n_rcu_torture_mberror);
 	spin_lock(&rand_lock);
-	cur_ops->readdelay(&rand);
+	cur_ops->read_delay(&rand);
 	n_rcu_torture_timers++;
 	spin_unlock(&rand_lock);
 	preempt_disable();
@@ -738,11 +753,11 @@  rcu_torture_reader(void *arg)
 
 	VERBOSE_PRINTK_STRING("rcu_torture_reader task started");
 	set_user_nice(current, 19);
-	if (irqreader && cur_ops->irqcapable)
+	if (irqreader && cur_ops->irq_capable)
 		setup_timer_on_stack(&t, rcu_torture_timer, 0);
 
 	do {
-		if (irqreader && cur_ops->irqcapable) {
+		if (irqreader && cur_ops->irq_capable) {
 			if (!timer_pending(&t))
 				mod_timer(&t, 1);
 		}
@@ -757,7 +772,7 @@  rcu_torture_reader(void *arg)
 		}
 		if (p->rtort_mbtest == 0)
 			atomic_inc(&n_rcu_torture_mberror);
-		cur_ops->readdelay(&rand);
+		cur_ops->read_delay(&rand);
 		preempt_disable();
 		pipe_count = p->rtort_pipe_count;
 		if (pipe_count > RCU_TORTURE_PIPE_LEN) {
@@ -778,7 +793,7 @@  rcu_torture_reader(void *arg)
 	} while (!kthread_should_stop() && fullstop == FULLSTOP_DONTSTOP);
 	VERBOSE_PRINTK_STRING("rcu_torture_reader task stopping");
 	rcutorture_shutdown_absorb("rcu_torture_reader");
-	if (irqreader && cur_ops->irqcapable)
+	if (irqreader && cur_ops->irq_capable)
 		del_timer_sync(&t);
 	while (!kthread_should_stop())
 		schedule_timeout_uninterruptible(1);
@@ -1078,6 +1093,7 @@  rcu_torture_init(void)
 	int firsterr = 0;
 	static struct rcu_torture_ops *torture_ops[] =
 		{ &rcu_ops, &rcu_sync_ops, &rcu_bh_ops, &rcu_bh_sync_ops,
+		  &sched_expedited_ops,
 		  &srcu_ops, &sched_ops, &sched_ops_sync, };
 
 	mutex_lock(&fullstop_mutex);
diff --git a/kernel/sched.c b/kernel/sched.c
index 26efa47..d6d4fc3 100644
--- a/kernel/sched.c
+++ b/kernel/sched.c
@@ -6772,7 +6772,8 @@  static int migration_thread(void *data)
 		list_del_init(head->next);
 
 		spin_unlock(&rq->lock);
-		__migrate_task(req->task, cpu, req->dest_cpu);
+		if (req->task != NULL)
+			__migrate_task(req->task, cpu, req->dest_cpu);
 		local_irq_enable();
 
 		complete(&req->done);
@@ -10239,3 +10240,52 @@  struct cgroup_subsys cpuacct_subsys = {
 	.subsys_id = cpuacct_subsys_id,
 };
 #endif	/* CONFIG_CGROUP_CPUACCT */
+
+#ifndef CONFIG_SMP
+
+void synchronize_sched_expedited(void)
+{
+}
+EXPORT_SYMBOL_GPL(synchronize_sched_expedited);
+
+#else /* #ifndef CONFIG_SMP */
+
+static DEFINE_PER_CPU(struct migration_req, rcu_migration_req);
+static DEFINE_MUTEX(rcu_sched_expedited_mutex);
+
+/*
+ * Wait for an rcu-sched grace period to elapse, but use "big hammer"
+ * approach to force grace period to end quickly.  This consumes
+ * significant time on all CPUs, and is thus not recommended for
+ * any sort of common-case code.
+ */
+void synchronize_sched_expedited(void)
+{
+	int cpu;
+	unsigned long flags;
+	struct rq *rq;
+	struct migration_req *req;
+
+	mutex_lock(&rcu_sched_expedited_mutex);
+	get_online_cpus();
+	for_each_online_cpu(cpu) {
+		rq = cpu_rq(cpu);
+		req = &per_cpu(rcu_migration_req, cpu);
+		init_completion(&req->done);
+		req->task = NULL;
+		req->dest_cpu = -1;
+		spin_lock_irqsave(&rq->lock, flags);
+		list_add(&req->list, &rq->migration_queue);
+		spin_unlock_irqrestore(&rq->lock, flags);
+		wake_up_process(rq->migration_thread);
+	}
+	for_each_online_cpu(cpu) {
+		req = &per_cpu(rcu_migration_req, cpu);
+		wait_for_completion(&req->done);
+	}
+	put_online_cpus();
+	mutex_unlock(&rcu_sched_expedited_mutex);
+}
+EXPORT_SYMBOL_GPL(synchronize_sched_expedited);
+
+#endif /* #else #ifndef CONFIG_SMP */