diff mbox

kthread: kthread_bind fails to enforce CPU affinity (fixes kernel BUG at kernel/smpboot.c:134!)

Message ID 1418009221-12719-1-git-send-email-anton@samba.org (mailing list archive)
State Rejected
Headers show

Commit Message

Anton Blanchard Dec. 8, 2014, 3:27 a.m. UTC
I have a busy ppc64le KVM box where guests sometimes hit the infamous
"kernel BUG at kernel/smpboot.c:134!" issue during boot:

BUG_ON(td->cpu != smp_processor_id());

Basically a per CPU hotplug thread scheduled on the wrong CPU. The oops
output confirms it:

CPU: 0
Comm: watchdog/130

The issue is in kthread_bind where we set the cpus_allowed mask, but do
not touch task_thread_info(p)->cpu. The scheduler assumes the previously
scheduled CPU is in the cpus_allowed mask, but in this case we are
moving a thread to another CPU so it is not.

We used to call set_task_cpu which sets task_thread_info(p)->cpu (in fact
kthread_bind still has a comment suggesting this). That was removed in
e2912009fb7b ("sched: Ensure set_task_cpu() is never called on blocked
tasks").

Since we cannot call set_task_cpu (the task is in a sleeping state),
just do an explicit set of task_thread_info(p)->cpu.

Fixes: e2912009fb7b ("sched: Ensure set_task_cpu() is never called on blocked tasks")
Cc: stable@vger.kernel.org
Signed-off-by: Anton Blanchard <anton@samba.org>
---
 kernel/kthread.c | 3 ++-
 1 file changed, 2 insertions(+), 1 deletion(-)

Comments

Linus Torvalds Dec. 8, 2014, 4:28 a.m. UTC | #1
On Sun, Dec 7, 2014 at 7:27 PM, Anton Blanchard <anton@samba.org> wrote:
>
> Since we cannot call set_task_cpu (the task is in a sleeping state),
> just do an explicit set of task_thread_info(p)->cpu.

Scheduler people: is this sufficient and ok?

The __set_task_cpu() function does various other things too:

        set_task_rq(p, cpu);
  #ifdef CONFIG_SMP
        /*
         * After ->cpu is set up to a new value, task_rq_lock(p, ...) can be
         * successfuly executed on another CPU. We must ensure that updates of
         * per-task data have been completed by this moment.
         */
        smp_wmb();
        task_thread_info(p)->cpu = cpu;
        p->wake_cpu = cpu;
  #endif

which makes me worry about just setting the thread_info->cpu value.
That set_task_rq() initializes various group scheduling things, an
dthat whole "wake_cpu" thing seems relevant too.

I'm not saying the patch is wrong, I just want confirmation/ack for
it. Although to be honest, I'd rather see this come in through the
scheduler tree anyway.

Hmm? This seems to go back to 2.6.33 if that "Fixes" line is accurate,
so it's been around forever (almost exactly five years). Comments?

                    Linus
Anton Blanchard Dec. 8, 2014, 4:46 a.m. UTC | #2
Hi Linus,

> The __set_task_cpu() function does various other things too:
> 
>         set_task_rq(p, cpu);
>   #ifdef CONFIG_SMP
>         /*
>          * After ->cpu is set up to a new value, task_rq_lock(p, ...)
> can be
>          * successfuly executed on another CPU. We must ensure that
> updates of
>          * per-task data have been completed by this moment.
>          */
>         smp_wmb();
>         task_thread_info(p)->cpu = cpu;
>         p->wake_cpu = cpu;
>   #endif
> 
> which makes me worry about just setting the thread_info->cpu value.
> That set_task_rq() initializes various group scheduling things, an
> dthat whole "wake_cpu" thing seems relevant too.

Yeah, I would definitely like the scheduler guys to weigh in on this,
especially considering how difficult it can be to hit.

Anton
Ingo Molnar Dec. 8, 2014, 8:34 a.m. UTC | #3
* Anton Blanchard <anton@samba.org> wrote:

> I have a busy ppc64le KVM box where guests sometimes hit the 
> infamous "kernel BUG at kernel/smpboot.c:134!" issue during 
> boot:
> 
> BUG_ON(td->cpu != smp_processor_id());
> 
> Basically a per CPU hotplug thread scheduled on the wrong CPU. The oops
> output confirms it:
> 
> CPU: 0
> Comm: watchdog/130
> 
> The issue is in kthread_bind where we set the cpus_allowed 
> mask, but do not touch task_thread_info(p)->cpu. The scheduler 
> assumes the previously scheduled CPU is in the cpus_allowed 
> mask, but in this case we are moving a thread to another CPU so 
> it is not.
> 
> We used to call set_task_cpu which sets 
> task_thread_info(p)->cpu (in fact kthread_bind still has a 
> comment suggesting this). That was removed in e2912009fb7b 
> ("sched: Ensure set_task_cpu() is never called on blocked 
> tasks").
> 
> Since we cannot call set_task_cpu (the task is in a sleeping 
> state), just do an explicit set of task_thread_info(p)->cpu.

So we cannot call set_task_cpu() because in the normal life time 
of a task the ->cpu value gets set on wakeup. So if a task is 
blocked right now, and its affinity changes, it ought to get a 
correct ->cpu selected on wakeup. The affinity mask and the 
current value of ->cpu getting out of sync is thus 'normal'.

(Check for example how set_cpus_allowed_ptr() works: we first set 
the new allowed mask, then do we migrate the task away if 
necessary.)

In the kthread_bind() case this is explicitly assumed: it only 
calls do_set_cpus_allowed().

But obviously the bug triggers in kernel/smpboot.c, and that 
assert shows a real bug - and your patch makes the assert go 
away, so the question is, how did the kthread get woken up and 
put on a runqueue without its ->cpu getting set?

One possibility is a generic scheduler bug in ttwu(), resulting 
in ->cpu not getting set properly. If this was the case then 
other places would be blowing up as well, and I don't think we 
are seeing this currently, especially not over such a long 
timespan.

Another possibility would be that kthread_bind()'s assumption 
that the task is inactive is false: if the task activates when we 
think it's blocked and we just hotplug-migrate it away while its 
running (setting its td->cpu?), the assert could trigger I think 
- and the patch would make the assert go away.

A third possibility would be, if this is a freshly created 
thread, some sort of initialization race - either in the kthread 
or in the scheduler code.

Weird.

Thanks,

	Ingo
Anton Blanchard Dec. 8, 2014, 10:18 a.m. UTC | #4
Hi Ingo,

> So we cannot call set_task_cpu() because in the normal life time 
> of a task the ->cpu value gets set on wakeup. So if a task is 
> blocked right now, and its affinity changes, it ought to get a 
> correct ->cpu selected on wakeup. The affinity mask and the 
> current value of ->cpu getting out of sync is thus 'normal'.
> 
> (Check for example how set_cpus_allowed_ptr() works: we first set 
> the new allowed mask, then do we migrate the task away if 
> necessary.)
> 
> In the kthread_bind() case this is explicitly assumed: it only 
> calls do_set_cpus_allowed().
> 
> But obviously the bug triggers in kernel/smpboot.c, and that 
> assert shows a real bug - and your patch makes the assert go 
> away, so the question is, how did the kthread get woken up and 
> put on a runqueue without its ->cpu getting set?

I started going down this line earlier today, and found things like:

select_task_rq_fair:

        if (p->nr_cpus_allowed == 1)
                return prev_cpu;

I tried returning cpumask_first(tsk_cpus_allowed()) instead, and while
I couldn't hit the BUG I did manage to get a scheduler lockup during
testing.

At that point I thought the previous task_cpu() was somewhat ingrained
in the scheduler and came up with the patch. If not, we could go on a
hunt to see what else needs fixing.

Anton
diff mbox

Patch

diff --git a/kernel/kthread.c b/kernel/kthread.c
index 10e489c..e40ab1d 100644
--- a/kernel/kthread.c
+++ b/kernel/kthread.c
@@ -327,13 +327,14 @@  EXPORT_SYMBOL(kthread_create_on_node);
 
 static void __kthread_bind(struct task_struct *p, unsigned int cpu, long state)
 {
-	/* Must have done schedule() in kthread() before we set_task_cpu */
+	/* Must have done schedule() in kthread() before we change affinity */
 	if (!wait_task_inactive(p, state)) {
 		WARN_ON(1);
 		return;
 	}
 	/* It's safe because the task is inactive. */
 	do_set_cpus_allowed(p, cpumask_of(cpu));
+	task_thread_info(p)->cpu = cpu;
 	p->flags |= PF_NO_SETAFFINITY;
 }