Message ID | 1418009221-12719-1-git-send-email-anton@samba.org (mailing list archive) |
---|---|
State | Rejected |
Headers | show |
On Sun, Dec 7, 2014 at 7:27 PM, Anton Blanchard <anton@samba.org> wrote: > > Since we cannot call set_task_cpu (the task is in a sleeping state), > just do an explicit set of task_thread_info(p)->cpu. Scheduler people: is this sufficient and ok? The __set_task_cpu() function does various other things too: set_task_rq(p, cpu); #ifdef CONFIG_SMP /* * After ->cpu is set up to a new value, task_rq_lock(p, ...) can be * successfuly executed on another CPU. We must ensure that updates of * per-task data have been completed by this moment. */ smp_wmb(); task_thread_info(p)->cpu = cpu; p->wake_cpu = cpu; #endif which makes me worry about just setting the thread_info->cpu value. That set_task_rq() initializes various group scheduling things, an dthat whole "wake_cpu" thing seems relevant too. I'm not saying the patch is wrong, I just want confirmation/ack for it. Although to be honest, I'd rather see this come in through the scheduler tree anyway. Hmm? This seems to go back to 2.6.33 if that "Fixes" line is accurate, so it's been around forever (almost exactly five years). Comments? Linus
Hi Linus, > The __set_task_cpu() function does various other things too: > > set_task_rq(p, cpu); > #ifdef CONFIG_SMP > /* > * After ->cpu is set up to a new value, task_rq_lock(p, ...) > can be > * successfuly executed on another CPU. We must ensure that > updates of > * per-task data have been completed by this moment. > */ > smp_wmb(); > task_thread_info(p)->cpu = cpu; > p->wake_cpu = cpu; > #endif > > which makes me worry about just setting the thread_info->cpu value. > That set_task_rq() initializes various group scheduling things, an > dthat whole "wake_cpu" thing seems relevant too. Yeah, I would definitely like the scheduler guys to weigh in on this, especially considering how difficult it can be to hit. Anton
* Anton Blanchard <anton@samba.org> wrote: > I have a busy ppc64le KVM box where guests sometimes hit the > infamous "kernel BUG at kernel/smpboot.c:134!" issue during > boot: > > BUG_ON(td->cpu != smp_processor_id()); > > Basically a per CPU hotplug thread scheduled on the wrong CPU. The oops > output confirms it: > > CPU: 0 > Comm: watchdog/130 > > The issue is in kthread_bind where we set the cpus_allowed > mask, but do not touch task_thread_info(p)->cpu. The scheduler > assumes the previously scheduled CPU is in the cpus_allowed > mask, but in this case we are moving a thread to another CPU so > it is not. > > We used to call set_task_cpu which sets > task_thread_info(p)->cpu (in fact kthread_bind still has a > comment suggesting this). That was removed in e2912009fb7b > ("sched: Ensure set_task_cpu() is never called on blocked > tasks"). > > Since we cannot call set_task_cpu (the task is in a sleeping > state), just do an explicit set of task_thread_info(p)->cpu. So we cannot call set_task_cpu() because in the normal life time of a task the ->cpu value gets set on wakeup. So if a task is blocked right now, and its affinity changes, it ought to get a correct ->cpu selected on wakeup. The affinity mask and the current value of ->cpu getting out of sync is thus 'normal'. (Check for example how set_cpus_allowed_ptr() works: we first set the new allowed mask, then do we migrate the task away if necessary.) In the kthread_bind() case this is explicitly assumed: it only calls do_set_cpus_allowed(). But obviously the bug triggers in kernel/smpboot.c, and that assert shows a real bug - and your patch makes the assert go away, so the question is, how did the kthread get woken up and put on a runqueue without its ->cpu getting set? One possibility is a generic scheduler bug in ttwu(), resulting in ->cpu not getting set properly. If this was the case then other places would be blowing up as well, and I don't think we are seeing this currently, especially not over such a long timespan. Another possibility would be that kthread_bind()'s assumption that the task is inactive is false: if the task activates when we think it's blocked and we just hotplug-migrate it away while its running (setting its td->cpu?), the assert could trigger I think - and the patch would make the assert go away. A third possibility would be, if this is a freshly created thread, some sort of initialization race - either in the kthread or in the scheduler code. Weird. Thanks, Ingo
Hi Ingo, > So we cannot call set_task_cpu() because in the normal life time > of a task the ->cpu value gets set on wakeup. So if a task is > blocked right now, and its affinity changes, it ought to get a > correct ->cpu selected on wakeup. The affinity mask and the > current value of ->cpu getting out of sync is thus 'normal'. > > (Check for example how set_cpus_allowed_ptr() works: we first set > the new allowed mask, then do we migrate the task away if > necessary.) > > In the kthread_bind() case this is explicitly assumed: it only > calls do_set_cpus_allowed(). > > But obviously the bug triggers in kernel/smpboot.c, and that > assert shows a real bug - and your patch makes the assert go > away, so the question is, how did the kthread get woken up and > put on a runqueue without its ->cpu getting set? I started going down this line earlier today, and found things like: select_task_rq_fair: if (p->nr_cpus_allowed == 1) return prev_cpu; I tried returning cpumask_first(tsk_cpus_allowed()) instead, and while I couldn't hit the BUG I did manage to get a scheduler lockup during testing. At that point I thought the previous task_cpu() was somewhat ingrained in the scheduler and came up with the patch. If not, we could go on a hunt to see what else needs fixing. Anton
diff --git a/kernel/kthread.c b/kernel/kthread.c index 10e489c..e40ab1d 100644 --- a/kernel/kthread.c +++ b/kernel/kthread.c @@ -327,13 +327,14 @@ EXPORT_SYMBOL(kthread_create_on_node); static void __kthread_bind(struct task_struct *p, unsigned int cpu, long state) { - /* Must have done schedule() in kthread() before we set_task_cpu */ + /* Must have done schedule() in kthread() before we change affinity */ if (!wait_task_inactive(p, state)) { WARN_ON(1); return; } /* It's safe because the task is inactive. */ do_set_cpus_allowed(p, cpumask_of(cpu)); + task_thread_info(p)->cpu = cpu; p->flags |= PF_NO_SETAFFINITY; }
I have a busy ppc64le KVM box where guests sometimes hit the infamous "kernel BUG at kernel/smpboot.c:134!" issue during boot: BUG_ON(td->cpu != smp_processor_id()); Basically a per CPU hotplug thread scheduled on the wrong CPU. The oops output confirms it: CPU: 0 Comm: watchdog/130 The issue is in kthread_bind where we set the cpus_allowed mask, but do not touch task_thread_info(p)->cpu. The scheduler assumes the previously scheduled CPU is in the cpus_allowed mask, but in this case we are moving a thread to another CPU so it is not. We used to call set_task_cpu which sets task_thread_info(p)->cpu (in fact kthread_bind still has a comment suggesting this). That was removed in e2912009fb7b ("sched: Ensure set_task_cpu() is never called on blocked tasks"). Since we cannot call set_task_cpu (the task is in a sleeping state), just do an explicit set of task_thread_info(p)->cpu. Fixes: e2912009fb7b ("sched: Ensure set_task_cpu() is never called on blocked tasks") Cc: stable@vger.kernel.org Signed-off-by: Anton Blanchard <anton@samba.org> --- kernel/kthread.c | 3 ++- 1 file changed, 2 insertions(+), 1 deletion(-)