diff mbox

Oops on Power8 (was Re: [PATCH v2 1/7] workqueue: make workqueue available early during boot)

Message ID 87a8eb5dwa.fsf@concordia.ellerman.id.au
State New
Headers show

Commit Message

Michael Ellerman Oct. 11, 2016, 11:22 a.m. UTC
Tejun Heo <tj@kernel.org> writes:

> Hello, Michael.
>
> On Mon, Oct 10, 2016 at 09:22:55PM +1100, Michael Ellerman wrote:
>> This patch seems to be causing one of my Power8 boxes not to boot.
>> 
>> Specifically commit 3347fa092821 ("workqueue: make workqueue available
>> early during boot") in linux-next.
>> 
>> If I revert this on top of next-20161005 then the machine boots again.
>> 
>> I've attached the oops below. It looks like the cfs_rq of p->se is NULL?
>
> Hah, weird that it's arch dependent, or maybe it's just different
> config options.  Most likely, it's caused by workqueue_init() call
> being moved too early.  Can you please try the following patch and see
> whether the problem goes away?

No that doesn't help.

What does is this:


The oops happens because we're in enqueue_task_fair() and p->se->cfs_rq
is NULL.

The cfs_rq is NULL because we did set_task_rq(p, 2048), where 2048 is
NR_CPUS. That causes us to index past the end of the tg->cfs_rq array in
set_task_rq() and happen to get NULL.

We never should have done set_task_rq(p, 2048), because 2048 is >=
nr_cpu_ids, which means it's not a valid CPU number, and set_task_rq()
doesn't cope with that.

The reason we're calling set_task_rq() with CPU 2048 is because
in select_task_rq() we had tsk_nr_cpus_allowed() = 0, because
tsk_cpus_allowed(p) is an empty cpu mask.

That means we do in select_task_rq():
  cpu = cpumask_any(tsk_cpus_allowed(p));                                                                                                                                    

And when tsk_cpus_allowed(p) is empty cpumask_any() returns nr_cpu_ids,
causing cpu to be set to 2048 in my case.

select_task_rq() then does the check to see if it should use a fallback
rq:

if (unlikely(!cpumask_test_cpu(cpu, tsk_cpus_allowed(p)) ||                                                                                                                        
	     !cpu_online(cpu)))
	cpu = select_fallback_rq(task_cpu(p), p);


But in both those checks we end up indexing off the end of the cpu mask,
because cpu is >= nr_cpu_ids. At least on my system they both return
true and so we return cpu == 2048.

The patch above is pretty clearly not the right fix, though maybe it's a
good safety measure.

Presumably we shouldn't be ending up with tsk_cpus_allowed() being
empty, but I haven't had time to track down why that's happening.

cheers

Comments

Balbir Singh Oct. 11, 2016, 12:21 p.m. UTC | #1
On 11/10/16 22:22, Michael Ellerman wrote:
> Tejun Heo <tj@kernel.org> writes:
> 
>> Hello, Michael.
>>
>> On Mon, Oct 10, 2016 at 09:22:55PM +1100, Michael Ellerman wrote:
>>> This patch seems to be causing one of my Power8 boxes not to boot.
>>>
>>> Specifically commit 3347fa092821 ("workqueue: make workqueue available
>>> early during boot") in linux-next.
>>>
>>> If I revert this on top of next-20161005 then the machine boots again.
>>>
>>> I've attached the oops below. It looks like the cfs_rq of p->se is NULL?
>>
>> Hah, weird that it's arch dependent, or maybe it's just different
>> config options.  Most likely, it's caused by workqueue_init() call
>> being moved too early.  Can you please try the following patch and see
>> whether the problem goes away?
> 
> No that doesn't help.
> 
> What does is this:
> 
> diff --git a/kernel/sched/core.c b/kernel/sched/core.c
> index 94732d1ab00a..4e79549d242f 100644
> --- a/kernel/sched/core.c
> +++ b/kernel/sched/core.c
> @@ -1614,7 +1614,8 @@ int select_task_rq(struct task_struct *p, int cpu, int sd_flags, int wake_flags)
>  	 * [ this allows ->select_task() to simply return task_cpu(p) and
>  	 *   not worry about this generic constraint ]
>  	 */
> -	if (unlikely(!cpumask_test_cpu(cpu, tsk_cpus_allowed(p)) ||
> +	if (unlikely(cpu >= nr_cpu_ids ||
> +		     !cpumask_test_cpu(cpu, tsk_cpus_allowed(p)) ||
>  		     !cpu_online(cpu)))
>  		cpu = select_fallback_rq(task_cpu(p), p);
>  
> 
> The oops happens because we're in enqueue_task_fair() and p->se->cfs_rq
> is NULL.
> 
> The cfs_rq is NULL because we did set_task_rq(p, 2048), where 2048 is
> NR_CPUS. That causes us to index past the end of the tg->cfs_rq array in
> set_task_rq() and happen to get NULL.
> 
> We never should have done set_task_rq(p, 2048), because 2048 is >=
> nr_cpu_ids, which means it's not a valid CPU number, and set_task_rq()
> doesn't cope with that.
> 
> The reason we're calling set_task_rq() with CPU 2048 is because
> in select_task_rq() we had tsk_nr_cpus_allowed() = 0, because
> tsk_cpus_allowed(p) is an empty cpu mask.
> 
> That means we do in select_task_rq():
>   cpu = cpumask_any(tsk_cpus_allowed(p));                                                                                                                                    
> 
> And when tsk_cpus_allowed(p) is empty cpumask_any() returns nr_cpu_ids,
> causing cpu to be set to 2048 in my case.
> 
> select_task_rq() then does the check to see if it should use a fallback
> rq:
> 
> if (unlikely(!cpumask_test_cpu(cpu, tsk_cpus_allowed(p)) ||                                                                                                                        
> 	     !cpu_online(cpu)))
> 	cpu = select_fallback_rq(task_cpu(p), p);
> 
> 
> But in both those checks we end up indexing off the end of the cpu mask,
> because cpu is >= nr_cpu_ids. At least on my system they both return
> true and so we return cpu == 2048.
> 
> The patch above is pretty clearly not the right fix, though maybe it's a
> good safety measure.
> 
> Presumably we shouldn't be ending up with tsk_cpus_allowed() being
> empty, but I haven't had time to track down why that's happening.
> 
> cheers
> 

+peterz

FYI: I see the samething on my cpu as well, its just that I get lucky
and cpu_online(cpu) returns false.

I think from a functional perspective we may want to get some additional
debug checks for places where the cpumask is empty early during boot.

Looks like there is a dependency between cpumasks and cpus coming online.
I wonder if we can hit similar issues during hotplug

FWIW, your patch looks correct to me, though one might argue that
cpumask_test_cpu() is a better place to fix it

Balbir Singh
Tejun Heo Oct. 14, 2016, 3:07 p.m. UTC | #2
Hello, Michael.

On Tue, Oct 11, 2016 at 10:22:13PM +1100, Michael Ellerman wrote:
> The oops happens because we're in enqueue_task_fair() and p->se->cfs_rq
> is NULL.
> 
> The cfs_rq is NULL because we did set_task_rq(p, 2048), where 2048 is
> NR_CPUS. That causes us to index past the end of the tg->cfs_rq array in
> set_task_rq() and happen to get NULL.
> 
> We never should have done set_task_rq(p, 2048), because 2048 is >=
> nr_cpu_ids, which means it's not a valid CPU number, and set_task_rq()
> doesn't cope with that.

Hmm... it doesn't reproduce it here and can't see how the commit would
affect this given that it doesn't really change when the kworker
kthreads are being created.

> Presumably we shouldn't be ending up with tsk_cpus_allowed() being
> empty, but I haven't had time to track down why that's happening.

Can you please add WARN_ON_ONCE(!tsk_nr_cpus_allowed(p)) to
select_task_rq() and post what that says?

Thanks.
Tejun Heo Oct. 14, 2016, 3:08 p.m. UTC | #3
Hello, Balbir.

On Tue, Oct 11, 2016 at 11:21:09PM +1100, Balbir Singh wrote:
> FYI: I see the samething on my cpu as well, its just that I get lucky
> and cpu_online(cpu) returns false.

Are you seeing this on x86 or is your test setup also a power machine?

Thanks.
Balbir Singh Oct. 15, 2016, 1:25 a.m. UTC | #4
On 15/10/16 02:07, Tejun Heo wrote:
> Hello, Michael.
> 
> On Tue, Oct 11, 2016 at 10:22:13PM +1100, Michael Ellerman wrote:
>> The oops happens because we're in enqueue_task_fair() and p->se->cfs_rq
>> is NULL.
>>
>> The cfs_rq is NULL because we did set_task_rq(p, 2048), where 2048 is
>> NR_CPUS. That causes us to index past the end of the tg->cfs_rq array in
>> set_task_rq() and happen to get NULL.
>>
>> We never should have done set_task_rq(p, 2048), because 2048 is >=
>> nr_cpu_ids, which means it's not a valid CPU number, and set_task_rq()
>> doesn't cope with that.
> 
> Hmm... it doesn't reproduce it here and can't see how the commit would
> affect this given that it doesn't really change when the kworker
> kthreads are being created.
> 
>> Presumably we shouldn't be ending up with tsk_cpus_allowed() being
>> empty, but I haven't had time to track down why that's happening.
> 

I think the basic analysis shows the change to creation of unbounded
workqueues from the unbound_hash, but those have a pool cpumask empty.

> Can you please add WARN_ON_ONCE(!tsk_nr_cpus_allowed(p)) to
> select_task_rq() and post what that says?
> 
> Thanks.
> 

Balbir Singh.
Balbir Singh Oct. 15, 2016, 3:43 a.m. UTC | #5
On 15/10/16 02:08, Tejun Heo wrote:
> Hello, Balbir.
> 
> On Tue, Oct 11, 2016 at 11:21:09PM +1100, Balbir Singh wrote:
>> FYI: I see the samething on my cpu as well, its just that I get lucky
>> and cpu_online(cpu) returns false.
> 
> Are you seeing this on x86 or is your test setup also a power machine?
> 
> Thanks.
> 

I saw this on a powerpc box

Balbir Singh.
Michael Ellerman Oct. 15, 2016, 9:48 a.m. UTC | #6
Tejun Heo <tj@kernel.org> writes:

> Hello, Michael.
>
> On Tue, Oct 11, 2016 at 10:22:13PM +1100, Michael Ellerman wrote:
>> The oops happens because we're in enqueue_task_fair() and p->se->cfs_rq
>> is NULL.
>> 
>> The cfs_rq is NULL because we did set_task_rq(p, 2048), where 2048 is
>> NR_CPUS. That causes us to index past the end of the tg->cfs_rq array in
>> set_task_rq() and happen to get NULL.
>> 
>> We never should have done set_task_rq(p, 2048), because 2048 is >=
>> nr_cpu_ids, which means it's not a valid CPU number, and set_task_rq()
>> doesn't cope with that.
>
> Hmm... it doesn't reproduce it here and can't see how the commit would
> affect this given that it doesn't really change when the kworker
> kthreads are being created.

Try turning on CONFIG_DEBUG_PER_CPU_MAPS=y ?

That will warn if you're indexing off the end of a cpu mask and just
getting lucky with the result.

>> Presumably we shouldn't be ending up with tsk_cpus_allowed() being
>> empty, but I haven't had time to track down why that's happening.
>
> Can you please add WARN_ON_ONCE(!tsk_nr_cpus_allowed(p)) to
> select_task_rq() and post what that says?

It says:

------------[ cut here ]------------
WARNING: CPU: 0 PID: 1 at ../kernel/sched/core.c:1602 try_to_wake_up+0x3f4/0x5c0
Modules linked in:
CPU: 0 PID: 1 Comm: swapper/0 Not tainted 4.8.0-compiler_gcc-6.2.0-next-20161014-dirty #116
task: c000000ff9200000 task.stack: c000001ffc084000
NIP: c0000000000f1ba4 LR: c0000000000f180c CTR: 0000000000000000
REGS: c000001ffc0878f0 TRAP: 0700   Not tainted  (4.8.0-compiler_gcc-6.2.0-next-20161014-dirty)
MSR: 9000000002029033 <SF,HV,VEC,EE,ME,IR,DR,RI,LE> CR: 28000422  XER: 00000000
CFAR: c0000000000f18bc SOFTE: 0 
GPR00: c0000000000f180c c000001ffc087b70 c000000000e83400 0000000000000000 
GPR04: 0000000000000002 0000000000000000 0000000000000000 0000000000000000 
GPR08: c000000000dc3400 0000000000000001 0000000000000002 0000000000000000 
GPR12: 0000000000000000 c00000000fb80000 c00000000000e0c8 0000000000000000 
GPR16: 0000000000000000 0000000000000000 0000000000000000 0000000000000000 
GPR20: 0000000000000000 0000000000000000 0000000000000000 c000000000eb8960 
GPR24: 0000000000000000 c000000000d8ce00 0000000000000000 0000000000000000 
GPR28: c0000007f54050f4 0000000000000000 0000000000000000 c0000007f5404900 
NIP [c0000000000f1ba4] try_to_wake_up+0x3f4/0x5c0
LR [c0000000000f180c] try_to_wake_up+0x5c/0x5c0
Call Trace:
[c000001ffc087b70] [c0000000000f180c] try_to_wake_up+0x5c/0x5c0 (unreliable)
[c000001ffc087bf0] [c0000000000d53e4] create_worker+0x144/0x250
[c000001ffc087c90] [c000000000cf7930] workqueue_init+0x170/0x19c
[c000001ffc087d00] [c000000000ce0e74] kernel_init_freeable+0x158/0x360
[c000001ffc087dc0] [c00000000000e0e4] kernel_init+0x24/0x160
[c000001ffc087e30] [c00000000000bfa0] ret_from_kernel_thread+0x5c/0xbc
Instruction dump:
e8790890 4bff6ed9 2fa30000 419e00dc 60000000 4bfffe54 3d02fff4 8928d7f9 
2f890000 409e0018 39200001 9928d7f9 <0fe00000> 60000000 60420000 3b5f0368 
---[ end trace 0000000000000000 ]---


But I'm not sure that tells us anything new?

cheers
Michael Ellerman Oct. 17, 2016, 12:24 p.m. UTC | #7
Tejun Heo <tj@kernel.org> writes:

> Hello, Michael.
>
> On Tue, Oct 11, 2016 at 10:22:13PM +1100, Michael Ellerman wrote:
>> The oops happens because we're in enqueue_task_fair() and p->se->cfs_rq
>> is NULL.
>> 
>> The cfs_rq is NULL because we did set_task_rq(p, 2048), where 2048 is
>> NR_CPUS. That causes us to index past the end of the tg->cfs_rq array in
>> set_task_rq() and happen to get NULL.
>> 
>> We never should have done set_task_rq(p, 2048), because 2048 is >=
>> nr_cpu_ids, which means it's not a valid CPU number, and set_task_rq()
>> doesn't cope with that.
>
> Hmm... it doesn't reproduce it here and can't see how the commit would
> affect this given that it doesn't really change when the kworker
> kthreads are being created.

It changes when the pool attributes are created, which is the source of
the bug.

The original crash happens because we have a task with an empty cpus_allowed
mask. That mask originally comes from pool->attrs->cpumask.

The attrs for the pool are created early via workqueue_init_early() in
apply_wqattrs_prepare():

  start_here_common
  -> start_kernel
     -> workqueue_init_early
        -> __alloc_workqueue_key
           -> apply_workqueue_attrs
              -> apply_workqueue_attrs_locked
                 -> apply_wqattrs_prepare
	          
In there we do:

	copy_workqueue_attrs(new_attrs, attrs);
	cpumask_and(new_attrs->cpumask, new_attrs->cpumask, wq_unbound_cpumask);
	if (unlikely(cpumask_empty(new_attrs->cpumask)))
		cpumask_copy(new_attrs->cpumask, wq_unbound_cpumask);
	...
	copy_workqueue_attrs(tmp_attrs, new_attrs);
	...
	for_each_node(node) {
		if (wq_calc_node_cpumask(new_attrs, node, -1, tmp_attrs->cpumask)) {
+			BUG_ON(cpumask_empty(tmp_attrs->cpumask));
			ctx->pwq_tbl[node] = alloc_unbound_pwq(wq, tmp_attrs);


The bad case (where we hit the BUG_ON I added above) is where we are
creating a wq for node 1.

In wq_calc_node_cpumask() we do:

	cpumask_and(cpumask, attrs->cpumask, wq_numa_possible_cpumask[node]);
	return !cpumask_equal(cpumask, attrs->cpumask);

Which with the arguments inserted is:

	cpumask_and(tmp_attrs->cpumask, new_attrs->cpumask, wq_numa_possible_cpumask[1]);
	return !cpumask_equal(tmp_attrs->cpumask, new_attrs->cpumask);

And that results in tmp_attrs->cpumask being empty, because
wq_numa_possible_cpumask[1] is an empty cpumask.

The reason wq_numa_possible_cpumask[1] is an empty mask is because in
wq_numa_init() we did:

	for_each_possible_cpu(cpu) {
		node = cpu_to_node(cpu);
		if (WARN_ON(node == NUMA_NO_NODE)) {
			pr_warn("workqueue: NUMA node mapping not available for cpu%d, disabling NUMA support\n", cpu);
			/* happens iff arch is bonkers, let's just proceed */
			return;
		}
		cpumask_set_cpu(cpu, tbl[node]);
	}

And cpu_to_node() returned node 0 for every CPU in the system, despite there
being multiple nodes.

That happened because we haven't yet called set_cpu_numa_node() for the non-boot
cpus, because that happens in smp_prepare_cpus(), and
workqueue_init_early() is called much earlier than that.

This doesn't trigger on x86 because it does set_cpu_numa_node() in
setup_per_cpu_areas(), which is called prior to workqueue_init_early().

We can (should) probably do the same on powerpc, I'll look at that
tomorrow. But other arches may have a similar problem, and at the very
least we need to document that workqueue_init_early() relies on
cpu_to_node() working.

cheers
Balbir Singh Oct. 17, 2016, 12:51 p.m. UTC | #8
On 17/10/16 23:24, Michael Ellerman wrote:
> Tejun Heo <tj@kernel.org> writes:
> 
>> Hello, Michael.
>>
>> On Tue, Oct 11, 2016 at 10:22:13PM +1100, Michael Ellerman wrote:
>>> The oops happens because we're in enqueue_task_fair() and p->se->cfs_rq
>>> is NULL.
>>>
>>> The cfs_rq is NULL because we did set_task_rq(p, 2048), where 2048 is
>>> NR_CPUS. That causes us to index past the end of the tg->cfs_rq array in
>>> set_task_rq() and happen to get NULL.
>>>
>>> We never should have done set_task_rq(p, 2048), because 2048 is >=
>>> nr_cpu_ids, which means it's not a valid CPU number, and set_task_rq()
>>> doesn't cope with that.
>>
>> Hmm... it doesn't reproduce it here and can't see how the commit would
>> affect this given that it doesn't really change when the kworker
>> kthreads are being created.
> 
> It changes when the pool attributes are created, which is the source of
> the bug.
> 
> The original crash happens because we have a task with an empty cpus_allowed
> mask. That mask originally comes from pool->attrs->cpumask.
> 
> The attrs for the pool are created early via workqueue_init_early() in
> apply_wqattrs_prepare():
> 
>   start_here_common
>   -> start_kernel
>      -> workqueue_init_early
>         -> __alloc_workqueue_key
>            -> apply_workqueue_attrs
>               -> apply_workqueue_attrs_locked
>                  -> apply_wqattrs_prepare
> 	          
> In there we do:
> 
> 	copy_workqueue_attrs(new_attrs, attrs);
> 	cpumask_and(new_attrs->cpumask, new_attrs->cpumask, wq_unbound_cpumask);
> 	if (unlikely(cpumask_empty(new_attrs->cpumask)))
> 		cpumask_copy(new_attrs->cpumask, wq_unbound_cpumask);
> 	...
> 	copy_workqueue_attrs(tmp_attrs, new_attrs);
> 	...
> 	for_each_node(node) {
> 		if (wq_calc_node_cpumask(new_attrs, node, -1, tmp_attrs->cpumask)) {
> +			BUG_ON(cpumask_empty(tmp_attrs->cpumask));
> 			ctx->pwq_tbl[node] = alloc_unbound_pwq(wq, tmp_attrs);
> 
> 
> The bad case (where we hit the BUG_ON I added above) is where we are
> creating a wq for node 1.
> 
> In wq_calc_node_cpumask() we do:
> 
> 	cpumask_and(cpumask, attrs->cpumask, wq_numa_possible_cpumask[node]);
> 	return !cpumask_equal(cpumask, attrs->cpumask);
> 
> Which with the arguments inserted is:
> 
> 	cpumask_and(tmp_attrs->cpumask, new_attrs->cpumask, wq_numa_possible_cpumask[1]);
> 	return !cpumask_equal(tmp_attrs->cpumask, new_attrs->cpumask);
> 
> And that results in tmp_attrs->cpumask being empty, because
> wq_numa_possible_cpumask[1] is an empty cpumask.
> 
> The reason wq_numa_possible_cpumask[1] is an empty mask is because in
> wq_numa_init() we did:
> 
> 	for_each_possible_cpu(cpu) {
> 		node = cpu_to_node(cpu);
> 		if (WARN_ON(node == NUMA_NO_NODE)) {
> 			pr_warn("workqueue: NUMA node mapping not available for cpu%d, disabling NUMA support\n", cpu);
> 			/* happens iff arch is bonkers, let's just proceed */
> 			return;
> 		}
> 		cpumask_set_cpu(cpu, tbl[node]);
> 	}
> 
> And cpu_to_node() returned node 0 for every CPU in the system, despite there
> being multiple nodes.
> 
> That happened because we haven't yet called set_cpu_numa_node() for the non-boot
> cpus, because that happens in smp_prepare_cpus(), and
> workqueue_init_early() is called much earlier than that.
> 
> This doesn't trigger on x86 because it does set_cpu_numa_node() in
> setup_per_cpu_areas(), which is called prior to workqueue_init_early().
> 
> We can (should) probably do the same on powerpc, I'll look at that
> tomorrow. But other arches may have a similar problem, and at the very
> least we need to document that workqueue_init_early() relies on
> cpu_to_node() working.

Don't we do the setup cpu->node mapings in initmem_init()?
Ideally we have setup_arch->intmem_init->numa_setup_cpu

Will look at it tomorrow
Balbir Singh
Tejun Heo Oct. 17, 2016, 6:13 p.m. UTC | #9
Hello,

On Sat, Oct 15, 2016 at 08:48:01PM +1100, Michael Ellerman wrote:
> > Hmm... it doesn't reproduce it here and can't see how the commit would
> > affect this given that it doesn't really change when the kworker
> > kthreads are being created.
> 
> Try turning on CONFIG_DEBUG_PER_CPU_MAPS=y ?
> 
> That will warn if you're indexing off the end of a cpu mask and just
> getting lucky with the result.

That's not happening on x86.  That could mean that powerpc is
initializing cpu_possible_mask after workqueue_init_early().  Looking
into it.

> ------------[ cut here ]------------
> WARNING: CPU: 0 PID: 1 at ../kernel/sched/core.c:1602 try_to_wake_up+0x3f4/0x5c0
> Modules linked in:
> CPU: 0 PID: 1 Comm: swapper/0 Not tainted 4.8.0-compiler_gcc-6.2.0-next-20161014-dirty #116
> task: c000000ff9200000 task.stack: c000001ffc084000
> NIP: c0000000000f1ba4 LR: c0000000000f180c CTR: 0000000000000000
> REGS: c000001ffc0878f0 TRAP: 0700   Not tainted  (4.8.0-compiler_gcc-6.2.0-next-20161014-dirty)
> MSR: 9000000002029033 <SF,HV,VEC,EE,ME,IR,DR,RI,LE> CR: 28000422  XER: 00000000
> CFAR: c0000000000f18bc SOFTE: 0 
> GPR00: c0000000000f180c c000001ffc087b70 c000000000e83400 0000000000000000 
> GPR04: 0000000000000002 0000000000000000 0000000000000000 0000000000000000 
> GPR08: c000000000dc3400 0000000000000001 0000000000000002 0000000000000000 
> GPR12: 0000000000000000 c00000000fb80000 c00000000000e0c8 0000000000000000 
> GPR16: 0000000000000000 0000000000000000 0000000000000000 0000000000000000 
> GPR20: 0000000000000000 0000000000000000 0000000000000000 c000000000eb8960 
> GPR24: 0000000000000000 c000000000d8ce00 0000000000000000 0000000000000000 
> GPR28: c0000007f54050f4 0000000000000000 0000000000000000 c0000007f5404900 
> NIP [c0000000000f1ba4] try_to_wake_up+0x3f4/0x5c0
> LR [c0000000000f180c] try_to_wake_up+0x5c/0x5c0
> Call Trace:
> [c000001ffc087b70] [c0000000000f180c] try_to_wake_up+0x5c/0x5c0 (unreliable)
> [c000001ffc087bf0] [c0000000000d53e4] create_worker+0x144/0x250
> [c000001ffc087c90] [c000000000cf7930] workqueue_init+0x170/0x19c
> [c000001ffc087d00] [c000000000ce0e74] kernel_init_freeable+0x158/0x360
> [c000001ffc087dc0] [c00000000000e0e4] kernel_init+0x24/0x160
> [c000001ffc087e30] [c00000000000bfa0] ret_from_kernel_thread+0x5c/0xbc
> Instruction dump:
> e8790890 4bff6ed9 2fa30000 419e00dc 60000000 4bfffe54 3d02fff4 8928d7f9 
> 2f890000 409e0018 39200001 9928d7f9 <0fe00000> 60000000 60420000 3b5f0368 
> ---[ end trace 0000000000000000 ]---
> 
> But I'm not sure that tells us anything new?

Yeah, I should have asked to print out information of the target task
but it looks like we have enough information now.

Thanks.
Tejun Heo Oct. 17, 2016, 6:15 p.m. UTC | #10
Hello, Michael.

On Mon, Oct 17, 2016 at 11:24:34PM +1100, Michael Ellerman wrote:
> The bad case (where we hit the BUG_ON I added above) is where we are
> creating a wq for node 1.
> 
> In wq_calc_node_cpumask() we do:
> 
> 	cpumask_and(cpumask, attrs->cpumask, wq_numa_possible_cpumask[node]);
> 	return !cpumask_equal(cpumask, attrs->cpumask);
> 
> Which with the arguments inserted is:
> 
> 	cpumask_and(tmp_attrs->cpumask, new_attrs->cpumask, wq_numa_possible_cpumask[1]);
> 	return !cpumask_equal(tmp_attrs->cpumask, new_attrs->cpumask);
> 
> And that results in tmp_attrs->cpumask being empty, because
> wq_numa_possible_cpumask[1] is an empty cpumask.

Ah, should have read this before replying to the previous mail, so
it's the numa mask, not the cpu_possible_mask.

> The reason wq_numa_possible_cpumask[1] is an empty mask is because in
> wq_numa_init() we did:
> 
> 	for_each_possible_cpu(cpu) {
> 		node = cpu_to_node(cpu);
> 		if (WARN_ON(node == NUMA_NO_NODE)) {
> 			pr_warn("workqueue: NUMA node mapping not available for cpu%d, disabling NUMA support\n", cpu);
> 			/* happens iff arch is bonkers, let's just proceed */
> 			return;
> 		}
> 		cpumask_set_cpu(cpu, tbl[node]);
> 	}
> 
> And cpu_to_node() returned node 0 for every CPU in the system, despite there
> being multiple nodes.
> 
> That happened because we haven't yet called set_cpu_numa_node() for the non-boot
> cpus, because that happens in smp_prepare_cpus(), and
> workqueue_init_early() is called much earlier than that.
> 
> This doesn't trigger on x86 because it does set_cpu_numa_node() in
> setup_per_cpu_areas(), which is called prior to workqueue_init_early().
> 
> We can (should) probably do the same on powerpc, I'll look at that
> tomorrow. But other arches may have a similar problem, and at the very
> least we need to document that workqueue_init_early() relies on
> cpu_to_node() working.

I should be able to move the numa part of initialization to the later
init function.  Working on it.

Thanks.
Michael Ellerman Oct. 18, 2016, 2:35 a.m. UTC | #11
Balbir Singh <bsingharora@gmail.com> writes:
> On 17/10/16 23:24, Michael Ellerman wrote:
>> That happened because we haven't yet called set_cpu_numa_node() for the non-boot
>> cpus, because that happens in smp_prepare_cpus(), and
>> workqueue_init_early() is called much earlier than that.
>> 
>> This doesn't trigger on x86 because it does set_cpu_numa_node() in
>> setup_per_cpu_areas(), which is called prior to workqueue_init_early().
>> 
>> We can (should) probably do the same on powerpc, I'll look at that
>> tomorrow. But other arches may have a similar problem, and at the very
>> least we need to document that workqueue_init_early() relies on
>> cpu_to_node() working.
>
> Don't we do the setup cpu->node mapings in initmem_init()?
> Ideally we have setup_arch->intmem_init->numa_setup_cpu

That sets up numa_cpu_lookup_table, which is a powerpc only data
structure.

But it doesn't setup the percpu numa_node variables, used by
cpu_to_node(), because percpu areas are not setup yet.

cheers
diff mbox

Patch

diff --git a/kernel/sched/core.c b/kernel/sched/core.c
index 94732d1ab00a..4e79549d242f 100644
--- a/kernel/sched/core.c
+++ b/kernel/sched/core.c
@@ -1614,7 +1614,8 @@  int select_task_rq(struct task_struct *p, int cpu, int sd_flags, int wake_flags)
 	 * [ this allows ->select_task() to simply return task_cpu(p) and
 	 *   not worry about this generic constraint ]
 	 */
-	if (unlikely(!cpumask_test_cpu(cpu, tsk_cpus_allowed(p)) ||
+	if (unlikely(cpu >= nr_cpu_ids ||
+		     !cpumask_test_cpu(cpu, tsk_cpus_allowed(p)) ||
 		     !cpu_online(cpu)))
 		cpu = select_fallback_rq(task_cpu(p), p);