Patchwork [RFC] preempt_count corruption across H_CEDE call with CONFIG_PREEMPT on pseries

login
register
mail settings
Submitter Michael Neuling
Date Sept. 2, 2010, 1:02 a.m.
Message ID <11579.1283389322@neuling.org>
Download mbox | patch
Permalink /patch/63432/
State Not Applicable
Headers show

Comments

Michael Neuling - Sept. 2, 2010, 1:02 a.m.
In message <4C7EBAA8.7030601@us.ibm.com> you wrote:
> On 09/01/2010 12:59 PM, Steven Rostedt wrote:
> > On Wed, 2010-09-01 at 11:47 -0700, Darren Hart wrote:
> > 
> >> from tip/rt/2.6.33 causes the preempt_count() to change across the cede
> >> call.  This patch appears to prevents the proxy preempt_count assignment
> >> from happening. This non-local-cpu assignment to 0 would cause an
> >> underrun of preempt_count() if the local-cpu had disabled preemption
> >> prior to the assignment and then later tried to enable it. This appears
> >> to be the case with the stack of __trace_hcall* calls preceeding the
> >> return from extended_cede_processor() in the latency format trace-cmd
> >> report:
> >>
> >>   <idle>-0       1d....   201.252737: function:             .cpu_die
> > 
> > Note, the above 1d.... is a series of values. The first being the CPU,
> > the next if interrupts are disabled, the next if the NEED_RESCHED flag
> > is set, the next is softirqs enabled or disabled, next the
> > preempt_count, and finally the lockdepth count.
> > 
> > Here we only care about the preempt_count, which is zero when '.' and a
> > number if it is something else. It is the second to last field in that
> > list.
> > 
> > 
> >>   <idle>-0       1d....   201.252738: function:                .pseries_ma
ch_cpu_die
> >>   <idle>-0       1d....   201.252740: function:                   .idle_ta
sk_exit
> >>   <idle>-0       1d....   201.252741: function:                      .swit
ch_slb
> >>   <idle>-0       1d....   201.252742: function:                   .xics_te
ardown_cpu
> >>   <idle>-0       1d....   201.252743: function:                      .xics
_set_cpu_priority
> >>   <idle>-0       1d....   201.252744: function:             .__trace_hcall
_entry
> >>   <idle>-0       1d..1.   201.252745: function:                .probe_hcal
l_entry
> > 
> >                        ^
> >                 preempt_count set to 1
> > 
> >>   <idle>-0       1d..1.   201.252746: function:             .__trace_hcall
_exit
> >>   <idle>-0       1d..2.   201.252747: function:                .probe_hcal
l_exit
> >>   <idle>-0       1d....   201.252748: function:             .__trace_hcall
_entry
> >>   <idle>-0       1d..1.   201.252748: function:                .probe_hcal
l_entry
> >>   <idle>-0       1d..1.   201.252750: function:             .__trace_hcall
_exit
> >>   <idle>-0       1d..2.   201.252751: function:                .probe_hcal
l_exit
> >>   <idle>-0       1d....   201.252752: function:             .__trace_hcall
_entry
> >>   <idle>-0       1d..1.   201.252753: function:                .probe_hcal
l_entry
> >                    ^   ^
> >                   CPU  preempt_count
> > 
> > Entering the function probe_hcall_entry() the preempt_count is 1 (see
> > below). But probe_hcall_entry does:
> > 
> > 	h = &get_cpu_var(hcall_stats)[opcode / 4];
> > 
> > Without doing the put (which it does in probe_hcall_exit())
> > 
> > So exiting the probe_hcall_entry() the prempt_count is 2.
> > The trace_hcall_entry() will do a preempt_enable() making it leave as 1.
> > 
> > 
> >>   offon.sh-3684  6.....   201.466488: bprint:               .smp_pSeries_k
ick_cpu : resetting pcnt to 0 for cpu 1
> > 
> > This is CPU 6, changing the preempt count from 1 to 0.
> > 
> >>
> >> preempt_count() is reset from 1 to 0 by smp_startup_cpu() without the
> >> QCSS_NOT_STOPPED check from the patch above.
> >>
> >>   <idle>-0       1d....   201.466503: function:             .__trace_hcall
_exit
> > 
> > Note: __trace_hcall_exit() and __trace_hcall_entry() basically do:
> > 
> >  preempt_disable();
> >  call probe();
> >  preempt_enable();
> > 
> > 
> >>   <idle>-0       1d..1.   201.466505: function:                .probe_hcal
l_exit
> > 
> > The preempt_count of 1 entering the probe_hcall_exit() is because of the
> > preempt_disable() shown above. It should have been entered as a 2.
> > 
> > But then it does:
> > 
> > 
> > 	put_cpu_var(hcall_stats);
> > 
> > making preempt_count 0.
> > 
> > But the preempt_enable() in the trace_hcall_exit() causes this to be -1.
> > 
> > 
> >>   <idle>-0       1d.Hff.   201.466507: bprint:               .pseries_mach
_cpu_die : after cede: ffffffff
> >>
> >> With the preempt_count() being one less than it should be, the final
> >> preempt_enable() in the trace_hcall path drops preempt_count to
> >> 0xffffffff, which of course is an illegal value and leads to a crash.
> > 
> > I'm confused to how this works in mainline?
> 
> Turns out it didn't. 2.6.33.5 with CONFIG_PREEMPT=y sees this exact same
> behavior. The following, part of the 2.6.33.6 stable release, prevents
> this from happening:
> 
> aef40e87d866355ffd279ab21021de733242d0d5
> powerpc/pseries: Only call start-cpu when a CPU is stopped
> 
> --- a/arch/powerpc/platforms/pseries/smp.c
> +++ b/arch/powerpc/platforms/pseries/smp.c
> @@ -82,6 +82,12 @@ static inline int __devinit smp_startup_cpu(unsigned
> int lcpu)
> 
>         pcpu = get_hard_smp_processor_id(lcpu);
> 
> +       /* Check to see if the CPU out of FW already for kexec */
> +       if (smp_query_cpu_stopped(pcpu) == QCSS_NOT_STOPPED){
> +               cpu_set(lcpu, of_spin_map);
> +               return 1;
> +       }
> +
>         /* Fixup atomic count: it exited inside IRQ handler. */
>         task_thread_info(paca[lcpu].__current)->preempt_count   = 0;
> 
> The question is now, Is this the right fix? If so, perhaps we can update
> the comment to be a bit more clear and not refer solely to kexec.
> 
> Michael Neuling, can you offer any thoughts here? We hit this EVERY
> TIME, which makes me wonder if the offline/online path could do this
> without calling smp_startup_cpu at all.

We need to call smp_startup_cpu on boot when we the cpus are still in
FW.  smp_startup_cpu does this for us on boot.

I'm wondering if we just need to move the test down a bit to make sure
the preempt_count is set.  I've not been following this thread, but
maybe this might work?

Untested patch below...

Mikey
Steven Rostedt - Sept. 2, 2010, 4:06 a.m.
On Thu, 2010-09-02 at 11:02 +1000, Michael Neuling wrote:

> We need to call smp_startup_cpu on boot when we the cpus are still in
> FW.  smp_startup_cpu does this for us on boot.
> 
> I'm wondering if we just need to move the test down a bit to make sure
> the preempt_count is set.  I've not been following this thread, but
> maybe this might work?

Egad no! Setting the preempt_count to zero _is_ the bug. I think Darren
even said that adding the exit prevented the bug (although now he's
hitting a hard lockup someplace else). The original code he was using
did not have the condition to return for kexec. It was just a
coincidence that this code helped in bringing a CPU back online.

> 
> Untested patch below...
> 
> Mikey
> 
> diff --git a/arch/powerpc/platforms/pseries/smp.c b/arch/powerpc/platforms/pseries/smp.c
> index 0317cce..3afaba4 100644
> --- a/arch/powerpc/platforms/pseries/smp.c
> +++ b/arch/powerpc/platforms/pseries/smp.c
> @@ -104,18 +104,18 @@ static inline int __devinit smp_startup_cpu(unsigned int lcpu)
>  
>  	pcpu = get_hard_smp_processor_id(lcpu);
>  
> -	/* Check to see if the CPU out of FW already for kexec */
> -	if (smp_query_cpu_stopped(pcpu) == QCSS_NOT_STOPPED){
> -		cpumask_set_cpu(lcpu, of_spin_mask);
> -		return 1;
> -	}
> -
>  	/* Fixup atomic count: it exited inside IRQ handler. */
>  	task_thread_info(paca[lcpu].__current)->preempt_count	= 0;

We DON'T want to do the above. It's nasty! This is one CPU's task
touching an intimate part of another CPU's task. It's equivalent of me
putting my hand down you wife's blouse. It's offensive, and rude.

OK, if the CPU was never online, then you can do what you want. But what
we see is that this fails on CPU hotplug.  You stop a CPU, and it goes
into this cede_processor() call. When you wake it up, suddenly the task
on that woken CPU has its preempt count fscked up.  This was really
really hard to debug. We thought it was stack corruption or something.
But it ended up being that this code has one CPU touching the breasts of
another CPU. This code is a pervert!

What the trace clearly showed, was that we take down a CPU, and in doing
so, the code on that CPU set the preempt count to 1, and it expected to
have it as 1 when it returned. But the code that kicked started this CPU
back to life (bring the CPU back online), set the preempt count on the
task of that CPU to 0, and screwed everything up.

-- Steve


>  
>  	if (get_cpu_current_state(lcpu) == CPU_STATE_INACTIVE)
>  		goto out;
>  
> +	/* Check to see if the CPU out of FW already for kexec */
> +	if (smp_query_cpu_stopped(pcpu) == QCSS_NOT_STOPPED){
> +		cpumask_set_cpu(lcpu, of_spin_mask);
> +		return 1;
> +	}
> +
>  	/* 
>  	 * If the RTAS start-cpu token does not exist then presume the
>  	 * cpu is already spinning.
>
Darren Hart - Sept. 2, 2010, 6:04 a.m.
On 09/01/2010 09:06 PM, Steven Rostedt wrote:
> On Thu, 2010-09-02 at 11:02 +1000, Michael Neuling wrote:
> 
>> We need to call smp_startup_cpu on boot when we the cpus are still in
>> FW.  smp_startup_cpu does this for us on boot.
>>
>> I'm wondering if we just need to move the test down a bit to make sure
>> the preempt_count is set.  I've not been following this thread, but
>> maybe this might work?
> 
> Egad no! Setting the preempt_count to zero _is_ the bug. I think Darren
> even said that adding the exit prevented the bug (although now he's
> hitting a hard lockup someplace else). The original code he was using
> did not have the condition to return for kexec. It was just a
> coincidence that this code helped in bringing a CPU back online.
> 
>>
>> Untested patch below...
>>
>> Mikey
>>
>> diff --git a/arch/powerpc/platforms/pseries/smp.c b/arch/powerpc/platforms/pseries/smp.c
>> index 0317cce..3afaba4 100644
>> --- a/arch/powerpc/platforms/pseries/smp.c
>> +++ b/arch/powerpc/platforms/pseries/smp.c
>> @@ -104,18 +104,18 @@ static inline int __devinit smp_startup_cpu(unsigned int lcpu)
>>  
>>  	pcpu = get_hard_smp_processor_id(lcpu);
>>  
>> -	/* Check to see if the CPU out of FW already for kexec */
>> -	if (smp_query_cpu_stopped(pcpu) == QCSS_NOT_STOPPED){
>> -		cpumask_set_cpu(lcpu, of_spin_mask);
>> -		return 1;
>> -	}
>> -
>>  	/* Fixup atomic count: it exited inside IRQ handler. */
>>  	task_thread_info(paca[lcpu].__current)->preempt_count	= 0;
> 
> We DON'T want to do the above. It's nasty! This is one CPU's task
> touching an intimate part of another CPU's task. It's equivalent of me
> putting my hand down you wife's blouse. It's offensive, and rude.
> 
> OK, if the CPU was never online, then you can do what you want. But what
> we see is that this fails on CPU hotplug.  You stop a CPU, and it goes
> into this cede_processor() call. When you wake it up, suddenly the task
> on that woken CPU has its preempt count fscked up.  This was really
> really hard to debug. We thought it was stack corruption or something.
> But it ended up being that this code has one CPU touching the breasts of
> another CPU. This code is a pervert!
> 
> What the trace clearly showed, was that we take down a CPU, and in doing
> so, the code on that CPU set the preempt count to 1, and it expected to
> have it as 1 when it returned. But the code that kicked started this CPU
> back to life (bring the CPU back online), set the preempt count on the
> task of that CPU to 0, and screwed everything up.


Right, what Steve said.

This patch resolved the problem, but it did so inadvertently while
trying to fix kexec. I'm wondering if the offline/online code needs to
be updated to never call smp_startup_cpu(). Or perhaps this is the right
fix, and the comment needs to be cleaned up so that it isn't only for
kexec.

With this in place, we no longer see the preempt_count dropping below
zero. However, if I offline/online a CPU about 246 times I hit the
opposite problem, a preempt_count() overflow. There appears to be a
missing preempt_enable() somewhere in the offline/online paths.

On 2.6.33.7 with CONFIG_PREEMPT=y I run the following test:

# cat offon_loop.sh 
#!/bin/sh
for i in `seq 100`; do
        echo "Iteration $i"
        echo 0 > /sys/devices/system/cpu/cpu1/online
        echo 1 > /sys/devices/system/cpu/cpu1/online
done

And see the overflow:

Iteration 238
Iteration 239
Iteration 240

Message from syslogd@igoort1 at Sep  1 17:36:45 ...
 kernel:------------[ cut here ]------------
Iteration 241
Iteration 242
Iteration 243
Iteration 244
Iteration 245

With the following on the console. This is the:
	/*
	 * Spinlock count overflowing soon?
	 */
	DEBUG_LOCKS_WARN_ON((preempt_count() & PREEMPT_MASK) >=
				PREEMPT_MASK - 10);
test. Max preempt count is 256.

Somewhere we are now MISSING a sub_preempt_count() or preempt_enable().

Badness at kernel/sched.c:5372
NIP: c000000000694f14 LR: c000000000694ef8 CTR: c00000000005b0a0
REGS: c00000008e776ae0 TRAP: 0700   Not tainted  (2.6.33.7-dirty)
MSR: 8000000000021032 <ME,CE,IR,DR>  CR: 28000082  XER: 00000000
TASK = c00000010e6fc040[0] 'swapper' THREAD: c00000008e774000 CPU: 1
GPR00: 0000000000000000 c00000008e776d60 c000000000af2ab8 0000000000000001
GPR04: c00000008e776fb8 0000000000000004 0000000000000001 ffffffffff676980
GPR08: 0000000000000400 c000000000bcd930 0000000000000001 c000000000b2d360
GPR12: 0000000000000002 c000000000b0f680 0000000000000000 000000000f394acc
GPR16: 0000000000000000 0000000000000000 0000000000000000 c00000008e777880
GPR20: c000000000e05160 c00000008e777870 7fffffffffffffff c000000000b0f480
GPR24: c0000000009da400 0000000000000002 0000000000000000 0000000000000001
GPR28: c00000008e776fb8 0000000000000001 c000000000a75bf0 c00000008e776d60
NIP [c000000000694f14] .add_preempt_count+0xc0/0xe0
LR [c000000000694ef8] .add_preempt_count+0xa4/0xe0
Call Trace:
[c00000008e776d60] [c00000008e776e00] 0xc00000008e776e00 (unreliable)
[c00000008e776df0] [c00000000005b0d8] .probe_hcall_entry+0x38/0x94
[c00000008e776e80] [c00000000004ef70] .__trace_hcall_entry+0x80/0xd4
[c00000008e776f10] [c00000000004f96c] .plpar_hcall_norets+0x50/0xd0
[c00000008e776f80] [c000000000055044] .smp_xics_message_pass+0x110/0x244
[c00000008e777030] [c000000000034824] .smp_send_reschedule+0x5c/0x78
[c00000008e7770c0] [c00000000006a6bc] .resched_task+0xb4/0xd8
[c00000008e777150] [c00000000006a840] .check_preempt_curr_idle+0x2c/0x44
[c00000008e7771e0] [c00000000007a868] .try_to_wake_up+0x460/0x548
[c00000008e7772b0] [c00000000007a98c] .default_wake_function+0x3c/0x54
[c00000008e777350] [c0000000000ab3b0] .autoremove_wake_function+0x44/0x84
[c00000008e777400] [c00000000006449c] .__wake_up_common+0x7c/0xf4
[c00000008e7774c0] [c00000000006a5d8] .__wake_up+0x60/0x90
[c00000008e777570] [c0000000000810dc] .printk_tick+0x84/0xa8
[c00000008e777600] [c000000000095c90] .update_process_times+0x64/0xa4
[c00000008e7776a0] [c0000000000bdcec] .tick_sched_timer+0xd0/0x120
[c00000008e777750] [c0000000000afe7c] .__run_hrtimer+0x1a0/0x29c
[c00000008e777800] [c0000000000b02a4] .hrtimer_interrupt+0x124/0x278
[c00000008e777900] [c00000000002ea90] .timer_interrupt+0x1dc/0x2e4
[c00000008e7779a0] [c000000000003700] decrementer_common+0x100/0x180
--- Exception: 901 at .raw_local_irq_restore+0x48/0x54
    LR = .cpu_idle+0x12c/0x208
[c00000008e777c90] [c000000000b21130] ppc_md+0x0/0x240 (unreliable)
[c00000008e777cd0] [c000000000016a60] .cpu_idle+0x120/0x208
[c00000008e777d70] [c00000000069dbec] .start_secondary+0x394/0x3d4
[c00000008e777e30] [c000000000008278] .start_secondary_resume+0x10/0x14
Instruction dump:
409e0034 78630620 2ba300f4 40fd0028 4bd0b711 60000000 2fa30000 419e0018
e93e8880 80090000 2f800000 409e0008 <0fe00000> 383f0090 e8010010 7c0803a6

BUG: scheduling while atomic: swapper/0/0x000000f0
Modules linked in: autofs4 binfmt_misc dm_mirror dm_region_hash dm_log [last unloaded: scsi_wait_scan]
Call Trace:
[c00000008e7779a0] [c000000000014280] .show_stack+0xd8/0x218 (unreliable)
[c00000008e777a80] [c0000000006980b0] .dump_stack+0x28/0x3c
[c00000008e777b00] [c000000000071954] .__schedule_bug+0x94/0xb8
[c00000008e777b90] [c00000000068db40] .schedule+0xf8/0xbf4
[c00000008e777cd0] [c000000000016b34] .cpu_idle+0x1f4/0x208
[c00000008e777d70] [c00000000069dbec] .start_secondary+0x394/0x3d4
[c00000008e777e30] [c000000000008278] .start_secondary_resume+0x10/0x14

I'll spend tomorrow collecting traces and trying to see who's groping who this time...

--
Darren



> 
> -- Steve
> 
> 
>>  
>>  	if (get_cpu_current_state(lcpu) == CPU_STATE_INACTIVE)
>>  		goto out;
>>  
>> +	/* Check to see if the CPU out of FW already for kexec */
>> +	if (smp_query_cpu_stopped(pcpu) == QCSS_NOT_STOPPED){
>> +		cpumask_set_cpu(lcpu, of_spin_mask);
>> +		return 1;
>> +	}
>> +
>>  	/* 
>>  	 * If the RTAS start-cpu token does not exist then presume the
>>  	 * cpu is already spinning.
>>
> 
>
Michael Neuling - Sept. 2, 2010, 11:04 p.m.
In message <1283400367.2356.69.camel@gandalf.stny.rr.com> you wrote:
> On Thu, 2010-09-02 at 11:02 +1000, Michael Neuling wrote:
> 
> > We need to call smp_startup_cpu on boot when we the cpus are still in
> > FW.  smp_startup_cpu does this for us on boot.
> > 
> > I'm wondering if we just need to move the test down a bit to make sure
> > the preempt_count is set.  I've not been following this thread, but
> > maybe this might work?
> 
> Egad no! Setting the preempt_count to zero _is_ the bug. I think Darren
> even said that adding the exit prevented the bug (although now he's
> hitting a hard lockup someplace else). The original code he was using
> did not have the condition to return for kexec. It was just a
> coincidence that this code helped in bringing a CPU back online.
> 
> > 
> > Untested patch below...
> > 
> > Mikey
> > 
> > diff --git a/arch/powerpc/platforms/pseries/smp.c b/arch/powerpc/platforms/
pseries/smp.c
> > index 0317cce..3afaba4 100644
> > --- a/arch/powerpc/platforms/pseries/smp.c
> > +++ b/arch/powerpc/platforms/pseries/smp.c
> > @@ -104,18 +104,18 @@ static inline int __devinit smp_startup_cpu(unsigned 
int lcpu)
> >  
> >  	pcpu = get_hard_smp_processor_id(lcpu);
> >  
> > -	/* Check to see if the CPU out of FW already for kexec */
> > -	if (smp_query_cpu_stopped(pcpu) == QCSS_NOT_STOPPED){
> > -		cpumask_set_cpu(lcpu, of_spin_mask);
> > -		return 1;
> > -	}
> > -
> >  	/* Fixup atomic count: it exited inside IRQ handler. */
> >  	task_thread_info(paca[lcpu].__current)->preempt_count	= 0;
> 
> We DON'T want to do the above. It's nasty! This is one CPU's task
> touching an intimate part of another CPU's task. It's equivalent of me
> putting my hand down you wife's blouse. It's offensive, and rude.
> 
> OK, if the CPU was never online, then you can do what you want. But what
> we see is that this fails on CPU hotplug.  You stop a CPU, and it goes
> into this cede_processor() call. When you wake it up, suddenly the task
> on that woken CPU has its preempt count fscked up.  This was really
> really hard to debug. We thought it was stack corruption or something.
> But it ended up being that this code has one CPU touching the breasts of
> another CPU. This code is a pervert!
> 
> What the trace clearly showed, was that we take down a CPU, and in doing
> so, the code on that CPU set the preempt count to 1, and it expected to
> have it as 1 when it returned. But the code that kicked started this CPU
> back to life (bring the CPU back online), set the preempt count on the
> task of that CPU to 0, and screwed everything up.

/me goes to checks where this came from...

It's been in the kernel since hotplug CPU support was added to ppc64
back in 2004, so I guess we are all at fault for letting this pervert
get away with this stuff for so long in plain sight. :-)

So I guess we should remove this but we need to audit all the different
paths that go through here to make sure they are OK with preempt.
Normal boot, kexec boot, hotplug with FW stop and hotplug with
extended_cede all hit this.

Mikey
Darren Hart - Sept. 3, 2010, 12:08 a.m.
On 09/02/2010 04:04 PM, Michael Neuling wrote:
> In message <1283400367.2356.69.camel@gandalf.stny.rr.com> you wrote:
>> On Thu, 2010-09-02 at 11:02 +1000, Michael Neuling wrote:
>>
>>> We need to call smp_startup_cpu on boot when we the cpus are still in
>>> FW.  smp_startup_cpu does this for us on boot.
>>>
>>> I'm wondering if we just need to move the test down a bit to make sure
>>> the preempt_count is set.  I've not been following this thread, but
>>> maybe this might work?
>>
>> Egad no! Setting the preempt_count to zero _is_ the bug. I think Darren
>> even said that adding the exit prevented the bug (although now he's
>> hitting a hard lockup someplace else). The original code he was using
>> did not have the condition to return for kexec. It was just a
>> coincidence that this code helped in bringing a CPU back online.
>>
>>>
>>> Untested patch below...
>>>
>>> Mikey
>>>
>>> diff --git a/arch/powerpc/platforms/pseries/smp.c b/arch/powerpc/platforms/
> pseries/smp.c
>>> index 0317cce..3afaba4 100644
>>> --- a/arch/powerpc/platforms/pseries/smp.c
>>> +++ b/arch/powerpc/platforms/pseries/smp.c
>>> @@ -104,18 +104,18 @@ static inline int __devinit smp_startup_cpu(unsigned 
> int lcpu)
>>>  
>>>  	pcpu = get_hard_smp_processor_id(lcpu);
>>>  
>>> -	/* Check to see if the CPU out of FW already for kexec */
>>> -	if (smp_query_cpu_stopped(pcpu) == QCSS_NOT_STOPPED){
>>> -		cpumask_set_cpu(lcpu, of_spin_mask);
>>> -		return 1;
>>> -	}
>>> -
>>>  	/* Fixup atomic count: it exited inside IRQ handler. */
>>>  	task_thread_info(paca[lcpu].__current)->preempt_count	= 0;
>>
>> We DON'T want to do the above. It's nasty! This is one CPU's task
>> touching an intimate part of another CPU's task. It's equivalent of me
>> putting my hand down you wife's blouse. It's offensive, and rude.
>>
>> OK, if the CPU was never online, then you can do what you want. But what
>> we see is that this fails on CPU hotplug.  You stop a CPU, and it goes
>> into this cede_processor() call. When you wake it up, suddenly the task
>> on that woken CPU has its preempt count fscked up.  This was really
>> really hard to debug. We thought it was stack corruption or something.
>> But it ended up being that this code has one CPU touching the breasts of
>> another CPU. This code is a pervert!
>>
>> What the trace clearly showed, was that we take down a CPU, and in doing
>> so, the code on that CPU set the preempt count to 1, and it expected to
>> have it as 1 when it returned. But the code that kicked started this CPU
>> back to life (bring the CPU back online), set the preempt count on the
>> task of that CPU to 0, and screwed everything up.
> 
> /me goes to checks where this came from...
> 
> It's been in the kernel since hotplug CPU support was added to ppc64
> back in 2004, so I guess we are all at fault for letting this pervert
> get away with this stuff for so long in plain sight. :-)
> 
> So I guess we should remove this but we need to audit all the different
> paths that go through here to make sure they are OK with preempt.
> Normal boot, kexec boot, hotplug with FW stop and hotplug with
> extended_cede all hit this.
> 
> Mikey

CC'ing my alter ego.
Will Schmidt - Sept. 3, 2010, 8:10 p.m.
dvhltc@linux.vnet.ibm.com wrote on 09/02/2010 01:04:28 AM:

> Subject
>
> Re: [PATCH][RFC] preempt_count corruption across H_CEDE call with
> CONFIG_PREEMPT on pseries

> With this in place, we no longer see the preempt_count dropping below
> zero. However, if I offline/online a CPU about 246 times I hit the
> opposite problem, a preempt_count() overflow. There appears to be a
> missing preempt_enable() somewhere in the offline/online paths.

This (preempt_count overflow) also occurred in mainline (with
CONFIG_PREEMPT=y) in 2.6.35, but not in 2.6.36-rc3.   A
bisect seems to indicate it was fixed with
a7c2bb8279d20d853e43c34584eaf2b039de8026   "powerpc: Re-enable preemption
before cpu_die()".

Which may look familiar.  :-)

It looks like this patch went to mainline (likely via the powerpc tree),
but may
have not gotten back into the -rt branch.

-Will

>
> --
> Darren Hart
> IBM Linux Technology Center
> Real-Time Linux Team

Patch

diff --git a/arch/powerpc/platforms/pseries/smp.c b/arch/powerpc/platforms/pseries/smp.c
index 0317cce..3afaba4 100644
--- a/arch/powerpc/platforms/pseries/smp.c
+++ b/arch/powerpc/platforms/pseries/smp.c
@@ -104,18 +104,18 @@  static inline int __devinit smp_startup_cpu(unsigned int lcpu)
 
 	pcpu = get_hard_smp_processor_id(lcpu);
 
-	/* Check to see if the CPU out of FW already for kexec */
-	if (smp_query_cpu_stopped(pcpu) == QCSS_NOT_STOPPED){
-		cpumask_set_cpu(lcpu, of_spin_mask);
-		return 1;
-	}
-
 	/* Fixup atomic count: it exited inside IRQ handler. */
 	task_thread_info(paca[lcpu].__current)->preempt_count	= 0;
 
 	if (get_cpu_current_state(lcpu) == CPU_STATE_INACTIVE)
 		goto out;
 
+	/* Check to see if the CPU out of FW already for kexec */
+	if (smp_query_cpu_stopped(pcpu) == QCSS_NOT_STOPPED){
+		cpumask_set_cpu(lcpu, of_spin_mask);
+		return 1;
+	}
+
 	/* 
 	 * If the RTAS start-cpu token does not exist then presume the
 	 * cpu is already spinning.