Patchwork [RFC] preempt_count corruption across H_CEDE call with CONFIG_PREEMPT on pseries

login
register
mail settings
Submitter Darren Hart
Date July 22, 2010, 6:24 p.m.
Message ID <4C488CCD.60004@us.ibm.com>
Download mbox | patch
Permalink /patch/59622/
State Rejected
Headers show

Comments

Darren Hart - July 22, 2010, 6:24 p.m.
While testing CPU offline/online, we hit various preempt_count related
bugs. Various hacks have been employed for several theoretical corner
cases. One situation however is perfectly repeatable on 2.6.33.6 with
CONFIG_PREEMPT=y.

BUG: scheduling while atomic: swapper/0/0x00000065
Modules linked in: autofs4 sunrpc ipv6 dm_mirror dm_region_hash dm_log
dm_mod ehea sg ext4 jbd2 mbcache sd_mod crc_t10dif ibmvscsic
scsi_transport_srp scsi_tgt [last unloaded: scsi_wait_scan]
Call Trace:
[c00000010e9e39f0] [c0000000000144d4] .show_stack+0x74/0x1c0 (unreliable)
[c00000010e9e3aa0] [c00000000007a680] .__schedule_bug+0xa0/0xb0
[c00000010e9e3b30] [c00000000056dea4] .schedule+0x7a4/0xd60
[c00000010e9e3cd0] [c000000000016be8] .cpu_idle+0x1f8/0x220
[c00000010e9e3d80] [c00000000057d858] .start_secondary+0x388/0x3c0
[c00000010e9e3e30] [c000000000008278] .start_secondary_resume+0x10/0x14

With some instrumentation we were able to determine that the
preempt_count() appears to change across the extended_cede_processor()
call.  Specifically across the plpar_hcall_norets(H_CEDE) call. On
PREEMPT_RT we call this with preempt_count=1 and return with
preempt_count=0xffffffff. On mainline with CONFIG_PREEMPT=y, the value
is different (0x65) but is still incorrect.

Also of interest is that this path
cpu_idle()->cpu_die()->pseries_mach_cpu_die() to start_secondary()
enters with a preempt_count=1 if it wasn't corrupted across the hcall.
The early boot path from _start however appears to call
start_secondary() with a preempt_count of 0.

The following patch is most certainly not correct, but it does eliminate
the situation on mainline 100% of the time (there is still a 25%
reproduction rate on PREEMPT_RT). Can someone comment on:

1) How can the preempt_count() get mangled across the H_CEDE hcall?
2) Should we call preempt_enable() in cpu_idle() prior to cpu_die() ?

Hacked-up-by: Darren Hart <dvhltc@us.ibm.com>
Darren Hart - July 22, 2010, 6:36 p.m.
On 07/22/2010 11:24 AM, Darren Hart wrote:

> The following patch is most certainly not correct, but it does eliminate
> the situation on mainline 100% of the time (there is still a 25%
> reproduction rate on PREEMPT_RT). Can someone comment on:

Apologies. This particular issue is also 100% eliminated on PREEMPT_RT. 
We hit another issue possibly unrelated to this 25% of time. Please 
disregard the comment regarding 25% failure on PREEMPT_RT.
Thomas Gleixner - July 22, 2010, 6:38 p.m.
On Thu, 22 Jul 2010, Darren Hart wrote:
 
> Also of interest is that this path
> cpu_idle()->cpu_die()->pseries_mach_cpu_die() to start_secondary()
> enters with a preempt_count=1 if it wasn't corrupted across the hcall.

That triggers the problem as well. preempt_count needs to be 0 when
entering start_secondary(). So I really wonder how that ever worked.

> The early boot path from _start however appears to call
> start_secondary() with a preempt_count of 0.

Which is correct.
 
> The following patch is most certainly not correct, but it does eliminate

It is correct, but i think it is incomplete as other portions of the
thread_info on the stack might be in some weird state as well.

> the situation on mainline 100% of the time (there is still a 25%
> reproduction rate on PREEMPT_RT).

But those are diffferent issues, for which we have reasonable
explanations and patches/workarounds.

> 2) Should we call preempt_enable() in cpu_idle() prior to cpu_die() ?

No
 
Thanks,

	tglx
Benjamin Herrenschmidt - July 22, 2010, 10:25 p.m.
On Thu, 2010-07-22 at 11:24 -0700, Darren Hart wrote:
> 
> 1) How can the preempt_count() get mangled across the H_CEDE hcall?
> 2) Should we call preempt_enable() in cpu_idle() prior to cpu_die() ?

The preempt count is on the thread info at the bottom of the stack.

Can you check the stack pointers ?

Cheers,
Ben.
Darren Hart - July 22, 2010, 11:57 p.m.
On 07/22/2010 03:25 PM, Benjamin Herrenschmidt wrote:
> On Thu, 2010-07-22 at 11:24 -0700, Darren Hart wrote:
>>
>> 1) How can the preempt_count() get mangled across the H_CEDE hcall?
>> 2) Should we call preempt_enable() in cpu_idle() prior to cpu_die() ?
> 
> The preempt count is on the thread info at the bottom of the stack.
> 
> Can you check the stack pointers ?

Hi Ben, thanks for looking.

I instrumented the area around extended_cede_processor() as follows
(please confirm I'm getting the stack pointer correctly).

while (get_preferred_offline_state(cpu) == CPU_STATE_INACTIVE) {
	asm("mr %0,1" : "=r" (sp));
	printk("before H_CEDE current->stack: %lx, pcnt: %x\n", sp, preempt_count());  
	extended_cede_processor(cede_latency_hint);
	asm("mr %0,1" : "=r" (sp));
	printk("after H_CEDE current->stack: %lx, pcnt: %x\n", sp, preempt_count());
 }


On Mainline (2.6.33.6, CONFIG_PREEMPT=y) I see this:
Jul 22 18:37:08 igoort1 kernel: before H_CEDE current->stack: c00000010e9e3ce0, pcnt: 1
Jul 22 18:37:08 igoort1 kernel: after H_CEDE current->stack:  c00000010e9e3ce0, pcnt: 1

This surprised me as preempt_count is 1 before and after, so no
corruption appears to occur on mainline. This makes the pcnt of 65 I see
without the preempt_count()=0 hack very strange. I ran several hundred
off/on cycles. The issue of preempt_count being 1 is still addressed by
this patch however.

On PREEMPT_RT (2.6.33.5-rt23 - tglx, sorry, rt/2.6.33 next time, promise):
Jul 22 18:51:11 igoort1 kernel: before H_CEDE current->stack: c000000089bcfcf0, pcnt: 1
Jul 22 18:51:11 igoort1 kernel: after H_CEDE current->stack:  c000000089bcfcf0, pcnt: ffffffff

In both cases the stack pointer appears unchanged.

Note: there is a BUG triggered in between these statements as the
preempt_count causes the printk to trigger:
Badness at kernel/sched.c:5572

Thanks,
Darren Hart - July 23, 2010, 4:44 a.m.
On 07/22/2010 04:57 PM, Darren Hart wrote:
> On 07/22/2010 03:25 PM, Benjamin Herrenschmidt wrote:
>> On Thu, 2010-07-22 at 11:24 -0700, Darren Hart wrote:
>>>
>>> 1) How can the preempt_count() get mangled across the H_CEDE hcall?
>>> 2) Should we call preempt_enable() in cpu_idle() prior to cpu_die() ?
>>
>> The preempt count is on the thread info at the bottom of the stack.
>>
>> Can you check the stack pointers ?
> 
> Hi Ben, thanks for looking.
> 
> I instrumented the area around extended_cede_processor() as follows
> (please confirm I'm getting the stack pointer correctly).
> 
> while (get_preferred_offline_state(cpu) == CPU_STATE_INACTIVE) {
> 	asm("mr %0,1" : "=r" (sp));
> 	printk("before H_CEDE current->stack: %lx, pcnt: %x\n", sp, preempt_count());
> 	extended_cede_processor(cede_latency_hint);
> 	asm("mr %0,1" : "=r" (sp));
> 	printk("after H_CEDE current->stack: %lx, pcnt: %x\n", sp, preempt_count());
>   }
> 
> 
> On Mainline (2.6.33.6, CONFIG_PREEMPT=y) I see this:
> Jul 22 18:37:08 igoort1 kernel: before H_CEDE current->stack: c00000010e9e3ce0, pcnt: 1
> Jul 22 18:37:08 igoort1 kernel: after H_CEDE current->stack:  c00000010e9e3ce0, pcnt: 1
> 
> This surprised me as preempt_count is 1 before and after, so no
> corruption appears to occur on mainline. This makes the pcnt of 65 I see
> without the preempt_count()=0 hack very strange. I ran several hundred
> off/on cycles. The issue of preempt_count being 1 is still addressed by
> this patch however.
> 
> On PREEMPT_RT (2.6.33.5-rt23 - tglx, sorry, rt/2.6.33 next time, promise):
> Jul 22 18:51:11 igoort1 kernel: before H_CEDE current->stack: c000000089bcfcf0, pcnt: 1
> Jul 22 18:51:11 igoort1 kernel: after H_CEDE current->stack:  c000000089bcfcf0, pcnt: ffffffff
> 
> In both cases the stack pointer appears unchanged.
> 
> Note: there is a BUG triggered in between these statements as the
> preempt_count causes the printk to trigger:
> Badness at kernel/sched.c:5572

At Steven's suggestion I updated the instrumentation to display the
local_save_flags and irqs_disabled_flags:

while (get_preferred_offline_state(cpu) == CPU_STATE_INACTIVE) {
        local_save_flags(flags);
        printk("local flags: %lx, irqs disabled: %d\n", flags, irqs_disabled_flags(flags));
        asm("mr %0,1" : "=r" (sp));
        printk("before H_CEDE current->stack: %lx, pcnt: %x\n", sp, preempt_count());
        extended_cede_processor(cede_latency_hint);
        asm("mr %0,1" : "=r" (sp));
        printk("after H_CEDE current->stack: %lx, pcnt: %x\n", sp, preempt_count());
}


Jul 22 23:36:58 igoort1 kernel: local flags: 0, irqs disabled: 1
Jul 22 23:36:58 igoort1 kernel: before H_CEDE current->stack: c00000010e9e3ce0, pcnt: 1
Jul 22 23:36:58 igoort1 kernel: after H_CEDE current->stack: c00000010e9e3ce0, pcnt: 1

I'm not sure if I'm reading that right, but I believe interrupts are
intended to be disabled here. If accomplished via the
spin_lock_irqsave() this would behave differently on RT. However, this
path disables the interrupts handled by xics, all but the IPIs anyway.
On RT I disabled the decrementer as well.

Is it possible for RT to be receiving other interrupts here?
Vaidyanathan Srinivasan - July 23, 2010, 5:08 a.m.
* Darren Hart <dvhltc@us.ibm.com> [2010-07-22 21:44:04]:

> On 07/22/2010 04:57 PM, Darren Hart wrote:
> > On 07/22/2010 03:25 PM, Benjamin Herrenschmidt wrote:
> >> On Thu, 2010-07-22 at 11:24 -0700, Darren Hart wrote:
> >>>
> >>> 1) How can the preempt_count() get mangled across the H_CEDE hcall?
> >>> 2) Should we call preempt_enable() in cpu_idle() prior to cpu_die() ?
> >>
> >> The preempt count is on the thread info at the bottom of the stack.
> >>
> >> Can you check the stack pointers ?
> > 
> > Hi Ben, thanks for looking.
> > 
> > I instrumented the area around extended_cede_processor() as follows
> > (please confirm I'm getting the stack pointer correctly).
> > 
> > while (get_preferred_offline_state(cpu) == CPU_STATE_INACTIVE) {
> > 	asm("mr %0,1" : "=r" (sp));
> > 	printk("before H_CEDE current->stack: %lx, pcnt: %x\n", sp, preempt_count());
> > 	extended_cede_processor(cede_latency_hint);
> > 	asm("mr %0,1" : "=r" (sp));
> > 	printk("after H_CEDE current->stack: %lx, pcnt: %x\n", sp, preempt_count());
> >   }
> > 
> > 
> > On Mainline (2.6.33.6, CONFIG_PREEMPT=y) I see this:
> > Jul 22 18:37:08 igoort1 kernel: before H_CEDE current->stack: c00000010e9e3ce0, pcnt: 1
> > Jul 22 18:37:08 igoort1 kernel: after H_CEDE current->stack:  c00000010e9e3ce0, pcnt: 1
> > 
> > This surprised me as preempt_count is 1 before and after, so no
> > corruption appears to occur on mainline. This makes the pcnt of 65 I see
> > without the preempt_count()=0 hack very strange. I ran several hundred
> > off/on cycles. The issue of preempt_count being 1 is still addressed by
> > this patch however.
> > 
> > On PREEMPT_RT (2.6.33.5-rt23 - tglx, sorry, rt/2.6.33 next time, promise):
> > Jul 22 18:51:11 igoort1 kernel: before H_CEDE current->stack: c000000089bcfcf0, pcnt: 1
> > Jul 22 18:51:11 igoort1 kernel: after H_CEDE current->stack:  c000000089bcfcf0, pcnt: ffffffff
> > 
> > In both cases the stack pointer appears unchanged.
> > 
> > Note: there is a BUG triggered in between these statements as the
> > preempt_count causes the printk to trigger:
> > Badness at kernel/sched.c:5572
> 
> At Steven's suggestion I updated the instrumentation to display the
> local_save_flags and irqs_disabled_flags:
> 
> while (get_preferred_offline_state(cpu) == CPU_STATE_INACTIVE) {
>         local_save_flags(flags);
>         printk("local flags: %lx, irqs disabled: %d\n", flags, irqs_disabled_flags(flags));
>         asm("mr %0,1" : "=r" (sp));
>         printk("before H_CEDE current->stack: %lx, pcnt: %x\n", sp, preempt_count());
>         extended_cede_processor(cede_latency_hint);
>         asm("mr %0,1" : "=r" (sp));
>         printk("after H_CEDE current->stack: %lx, pcnt: %x\n", sp, preempt_count());
> }
> 
> 
> Jul 22 23:36:58 igoort1 kernel: local flags: 0, irqs disabled: 1
> Jul 22 23:36:58 igoort1 kernel: before H_CEDE current->stack: c00000010e9e3ce0, pcnt: 1
> Jul 22 23:36:58 igoort1 kernel: after H_CEDE current->stack: c00000010e9e3ce0, pcnt: 1
> 
> I'm not sure if I'm reading that right, but I believe interrupts are
> intended to be disabled here. If accomplished via the
> spin_lock_irqsave() this would behave differently on RT. However, this
> path disables the interrupts handled by xics, all but the IPIs anyway.
> On RT I disabled the decrementer as well.
> 
> Is it possible for RT to be receiving other interrupts here?

Yes.  extended_cede_processor() will return with interrupts enabled in
the cpu. (This is done by the hypervisor).  Under normal cases we
cannot be interrupted because no IO interrupts are routed to us after
xics_teardown_cpu() and since the CPU is out of the map, nobody will
send us IPIs.

Though H_CEDE will return with interrupts enabled, it is unlikely that
an interrupt can be delivered in this context.

--Vaidy
Benjamin Herrenschmidt - July 23, 2010, 5:09 a.m.
On Thu, 2010-07-22 at 21:44 -0700, Darren Hart wrote:

>  suggestion I updated the instrumentation to display the
> local_save_flags and irqs_disabled_flags:

> Jul 22 23:36:58 igoort1 kernel: local flags: 0, irqs disabled: 1
> Jul 22 23:36:58 igoort1 kernel: before H_CEDE current->stack: c00000010e9e3ce0, pcnt: 1
> Jul 22 23:36:58 igoort1 kernel: after H_CEDE current->stack: c00000010e9e3ce0, pcnt: 1
> 
> I'm not sure if I'm reading that right, but I believe interrupts are
> intended to be disabled here. If accomplished via the
> spin_lock_irqsave() this would behave differently on RT. However, this
> path disables the interrupts handled by xics, all but the IPIs anyway.
> On RT I disabled the decrementer as well.
> 
> Is it possible for RT to be receiving other interrupts here?

Also you may want to call hard_irq_disable() to -really- disable
interrupts ... since we do lazy-disable on powerpc

Cheers,
Ben.
Benjamin Herrenschmidt - July 23, 2010, 5:11 a.m.
On Fri, 2010-07-23 at 10:38 +0530, Vaidyanathan Srinivasan wrote:
> Yes.  extended_cede_processor() will return with interrupts enabled in
> the cpu. (This is done by the hypervisor).  Under normal cases we
> cannot be interrupted because no IO interrupts are routed to us after
> xics_teardown_cpu() and since the CPU is out of the map, nobody will
> send us IPIs.

What about decrementer ?

> Though H_CEDE will return with interrupts enabled, it is unlikely that
> an interrupt can be delivered in this context. 

Well, if interrupts are soft-disabled, even if one occurs, we will just
mask and return, so that at least should be ok.

Cheers,
Ben.
Vaidyanathan Srinivasan - July 23, 2010, 7:07 a.m.
* Benjamin Herrenschmidt <benh@kernel.crashing.org> [2010-07-23 15:11:00]:

> On Fri, 2010-07-23 at 10:38 +0530, Vaidyanathan Srinivasan wrote:
> > Yes.  extended_cede_processor() will return with interrupts enabled in
> > the cpu. (This is done by the hypervisor).  Under normal cases we
> > cannot be interrupted because no IO interrupts are routed to us after
> > xics_teardown_cpu() and since the CPU is out of the map, nobody will
> > send us IPIs.
> 
> What about decrementer ?

Decrementer expiry event handling is bit complex.  The event as such
may not bring back the extended_cede_processor() cpu, but may be
marked pending when we get out of this state eventually.  I will find
more information on this event and update.

> > Though H_CEDE will return with interrupts enabled, it is unlikely that
> > an interrupt can be delivered in this context. 
> 
> Well, if interrupts are soft-disabled, even if one occurs, we will just
> mask and return, so that at least should be ok.

Yes.  We will immediately return to the extended_cede_processor() in
the while loop until the preferred_offline_state is changed.

--Vaidy
Will Schmidt - July 23, 2010, 2:39 p.m.
dvhltc@linux.vnet.ibm.com wrote on 07/22/2010 06:57:18 PM:
> Subject
>
> Re: [PATCH][RFC] preempt_count corruption across H_CEDE call with
> CONFIG_PREEMPT on pseries
>
> On 07/22/2010 03:25 PM, Benjamin Herrenschmidt wrote:
> > On Thu, 2010-07-22 at 11:24 -0700, Darren Hart wrote:
> >>
> >> 1) How can the preempt_count() get mangled across the H_CEDE hcall?
> >> 2) Should we call preempt_enable() in cpu_idle() prior to cpu_die() ?
> >
> > The preempt count is on the thread info at the bottom of the stack.
> >
> > Can you check the stack pointers ?
>
> Hi Ben, thanks for looking.
>
> I instrumented the area around extended_cede_processor() as follows
> (please confirm I'm getting the stack pointer correctly).
>
> while (get_preferred_offline_state(cpu) == CPU_STATE_INACTIVE) {
>    asm("mr %0,1" : "=r" (sp));
>    printk("before H_CEDE current->stack: %lx, pcnt: %x\n", sp,
> preempt_count());
>    extended_cede_processor(cede_latency_hint);
>    asm("mr %0,1" : "=r" (sp));
>    printk("after H_CEDE current->stack: %lx, pcnt: %x\n", sp,
> preempt_count());
>  }
>
>
> On Mainline (2.6.33.6, CONFIG_PREEMPT=y) I see this:
> Jul 22 18:37:08 igoort1 kernel: before H_CEDE current->stack:
> c00000010e9e3ce0, pcnt: 1
> Jul 22 18:37:08 igoort1 kernel: after H_CEDE current->stack:
> c00000010e9e3ce0, pcnt: 1
>
> This surprised me as preempt_count is 1 before and after, so no
> corruption appears to occur on mainline. This makes the pcnt of 65 I see
> without the preempt_count()=0 hack very strange. I ran several hundred
> off/on cycles. The issue of preempt_count being 1 is still addressed by
> this patch however.
>
> On PREEMPT_RT (2.6.33.5-rt23 - tglx, sorry, rt/2.6.33 next time,
promise):
> Jul 22 18:51:11 igoort1 kernel: before H_CEDE current->stack:
> c000000089bcfcf0, pcnt: 1
> Jul 22 18:51:11 igoort1 kernel: after H_CEDE current->stack:
> c000000089bcfcf0, pcnt: ffffffff

I'm not seeing the preempt_count value corrupted with my current set of
debug, however, I have added buffers to the thread_info struct, so
wonder if I've moved the preempt_count variable out of the way of
the corruption.  (Still investigating that point..)

<Why..   because I had been trying to set a DABR on the preempt_count
value to catch the corrupter, and due to hits on the nearby _flags fields,
getting
false positives..>


struct thread_info {
|-------struct task_struct *task;|------|-------/* main task structure */
|-------struct exec_domain *exec_domain;|-------/* execution domain */
|-------int|----|-------cpu;|---|-------|-------/* cpu we're on */
|-------int|----|-------pad_buffer[64];
|-------int|----|-------preempt_count;|-|-------/* 0 => preemptable,
|-------|-------|-------|-------|-------|-------   <0 => BUG */
|-------int|----|-------pad_buffer2[256];
|-------struct restart_block restart_block;
|-------unsigned long|--local_flags;|---|-------/* private flags for thread
*/

|-------/* low level flags - has atomic operations done on it */
|-------unsigned long|--flags ____cacheline_aligned_in_smp;
};


>
> In both cases the stack pointer appears unchanged.
>
> Note: there is a BUG triggered in between these statements as the
> preempt_count causes the printk to trigger:
> Badness at kernel/sched.c:5572
>
> Thanks,
>
> --
> Darren Hart
> IBM Linux Technology Center
> Real-Time Linux Team
Darren Hart - Aug. 4, 2010, 1:44 p.m.
On 07/22/2010 03:25 PM, Benjamin Herrenschmidt wrote:
> On Thu, 2010-07-22 at 11:24 -0700, Darren Hart wrote:
>>
>> 1) How can the preempt_count() get mangled across the H_CEDE hcall?
>> 2) Should we call preempt_enable() in cpu_idle() prior to cpu_die() ?
>
> The preempt count is on the thread info at the bottom of the stack.
>
> Can you check the stack pointers ?

Hi Ben, sorry if I didn't get back to you on this already. I checked the 
stack pointer before and after the cede call and they match.
Darren Hart - Aug. 5, 2010, 4:45 a.m.
On 07/23/2010 12:07 AM, Vaidyanathan Srinivasan wrote:
> * Benjamin Herrenschmidt<benh@kernel.crashing.org>  [2010-07-23 15:11:00]:
>
>> On Fri, 2010-07-23 at 10:38 +0530, Vaidyanathan Srinivasan wrote:
>>> Yes.  extended_cede_processor() will return with interrupts enabled in
>>> the cpu. (This is done by the hypervisor).  Under normal cases we
>>> cannot be interrupted because no IO interrupts are routed to us after
>>> xics_teardown_cpu() and since the CPU is out of the map, nobody will
>>> send us IPIs.
>>
>> What about decrementer ?
>
> Decrementer expiry event handling is bit complex.  The event as such
> may not bring back the extended_cede_processor() cpu, but may be
> marked pending when we get out of this state eventually.  I will find
> more information on this event and update.

Hi Vaidy, have you been able to dig anything up about the decrementer 
expiry?

Thanks,
Vaidyanathan Srinivasan - Aug. 5, 2010, 11:06 a.m.
* Darren Hart <dvhltc@us.ibm.com> [2010-08-04 21:45:51]:

> On 07/23/2010 12:07 AM, Vaidyanathan Srinivasan wrote:
> >* Benjamin Herrenschmidt<benh@kernel.crashing.org>  [2010-07-23 15:11:00]:
> >
> >>On Fri, 2010-07-23 at 10:38 +0530, Vaidyanathan Srinivasan wrote:
> >>>Yes.  extended_cede_processor() will return with interrupts enabled in
> >>>the cpu. (This is done by the hypervisor).  Under normal cases we
> >>>cannot be interrupted because no IO interrupts are routed to us after
> >>>xics_teardown_cpu() and since the CPU is out of the map, nobody will
> >>>send us IPIs.
> >>
> >>What about decrementer ?
> >
> >Decrementer expiry event handling is bit complex.  The event as such
> >may not bring back the extended_cede_processor() cpu, but may be
> >marked pending when we get out of this state eventually.  I will find
> >more information on this event and update.
> 
> Hi Vaidy, have you been able to dig anything up about the
> decrementer expiry?

Hi Darren,

Yes, it is possible that the cpu in extended_cede_processor() can be
woken up by the decrementer.  But the expectation is that we will
return back to this context since we will not have any pending timers.

Our stack should not change even if we get the decrementer interrupts.

--Vaidy
Thomas Gleixner - Aug. 5, 2010, 12:26 p.m.
On Thu, 5 Aug 2010, Vaidyanathan Srinivasan wrote:

> * Darren Hart <dvhltc@us.ibm.com> [2010-08-04 21:45:51]:
> 
> > On 07/23/2010 12:07 AM, Vaidyanathan Srinivasan wrote:
> > >* Benjamin Herrenschmidt<benh@kernel.crashing.org>  [2010-07-23 15:11:00]:
> > >
> > >>On Fri, 2010-07-23 at 10:38 +0530, Vaidyanathan Srinivasan wrote:
> > >>>Yes.  extended_cede_processor() will return with interrupts enabled in
> > >>>the cpu. (This is done by the hypervisor).  Under normal cases we
> > >>>cannot be interrupted because no IO interrupts are routed to us after
> > >>>xics_teardown_cpu() and since the CPU is out of the map, nobody will
> > >>>send us IPIs.
> > >>
> > >>What about decrementer ?
> > >
> > >Decrementer expiry event handling is bit complex.  The event as such
> > >may not bring back the extended_cede_processor() cpu, but may be
> > >marked pending when we get out of this state eventually.  I will find
> > >more information on this event and update.
> > 
> > Hi Vaidy, have you been able to dig anything up about the
> > decrementer expiry?
> 
> Hi Darren,
> 
> Yes, it is possible that the cpu in extended_cede_processor() can be
> woken up by the decrementer.  But the expectation is that we will
> return back to this context since we will not have any pending timers.

The problem is that you might get woken _before_ the timers have been
migrated away. 
 
> Our stack should not change even if we get the decrementer interrupts.

But you should not execute kernel code when you got offlined either.

Thanks,

	tglx
Darren Hart - Aug. 6, 2010, 2:19 a.m.
On 07/22/2010 10:09 PM, Benjamin Herrenschmidt wrote:
> On Thu, 2010-07-22 at 21:44 -0700, Darren Hart wrote:
> 
>>   suggestion I updated the instrumentation to display the
>> local_save_flags and irqs_disabled_flags:
> 
>> Jul 22 23:36:58 igoort1 kernel: local flags: 0, irqs disabled: 1
>> Jul 22 23:36:58 igoort1 kernel: before H_CEDE current->stack: c00000010e9e3ce0, pcnt: 1
>> Jul 22 23:36:58 igoort1 kernel: after H_CEDE current->stack: c00000010e9e3ce0, pcnt: 1
>>
>> I'm not sure if I'm reading that right, but I believe interrupts are
>> intended to be disabled here. If accomplished via the
>> spin_lock_irqsave() this would behave differently on RT. However, this
>> path disables the interrupts handled by xics, all but the IPIs anyway.
>> On RT I disabled the decrementer as well.
>>
>> Is it possible for RT to be receiving other interrupts here?
> 
> Also you may want to call hard_irq_disable() to -really- disable
> interrupts ... since we do lazy-disable on powerpc
> 

A quick update and request for direction wrt the cede hcall, interrupt handling, and the stack pointer.

I've added patches one at a time, eliminating bugs with preempt_rt
(tip/rt/head). With my current patchset I have no crashes with a single run of
random_online.sh (stress testing to hit the hang will sees is my todo for
tomorrow).

Current patchset includes:
patches/0001-wms-fix01.patch # P7 lazy flushing thing

# next four are sent to / queued for mainline
patches/powerpc-increase-pseries_cpu_die-delay.patch
patches/powerpc-enable-preemption-before-cpu_die.patch
patches/powerpc-silence-__cpu_up-under-normal-operation.patch
patches/powerpc-silence-xics_migrate_irqs_away-during-cpu-offline.patch

# this one needs to be cleaned up and sent to mainline
patches/powerpc-wait-for-cpu-to-go-inactive.patch

patches/powerpc-disable-decrementer-on-offline.patch
patches/powerpc-cpu_die-preempt-hack.patch # reset preempt_count to 0 after cede
patches/powerpc-cede-processor-inst.patch # instrumentation
patches/powerpc-hard_irq_disable.patch # hard_irq_disable before cede
patches/powerpc-pad-thread_info.patch

I didn't include all the patches as the relevant bits are included below in
code form.

With the instrumentation, it's clear the change to preempt_count() is targeted
(since I added padding before and after preempt_count in the thread_info
struct) and preempt_count is still changed. It is also clear that it isn't just
a stray decrement since I set preempt_count() to 4 prior to calling cede and it
still is 0xffffffff after the hcall. Also note that the stack pointer doesn't
change across the cede call and neither does any other part of the thread_info
structure.

Adding hard_irq_disable() did seem to affect things somewhat. Rather than a
serialized list of before/after thread_info prints around cede, I see
several befores then several afters. But the preempt_count is still modified.

The relevant code looks like this (from pseries_mach_cpu_die())

                        hard_irq_disable(); /* this doesn't fix things... but
			                       does change behavior a bit */
                        preempt_count() = 0x4; 
                        asm("mr %0,1" : "=r" (sp));  /* stack pointer is in R1 */
                        printk("before cede: sp=%lx pcnt=%x\n", sp, preempt_count()); 
                        print_thread_info(); 
                        extended_cede_processor(cede_latency_hint); 
                        asm("mr %0,1" : "=r" (sp)); 
                        printk("after cede: sp=%lx pcnt=%x\n", sp, preempt_count()); 
                        print_thread_info(); 
                        preempt_count() = 0; 


With these patches applied, the output across cede looks like:

before cede: sp=c000000000b57150 pcnt=4
*** current->thread_info ***
ti->task: c000000000aa1410
ti->exec_domain: c000000000a59958
ti->cpu: 0
ti->stomp_on_me: 57005 /* 0xDEAD - forgot to print in hex */
ti->preempt_count: 4
ti->stomp_on_me_too: 48879 /* 0xBEEF - forgot to print in hex */
ti->local_flags: 0
*** end thread_info ***
after cede: sp=c000000000b57150 pcnt=ffffffff
*** current->thread_info ***
ti->task: c000000000aa1410
ti->exec_domain: c000000000a59958
ti->cpu: 0
ti->stomp_on_me: 57005
ti->preempt_count: ffffffff
ti->stomp_on_me_too: 48879
ti->local_flags: 0
*** end thread_info ***

Are there any additional thoughts on what might be causing preempt_count to change?
I was thinking that cede might somehow force it to 0 (or perhaps one of the
preempt_count explicit value setters in irq.c) and then decrement it - but that dec
should trigger an error in the dec_preempt_count() as I have CONFIG_DEBUG_PREEMPT=y.

...
Vaidyanathan Srinivasan - Aug. 6, 2010, 5:09 a.m.
* Darren Hart <dvhltc@us.ibm.com> [2010-08-05 19:19:00]:

> On 07/22/2010 10:09 PM, Benjamin Herrenschmidt wrote:
> > On Thu, 2010-07-22 at 21:44 -0700, Darren Hart wrote:
> > 
> >>   suggestion I updated the instrumentation to display the
> >> local_save_flags and irqs_disabled_flags:
> > 
> >> Jul 22 23:36:58 igoort1 kernel: local flags: 0, irqs disabled: 1
> >> Jul 22 23:36:58 igoort1 kernel: before H_CEDE current->stack: c00000010e9e3ce0, pcnt: 1
> >> Jul 22 23:36:58 igoort1 kernel: after H_CEDE current->stack: c00000010e9e3ce0, pcnt: 1
> >>
> >> I'm not sure if I'm reading that right, but I believe interrupts are
> >> intended to be disabled here. If accomplished via the
> >> spin_lock_irqsave() this would behave differently on RT. However, this
> >> path disables the interrupts handled by xics, all but the IPIs anyway.
> >> On RT I disabled the decrementer as well.
> >>
> >> Is it possible for RT to be receiving other interrupts here?
> > 
> > Also you may want to call hard_irq_disable() to -really- disable
> > interrupts ... since we do lazy-disable on powerpc
> > 
> 
> A quick update and request for direction wrt the cede hcall, interrupt handling, and the stack pointer.
> 
> I've added patches one at a time, eliminating bugs with preempt_rt
> (tip/rt/head). With my current patchset I have no crashes with a single run of
> random_online.sh (stress testing to hit the hang will sees is my todo for
> tomorrow).
> 
> Current patchset includes:
> patches/0001-wms-fix01.patch # P7 lazy flushing thing
> 
> # next four are sent to / queued for mainline
> patches/powerpc-increase-pseries_cpu_die-delay.patch
> patches/powerpc-enable-preemption-before-cpu_die.patch
> patches/powerpc-silence-__cpu_up-under-normal-operation.patch
> patches/powerpc-silence-xics_migrate_irqs_away-during-cpu-offline.patch
> 
> # this one needs to be cleaned up and sent to mainline
> patches/powerpc-wait-for-cpu-to-go-inactive.patch
> 
> patches/powerpc-disable-decrementer-on-offline.patch
> patches/powerpc-cpu_die-preempt-hack.patch # reset preempt_count to 0 after cede
> patches/powerpc-cede-processor-inst.patch # instrumentation
> patches/powerpc-hard_irq_disable.patch # hard_irq_disable before cede
> patches/powerpc-pad-thread_info.patch
> 
> I didn't include all the patches as the relevant bits are included below in
> code form.
> 
> With the instrumentation, it's clear the change to preempt_count() is targeted
> (since I added padding before and after preempt_count in the thread_info
> struct) and preempt_count is still changed. It is also clear that it isn't just
> a stray decrement since I set preempt_count() to 4 prior to calling cede and it
> still is 0xffffffff after the hcall. Also note that the stack pointer doesn't
> change across the cede call and neither does any other part of the thread_info
> structure.
> 
> Adding hard_irq_disable() did seem to affect things somewhat. Rather than a
> serialized list of before/after thread_info prints around cede, I see
> several befores then several afters. But the preempt_count is still modified.
> 
> The relevant code looks like this (from pseries_mach_cpu_die())
> 
>                         hard_irq_disable(); /* this doesn't fix things... but
> 			                       does change behavior a bit */
>                         preempt_count() = 0x4; 
>                         asm("mr %0,1" : "=r" (sp));  /* stack pointer is in R1 */
>                         printk("before cede: sp=%lx pcnt=%x\n", sp, preempt_count()); 
>                         print_thread_info(); 
>                         extended_cede_processor(cede_latency_hint); 
>                         asm("mr %0,1" : "=r" (sp)); 
>                         printk("after cede: sp=%lx pcnt=%x\n", sp, preempt_count()); 
>                         print_thread_info(); 
>                         preempt_count() = 0; 
> 
> 
> With these patches applied, the output across cede looks like:
> 
> before cede: sp=c000000000b57150 pcnt=4
> *** current->thread_info ***
> ti->task: c000000000aa1410
> ti->exec_domain: c000000000a59958
> ti->cpu: 0
> ti->stomp_on_me: 57005 /* 0xDEAD - forgot to print in hex */
> ti->preempt_count: 4
> ti->stomp_on_me_too: 48879 /* 0xBEEF - forgot to print in hex */
> ti->local_flags: 0
> *** end thread_info ***
> after cede: sp=c000000000b57150 pcnt=ffffffff
> *** current->thread_info ***
> ti->task: c000000000aa1410
> ti->exec_domain: c000000000a59958
> ti->cpu: 0

Is this print sufficient to prove that we did not jump CPU?

We agree that decrementer can possibly expire and we could have gone
to the handler and come back just like we would do in an idle loop.
This will not happen always, but I could see a window of time during
which this may happen.  But that should not change the preempt_count
unconditionally to -1 with the same stack pointer.

> ti->stomp_on_me: 57005
> ti->preempt_count: ffffffff
> ti->stomp_on_me_too: 48879
> ti->local_flags: 0

Is there a method to identify whether we had executed any of the
interrupt handler while in this code?

> *** end thread_info ***
> 
> Are there any additional thoughts on what might be causing preempt_count to change?
> I was thinking that cede might somehow force it to 0 (or perhaps one of the
> preempt_count explicit value setters in irq.c) and then decrement it - but that dec
> should trigger an error in the dec_preempt_count() as I have CONFIG_DEBUG_PREEMPT=y.

Decrementer is only the suspect, we have not yet proved that the dec
handler is being executed.  Can somebody be using our stack?  The
pointer is same.. but do we still own it?  I cannot think of how
somebody else could be using this cpu's stack... but just a thought to
explain this anomaly.

--Vaidy
Darren Hart - Aug. 6, 2010, 7:13 a.m.
On 08/05/2010 10:09 PM, Vaidyanathan Srinivasan wrote:
> * Darren Hart<dvhltc@us.ibm.com>  [2010-08-05 19:19:00]:
> 
>> On 07/22/2010 10:09 PM, Benjamin Herrenschmidt wrote:
>>> On Thu, 2010-07-22 at 21:44 -0700, Darren Hart wrote:
>>>
>>>>    suggestion I updated the instrumentation to display the
>>>> local_save_flags and irqs_disabled_flags:
>>>
>>>> Jul 22 23:36:58 igoort1 kernel: local flags: 0, irqs disabled: 1
>>>> Jul 22 23:36:58 igoort1 kernel: before H_CEDE current->stack: c00000010e9e3ce0, pcnt: 1
>>>> Jul 22 23:36:58 igoort1 kernel: after H_CEDE current->stack: c00000010e9e3ce0, pcnt: 1
>>>>
>>>> I'm not sure if I'm reading that right, but I believe interrupts are
>>>> intended to be disabled here. If accomplished via the
>>>> spin_lock_irqsave() this would behave differently on RT. However, this
>>>> path disables the interrupts handled by xics, all but the IPIs anyway.
>>>> On RT I disabled the decrementer as well.
>>>>
>>>> Is it possible for RT to be receiving other interrupts here?
>>>
>>> Also you may want to call hard_irq_disable() to -really- disable
>>> interrupts ... since we do lazy-disable on powerpc
>>>
>>
>> A quick update and request for direction wrt the cede hcall, interrupt handling, and the stack pointer.
>>
>> I've added patches one at a time, eliminating bugs with preempt_rt
>> (tip/rt/head). With my current patchset I have no crashes with a single run of
>> random_online.sh (stress testing to hit the hang will sees is my todo for
>> tomorrow).
>>
>> Current patchset includes:
>> patches/0001-wms-fix01.patch # P7 lazy flushing thing
>>
>> # next four are sent to / queued for mainline
>> patches/powerpc-increase-pseries_cpu_die-delay.patch
>> patches/powerpc-enable-preemption-before-cpu_die.patch
>> patches/powerpc-silence-__cpu_up-under-normal-operation.patch
>> patches/powerpc-silence-xics_migrate_irqs_away-during-cpu-offline.patch
>>
>> # this one needs to be cleaned up and sent to mainline
>> patches/powerpc-wait-for-cpu-to-go-inactive.patch
>>
>> patches/powerpc-disable-decrementer-on-offline.patch
>> patches/powerpc-cpu_die-preempt-hack.patch # reset preempt_count to 0 after cede
>> patches/powerpc-cede-processor-inst.patch # instrumentation
>> patches/powerpc-hard_irq_disable.patch # hard_irq_disable before cede
>> patches/powerpc-pad-thread_info.patch
>>
>> I didn't include all the patches as the relevant bits are included below in
>> code form.
>>
>> With the instrumentation, it's clear the change to preempt_count() is targeted
>> (since I added padding before and after preempt_count in the thread_info
>> struct) and preempt_count is still changed. It is also clear that it isn't just
>> a stray decrement since I set preempt_count() to 4 prior to calling cede and it
>> still is 0xffffffff after the hcall. Also note that the stack pointer doesn't
>> change across the cede call and neither does any other part of the thread_info
>> structure.
>>
>> Adding hard_irq_disable() did seem to affect things somewhat. Rather than a
>> serialized list of before/after thread_info prints around cede, I see
>> several befores then several afters. But the preempt_count is still modified.
>>
>> The relevant code looks like this (from pseries_mach_cpu_die())
>>
>>                          hard_irq_disable(); /* this doesn't fix things... but
>> 			                       does change behavior a bit */
>>                          preempt_count() = 0x4;
>>                          asm("mr %0,1" : "=r" (sp));  /* stack pointer is in R1 */
>>                          printk("before cede: sp=%lx pcnt=%x\n", sp, preempt_count());
>>                          print_thread_info();
>>                          extended_cede_processor(cede_latency_hint);
>>                          asm("mr %0,1" : "=r" (sp));
>>                          printk("after cede: sp=%lx pcnt=%x\n", sp, preempt_count());
>>                          print_thread_info();
>>                          preempt_count() = 0;
>>
>>
>> With these patches applied, the output across cede looks like:
>>
>> before cede: sp=c000000000b57150 pcnt=4
>> *** current->thread_info ***
>> ti->task: c000000000aa1410
>> ti->exec_domain: c000000000a59958
>> ti->cpu: 0
>> ti->stomp_on_me: 57005 /* 0xDEAD - forgot to print in hex */
>> ti->preempt_count: 4
>> ti->stomp_on_me_too: 48879 /* 0xBEEF - forgot to print in hex */
>> ti->local_flags: 0
>> *** end thread_info ***
>> after cede: sp=c000000000b57150 pcnt=ffffffff
>> *** current->thread_info ***
>> ti->task: c000000000aa1410
>> ti->exec_domain: c000000000a59958
>> ti->cpu: 0
> 
> Is this print sufficient to prove that we did not jump CPU?

Probably not, but the following should be sufficient. Note that the
dump occurs on "CPU: 6" which matches the CPU noted in the ti->cpu
before and after the cede call.

*** current->thread_info ***
ti->task: c00000008e7b2240
ti->exec_domain: c000000000a59958
ti->cpu: 6
ti->stomp_on_me: 57005
ti->preempt_count: 4
ti->stomp_on_me_too: 48879
ti->local_flags: 0
*** end thread_info ***
------------[ cut here ]------------
Badness at kernel/sched.c:3698
NIP: c0000000006987e4 LR: c0000000006987c8 CTR: c00000000005c668
REGS: c00000010e713800 TRAP: 0700   Not tainted  (2.6.33-rt-dvhrt05-02358-g61223dd-dirty)
MSR: 8000000000021032 <ME,CE,IR,DR>  CR: 24000082  XER: 20000000
TASK = c00000008e7b2240[0] 'swapper' THREAD: c00000010e710000 CPU: 6
GPR00: 0000000000000000 c00000010e713a80 c000000000b54f88 0000000000000001 
GPR04: c00000010e713d08 ffffffffffffffff 00000000000000e0 800000000d049138 
GPR08: 0000000000000000 c000000000ca5420 0000000000000001 c000000000bc22e8 
GPR12: 0000000000000002 c000000000ba5080 0000000000000000 000000000f394b6c 
GPR16: 0000000000000000 0000000000000000 0000000000000000 0000000000000000 
GPR20: c000000000ba50c0 0000000000000004 0000000000000002 0000000000000000 
GPR24: 0000000000000004 c00000010e713cd0 c0000000009f1ecc c00000000088aaff 
GPR28: 0000000000000000 0000000000000001 c000000000ad7610 c00000010e713a80 
NIP [c0000000006987e4] .add_preempt_count+0x6c/0xe0
LR [c0000000006987c8] .add_preempt_count+0x50/0xe0
Call Trace:
[c00000010e713a80] [c00000010e713b30] 0xc00000010e713b30 (unreliable)
[c00000010e713b10] [c0000000000824b4] .vprintk+0xac/0x480
[c00000010e713c40] [c00000000069bcbc] .printk+0x48/0x5c
[c00000010e713cd0] [c00000000005c190] .pseries_mach_cpu_die+0x1c4/0x39c
[c00000010e713db0] [c00000000001e7e0] .cpu_die+0x4c/0x68
[c00000010e713e30] [c000000000017de0] .cpu_idle+0x1f8/0x220
[c00000010e713ed0] [c0000000006a17c4] .start_secondary+0x394/0x3d4
[c00000010e713f90] [c000000000008264] .start_secondary_prolog+0x10/0x14
Instruction dump:
78290464 80090018 2f800000 40fc002c 4bd089c1 60000000 2fa30000 419e006c 
e93e87e0 80090000 2f800000 409e005c <0fe00000> 48000054 e97e8398 78290464 
after cede: sp=c00000010e713cd0 pcnt=ffffffff
*** current->thread_info ***
ti->task: c00000008e7b2240
ti->exec_domain: c000000000a59958
ti->cpu: 6
ti->stomp_on_me: 57005
ti->preempt_count: fffffffe
ti->stomp_on_me_too: 48879
ti->local_flags: 0
*** end thread_info ***


> We agree that decrementer can possibly expire and we could have gone
> to the handler and come back just like we would do in an idle loop.

I disable the decrementer in the stop_cpu path... that may not be
sufficient for the pseries_mach_cpu_die() path, I'll have to experiment.

However, even if it did, it shouldn't be just changing the value,
especially not to something illegal.

> This will not happen always, but I could see a window of time during
> which this may happen.  But that should not change the preempt_count
> unconditionally to -1 with the same stack pointer.
> 
>> ti->stomp_on_me: 57005
>> ti->preempt_count: ffffffff
>> ti->stomp_on_me_too: 48879
>> ti->local_flags: 0
> 
> Is there a method to identify whether we had executed any of the
> interrupt handler while in this code?

I'd like to know as well. If anyone knows, please share. Otherwise
I'll be trying to sort that out tomorrow.

> 
>> *** end thread_info ***
>>
>> Are there any additional thoughts on what might be causing preempt_count to change?
>> I was thinking that cede might somehow force it to 0 (or perhaps one of the
>> preempt_count explicit value setters in irq.c) and then decrement it - but that dec
>> should trigger an error in the dec_preempt_count() as I have CONFIG_DEBUG_PREEMPT=y.
> 
> Decrementer is only the suspect, we have not yet proved that the dec
> handler is being executed.  Can somebody be using our stack?  The
> pointer is same.. but do we still own it?  I cannot think of how
> somebody else could be using this cpu's stack... but just a thought to
> explain this anomaly.
> 
> --Vaidy


Thanks for the thoughts,
Darren Hart - Aug. 10, 2010, 10:36 p.m.
On 07/22/2010 11:38 AM, Thomas Gleixner wrote:
> On Thu, 22 Jul 2010, Darren Hart wrote:
>
>> Also of interest is that this path
>> cpu_idle()->cpu_die()->pseries_mach_cpu_die() to start_secondary()
>> enters with a preempt_count=1 if it wasn't corrupted across the hcall.
>
> That triggers the problem as well. preempt_count needs to be 0 when
> entering start_secondary(). So I really wonder how that ever worked.
>
>> The early boot path from _start however appears to call
>> start_secondary() with a preempt_count of 0.
>
> Which is correct.
>
>> The following patch is most certainly not correct, but it does eliminate
>
> It is correct, but i think it is incomplete as other portions of the
> thread_info on the stack might be in some weird state as well.

Just FYI, I took a look at the stack pointers as well as all the fields 
in the thread_info struct. The only one that changes is preempt_count. 
The previous value of preempt_count doesn't impact the value after cede. 
An initial value of 0, 1, or 4 all result in an after-cede value of 
0xffffffff. I also added 32 bits of padding on either side of the 
preempt_count in case the change was accidental - it wasnt', the padded 
values remained unchanged across the cede call while the preempt_count 
still changed to 0xffffffff.
Ankita Garg - Aug. 19, 2010, 3:58 p.m.
Hi Darren,

On Thu, Jul 22, 2010 at 11:24:13AM -0700, Darren Hart wrote:
> 
> With some instrumentation we were able to determine that the
> preempt_count() appears to change across the extended_cede_processor()
> call.  Specifically across the plpar_hcall_norets(H_CEDE) call. On
> PREEMPT_RT we call this with preempt_count=1 and return with
> preempt_count=0xffffffff. On mainline with CONFIG_PREEMPT=y, the value
> is different (0x65) but is still incorrect.

I was trying to reproduce this issue on a 2.6.33.7-rt29 kernel. I could
easily reproduce this on the RT kernel and not the non-RT kernel.
However, I hit it every single time I did a cpu online operation. I also
noticed that the issue persists even when I disable H_CEDE by passing
the "cede_offline=0" kernel commandline parameter. Could you pl confirm
if you observe the same in your setup ? 

However, the issue still remains. Will spend few cycles looking into
this issue.

> 
> Also of interest is that this path
> cpu_idle()->cpu_die()->pseries_mach_cpu_die() to start_secondary()
> enters with a preempt_count=1 if it wasn't corrupted across the hcall.
> The early boot path from _start however appears to call
> start_secondary() with a preempt_count of 0.
> 
> The following patch is most certainly not correct, but it does eliminate
> the situation on mainline 100% of the time (there is still a 25%
> reproduction rate on PREEMPT_RT). Can someone comment on:
> 
> 1) How can the preempt_count() get mangled across the H_CEDE hcall?
> 2) Should we call preempt_enable() in cpu_idle() prior to cpu_die() ?
> 
> Hacked-up-by: Darren Hart <dvhltc@us.ibm.com>
> 
> Index: linux-2.6.33.6/arch/powerpc/platforms/pseries/hotplug-cpu.c
> ===================================================================
> --- linux-2.6.33.6.orig/arch/powerpc/platforms/pseries/hotplug-cpu.c
> +++ linux-2.6.33.6/arch/powerpc/platforms/pseries/hotplug-cpu.c
> @@ -138,6 +138,7 @@ static void pseries_mach_cpu_die(void)
>  			 * Kernel stack will be reset and start_secondary()
>  			 * will be called to continue the online operation.
>  			 */
> +			preempt_count() = 0;
>  			start_secondary_resume();
>  		}
>  	}
> 
>
Will Schmidt - Aug. 19, 2010, 6:58 p.m.
Ankita Garg <ankita@in.ibm.com> wrote on 08/19/2010 10:58:24 AM:
> Subject
> Re: [PATCH][RFC] preempt_count corruption across H_CEDE call with
> CONFIG_PREEMPT on pseries
>
> Hi Darren,
>
> On Thu, Jul 22, 2010 at 11:24:13AM -0700, Darren Hart wrote:
> >
> > With some instrumentation we were able to determine that the
> > preempt_count() appears to change across the extended_cede_processor()
> > call.  Specifically across the plpar_hcall_norets(H_CEDE) call. On
> > PREEMPT_RT we call this with preempt_count=1 and return with
> > preempt_count=0xffffffff. On mainline with CONFIG_PREEMPT=y, the value
> > is different (0x65) but is still incorrect.
>
> I was trying to reproduce this issue on a 2.6.33.7-rt29 kernel. I could
> easily reproduce this on the RT kernel and not the non-RT kernel.
> However, I hit it every single time I did a cpu online operation. I also
> noticed that the issue persists even when I disable H_CEDE by passing
> the "cede_offline=0" kernel commandline parameter. Could you pl confirm
> if you observe the same in your setup ?

If you see it every time, double-check that you have
http://patchwork.ozlabs.org/patch/60922/  or an equivalent in your tree.
(The
patch moves the preempt_enable_no_resched() call up above a call to cpu_die
().  An
earlier variation called a preempt_enable before calling
start_secondary_resume()).

Darren and I have been seeing different problems (different
dedicated-processor LPARS
on the same physical system).  He's seeing scrambled preempt_count values,
I'm tending
to see a system hang/death [with processor backtraces
showing .pseries_mach_cpu_die or
.pseries_dedicated_idle_sleep as expected, but no processes running] .

I'll be curious which you end up seeing.   With 2.6.33.7-rt27 as of a few
minutes ago,
I'm still seeing a system hang/death.



>
> However, the issue still remains. Will spend few cycles looking into
> this issue.
>
> >
> > Also of interest is that this path
> > cpu_idle()->cpu_die()->pseries_mach_cpu_die() to start_secondary()
> > enters with a preempt_count=1 if it wasn't corrupted across the hcall.
> > The early boot path from _start however appears to call
> > start_secondary() with a preempt_count of 0.
> >
> > The following patch is most certainly not correct, but it does
eliminate
> > the situation on mainline 100% of the time (there is still a 25%
> > reproduction rate on PREEMPT_RT). Can someone comment on:
> >
> > 1) How can the preempt_count() get mangled across the H_CEDE hcall?
> > 2) Should we call preempt_enable() in cpu_idle() prior to cpu_die() ?
> >
> > Hacked-up-by: Darren Hart <dvhltc@us.ibm.com>
> >
> > Index: linux-2.6.33.6/arch/powerpc/platforms/pseries/hotplug-cpu.c
> > ===================================================================
> > --- linux-2.6.33.6.orig/arch/powerpc/platforms/pseries/hotplug-cpu.c
> > +++ linux-2.6.33.6/arch/powerpc/platforms/pseries/hotplug-cpu.c
> > @@ -138,6 +138,7 @@ static void pseries_mach_cpu_die(void)
> >            * Kernel stack will be reset and start_secondary()
> >            * will be called to continue the online operation.
> >            */
> > +         preempt_count() = 0;
> >           start_secondary_resume();
> >        }
> >     }
> >
> >
>
> --
> Regards,
> Ankita Garg (ankita@in.ibm.com)
> Linux Technology Center
> IBM India Systems & Technology Labs,
> Bangalore, India
Darren Hart - Aug. 23, 2010, 10:20 p.m.
On 08/19/2010 08:58 AM, Ankita Garg wrote:
> Hi Darren,
>
> On Thu, Jul 22, 2010 at 11:24:13AM -0700, Darren Hart wrote:
>>
>> With some instrumentation we were able to determine that the
>> preempt_count() appears to change across the extended_cede_processor()
>> call.  Specifically across the plpar_hcall_norets(H_CEDE) call. On
>> PREEMPT_RT we call this with preempt_count=1 and return with
>> preempt_count=0xffffffff. On mainline with CONFIG_PREEMPT=y, the value
>> is different (0x65) but is still incorrect.
>
> I was trying to reproduce this issue on a 2.6.33.7-rt29 kernel. I could
> easily reproduce this on the RT kernel and not the non-RT kernel.
> However, I hit it every single time I did a cpu online operation. I also
> noticed that the issue persists even when I disable H_CEDE by passing
> the "cede_offline=0" kernel commandline parameter. Could you pl confirm
> if you observe the same in your setup ?

with the following patches:
[root@igoort1 linux-2.6-combined]# quilt applied
patches/0001-wms-fix01.patch
patches/powerpc-increase-pseries_cpu_die-delay.patch
patches/powerpc-enable-preemption-before-cpu_die.patch
patches/powerpc-silence-__cpu_up-under-normal-operation.patch
patches/powerpc-silence-xics_migrate_irqs_away-during-cpu-offline.patch
patches/powerpc-wait-for-cpu-to-go-inactive.patch
patches/powerpc-disable-decrementer-on-offline.patch
patches/powerpc-cpu_die-preempt-hack.patch
patches/powerpc-cede-processor-inst.patch
patches/irq-preempt-inst.patch
patches/disable-decrementer-in-cpu_die.patch
patches/powerpc-hard_irq_disable.patch
[root@igoort1 linux-2.6-combined]# quilt unapplied
patches/powerpc-debug-replace-cede-with-mdelay.patch
patches/powerpc-pad-thread_info.patch

applied to tip/rt/head (2.6.33-rt ish) I will see the following crash 
after 3 or 4 runs:

<3>Badness at kernel/sched.c:3720
<4>NIP: c0000000006986f8 LR: c0000000006986dc CTR: c00000000006ec34
<4>REGS: c00000008e7a7e50 TRAP: 0700   Not tainted 
(2.6.33-rt-dvhrt08-02358-g61223dd-dirty)
<4>MSR: 8000000000021032 <ME,CE,IR,DR>  CR: 28000022  XER: 00000000
<4>TASK = c00000010e7080c0[0] 'swapper' THREAD: c00000008e7a8000 CPU: 3
<4>GPR00: 0000000000000000 c00000008e7a80d0 c000000000b54fa0 
0000000000000001
<4>GPR04: 0000000000000000 0000000000000032 0000000000000000 
000000000000000f
<4>GPR08: c00000008eb68d00 c000000000ca5420 0000000000000001 
c000000000bc22e8
<4>GPR12: 8000000000009032 c000000000ba4a80 c00000008e7a8a70 
0000000000000003
<4>GPR16: fffffffffffff9ba c00000010e7080c0 0000000000000000 
7fffffffffffffff
<4>GPR20: 0000000000000000 0000000000000001 0000000000000003 
c000000001042c80
<4>GPR24: 0000000000000000 c00000008eb686a0 0000000000000003 
0000000000000001
<4>GPR28: 0000000000000000 0000000000000001 c000000000ad7628 
c00000008e7a80d0
<4>NIP [c0000000006986f8] .sub_preempt_count+0x6c/0xdc
<4>LR [c0000000006986dc] .sub_preempt_count+0x50/0xdc
<4>Call Trace:
<4>Instruction dump:
<4>78290464 80090014 7f80e800 40fc002c 4bd08a99 60000000 2fa30000 419e0068
<4>e93e87e0 80090000 2f800000 409e0058 <0fe00000> 48000050 2b9d00fe 41fd0038
<1>Unable to handle kernel paging request for data at address 
0xc000018000ba44c0
<1>Faulting instruction address: 0xc00000000006aae4

This occurs with or without the cede_offline=0 parameter.

Also, in a similar experiment which seems to corroborate your results, 
suggesting the HCEDE call is not necessarily to blame here.

I had replaced the HCEDE call with a mdelay(2) and still ran into 
issues. I didn't see the preempt count change, but I do see the 
rtmutex.c:684 bug.

cpu 0x0: Vector: 300 (Data Access) at [c000000000b53ef0]
     pc: c00000000006aa54: .resched_task+0x48/0xd8
     lr: c00000000006ac44: .check_preempt_curr_idle+0x2c/0x44
     sp: c000000000b54170
    msr: 8000000000001032
    dar: c000018000ba44c0
  dsisr: 40000000
   current = 0xc000000000aa1410
   paca    = 0xc000000000ba4480
     pid   = 0, comm = swapper
enter ? for help
[c000000000b54200] c00000000006ac44 .check_preempt_curr_idle+0x2c/0x44
[c000000000b54290] c00000000007b494 .try_to_wake_up+0x430/0x540
[c000000000b54360] c00000000007b754 .wake_up_process+0x34/0x48
[c000000000b543f0] c000000000089fa8 .wakeup_softirqd+0x78/0x9c
[c000000000b54480] c00000000008a2c4 .raise_softirq+0x7c/0xb8
[c000000000b54520] c0000000000977b0 .run_local_timers+0x2c/0x4c
[c000000000b545a0] c000000000097828 .update_process_times+0x58/0x9c
[c000000000b54640] c0000000000beb3c .tick_sched_timer+0xd0/0x120
[c000000000b546f0] c0000000000b08b8 .__run_hrtimer+0x1a0/0x29c
[c000000000b547a0] c0000000000b1258 .hrtimer_interrupt+0x21c/0x394
[c000000000b548d0] c0000000000304c4 .timer_interrupt+0x1ec/0x2f8
[c000000000b54980] c000000000003700 decrementer_common+0x100/0x180
--- Exception: 901 (Decrementer) at c0000000000100f8 
.raw_local_irq_restore+0x74/0x8c
[c000000000b54d00] c000000000017d14 .cpu_idle+0x12c/0x220
[c000000000b54da0] c0000000006a1768 .start_secondary+0x3d8/0x418
[c000000000b54e60] c00000000005c1f0 .pseries_mach_cpu_die+0x244/0x318
[c000000000b54f10] c00000000001e7e0 .cpu_die+0x4c/0x68
[c000000000b54f90] c000000000017de0 .cpu_idle+0x1f8/0x220
... repeated multiple times ...


> However, the issue still remains. Will spend few cycles looking into
> this issue.

Appreciate it, the more eyes the better on this one.

--
Darren

>>
>> Also of interest is that this path
>> cpu_idle()->cpu_die()->pseries_mach_cpu_die() to start_secondary()
>> enters with a preempt_count=1 if it wasn't corrupted across the hcall.
>> The early boot path from _start however appears to call
>> start_secondary() with a preempt_count of 0.
>>
>> The following patch is most certainly not correct, but it does eliminate
>> the situation on mainline 100% of the time (there is still a 25%
>> reproduction rate on PREEMPT_RT). Can someone comment on:
>>
>> 1) How can the preempt_count() get mangled across the H_CEDE hcall?
>> 2) Should we call preempt_enable() in cpu_idle() prior to cpu_die() ?
>>
>> Hacked-up-by: Darren Hart<dvhltc@us.ibm.com>
>>
>> Index: linux-2.6.33.6/arch/powerpc/platforms/pseries/hotplug-cpu.c
>> ===================================================================
>> --- linux-2.6.33.6.orig/arch/powerpc/platforms/pseries/hotplug-cpu.c
>> +++ linux-2.6.33.6/arch/powerpc/platforms/pseries/hotplug-cpu.c
>> @@ -138,6 +138,7 @@ static void pseries_mach_cpu_die(void)
>>   			 * Kernel stack will be reset and start_secondary()
>>   			 * will be called to continue the online operation.
>>   			 */
>> +			preempt_count() = 0;
>>   			start_secondary_resume();
>>   		}
>>   	}
>>
>>
>
Darren Hart - Aug. 31, 2010, 7:12 a.m.
On 08/19/2010 08:58 AM, Ankita Garg wrote:
> Hi Darren,
> 
> On Thu, Jul 22, 2010 at 11:24:13AM -0700, Darren Hart wrote:
>>
>> With some instrumentation we were able to determine that the
>> preempt_count() appears to change across the extended_cede_processor()
>> call.  Specifically across the plpar_hcall_norets(H_CEDE) call. On
>> PREEMPT_RT we call this with preempt_count=1 and return with
>> preempt_count=0xffffffff. On mainline with CONFIG_PREEMPT=y, the value
>> is different (0x65) but is still incorrect.
> 
> I was trying to reproduce this issue on a 2.6.33.7-rt29 kernel. I could
> easily reproduce this on the RT kernel and not the non-RT kernel.

This matches my experience.

> However, I hit it every single time I did a cpu online operation. I also
> noticed that the issue persists even when I disable H_CEDE by passing
> the "cede_offline=0" kernel commandline parameter. Could you pl confirm
> if you observe the same in your setup ?

Yes, this is my experience as well.

> 
> However, the issue still remains. Will spend few cycles looking into
> this issue.
> 

I've spent today trying to collect some useful traces. I consistently
see the preempt_count() change to 0xffffffff across the H_CEDE call, but
the irq and sched events (to ftrace) do not indicate anything else
running on the CPU that could change that.

  <idle>-0       1d.h1. 11408us : irq_handler_entry: irq=16 name=IPI
  <idle>-0       1d.h1. 11411us : irq_handler_exit: irq=16 ret=handled
  ...            <other cpus>
  <idle>-0       1d.... 22101us : .pseries_mach_cpu_die: start
  <idle>-0       1d.... 22108us : .pseries_mach_cpu_die: cpu 1 (hwid 1) ceding for offline with hint 2
  ...            <other cpus>
  <idle>-0       1d.Hff. 33804us : .pseries_mach_cpu_die: returned from cede with pcnt: ffffffff
  <idle>-0       1d.Hff. 33805us : .pseries_mach_cpu_die: forcing pcnt to 0
  <idle>-0       1d.... 33807us : .pseries_mach_cpu_die: cpu 1 (hwid 1) returned from cede.
  <idle>-0       1d.... 33808us : .pseries_mach_cpu_die: Decrementer value = 7fa49474 Timebase value = baefee6c88113
  <idle>-0       1d.... 33809us : .pseries_mach_cpu_die: cpu 1 (hwid 1) got prodded to go online
  <idle>-0       1d.... 33816us : .pseries_mach_cpu_die: calling start_seconday, pcnt: 0
  <idle>-0       1d.... 33816us : .pseries_mach_cpu_die: forcing pcnt to 0

---------------------------------
Modules linked in: autofs4 binfmt_misc dm_mirror dm_region_hash dm_log [last unloaded: scsi_wait_scan]
NIP: c00000000006aa50 LR: c00000000006ac40 CTR: c00000000006ac14
REGS: c00000008e79ffc0 TRAP: 0300   Not tainted  (2.6.33-rt-dvhrt16-02358-g61223dd-dirty)
MSR: 8000000000001032 <ME,IR,DR>  CR: 28000022  XER: 00000000
DAR: c000018000ba44c0, DSISR: 0000000040000000
TASK = c00000010e6de040[0] 'swapper' THREAD: c00000008e7a0000 CPU: 1
GPR00: 0000018000000040 c00000008e7a0240 c000000000b55008 c00000010e6de040
GPR04: c000000085d800c0 0000000000000000 0000000000000000 000000000000000f
GPR08: 0000018000000000 c00000008e7a0000 c000000000ba4480 c000000000a32c80
GPR12: 8000000000009032 c000000000ba4680 c00000008e7a08f0 0000000000000001
GPR16: fffffffffffff2fa c00000010e6de040 0000000000000000 7fffffffffffffff
GPR20: 0000000000000000 0000000000000001 0000000000000001 c000000000f42c80
GPR24: c0000000850b7638 0000000000000000 0000000000000000 0000000000000001
GPR28: c000000000f42c80 c00000010e6de040 c000000000ad7698 c00000008e7a0240
NIP [c00000000006aa50] .resched_task+0x48/0xd8
LR [c00000000006ac40] .check_preempt_curr_idle+0x2c/0x44
Call Trace:
Instruction dump:
f821ff71 7c3f0b78 ebc2ab28 7c7d1b78 60000000 60000000 e95e8008 e97e8000
e93d0008 81090010 79084da4 38080040 <7d4a002a> 7c0b502e 7c000074 7800d182
---[ end trace 6d3f1cddaa17382c ]---
Kernel panic - not syncing: Attempted to kill the idle task!
Call Trace:
Rebooting in 180 seconds..



When running with the function plugin I had to stop the trace
immediately before entering start_secondary after an online or my traces
would not include the pseries_mach_cpu_die function, nor the tracing I
added there (possibly buffer size, I am using 2048). The following trace
was collected using "trace-cmd record -p function -e irq -e sched" and
has been filtered to only show CPU [001] (the CPU undergoing the
offline/online test, and the one seeing preempt_count (pcnt) go to
ffffffff after cede. The function tracer does not indicate anything
running on the CPU other than the HCALL - unless the __trace_hcall*
commands might be to blame. Can a POWER guru comment?

          <idle>-0     [001]   417.625286: function:             .cpu_die
          <idle>-0     [001]   417.625287: function:                .pseries_mach_cpu_die
          <idle>-0     [001]   417.625290: bprint:               .pseries_mach_cpu_die : start
          <idle>-0     [001]   417.625291: function:                   .idle_task_exit
          <idle>-0     [001]   417.625292: function:                      .switch_slb
          <idle>-0     [001]   417.625294: function:                   .xics_teardown_cpu
          <idle>-0     [001]   417.625294: function:                      .xics_set_cpu_priority
          <idle>-0     [001]   417.625295: function:             .__trace_hcall_entry
          <idle>-0     [001]   417.625296: function:                .probe_hcall_entry
          <idle>-0     [001]   417.625297: function:             .__trace_hcall_exit
          <idle>-0     [001]   417.625298: function:                .probe_hcall_exit
          <idle>-0     [001]   417.625299: function:             .__trace_hcall_entry
          <idle>-0     [001]   417.625300: function:                .probe_hcall_entry
          <idle>-0     [001]   417.625301: function:             .__trace_hcall_exit
          <idle>-0     [001]   417.625302: function:                .probe_hcall_exit
          <idle>-0     [001]   417.625305: bprint:               .pseries_mach_cpu_die : cpu 1 (hwid 1) ceding for offline with hint 2
          <idle>-0     [001]   417.625307: bprint:               .pseries_mach_cpu_die : calling extended cede, pcnt: 0
          <idle>-0     [001]   417.625308: function:             .__trace_hcall_entry
          <idle>-0     [001]   417.625308: function:                .probe_hcall_entry
          <idle>-0     [001]   417.837734: function:             .__trace_hcall_exit
          <idle>-0     [001]   417.837736: function:                .probe_hcall_exit
          <idle>-0     [001]   417.837740: bprint:               .pseries_mach_cpu_die : returned from cede with pcnt: ffffffff
          <idle>-0     [001]   417.837742: bprint:               .pseries_mach_cpu_die : forcing pcnt to 0
          <idle>-0     [001]   417.837743: bprint:               .pseries_mach_cpu_die : cpu 1 (hwid 1) returned from cede.
          <idle>-0     [001]   417.837745: bprint:               .pseries_mach_cpu_die : Decrementer value = 79844cbf Timebase value = bb3cf7ab2771c
          <idle>-0     [001]   417.837747: bprint:               .pseries_mach_cpu_die : cpu 1 (hwid 1) got prodded to go online
          <idle>-0     [001]   417.837749: function:             .__trace_hcall_entry
          <idle>-0     [001]   417.837749: function:                .probe_hcall_entry
          <idle>-0     [001]   417.837755: function:             .__trace_hcall_exit
          <idle>-0     [001]   417.837755: function:                .probe_hcall_exit
          <idle>-0     [001]   417.837757: bprint:               .pseries_mach_cpu_die : calling start_seconday, pcnt: 0
          <idle>-0     [001]   417.837758: bprint:               .pseries_mach_cpu_die : forcing pcnt to 0



I replaced the extended_cede_processor() call with mdelay(2). When I do
that, I don't see the preempt_count() change across the mdelay(2) call.
However, the system still eventually crashes in sub_preempt_count().
This appears to be independent of if preempt_count changes across
CEDE (since it doesn't occur across mdelay). I will try with larger
values of mdelay tomorrow to see if a longer delay will experience the
preempt_count value change.

Badness at kernel/sched.c:3726
NIP: c000000000698684 LR: c000000000698668 CTR: c00000000007a89c
REGS: c00000008e7a0170 TRAP: 0700   Not tainted  (2.6.33-rt-dvhrt16-02358-g61223dd-dirty)
MSR: 8000000000021032 <ME,CE,IR,DR>  CR: 28000022  XER: 00000000
TASK = c00000010e6de040[0] 'swapper' THREAD: c00000008e7a0000 CPU: 1
GPR00: 0000000000000000 c00000008e7a03f0 c000000000b55008 0000000000000001
GPR04: 0000000000000000 0000000000000000 0000000000000000 000000000000000f
GPR08: 0000000000000200 c000000000ca5420 0000000000000001 c000000000bc22f0
GPR12: 8000000000009032 c000000000ba4680 c00000008e7a0a10 0000000000000001
GPR16: fffffffffffff4d1 c00000010e6de040 0000000000000000 7fffffffffffffff
GPR20: 0000000000000000 0000000000000001 0000000000000001 c000000000f42c80
GPR24: 0000000000000001 0000000000000000 0000000000000000 0000000000000001
GPR28: c000000000f42c80 0000000000000001 c000000000ad7698 c00000008e7a03f0
NIP [c000000000698684] .sub_preempt_count+0xa8/0xdc
LR [c000000000698668] .sub_preempt_count+0x8c/0xdc
Call Trace:
Instruction dump:
41fd0038 78000620 2fa00000 40fe002c 4bd08a61 60000000 2fa30000 419e002c
e93e87e8 80090000 2f800000 409e001c <0fe00000> 48000014 78290464 80090014
Unable to handle kernel paging request for data at address 0xc000018000ba44c0
Faulting instruction address: 0xc00000000006aa1c
Oops: Kernel access of bad area, sig: 11 [#1]
PREEMPT SMP NR_CPUS=128 NUMA pSeries
last sysfs file:line
Michael Ellerman - Sept. 1, 2010, 5:54 a.m.
On Tue, 2010-08-31 at 00:12 -0700, Darren Hart wrote:
..
> 
> When running with the function plugin I had to stop the trace
> immediately before entering start_secondary after an online or my traces
> would not include the pseries_mach_cpu_die function, nor the tracing I
> added there (possibly buffer size, I am using 2048). The following trace
> was collected using "trace-cmd record -p function -e irq -e sched" and
> has been filtered to only show CPU [001] (the CPU undergoing the
> offline/online test, and the one seeing preempt_count (pcnt) go to
> ffffffff after cede. The function tracer does not indicate anything
> running on the CPU other than the HCALL - unless the __trace_hcall*
> commands might be to blame. 

It's not impossible. Though normally they're disabled right, so the only
reason they're running is because you're tracing. So if they are causing
the bug then that doesn't explain why you see it normally.

Still, might be worth disabling just the hcall tracepoints just to be
100% sure.

cheers
Darren Hart - Sept. 1, 2010, 3:10 p.m.
On 08/31/2010 10:54 PM, Michael Ellerman wrote:
> On Tue, 2010-08-31 at 00:12 -0700, Darren Hart wrote:
> ..
>>
>> When running with the function plugin I had to stop the trace
>> immediately before entering start_secondary after an online or my traces
>> would not include the pseries_mach_cpu_die function, nor the tracing I
>> added there (possibly buffer size, I am using 2048). The following trace
>> was collected using "trace-cmd record -p function -e irq -e sched" and
>> has been filtered to only show CPU [001] (the CPU undergoing the
>> offline/online test, and the one seeing preempt_count (pcnt) go to
>> ffffffff after cede. The function tracer does not indicate anything
>> running on the CPU other than the HCALL - unless the __trace_hcall*
>> commands might be to blame. 
> 
> It's not impossible. Though normally they're disabled right, so the only
> reason they're running is because you're tracing. So if they are causing
> the bug then that doesn't explain why you see it normally.
> 
> Still, might be worth disabling just the hcall tracepoints just to be
> 100% sure.

A couple of updates. I was working from tip/rt/head, which has been
stale for some months now. I switched to tip/rt/2.6.33 and the
preempt_count() change over cede went away. I now see the live hang that
Will has reported.

Before I dive into the live hang, I want to understand what fixed the
preempt_count() change. That might start pointing us in the right
direction for the live hang.

I did an inverted git bisect between tip/rt/head and tip/rt/2.6.33 to
try and locate the fix. I've narrowed it down to the 2.6.33.6 merge:

# git show 7e1af1172bbd4109d09ac515c5d376f633da7cff
commit 7e1af1172bbd4109d09ac515c5d376f633da7cff
Merge: d8e94db 9666790
Author: Thomas Gleixner <tglx@linutronix.de>
Date:   Tue Jul 13 16:01:16 2010 +0200

    Merge
git://git.kernel.org/pub/scm/linux/kernel/git/stable/linux-2.6.33.y

    Conflicts:
        Makefile

    Signed-off-by: Thomas Gleixner <tglx@linutronix.de>


Visual inspection yields two patches of interest:

f8b67691828321f5c85bb853283aa101ae673130
powerpc/pseries: Make query-cpu-stopped callable outside hotplug cpu

aef40e87d866355ffd279ab21021de733242d0d5
powerpc/pseries: Only call start-cpu when a CPU is stopped

I'm going to try reverting these today and see if they addressed the
issue indirectly.

Patch

Index: linux-2.6.33.6/arch/powerpc/platforms/pseries/hotplug-cpu.c
===================================================================
--- linux-2.6.33.6.orig/arch/powerpc/platforms/pseries/hotplug-cpu.c
+++ linux-2.6.33.6/arch/powerpc/platforms/pseries/hotplug-cpu.c
@@ -138,6 +138,7 @@  static void pseries_mach_cpu_die(void)
 			 * Kernel stack will be reset and start_secondary()
 			 * will be called to continue the online operation.
 			 */
+			preempt_count() = 0;
 			start_secondary_resume();
 		}
 	}