diff mbox

[RFC] powerpc/pseries: Increase busy loop in pseries_cpu_die

Message ID 1486407496-12151-1-git-send-email-bauerman@linux.vnet.ibm.com
State New
Headers show

Commit Message

Thiago Jung Bauermann Feb. 6, 2017, 6:58 p.m. UTC
When testing DLPAR CPU add/remove on a system under stress, pseries_cpu_die
doesn't wait long enough for a CPU to die and the kernel ends up crashing:

[  446.143648] cpu 152 (hwid 152) Ready to die...
[  446.464057] cpu 153 (hwid 153) Ready to die...
[  446.473525] cpu 154 (hwid 154) Ready to die...
[  446.474077] cpu 155 (hwid 155) Ready to die...
[  446.483529] cpu 156 (hwid 156) Ready to die...
[  446.493532] cpu 157 (hwid 157) Ready to die...
[  446.494078] cpu 158 (hwid 158) Ready to die...
[  446.503527] cpu 159 (hwid 159) Ready to die...
[  446.664534] cpu 144 (hwid 144) Ready to die...
[  446.964113] cpu 145 (hwid 145) Ready to die...
[  446.973525] cpu 146 (hwid 146) Ready to die...
[  446.974094] cpu 147 (hwid 147) Ready to die...
[  446.983944] cpu 148 (hwid 148) Ready to die...
[  446.984062] cpu 149 (hwid 149) Ready to die...
[  446.993518] cpu 150 (hwid 150) Ready to die...
[  446.993543] Querying DEAD? cpu 150 (150) shows 2
[  446.994098] cpu 151 (hwid 151) Ready to die...
[  447.133726] cpu 136 (hwid 136) Ready to die...
[  447.403532] cpu 137 (hwid 137) Ready to die...
[  447.403772] cpu 138 (hwid 138) Ready to die...
[  447.403839] cpu 139 (hwid 139) Ready to die...
[  447.403887] cpu 140 (hwid 140) Ready to die...
[  447.403937] cpu 141 (hwid 141) Ready to die...
[  447.403979] cpu 142 (hwid 142) Ready to die...
[  447.404038] cpu 143 (hwid 143) Ready to die...
[  447.513546] cpu 128 (hwid 128) Ready to die...
[  447.693533] cpu 129 (hwid 129) Ready to die...
[  447.693999] cpu 130 (hwid 130) Ready to die...
[  447.703530] cpu 131 (hwid 131) Ready to die...
[  447.704087] Querying DEAD? cpu 132 (132) shows 2
[  447.704102] cpu 132 (hwid 132) Ready to die...
[  447.713534] cpu 133 (hwid 133) Ready to die...
[  447.714064] Querying DEAD? cpu 134 (134) shows 2
cpu 0x86: Vector: 300 (Data Access) at [c000000007b0fd40]
    pc: 000000001ec3072c
    lr: 000000001ec2fee0
    sp: 1faf6bd0
   msr: 8000000102801000
   dar: 212d6c1a2a20c
 dsisr: 42000000
  current = 0xc000000474c6d600
  paca    = 0xc000000007b6b600   softe: 0        irq_happened: 0x01
    pid   = 0, comm = swapper/134
Linux version 4.8.0-34-generic (buildd@bos01-ppc64el-026) (gcc version 5.4.0 20160609 (Ubuntu/IBM 5.4.0-6ubuntu1~16.04.4) ) #36~16.04.1-Ubuntu SMP Wed Dec 21 18:53:20 UTC 2016 (Ubuntu 4.8.0-34.36~16.04.1-generic 4.8.11)
WARNING: exception is not recoverable, can't continue

This was reproduced in v4.10-rc6 as well, but I don't have a crash log
handy for that version right now. Sorry.

This is a race between one CPU stopping and another one calling
pseries_cpu_die to wait for it to stop. That function does a short
busy loop calling RTAS query-cpu-stopped-state on the stopping CPU
to verify that it is stopped.

As can be seen in the dmesg right before or after the "Querying DEAD?"
messages, if pseries_cpu_die waited a little longer it would have seen
the CPU in the stopped state.

I see two cases that can be causing this race:

1. It's possible that CPU 134 was inactive at the time it was unplugged.
   In that case, dlpar_offline_cpu calls H_PROD on the CPU and immediately
   calls pseries_cpu_die. Meanwhile, the prodded CPU activates and start
   the process of stopping itself. It's possible that the busy loop is not
   long enough to allow for the CPU to wake up and complete the stopping
   process.
2. If CPU 134 was online at the time it was unplugged, it would have gone
   through the new CPU hotplug state machine in kernel/cpu.c that was
   introduced in v4.6 to get itself stopped. It's possible that the busy
   loop in pseries_cpu_die was long enough for the older hotplug code but
   not for the new hotplug state machine.

Either way, the solution is the same: wait an adequate amount in
pseries_cpu_die.

The simple solution is to increase the number of tries in the loop.
This was done to solve a similar problem in
commit 940ce422a367 ("powerpc/pseries: Increase cpu die timeout"), so
it's not as lame as it sounds. :-)

Signed-off-by: Thiago Jung Bauermann <bauerman@linux.vnet.ibm.com>
---

Notes:
    A solution that is probably better is to have pseries_cpu_die wait
    on a per-CPU semaphore at the beginning of the function, before doing a
    short busy loop. Then the CPU that is stopping unlocks that semaphore right
    before stopping itself, probably at pseries_mach_cpu_die.
    
    What do you think? I can implement that if there is interest.

 arch/powerpc/platforms/pseries/hotplug-cpu.c | 2 +-
 1 file changed, 1 insertion(+), 1 deletion(-)

Comments

Han Pingtian Feb. 7, 2017, 1:05 a.m. UTC | #1
On Mon, Feb 06, 2017 at 04:58:16PM -0200, Thiago Jung Bauermann wrote:
> This was reproduced in v4.10-rc6 as well, but I don't have a crash log
> handy for that version right now. Sorry.
> 
This is the crash log of v4.10:

roselp4 login: [  505.097727] sysrq: SysRq : Changing Loglevel
[  505.097743] sysrq: Loglevel set to 9
[  595.183727] iommu: Removing device 0021:01:00.0 from group 0
[  605.213053] pci_bus 0021:01: busn_res: [bus 01-ff] is released
[  605.213204] rpadlpar_io: slot PHB 33 removed
[  624.623544] min_free_kbytes is not updated to 16826 because user defined value 547628 is preferred
[  647.574039] cpu 152 (hwid 152) Ready to die...
[  647.803624] cpu 153 (hwid 153) Ready to die...
[  647.813371] cpu 154 (hwid 154) Ready to die...
[  647.823646] cpu 155 (hwid 155) Ready to die...
[  647.833371] cpu 156 (hwid 156) Ready to die...
[  647.843392] cpu 157 (hwid 157) Ready to die...
[  647.853368] cpu 158 (hwid 158) Ready to die...
[  647.863358] cpu 159 (hwid 159) Ready to die...
[  648.233756] cpu 144 (hwid 144) Ready to die...
[  648.383417] cpu 145 (hwid 145) Ready to die...
[  648.383488] cpu 146 (hwid 146) Ready to die...
[  648.383551] cpu 147 (hwid 147) Ready to die...
[  648.402845] cpu 148 (hwid 148) Ready to die...
[  648.403329] cpu 149 (hwid 149) Ready to die...
[  648.403384] cpu 150 (hwid 150) Ready to die...
[  648.413398] cpu 151 (hwid 151) Ready to die...
[  648.583466] cpu 136 (hwid 136) Ready to die...
[  648.833672] cpu 137 (hwid 137) Ready to die...
[  648.843481] cpu 138 (hwid 138) Ready to die...
[  648.843541] cpu 139 (hwid 139) Ready to die...
[  648.843589] cpu 140 (hwid 140) Ready to die...
[  648.843635] cpu 141 (hwid 141) Ready to die...
[  648.853426] cpu 142 (hwid 142) Ready to die...
[  648.863105] Querying DEAD? cpu 142 (142) shows 2
[  648.863697] cpu 143 (hwid 143) Ready to die...
[  649.013280] cpu 128 (hwid 128) Ready to die...
[  649.253575] cpu 129 (hwid 129) Ready to die...
[  649.253634] cpu 130 (hwid 130) Ready to die...
[  649.253689] cpu 131 (hwid 131) Ready to die...
[  649.253741] cpu 132 (hwid 132) Ready to die...
[  649.253793] cpu 133 (hwid 133) Ready to die...
[  649.253836] cpu 134 (hwid 134) Ready to die...
[  649.253885] cpu 135 (hwid 135) Ready to die...
[  649.353116] cpu 120 (hwid 120) Ready to die...
[  649.553776] cpu 121 (hwid 121) Ready to die...
[  649.563626] Querying DEAD? cpu 122 (122) shows 2
cpu 0x7b: Vector: 400 (Instruction Access) at [c000000007bc7d40]
    pc: f81f0070801f0078
    lr: f81f0070801f0078
    sp: ec20d28
   msr: 8000000042801000
  current = 0xc000000474838600
  paca    = 0xc000000007b65300   softe: 0        irq_happened: 0x01
    pid   = 0, comm = swapper/123
Linux version 4.10.0-rc6 (bauermann@u1604le) (gcc version 5.4.0 20160609 (Ubuntu/IBM 5.4.0-6ubuntu1~16.04.4) ) #1 SMP Fri Feb 3 18:55:20 BRST 2017
WARNING: exception is not recoverable, can't continue
enter ? for help
SP (ec20d28) is in userspace
7b:mon> e
cpu 0x7b: Vector: 400 (Instruction Access) at [c000000007bc7d40]
    pc: f81f0070801f0078
    lr: f81f0070801f0078
    sp: ec20d28
   msr: 8000000042801000
  current = 0xc000000474838600
  paca    = 0xc000000007b65300   softe: 0        irq_happened: 0x01
    pid   = 0, comm = swapper/123
Linux version 4.10.0-rc6 (bauermann@u1604le) (gcc version 5.4.0 20160609 (Ubuntu/IBM 5.4.0-6ubuntu1~16.04.4) ) #1 SMP Fri Feb 3 18:55:20 BRST 2017
7b:mon>
Michael Ellerman Feb. 7, 2017, 2:10 a.m. UTC | #2
Thiago Jung Bauermann <bauerman@linux.vnet.ibm.com> writes:

> When testing DLPAR CPU add/remove on a system under stress, pseries_cpu_die
> doesn't wait long enough for a CPU to die and the kernel ends up crashing:
>
> [  446.143648] cpu 152 (hwid 152) Ready to die...
> [  446.464057] cpu 153 (hwid 153) Ready to die...
> [  446.473525] cpu 154 (hwid 154) Ready to die...
> [  446.474077] cpu 155 (hwid 155) Ready to die...
> [  446.483529] cpu 156 (hwid 156) Ready to die...
> [  446.493532] cpu 157 (hwid 157) Ready to die...
> [  446.494078] cpu 158 (hwid 158) Ready to die...
> [  446.503527] cpu 159 (hwid 159) Ready to die...
> [  446.664534] cpu 144 (hwid 144) Ready to die...
> [  446.964113] cpu 145 (hwid 145) Ready to die...
> [  446.973525] cpu 146 (hwid 146) Ready to die...
> [  446.974094] cpu 147 (hwid 147) Ready to die...
> [  446.983944] cpu 148 (hwid 148) Ready to die...
> [  446.984062] cpu 149 (hwid 149) Ready to die...
> [  446.993518] cpu 150 (hwid 150) Ready to die...
> [  446.993543] Querying DEAD? cpu 150 (150) shows 2
> [  446.994098] cpu 151 (hwid 151) Ready to die...
> [  447.133726] cpu 136 (hwid 136) Ready to die...
> [  447.403532] cpu 137 (hwid 137) Ready to die...
> [  447.403772] cpu 138 (hwid 138) Ready to die...
> [  447.403839] cpu 139 (hwid 139) Ready to die...
> [  447.403887] cpu 140 (hwid 140) Ready to die...
> [  447.403937] cpu 141 (hwid 141) Ready to die...
> [  447.403979] cpu 142 (hwid 142) Ready to die...
> [  447.404038] cpu 143 (hwid 143) Ready to die...
> [  447.513546] cpu 128 (hwid 128) Ready to die...
> [  447.693533] cpu 129 (hwid 129) Ready to die...
> [  447.693999] cpu 130 (hwid 130) Ready to die...
> [  447.703530] cpu 131 (hwid 131) Ready to die...
> [  447.704087] Querying DEAD? cpu 132 (132) shows 2
> [  447.704102] cpu 132 (hwid 132) Ready to die...
> [  447.713534] cpu 133 (hwid 133) Ready to die...
> [  447.714064] Querying DEAD? cpu 134 (134) shows 2
> cpu 0x86: Vector: 300 (Data Access) at [c000000007b0fd40]
>     pc: 000000001ec3072c
>     lr: 000000001ec2fee0
>     sp: 1faf6bd0
>    msr: 8000000102801000
>    dar: 212d6c1a2a20c
>  dsisr: 42000000
>   current = 0xc000000474c6d600
>   paca    = 0xc000000007b6b600   softe: 0        irq_happened: 0x01
>     pid   = 0, comm = swapper/134
> Linux version 4.8.0-34-generic (buildd@bos01-ppc64el-026) (gcc version 5.4.0 20160609 (Ubuntu/IBM 5.4.0-6ubuntu1~16.04.4) ) #36~16.04.1-Ubuntu SMP Wed Dec 21 18:53:20 UTC 2016 (Ubuntu 4.8.0-34.36~16.04.1-generic 4.8.11)
> WARNING: exception is not recoverable, can't continue
>
> This was reproduced in v4.10-rc6 as well, but I don't have a crash log
> handy for that version right now. Sorry.

We shouldn't be crashing.

So we need to fix that.

We may also need to increase the timeout, though it's pretty gross TBH.

But step one is make sure we don't crash.

cheers
Balbir Singh Feb. 7, 2017, 2:56 a.m. UTC | #3
On Mon, Feb 06, 2017 at 04:58:16PM -0200, Thiago Jung Bauermann wrote:
> [  447.714064] Querying DEAD? cpu 134 (134) shows 2
> cpu 0x86: Vector: 300 (Data Access) at [c000000007b0fd40]
>     pc: 000000001ec3072c
>     lr: 000000001ec2fee0
>     sp: 1faf6bd0
>    msr: 8000000102801000
>    dar: 212d6c1a2a20c

This looks like we accessed a bad address, but why?

>  dsisr: 42000000
>   current = 0xc000000474c6d600
>   paca    = 0xc000000007b6b600   softe: 0        irq_happened: 0x01
>     pid   = 0, comm = swapper/134
> Linux version 4.8.0-34-generic (buildd@bos01-ppc64el-026) (gcc version 5.4.0 20160609 (Ubuntu/IBM 5.4.0-6ubuntu1~16.04.4) ) #36~16.04.1-Ubuntu SMP Wed Dec 21 18:53:20 UTC 2016 (Ubuntu 4.8.0-34.36~16.04.1-generic 4.8.11)
> WARNING: exception is not recoverable, can't continue
> 
> This was reproduced in v4.10-rc6 as well, but I don't have a crash log
> handy for that version right now. Sorry.
> 
> This is a race between one CPU stopping and another one calling
> pseries_cpu_die to wait for it to stop. That function does a short
> busy loop calling RTAS query-cpu-stopped-state on the stopping CPU
> to verify that it is stopped.
> 
> As can be seen in the dmesg right before or after the "Querying DEAD?"
> messages, if pseries_cpu_die waited a little longer it would have seen
> the CPU in the stopped state.
> 
> I see two cases that can be causing this race:
> 
> 1. It's possible that CPU 134 was inactive at the time it was unplugged.
>    In that case, dlpar_offline_cpu calls H_PROD on the CPU and immediately
>    calls pseries_cpu_die. Meanwhile, the prodded CPU activates and start
>    the process of stopping itself. It's possible that the busy loop is not
>    long enough to allow for the CPU to wake up and complete the stopping
>    process.
> 2. If CPU 134 was online at the time it was unplugged, it would have gone
>    through the new CPU hotplug state machine in kernel/cpu.c that was
>    introduced in v4.6 to get itself stopped. It's possible that the busy
>    loop in pseries_cpu_die was long enough for the older hotplug code but
>    not for the new hotplug state machine.
> 
> Either way, the solution is the same: wait an adequate amount in
> pseries_cpu_die.
> 
> The simple solution is to increase the number of tries in the loop.
> This was done to solve a similar problem in
> commit 940ce422a367 ("powerpc/pseries: Increase cpu die timeout"), so
> it's not as lame as it sounds. :-)
> 
> Signed-off-by: Thiago Jung Bauermann <bauerman@linux.vnet.ibm.com>
> ---
> 
> Notes:
>     A solution that is probably better is to have pseries_cpu_die wait
>     on a per-CPU semaphore at the beginning of the function, before doing a
>     short busy loop. Then the CPU that is stopping unlocks that semaphore right
>     before stopping itself, probably at pseries_mach_cpu_die.
>     
>     What do you think? I can implement that if there is interest.
> 
>  arch/powerpc/platforms/pseries/hotplug-cpu.c | 2 +-
>  1 file changed, 1 insertion(+), 1 deletion(-)
> 
> diff --git a/arch/powerpc/platforms/pseries/hotplug-cpu.c b/arch/powerpc/platforms/pseries/hotplug-cpu.c
> index a1b63e00b2f7..3d43317eec1b 100644
> --- a/arch/powerpc/platforms/pseries/hotplug-cpu.c
> +++ b/arch/powerpc/platforms/pseries/hotplug-cpu.c
> @@ -206,7 +206,7 @@ static void pseries_cpu_die(unsigned int cpu)
>  		}
>  	} else if (get_preferred_offline_state(cpu) == CPU_STATE_OFFLINE) {
>  
> -		for (tries = 0; tries < 25; tries++) {
> +		for (tries = 0; tries < 5000; tries++) {

This fixes some of the asymmetry between handling of CPU_STATE_INACTIVE
and CPU_STATE_OFFLINE, but I think we can probably move the cpu_relax()
to msleep(1). 

Please also see
940ce42 powerpc/pseries: Increase cpu die timeout

Balbir Singh.
Thiago Jung Bauermann Feb. 7, 2017, 3:32 p.m. UTC | #4
Am Dienstag, 7. Februar 2017, 08:26:45 BRST schrieb Balbir Singh:
> On Mon, Feb 06, 2017 at 04:58:16PM -0200, Thiago Jung Bauermann wrote:
> > [  447.714064] Querying DEAD? cpu 134 (134) shows 2
> > cpu 0x86: Vector: 300 (Data Access) at [c000000007b0fd40]
> > 
> >     pc: 000000001ec3072c
> >     lr: 000000001ec2fee0
> >     sp: 1faf6bd0
> >    
> >    msr: 8000000102801000
> >    dar: 212d6c1a2a20c
> 
> This looks like we accessed a bad address, but why?

Am Dienstag, 7. Februar 2017, 13:10:22 BRST schrieb Michael Ellerman:
> We shouldn't be crashing.
> 
> So we need to fix that.
> 
> We may also need to increase the timeout, though it's pretty gross TBH.
> 
> But step one is make sure we don't crash.

I didn't analyze exactly what is causing the CPU to crash because the root 
cause is the inconsistency between what the kernel thinks the CPU state is and 
reality. But if we have to be able to handle that inconsistency I will keep 
digging and try to fix that.

> > --- a/arch/powerpc/platforms/pseries/hotplug-cpu.c
> > +++ b/arch/powerpc/platforms/pseries/hotplug-cpu.c
> > @@ -206,7 +206,7 @@ static void pseries_cpu_die(unsigned int cpu)
> > 
> >  		}
> >  	
> >  	} else if (get_preferred_offline_state(cpu) == CPU_STATE_OFFLINE) {
> > 
> > -		for (tries = 0; tries < 25; tries++) {
> > +		for (tries = 0; tries < 5000; tries++) {
> 
> This fixes some of the asymmetry between handling of CPU_STATE_INACTIVE
> and CPU_STATE_OFFLINE, but I think we can probably move the cpu_relax()
> to msleep(1).

I didn't change it to msleep() because I thought it would introduce a 
regression. commit b906cfa397fd ("powerpc/pseries: Fix cpu hotplug") changed a 
msleep(200) that was there to a cpu_relax() with this explanation:

    Currently, pseries_cpu_die() calls msleep() while polling RTAS for 
    the status of the dying cpu.

    However, if the cpu that is going down also happens to be the one 
    doing the tick then we're hosed as the tick_do_timer_cpu 'baton' is
    only passed later on in tick_shutdown() when _cpu_down() does the 
    CPU_DEAD notification.  Therefore jiffies won't be updated anymore.

    This replaces that msleep() with a cpu_relax() to make sure we're not 
    going to schedule at that point.

    With this patch my test box survives a 100k iterations hotplug stress
    test on _all_ cpus, whereas without it, it quickly dies after ~50 
    iterations.

I can try to add it back and see what happens. Perhaps that situation won't 
happen anymore with today's kernel.

> Please also see
> 940ce42 powerpc/pseries: Increase cpu die timeout

Yes, that is the commit that I mentioned in the patch description.
Michael Ellerman Feb. 8, 2017, 2:59 a.m. UTC | #5
Thiago Jung Bauermann <bauerman@linux.vnet.ibm.com> writes:

> Am Dienstag, 7. Februar 2017, 08:26:45 BRST schrieb Balbir Singh:
>> On Mon, Feb 06, 2017 at 04:58:16PM -0200, Thiago Jung Bauermann wrote:
>> > [  447.714064] Querying DEAD? cpu 134 (134) shows 2
>> > cpu 0x86: Vector: 300 (Data Access) at [c000000007b0fd40]
>> > 
>> >     pc: 000000001ec3072c
>> >     lr: 000000001ec2fee0
>> >     sp: 1faf6bd0
>> >    
>> >    msr: 8000000102801000
>> >    dar: 212d6c1a2a20c
>> 
>> This looks like we accessed a bad address, but why?
>
> Am Dienstag, 7. Februar 2017, 13:10:22 BRST schrieb Michael Ellerman:
>> We shouldn't be crashing.
>> 
>> So we need to fix that.
>> 
>> We may also need to increase the timeout, though it's pretty gross TBH.
>> 
>> But step one is make sure we don't crash.
>
> I didn't analyze exactly what is causing the CPU to crash because the root 
> cause is the inconsistency between what the kernel thinks the CPU state is and 
> reality. But if we have to be able to handle that inconsistency I will keep 
> digging and try to fix that.

Please do.

It's obviously not a good situation, but it shouldn't crash the kernel.

I realise that may be easier said than done, but that should be the
goal. I see smp_ops->cpu_die() can't fail (returns void), which may be
part of the problem.

>> > --- a/arch/powerpc/platforms/pseries/hotplug-cpu.c
>> > +++ b/arch/powerpc/platforms/pseries/hotplug-cpu.c
>> > @@ -206,7 +206,7 @@ static void pseries_cpu_die(unsigned int cpu)
>> > 
>> >  		}
>> >  	
>> >  	} else if (get_preferred_offline_state(cpu) == CPU_STATE_OFFLINE) {
>> > 
>> > -		for (tries = 0; tries < 25; tries++) {
>> > +		for (tries = 0; tries < 5000; tries++) {
>> 
>> This fixes some of the asymmetry between handling of CPU_STATE_INACTIVE
>> and CPU_STATE_OFFLINE, but I think we can probably move the cpu_relax()
>> to msleep(1).
>
> I didn't change it to msleep() because I thought it would introduce a 
> regression. commit b906cfa397fd ("powerpc/pseries: Fix cpu hotplug") changed a 
> msleep(200) that was there to a cpu_relax() with this explanation:
>
>     Currently, pseries_cpu_die() calls msleep() while polling RTAS for 
>     the status of the dying cpu.
>
>     However, if the cpu that is going down also happens to be the one 
>     doing the tick then we're hosed as the tick_do_timer_cpu 'baton' is
>     only passed later on in tick_shutdown() when _cpu_down() does the 
>     CPU_DEAD notification.  Therefore jiffies won't be updated anymore.
>
>     This replaces that msleep() with a cpu_relax() to make sure we're not 
>     going to schedule at that point.
>
>     With this patch my test box survives a 100k iterations hotplug stress
>     test on _all_ cpus, whereas without it, it quickly dies after ~50 
>     iterations.
>
> I can try to add it back and see what happens. Perhaps that situation won't 
> happen anymore with today's kernel.

The relative timing of the events may have changed. But the fundamental
problem still exists AFAIK.

So we'd need a pretty solid guarantee that the problem can't happy
anymore before I'd change it to msleep().

However it could/should at least use udelay(), which AFAIK is not
reliant on jiffies. That way the loop will actually take a set amount of
wall time, rather than just cycles.

cheers
diff mbox

Patch

diff --git a/arch/powerpc/platforms/pseries/hotplug-cpu.c b/arch/powerpc/platforms/pseries/hotplug-cpu.c
index a1b63e00b2f7..3d43317eec1b 100644
--- a/arch/powerpc/platforms/pseries/hotplug-cpu.c
+++ b/arch/powerpc/platforms/pseries/hotplug-cpu.c
@@ -206,7 +206,7 @@  static void pseries_cpu_die(unsigned int cpu)
 		}
 	} else if (get_preferred_offline_state(cpu) == CPU_STATE_OFFLINE) {
 
-		for (tries = 0; tries < 25; tries++) {
+		for (tries = 0; tries < 5000; tries++) {
 			cpu_status = smp_query_cpu_stopped(pcpu);
 			if (cpu_status == QCSS_STOPPED ||
 			    cpu_status == QCSS_HARDWARE_ERROR)