diff mbox

offlining cpus breakage

Message ID 54B7BEF9.8090603@linux.vnet.ibm.com (mailing list archive)
State Superseded
Delegated to: Michael Ellerman
Headers show

Commit Message

Preeti U Murthy Jan. 15, 2015, 1:22 p.m. UTC
Hi Alexey,

Can you let me know if the following patch fixes the issue for you ?
It did for us on one of our machines that we were investigating on.

Anton,
Can you let me know if the patch fixes the odd perf report that you observed?
This is the latest fix that there is to it.

-------------------------------START PATCH--------------------------------------

tick/broadcast-hrtimer : Make movement of broadcast hrtimer robust against cpu offline

From: Preeti U Murthy <preeti@linux.vnet.ibm.com>

When a cpu on which the broadcast hrtimer is queued goes offline, the hrtimer
needs to be moved to another cpu. There maybe potential race conditions here,
which can lead to the hrtimer not being queued on any cpu.

There was a report on softlockups, with cpus being stuck in deep idle states,
when smt mode switching was being done, especially on systems with smaller number of
cpus [1]. This is hard to reproduce on a large machine because the chances that an
offline cpu was the stand by cpu handling broadcast become lesser. Given the
current code, the only situation that looks capable of triggering scenarios where
broadcast IPIs are not delivered, is in the cpu offline path. I am
at a loss to pin point the precise scenario.

Therefore to avoid any possible race condition between cpu offline and
movement of the broadcast hrtimer, this patch ensures that the act of keeping
track of broadcasting wakeup IPIs is started afresh after a cpu offline operation.
This is done by reseting the expiry time of the hrtimer to a max value. This
has to be done in the CPU_DYING phase so that it is visible to all cpus right
after exiting stop_machine.

The rationale is that during cpu offline, since all cpus are woken up anyway
to run stop_machine, we are only required to keep track of broadcasting to cpus
that henceforth enter deep idle states. This ensures that the broadcast hrtimer
gets positively queued on a cpu as long as there are cpus in deep idle states.

It has to be noted that this code is in addition to retaining the logic that we
have today in the broadcast code on an offline operation. If this logic fails to
move the broadcast hrtimer due to a race condition we have the following patch to
handle it right.

[1]http://linuxppc.10917.n7.nabble.com/offlining-cpus-breakage-td88619.html
There is no issue in programming the decrementer as was presumed and stated in
this link.

Signed-off-by: Preeti U Murthy <preeti@linux.vnet.ibm.com>
---
 kernel/time/clockevents.c    |    2 +-
 kernel/time/tick-broadcast.c |    1 +
 2 files changed, 2 insertions(+), 1 deletion(-)


---------------------------------------END PATCH-------------------------
Thanks

Regards
Preeti U Murthy

On 01/07/2015 03:07 PM, Alexey Kardashevskiy wrote:
> Hi!
> 
> "ppc64_cpu --smt=off" produces multiple error on the latest upstream kernel
> (sha1 bdec419):
> 
> NMI watchdog: BUG: soft lockup - CPU#20 stuck for 23s! [swapper/20:0]
> 
> or
> 
> INFO: rcu_sched detected stalls on CPUs/tasks: { 2 7 8 9 10 11 12 13 14 15
> 16 17 18 19 20 21 22 23 2
> 4 25 26 27 28 29 30 31} (detected by 6, t=2102 jiffies, g=1617, c=1616,
> q=1441)
> 
> and many others, all about lockups
> 
> I did bisecting and found out that reverting these helps:
> 
> 77b54e9f213f76a23736940cf94bcd765fc00f40 powernv/powerpc: Add winkle
> support for offline cpus
> 7cba160ad789a3ad7e68b92bf20eaad6ed171f80 powernv/cpuidle: Redesign idle
> states management
> 8eb8ac89a364305d05ad16be983b7890eb462cc3 powerpc/powernv: Enable Offline
> CPUs to enter deep idle states
> 
> btw reverting just two of them produces a compile error.
> 
> It is pseries_le_defconfig, POWER8 machine:
> timebase        : 512000000
> platform        : PowerNV
> model           : palmetto
> machine         : PowerNV palmetto
> firmware        : OPAL v3
> 
> 
> Please help to fix it. Thanks.
> 
>

Comments

Alexey Kardashevskiy Jan. 16, 2015, 12:28 a.m. UTC | #1
On 01/16/2015 02:22 AM, Preeti U Murthy wrote:
> Hi Alexey,
> 
> Can you let me know if the following patch fixes the issue for you ?
> It did for us on one of our machines that we were investigating on.


This fixes the issue for me as well, thanks!

Tested-by: Alexey Kardashevskiy <aik@ozlabs.ru>	


> Anton,
> Can you let me know if the patch fixes the odd perf report that you observed?
> This is the latest fix that there is to it.
> 
> -------------------------------START PATCH--------------------------------------
> 
> tick/broadcast-hrtimer : Make movement of broadcast hrtimer robust against cpu offline
> 
> From: Preeti U Murthy <preeti@linux.vnet.ibm.com>
> 
> When a cpu on which the broadcast hrtimer is queued goes offline, the hrtimer
> needs to be moved to another cpu. There maybe potential race conditions here,
> which can lead to the hrtimer not being queued on any cpu.
> 
> There was a report on softlockups, with cpus being stuck in deep idle states,
> when smt mode switching was being done, especially on systems with smaller number of
> cpus [1]. This is hard to reproduce on a large machine because the chances that an
> offline cpu was the stand by cpu handling broadcast become lesser. Given the
> current code, the only situation that looks capable of triggering scenarios where
> broadcast IPIs are not delivered, is in the cpu offline path. I am
> at a loss to pin point the precise scenario.
> 
> Therefore to avoid any possible race condition between cpu offline and
> movement of the broadcast hrtimer, this patch ensures that the act of keeping
> track of broadcasting wakeup IPIs is started afresh after a cpu offline operation.
> This is done by reseting the expiry time of the hrtimer to a max value. This
> has to be done in the CPU_DYING phase so that it is visible to all cpus right
> after exiting stop_machine.
> 
> The rationale is that during cpu offline, since all cpus are woken up anyway
> to run stop_machine, we are only required to keep track of broadcasting to cpus
> that henceforth enter deep idle states. This ensures that the broadcast hrtimer
> gets positively queued on a cpu as long as there are cpus in deep idle states.
> 
> It has to be noted that this code is in addition to retaining the logic that we
> have today in the broadcast code on an offline operation. If this logic fails to
> move the broadcast hrtimer due to a race condition we have the following patch to
> handle it right.
> 
> [1]http://linuxppc.10917.n7.nabble.com/offlining-cpus-breakage-td88619.html
> There is no issue in programming the decrementer as was presumed and stated in
> this link.
> 
> Signed-off-by: Preeti U Murthy <preeti@linux.vnet.ibm.com>
> ---
>  kernel/time/clockevents.c    |    2 +-
>  kernel/time/tick-broadcast.c |    1 +
>  2 files changed, 2 insertions(+), 1 deletion(-)
> 
> diff --git a/kernel/time/clockevents.c b/kernel/time/clockevents.c
> index 5544990..f3907c9 100644
> --- a/kernel/time/clockevents.c
> +++ b/kernel/time/clockevents.c
> @@ -568,6 +568,7 @@ int clockevents_notify(unsigned long reason, void *arg)
>  
>  	case CLOCK_EVT_NOTIFY_CPU_DYING:
>  		tick_handover_do_timer(arg);
> +		tick_shutdown_broadcast_oneshot(arg);
>  		break;
>  
>  	case CLOCK_EVT_NOTIFY_SUSPEND:
> @@ -580,7 +581,6 @@ int clockevents_notify(unsigned long reason, void *arg)
>  		break;
>  
>  	case CLOCK_EVT_NOTIFY_CPU_DEAD:
> -		tick_shutdown_broadcast_oneshot(arg);
>  		tick_shutdown_broadcast(arg);
>  		tick_shutdown(arg);
>  		/*
> diff --git a/kernel/time/tick-broadcast.c b/kernel/time/tick-broadcast.c
> index 066f0ec..1f5eda6 100644
> --- a/kernel/time/tick-broadcast.c
> +++ b/kernel/time/tick-broadcast.c
> @@ -677,6 +677,7 @@ static void broadcast_move_bc(int deadcpu)
>  		return;
>  	/* This moves the broadcast assignment to this cpu */
>  	clockevents_program_event(bc, bc->next_event, 1);
> +	bc->next_event.tv64 = KTIME_MAX;
>  }
>  
>  /*
> 
> ---------------------------------------END PATCH-------------------------
> Thanks
> 
> Regards
> Preeti U Murthy
> 
> On 01/07/2015 03:07 PM, Alexey Kardashevskiy wrote:
>> Hi!
>>
>> "ppc64_cpu --smt=off" produces multiple error on the latest upstream kernel
>> (sha1 bdec419):
>>
>> NMI watchdog: BUG: soft lockup - CPU#20 stuck for 23s! [swapper/20:0]
>>
>> or
>>
>> INFO: rcu_sched detected stalls on CPUs/tasks: { 2 7 8 9 10 11 12 13 14 15
>> 16 17 18 19 20 21 22 23 2
>> 4 25 26 27 28 29 30 31} (detected by 6, t=2102 jiffies, g=1617, c=1616,
>> q=1441)
>>
>> and many others, all about lockups
>>
>> I did bisecting and found out that reverting these helps:
>>
>> 77b54e9f213f76a23736940cf94bcd765fc00f40 powernv/powerpc: Add winkle
>> support for offline cpus
>> 7cba160ad789a3ad7e68b92bf20eaad6ed171f80 powernv/cpuidle: Redesign idle
>> states management
>> 8eb8ac89a364305d05ad16be983b7890eb462cc3 powerpc/powernv: Enable Offline
>> CPUs to enter deep idle states
>>
>> btw reverting just two of them produces a compile error.
>>
>> It is pseries_le_defconfig, POWER8 machine:
>> timebase        : 512000000
>> platform        : PowerNV
>> model           : palmetto
>> machine         : PowerNV palmetto
>> firmware        : OPAL v3
>>
>>
>> Please help to fix it. Thanks.
>>
>>
>
Michael Ellerman Jan. 16, 2015, 3:04 a.m. UTC | #2
On Fri, 2015-01-16 at 13:28 +1300, Alexey Kardashevskiy wrote:
> On 01/16/2015 02:22 AM, Preeti U Murthy wrote:
> > Hi Alexey,
> > 
> > Can you let me know if the following patch fixes the issue for you ?
> > It did for us on one of our machines that we were investigating on.
> 
> This fixes the issue for me as well, thanks!
> 
> Tested-by: Alexey Kardashevskiy <aik@ozlabs.ru>	

OK, that's great.

But, I really don't think we can ask upstream to merge this patch to generic
code when we don't have a good explanation for why it's necessary. At least I'm
not going to ask anyone to do that :)

So Pretti can you either write a 100% convincing explanation of why this patch
is correct in the general case, or (preferably) do some more investigating to
work out what Alexey's bug actually is.

cheers
Preeti U Murthy Jan. 16, 2015, 8:56 a.m. UTC | #3
On 01/16/2015 08:34 AM, Michael Ellerman wrote:
> On Fri, 2015-01-16 at 13:28 +1300, Alexey Kardashevskiy wrote:
>> On 01/16/2015 02:22 AM, Preeti U Murthy wrote:
>>> Hi Alexey,
>>>
>>> Can you let me know if the following patch fixes the issue for you ?
>>> It did for us on one of our machines that we were investigating on.
>>
>> This fixes the issue for me as well, thanks!
>>
>> Tested-by: Alexey Kardashevskiy <aik@ozlabs.ru>	
> 
> OK, that's great.
> 
> But, I really don't think we can ask upstream to merge this patch to generic
> code when we don't have a good explanation for why it's necessary. At least I'm
> not going to ask anyone to do that :)
> 
> So Pretti can you either write a 100% convincing explanation of why this patch
> is correct in the general case, or (preferably) do some more investigating to
> work out what Alexey's bug actually is.

Yes will do so. Its better to investigate where precisely is the bug.
This patch helped me narrow down on the buggy scenario.

Regards
Preeti U Murthy
> 
> cheers
> 
> 
> _______________________________________________
> Linuxppc-dev mailing list
> Linuxppc-dev@lists.ozlabs.org
> https://lists.ozlabs.org/listinfo/linuxppc-dev
>
Preeti U Murthy Jan. 16, 2015, 9:10 a.m. UTC | #4
On 01/16/2015 02:26 PM, Preeti U Murthy wrote:
> On 01/16/2015 08:34 AM, Michael Ellerman wrote:
>> On Fri, 2015-01-16 at 13:28 +1300, Alexey Kardashevskiy wrote:
>>> On 01/16/2015 02:22 AM, Preeti U Murthy wrote:
>>>> Hi Alexey,
>>>>
>>>> Can you let me know if the following patch fixes the issue for you ?
>>>> It did for us on one of our machines that we were investigating on.
>>>
>>> This fixes the issue for me as well, thanks!
>>>
>>> Tested-by: Alexey Kardashevskiy <aik@ozlabs.ru>	
>>
>> OK, that's great.
>>
>> But, I really don't think we can ask upstream to merge this patch to generic
>> code when we don't have a good explanation for why it's necessary. At least I'm
>> not going to ask anyone to do that :)
>>
>> So Pretti can you either write a 100% convincing explanation of why this patch
>> is correct in the general case, or (preferably) do some more investigating to
>> work out what Alexey's bug actually is.
> 
> Yes will do so. Its better to investigate where precisely is the bug.
> This patch helped me narrow down on the buggy scenario.

On a side note, while I was tracking the race condition, I noticed that
in the final stage of the cpu offline path, after the state of the
hotplugged cpu is set to CPU_DEAD, we check if there were interrupts
delivered during the soft disabled state and service them if there were.
It makes sense to check for pending interrupts in the idle path. In the
offline path however, this did not look right to me at first glance. Am
I missing something ?

Regards
Preeti U Murthy

> 
> Regards
> Preeti U Murthy
>>
>> cheers
>>
>>
>> _______________________________________________
>> Linuxppc-dev mailing list
>> Linuxppc-dev@lists.ozlabs.org
>> https://lists.ozlabs.org/listinfo/linuxppc-dev
>>
> 
> _______________________________________________
> Linuxppc-dev mailing list
> Linuxppc-dev@lists.ozlabs.org
> https://lists.ozlabs.org/listinfo/linuxppc-dev
>
Preeti U Murthy Jan. 17, 2015, 1:39 p.m. UTC | #5
On 01/16/2015 08:34 AM, Michael Ellerman wrote:
> On Fri, 2015-01-16 at 13:28 +1300, Alexey Kardashevskiy wrote:
>> On 01/16/2015 02:22 AM, Preeti U Murthy wrote:
>>> Hi Alexey,
>>>
>>> Can you let me know if the following patch fixes the issue for you ?
>>> It did for us on one of our machines that we were investigating on.
>>
>> This fixes the issue for me as well, thanks!
>>
>> Tested-by: Alexey Kardashevskiy <aik@ozlabs.ru>	
> 
> OK, that's great.
> 
> But, I really don't think we can ask upstream to merge this patch to generic
> code when we don't have a good explanation for why it's necessary. At least I'm
> not going to ask anyone to do that :)
> 
> So Pretti can you either write a 100% convincing explanation of why this patch
> is correct in the general case, or (preferably) do some more investigating to
> work out what Alexey's bug actually is.

On further investigation, I found that the issue lies in the latency of
cpu hotplug operation, specifically the time taken for the offline cpu
to enter powersave mode.

The time between the beginning of the cpu hotplug operation and the
beginning of __cpu_die() operation (which is one of the last stages of
cpu hotplug) takes around a maximum of 40ms. Although this is not
causing softlockups, it is quite a large duration.

The more serious issue is the time taken for __cpu_die() operation to
complete. The __cpu_die() operation waits for the offline cpu to set its
state to CPU_DEAD before entering powersave state. This time varies from
4s to a maximum of 200s! It is not this bad always but it does happen
quite a few times. It is during these times that we observe softlockups.
I added trace prints throughout the cpu hotplug code to measure these
numbers. This delay is causing the softlockup and here is why.

If the cpu going offline is the one broadcasting wakeups to cpus in
fastsleep, it queues the broadcast timer on another cpu during the
CPU_DEAD phase. The CPU_DEAD notifiers are run only after the
__cpu_die() operation completes, which is taking a long time as
mentioned above. So between the time irqs are migrated off the about to
go offline cpu and CPU_DEAD stage, no cpu can be woken up. The above
numbers show that this can be a horridly long time. Hence the next time
that they get woken up the unnatural idle time is detected and
softlockup triggers.

The patch on this thread that I proposed covered up the problem by
allowing the remaining cpus to freshly reevaluate their wakeups after
the stop machine phase without having to depend on the previous
broadcast state.So it did not matter what the previously appointed
broadcast cpu was upto.However there are still corner cases which cannot
get solved with this patch. And understandably because it is not
addressing the core issue, which is how to get around the latency issue
of cpu hotplug.

There can be ways in which the broadcast timer be migrated in time
during hotplug to get around the softlockups, but the latency of the cpu
hotplug operation looks like a serious issue. Has anybody observed or
explicitly instrumented cpu hotplug operation before and happened to
notice the large time duration required for its completion?

Ccing Paul.

Thanks

Regards
Preeti U Murthy
> 
> cheers
> 
> 
> _______________________________________________
> Linuxppc-dev mailing list
> Linuxppc-dev@lists.ozlabs.org
> https://lists.ozlabs.org/listinfo/linuxppc-dev
>
Michael Ellerman Jan. 22, 2015, 5:29 a.m. UTC | #6
On Fri, 2015-01-16 at 14:40 +0530, Preeti U Murthy wrote:
> On 01/16/2015 02:26 PM, Preeti U Murthy wrote:
> > On 01/16/2015 08:34 AM, Michael Ellerman wrote:
> >> On Fri, 2015-01-16 at 13:28 +1300, Alexey Kardashevskiy wrote:
> >>> On 01/16/2015 02:22 AM, Preeti U Murthy wrote:
> >>>> Hi Alexey,
> >>>>
> >>>> Can you let me know if the following patch fixes the issue for you ?
> >>>> It did for us on one of our machines that we were investigating on.
> >>>
> >>> This fixes the issue for me as well, thanks!
> >>>
> >>> Tested-by: Alexey Kardashevskiy <aik@ozlabs.ru>	
> >>
> >> OK, that's great.
> >>
> >> But, I really don't think we can ask upstream to merge this patch to generic
> >> code when we don't have a good explanation for why it's necessary. At least I'm
> >> not going to ask anyone to do that :)
> >>
> >> So Pretti can you either write a 100% convincing explanation of why this patch
> >> is correct in the general case, or (preferably) do some more investigating to
> >> work out what Alexey's bug actually is.
> > 
> > Yes will do so. Its better to investigate where precisely is the bug.
> > This patch helped me narrow down on the buggy scenario.
> 
> On a side note, while I was tracking the race condition, I noticed that
> in the final stage of the cpu offline path, after the state of the
> hotplugged cpu is set to CPU_DEAD, we check if there were interrupts
> delivered during the soft disabled state and service them if there were.
> It makes sense to check for pending interrupts in the idle path. In the
> offline path however, this did not look right to me at first glance. Am
> I missing something ?

That does sound a bit fishy.

I guess we're just assuming that all interrupts have been migrated away prior
to the offline?

cheers
Preeti U Murthy Jan. 22, 2015, 6:31 a.m. UTC | #7
On 01/22/2015 10:59 AM, Michael Ellerman wrote:
> On Fri, 2015-01-16 at 14:40 +0530, Preeti U Murthy wrote:
>> On 01/16/2015 02:26 PM, Preeti U Murthy wrote:
>>> On 01/16/2015 08:34 AM, Michael Ellerman wrote:
>>>> On Fri, 2015-01-16 at 13:28 +1300, Alexey Kardashevskiy wrote:
>>>>> On 01/16/2015 02:22 AM, Preeti U Murthy wrote:
>>>>>> Hi Alexey,
>>>>>>
>>>>>> Can you let me know if the following patch fixes the issue for you ?
>>>>>> It did for us on one of our machines that we were investigating on.
>>>>>
>>>>> This fixes the issue for me as well, thanks!
>>>>>
>>>>> Tested-by: Alexey Kardashevskiy <aik@ozlabs.ru>	
>>>>
>>>> OK, that's great.
>>>>
>>>> But, I really don't think we can ask upstream to merge this patch to generic
>>>> code when we don't have a good explanation for why it's necessary. At least I'm
>>>> not going to ask anyone to do that :)
>>>>
>>>> So Pretti can you either write a 100% convincing explanation of why this patch
>>>> is correct in the general case, or (preferably) do some more investigating to
>>>> work out what Alexey's bug actually is.
>>>
>>> Yes will do so. Its better to investigate where precisely is the bug.
>>> This patch helped me narrow down on the buggy scenario.
>>
>> On a side note, while I was tracking the race condition, I noticed that
>> in the final stage of the cpu offline path, after the state of the
>> hotplugged cpu is set to CPU_DEAD, we check if there were interrupts
>> delivered during the soft disabled state and service them if there were.
>> It makes sense to check for pending interrupts in the idle path. In the
>> offline path however, this did not look right to me at first glance. Am
>> I missing something ?
> 
> That does sound a bit fishy.
> 
> I guess we're just assuming that all interrupts have been migrated away prior
> to the offline?

Yes they have. Interrupts to start a guest will also not get delivered
because the hwthread_state is not set to nap yet at that point.

Regards
Preeti U Murthy
> 
> cheers
> 
> 
> _______________________________________________
> Linuxppc-dev mailing list
> Linuxppc-dev@lists.ozlabs.org
> https://lists.ozlabs.org/listinfo/linuxppc-dev
>
diff mbox

Patch

diff --git a/kernel/time/clockevents.c b/kernel/time/clockevents.c
index 5544990..f3907c9 100644
--- a/kernel/time/clockevents.c
+++ b/kernel/time/clockevents.c
@@ -568,6 +568,7 @@  int clockevents_notify(unsigned long reason, void *arg)
 
 	case CLOCK_EVT_NOTIFY_CPU_DYING:
 		tick_handover_do_timer(arg);
+		tick_shutdown_broadcast_oneshot(arg);
 		break;
 
 	case CLOCK_EVT_NOTIFY_SUSPEND:
@@ -580,7 +581,6 @@  int clockevents_notify(unsigned long reason, void *arg)
 		break;
 
 	case CLOCK_EVT_NOTIFY_CPU_DEAD:
-		tick_shutdown_broadcast_oneshot(arg);
 		tick_shutdown_broadcast(arg);
 		tick_shutdown(arg);
 		/*
diff --git a/kernel/time/tick-broadcast.c b/kernel/time/tick-broadcast.c
index 066f0ec..1f5eda6 100644
--- a/kernel/time/tick-broadcast.c
+++ b/kernel/time/tick-broadcast.c
@@ -677,6 +677,7 @@  static void broadcast_move_bc(int deadcpu)
 		return;
 	/* This moves the broadcast assignment to this cpu */
 	clockevents_program_event(bc, bc->next_event, 1);
+	bc->next_event.tv64 = KTIME_MAX;
 }
 
 /*