Message ID | 54B7BEF9.8090603@linux.vnet.ibm.com (mailing list archive) |
---|---|
State | Superseded |
Delegated to: | Michael Ellerman |
Headers | show |
On 01/16/2015 02:22 AM, Preeti U Murthy wrote: > Hi Alexey, > > Can you let me know if the following patch fixes the issue for you ? > It did for us on one of our machines that we were investigating on. This fixes the issue for me as well, thanks! Tested-by: Alexey Kardashevskiy <aik@ozlabs.ru> > Anton, > Can you let me know if the patch fixes the odd perf report that you observed? > This is the latest fix that there is to it. > > -------------------------------START PATCH-------------------------------------- > > tick/broadcast-hrtimer : Make movement of broadcast hrtimer robust against cpu offline > > From: Preeti U Murthy <preeti@linux.vnet.ibm.com> > > When a cpu on which the broadcast hrtimer is queued goes offline, the hrtimer > needs to be moved to another cpu. There maybe potential race conditions here, > which can lead to the hrtimer not being queued on any cpu. > > There was a report on softlockups, with cpus being stuck in deep idle states, > when smt mode switching was being done, especially on systems with smaller number of > cpus [1]. This is hard to reproduce on a large machine because the chances that an > offline cpu was the stand by cpu handling broadcast become lesser. Given the > current code, the only situation that looks capable of triggering scenarios where > broadcast IPIs are not delivered, is in the cpu offline path. I am > at a loss to pin point the precise scenario. > > Therefore to avoid any possible race condition between cpu offline and > movement of the broadcast hrtimer, this patch ensures that the act of keeping > track of broadcasting wakeup IPIs is started afresh after a cpu offline operation. > This is done by reseting the expiry time of the hrtimer to a max value. This > has to be done in the CPU_DYING phase so that it is visible to all cpus right > after exiting stop_machine. > > The rationale is that during cpu offline, since all cpus are woken up anyway > to run stop_machine, we are only required to keep track of broadcasting to cpus > that henceforth enter deep idle states. This ensures that the broadcast hrtimer > gets positively queued on a cpu as long as there are cpus in deep idle states. > > It has to be noted that this code is in addition to retaining the logic that we > have today in the broadcast code on an offline operation. If this logic fails to > move the broadcast hrtimer due to a race condition we have the following patch to > handle it right. > > [1]http://linuxppc.10917.n7.nabble.com/offlining-cpus-breakage-td88619.html > There is no issue in programming the decrementer as was presumed and stated in > this link. > > Signed-off-by: Preeti U Murthy <preeti@linux.vnet.ibm.com> > --- > kernel/time/clockevents.c | 2 +- > kernel/time/tick-broadcast.c | 1 + > 2 files changed, 2 insertions(+), 1 deletion(-) > > diff --git a/kernel/time/clockevents.c b/kernel/time/clockevents.c > index 5544990..f3907c9 100644 > --- a/kernel/time/clockevents.c > +++ b/kernel/time/clockevents.c > @@ -568,6 +568,7 @@ int clockevents_notify(unsigned long reason, void *arg) > > case CLOCK_EVT_NOTIFY_CPU_DYING: > tick_handover_do_timer(arg); > + tick_shutdown_broadcast_oneshot(arg); > break; > > case CLOCK_EVT_NOTIFY_SUSPEND: > @@ -580,7 +581,6 @@ int clockevents_notify(unsigned long reason, void *arg) > break; > > case CLOCK_EVT_NOTIFY_CPU_DEAD: > - tick_shutdown_broadcast_oneshot(arg); > tick_shutdown_broadcast(arg); > tick_shutdown(arg); > /* > diff --git a/kernel/time/tick-broadcast.c b/kernel/time/tick-broadcast.c > index 066f0ec..1f5eda6 100644 > --- a/kernel/time/tick-broadcast.c > +++ b/kernel/time/tick-broadcast.c > @@ -677,6 +677,7 @@ static void broadcast_move_bc(int deadcpu) > return; > /* This moves the broadcast assignment to this cpu */ > clockevents_program_event(bc, bc->next_event, 1); > + bc->next_event.tv64 = KTIME_MAX; > } > > /* > > ---------------------------------------END PATCH------------------------- > Thanks > > Regards > Preeti U Murthy > > On 01/07/2015 03:07 PM, Alexey Kardashevskiy wrote: >> Hi! >> >> "ppc64_cpu --smt=off" produces multiple error on the latest upstream kernel >> (sha1 bdec419): >> >> NMI watchdog: BUG: soft lockup - CPU#20 stuck for 23s! [swapper/20:0] >> >> or >> >> INFO: rcu_sched detected stalls on CPUs/tasks: { 2 7 8 9 10 11 12 13 14 15 >> 16 17 18 19 20 21 22 23 2 >> 4 25 26 27 28 29 30 31} (detected by 6, t=2102 jiffies, g=1617, c=1616, >> q=1441) >> >> and many others, all about lockups >> >> I did bisecting and found out that reverting these helps: >> >> 77b54e9f213f76a23736940cf94bcd765fc00f40 powernv/powerpc: Add winkle >> support for offline cpus >> 7cba160ad789a3ad7e68b92bf20eaad6ed171f80 powernv/cpuidle: Redesign idle >> states management >> 8eb8ac89a364305d05ad16be983b7890eb462cc3 powerpc/powernv: Enable Offline >> CPUs to enter deep idle states >> >> btw reverting just two of them produces a compile error. >> >> It is pseries_le_defconfig, POWER8 machine: >> timebase : 512000000 >> platform : PowerNV >> model : palmetto >> machine : PowerNV palmetto >> firmware : OPAL v3 >> >> >> Please help to fix it. Thanks. >> >> >
On Fri, 2015-01-16 at 13:28 +1300, Alexey Kardashevskiy wrote: > On 01/16/2015 02:22 AM, Preeti U Murthy wrote: > > Hi Alexey, > > > > Can you let me know if the following patch fixes the issue for you ? > > It did for us on one of our machines that we were investigating on. > > This fixes the issue for me as well, thanks! > > Tested-by: Alexey Kardashevskiy <aik@ozlabs.ru> OK, that's great. But, I really don't think we can ask upstream to merge this patch to generic code when we don't have a good explanation for why it's necessary. At least I'm not going to ask anyone to do that :) So Pretti can you either write a 100% convincing explanation of why this patch is correct in the general case, or (preferably) do some more investigating to work out what Alexey's bug actually is. cheers
On 01/16/2015 08:34 AM, Michael Ellerman wrote: > On Fri, 2015-01-16 at 13:28 +1300, Alexey Kardashevskiy wrote: >> On 01/16/2015 02:22 AM, Preeti U Murthy wrote: >>> Hi Alexey, >>> >>> Can you let me know if the following patch fixes the issue for you ? >>> It did for us on one of our machines that we were investigating on. >> >> This fixes the issue for me as well, thanks! >> >> Tested-by: Alexey Kardashevskiy <aik@ozlabs.ru> > > OK, that's great. > > But, I really don't think we can ask upstream to merge this patch to generic > code when we don't have a good explanation for why it's necessary. At least I'm > not going to ask anyone to do that :) > > So Pretti can you either write a 100% convincing explanation of why this patch > is correct in the general case, or (preferably) do some more investigating to > work out what Alexey's bug actually is. Yes will do so. Its better to investigate where precisely is the bug. This patch helped me narrow down on the buggy scenario. Regards Preeti U Murthy > > cheers > > > _______________________________________________ > Linuxppc-dev mailing list > Linuxppc-dev@lists.ozlabs.org > https://lists.ozlabs.org/listinfo/linuxppc-dev >
On 01/16/2015 02:26 PM, Preeti U Murthy wrote: > On 01/16/2015 08:34 AM, Michael Ellerman wrote: >> On Fri, 2015-01-16 at 13:28 +1300, Alexey Kardashevskiy wrote: >>> On 01/16/2015 02:22 AM, Preeti U Murthy wrote: >>>> Hi Alexey, >>>> >>>> Can you let me know if the following patch fixes the issue for you ? >>>> It did for us on one of our machines that we were investigating on. >>> >>> This fixes the issue for me as well, thanks! >>> >>> Tested-by: Alexey Kardashevskiy <aik@ozlabs.ru> >> >> OK, that's great. >> >> But, I really don't think we can ask upstream to merge this patch to generic >> code when we don't have a good explanation for why it's necessary. At least I'm >> not going to ask anyone to do that :) >> >> So Pretti can you either write a 100% convincing explanation of why this patch >> is correct in the general case, or (preferably) do some more investigating to >> work out what Alexey's bug actually is. > > Yes will do so. Its better to investigate where precisely is the bug. > This patch helped me narrow down on the buggy scenario. On a side note, while I was tracking the race condition, I noticed that in the final stage of the cpu offline path, after the state of the hotplugged cpu is set to CPU_DEAD, we check if there were interrupts delivered during the soft disabled state and service them if there were. It makes sense to check for pending interrupts in the idle path. In the offline path however, this did not look right to me at first glance. Am I missing something ? Regards Preeti U Murthy > > Regards > Preeti U Murthy >> >> cheers >> >> >> _______________________________________________ >> Linuxppc-dev mailing list >> Linuxppc-dev@lists.ozlabs.org >> https://lists.ozlabs.org/listinfo/linuxppc-dev >> > > _______________________________________________ > Linuxppc-dev mailing list > Linuxppc-dev@lists.ozlabs.org > https://lists.ozlabs.org/listinfo/linuxppc-dev >
On 01/16/2015 08:34 AM, Michael Ellerman wrote: > On Fri, 2015-01-16 at 13:28 +1300, Alexey Kardashevskiy wrote: >> On 01/16/2015 02:22 AM, Preeti U Murthy wrote: >>> Hi Alexey, >>> >>> Can you let me know if the following patch fixes the issue for you ? >>> It did for us on one of our machines that we were investigating on. >> >> This fixes the issue for me as well, thanks! >> >> Tested-by: Alexey Kardashevskiy <aik@ozlabs.ru> > > OK, that's great. > > But, I really don't think we can ask upstream to merge this patch to generic > code when we don't have a good explanation for why it's necessary. At least I'm > not going to ask anyone to do that :) > > So Pretti can you either write a 100% convincing explanation of why this patch > is correct in the general case, or (preferably) do some more investigating to > work out what Alexey's bug actually is. On further investigation, I found that the issue lies in the latency of cpu hotplug operation, specifically the time taken for the offline cpu to enter powersave mode. The time between the beginning of the cpu hotplug operation and the beginning of __cpu_die() operation (which is one of the last stages of cpu hotplug) takes around a maximum of 40ms. Although this is not causing softlockups, it is quite a large duration. The more serious issue is the time taken for __cpu_die() operation to complete. The __cpu_die() operation waits for the offline cpu to set its state to CPU_DEAD before entering powersave state. This time varies from 4s to a maximum of 200s! It is not this bad always but it does happen quite a few times. It is during these times that we observe softlockups. I added trace prints throughout the cpu hotplug code to measure these numbers. This delay is causing the softlockup and here is why. If the cpu going offline is the one broadcasting wakeups to cpus in fastsleep, it queues the broadcast timer on another cpu during the CPU_DEAD phase. The CPU_DEAD notifiers are run only after the __cpu_die() operation completes, which is taking a long time as mentioned above. So between the time irqs are migrated off the about to go offline cpu and CPU_DEAD stage, no cpu can be woken up. The above numbers show that this can be a horridly long time. Hence the next time that they get woken up the unnatural idle time is detected and softlockup triggers. The patch on this thread that I proposed covered up the problem by allowing the remaining cpus to freshly reevaluate their wakeups after the stop machine phase without having to depend on the previous broadcast state.So it did not matter what the previously appointed broadcast cpu was upto.However there are still corner cases which cannot get solved with this patch. And understandably because it is not addressing the core issue, which is how to get around the latency issue of cpu hotplug. There can be ways in which the broadcast timer be migrated in time during hotplug to get around the softlockups, but the latency of the cpu hotplug operation looks like a serious issue. Has anybody observed or explicitly instrumented cpu hotplug operation before and happened to notice the large time duration required for its completion? Ccing Paul. Thanks Regards Preeti U Murthy > > cheers > > > _______________________________________________ > Linuxppc-dev mailing list > Linuxppc-dev@lists.ozlabs.org > https://lists.ozlabs.org/listinfo/linuxppc-dev >
On Fri, 2015-01-16 at 14:40 +0530, Preeti U Murthy wrote: > On 01/16/2015 02:26 PM, Preeti U Murthy wrote: > > On 01/16/2015 08:34 AM, Michael Ellerman wrote: > >> On Fri, 2015-01-16 at 13:28 +1300, Alexey Kardashevskiy wrote: > >>> On 01/16/2015 02:22 AM, Preeti U Murthy wrote: > >>>> Hi Alexey, > >>>> > >>>> Can you let me know if the following patch fixes the issue for you ? > >>>> It did for us on one of our machines that we were investigating on. > >>> > >>> This fixes the issue for me as well, thanks! > >>> > >>> Tested-by: Alexey Kardashevskiy <aik@ozlabs.ru> > >> > >> OK, that's great. > >> > >> But, I really don't think we can ask upstream to merge this patch to generic > >> code when we don't have a good explanation for why it's necessary. At least I'm > >> not going to ask anyone to do that :) > >> > >> So Pretti can you either write a 100% convincing explanation of why this patch > >> is correct in the general case, or (preferably) do some more investigating to > >> work out what Alexey's bug actually is. > > > > Yes will do so. Its better to investigate where precisely is the bug. > > This patch helped me narrow down on the buggy scenario. > > On a side note, while I was tracking the race condition, I noticed that > in the final stage of the cpu offline path, after the state of the > hotplugged cpu is set to CPU_DEAD, we check if there were interrupts > delivered during the soft disabled state and service them if there were. > It makes sense to check for pending interrupts in the idle path. In the > offline path however, this did not look right to me at first glance. Am > I missing something ? That does sound a bit fishy. I guess we're just assuming that all interrupts have been migrated away prior to the offline? cheers
On 01/22/2015 10:59 AM, Michael Ellerman wrote: > On Fri, 2015-01-16 at 14:40 +0530, Preeti U Murthy wrote: >> On 01/16/2015 02:26 PM, Preeti U Murthy wrote: >>> On 01/16/2015 08:34 AM, Michael Ellerman wrote: >>>> On Fri, 2015-01-16 at 13:28 +1300, Alexey Kardashevskiy wrote: >>>>> On 01/16/2015 02:22 AM, Preeti U Murthy wrote: >>>>>> Hi Alexey, >>>>>> >>>>>> Can you let me know if the following patch fixes the issue for you ? >>>>>> It did for us on one of our machines that we were investigating on. >>>>> >>>>> This fixes the issue for me as well, thanks! >>>>> >>>>> Tested-by: Alexey Kardashevskiy <aik@ozlabs.ru> >>>> >>>> OK, that's great. >>>> >>>> But, I really don't think we can ask upstream to merge this patch to generic >>>> code when we don't have a good explanation for why it's necessary. At least I'm >>>> not going to ask anyone to do that :) >>>> >>>> So Pretti can you either write a 100% convincing explanation of why this patch >>>> is correct in the general case, or (preferably) do some more investigating to >>>> work out what Alexey's bug actually is. >>> >>> Yes will do so. Its better to investigate where precisely is the bug. >>> This patch helped me narrow down on the buggy scenario. >> >> On a side note, while I was tracking the race condition, I noticed that >> in the final stage of the cpu offline path, after the state of the >> hotplugged cpu is set to CPU_DEAD, we check if there were interrupts >> delivered during the soft disabled state and service them if there were. >> It makes sense to check for pending interrupts in the idle path. In the >> offline path however, this did not look right to me at first glance. Am >> I missing something ? > > That does sound a bit fishy. > > I guess we're just assuming that all interrupts have been migrated away prior > to the offline? Yes they have. Interrupts to start a guest will also not get delivered because the hwthread_state is not set to nap yet at that point. Regards Preeti U Murthy > > cheers > > > _______________________________________________ > Linuxppc-dev mailing list > Linuxppc-dev@lists.ozlabs.org > https://lists.ozlabs.org/listinfo/linuxppc-dev >
diff --git a/kernel/time/clockevents.c b/kernel/time/clockevents.c index 5544990..f3907c9 100644 --- a/kernel/time/clockevents.c +++ b/kernel/time/clockevents.c @@ -568,6 +568,7 @@ int clockevents_notify(unsigned long reason, void *arg) case CLOCK_EVT_NOTIFY_CPU_DYING: tick_handover_do_timer(arg); + tick_shutdown_broadcast_oneshot(arg); break; case CLOCK_EVT_NOTIFY_SUSPEND: @@ -580,7 +581,6 @@ int clockevents_notify(unsigned long reason, void *arg) break; case CLOCK_EVT_NOTIFY_CPU_DEAD: - tick_shutdown_broadcast_oneshot(arg); tick_shutdown_broadcast(arg); tick_shutdown(arg); /* diff --git a/kernel/time/tick-broadcast.c b/kernel/time/tick-broadcast.c index 066f0ec..1f5eda6 100644 --- a/kernel/time/tick-broadcast.c +++ b/kernel/time/tick-broadcast.c @@ -677,6 +677,7 @@ static void broadcast_move_bc(int deadcpu) return; /* This moves the broadcast assignment to this cpu */ clockevents_program_event(bc, bc->next_event, 1); + bc->next_event.tv64 = KTIME_MAX; } /*