Message ID | 20140509174712.55fe72d0@kryten (mailing list archive) |
---|---|
State | Accepted |
Headers | show |
Hi Anton, On 05/09/2014 01:17 PM, Anton Blanchard wrote: > I am seeing an issue where a CPU running perf eventually hangs. > Traces show timer interrupts happening every 4 seconds even > when a userspace task is running on the CPU. /proc/timer_list > also shows pending hrtimers have not run in over an hour, > including the scheduler. > > Looking closer, decrementers_next_tb is getting set to > 0xffffffffffffffff, and at that point we will never take > a timer interrupt again. > > In __timer_interrupt() we set decrementers_next_tb to > 0xffffffffffffffff and rely on ->event_handler to update it: > > *next_tb = ~(u64)0; > if (evt->event_handler) > evt->event_handler(evt); > > In this case ->event_handler is hrtimer_interrupt. This will eventually > call back through the clockevents code with the next event to be > programmed: > > static int decrementer_set_next_event(unsigned long evt, > struct clock_event_device *dev) > { > /* Don't adjust the decrementer if some irq work is pending */ > if (test_irq_work_pending()) > return 0; > __get_cpu_var(decrementers_next_tb) = get_tb_or_rtc() + evt; > > If irq work came in between these two points, we will return > before updating decrementers_next_tb and we never process a timer > interrupt again. > > This looks to have been introduced by 0215f7d8c53f (powerpc: Fix races > with irq_work). Fix it by removing the early exit and relying on > code later on in the function to force an early decrementer: > > /* We may have raced with new irq work */ > if (test_irq_work_pending()) > set_dec(1); > There is another scenario we are missing. Its not necessary that on a timer interrupt the event handler will call back through the set_next_event(). If there are no pending timers then the event handler will not bother programming the tick device and simply return.IOW, set_next_event() will not be called. In that case we will miss taking care of pending irq work altogether. __timer_interrupt() -> event_handler -> next_time = KTIME_MAX -> __timer_interrupt(). In __timer_interrupt() we do not check for pending irq anywhere after the call to the event handler and we hence miss servicing irqs in the above scenario. How about you also move the check: if (test_irq_pending()) set_dec(1) in __timer_interrupt() outside the _else_ loop? This will ensure that no matter what, before exiting timer interrupt handler we check for pending irq work. Regards Preeti U Murthy > Signed-off-by: Anton Blanchard <anton@samba.org> > Cc: stable@vger.kernel.org # 3.14+ > --- > > diff --git a/arch/powerpc/kernel/time.c b/arch/powerpc/kernel/time.c > index 122a580..4f0b676 100644 > --- a/arch/powerpc/kernel/time.c > +++ b/arch/powerpc/kernel/time.c > @@ -813,9 +888,6 @@ static void __init clocksource_init(void) > static int decrementer_set_next_event(unsigned long evt, > struct clock_event_device *dev) > { > - /* Don't adjust the decrementer if some irq work is pending */ > - if (test_irq_work_pending()) > - return 0; > __get_cpu_var(decrementers_next_tb) = get_tb_or_rtc() + evt; > set_dec(evt); How about if you move the test_irq_work_pending Why do we have test_irq_work_pending() later in the function decrementer_set_next_event()? > > _______________________________________________ > Linuxppc-dev mailing list > Linuxppc-dev@lists.ozlabs.org > https://lists.ozlabs.org/listinfo/linuxppc-dev >
On Fri, May 09, 2014 at 05:47:12PM +1000, Anton Blanchard wrote: > I am seeing an issue where a CPU running perf eventually hangs. > Traces show timer interrupts happening every 4 seconds even > when a userspace task is running on the CPU. Is this by chance every 4.2 seconds? The reason I ask is that Paul Clarke and I are seeing an interrupt every 4.2 seconds when he runs NO_HZ_FULL, and are trying to get rid of it. ;-) Thanx, Paul > /proc/timer_list > also shows pending hrtimers have not run in over an hour, > including the scheduler. > > Looking closer, decrementers_next_tb is getting set to > 0xffffffffffffffff, and at that point we will never take > a timer interrupt again. > > In __timer_interrupt() we set decrementers_next_tb to > 0xffffffffffffffff and rely on ->event_handler to update it: > > *next_tb = ~(u64)0; > if (evt->event_handler) > evt->event_handler(evt); > > In this case ->event_handler is hrtimer_interrupt. This will eventually > call back through the clockevents code with the next event to be > programmed: > > static int decrementer_set_next_event(unsigned long evt, > struct clock_event_device *dev) > { > /* Don't adjust the decrementer if some irq work is pending */ > if (test_irq_work_pending()) > return 0; > __get_cpu_var(decrementers_next_tb) = get_tb_or_rtc() + evt; > > If irq work came in between these two points, we will return > before updating decrementers_next_tb and we never process a timer > interrupt again. > > This looks to have been introduced by 0215f7d8c53f (powerpc: Fix races > with irq_work). Fix it by removing the early exit and relying on > code later on in the function to force an early decrementer: > > /* We may have raced with new irq work */ > if (test_irq_work_pending()) > set_dec(1); > > Signed-off-by: Anton Blanchard <anton@samba.org> > Cc: stable@vger.kernel.org # 3.14+ > --- > > diff --git a/arch/powerpc/kernel/time.c b/arch/powerpc/kernel/time.c > index 122a580..4f0b676 100644 > --- a/arch/powerpc/kernel/time.c > +++ b/arch/powerpc/kernel/time.c > @@ -813,9 +888,6 @@ static void __init clocksource_init(void) > static int decrementer_set_next_event(unsigned long evt, > struct clock_event_device *dev) > { > - /* Don't adjust the decrementer if some irq work is pending */ > - if (test_irq_work_pending()) > - return 0; > __get_cpu_var(decrementers_next_tb) = get_tb_or_rtc() + evt; > set_dec(evt); > >
On Fri, May 09, 2014 at 06:41:13AM -0700, Paul E. McKenney wrote: > On Fri, May 09, 2014 at 05:47:12PM +1000, Anton Blanchard wrote: > > I am seeing an issue where a CPU running perf eventually hangs. > > Traces show timer interrupts happening every 4 seconds even > > when a userspace task is running on the CPU. > > Is this by chance every 4.2 seconds? The reason I ask is that > Paul Clarke and I are seeing an interrupt every 4.2 seconds when > he runs NO_HZ_FULL, and are trying to get rid of it. ;-) Hmmm, it's close to 2^32 nanoseconds, isnt't it suspiscious? Gabriel
On Fri, May 09, 2014 at 11:50:05PM +0200, Gabriel Paubert wrote: > On Fri, May 09, 2014 at 06:41:13AM -0700, Paul E. McKenney wrote: > > On Fri, May 09, 2014 at 05:47:12PM +1000, Anton Blanchard wrote: > > > I am seeing an issue where a CPU running perf eventually hangs. > > > Traces show timer interrupts happening every 4 seconds even > > > when a userspace task is running on the CPU. > > > > Is this by chance every 4.2 seconds? The reason I ask is that > > Paul Clarke and I are seeing an interrupt every 4.2 seconds when > > he runs NO_HZ_FULL, and are trying to get rid of it. ;-) > > Hmmm, it's close to 2^32 nanoseconds, isnt't it suspiscious? Now that you mention it... ;-) So you are telling me that we are not succeeding in completely turning off the decrementer interrupt? Thanx, Paul
On Fri, May 09, 2014 at 03:08:45PM -0700, Paul E. McKenney wrote: > On Fri, May 09, 2014 at 11:50:05PM +0200, Gabriel Paubert wrote: > > On Fri, May 09, 2014 at 06:41:13AM -0700, Paul E. McKenney wrote: > > > On Fri, May 09, 2014 at 05:47:12PM +1000, Anton Blanchard wrote: > > > > I am seeing an issue where a CPU running perf eventually hangs. > > > > Traces show timer interrupts happening every 4 seconds even > > > > when a userspace task is running on the CPU. > > > > > > Is this by chance every 4.2 seconds? The reason I ask is that > > > Paul Clarke and I are seeing an interrupt every 4.2 seconds when > > > he runs NO_HZ_FULL, and are trying to get rid of it. ;-) > > > > Hmmm, it's close to 2^32 nanoseconds, isnt't it suspiscious? > > Now that you mention it... ;-) > > So you are telling me that we are not succeeding in completely turning > off the decrementer interrupt? There is no way to turn off the decrementer interrupt without turning off external (device) interrupts. On IBM Power CPUs since POWER6, the decrementer runs at 512MHz. If you set the decrementer to 0x7fffffff it will interrupt in 4.194 seconds, so that would be what you're seeing. The only way to avoid the interrupt becoming pending is to keep on setting it to a large value before it gets to -1. If an interrupt every 4.2 seconds is a problem in some applications, then we need to talk to the Power architects. Regards, Paul.
On Sat, May 10, 2014 at 04:33:37PM +1000, Paul Mackerras wrote: > On Fri, May 09, 2014 at 03:08:45PM -0700, Paul E. McKenney wrote: > > On Fri, May 09, 2014 at 11:50:05PM +0200, Gabriel Paubert wrote: > > > On Fri, May 09, 2014 at 06:41:13AM -0700, Paul E. McKenney wrote: > > > > On Fri, May 09, 2014 at 05:47:12PM +1000, Anton Blanchard wrote: > > > > > I am seeing an issue where a CPU running perf eventually hangs. > > > > > Traces show timer interrupts happening every 4 seconds even > > > > > when a userspace task is running on the CPU. > > > > > > > > Is this by chance every 4.2 seconds? The reason I ask is that > > > > Paul Clarke and I are seeing an interrupt every 4.2 seconds when > > > > he runs NO_HZ_FULL, and are trying to get rid of it. ;-) > > > > > > Hmmm, it's close to 2^32 nanoseconds, isnt't it suspiscious? > > > > Now that you mention it... ;-) > > > > So you are telling me that we are not succeeding in completely turning > > off the decrementer interrupt? > > There is no way to turn off the decrementer interrupt without turning > off external (device) interrupts. > > On IBM Power CPUs since POWER6, the decrementer runs at 512MHz. If > you set the decrementer to 0x7fffffff it will interrupt in 4.194 > seconds, so that would be what you're seeing. The only way to avoid > the interrupt becoming pending is to keep on setting it to a large > value before it gets to -1. > > If an interrupt every 4.2 seconds is a problem in some applications, > then we need to talk to the Power architects. Thank you for filling me in on this! Might be worth doing just that. Thanx, Paul
diff --git a/arch/powerpc/kernel/time.c b/arch/powerpc/kernel/time.c index 122a580..4f0b676 100644 --- a/arch/powerpc/kernel/time.c +++ b/arch/powerpc/kernel/time.c @@ -813,9 +888,6 @@ static void __init clocksource_init(void) static int decrementer_set_next_event(unsigned long evt, struct clock_event_device *dev) { - /* Don't adjust the decrementer if some irq work is pending */ - if (test_irq_work_pending()) - return 0; __get_cpu_var(decrementers_next_tb) = get_tb_or_rtc() + evt; set_dec(evt);
I am seeing an issue where a CPU running perf eventually hangs. Traces show timer interrupts happening every 4 seconds even when a userspace task is running on the CPU. /proc/timer_list also shows pending hrtimers have not run in over an hour, including the scheduler. Looking closer, decrementers_next_tb is getting set to 0xffffffffffffffff, and at that point we will never take a timer interrupt again. In __timer_interrupt() we set decrementers_next_tb to 0xffffffffffffffff and rely on ->event_handler to update it: *next_tb = ~(u64)0; if (evt->event_handler) evt->event_handler(evt); In this case ->event_handler is hrtimer_interrupt. This will eventually call back through the clockevents code with the next event to be programmed: static int decrementer_set_next_event(unsigned long evt, struct clock_event_device *dev) { /* Don't adjust the decrementer if some irq work is pending */ if (test_irq_work_pending()) return 0; __get_cpu_var(decrementers_next_tb) = get_tb_or_rtc() + evt; If irq work came in between these two points, we will return before updating decrementers_next_tb and we never process a timer interrupt again. This looks to have been introduced by 0215f7d8c53f (powerpc: Fix races with irq_work). Fix it by removing the early exit and relying on code later on in the function to force an early decrementer: /* We may have raced with new irq work */ if (test_irq_work_pending()) set_dec(1); Signed-off-by: Anton Blanchard <anton@samba.org> Cc: stable@vger.kernel.org # 3.14+ ---