diff mbox

tick/broadcast: Make movement of broadcast hrtimer robust against hotplug

Message ID 20150119102647.23035.13380.stgit@preeti.in.ibm.com (mailing list archive)
State Superseded
Delegated to: Michael Ellerman
Headers show

Commit Message

Preeti U Murthy Jan. 19, 2015, 10:26 a.m. UTC
Today if a cpu handling broadcasting of wakeups goes offline, it hands over
the job of broadcasting to another cpu in the CPU_DEAD phase. The CPU_DEAD
notifiers are run only after the offline cpu sets its state as CPU_DEAD.
Meanwhile, the kthread doing the offline is scheduled out while waiting for
this transition by queuing a timer. This is fatal because if the cpu on which
this kthread was running has no other work queued on it, it can re-enter deep
idle state, since it sees that a broadcast cpu still exists. However the broadcast
wakeup will never come since the cpu which was handling it is offline, and this cpu
never wakes up to see this because its in deep idle state.

Fix this by setting the broadcast timer to a max value so as to force the cpus
entering deep idle states henceforth to freshly nominate the broadcast cpu. More
importantly this has to be done in the CPU_DYING phase so that it is visible to
all cpus right after exiting stop_machine, which is when they can re-enter idle.
This ensures that handover of the broadcast duty falls in place on offline, without
having to do it explicitly.

Signed-off-by: Preeti U Murthy <preeti@linux.vnet.ibm.com>
---

 kernel/time/clockevents.c    |    2 +-
 kernel/time/tick-broadcast.c |    4 ++--
 2 files changed, 3 insertions(+), 3 deletions(-)

Comments

Michael Ellerman Jan. 20, 2015, 6:09 a.m. UTC | #1
On Mon, 2015-19-01 at 10:26:48 UTC, Preeti U Murthy wrote:
> Today if a cpu handling broadcasting of wakeups goes offline, it hands over

It's *the* cpu handling broadcasting of wakeups right? ie. there's only ever
one at a time.

> the job of broadcasting to another cpu in the CPU_DEAD phase. 

I think that would be clearer as "to another cpu, when the cpu going offline
reaches the CPU_DEAD state."

Otherwise it can read as "another cpu (which is) in the CPU_DEAD phase", which
is not what you mean - I think.

> The CPU_DEAD notifiers are run only after the offline cpu sets its state as
> CPU_DEAD. Meanwhile, the kthread doing the offline is scheduled out while

The kthread which is running on a different cpu from either of the first two
cpus you've mentioned.

> waiting for this transition by queuing a timer. This is fatal because if the
> cpu on which this kthread was running has no other work queued on it, it can
> re-enter deep idle state, since it sees that a broadcast cpu still exists.
> However the broadcast wakeup will never come since the cpu which was handling
> it is offline, and this cpu never wakes up to see this because its in deep
> idle state.

Which cpu is "this cpu"? I think you mean the one running the kthread which is
doing the offline, but it's not 100% clear.

> Fix this by setting the broadcast timer to a max value so as to force the cpus
> entering deep idle states henceforth to freshly nominate the broadcast cpu. More
> importantly this has to be done in the CPU_DYING phase so that it is visible to
> all cpus right after exiting stop_machine, which is when they can re-enter idle.
> This ensures that handover of the broadcast duty falls in place on offline, without
> having to do it explicitly.

OK, I don't know the code well enough to say if that's the right fix.

You say:

+	/* This allows fresh nomination of broadcast cpu */
+	bc->next_event.tv64 = KTIME_MAX;

Is that all it does? I see that check in several places in the code.


I assume we're expecting Thomas to merge this?

If so it's probably worth mentioning that it fixes a bug we are seeing on
machines in the wild. So it'd be nice if it went into 3.19 and/or gets sent to
stable.

cheers
Preeti U Murthy Jan. 20, 2015, 6:58 a.m. UTC | #2
On 01/20/2015 11:39 AM, Michael Ellerman wrote:
> On Mon, 2015-19-01 at 10:26:48 UTC, Preeti U Murthy wrote:
>> Today if a cpu handling broadcasting of wakeups goes offline, it hands over
> 
> It's *the* cpu handling broadcasting of wakeups right? ie. there's only ever
> one at a time.

Right, thanks for pointing this.

> 
>> the job of broadcasting to another cpu in the CPU_DEAD phase. 
> 
> I think that would be clearer as "to another cpu, when the cpu going offline
> reaches the CPU_DEAD state."
> 
> Otherwise it can read as "another cpu (which is) in the CPU_DEAD phase", which
> is not what you mean - I think.

Yes I will word this better.
> 
>> The CPU_DEAD notifiers are run only after the offline cpu sets its state as
>> CPU_DEAD. Meanwhile, the kthread doing the offline is scheduled out while
> 
> The kthread which is running on a different cpu from either of the first two
> cpus you've mentioned.

It could be running on any cpu other than the offline one. The next line
clarifies this - "This is fatal if the cpu on which this kthread was
running". I also say "Meanwhile, the kthread doing the offline" above so
as to clarify that the offline cpu has nothing to do with this kthread.

> 
>> waiting for this transition by queuing a timer. This is fatal because if the
>> cpu on which this kthread was running has no other work queued on it, it can
>> re-enter deep idle state, since it sees that a broadcast cpu still exists.
>> However the broadcast wakeup will never come since the cpu which was handling
>> it is offline, and this cpu never wakes up to see this because its in deep
>> idle state.
> 
> Which cpu is "this cpu"? I think you mean the one running the kthread which is
> doing the offline, but it's not 100% clear.

Right, I will make this correction as well, its ambiguous.

> 
>> Fix this by setting the broadcast timer to a max value so as to force the cpus
>> entering deep idle states henceforth to freshly nominate the broadcast cpu. More
>> importantly this has to be done in the CPU_DYING phase so that it is visible to
>> all cpus right after exiting stop_machine, which is when they can re-enter idle.
>> This ensures that handover of the broadcast duty falls in place on offline, without
>> having to do it explicitly.
> 
> OK, I don't know the code well enough to say if that's the right fix.
> 
> You say:
> 
> +	/* This allows fresh nomination of broadcast cpu */
> +	bc->next_event.tv64 = KTIME_MAX;
> 
> Is that all it does? I see that check in several places in the code.

This change does not affect those parts of the tick broadcast code,
which do not depend on the hrtimer broadcast framework. And for those
parts that do depend on this framework, this plays a critical role in
handing over the broadcast duty.

> 
> 
> I assume we're expecting Thomas to merge this?

Yes, Thomas can you please take this into the next 3.19 rc release ? I
will send out a new patch with the modified changelog. If you find this
acceptable I will port it to the relevant stable kernels.
> 
> If so it's probably worth mentioning that it fixes a bug we are seeing on

I will mention this in the changelog by pointing to the bug report.

Regards
Preeti U Murthy
> machines in the wild. So it'd be nice if it went into 3.19 and/or gets sent to
> stable.
> 
> cheers
>
diff mbox

Patch

diff --git a/kernel/time/clockevents.c b/kernel/time/clockevents.c
index 5544990..f3907c9 100644
--- a/kernel/time/clockevents.c
+++ b/kernel/time/clockevents.c
@@ -568,6 +568,7 @@  int clockevents_notify(unsigned long reason, void *arg)
 
 	case CLOCK_EVT_NOTIFY_CPU_DYING:
 		tick_handover_do_timer(arg);
+		tick_shutdown_broadcast_oneshot(arg);
 		break;
 
 	case CLOCK_EVT_NOTIFY_SUSPEND:
@@ -580,7 +581,6 @@  int clockevents_notify(unsigned long reason, void *arg)
 		break;
 
 	case CLOCK_EVT_NOTIFY_CPU_DEAD:
-		tick_shutdown_broadcast_oneshot(arg);
 		tick_shutdown_broadcast(arg);
 		tick_shutdown(arg);
 		/*
diff --git a/kernel/time/tick-broadcast.c b/kernel/time/tick-broadcast.c
index 066f0ec..e9c1d9b 100644
--- a/kernel/time/tick-broadcast.c
+++ b/kernel/time/tick-broadcast.c
@@ -675,8 +675,8 @@  static void broadcast_move_bc(int deadcpu)
 
 	if (!bc || !broadcast_needs_cpu(bc, deadcpu))
 		return;
-	/* This moves the broadcast assignment to this cpu */
-	clockevents_program_event(bc, bc->next_event, 1);
+	/* This allows fresh nomination of broadcast cpu */
+	bc->next_event.tv64 = KTIME_MAX;
 }
 
 /*