| Submitter | Chase Douglas |
|---|---|
| Date | April 20, 2010, 1:46 p.m. |
| Message ID | <1271771163-2779-1-git-send-email-chase.douglas@canonical.com> |
| Download | mbox | patch |
| Permalink | /patch/50542/ |
| State | Accepted |
| Delegated to: | Stefan Bader |
| Headers | show |
Comments
I think I am a little confused by other discussions going on on this. A bit it sounds like the moment you submitted this there was Peter responding with yet another idea. Or am I just confused? -Stefan
On Tue, Apr 20, 2010 at 2:14 PM, Stefan Bader <stefan.bader@canonical.com> wrote: > I think I am a little confused by other discussions going on on this. A bit it > sounds like the moment you submitted this there was Peter responding with yet > another idea. Or am I just confused? Peter Zijlstra has another idea about using the idle load balancing mechanism to do load accounting. However, his idea is so far not fully thought out, and not tested at all I presume. I pointed out a few concerns with his approach, and he agreed there are issues that would need to be worked around. Right now, the only tested and reasonable solution is my patch. I'm waiting to hear back from him on why he feels using the ILB mechanism is a better approach, but I've also made clear that I don't understand the code and can't create a solution myself. I think you really need to understand ILB deep down to do it right. I think it makes sense to put this in Karmic and leave the patch in Lucid for now until upstream decides its final direction. P.S.: I CC'd kernel-team cause I figured we all would like to see the discussion, but Peter un-CC'd it cause of bounces. Is there a proper way to do this, or should we just not CC our list? -- Chase
On 04/20/2010 12:28 PM, Chase Douglas wrote: > P.S.: I CC'd kernel-team cause I figured we all would like to see the > discussion, but Peter un-CC'd it cause of bounces. Is there a proper > way to do this, or should we just not CC our list? > > -- Chase > I turned off the option in the list serv that annoys unsubscribed email submitters. rtg
Based on previous review and current state of the discussion I have applied this to Karmic master
Patch
diff --git a/kernel/sched.c b/kernel/sched.c index 81ede13..c372249 100644 --- a/kernel/sched.c +++ b/kernel/sched.c @@ -2967,6 +2967,7 @@ unsigned long nr_iowait(void) /* Variables and functions for calc_load */ static atomic_long_t calc_load_tasks; +static atomic_long_t calc_load_tasks_deferred; static unsigned long calc_load_update; unsigned long avenrun[3]; EXPORT_SYMBOL(avenrun); @@ -3021,7 +3022,7 @@ void calc_global_load(void) */ static void calc_load_account_active(struct rq *this_rq) { - long nr_active, delta; + long nr_active, delta, deferred; nr_active = this_rq->nr_running; nr_active += (long) this_rq->nr_uninterruptible; @@ -3029,6 +3030,25 @@ static void calc_load_account_active(struct rq *this_rq) if (nr_active != this_rq->calc_load_active) { delta = nr_active - this_rq->calc_load_active; this_rq->calc_load_active = nr_active; + + /* + * Update calc_load_tasks only once per cpu in 10 tick update + * window. + */ + if (unlikely(time_before(jiffies, this_rq->calc_load_update) && + time_after_eq(jiffies, calc_load_update))) { + if (delta) + atomic_long_add(delta, + &calc_load_tasks_deferred); + return; + } + + if (atomic_long_read(&calc_load_tasks_deferred)) { + deferred = atomic_long_xchg(&calc_load_tasks_deferred, + 0); + delta += deferred; + } + atomic_long_add(delta, &calc_load_tasks); } } @@ -3072,8 +3092,8 @@ static void update_cpu_load(struct rq *this_rq) } if (time_after_eq(jiffies, this_rq->calc_load_update)) { - this_rq->calc_load_update += LOAD_FREQ; calc_load_account_active(this_rq); + this_rq->calc_load_update += LOAD_FREQ; } }
There's a period of 10 ticks where calc_load_tasks is updated by all the cpus for the load avg. Usually all the cpus do this during the first tick. If any cpus go idle, calc_load_tasks is decremented accordingly. However, if they wake up calc_load_tasks is not incremented. Thus, if cpus go idle during the 10 tick period, calc_load_tasks may be decremented to a non-representative value. This issue can lead to systems having a load avg of exactly 0, even though the real load avg could theoretically be up to NR_CPUS. This change defers calc_load_tasks accounting after each cpu updates the count until after the 10 tick update window. A few points: * A global atomic deferral counter, and not per-cpu vars, is needed because a cpu may go NOHZ idle and not be able to update the global calc_load_tasks variable for subsequent load calculations. * It is not enough to add calls to account for the load when a cpu is awakened: - Load avg calculation must be independent of cpu load. - If a cpu is awakend by one tasks, but then has more scheduled before the end of the update window, only the first task will be accounted. BugLink: http://bugs.launchpad.net/bugs/513848 Stable Release Update Justification: Impact of bug: On low load systems with specific work load characteristics the load average may not be representative of the real load. Given a specific test case it is possible to load a system with NR_CPUS number of tasks and yet have a load avg of 0.00, though this is unlikely to occur in real work loads. How addressed: The attached patch reworks the load accounting mechanism in the kernel scheduler. It ensures that the accounting is strictly dependent on the time (i.e. snapshot taken every 5 seconds) and the number of runnable and uninterruptible tasks at that given time. Previously, the accounting also depended on whether a cpu goes idle shortly after the 5 second snapshot. Reproduction: See attached reproduction test case. Run it once on a non-loaded system (boot to rescue mode works well). Top will report the cpu usage at 90%, but uptime will report a load avg near 0.00 instead of at least 0.90 as expected. Regression potential: The patch has been received well from senior Ubuntu kernel team members and some of the upstream kernel maintainers on lkml. For this reason it is assumed to be a good fix for this issue. The only code path touched by this patch involves the load avg accounting, so potential regressions could include incorrect load avg and/or some unforeseen general bug like a null dereference. However, the likelihood of either is minimal due to proper and thorough patch review. Signed-off-by: Chase Douglas <chase.douglas@canonical.com> Acked-by: Colin King <colin.king@canonical.com> Acked-by: Andy Whitcroft <apw@canonical.com> --- kernel/sched.c | 24 ++++++++++++++++++++++-- 1 files changed, 22 insertions(+), 2 deletions(-)