Patchwork [1/3] sched: Fix nohz_kick_needed to consider the nr_busy of the parent domain's group

login
register
mail settings
Submitter Vaidyanathan Srinivasan
Date Oct. 21, 2013, 11:44 a.m.
Message ID <20131021114442.13291.99344.stgit@drishya>
Download mbox | patch
Permalink /patch/285174/
State Not Applicable
Headers show

Comments

Vaidyanathan Srinivasan - Oct. 21, 2013, 11:44 a.m.
From: Preeti U Murthy <preeti@linux.vnet.ibm.com>

In nohz_kick_needed() there are checks around the flags
SD_SHARE_PKG_RESOURCES which decide to initiate nohz_balance if the
domains with this flag set have more than one cpu busy. Therefore at
every domain, a check has to be made on nr_busy of that domain. This
means the sum of the nr_busy of each group in that domain needs to be
checked, since nr_busy is a parameter which is associated with
a group.  However in the current implementation of nohz_kick_needed(),
the nr_busy is being checked for just the group to which the cpu that
has initiated this check belongs to. This will give us the wrong count
of the number of busy cpus in that domain.

The following commit which fixed the sgp->nr_busy_cpus computation
actually exposed the bug in nohz_kick_needed() which worked when
nr_busy was incorrectly > 1

25f55d9d01ad7a7ad248fd5af1d22675ffd202c5
sched: Fix init NOHZ_IDLE flag

To illustrate the scenario,  consider a core, whose domain will have
the SD_SHARE_PKG_RESOURCES set. We want to kick nohz_idle_balance when
we find that more than one thread in the core is busy. With the
current implementation of nohz_kick_needed(), at this domain(sd), the
nr_busy will be 1 always since it returns this parameter for
sd->groups which encompasses a single thread, while we want this
parameter for sd->parent->groups which will rightly point to the
number of busy threads in the core.

This patch also ensures that the order of check for
SD_SHARE_PKG_RESOURCE comes before the check for ASYM_PACKING.
Priority is given to avoid more than one busy thread in a core as much
as possible before attempting asymmetric packing.

Signed-off-by: Preeti U Murthy <preeti@linux.vnet.ibm.com>
Signed-off-by: Vaidyanathan Srinivasan <svaidy@linux.vnet.ibm.com>
---
 kernel/sched/fair.c |   19 +++++++++++++------
 1 file changed, 13 insertions(+), 6 deletions(-)
Kamalesh Babulal - Oct. 22, 2013, 2:35 p.m.
* Vaidyanathan Srinivasan <svaidy@linux.vnet.ibm.com> [2013-10-21 17:14:42]:

>  	for_each_domain(cpu, sd) {
> -		struct sched_group *sg = sd->groups;
> -		struct sched_group_power *sgp = sg->sgp;
> -		int nr_busy = atomic_read(&sgp->nr_busy_cpus);
> -
> -		if (sd->flags & SD_SHARE_PKG_RESOURCES && nr_busy > 1)
> -			goto need_kick_unlock;
> +		struct sched_domain *sd_parent = sd->parent;
> +		struct sched_group *sg;
> +		struct sched_group_power *sgp;
> +		int nr_busy;
> +
> +		if (sd_parent) {
> +			sg = sd_parent->groups;
> +			sgp = sg->sgp;
> +			nr_busy = atomic_read(&sgp->nr_busy_cpus);
> +
> +			if (sd->flags & SD_SHARE_PKG_RESOURCES && nr_busy > 1)
> +				goto need_kick_unlock;
> +		}
> 
>  		if (sd->flags & SD_ASYM_PACKING && nr_busy != sg->group_weight
>  		    && (cpumask_first_and(nohz.idle_cpus_mask,

CC'ing Suresh Siddha and Vincent Guittot

Please correct me, If my understanding of idle balancing is wrong.
With proposed approach will not idle load balancer kick in, even if
there are busy cpus across groups or if there are 2 busy cpus which
are spread across sockets.

Consider 2 socket machine with 4 processors each (MC and NUMA domains).
If the machine is partial loaded such that cpus 0,4,5,6,7 are busy, then too
nohz balancing is triggered because with this approach
(NUMA)->groups->sgp->nr_busy_cpus is taken in account for nohz kick, while
iterating over MC domain.

Isn't idle load balancer not suppose kick in, even in the case of two busy
cpu's in a dual-core single socket system.

Thanks,
Kamalesh.
Preeti U Murthy - Oct. 22, 2013, 4:40 p.m.
Hi Kamalesh,

On 10/22/2013 08:05 PM, Kamalesh Babulal wrote:
> * Vaidyanathan Srinivasan <svaidy@linux.vnet.ibm.com> [2013-10-21 17:14:42]:
> 
>>  	for_each_domain(cpu, sd) {
>> -		struct sched_group *sg = sd->groups;
>> -		struct sched_group_power *sgp = sg->sgp;
>> -		int nr_busy = atomic_read(&sgp->nr_busy_cpus);
>> -
>> -		if (sd->flags & SD_SHARE_PKG_RESOURCES && nr_busy > 1)
>> -			goto need_kick_unlock;
>> +		struct sched_domain *sd_parent = sd->parent;
>> +		struct sched_group *sg;
>> +		struct sched_group_power *sgp;
>> +		int nr_busy;
>> +
>> +		if (sd_parent) {
>> +			sg = sd_parent->groups;
>> +			sgp = sg->sgp;
>> +			nr_busy = atomic_read(&sgp->nr_busy_cpus);
>> +
>> +			if (sd->flags & SD_SHARE_PKG_RESOURCES && nr_busy > 1)
>> +				goto need_kick_unlock;
>> +		}
>>
>>  		if (sd->flags & SD_ASYM_PACKING && nr_busy != sg->group_weight
>>  		    && (cpumask_first_and(nohz.idle_cpus_mask,
> 
> CC'ing Suresh Siddha and Vincent Guittot
> 
> Please correct me, If my understanding of idle balancing is wrong.
> With proposed approach will not idle load balancer kick in, even if
> there are busy cpus across groups or if there are 2 busy cpus which
> are spread across sockets.

Yes load balancing will happen on busy cpus periodically.

Wrt idle balancing there are two points here. One, when a CPU is just
about to go idle, it will enter idle_balance(), and trigger load
balancing with itself being the destination CPU to begin with. It will
load balance at every level of the sched domain that it belongs to. If
it manages to pull tasks, good, else it will enter an idle state.

nohz_idle_balancing is triggered by a busy cpu at every tick if it has
more than one task in its runqueue or if it belongs to a group that
shares the package resources and has more than one cpu busy. By
"nohz_idle_balance triggered", it means the busy cpu will send an ipi to
the ilb_cpu to do load balancing on the behalf of the idle cpus in the
nohz mask.

So to answer your question wrt this patch, if there is one busy cpu with
say 2 tasks in one socket and another busy cpu with 1 task on another
socket, the former busy cpu can kick nohz_idle_balance since it has more
than one task in its runqueue. An idle cpu in either socket could be
woken up to balance tasks with it.

The usual idle load balancer that runs on a CPU about to become idle
could pull from either cpu depending on who is more busy as it begins to
load balance across all levels of sched domain that it belongs to.
> 
> Consider 2 socket machine with 4 processors each (MC and NUMA domains).
> If the machine is partial loaded such that cpus 0,4,5,6,7 are busy, then too
> nohz balancing is triggered because with this approach
> (NUMA)->groups->sgp->nr_busy_cpus is taken in account for nohz kick, while
> iterating over MC domain.

For the example that you mention, you will have a CPU domain and a NUMA
domain. When the sockets are NUMA nodes, each socket will belong to a
CPU domain. If the sockets are non-numa nodes, then the domain
encompassing both the nodes will be a CPU domain, possibly with each
socket being an MC domain.
> 
> Isn't idle load balancer not suppose kick in, even in the case of two busy
> cpu's in a dual-core single socket system

nohz_idle_balancing is a special case. It is triggered when the
conditions mentioned in nohz_kick_needed() are true. A CPU just about to
go idle will trigger load balancing without any pre-conditions.

In a single socket machine, there will be a CPU domain encompassing the
socket and the MC domain will encompass a core. nohz_idle load balancer
will kick in if both the threads in the core have tasks running on them.
This is fair enough because the threads share the resources of the core.

Regards
Preeti U Murthy
> 
> Thanks,
> Kamalesh.
>

Patch

diff --git a/kernel/sched/fair.c b/kernel/sched/fair.c
index 7c70201..12f0eab 100644
--- a/kernel/sched/fair.c
+++ b/kernel/sched/fair.c
@@ -5807,12 +5807,19 @@  static inline int nohz_kick_needed(struct rq *rq, int cpu)
 
 	rcu_read_lock();
 	for_each_domain(cpu, sd) {
-		struct sched_group *sg = sd->groups;
-		struct sched_group_power *sgp = sg->sgp;
-		int nr_busy = atomic_read(&sgp->nr_busy_cpus);
-
-		if (sd->flags & SD_SHARE_PKG_RESOURCES && nr_busy > 1)
-			goto need_kick_unlock;
+		struct sched_domain *sd_parent = sd->parent;
+		struct sched_group *sg;
+		struct sched_group_power *sgp;
+		int nr_busy;
+
+		if (sd_parent) {
+			sg = sd_parent->groups;
+			sgp = sg->sgp;
+			nr_busy = atomic_read(&sgp->nr_busy_cpus);
+
+			if (sd->flags & SD_SHARE_PKG_RESOURCES && nr_busy > 1)
+				goto need_kick_unlock;
+		}
 
 		if (sd->flags & SD_ASYM_PACKING && nr_busy != sg->group_weight
 		    && (cpumask_first_and(nohz.idle_cpus_mask,