diff mbox

[V2,2/2] sched: Remove un-necessary iteration over sched domains to update nr_busy_cpus

Message ID 52707B02.7030100@linux.vnet.ibm.com (mailing list archive)
State Not Applicable
Headers show

Commit Message

Preeti U Murthy Oct. 30, 2013, 3:20 a.m. UTC
The changelog has missed mentioning the introduction of sd_asym per_cpu sched domain.
Apologies for this. The patch with the changelog including mention of sd_asym is
pasted below.

Regards
Preeti U Murthy

---------------

sched: Remove un-necessary iteration over sched domains to update nr_busy_cpus

From: Preeti U Murthy <preeti@linux.vnet.ibm.com>

nr_busy_cpus parameter is used by nohz_kick_needed() to find out the number
of busy cpus in a sched domain which has SD_SHARE_PKG_RESOURCES flag set.
Therefore instead of updating nr_busy_cpus at every level of sched domain,
since it is irrelevant, we can update this parameter only at the parent
domain of the sd which has this flag set. Introduce a per-cpu parameter
sd_busy which represents this parent domain.

In nohz_kick_needed() we directly query the nr_busy_cpus parameter
associated with the groups of sd_busy.

By associating sd_busy with the highest domain which has
SD_SHARE_PKG_RESOURCES flag set, we cover all lower level domains which could
have this flag set and trigger nohz_idle_balancing if any of the levels have
more than one busy cpu.

sd_busy is irrelevant for asymmetric load balancing. However sd_asym has been
introduced to represent the highest sched domain which has SD_ASYM_PACKING flag set
so that it can be queried directly when required.

While we are at it, we might as well change the nohz_idle parameter to be
updated at the sd_busy domain level alone and not the base domain level of a CPU.
This will unify the concept of busy cpus at just one level of sched domain
where it is currently used.

Signed-off-by: Preeti U Murthy<preeti@linux.vnet.ibm.com>
---
 kernel/sched/core.c  |    6 ++++++
 kernel/sched/fair.c  |   38 ++++++++++++++++++++------------------
 kernel/sched/sched.h |    2 ++
 3 files changed, 28 insertions(+), 18 deletions(-)

Comments

Kamalesh Babulal Oct. 30, 2013, 9:23 a.m. UTC | #1
Hi Preeti,

> nr_busy_cpus parameter is used by nohz_kick_needed() to find out the number
> of busy cpus in a sched domain which has SD_SHARE_PKG_RESOURCES flag set.
> Therefore instead of updating nr_busy_cpus at every level of sched domain,
> since it is irrelevant, we can update this parameter only at the parent
> domain of the sd which has this flag set. Introduce a per-cpu parameter
> sd_busy which represents this parent domain.
> 
> In nohz_kick_needed() we directly query the nr_busy_cpus parameter
> associated with the groups of sd_busy.
> 
> By associating sd_busy with the highest domain which has
> SD_SHARE_PKG_RESOURCES flag set, we cover all lower level domains which could
> have this flag set and trigger nohz_idle_balancing if any of the levels have
> more than one busy cpu.
> 
> sd_busy is irrelevant for asymmetric load balancing. However sd_asym has been
> introduced to represent the highest sched domain which has SD_ASYM_PACKING flag set
> so that it can be queried directly when required.
> 
> While we are at it, we might as well change the nohz_idle parameter to be
> updated at the sd_busy domain level alone and not the base domain level of a CPU.
> This will unify the concept of busy cpus at just one level of sched domain
> where it is currently used.
> 
> Signed-off-by: Preeti U Murthy<preeti@linux.vnet.ibm.com>
> ---
>  kernel/sched/core.c  |    6 ++++++
>  kernel/sched/fair.c  |   38 ++++++++++++++++++++------------------
>  kernel/sched/sched.h |    2 ++
>  3 files changed, 28 insertions(+), 18 deletions(-)
> 
> diff --git a/kernel/sched/core.c b/kernel/sched/core.c
> index c06b8d3..e6a6244 100644
> --- a/kernel/sched/core.c
> +++ b/kernel/sched/core.c
> @@ -5271,6 +5271,8 @@ DEFINE_PER_CPU(struct sched_domain *, sd_llc);
>  DEFINE_PER_CPU(int, sd_llc_size);
>  DEFINE_PER_CPU(int, sd_llc_id);
>  DEFINE_PER_CPU(struct sched_domain *, sd_numa);
> +DEFINE_PER_CPU(struct sched_domain *, sd_busy);
> +DEFINE_PER_CPU(struct sched_domain *, sd_asym);
> 
>  static void update_top_cache_domain(int cpu)
>  {
> @@ -5282,6 +5284,7 @@ static void update_top_cache_domain(int cpu)
>  	if (sd) {
>  		id = cpumask_first(sched_domain_span(sd));
>  		size = cpumask_weight(sched_domain_span(sd));
> +		rcu_assign_pointer(per_cpu(sd_busy, cpu), sd->parent);
>  	}


consider a machine with single socket, dual core with HT enabled. The top most
domain is also the highest domain with SD_SHARE_PKG_RESOURCES flag set,
i.e MC domain (the machine toplogy consist of SIBLING and MC domain).

# lstopo-no-graphics --no-bridges --no-io
Machine (7869MB) + Socket L#0 + L3 L#0 (3072KB)
  L2 L#0 (256KB) + L1d L#0 (32KB) + L1i L#0 (32KB) + Core L#0
    PU L#0 (P#0)
    PU L#1 (P#1)
  L2 L#1 (256KB) + L1d L#1 (32KB) + L1i L#1 (32KB) + Core L#1
    PU L#2 (P#2)
    PU L#3 (P#3)

With this approach parent of MC domain is NULL and given that sd_busy is NULL,
nr_busy_cpus of sched domain sd_busy will never be incremented/decremented.
Resulting is nohz_kick_needed returning 0.

Thanks,
Kamalesh.
Preeti U Murthy Oct. 30, 2013, 10:03 a.m. UTC | #2
Hi Kamalesh,

On 10/30/2013 02:53 PM, Kamalesh Babulal wrote:
> Hi Preeti,
> 
>> nr_busy_cpus parameter is used by nohz_kick_needed() to find out the number
>> of busy cpus in a sched domain which has SD_SHARE_PKG_RESOURCES flag set.
>> Therefore instead of updating nr_busy_cpus at every level of sched domain,
>> since it is irrelevant, we can update this parameter only at the parent
>> domain of the sd which has this flag set. Introduce a per-cpu parameter
>> sd_busy which represents this parent domain.
>>
>> In nohz_kick_needed() we directly query the nr_busy_cpus parameter
>> associated with the groups of sd_busy.
>>
>> By associating sd_busy with the highest domain which has
>> SD_SHARE_PKG_RESOURCES flag set, we cover all lower level domains which could
>> have this flag set and trigger nohz_idle_balancing if any of the levels have
>> more than one busy cpu.
>>
>> sd_busy is irrelevant for asymmetric load balancing. However sd_asym has been
>> introduced to represent the highest sched domain which has SD_ASYM_PACKING flag set
>> so that it can be queried directly when required.
>>
>> While we are at it, we might as well change the nohz_idle parameter to be
>> updated at the sd_busy domain level alone and not the base domain level of a CPU.
>> This will unify the concept of busy cpus at just one level of sched domain
>> where it is currently used.
>>
>> Signed-off-by: Preeti U Murthy<preeti@linux.vnet.ibm.com>
>> ---
>>  kernel/sched/core.c  |    6 ++++++
>>  kernel/sched/fair.c  |   38 ++++++++++++++++++++------------------
>>  kernel/sched/sched.h |    2 ++
>>  3 files changed, 28 insertions(+), 18 deletions(-)
>>
>> diff --git a/kernel/sched/core.c b/kernel/sched/core.c
>> index c06b8d3..e6a6244 100644
>> --- a/kernel/sched/core.c
>> +++ b/kernel/sched/core.c
>> @@ -5271,6 +5271,8 @@ DEFINE_PER_CPU(struct sched_domain *, sd_llc);
>>  DEFINE_PER_CPU(int, sd_llc_size);
>>  DEFINE_PER_CPU(int, sd_llc_id);
>>  DEFINE_PER_CPU(struct sched_domain *, sd_numa);
>> +DEFINE_PER_CPU(struct sched_domain *, sd_busy);
>> +DEFINE_PER_CPU(struct sched_domain *, sd_asym);
>>
>>  static void update_top_cache_domain(int cpu)
>>  {
>> @@ -5282,6 +5284,7 @@ static void update_top_cache_domain(int cpu)
>>  	if (sd) {
>>  		id = cpumask_first(sched_domain_span(sd));
>>  		size = cpumask_weight(sched_domain_span(sd));
>> +		rcu_assign_pointer(per_cpu(sd_busy, cpu), sd->parent);
>>  	}
> 
> 
> consider a machine with single socket, dual core with HT enabled. The top most
> domain is also the highest domain with SD_SHARE_PKG_RESOURCES flag set,
> i.e MC domain (the machine toplogy consist of SIBLING and MC domain).
> 
> # lstopo-no-graphics --no-bridges --no-io
> Machine (7869MB) + Socket L#0 + L3 L#0 (3072KB)
>   L2 L#0 (256KB) + L1d L#0 (32KB) + L1i L#0 (32KB) + Core L#0
>     PU L#0 (P#0)
>     PU L#1 (P#1)
>   L2 L#1 (256KB) + L1d L#1 (32KB) + L1i L#1 (32KB) + Core L#1
>     PU L#2 (P#2)
>     PU L#3 (P#3)
> 
> With this approach parent of MC domain is NULL and given that sd_busy is NULL,
> nr_busy_cpus of sched domain sd_busy will never be incremented/decremented.
> Resulting is nohz_kick_needed returning 0.

Right and it *should* return 0. There is no sibling domain that can
offload tasks from it. Therefore there is no point kicking nohz idle
balance.

Regards
Preeti U Murthy
> 
> Thanks,
> Kamalesh.
>
diff mbox

Patch

diff --git a/kernel/sched/core.c b/kernel/sched/core.c
index c06b8d3..e6a6244 100644
--- a/kernel/sched/core.c
+++ b/kernel/sched/core.c
@@ -5271,6 +5271,8 @@  DEFINE_PER_CPU(struct sched_domain *, sd_llc);
 DEFINE_PER_CPU(int, sd_llc_size);
 DEFINE_PER_CPU(int, sd_llc_id);
 DEFINE_PER_CPU(struct sched_domain *, sd_numa);
+DEFINE_PER_CPU(struct sched_domain *, sd_busy);
+DEFINE_PER_CPU(struct sched_domain *, sd_asym);
 
 static void update_top_cache_domain(int cpu)
 {
@@ -5282,6 +5284,7 @@  static void update_top_cache_domain(int cpu)
 	if (sd) {
 		id = cpumask_first(sched_domain_span(sd));
 		size = cpumask_weight(sched_domain_span(sd));
+		rcu_assign_pointer(per_cpu(sd_busy, cpu), sd->parent);
 	}
 
 	rcu_assign_pointer(per_cpu(sd_llc, cpu), sd);
@@ -5290,6 +5293,9 @@  static void update_top_cache_domain(int cpu)
 
 	sd = lowest_flag_domain(cpu, SD_NUMA);
 	rcu_assign_pointer(per_cpu(sd_numa, cpu), sd);
+
+	sd = highest_flag_domain(cpu, SD_ASYM_PACKING);
+	rcu_assign_pointer(per_cpu(sd_asym, cpu), sd);
 }
 
 /*
diff --git a/kernel/sched/fair.c b/kernel/sched/fair.c
index e9c9549..8602b2c 100644
--- a/kernel/sched/fair.c
+++ b/kernel/sched/fair.c
@@ -6515,16 +6515,16 @@  static inline void nohz_balance_exit_idle(int cpu)
 static inline void set_cpu_sd_state_busy(void)
 {
 	struct sched_domain *sd;
+	int cpu = smp_processor_id();
 
 	rcu_read_lock();
-	sd = rcu_dereference_check_sched_domain(this_rq()->sd);
+	sd = rcu_dereference(per_cpu(sd_busy, cpu));
 
 	if (!sd || !sd->nohz_idle)
 		goto unlock;
 	sd->nohz_idle = 0;
 
-	for (; sd; sd = sd->parent)
-		atomic_inc(&sd->groups->sgp->nr_busy_cpus);
+	atomic_inc(&sd->groups->sgp->nr_busy_cpus);
 unlock:
 	rcu_read_unlock();
 }
@@ -6532,16 +6532,16 @@  unlock:
 void set_cpu_sd_state_idle(void)
 {
 	struct sched_domain *sd;
+	int cpu = smp_processor_id();
 
 	rcu_read_lock();
-	sd = rcu_dereference_check_sched_domain(this_rq()->sd);
+	sd = rcu_dereference(per_cpu(sd_busy, cpu));
 
 	if (!sd || sd->nohz_idle)
 		goto unlock;
 	sd->nohz_idle = 1;
 
-	for (; sd; sd = sd->parent)
-		atomic_dec(&sd->groups->sgp->nr_busy_cpus);
+	atomic_dec(&sd->groups->sgp->nr_busy_cpus);
 unlock:
 	rcu_read_unlock();
 }
@@ -6748,6 +6748,8 @@  static inline int nohz_kick_needed(struct rq *rq, int cpu)
 {
 	unsigned long now = jiffies;
 	struct sched_domain *sd;
+	struct sched_group_power *sgp;
+	int nr_busy;
 
 	if (unlikely(idle_cpu(cpu)))
 		return 0;
@@ -6773,22 +6775,22 @@  static inline int nohz_kick_needed(struct rq *rq, int cpu)
 		goto need_kick;
 
 	rcu_read_lock();
-	for_each_domain(cpu, sd) {
-		struct sched_group *sg = sd->groups;
-		struct sched_group_power *sgp = sg->sgp;
-		int nr_busy = atomic_read(&sgp->nr_busy_cpus);
+	sd = rcu_dereference(per_cpu(sd_busy, cpu));
 
-		if (sd->flags & SD_SHARE_PKG_RESOURCES && nr_busy > 1)
-			goto need_kick_unlock;
+	if (sd) {
+		sgp = sd->groups->sgp;
+		nr_busy = atomic_read(&sgp->nr_busy_cpus);
 
-		if (sd->flags & SD_ASYM_PACKING
-		    && (cpumask_first_and(nohz.idle_cpus_mask,
-					  sched_domain_span(sd)) < cpu))
+		if (nr_busy > 1)
 			goto need_kick_unlock;
-
-		if (!(sd->flags & (SD_SHARE_PKG_RESOURCES | SD_ASYM_PACKING)))
-			break;
 	}
+
+	sd = rcu_dereference(per_cpu(sd_asym, cpu));
+
+	if (sd && (cpumask_first_and(nohz.idle_cpus_mask,
+				  sched_domain_span(sd)) < cpu))
+		goto need_kick_unlock;
+
 	rcu_read_unlock();
 	return 0;
 
diff --git a/kernel/sched/sched.h b/kernel/sched/sched.h
index ffc7087..c8cb145 100644
--- a/kernel/sched/sched.h
+++ b/kernel/sched/sched.h
@@ -623,6 +623,8 @@  DECLARE_PER_CPU(struct sched_domain *, sd_llc);
 DECLARE_PER_CPU(int, sd_llc_size);
 DECLARE_PER_CPU(int, sd_llc_id);
 DECLARE_PER_CPU(struct sched_domain *, sd_numa);
+DECLARE_PER_CPU(struct sched_domain *, sd_busy);
+DECLARE_PER_CPU(struct sched_domain *, sd_asym);
 
 struct sched_group_power {
 	atomic_t ref;