Message ID | 4886.1266991633@neuling.org (mailing list archive) |
---|---|
State | Changes Requested |
Headers | show |
> > If there's less the group will normally be balanced and we fall out and > > end up in check_asym_packing(). > > > > So what I tried doing with that loop is detect if there's a hole in the > > packing before busiest. Now that I think about it, what we need to check > > is if this_cpu (the removed cpu argument) is idle and less than busiest. > > > > So something like: > > > > static int check_asym_pacing(struct sched_domain *sd, > > struct sd_lb_stats *sds, > > int this_cpu, unsigned long *imbalance) > > { > > int busiest_cpu; > > > > if (!(sd->flags & SD_ASYM_PACKING)) > > return 0; > > > > if (!sds->busiest) > > return 0; > > > > busiest_cpu = group_first_cpu(sds->busiest); > > if (cpu_rq(this_cpu)->nr_running || this_cpu > busiest_cpu) > > return 0; > > > > *imbalance = (sds->max_load * sds->busiest->cpu_power) / > > SCHED_LOAD_SCALE; > > return 1; > > } > > > > Does that make sense? > > I think so. > > I'm seeing check_asym_packing do the right thing with the simple SMT2 > with 1 process case. It marks cpu0 as imbalanced when cpu0 is idle and > cpu1 is busy. > > Unfortunately the process doesn't seem to be get migrated down though. > Do we need to give *imbalance a higher value? So with ego help, I traced this down a bit more. In my simple test case (SMT2, t0 idle, t1 active) if f_b_g() hits our new case in check_asym_packing(), load_balance then runs f_b_q(). f_b_q() has this: if (capacity && rq->nr_running == 1 && wl > imbalance) continue; when check_asym_packing() hits, wl = 1783 and imbalance = 1024, so we continue and busiest remains NULL. load_balance then does "goto out_balanced" and it doesn't attempt to move the task. Based on this and on egos suggestion I pulled in Suresh Siddha patch from: http://lkml.org/lkml/2010/2/12/352. This fixes the problem. The process is moved down to t0. I've only tested SMT2 so far. Mikey
In message <11927.1267010024@neuling.org> you wrote: > > > If there's less the group will normally be balanced and we fall out and > > > end up in check_asym_packing(). > > > > > > So what I tried doing with that loop is detect if there's a hole in the > > > packing before busiest. Now that I think about it, what we need to check > > > is if this_cpu (the removed cpu argument) is idle and less than busiest. > > > > > > So something like: > > > > > > static int check_asym_pacing(struct sched_domain *sd, > > > struct sd_lb_stats *sds, > > > int this_cpu, unsigned long *imbalance) > > > { > > > int busiest_cpu; > > > > > > if (!(sd->flags & SD_ASYM_PACKING)) > > > return 0; > > > > > > if (!sds->busiest) > > > return 0; > > > > > > busiest_cpu = group_first_cpu(sds->busiest); > > > if (cpu_rq(this_cpu)->nr_running || this_cpu > busiest_cpu) > > > return 0; > > > > > > *imbalance = (sds->max_load * sds->busiest->cpu_power) / > > > SCHED_LOAD_SCALE; > > > return 1; > > > } > > > > > > Does that make sense? > > > > I think so. > > > > I'm seeing check_asym_packing do the right thing with the simple SMT2 > > with 1 process case. It marks cpu0 as imbalanced when cpu0 is idle and > > cpu1 is busy. > > > > Unfortunately the process doesn't seem to be get migrated down though. > > Do we need to give *imbalance a higher value? > > So with ego help, I traced this down a bit more. > > In my simple test case (SMT2, t0 idle, t1 active) if f_b_g() hits our > new case in check_asym_packing(), load_balance then runs f_b_q(). > f_b_q() has this: > > if (capacity && rq->nr_running == 1 && wl > imbalance) > continue; > > when check_asym_packing() hits, wl = 1783 and imbalance = 1024, so we > continue and busiest remains NULL. > > load_balance then does "goto out_balanced" and it doesn't attempt to > move the task. > > Based on this and on egos suggestion I pulled in Suresh Siddha patch > from: http://lkml.org/lkml/2010/2/12/352. This fixes the problem. The > process is moved down to t0. > > I've only tested SMT2 so far. SMT4 also works in the simple test case of a single process being pulled down to thread 0. As you suspected though, unfortunately this is only working with CONFIG_NO_HZ off. If I turn NO_HZ on, my single process gets bounced around the core. Did you think of any ideas for how to fix the NO_HZ interaction? Mikey
In message <11927.1267010024@neuling.org> you wrote: > > > If there's less the group will normally be balanced and we fall out and > > > end up in check_asym_packing(). > > > > > > So what I tried doing with that loop is detect if there's a hole in the > > > packing before busiest. Now that I think about it, what we need to check > > > is if this_cpu (the removed cpu argument) is idle and less than busiest. > > > > > > So something like: > > > > > > static int check_asym_pacing(struct sched_domain *sd, > > > struct sd_lb_stats *sds, > > > int this_cpu, unsigned long *imbalance) > > > { > > > int busiest_cpu; > > > > > > if (!(sd->flags & SD_ASYM_PACKING)) > > > return 0; > > > > > > if (!sds->busiest) > > > return 0; > > > > > > busiest_cpu = group_first_cpu(sds->busiest); > > > if (cpu_rq(this_cpu)->nr_running || this_cpu > busiest_cpu) > > > return 0; > > > > > > *imbalance = (sds->max_load * sds->busiest->cpu_power) / > > > SCHED_LOAD_SCALE; > > > return 1; > > > } > > > > > > Does that make sense? > > > > I think so. > > > > I'm seeing check_asym_packing do the right thing with the simple SMT2 > > with 1 process case. It marks cpu0 as imbalanced when cpu0 is idle and > > cpu1 is busy. > > > > Unfortunately the process doesn't seem to be get migrated down though. > > Do we need to give *imbalance a higher value? > > So with ego help, I traced this down a bit more. > > In my simple test case (SMT2, t0 idle, t1 active) if f_b_g() hits our > new case in check_asym_packing(), load_balance then runs f_b_q(). > f_b_q() has this: > > if (capacity && rq->nr_running == 1 && wl > imbalance) > continue; > > when check_asym_packing() hits, wl = 1783 and imbalance = 1024, so we > continue and busiest remains NULL. > > load_balance then does "goto out_balanced" and it doesn't attempt to > move the task. > > Based on this and on egos suggestion I pulled in Suresh Siddha patch > from: http://lkml.org/lkml/2010/2/12/352. This fixes the problem. The > process is moved down to t0. > > I've only tested SMT2 so far. I'm finding this SMT2 result to be unreliable. Sometimes it doesn't work for the simple 1 process case. It seems to change boot to boot. Sometimes it works as expected with t0 busy and t1 idle, but other times it's the other way around. When it doesn't work, check_asym_packing() is still marking processes to be pulled down but only gets run about 1 in every 4 calls to load_balance(). For 2 of the other calls to load_balance, idle is CPU_NEWLY_IDLE and hence check_asym_packing() doesn't get called. This results in sd->nr_balance_failed being reset. When load_balance is next called and check_asym_packing() hits, need_active_balance() returns 0 as sd->nr_balance_failed is too small. This means the migration thread on t1 is not woken and the process remains there. So why does thread0 change from NEWLY_IDLE to IDLE and visa versa, when there is nothing running on it? Is this expected? This is with next-20100225 which pulled in Ingos tip at 407b4844f2af416cd8c9edd7c44d1545c93a4938 (from 24/2/2010) Mikey
On Sat, 2010-02-27 at 21:21 +1100, Michael Neuling wrote: > In message <11927.1267010024@neuling.org> you wrote: > > > > If there's less the group will normally be balanced and we fall out and > > > > end up in check_asym_packing(). > > > > > > > > So what I tried doing with that loop is detect if there's a hole in the > > > > packing before busiest. Now that I think about it, what we need to check > > > > is if this_cpu (the removed cpu argument) is idle and less than busiest. > > > > > > > > So something like: > > > > > > > > static int check_asym_pacing(struct sched_domain *sd, > > > > struct sd_lb_stats *sds, > > > > int this_cpu, unsigned long *imbalance) > > > > { > > > > int busiest_cpu; > > > > > > > > if (!(sd->flags & SD_ASYM_PACKING)) > > > > return 0; > > > > > > > > if (!sds->busiest) > > > > return 0; > > > > > > > > busiest_cpu = group_first_cpu(sds->busiest); > > > > if (cpu_rq(this_cpu)->nr_running || this_cpu > busiest_cpu) > > > > return 0; > > > > > > > > *imbalance = (sds->max_load * sds->busiest->cpu_power) / > > > > SCHED_LOAD_SCALE; > > > > return 1; > > > > } > > > > > > > > Does that make sense? > > > > > > I think so. > > > > > > I'm seeing check_asym_packing do the right thing with the simple SMT2 > > > with 1 process case. It marks cpu0 as imbalanced when cpu0 is idle and > > > cpu1 is busy. > > > > > > Unfortunately the process doesn't seem to be get migrated down though. > > > Do we need to give *imbalance a higher value? > > > > So with ego help, I traced this down a bit more. > > > > In my simple test case (SMT2, t0 idle, t1 active) if f_b_g() hits our > > new case in check_asym_packing(), load_balance then runs f_b_q(). > > f_b_q() has this: > > > > if (capacity && rq->nr_running == 1 && wl > imbalance) > > continue; > > > > when check_asym_packing() hits, wl = 1783 and imbalance = 1024, so we > > continue and busiest remains NULL. > > > > load_balance then does "goto out_balanced" and it doesn't attempt to > > move the task. > > > > Based on this and on egos suggestion I pulled in Suresh Siddha patch > > from: http://lkml.org/lkml/2010/2/12/352. This fixes the problem. The > > process is moved down to t0. > > > > I've only tested SMT2 so far. > > I'm finding this SMT2 result to be unreliable. Sometimes it doesn't work > for the simple 1 process case. It seems to change boot to boot. > Sometimes it works as expected with t0 busy and t1 idle, but other times > it's the other way around. > > When it doesn't work, check_asym_packing() is still marking processes to > be pulled down but only gets run about 1 in every 4 calls to > load_balance(). > > For 2 of the other calls to load_balance, idle is CPU_NEWLY_IDLE and > hence check_asym_packing() doesn't get called. This results in > sd->nr_balance_failed being reset. When load_balance is next called and > check_asym_packing() hits, need_active_balance() returns 0 as > sd->nr_balance_failed is too small. This means the migration thread on > t1 is not woken and the process remains there. > > So why does thread0 change from NEWLY_IDLE to IDLE and visa versa, when > there is nothing running on it? Is this expected? Ah, yes, you should probably allow both those. NEWLY_IDLE is when we are about to schedule the idle thread, IDLE is when a tick hits the idle thread. I'm thinking that NEWLY_IDLE should also solve the NO_HZ case, since we'll have passed through that before we enter tickless state, just make sure SD_BALANCE_NEWIDLE is set on the relevant levels (should already be so).
In message <1267541076.25158.60.camel@laptop> you wrote: > On Sat, 2010-02-27 at 21:21 +1100, Michael Neuling wrote: > > In message <11927.1267010024@neuling.org> you wrote: > > > > > If there's less the group will normally be balanced and we fall out a nd > > > > > end up in check_asym_packing(). > > > > > > > > > > So what I tried doing with that loop is detect if there's a hole in t he > > > > > packing before busiest. Now that I think about it, what we need to ch eck > > > > > is if this_cpu (the removed cpu argument) is idle and less than busie st. > > > > > > > > > > So something like: > > > > > > > > > > static int check_asym_pacing(struct sched_domain *sd, > > > > > struct sd_lb_stats *sds, > > > > > int this_cpu, unsigned long *imbalance) > > > > > { > > > > > int busiest_cpu; > > > > > > > > > > if (!(sd->flags & SD_ASYM_PACKING)) > > > > > return 0; > > > > > > > > > > if (!sds->busiest) > > > > > return 0; > > > > > > > > > > busiest_cpu = group_first_cpu(sds->busiest); > > > > > if (cpu_rq(this_cpu)->nr_running || this_cpu > busiest_cpu) > > > > > return 0; > > > > > > > > > > *imbalance = (sds->max_load * sds->busiest->cpu_power) / > > > > > SCHED_LOAD_SCALE; > > > > > return 1; > > > > > } > > > > > > > > > > Does that make sense? > > > > > > > > I think so. > > > > > > > > I'm seeing check_asym_packing do the right thing with the simple SMT2 > > > > with 1 process case. It marks cpu0 as imbalanced when cpu0 is idle and > > > > cpu1 is busy. > > > > > > > > Unfortunately the process doesn't seem to be get migrated down though. > > > > Do we need to give *imbalance a higher value? > > > > > > So with ego help, I traced this down a bit more. > > > > > > In my simple test case (SMT2, t0 idle, t1 active) if f_b_g() hits our > > > new case in check_asym_packing(), load_balance then runs f_b_q(). > > > f_b_q() has this: > > > > > > if (capacity && rq->nr_running == 1 && wl > imbalance) > > > continue; > > > > > > when check_asym_packing() hits, wl = 1783 and imbalance = 1024, so we > > > continue and busiest remains NULL. > > > > > > load_balance then does "goto out_balanced" and it doesn't attempt to > > > move the task. > > > > > > Based on this and on egos suggestion I pulled in Suresh Siddha patch > > > from: http://lkml.org/lkml/2010/2/12/352. This fixes the problem. The > > > process is moved down to t0. > > > > > > I've only tested SMT2 so far. > > > > I'm finding this SMT2 result to be unreliable. Sometimes it doesn't work > > for the simple 1 process case. It seems to change boot to boot. > > Sometimes it works as expected with t0 busy and t1 idle, but other times > > it's the other way around. > > > > When it doesn't work, check_asym_packing() is still marking processes to > > be pulled down but only gets run about 1 in every 4 calls to > > load_balance(). > > > > For 2 of the other calls to load_balance, idle is CPU_NEWLY_IDLE and > > hence check_asym_packing() doesn't get called. This results in > > sd->nr_balance_failed being reset. When load_balance is next called and > > check_asym_packing() hits, need_active_balance() returns 0 as > > sd->nr_balance_failed is too small. This means the migration thread on > > t1 is not woken and the process remains there. > > > > So why does thread0 change from NEWLY_IDLE to IDLE and visa versa, when > > there is nothing running on it? Is this expected? > > Ah, yes, you should probably allow both those. > > NEWLY_IDLE is when we are about to schedule the idle thread, IDLE is > when a tick hits the idle thread. > > I'm thinking that NEWLY_IDLE should also solve the NO_HZ case, since > we'll have passed through that before we enter tickless state, just make > sure SD_BALANCE_NEWIDLE is set on the relevant levels (should already be > so). OK, thanks. There seems to be a regression in Linus' latest tree (also -next) where new processes usually end up on the thread 1 rather than 0 (when in SMT2 mode). This only seems to happen with newly created processes. If you pin a process to t0 and then unpin it, it stays on t0. Also if a process is migrated to another core, it can end up on t0. This happens with a vanilla linus or -next tree on ppc64 pseries_defconfig - NO_HZ. I've not tried with NO_HZ. Anyway, this regression seems to be causing problems when we apply our patch. We are trying to pull down to T0 which works, but we immediately get pulled back upto t1 due to the above regression. This happens over and over, causing process to ping-pong every few sched ticks. We've not tried to bisect this problem but that's the next step unless someone has some insights to the problem. Also, we had to change the following to get the pull down to work correctly in the original patch: @@ -2618,8 +2618,8 @@ static int check_asym_packing(struct sch if (this_cpu > busiest_cpu) return 0; - *imbalance = (sds->max_load * sds->busiest->cpu_power) / - SCHED_LOAD_SCALE; + *imbalance = DIV_ROUND_CLOSEST(sds->max_load * sds->busiest->cpu_power, + SCHED_LOAD_SCALE); return 1; We found that imbalance = 1023.8 which got rounded down to 1023 which ended up being compared to a wl of 1024 in find_busiest_queue and failing. The closest round fixes this. Mikey
Index: linux-next/include/linux/sched.h =================================================================== --- linux-next.orig/include/linux/sched.h +++ linux-next/include/linux/sched.h @@ -849,7 +849,7 @@ enum cpu_idle_type { #define SD_POWERSAVINGS_BALANCE 0x0100 /* Balance for power savings */ #define SD_SHARE_PKG_RESOURCES 0x0200 /* Domain members share cpu pkg resources */ #define SD_SERIALIZE 0x0400 /* Only a single load balancing instance */ - +#define SD_ASYM_PACKING 0x0800 /* Asymmetric SMT packing */ #define SD_PREFER_SIBLING 0x1000 /* Prefer to place tasks in a sibling domain */ enum powersavings_balance_level { @@ -881,6 +881,8 @@ static inline int sd_balance_for_package return SD_PREFER_SIBLING; } +extern int arch_sd_asym_packing(void); + /* * Optimise SD flags for power savings: * SD_BALANCE_NEWIDLE helps agressive task consolidation and power savings. Index: linux-next/include/linux/topology.h =================================================================== --- linux-next.orig/include/linux/topology.h +++ linux-next/include/linux/topology.h @@ -102,6 +102,7 @@ int arch_update_cpu_topology(void); | 1*SD_SHARE_PKG_RESOURCES \ | 0*SD_SERIALIZE \ | 0*SD_PREFER_SIBLING \ + | arch_sd_asym_packing() \ , \ .last_balance = jiffies, \ .balance_interval = 1, \ Index: linux-next/kernel/sched_fair.c =================================================================== --- linux-next.orig/kernel/sched_fair.c +++ linux-next/kernel/sched_fair.c @@ -2494,6 +2494,32 @@ static inline void update_sg_lb_stats(st } /** + * update_sd_pick_busiest - return 1 on busiest group + */ +static int update_sd_pick_busiest(struct sched_domain *sd, + struct sd_lb_stats *sds, + struct sched_group *sg, + struct sg_lb_stats *sgs) +{ + if (sgs->sum_nr_running > sgs->group_capacity) + return 1; + + if (sgs->group_imb) + return 1; + + /* Check packing mode for this domain */ + if ((sd->flags & SD_ASYM_PACKING) && sgs->sum_nr_running) { + if (!sds->busiest) + return 1; + + if (group_first_cpu(sds->busiest) > group_first_cpu(sg)) + return 1; + } + + return 0; +} + +/** * update_sd_lb_stats - Update sched_group's statistics for load balancing. * @sd: sched_domain whose statistics are to be updated. * @this_cpu: Cpu for which load balance is currently performed. @@ -2547,9 +2573,8 @@ static inline void update_sd_lb_stats(st sds->this = group; sds->this_nr_running = sgs.sum_nr_running; sds->this_load_per_task = sgs.sum_weighted_load; - } else if (sgs.avg_load > sds->max_load && - (sgs.sum_nr_running > sgs.group_capacity || - sgs.group_imb)) { + } else if (sgs.avg_load >= sds->max_load && + update_sd_pick_busiest(sd, sds, group, &sgs)) { sds->max_load = sgs.avg_load; sds->busiest = group; sds->busiest_nr_running = sgs.sum_nr_running; @@ -2562,6 +2587,36 @@ static inline void update_sd_lb_stats(st } while (group != sd->groups); } +int __weak arch_sd_asym_packing(void) +{ + return 0*SD_ASYM_PACKING; +} + +/** + * check_asym_packing - Check to see if we the group is packed into + * the sched doman + */ +static int check_asym_packing(struct sched_domain *sd, + struct sd_lb_stats *sds, + int this_cpu, unsigned long *imbalance) +{ + int busiest_cpu; + + if (!(sd->flags & SD_ASYM_PACKING)) + return 0; + + if (!sds->busiest) + return 0; + + busiest_cpu = group_first_cpu(sds->busiest); + if (this_cpu > busiest_cpu) + return 0; + + *imbalance = (sds->max_load * sds->busiest->cpu_power) / + SCHED_LOAD_SCALE; + return 1; +} + /** * fix_small_imbalance - Calculate the minor imbalance that exists * amongst the groups of a sched_domain, during @@ -2761,6 +2816,10 @@ find_busiest_group(struct sched_domain * return sds.busiest; out_balanced: + if (idle == CPU_IDLE && + check_asym_packing(sd, &sds, this_cpu, imbalance)) + return sds.busiest; + /* * There is no obvious imbalance. But check if we can do some balancing * to save power.