Message ID | 1341320518-7772-1-git-send-email-stefan.bader@canonical.com |
---|---|
State | New |
Headers | show |
On Tue, Jul 03, 2012 at 03:01:58PM +0200, Stefan Bader wrote: > Though reproducing this somehow required (or at least made much > more likely) a certain setup (two CPUs and the described loop of > a certain command), the result (crash) is probably bad enough to > warrant applying it before it appears upstream. Peter queued it > up but needs Ingo to push it onwards. > > Also this may cause other subtle to find issues as the current > code calls task_group() four times to assign values to cfs and > rt schdeduler elements. So while the observed crash happened when > cfs elements were inconsistent, there could be other problems when > the inconsistency hits cfs/rt or rt only. > > Quantal still has the same race but at least the code was changed > to call task_group only once. So the value could be old but at least > it is always old (or the new one). > > This backport should apply to Natty till Precise (maybe a bit of > fuzz in Natty). But later that 3.3 all the files moved around. > > -Stefan Ack, verified to fix the issue, etc. Though I would lean to wait this lands up on Linus' tree first. > > From 97dc9f26a508f140cc97eadedddab493619eeee4 Mon Sep 17 00:00:00 2001 > From: Peter Zijlstra <peterz@infradead.org> > Date: Fri, 22 Jun 2012 13:36:00 +0200 > Subject: [PATCH] UBUNTU: (pre-stable) sched: Fix race in task_group() > > Stefan reported a crash on a kernel before a3e5d1091c1 ("sched: Don't > call task_group() too many times in set_task_rq()"), he found the reason > to be that the multiple task_group() invocations in set_task_rq() > returned different values. > > Looking at all that I found a lack of serialization and plain wrong > comments. > > The below tries to fix it using an extra pointer which is updated under > the appropriate scheduler locks. Its not pretty, but I can't really see > another way given how all the cgroup stuff works. > > Reported-by: Stefan Bader <stefan.bader@canonical.com> > Signed-off-by: Peter Zijlstra <a.p.zijlstra@chello.nl> > > BugLink: http://bugs.launchpad.net/bugs/999755 > > [backported from patch posted on mailing list] > Signed-off-by: Stefan Bader <stefan.bader@canonical.com> > --- > include/linux/init_task.h | 12 +++++++++++- > include/linux/sched.h | 5 ++++- > kernel/sched.c | 32 ++++++++++++++++++-------------- > 3 files changed, 33 insertions(+), 16 deletions(-) > > diff --git a/include/linux/init_task.h b/include/linux/init_task.h > index 32574ee..13b2684 100644 > --- a/include/linux/init_task.h > +++ b/include/linux/init_task.h > @@ -117,8 +117,17 @@ extern struct group_info init_groups; > > extern struct cred init_cred; > > +extern struct task_group root_task_group; > + > +#ifdef CONFIG_CGROUP_SCHED > +# define INIT_CGROUP_SCHED(tsk) \ > + .sched_task_group = &root_task_group, > +#else > +# define INIT_CGROUP_SCHED(tsk) > +#endif > + > #ifdef CONFIG_PERF_EVENTS > -# define INIT_PERF_EVENTS(tsk) \ > +# define INIT_PERF_EVENTS(tsk) \ > .perf_event_mutex = \ > __MUTEX_INITIALIZER(tsk.perf_event_mutex), \ > .perf_event_list = LIST_HEAD_INIT(tsk.perf_event_list), > @@ -155,6 +164,7 @@ extern struct cred init_cred; > }, \ > .tasks = LIST_HEAD_INIT(tsk.tasks), \ > INIT_PUSHABLE_TASKS(tsk) \ > + INIT_CGROUP_SCHED(tsk) \ > .ptraced = LIST_HEAD_INIT(tsk.ptraced), \ > .ptrace_entry = LIST_HEAD_INIT(tsk.ptrace_entry), \ > .real_parent = &tsk, \ > diff --git a/include/linux/sched.h b/include/linux/sched.h > index 56de5c1..ac51641 100644 > --- a/include/linux/sched.h > +++ b/include/linux/sched.h > @@ -1242,6 +1242,9 @@ struct task_struct { > const struct sched_class *sched_class; > struct sched_entity se; > struct sched_rt_entity rt; > +#ifdef CONFIG_CGROUP_SCHED > + struct task_group *sched_task_group; > +#endif > > #ifdef CONFIG_PREEMPT_NOTIFIERS > /* list of struct preempt_notifier: */ > @@ -2646,7 +2649,7 @@ extern int sched_group_set_rt_period(struct task_group *tg, > extern long sched_group_rt_period(struct task_group *tg); > extern int sched_rt_can_attach(struct task_group *tg, struct task_struct *tsk); > #endif > -#endif > +#endif /* CONFIG_CGROUP_SCHED */ > > extern int task_can_switch_user(struct user_struct *up, > struct task_struct *tsk); > diff --git a/kernel/sched.c b/kernel/sched.c > index aae0c1d..b99a61e 100644 > --- a/kernel/sched.c > +++ b/kernel/sched.c > @@ -746,22 +746,19 @@ static inline int cpu_of(struct rq *rq) > /* > * Return the group to which this tasks belongs. > * > - * We use task_subsys_state_check() and extend the RCU verification with > - * pi->lock and rq->lock because cpu_cgroup_attach() holds those locks for each > - * task it moves into the cgroup. Therefore by holding either of those locks, > - * we pin the task to the current cgroup. > + * We cannot use task_subsys_state() and friends because the cgroup > + * subsystem changes that value before the cgroup_subsys::attach() method > + * is called, therefore we cannot pin it and might observe the wrong value. > + * > + * The same is true for autogroup's p->signal->autogroup->tg, the autogroup > + * core changes this before calling sched_move_task(). > + * > + * Instead we use a 'copy' which is updated from sched_move_task() while > + * holding both task_struct::pi_lock and rq::lock. > */ > static inline struct task_group *task_group(struct task_struct *p) > { > - struct task_group *tg; > - struct cgroup_subsys_state *css; > - > - css = task_subsys_state_check(p, cpu_cgroup_subsys_id, > - lockdep_is_held(&p->pi_lock) || > - lockdep_is_held(&task_rq(p)->lock)); > - tg = container_of(css, struct task_group, css); > - > - return autogroup_task_group(p, tg); > + return p->sched_task_group; > } > > /* Change a task's cfs_rq and parent entity if it moves across CPUs/groups */ > @@ -2373,7 +2370,7 @@ void set_task_cpu(struct task_struct *p, unsigned int new_cpu) > * a task's CPU. ->pi_lock for waking tasks, rq->lock for runnable tasks. > * > * sched_move_task() holds both and thus holding either pins the cgroup, > - * see set_task_rq(). > + * see task_group(). > * > * Furthermore, all task_rq users should acquire both locks, see > * task_rq_lock(). > @@ -8765,6 +8762,7 @@ void sched_destroy_group(struct task_group *tg) > */ > void sched_move_task(struct task_struct *tsk) > { > + struct task_group *tg; > int on_rq, running; > unsigned long flags; > struct rq *rq; > @@ -8779,6 +8777,12 @@ void sched_move_task(struct task_struct *tsk) > if (unlikely(running)) > tsk->sched_class->put_prev_task(rq, tsk); > > + tg = container_of(task_subsys_state_check(tsk, cpu_cgroup_subsys_id, > + lockdep_is_held(&tsk->sighand->siglock)), > + struct task_group, css); > + tg = autogroup_task_group(tsk, tg); > + tsk->sched_task_group = tg; > + > #ifdef CONFIG_FAIR_GROUP_SCHED > if (tsk->sched_class->task_move_group) > tsk->sched_class->task_move_group(tsk, on_rq); > -- > 1.7.9.5 > > > -- > kernel-team mailing list > kernel-team@lists.ubuntu.com > https://lists.ubuntu.com/mailman/listinfo/kernel-team >
On 07/03/2012 08:17 AM, Herton Ronaldo Krzesinski wrote: > On Tue, Jul 03, 2012 at 03:01:58PM +0200, Stefan Bader wrote: >> Though reproducing this somehow required (or at least made much >> more likely) a certain setup (two CPUs and the described loop of >> a certain command), the result (crash) is probably bad enough to >> warrant applying it before it appears upstream. Peter queued it >> up but needs Ingo to push it onwards. >> >> Also this may cause other subtle to find issues as the current >> code calls task_group() four times to assign values to cfs and >> rt schdeduler elements. So while the observed crash happened when >> cfs elements were inconsistent, there could be other problems when >> the inconsistency hits cfs/rt or rt only. >> >> Quantal still has the same race but at least the code was changed >> to call task_group only once. So the value could be old but at least >> it is always old (or the new one). >> >> This backport should apply to Natty till Precise (maybe a bit of >> fuzz in Natty). But later that 3.3 all the files moved around. >> >> -Stefan The patch at least made it into linux-next as commit 7fac251a97f36c9aef31b08ce7ad6ef8f06e54d1 Author: Peter Zijlstra <peterz@infradead.org> Date: Fri Jun 22 13:36:05 2012 +0200 sched: Fix race in task_group() I would like to see the backport I submitted to go into our kernels as soon as possible. Upstream unlikely is in a hurry (so I would think it gets there not before 3.6) because the issue is made covered up there. -Stefan > > Ack, verified to fix the issue, etc. Though I would lean to wait this > lands up on Linus' tree first. > >> >> From 97dc9f26a508f140cc97eadedddab493619eeee4 Mon Sep 17 00:00:00 2001 >> From: Peter Zijlstra <peterz@infradead.org> >> Date: Fri, 22 Jun 2012 13:36:00 +0200 >> Subject: [PATCH] UBUNTU: (pre-stable) sched: Fix race in task_group() >> >> Stefan reported a crash on a kernel before a3e5d1091c1 ("sched: Don't >> call task_group() too many times in set_task_rq()"), he found the reason >> to be that the multiple task_group() invocations in set_task_rq() >> returned different values. >> >> Looking at all that I found a lack of serialization and plain wrong >> comments. >> >> The below tries to fix it using an extra pointer which is updated under >> the appropriate scheduler locks. Its not pretty, but I can't really see >> another way given how all the cgroup stuff works. >> >> Reported-by: Stefan Bader <stefan.bader@canonical.com> >> Signed-off-by: Peter Zijlstra <a.p.zijlstra@chello.nl> >> >> BugLink: http://bugs.launchpad.net/bugs/999755 >> >> [backported from patch posted on mailing list] >> Signed-off-by: Stefan Bader <stefan.bader@canonical.com> >> --- >> include/linux/init_task.h | 12 +++++++++++- >> include/linux/sched.h | 5 ++++- >> kernel/sched.c | 32 ++++++++++++++++++-------------- >> 3 files changed, 33 insertions(+), 16 deletions(-) >> >> diff --git a/include/linux/init_task.h b/include/linux/init_task.h >> index 32574ee..13b2684 100644 >> --- a/include/linux/init_task.h >> +++ b/include/linux/init_task.h >> @@ -117,8 +117,17 @@ extern struct group_info init_groups; >> >> extern struct cred init_cred; >> >> +extern struct task_group root_task_group; >> + >> +#ifdef CONFIG_CGROUP_SCHED >> +# define INIT_CGROUP_SCHED(tsk) \ >> + .sched_task_group = &root_task_group, >> +#else >> +# define INIT_CGROUP_SCHED(tsk) >> +#endif >> + >> #ifdef CONFIG_PERF_EVENTS >> -# define INIT_PERF_EVENTS(tsk) \ >> +# define INIT_PERF_EVENTS(tsk) \ >> .perf_event_mutex = \ >> __MUTEX_INITIALIZER(tsk.perf_event_mutex), \ >> .perf_event_list = LIST_HEAD_INIT(tsk.perf_event_list), >> @@ -155,6 +164,7 @@ extern struct cred init_cred; >> }, \ >> .tasks = LIST_HEAD_INIT(tsk.tasks), \ >> INIT_PUSHABLE_TASKS(tsk) \ >> + INIT_CGROUP_SCHED(tsk) \ >> .ptraced = LIST_HEAD_INIT(tsk.ptraced), \ >> .ptrace_entry = LIST_HEAD_INIT(tsk.ptrace_entry), \ >> .real_parent = &tsk, \ >> diff --git a/include/linux/sched.h b/include/linux/sched.h >> index 56de5c1..ac51641 100644 >> --- a/include/linux/sched.h >> +++ b/include/linux/sched.h >> @@ -1242,6 +1242,9 @@ struct task_struct { >> const struct sched_class *sched_class; >> struct sched_entity se; >> struct sched_rt_entity rt; >> +#ifdef CONFIG_CGROUP_SCHED >> + struct task_group *sched_task_group; >> +#endif >> >> #ifdef CONFIG_PREEMPT_NOTIFIERS >> /* list of struct preempt_notifier: */ >> @@ -2646,7 +2649,7 @@ extern int sched_group_set_rt_period(struct task_group *tg, >> extern long sched_group_rt_period(struct task_group *tg); >> extern int sched_rt_can_attach(struct task_group *tg, struct task_struct *tsk); >> #endif >> -#endif >> +#endif /* CONFIG_CGROUP_SCHED */ >> >> extern int task_can_switch_user(struct user_struct *up, >> struct task_struct *tsk); >> diff --git a/kernel/sched.c b/kernel/sched.c >> index aae0c1d..b99a61e 100644 >> --- a/kernel/sched.c >> +++ b/kernel/sched.c >> @@ -746,22 +746,19 @@ static inline int cpu_of(struct rq *rq) >> /* >> * Return the group to which this tasks belongs. >> * >> - * We use task_subsys_state_check() and extend the RCU verification with >> - * pi->lock and rq->lock because cpu_cgroup_attach() holds those locks for each >> - * task it moves into the cgroup. Therefore by holding either of those locks, >> - * we pin the task to the current cgroup. >> + * We cannot use task_subsys_state() and friends because the cgroup >> + * subsystem changes that value before the cgroup_subsys::attach() method >> + * is called, therefore we cannot pin it and might observe the wrong value. >> + * >> + * The same is true for autogroup's p->signal->autogroup->tg, the autogroup >> + * core changes this before calling sched_move_task(). >> + * >> + * Instead we use a 'copy' which is updated from sched_move_task() while >> + * holding both task_struct::pi_lock and rq::lock. >> */ >> static inline struct task_group *task_group(struct task_struct *p) >> { >> - struct task_group *tg; >> - struct cgroup_subsys_state *css; >> - >> - css = task_subsys_state_check(p, cpu_cgroup_subsys_id, >> - lockdep_is_held(&p->pi_lock) || >> - lockdep_is_held(&task_rq(p)->lock)); >> - tg = container_of(css, struct task_group, css); >> - >> - return autogroup_task_group(p, tg); >> + return p->sched_task_group; >> } >> >> /* Change a task's cfs_rq and parent entity if it moves across CPUs/groups */ >> @@ -2373,7 +2370,7 @@ void set_task_cpu(struct task_struct *p, unsigned int new_cpu) >> * a task's CPU. ->pi_lock for waking tasks, rq->lock for runnable tasks. >> * >> * sched_move_task() holds both and thus holding either pins the cgroup, >> - * see set_task_rq(). >> + * see task_group(). >> * >> * Furthermore, all task_rq users should acquire both locks, see >> * task_rq_lock(). >> @@ -8765,6 +8762,7 @@ void sched_destroy_group(struct task_group *tg) >> */ >> void sched_move_task(struct task_struct *tsk) >> { >> + struct task_group *tg; >> int on_rq, running; >> unsigned long flags; >> struct rq *rq; >> @@ -8779,6 +8777,12 @@ void sched_move_task(struct task_struct *tsk) >> if (unlikely(running)) >> tsk->sched_class->put_prev_task(rq, tsk); >> >> + tg = container_of(task_subsys_state_check(tsk, cpu_cgroup_subsys_id, >> + lockdep_is_held(&tsk->sighand->siglock)), >> + struct task_group, css); >> + tg = autogroup_task_group(tsk, tg); >> + tsk->sched_task_group = tg; >> + >> #ifdef CONFIG_FAIR_GROUP_SCHED >> if (tsk->sched_class->task_move_group) >> tsk->sched_class->task_move_group(tsk, on_rq); >> -- >> 1.7.9.5 >> >> >> -- >> kernel-team mailing list >> kernel-team@lists.ubuntu.com >> https://lists.ubuntu.com/mailman/listinfo/kernel-team >> >
On Tue, Jul 17, 2012 at 10:22:55AM -0700, Stefan Bader wrote: > On 07/03/2012 08:17 AM, Herton Ronaldo Krzesinski wrote: > >On Tue, Jul 03, 2012 at 03:01:58PM +0200, Stefan Bader wrote: > >>Though reproducing this somehow required (or at least made much > >>more likely) a certain setup (two CPUs and the described loop of > >>a certain command), the result (crash) is probably bad enough to > >>warrant applying it before it appears upstream. Peter queued it > >>up but needs Ingo to push it onwards. > >> > >>Also this may cause other subtle to find issues as the current > >>code calls task_group() four times to assign values to cfs and > >>rt schdeduler elements. So while the observed crash happened when > >>cfs elements were inconsistent, there could be other problems when > >>the inconsistency hits cfs/rt or rt only. > >> > >>Quantal still has the same race but at least the code was changed > >>to call task_group only once. So the value could be old but at least > >>it is always old (or the new one). > >> > >>This backport should apply to Natty till Precise (maybe a bit of > >>fuzz in Natty). But later that 3.3 all the files moved around. > >> > >>-Stefan > > The patch at least made it into linux-next as > > commit 7fac251a97f36c9aef31b08ce7ad6ef8f06e54d1 > Author: Peter Zijlstra <peterz@infradead.org> > Date: Fri Jun 22 13:36:05 2012 +0200 > > sched: Fix race in task_group() > > I would like to see the backport I submitted to go into our kernels > as soon as possible. Upstream unlikely is in a hurry (so I would > think it gets there not before 3.6) because the issue is made > covered up there. Yes, looks sensible to get this now. > > -Stefan
Doesn't apply to Natty
On 07/17/2012 12:09 PM, Tim Gardner wrote: > Doesn't apply to Natty > Hm, I seem to have confused to have tried for 3.0 with having tried for Natty. I have now done that backport for Natty but need to test it before submitting it here.
diff --git a/include/linux/init_task.h b/include/linux/init_task.h index 32574ee..13b2684 100644 --- a/include/linux/init_task.h +++ b/include/linux/init_task.h @@ -117,8 +117,17 @@ extern struct group_info init_groups; extern struct cred init_cred; +extern struct task_group root_task_group; + +#ifdef CONFIG_CGROUP_SCHED +# define INIT_CGROUP_SCHED(tsk) \ + .sched_task_group = &root_task_group, +#else +# define INIT_CGROUP_SCHED(tsk) +#endif + #ifdef CONFIG_PERF_EVENTS -# define INIT_PERF_EVENTS(tsk) \ +# define INIT_PERF_EVENTS(tsk) \ .perf_event_mutex = \ __MUTEX_INITIALIZER(tsk.perf_event_mutex), \ .perf_event_list = LIST_HEAD_INIT(tsk.perf_event_list), @@ -155,6 +164,7 @@ extern struct cred init_cred; }, \ .tasks = LIST_HEAD_INIT(tsk.tasks), \ INIT_PUSHABLE_TASKS(tsk) \ + INIT_CGROUP_SCHED(tsk) \ .ptraced = LIST_HEAD_INIT(tsk.ptraced), \ .ptrace_entry = LIST_HEAD_INIT(tsk.ptrace_entry), \ .real_parent = &tsk, \ diff --git a/include/linux/sched.h b/include/linux/sched.h index 56de5c1..ac51641 100644 --- a/include/linux/sched.h +++ b/include/linux/sched.h @@ -1242,6 +1242,9 @@ struct task_struct { const struct sched_class *sched_class; struct sched_entity se; struct sched_rt_entity rt; +#ifdef CONFIG_CGROUP_SCHED + struct task_group *sched_task_group; +#endif #ifdef CONFIG_PREEMPT_NOTIFIERS /* list of struct preempt_notifier: */ @@ -2646,7 +2649,7 @@ extern int sched_group_set_rt_period(struct task_group *tg, extern long sched_group_rt_period(struct task_group *tg); extern int sched_rt_can_attach(struct task_group *tg, struct task_struct *tsk); #endif -#endif +#endif /* CONFIG_CGROUP_SCHED */ extern int task_can_switch_user(struct user_struct *up, struct task_struct *tsk); diff --git a/kernel/sched.c b/kernel/sched.c index aae0c1d..b99a61e 100644 --- a/kernel/sched.c +++ b/kernel/sched.c @@ -746,22 +746,19 @@ static inline int cpu_of(struct rq *rq) /* * Return the group to which this tasks belongs. * - * We use task_subsys_state_check() and extend the RCU verification with - * pi->lock and rq->lock because cpu_cgroup_attach() holds those locks for each - * task it moves into the cgroup. Therefore by holding either of those locks, - * we pin the task to the current cgroup. + * We cannot use task_subsys_state() and friends because the cgroup + * subsystem changes that value before the cgroup_subsys::attach() method + * is called, therefore we cannot pin it and might observe the wrong value. + * + * The same is true for autogroup's p->signal->autogroup->tg, the autogroup + * core changes this before calling sched_move_task(). + * + * Instead we use a 'copy' which is updated from sched_move_task() while + * holding both task_struct::pi_lock and rq::lock. */ static inline struct task_group *task_group(struct task_struct *p) { - struct task_group *tg; - struct cgroup_subsys_state *css; - - css = task_subsys_state_check(p, cpu_cgroup_subsys_id, - lockdep_is_held(&p->pi_lock) || - lockdep_is_held(&task_rq(p)->lock)); - tg = container_of(css, struct task_group, css); - - return autogroup_task_group(p, tg); + return p->sched_task_group; } /* Change a task's cfs_rq and parent entity if it moves across CPUs/groups */ @@ -2373,7 +2370,7 @@ void set_task_cpu(struct task_struct *p, unsigned int new_cpu) * a task's CPU. ->pi_lock for waking tasks, rq->lock for runnable tasks. * * sched_move_task() holds both and thus holding either pins the cgroup, - * see set_task_rq(). + * see task_group(). * * Furthermore, all task_rq users should acquire both locks, see * task_rq_lock(). @@ -8765,6 +8762,7 @@ void sched_destroy_group(struct task_group *tg) */ void sched_move_task(struct task_struct *tsk) { + struct task_group *tg; int on_rq, running; unsigned long flags; struct rq *rq; @@ -8779,6 +8777,12 @@ void sched_move_task(struct task_struct *tsk) if (unlikely(running)) tsk->sched_class->put_prev_task(rq, tsk); + tg = container_of(task_subsys_state_check(tsk, cpu_cgroup_subsys_id, + lockdep_is_held(&tsk->sighand->siglock)), + struct task_group, css); + tg = autogroup_task_group(tsk, tg); + tsk->sched_task_group = tg; + #ifdef CONFIG_FAIR_GROUP_SCHED if (tsk->sched_class->task_move_group) tsk->sched_class->task_move_group(tsk, on_rq);