[Natty,Oneiric,Precise,SRU] Fix race in task_group()

Message ID	1341320518-7772-1-git-send-email-stefan.bader@canonical.com
State	New
Headers	show Return-Path: <kernel-team-bounces@lists.ubuntu.com> From: Stefan Bader <stefan.bader@canonical.com> To: kernel-team@lists.ubuntu.com Subject: [Natty, Oneiric, Precise SRU] Fix race in task_group() Date: Tue, 3 Jul 2012 15:01:58 +0200 Message-Id: <1341320518-7772-1-git-send-email-stefan.bader@canonical.com> Precedence: list MIME-Version: 1.0 Content-Type: text/plain; charset="us-ascii" Content-Transfer-Encoding: 7bit Sender: kernel-team-bounces@lists.ubuntu.com Errors-To: kernel-team-bounces@lists.ubuntu.com

Stefan Bader July 3, 2012, 1:01 p.m. UTC

Though reproducing this somehow required (or at least made much
more likely) a certain setup (two CPUs and the described loop of
a certain command), the result (crash) is probably bad enough to
warrant applying it before it appears upstream. Peter queued it
up but needs Ingo to push it onwards.

Also this may cause other subtle to find issues as the current
code calls task_group() four times to assign values to cfs and
rt schdeduler elements. So while the observed crash happened when
cfs elements were inconsistent, there could be other problems when
the inconsistency hits cfs/rt or rt only.

Quantal still has the same race but at least the code was changed
to call task_group only once. So the value could be old but at least
it is always old (or the new one).

This backport should apply to Natty till Precise (maybe a bit of
fuzz in Natty). But later that 3.3 all the files moved around.

-Stefan

From 97dc9f26a508f140cc97eadedddab493619eeee4 Mon Sep 17 00:00:00 2001
From: Peter Zijlstra <peterz@infradead.org>
Date: Fri, 22 Jun 2012 13:36:00 +0200
Subject: [PATCH] UBUNTU: (pre-stable) sched: Fix race in task_group()

Stefan reported a crash on a kernel before a3e5d1091c1 ("sched: Don't
call task_group() too many times in set_task_rq()"), he found the reason
to be that the multiple task_group() invocations in set_task_rq()
returned different values.

Looking at all that I found a lack of serialization and plain wrong
comments.

The below tries to fix it using an extra pointer which is updated under
the appropriate scheduler locks. Its not pretty, but I can't really see
another way given how all the cgroup stuff works.

Reported-by: Stefan Bader <stefan.bader@canonical.com>
Signed-off-by: Peter Zijlstra <a.p.zijlstra@chello.nl>

BugLink: http://bugs.launchpad.net/bugs/999755

[backported from patch posted on mailing list]
Signed-off-by: Stefan Bader <stefan.bader@canonical.com>
---
 include/linux/init_task.h |   12 +++++++++++-
 include/linux/sched.h     |    5 ++++-
 kernel/sched.c            |   32 ++++++++++++++++++--------------
 3 files changed, 33 insertions(+), 16 deletions(-)

Herton Ronaldo Krzesinski July 3, 2012, 3:17 p.m. UTC | #1

On Tue, Jul 03, 2012 at 03:01:58PM +0200, Stefan Bader wrote:
> Though reproducing this somehow required (or at least made much
> more likely) a certain setup (two CPUs and the described loop of
> a certain command), the result (crash) is probably bad enough to
> warrant applying it before it appears upstream. Peter queued it
> up but needs Ingo to push it onwards.
> 
> Also this may cause other subtle to find issues as the current
> code calls task_group() four times to assign values to cfs and
> rt schdeduler elements. So while the observed crash happened when
> cfs elements were inconsistent, there could be other problems when
> the inconsistency hits cfs/rt or rt only.
> 
> Quantal still has the same race but at least the code was changed
> to call task_group only once. So the value could be old but at least
> it is always old (or the new one).
> 
> This backport should apply to Natty till Precise (maybe a bit of
> fuzz in Natty). But later that 3.3 all the files moved around.
> 
> -Stefan

Ack, verified to fix the issue, etc. Though I would lean to wait this
lands up on Linus' tree first.

> 
> From 97dc9f26a508f140cc97eadedddab493619eeee4 Mon Sep 17 00:00:00 2001
> From: Peter Zijlstra <peterz@infradead.org>
> Date: Fri, 22 Jun 2012 13:36:00 +0200
> Subject: [PATCH] UBUNTU: (pre-stable) sched: Fix race in task_group()
> 
> Stefan reported a crash on a kernel before a3e5d1091c1 ("sched: Don't
> call task_group() too many times in set_task_rq()"), he found the reason
> to be that the multiple task_group() invocations in set_task_rq()
> returned different values.
> 
> Looking at all that I found a lack of serialization and plain wrong
> comments.
> 
> The below tries to fix it using an extra pointer which is updated under
> the appropriate scheduler locks. Its not pretty, but I can't really see
> another way given how all the cgroup stuff works.
> 
> Reported-by: Stefan Bader <stefan.bader@canonical.com>
> Signed-off-by: Peter Zijlstra <a.p.zijlstra@chello.nl>
> 
> BugLink: http://bugs.launchpad.net/bugs/999755
> 
> [backported from patch posted on mailing list]
> Signed-off-by: Stefan Bader <stefan.bader@canonical.com>
> ---
>  include/linux/init_task.h |   12 +++++++++++-
>  include/linux/sched.h     |    5 ++++-
>  kernel/sched.c            |   32 ++++++++++++++++++--------------
>  3 files changed, 33 insertions(+), 16 deletions(-)
> 
> diff --git a/include/linux/init_task.h b/include/linux/init_task.h
> index 32574ee..13b2684 100644
> --- a/include/linux/init_task.h
> +++ b/include/linux/init_task.h
> @@ -117,8 +117,17 @@ extern struct group_info init_groups;
>  
>  extern struct cred init_cred;
>  
> +extern struct task_group root_task_group;
> +
> +#ifdef CONFIG_CGROUP_SCHED
> +# define INIT_CGROUP_SCHED(tsk)						\
> +	.sched_task_group = &root_task_group,
> +#else
> +# define INIT_CGROUP_SCHED(tsk)
> +#endif
> +
>  #ifdef CONFIG_PERF_EVENTS
> -# define INIT_PERF_EVENTS(tsk)					\
> +# define INIT_PERF_EVENTS(tsk)						\
>  	.perf_event_mutex = 						\
>  		 __MUTEX_INITIALIZER(tsk.perf_event_mutex),		\
>  	.perf_event_list = LIST_HEAD_INIT(tsk.perf_event_list),
> @@ -155,6 +164,7 @@ extern struct cred init_cred;
>  	},								\
>  	.tasks		= LIST_HEAD_INIT(tsk.tasks),			\
>  	INIT_PUSHABLE_TASKS(tsk)					\
> +	INIT_CGROUP_SCHED(tsk)						\
>  	.ptraced	= LIST_HEAD_INIT(tsk.ptraced),			\
>  	.ptrace_entry	= LIST_HEAD_INIT(tsk.ptrace_entry),		\
>  	.real_parent	= &tsk,						\
> diff --git a/include/linux/sched.h b/include/linux/sched.h
> index 56de5c1..ac51641 100644
> --- a/include/linux/sched.h
> +++ b/include/linux/sched.h
> @@ -1242,6 +1242,9 @@ struct task_struct {
>  	const struct sched_class *sched_class;
>  	struct sched_entity se;
>  	struct sched_rt_entity rt;
> +#ifdef CONFIG_CGROUP_SCHED
> +	struct task_group *sched_task_group;
> +#endif
>  
>  #ifdef CONFIG_PREEMPT_NOTIFIERS
>  	/* list of struct preempt_notifier: */
> @@ -2646,7 +2649,7 @@ extern int sched_group_set_rt_period(struct task_group *tg,
>  extern long sched_group_rt_period(struct task_group *tg);
>  extern int sched_rt_can_attach(struct task_group *tg, struct task_struct *tsk);
>  #endif
> -#endif
> +#endif /* CONFIG_CGROUP_SCHED */
>  
>  extern int task_can_switch_user(struct user_struct *up,
>  					struct task_struct *tsk);
> diff --git a/kernel/sched.c b/kernel/sched.c
> index aae0c1d..b99a61e 100644
> --- a/kernel/sched.c
> +++ b/kernel/sched.c
> @@ -746,22 +746,19 @@ static inline int cpu_of(struct rq *rq)
>  /*
>   * Return the group to which this tasks belongs.
>   *
> - * We use task_subsys_state_check() and extend the RCU verification with
> - * pi->lock and rq->lock because cpu_cgroup_attach() holds those locks for each
> - * task it moves into the cgroup. Therefore by holding either of those locks,
> - * we pin the task to the current cgroup.
> + * We cannot use task_subsys_state() and friends because the cgroup
> + * subsystem changes that value before the cgroup_subsys::attach() method
> + * is called, therefore we cannot pin it and might observe the wrong value.
> + *
> + * The same is true for autogroup's p->signal->autogroup->tg, the autogroup
> + * core changes this before calling sched_move_task().
> + *
> + * Instead we use a 'copy' which is updated from sched_move_task() while
> + * holding both task_struct::pi_lock and rq::lock.
>   */
>  static inline struct task_group *task_group(struct task_struct *p)
>  {
> -	struct task_group *tg;
> -	struct cgroup_subsys_state *css;
> -
> -	css = task_subsys_state_check(p, cpu_cgroup_subsys_id,
> -			lockdep_is_held(&p->pi_lock) ||
> -			lockdep_is_held(&task_rq(p)->lock));
> -	tg = container_of(css, struct task_group, css);
> -
> -	return autogroup_task_group(p, tg);
> +	return p->sched_task_group;
>  }
>  
>  /* Change a task's cfs_rq and parent entity if it moves across CPUs/groups */
> @@ -2373,7 +2370,7 @@ void set_task_cpu(struct task_struct *p, unsigned int new_cpu)
>  	 * a task's CPU. ->pi_lock for waking tasks, rq->lock for runnable tasks.
>  	 *
>  	 * sched_move_task() holds both and thus holding either pins the cgroup,
> -	 * see set_task_rq().
> +	 * see task_group().
>  	 *
>  	 * Furthermore, all task_rq users should acquire both locks, see
>  	 * task_rq_lock().
> @@ -8765,6 +8762,7 @@ void sched_destroy_group(struct task_group *tg)
>   */
>  void sched_move_task(struct task_struct *tsk)
>  {
> +	struct task_group *tg;
>  	int on_rq, running;
>  	unsigned long flags;
>  	struct rq *rq;
> @@ -8779,6 +8777,12 @@ void sched_move_task(struct task_struct *tsk)
>  	if (unlikely(running))
>  		tsk->sched_class->put_prev_task(rq, tsk);
>  
> +	tg = container_of(task_subsys_state_check(tsk, cpu_cgroup_subsys_id,
> +				lockdep_is_held(&tsk->sighand->siglock)),
> +			  struct task_group, css);
> +	tg = autogroup_task_group(tsk, tg);
> +	tsk->sched_task_group = tg;
> +
>  #ifdef CONFIG_FAIR_GROUP_SCHED
>  	if (tsk->sched_class->task_move_group)
>  		tsk->sched_class->task_move_group(tsk, on_rq);
> -- 
> 1.7.9.5
> 
> 
> -- 
> kernel-team mailing list
> kernel-team@lists.ubuntu.com
> https://lists.ubuntu.com/mailman/listinfo/kernel-team
>

Stefan Bader July 17, 2012, 5:22 p.m. UTC | #2

On 07/03/2012 08:17 AM, Herton Ronaldo Krzesinski wrote:
> On Tue, Jul 03, 2012 at 03:01:58PM +0200, Stefan Bader wrote:
>> Though reproducing this somehow required (or at least made much
>> more likely) a certain setup (two CPUs and the described loop of
>> a certain command), the result (crash) is probably bad enough to
>> warrant applying it before it appears upstream. Peter queued it
>> up but needs Ingo to push it onwards.
>>
>> Also this may cause other subtle to find issues as the current
>> code calls task_group() four times to assign values to cfs and
>> rt schdeduler elements. So while the observed crash happened when
>> cfs elements were inconsistent, there could be other problems when
>> the inconsistency hits cfs/rt or rt only.
>>
>> Quantal still has the same race but at least the code was changed
>> to call task_group only once. So the value could be old but at least
>> it is always old (or the new one).
>>
>> This backport should apply to Natty till Precise (maybe a bit of
>> fuzz in Natty). But later that 3.3 all the files moved around.
>>
>> -Stefan

The patch at least made it into linux-next as

commit 7fac251a97f36c9aef31b08ce7ad6ef8f06e54d1
Author: Peter Zijlstra <peterz@infradead.org>
Date:   Fri Jun 22 13:36:05 2012 +0200

     sched: Fix race in task_group()

I would like to see the backport I submitted to go into our kernels as soon as 
possible. Upstream unlikely is in a hurry (so I would think it gets there not 
before 3.6) because the issue is made covered up there.

-Stefan
>
> Ack, verified to fix the issue, etc. Though I would lean to wait this
> lands up on Linus' tree first.
>
>>
>>  From 97dc9f26a508f140cc97eadedddab493619eeee4 Mon Sep 17 00:00:00 2001
>> From: Peter Zijlstra <peterz@infradead.org>
>> Date: Fri, 22 Jun 2012 13:36:00 +0200
>> Subject: [PATCH] UBUNTU: (pre-stable) sched: Fix race in task_group()
>>
>> Stefan reported a crash on a kernel before a3e5d1091c1 ("sched: Don't
>> call task_group() too many times in set_task_rq()"), he found the reason
>> to be that the multiple task_group() invocations in set_task_rq()
>> returned different values.
>>
>> Looking at all that I found a lack of serialization and plain wrong
>> comments.
>>
>> The below tries to fix it using an extra pointer which is updated under
>> the appropriate scheduler locks. Its not pretty, but I can't really see
>> another way given how all the cgroup stuff works.
>>
>> Reported-by: Stefan Bader <stefan.bader@canonical.com>
>> Signed-off-by: Peter Zijlstra <a.p.zijlstra@chello.nl>
>>
>> BugLink: http://bugs.launchpad.net/bugs/999755
>>
>> [backported from patch posted on mailing list]
>> Signed-off-by: Stefan Bader <stefan.bader@canonical.com>
>> ---
>>   include/linux/init_task.h |   12 +++++++++++-
>>   include/linux/sched.h     |    5 ++++-
>>   kernel/sched.c            |   32 ++++++++++++++++++--------------
>>   3 files changed, 33 insertions(+), 16 deletions(-)
>>
>> diff --git a/include/linux/init_task.h b/include/linux/init_task.h
>> index 32574ee..13b2684 100644
>> --- a/include/linux/init_task.h
>> +++ b/include/linux/init_task.h
>> @@ -117,8 +117,17 @@ extern struct group_info init_groups;
>>
>>   extern struct cred init_cred;
>>
>> +extern struct task_group root_task_group;
>> +
>> +#ifdef CONFIG_CGROUP_SCHED
>> +# define INIT_CGROUP_SCHED(tsk)						\
>> +	.sched_task_group = &root_task_group,
>> +#else
>> +# define INIT_CGROUP_SCHED(tsk)
>> +#endif
>> +
>>   #ifdef CONFIG_PERF_EVENTS
>> -# define INIT_PERF_EVENTS(tsk)					\
>> +# define INIT_PERF_EVENTS(tsk)						\
>>   	.perf_event_mutex = 						\
>>   		 __MUTEX_INITIALIZER(tsk.perf_event_mutex),		\
>>   	.perf_event_list = LIST_HEAD_INIT(tsk.perf_event_list),
>> @@ -155,6 +164,7 @@ extern struct cred init_cred;
>>   	},								\
>>   	.tasks		= LIST_HEAD_INIT(tsk.tasks),			\
>>   	INIT_PUSHABLE_TASKS(tsk)					\
>> +	INIT_CGROUP_SCHED(tsk)						\
>>   	.ptraced	= LIST_HEAD_INIT(tsk.ptraced),			\
>>   	.ptrace_entry	= LIST_HEAD_INIT(tsk.ptrace_entry),		\
>>   	.real_parent	= &tsk,						\
>> diff --git a/include/linux/sched.h b/include/linux/sched.h
>> index 56de5c1..ac51641 100644
>> --- a/include/linux/sched.h
>> +++ b/include/linux/sched.h
>> @@ -1242,6 +1242,9 @@ struct task_struct {
>>   	const struct sched_class *sched_class;
>>   	struct sched_entity se;
>>   	struct sched_rt_entity rt;
>> +#ifdef CONFIG_CGROUP_SCHED
>> +	struct task_group *sched_task_group;
>> +#endif
>>
>>   #ifdef CONFIG_PREEMPT_NOTIFIERS
>>   	/* list of struct preempt_notifier: */
>> @@ -2646,7 +2649,7 @@ extern int sched_group_set_rt_period(struct task_group *tg,
>>   extern long sched_group_rt_period(struct task_group *tg);
>>   extern int sched_rt_can_attach(struct task_group *tg, struct task_struct *tsk);
>>   #endif
>> -#endif
>> +#endif /* CONFIG_CGROUP_SCHED */
>>
>>   extern int task_can_switch_user(struct user_struct *up,
>>   					struct task_struct *tsk);
>> diff --git a/kernel/sched.c b/kernel/sched.c
>> index aae0c1d..b99a61e 100644
>> --- a/kernel/sched.c
>> +++ b/kernel/sched.c
>> @@ -746,22 +746,19 @@ static inline int cpu_of(struct rq *rq)
>>   /*
>>    * Return the group to which this tasks belongs.
>>    *
>> - * We use task_subsys_state_check() and extend the RCU verification with
>> - * pi->lock and rq->lock because cpu_cgroup_attach() holds those locks for each
>> - * task it moves into the cgroup. Therefore by holding either of those locks,
>> - * we pin the task to the current cgroup.
>> + * We cannot use task_subsys_state() and friends because the cgroup
>> + * subsystem changes that value before the cgroup_subsys::attach() method
>> + * is called, therefore we cannot pin it and might observe the wrong value.
>> + *
>> + * The same is true for autogroup's p->signal->autogroup->tg, the autogroup
>> + * core changes this before calling sched_move_task().
>> + *
>> + * Instead we use a 'copy' which is updated from sched_move_task() while
>> + * holding both task_struct::pi_lock and rq::lock.
>>    */
>>   static inline struct task_group *task_group(struct task_struct *p)
>>   {
>> -	struct task_group *tg;
>> -	struct cgroup_subsys_state *css;
>> -
>> -	css = task_subsys_state_check(p, cpu_cgroup_subsys_id,
>> -			lockdep_is_held(&p->pi_lock) ||
>> -			lockdep_is_held(&task_rq(p)->lock));
>> -	tg = container_of(css, struct task_group, css);
>> -
>> -	return autogroup_task_group(p, tg);
>> +	return p->sched_task_group;
>>   }
>>
>>   /* Change a task's cfs_rq and parent entity if it moves across CPUs/groups */
>> @@ -2373,7 +2370,7 @@ void set_task_cpu(struct task_struct *p, unsigned int new_cpu)
>>   	 * a task's CPU. ->pi_lock for waking tasks, rq->lock for runnable tasks.
>>   	 *
>>   	 * sched_move_task() holds both and thus holding either pins the cgroup,
>> -	 * see set_task_rq().
>> +	 * see task_group().
>>   	 *
>>   	 * Furthermore, all task_rq users should acquire both locks, see
>>   	 * task_rq_lock().
>> @@ -8765,6 +8762,7 @@ void sched_destroy_group(struct task_group *tg)
>>    */
>>   void sched_move_task(struct task_struct *tsk)
>>   {
>> +	struct task_group *tg;
>>   	int on_rq, running;
>>   	unsigned long flags;
>>   	struct rq *rq;
>> @@ -8779,6 +8777,12 @@ void sched_move_task(struct task_struct *tsk)
>>   	if (unlikely(running))
>>   		tsk->sched_class->put_prev_task(rq, tsk);
>>
>> +	tg = container_of(task_subsys_state_check(tsk, cpu_cgroup_subsys_id,
>> +				lockdep_is_held(&tsk->sighand->siglock)),
>> +			  struct task_group, css);
>> +	tg = autogroup_task_group(tsk, tg);
>> +	tsk->sched_task_group = tg;
>> +
>>   #ifdef CONFIG_FAIR_GROUP_SCHED
>>   	if (tsk->sched_class->task_move_group)
>>   		tsk->sched_class->task_move_group(tsk, on_rq);
>> --
>> 1.7.9.5
>>
>>
>> --
>> kernel-team mailing list
>> kernel-team@lists.ubuntu.com
>> https://lists.ubuntu.com/mailman/listinfo/kernel-team
>>
>

Herton Ronaldo Krzesinski July 17, 2012, 5:27 p.m. UTC | #3

On Tue, Jul 17, 2012 at 10:22:55AM -0700, Stefan Bader wrote:
> On 07/03/2012 08:17 AM, Herton Ronaldo Krzesinski wrote:
> >On Tue, Jul 03, 2012 at 03:01:58PM +0200, Stefan Bader wrote:
> >>Though reproducing this somehow required (or at least made much
> >>more likely) a certain setup (two CPUs and the described loop of
> >>a certain command), the result (crash) is probably bad enough to
> >>warrant applying it before it appears upstream. Peter queued it
> >>up but needs Ingo to push it onwards.
> >>
> >>Also this may cause other subtle to find issues as the current
> >>code calls task_group() four times to assign values to cfs and
> >>rt schdeduler elements. So while the observed crash happened when
> >>cfs elements were inconsistent, there could be other problems when
> >>the inconsistency hits cfs/rt or rt only.
> >>
> >>Quantal still has the same race but at least the code was changed
> >>to call task_group only once. So the value could be old but at least
> >>it is always old (or the new one).
> >>
> >>This backport should apply to Natty till Precise (maybe a bit of
> >>fuzz in Natty). But later that 3.3 all the files moved around.
> >>
> >>-Stefan
> 
> The patch at least made it into linux-next as
> 
> commit 7fac251a97f36c9aef31b08ce7ad6ef8f06e54d1
> Author: Peter Zijlstra <peterz@infradead.org>
> Date:   Fri Jun 22 13:36:05 2012 +0200
> 
>     sched: Fix race in task_group()
> 
> I would like to see the backport I submitted to go into our kernels
> as soon as possible. Upstream unlikely is in a hurry (so I would
> think it gets there not before 3.6) because the issue is made
> covered up there.

Yes, looks sensible to get this now.

> 
> -Stefan

Tim Gardner July 17, 2012, 7:09 p.m. UTC | #4

Doesn't apply to Natty

Stefan Bader July 17, 2012, 8:34 p.m. UTC | #5

On 07/17/2012 12:09 PM, Tim Gardner wrote:
> Doesn't apply to Natty
>
Hm, I seem to have confused to have tried for 3.0 with having tried for Natty. I 
have now done that backport for Natty but need to test it before submitting it here.

[Natty,Oneiric,Precise,SRU] Fix race in task_group()

Commit Message

Comments

Patch