diff mbox series

[v2,07/12] ptrace: Don't change __state

Message ID 20220429214837.386518-7-ebiederm@xmission.com
State Not Applicable
Headers show
Series ptrace: cleaning up ptrace_stop | expand

Commit Message

Eric W. Biederman April 29, 2022, 9:48 p.m. UTC
Stop playing with tsk->__state to remove TASK_WAKEKILL while a ptrace
command is executing.

Instead TASK_WAKEKILL from the definition of TASK_TRACED, and
implemention a new jobctl flag TASK_PTRACE_FROZEN.  This new This new
flag is set in jobctl_freeze_task and cleared when ptrace_stop is
awoken or in jobctl_unfreeze_task (when ptrace_stop remains asleep).

In singal_wake_up add __TASK_TRACED to state along with TASK_WAKEKILL
when it is indicated a fatal signal is pending.  Skip adding
__TASK_TRACED when TASK_PTRACE_FROZEN is not set.  This has the same
effect as changing TASK_TRACED to __TASK_TRACED as all of the wake_ups
that use TASK_KILLABLE go through signal_wake_up.

Don't set TASK_TRACED if fatal_signal_pending so that the code
continues not to sleep if there was a pending fatal signal before
ptrace_stop is called.  With TASK_WAKEKILL no longer present in
TASK_TRACED signal_pending_state will no longer prevent ptrace_stop
from sleeping if there is a pending fatal signal.

Previously the __state value of __TASK_TRACED was changed to
TASK_RUNNING when woken up or back to TASK_TRACED when the code was
left in ptrace_stop.  Now when woken up ptrace_stop now clears
JOBCTL_PTRACE_FROZEN and when left sleeping ptrace_unfreezed_traced
clears JOBCTL_PTRACE_FROZEN.

Signed-off-by: "Eric W. Biederman" <ebiederm@xmission.com>
---
 include/linux/sched.h        |  2 +-
 include/linux/sched/jobctl.h |  2 ++
 include/linux/sched/signal.h |  8 +++++++-
 kernel/ptrace.c              | 21 ++++++++-------------
 kernel/signal.c              |  9 +++------
 5 files changed, 21 insertions(+), 21 deletions(-)

Comments

Peter Zijlstra April 29, 2022, 10:27 p.m. UTC | #1
On Fri, Apr 29, 2022 at 04:48:32PM -0500, Eric W. Biederman wrote:
> Stop playing with tsk->__state to remove TASK_WAKEKILL while a ptrace
> command is executing.
> 
> Instead TASK_WAKEKILL from the definition of TASK_TRACED, and
> implemention a new jobctl flag TASK_PTRACE_FROZEN.  This new This new
> flag is set in jobctl_freeze_task and cleared when ptrace_stop is
> awoken or in jobctl_unfreeze_task (when ptrace_stop remains asleep).
> 
> In singal_wake_up add __TASK_TRACED to state along with TASK_WAKEKILL
> when it is indicated a fatal signal is pending.  Skip adding
> __TASK_TRACED when TASK_PTRACE_FROZEN is not set.  This has the same
> effect as changing TASK_TRACED to __TASK_TRACED as all of the wake_ups
> that use TASK_KILLABLE go through signal_wake_up.
> 
> Don't set TASK_TRACED if fatal_signal_pending so that the code
> continues not to sleep if there was a pending fatal signal before
> ptrace_stop is called.  With TASK_WAKEKILL no longer present in
> TASK_TRACED signal_pending_state will no longer prevent ptrace_stop
> from sleeping if there is a pending fatal signal.
> 
> Previously the __state value of __TASK_TRACED was changed to
> TASK_RUNNING when woken up or back to TASK_TRACED when the code was
> left in ptrace_stop.  Now when woken up ptrace_stop now clears
> JOBCTL_PTRACE_FROZEN and when left sleeping ptrace_unfreezed_traced
> clears JOBCTL_PTRACE_FROZEN.
> 
> Signed-off-by: "Eric W. Biederman" <ebiederm@xmission.com>
> ---
>  include/linux/sched.h        |  2 +-
>  include/linux/sched/jobctl.h |  2 ++
>  include/linux/sched/signal.h |  8 +++++++-
>  kernel/ptrace.c              | 21 ++++++++-------------
>  kernel/signal.c              |  9 +++------
>  5 files changed, 21 insertions(+), 21 deletions(-)

Please fold this hunk:

--- a/kernel/sched/core.c
+++ b/kernel/sched/core.c
@@ -6310,10 +6310,7 @@ static void __sched notrace __schedule(u
 
 	/*
 	 * We must load prev->state once (task_struct::state is volatile), such
-	 * that:
-	 *
-	 *  - we form a control dependency vs deactivate_task() below.
-	 *  - ptrace_{,un}freeze_traced() can change ->state underneath us.
+	 * that we form a control dependency vs deactivate_task() below.
 	 */
 	prev_state = READ_ONCE(prev->__state);
 	if (!(sched_mode & SM_MASK_PREEMPT) && prev_state) {
Sebastian Andrzej Siewior May 2, 2022, 8:59 a.m. UTC | #2
On 2022-04-29 16:48:32 [-0500], Eric W. Biederman wrote:
> Stop playing with tsk->__state to remove TASK_WAKEKILL while a ptrace
> command is executing.
> 
> Instead TASK_WAKEKILL from the definition of TASK_TRACED, and
> implemention a new jobctl flag TASK_PTRACE_FROZEN.  This new This new

Instead adding TASK_WAKEKILL to the definition of TASK_TRACED, implement
a new jobctl flag TASK_PTRACE_FROZEN for this. This new

> flag is set in jobctl_freeze_task and cleared when ptrace_stop is
> awoken or in jobctl_unfreeze_task (when ptrace_stop remains asleep).
> 
> In singal_wake_up add __TASK_TRACED to state along with TASK_WAKEKILL
     signal_wake_up

> when it is indicated a fatal signal is pending.  Skip adding
                      +that ?

> __TASK_TRACED when TASK_PTRACE_FROZEN is not set.  This has the same
> effect as changing TASK_TRACED to __TASK_TRACED as all of the wake_ups
                                                                        ,
> that use TASK_KILLABLE go through signal_wake_up.
                        ,

> Don't set TASK_TRACED if fatal_signal_pending so that the code
> continues not to sleep if there was a pending fatal signal before
> ptrace_stop is called.  With TASK_WAKEKILL no longer present in
> TASK_TRACED signal_pending_state will no longer prevent ptrace_stop
> from sleeping if there is a pending fatal signal.
> 
> Previously the __state value of __TASK_TRACED was changed to
> TASK_RUNNING when woken up or back to TASK_TRACED when the code was
> left in ptrace_stop.  Now when woken up ptrace_stop now clears
> JOBCTL_PTRACE_FROZEN and when left sleeping ptrace_unfreezed_traced
> clears JOBCTL_PTRACE_FROZEN.
> 
> Signed-off-by: "Eric W. Biederman" <ebiederm@xmission.com>

Sebastian
Oleg Nesterov May 2, 2022, 3:39 p.m. UTC | #3
On 04/29, Eric W. Biederman wrote:
>
> Stop playing with tsk->__state to remove TASK_WAKEKILL while a ptrace
> command is executing.

Eric, I'll read this patch and the rest of this series tomorrow.
Somehow I failed to force myself to read yet another version after
weekend ;)

plus I don't really understand this one...

>  #define TASK_KILLABLE			(TASK_WAKEKILL | TASK_UNINTERRUPTIBLE)
>  #define TASK_STOPPED			(TASK_WAKEKILL | __TASK_STOPPED)
> -#define TASK_TRACED			(TASK_WAKEKILL | __TASK_TRACED)
> +#define TASK_TRACED			__TASK_TRACED
...
>  static inline void signal_wake_up(struct task_struct *t, bool resume)
>  {
> -	signal_wake_up_state(t, resume ? TASK_WAKEKILL : 0);
> +	unsigned int state = 0;
> +	if (resume) {
> +		state = TASK_WAKEKILL;
> +		if (!(t->jobctl & JOBCTL_PTRACE_FROZEN))
> +			state |= __TASK_TRACED;
> +	}
> +	signal_wake_up_state(t, state);

Can't understand why is this better than the previous version which removed
TASK_WAKEKILL if resume... Looks a bit strange to me. But again, I didn't
look at the next patches yet.

> @@ -2209,11 +2209,8 @@ static int ptrace_stop(int exit_code, int why, int clear_code,
>  		spin_lock_irq(&current->sighand->siglock);
>  	}
>
> -	/*
> -	 * schedule() will not sleep if there is a pending signal that
> -	 * can awaken the task.
> -	 */
> -	set_special_state(TASK_TRACED);
> +	if (!__fatal_signal_pending(current))
> +		set_special_state(TASK_TRACED);

This is where I stuck. This probably makes sense, but what does it buy
for this particular patch?

And if we check __fatal_signal_pending(), why can't ptrace_stop() simply
return ?

Oleg.
Oleg Nesterov May 2, 2022, 3:47 p.m. UTC | #4
On 04/29, Eric W. Biederman wrote:
>
>  static void ptrace_unfreeze_traced(struct task_struct *task)
>  {
> -	if (READ_ONCE(task->__state) != __TASK_TRACED)
> -		return;
> -
> -	WARN_ON(!task->ptrace || task->parent != current);
> +	unsigned long flags;
>
>  	/*
> -	 * PTRACE_LISTEN can allow ptrace_trap_notify to wake us up remotely.
> -	 * Recheck state under the lock to close this race.
> +	 * The child may be awake and may have cleared
> +	 * JOBCTL_PTRACE_FROZEN (see ptrace_resume).  The child will
> +	 * not set JOBCTL_PTRACE_FROZEN or enter __TASK_TRACED anew.
>  	 */
> -	spin_lock_irq(&task->sighand->siglock);
> -	if (READ_ONCE(task->__state) == __TASK_TRACED) {
> +	if (lock_task_sighand(task, &flags)) {
> +		task->jobctl &= ~JOBCTL_PTRACE_FROZEN;

Well, I think that the fast-path

	if (!(task->jobctl & JOBCTL_PTRACE_FROZEN))
		return;

at the start makes sense, we can avoid lock_task_sighand() if the tracee
was resumed.

Oleg.
Eric W. Biederman May 2, 2022, 4:35 p.m. UTC | #5
Oleg Nesterov <oleg@redhat.com> writes:

> On 04/29, Eric W. Biederman wrote:
>>
>> Stop playing with tsk->__state to remove TASK_WAKEKILL while a ptrace
>> command is executing.
>
> Eric, I'll read this patch and the rest of this series tomorrow.
> Somehow I failed to force myself to read yet another version after
> weekend ;)

That is quite alright.

> plus I don't really understand this one...
>
>>  #define TASK_KILLABLE			(TASK_WAKEKILL | TASK_UNINTERRUPTIBLE)
>>  #define TASK_STOPPED			(TASK_WAKEKILL | __TASK_STOPPED)
>> -#define TASK_TRACED			(TASK_WAKEKILL | __TASK_TRACED)
>> +#define TASK_TRACED			__TASK_TRACED
> ...
>>  static inline void signal_wake_up(struct task_struct *t, bool resume)
>>  {
>> -	signal_wake_up_state(t, resume ? TASK_WAKEKILL : 0);
>> +	unsigned int state = 0;
>> +	if (resume) {
>> +		state = TASK_WAKEKILL;
>> +		if (!(t->jobctl & JOBCTL_PTRACE_FROZEN))
>> +			state |= __TASK_TRACED;
>> +	}
>> +	signal_wake_up_state(t, state);
>
> Can't understand why is this better than the previous version which removed
> TASK_WAKEKILL if resume... Looks a bit strange to me. But again, I didn't
> look at the next patches yet.

The goal is to replace the existing mechanism with an equivalent one,
so that we don't have to be clever and deal with it being slightly
different in one case.

The difference is how does signal_pending_state affect how schedule will
sleep in ptrace_stop.

As the patch is constructed currently (and how the existing code works)
is that signal_pending_state will always sleep if ptrace_freeze_traced
completes successfully.

When TASK_WAKEKILL was included in TASK_TRACED schedule might refuse
to sleep even though ptrace_freeze_traced completed successfully.  As
you pointed out wait_task_inactive would then fail, keeping
ptrace_check_attach from succeeded.

Other than complicating the analysis by adding extra states we need to
consider when reviewing the patch, the practical difference is for
Peter's plans to fix PREEMPT_RT or the freezer wait_task_inactive needs
to cope with the final being changed by something else. (TASK_FROZEN in
the freezer case).  I can only see that happening by removing the
dependency on the final state in wait_task_inactive.  Which we can't do
if we depend on wait_task_inactive failing if the process is in the
wrong state.


>> @@ -2209,11 +2209,8 @@ static int ptrace_stop(int exit_code, int why, int clear_code,
>>  		spin_lock_irq(&current->sighand->siglock);
>>  	}
>>
>> -	/*
>> -	 * schedule() will not sleep if there is a pending signal that
>> -	 * can awaken the task.
>> -	 */
>> -	set_special_state(TASK_TRACED);
>> +	if (!__fatal_signal_pending(current))
>> +		set_special_state(TASK_TRACED);
>
> This is where I stuck. This probably makes sense, but what does it buy
> for this particular patch?
>
> And if we check __fatal_signal_pending(), why can't ptrace_stop() simply
> return ?

Again this is about preserving existing behavior as much as possible to
simplify analsysis of the patch.

The current code depends upon schedule not sleeping if there was a fatal
signal received before ptrace_stop is called.  With TASK_WAKEKILL
removed from TASK_TRACED that no longer happens.  Just not setting
TASK_TRACED when !__fatal_signal_pending has the same effect.


At a practical level I think it also has an impact on patch:
"10/12 ptrace: Only return signr from ptrace_stop if it was provided".

At a minimum the code would need to do something like:
	if (__fatal_signal_pending(current)) {
		return clear_code ? 0 : exit_code;
        }

With a little bit of care put in to ensure everytime the logic changes
that early return changes too.  I think that just complicates things
unnecessarily.

Eric
Oleg Nesterov May 3, 2022, 1:41 p.m. UTC | #6
On 05/02, Eric W. Biederman wrote:
>
> Oleg Nesterov <oleg@redhat.com> writes:
>
> >>  #define TASK_KILLABLE			(TASK_WAKEKILL | TASK_UNINTERRUPTIBLE)
> >>  #define TASK_STOPPED			(TASK_WAKEKILL | __TASK_STOPPED)
> >> -#define TASK_TRACED			(TASK_WAKEKILL | __TASK_TRACED)
> >> +#define TASK_TRACED			__TASK_TRACED
> > ...
> >>  static inline void signal_wake_up(struct task_struct *t, bool resume)
> >>  {
> >> -	signal_wake_up_state(t, resume ? TASK_WAKEKILL : 0);
> >> +	unsigned int state = 0;
> >> +	if (resume) {
> >> +		state = TASK_WAKEKILL;
> >> +		if (!(t->jobctl & JOBCTL_PTRACE_FROZEN))
> >> +			state |= __TASK_TRACED;
> >> +	}
> >> +	signal_wake_up_state(t, state);
> >
> > Can't understand why is this better than the previous version which removed
> > TASK_WAKEKILL if resume... Looks a bit strange to me. But again, I didn't
> > look at the next patches yet.
>
> The goal is to replace the existing mechanism with an equivalent one,
> so that we don't have to be clever and deal with it being slightly
> different in one case.
>
> The difference is how does signal_pending_state affect how schedule will
> sleep in ptrace_stop.

But why is it bad if the tracee doesn't sleep in schedule ? If it races
with SIGKILL. I still can't understand this.

Yes, wait_task_inactive() can fail, so you need to remove WARN_ON_ONCE()
in 11/12.

Why is removing TASK_WAKEKILL from TASK_TRACED and complicating
*signal_wake_up() better?

And even if we need to ensure the tracee will always block after
ptrace_freeze_traced(), we can change signal_pending_state() to
return false if JOBCTL_PTRACE_FROZEN. Much simpler, imo. But still
looks unnecessary to me.



> Peter's plans to fix PREEMPT_RT or the freezer wait_task_inactive needs
> to cope with the final being changed by something else. (TASK_FROZEN in
> the freezer case).  I can only see that happening by removing the
> dependency on the final state in wait_task_inactive.  Which we can't do
> if we depend on wait_task_inactive failing if the process is in the
> wrong state.

OK, I guess this is what I do not understand. Could you spell please?

And speaking of RT, wait_task_inactive() still can fail because
cgroup_enter_frozen() takes css_set_lock? And it is called under
preempt_disable() ? I don't understand the plan :/

> At a practical level I think it also has an impact on patch:
> "10/12 ptrace: Only return signr from ptrace_stop if it was provided".

I didn't look at JOBCTL_PTRACE_SIGNR yet. But this looks minor to me,
I mean, I am not sure it worth the trouble.

Oleg.
Eric W. Biederman May 3, 2022, 8:45 p.m. UTC | #7
Oleg Nesterov <oleg@redhat.com> writes:

> On 05/02, Eric W. Biederman wrote:
>>
>> Oleg Nesterov <oleg@redhat.com> writes:
>>
>> >>  #define TASK_KILLABLE			(TASK_WAKEKILL | TASK_UNINTERRUPTIBLE)
>> >>  #define TASK_STOPPED			(TASK_WAKEKILL | __TASK_STOPPED)
>> >> -#define TASK_TRACED			(TASK_WAKEKILL | __TASK_TRACED)
>> >> +#define TASK_TRACED			__TASK_TRACED
>> > ...
>> >>  static inline void signal_wake_up(struct task_struct *t, bool resume)
>> >>  {
>> >> -	signal_wake_up_state(t, resume ? TASK_WAKEKILL : 0);
>> >> +	unsigned int state = 0;
>> >> +	if (resume) {
>> >> +		state = TASK_WAKEKILL;
>> >> +		if (!(t->jobctl & JOBCTL_PTRACE_FROZEN))
>> >> +			state |= __TASK_TRACED;
>> >> +	}
>> >> +	signal_wake_up_state(t, state);
>> >
>> > Can't understand why is this better than the previous version which removed
>> > TASK_WAKEKILL if resume... Looks a bit strange to me. But again, I didn't
>> > look at the next patches yet.
>>
>> The goal is to replace the existing mechanism with an equivalent one,
>> so that we don't have to be clever and deal with it being slightly
>> different in one case.
>>
>> The difference is how does signal_pending_state affect how schedule will
>> sleep in ptrace_stop.
>
> But why is it bad if the tracee doesn't sleep in schedule ? If it races
> with SIGKILL. I still can't understand this.
>
> Yes, wait_task_inactive() can fail, so you need to remove WARN_ON_ONCE()
> in 11/12.


>
> Why is removing TASK_WAKEKILL from TASK_TRACED and complicating
> *signal_wake_up() better?

Not changing __state is better because it removes special cases
from the scheduler that only apply to ptrace.


> And even if we need to ensure the tracee will always block after
> ptrace_freeze_traced(), we can change signal_pending_state() to
> return false if JOBCTL_PTRACE_FROZEN. Much simpler, imo. But still
> looks unnecessary to me.

We still need to change signal_wake_up in that case.  Possibly
signal_wake_up_state.  The choice is for fatal signals is TASK_WAKEKILL
suppressed or is TASK_TRACED added.

With removing TASK_WAKEKILL the resulting code behaves in a very obvious
minimally special case way.  Yes there is a special case in
signal_wake_up but that is the entirety of the special case and it is
easy to read and see what it does.

>> Peter's plans to fix PREEMPT_RT or the freezer wait_task_inactive needs
>> to cope with the final being changed by something else. (TASK_FROZEN in
>> the freezer case).  I can only see that happening by removing the
>> dependency on the final state in wait_task_inactive.  Which we can't do
>> if we depend on wait_task_inactive failing if the process is in the
>> wrong state.
>
> OK, I guess this is what I do not understand. Could you spell please?
>
> And speaking of RT, wait_task_inactive() still can fail because
> cgroup_enter_frozen() takes css_set_lock? And it is called under
> preempt_disable() ? I don't understand the plan :/

Let me describe his freezer change as that is much easier to get to the
final result.  RT has more problems as it turns all spin locks into
sleeping locks.  When a task is frozen it turns it's sleeping state into
TASK_FROZEN.  That is TASK_STOPPED and TASK_TRACED become TASK_FROZEN.
If this races with ptrace_check_attach the wait_task_inactive fail as
the process state has changed.  This makes the freezer userspace
visible.

For ordinary tasks the freezer thaws them just by giving them a spurious
wake-up.  After which they check their conditions and go back to sleep
on their on.  For TASK_STOPPED and TASK_TRACED (which can't handle
spurious wake-ups) the __state value is recovered from task->jobctl.

For RT cgroup_enter_frozen needs fixes that no one has proposed yet.
The problem is that for "preempt_disable()" before
"read_unlock(&tasklist_lock)" is not something that can reasonably be
removed.  It would cause a performance regression.

So my plan is to get the things as far as the Peter's freezer change
working.  That cleans up the code and makes it much closer for
ptrace working in PTREMPT_RT.  That makes the problems left for
the PREEMPT_RT folks much smaller.


>> At a practical level I think it also has an impact on patch:
>> "10/12 ptrace: Only return signr from ptrace_stop if it was provided".
>
> I didn't look at JOBCTL_PTRACE_SIGNR yet. But this looks minor to me,
> I mean, I am not sure it worth the trouble.

The immediate problem the JOBCTL_PTRACE_SIGNR patch solves is:
- stopping in ptrace_report_syscall.
- Not having PT_TRACESYSGOOD set.
- The tracee being killed with a fatal signal
- The tracee sending SIGTRAP to itself.

The larger problem solved by the JOBCTL_PTRACE_SIGNR patch is that
it removes the need for current->ptrace test from ptrace_stop.  Which
in turn is part of what is needed for wait_task_inactive to be
guaranteed a stop in ptrace_stop.


Thinking about it.  I think a reasonable case can be made that it
is weird if not dangerous to play with the task fields (ptrace_message,
last_siginfo, and exit_code) without task_is_traced being true.
So I will adjust my patch to check that.  The difference in behavior
is explicit enough we can think about it easily.

Eric
Oleg Nesterov May 4, 2022, 2:02 p.m. UTC | #8
On 05/03, Eric W. Biederman wrote:
>
> Oleg Nesterov <oleg@redhat.com> writes:
>
> > But why is it bad if the tracee doesn't sleep in schedule ? If it races
> > with SIGKILL. I still can't understand this.
> >
> > Yes, wait_task_inactive() can fail, so you need to remove WARN_ON_ONCE()
> > in 11/12.
>
> >
> > Why is removing TASK_WAKEKILL from TASK_TRACED and complicating
> > *signal_wake_up() better?
>
> Not changing __state is better because it removes special cases
> from the scheduler that only apply to ptrace.

Hmm. But I didn't argue with that? I like the idea of JOBCTL_TASK_FROZEN.

I meant, I do not think that removing KILLABLE from TASK_TRACED (not
from __state) and complicating *signal_wake_up() (I mean, compared
to your previous version) is a good idea.

And. At least in context of this series it is fine if the JOBCTL_TASK_FROZEN
tracee do not block in schedule(), just you need to remove WARN_ON_ONCE()
around wait_task_inactive().

> > And even if we need to ensure the tracee will always block after
> > ptrace_freeze_traced(), we can change signal_pending_state() to
> > return false if JOBCTL_PTRACE_FROZEN. Much simpler, imo. But still
> > looks unnecessary to me.
>
> We still need to change signal_wake_up in that case.  Possibly
> signal_wake_up_state.

Of course. See above.

> >> if we depend on wait_task_inactive failing if the process is in the
> >> wrong state.
> >
> > OK, I guess this is what I do not understand. Could you spell please?
> >
> > And speaking of RT, wait_task_inactive() still can fail because
> > cgroup_enter_frozen() takes css_set_lock? And it is called under
> > preempt_disable() ? I don't understand the plan :/
>
> Let me describe his freezer change as that is much easier to get to the
> final result.  RT has more problems as it turns all spin locks into
> sleeping locks.  When a task is frozen

[...snip...]

Oh, thanks Eric, but I understand this part. But I still can't understand
why is it that critical to block in schedule... OK, I need to think about
it. Lets assume this is really necessary.

Anyway. I'd suggest to not change TASK_TRACED in this series and not
complicate signal_wake_up() more than you did in your previous version:

	static inline void signal_wake_up(struct task_struct *t, bool resume)
	{
		bool wakekill = resume && !(t->jobctl & JOBCTL_DELAY_WAKEKILL);
		signal_wake_up_state(t, wakekill ? TASK_WAKEKILL : 0);
	}

JOBCTL_PTRACE_FROZEN is fine.

ptrace_check_attach() can do

	if (!ret && !ignore_state &&
	    /*
	     * This can only fail if the frozen tracee races with
	     * SIGKILL and enters schedule() with fatal_signal_pending
	     */
	    !wait_task_inactive(child, __TASK_TRACED))
		ret = -ESRCH;

	return ret;


Now. If/when we really need to ensure that the frozen tracee always
blocks and wait_task_inactive() never fails, we can just do

	- add the fatal_signal_pending() check into ptrace_stop()
	  (like this patch does)

	- say, change signal_pending_state:

	static inline int signal_pending_state(unsigned int state, struct task_struct *p)
	{
		if (!(state & (TASK_INTERRUPTIBLE | TASK_WAKEKILL)))
			return 0;
		if (!signal_pending(p))
			return 0;
		if (p->jobctl & JOBCTL_TASK_FROZEN)
			return 0;
		return (state & TASK_INTERRUPTIBLE) || __fatal_signal_pending(p);
	}

in a separate patch which should carefully document the need for this
change.

> > I didn't look at JOBCTL_PTRACE_SIGNR yet. But this looks minor to me,
> > I mean, I am not sure it worth the trouble.
>
> The immediate problem the JOBCTL_PTRACE_SIGNR patch solves is:
> - stopping in ptrace_report_syscall.
> - Not having PT_TRACESYSGOOD set.
> - The tracee being killed with a fatal signal
        ^^^^^^
        tracer ?
> - The tracee sending SIGTRAP to itself.

Oh, but this is clear. But do we really care? If the tracer exits
unexpectedly, the tracee can have a lot more problems, I don't think
that this particular one is that important.

Oleg.
Eric W. Biederman May 4, 2022, 5:37 p.m. UTC | #9
Oleg Nesterov <oleg@redhat.com> writes:

> On 05/03, Eric W. Biederman wrote:
>>
>> Oleg Nesterov <oleg@redhat.com> writes:
>>
>> > But why is it bad if the tracee doesn't sleep in schedule ? If it races
>> > with SIGKILL. I still can't understand this.
>> >
>> > Yes, wait_task_inactive() can fail, so you need to remove WARN_ON_ONCE()
>> > in 11/12.
>>
>> >
>> > Why is removing TASK_WAKEKILL from TASK_TRACED and complicating
>> > *signal_wake_up() better?
>>
>> Not changing __state is better because it removes special cases
>> from the scheduler that only apply to ptrace.
>
> Hmm. But I didn't argue with that? I like the idea of JOBCTL_TASK_FROZEN.
>
> I meant, I do not think that removing KILLABLE from TASK_TRACED (not
> from __state) and complicating *signal_wake_up() (I mean, compared
> to your previous version) is a good idea.
>
> And. At least in context of this series it is fine if the JOBCTL_TASK_FROZEN
> tracee do not block in schedule(), just you need to remove WARN_ON_ONCE()
> around wait_task_inactive().
>
>> > And even if we need to ensure the tracee will always block after
>> > ptrace_freeze_traced(), we can change signal_pending_state() to
>> > return false if JOBCTL_PTRACE_FROZEN. Much simpler, imo. But still
>> > looks unnecessary to me.
>>
>> We still need to change signal_wake_up in that case.  Possibly
>> signal_wake_up_state.
>
> Of course. See above.
>
>> >> if we depend on wait_task_inactive failing if the process is in the
>> >> wrong state.
>> >
>> > OK, I guess this is what I do not understand. Could you spell please?
>> >
>> > And speaking of RT, wait_task_inactive() still can fail because
>> > cgroup_enter_frozen() takes css_set_lock? And it is called under
>> > preempt_disable() ? I don't understand the plan :/
>>
>> Let me describe his freezer change as that is much easier to get to the
>> final result.  RT has more problems as it turns all spin locks into
>> sleeping locks.  When a task is frozen
>
> [...snip...]
>
> Oh, thanks Eric, but I understand this part. But I still can't understand
> why is it that critical to block in schedule... OK, I need to think about
> it. Lets assume this is really necessary.
>
> Anyway. I'd suggest to not change TASK_TRACED in this series and not
> complicate signal_wake_up() more than you did in your previous version:
>
> 	static inline void signal_wake_up(struct task_struct *t, bool resume)
> 	{
> 		bool wakekill = resume && !(t->jobctl & JOBCTL_DELAY_WAKEKILL);
> 		signal_wake_up_state(t, wakekill ? TASK_WAKEKILL : 0);
> 	}

If your concern is signal_wake_up there is no reason it can't be:

	static inline void signal_wake_up(struct task_struct *t, bool fatal)
        {
        	fatal = fatal && !(t->jobctl & JOBCTL_PTRACE_FROZEN);
                signal_wake_up_state(t, fatal ? TASK_WAKEKILL | TASK_TRACED : 0);
        }

I guess I was more targeted in this version, which lead to more if
statements but as there is only one place in the code that can be
JOBCTL_PTRACE_FROZEN and TASK_TRACED there is no point in setting
TASK_WAKEKILL without also setting TASK_TRACED in the wake-up.

So yes. I can make the code as simple as my earlier version of
signal_wake_up.

> JOBCTL_PTRACE_FROZEN is fine.
>
> ptrace_check_attach() can do
>
> 	if (!ret && !ignore_state &&
> 	    /*
> 	     * This can only fail if the frozen tracee races with
> 	     * SIGKILL and enters schedule() with fatal_signal_pending
> 	     */
> 	    !wait_task_inactive(child, __TASK_TRACED))
> 		ret = -ESRCH;
>
> 	return ret;
>
>
> Now. If/when we really need to ensure that the frozen tracee always
> blocks and wait_task_inactive() never fails, we can just do
>
> 	- add the fatal_signal_pending() check into ptrace_stop()
> 	  (like this patch does)
>
> 	- say, change signal_pending_state:
>
> 	static inline int signal_pending_state(unsigned int state, struct task_struct *p)
> 	{
> 		if (!(state & (TASK_INTERRUPTIBLE | TASK_WAKEKILL)))
> 			return 0;
> 		if (!signal_pending(p))
> 			return 0;
> 		if (p->jobctl & JOBCTL_TASK_FROZEN)
> 			return 0;
> 		return (state & TASK_INTERRUPTIBLE) || __fatal_signal_pending(p);
> 	}
>
> in a separate patch which should carefully document the need for this
> change.
>
>> > I didn't look at JOBCTL_PTRACE_SIGNR yet. But this looks minor to me,
>> > I mean, I am not sure it worth the trouble.
>>
>> The immediate problem the JOBCTL_PTRACE_SIGNR patch solves is:
>> - stopping in ptrace_report_syscall.
>> - Not having PT_TRACESYSGOOD set.
>> - The tracee being killed with a fatal signal
>         ^^^^^^
>         tracer ?

Both actually.

>> - The tracee sending SIGTRAP to itself.
>
> Oh, but this is clear. But do we really care? If the tracer exits
> unexpectedly, the tracee can have a lot more problems, I don't think
> that this particular one is that important.

I don't know of complaints, and if you haven't heard them either
that that is a good indication that in practice we don't care.

At a practical level I just don't want that silly case that sets
TASK_TRACED to TASK_RUNNING without stopping at all in ptrace_stop to
remain.  It just seems to make everything more complicated for no real
reason anymore.  The deadlocks may_ptrace_stop was guarding against are
gone.

Plus the test is so racy we case can happen after we drop siglock
before we schedule, or shortly after we have stopped so we really
don't reliably catch the condition the code is trying to catch.

I think the case I care most about is ptrace_signal, which pretty much
requires the tracer to wait and clear exit_code before being terminated
to cause problems.  We don't handle that at all today.

So yeah.  I think the code handles so little at this point we can just
remove the code and simplify things, if we actually care we can come
back and implement JOBCTL_PTRACE_SIGNR or the like.

I will chew on that a bit and see if I can find any reasons for keeping
the code in ptrace_stop at all.



As an added data point we can probably remove handling of the signal
from ptrace_report_syscall entirely (not in this patchset!).

I took a quick skim and it appears that sending a signal in
ptrace_report_syscall appears to be a feature introduced with ptrace
support in Linux v1.0 and the comment in ptrace_report_syscall appears
to document the fact that the code has always been dead.


I made it through 13 of 133 pages of debian code search results for
PTRACE_SYSCALL, and the only use I could find of setting the continue
signal was when the signal reported from wait was not SIGTRAP.  Exactly
the same as in the comment in ptrace_report_syscall.

If that pattern holds for all of the uses of ptrace then the code
in ptrace_report_syscall is dead.



Eric
Eric W. Biederman May 4, 2022, 6:28 p.m. UTC | #10
"Eric W. Biederman" <ebiederm@xmission.com> writes:

> Oleg Nesterov <oleg@redhat.com> writes:
>
>> On 05/03, Eric W. Biederman wrote:
>>>
>>> Oleg Nesterov <oleg@redhat.com> writes:
>>>
>>> > But why is it bad if the tracee doesn't sleep in schedule ? If it races
>>> > with SIGKILL. I still can't understand this.
>>> >
>>> > Yes, wait_task_inactive() can fail, so you need to remove WARN_ON_ONCE()
>>> > in 11/12.
>>>
>>> >
>>> > Why is removing TASK_WAKEKILL from TASK_TRACED and complicating
>>> > *signal_wake_up() better?
>>>
>>> Not changing __state is better because it removes special cases
>>> from the scheduler that only apply to ptrace.
>>
>> Hmm. But I didn't argue with that? I like the idea of JOBCTL_TASK_FROZEN.
>>
>> I meant, I do not think that removing KILLABLE from TASK_TRACED (not
>> from __state) and complicating *signal_wake_up() (I mean, compared
>> to your previous version) is a good idea.
>>
>> And. At least in context of this series it is fine if the JOBCTL_TASK_FROZEN
>> tracee do not block in schedule(), just you need to remove WARN_ON_ONCE()
>> around wait_task_inactive().
>>
>>> > And even if we need to ensure the tracee will always block after
>>> > ptrace_freeze_traced(), we can change signal_pending_state() to
>>> > return false if JOBCTL_PTRACE_FROZEN. Much simpler, imo. But still
>>> > looks unnecessary to me.
>>>
>>> We still need to change signal_wake_up in that case.  Possibly
>>> signal_wake_up_state.
>>
>> Of course. See above.
>>
>>> >> if we depend on wait_task_inactive failing if the process is in the
>>> >> wrong state.
>>> >
>>> > OK, I guess this is what I do not understand. Could you spell please?
>>> >
>>> > And speaking of RT, wait_task_inactive() still can fail because
>>> > cgroup_enter_frozen() takes css_set_lock? And it is called under
>>> > preempt_disable() ? I don't understand the plan :/
>>>
>>> Let me describe his freezer change as that is much easier to get to the
>>> final result.  RT has more problems as it turns all spin locks into
>>> sleeping locks.  When a task is frozen
>>
>> [...snip...]
>>
>> Oh, thanks Eric, but I understand this part. But I still can't understand
>> why is it that critical to block in schedule... OK, I need to think about
>> it. Lets assume this is really necessary.
>>
>> Anyway. I'd suggest to not change TASK_TRACED in this series and not
>> complicate signal_wake_up() more than you did in your previous version:
>>
>> 	static inline void signal_wake_up(struct task_struct *t, bool resume)
>> 	{
>> 		bool wakekill = resume && !(t->jobctl & JOBCTL_DELAY_WAKEKILL);
>> 		signal_wake_up_state(t, wakekill ? TASK_WAKEKILL : 0);
>> 	}
>
> If your concern is signal_wake_up there is no reason it can't be:
>
> 	static inline void signal_wake_up(struct task_struct *t, bool fatal)
>         {
>         	fatal = fatal && !(t->jobctl & JOBCTL_PTRACE_FROZEN);
>                 signal_wake_up_state(t, fatal ? TASK_WAKEKILL | TASK_TRACED : 0);
>         }
>
> I guess I was more targeted in this version, which lead to more if
> statements but as there is only one place in the code that can be
> JOBCTL_PTRACE_FROZEN and TASK_TRACED there is no point in setting
> TASK_WAKEKILL without also setting TASK_TRACED in the wake-up.
>
> So yes. I can make the code as simple as my earlier version of
> signal_wake_up.
>
>> JOBCTL_PTRACE_FROZEN is fine.
>>
>> ptrace_check_attach() can do
>>
>> 	if (!ret && !ignore_state &&
>> 	    /*
>> 	     * This can only fail if the frozen tracee races with
>> 	     * SIGKILL and enters schedule() with fatal_signal_pending
>> 	     */
>> 	    !wait_task_inactive(child, __TASK_TRACED))
>> 		ret = -ESRCH;
>>
>> 	return ret;
>>
>>
>> Now. If/when we really need to ensure that the frozen tracee always
>> blocks and wait_task_inactive() never fails, we can just do
>>
>> 	- add the fatal_signal_pending() check into ptrace_stop()
>> 	  (like this patch does)
>>
>> 	- say, change signal_pending_state:
>>
>> 	static inline int signal_pending_state(unsigned int state, struct task_struct *p)
>> 	{
>> 		if (!(state & (TASK_INTERRUPTIBLE | TASK_WAKEKILL)))
>> 			return 0;
>> 		if (!signal_pending(p))
>> 			return 0;
>> 		if (p->jobctl & JOBCTL_TASK_FROZEN)
>> 			return 0;
>> 		return (state & TASK_INTERRUPTIBLE) || __fatal_signal_pending(p);
>> 	}
>>
>> in a separate patch which should carefully document the need for this
>> change.
>>
>>> > I didn't look at JOBCTL_PTRACE_SIGNR yet. But this looks minor to me,
>>> > I mean, I am not sure it worth the trouble.
>>>
>>> The immediate problem the JOBCTL_PTRACE_SIGNR patch solves is:
>>> - stopping in ptrace_report_syscall.
>>> - Not having PT_TRACESYSGOOD set.
>>> - The tracee being killed with a fatal signal
>>         ^^^^^^
>>         tracer ?
>
> Both actually.
>
>>> - The tracee sending SIGTRAP to itself.
>>
>> Oh, but this is clear. But do we really care? If the tracer exits
>> unexpectedly, the tracee can have a lot more problems, I don't think
>> that this particular one is that important.
>
> I don't know of complaints, and if you haven't heard them either
> that that is a good indication that in practice we don't care.
>
> At a practical level I just don't want that silly case that sets
> TASK_TRACED to TASK_RUNNING without stopping at all in ptrace_stop to
> remain.  It just seems to make everything more complicated for no real
> reason anymore.  The deadlocks may_ptrace_stop was guarding against are
> gone.
>
> Plus the test is so racy we case can happen after we drop siglock
> before we schedule, or shortly after we have stopped so we really
> don't reliably catch the condition the code is trying to catch.
>
> I think the case I care most about is ptrace_signal, which pretty much
> requires the tracer to wait and clear exit_code before being terminated
> to cause problems.  We don't handle that at all today.
>
> So yeah.  I think the code handles so little at this point we can just
> remove the code and simplify things, if we actually care we can come
> back and implement JOBCTL_PTRACE_SIGNR or the like.

The original explanation for handling this is:

commit 66519f549ae516e7ff2f24a8a5134713411a4a58
Author: Roland McGrath <roland@redhat.com>
Date:   Tue Jan 4 05:38:15 2005 -0800

    [PATCH] fix ptracer death race yielding bogus BUG_ON
    
    There is a BUG_ON in ptrace_stop that hits if the thread is not ptraced.
    However, there is no synchronization between a thread deciding to do a
    ptrace stop and so going here, and its ptracer dying and so detaching from
    it and clearing its ->ptrace field.
    
    The RHEL3 2.4-based kernel has a backport of a slightly older version of
    the 2.6 signals code, which has a different but equivalent BUG_ON.  This
    actually bit users in practice (when the debugger dies), but was
    exceedingly difficult to reproduce in contrived circumstances.  We moved
    forward in RHEL3 just by removing the BUG_ON, and that fixed the real user
    problems even though I was never able to reproduce the scenario myself.
    So, to my knowledge this scenario has never actually been seen in practice
    under 2.6.  But it's plain to see from the code that it is indeed possible.
    
    This patch removes that BUG_ON, but also goes further and tries to handle
    this case more gracefully than simply avoiding the crash.  By removing the
    BUG_ON alone, it becomes possible for the real parent of a process to see
    spurious SIGCHLD notifications intended for the debugger that has just
    died, and have its child wind up stopped unexpectedly.  This patch avoids
    that possibility by detecting the case when we are about to do the ptrace
    stop but our ptracer has gone away, and simply eliding that ptrace stop
    altogether as if we hadn't been ptraced when we hit the interesting event
    (signal or ptrace_notify call for syscall tracing or something like that).
    
    Signed-off-by: Roland McGrath <roland@redhat.com>
    Signed-off-by: Andrew Morton <akpm@osdl.org>
    Signed-off-by: Linus Torvalds <torvalds@osdl.org>

And it was all about
	BUG_ON(!(current->ptrace & PT_PTRACED));
At the beginning of ptrace_stop.

Which seems like a bit of buggy overkill.

>
> I will chew on that a bit and see if I can find any reasons for keeping
> the code in ptrace_stop at all.

Still chewing.

Eric
diff mbox series

Patch

diff --git a/include/linux/sched.h b/include/linux/sched.h
index d5e3c00b74e1..610f2fdb1e2c 100644
--- a/include/linux/sched.h
+++ b/include/linux/sched.h
@@ -103,7 +103,7 @@  struct task_group;
 /* Convenience macros for the sake of set_current_state: */
 #define TASK_KILLABLE			(TASK_WAKEKILL | TASK_UNINTERRUPTIBLE)
 #define TASK_STOPPED			(TASK_WAKEKILL | __TASK_STOPPED)
-#define TASK_TRACED			(TASK_WAKEKILL | __TASK_TRACED)
+#define TASK_TRACED			__TASK_TRACED
 
 #define TASK_IDLE			(TASK_UNINTERRUPTIBLE | TASK_NOLOAD)
 
diff --git a/include/linux/sched/jobctl.h b/include/linux/sched/jobctl.h
index fa067de9f1a9..d556c3425963 100644
--- a/include/linux/sched/jobctl.h
+++ b/include/linux/sched/jobctl.h
@@ -19,6 +19,7 @@  struct task_struct;
 #define JOBCTL_TRAPPING_BIT	21	/* switching to TRACED */
 #define JOBCTL_LISTENING_BIT	22	/* ptracer is listening for events */
 #define JOBCTL_TRAP_FREEZE_BIT	23	/* trap for cgroup freezer */
+#define JOBCTL_PTRACE_FROZEN_BIT	24	/* frozen for ptrace */
 
 #define JOBCTL_STOP_DEQUEUED	(1UL << JOBCTL_STOP_DEQUEUED_BIT)
 #define JOBCTL_STOP_PENDING	(1UL << JOBCTL_STOP_PENDING_BIT)
@@ -28,6 +29,7 @@  struct task_struct;
 #define JOBCTL_TRAPPING		(1UL << JOBCTL_TRAPPING_BIT)
 #define JOBCTL_LISTENING	(1UL << JOBCTL_LISTENING_BIT)
 #define JOBCTL_TRAP_FREEZE	(1UL << JOBCTL_TRAP_FREEZE_BIT)
+#define JOBCTL_PTRACE_FROZEN	(1UL << JOBCTL_PTRACE_FROZEN_BIT)
 
 #define JOBCTL_TRAP_MASK	(JOBCTL_TRAP_STOP | JOBCTL_TRAP_NOTIFY)
 #define JOBCTL_PENDING_MASK	(JOBCTL_STOP_PENDING | JOBCTL_TRAP_MASK)
diff --git a/include/linux/sched/signal.h b/include/linux/sched/signal.h
index 3c8b34876744..35af34eeee9e 100644
--- a/include/linux/sched/signal.h
+++ b/include/linux/sched/signal.h
@@ -437,7 +437,13 @@  extern void signal_wake_up_state(struct task_struct *t, unsigned int state);
 
 static inline void signal_wake_up(struct task_struct *t, bool resume)
 {
-	signal_wake_up_state(t, resume ? TASK_WAKEKILL : 0);
+	unsigned int state = 0;
+	if (resume) {
+		state = TASK_WAKEKILL;
+		if (!(t->jobctl & JOBCTL_PTRACE_FROZEN))
+			state |= __TASK_TRACED;
+	}
+	signal_wake_up_state(t, state);
 }
 static inline void ptrace_signal_wake_up(struct task_struct *t, bool resume)
 {
diff --git a/kernel/ptrace.c b/kernel/ptrace.c
index 43da5764b6f3..644eb7439d01 100644
--- a/kernel/ptrace.c
+++ b/kernel/ptrace.c
@@ -197,7 +197,7 @@  static bool ptrace_freeze_traced(struct task_struct *task)
 	spin_lock_irq(&task->sighand->siglock);
 	if (task_is_traced(task) && !looks_like_a_spurious_pid(task) &&
 	    !__fatal_signal_pending(task)) {
-		WRITE_ONCE(task->__state, __TASK_TRACED);
+		task->jobctl |= JOBCTL_PTRACE_FROZEN;
 		ret = true;
 	}
 	spin_unlock_irq(&task->sighand->siglock);
@@ -207,23 +207,19 @@  static bool ptrace_freeze_traced(struct task_struct *task)
 
 static void ptrace_unfreeze_traced(struct task_struct *task)
 {
-	if (READ_ONCE(task->__state) != __TASK_TRACED)
-		return;
-
-	WARN_ON(!task->ptrace || task->parent != current);
+	unsigned long flags;
 
 	/*
-	 * PTRACE_LISTEN can allow ptrace_trap_notify to wake us up remotely.
-	 * Recheck state under the lock to close this race.
+	 * The child may be awake and may have cleared
+	 * JOBCTL_PTRACE_FROZEN (see ptrace_resume).  The child will
+	 * not set JOBCTL_PTRACE_FROZEN or enter __TASK_TRACED anew.
 	 */
-	spin_lock_irq(&task->sighand->siglock);
-	if (READ_ONCE(task->__state) == __TASK_TRACED) {
+	if (lock_task_sighand(task, &flags)) {
+		task->jobctl &= ~JOBCTL_PTRACE_FROZEN;
 		if (__fatal_signal_pending(task))
 			wake_up_state(task, __TASK_TRACED);
-		else
-			WRITE_ONCE(task->__state, TASK_TRACED);
+		unlock_task_sighand(task, &flags);
 	}
-	spin_unlock_irq(&task->sighand->siglock);
 }
 
 /**
@@ -256,7 +252,6 @@  static int ptrace_check_attach(struct task_struct *child, bool ignore_state)
 	 */
 	read_lock(&tasklist_lock);
 	if (child->ptrace && child->parent == current) {
-		WARN_ON(READ_ONCE(child->__state) == __TASK_TRACED);
 		/*
 		 * child->sighand can't be NULL, release_task()
 		 * does ptrace_unlink() before __exit_signal().
diff --git a/kernel/signal.c b/kernel/signal.c
index 3fd2ce133387..5cf268982a7e 100644
--- a/kernel/signal.c
+++ b/kernel/signal.c
@@ -2209,11 +2209,8 @@  static int ptrace_stop(int exit_code, int why, int clear_code,
 		spin_lock_irq(&current->sighand->siglock);
 	}
 
-	/*
-	 * schedule() will not sleep if there is a pending signal that
-	 * can awaken the task.
-	 */
-	set_special_state(TASK_TRACED);
+	if (!__fatal_signal_pending(current))
+		set_special_state(TASK_TRACED);
 
 	/*
 	 * We're committing to trapping.  TRACED should be visible before
@@ -2321,7 +2318,7 @@  static int ptrace_stop(int exit_code, int why, int clear_code,
 	current->exit_code = 0;
 
 	/* LISTENING can be set only during STOP traps, clear it */
-	current->jobctl &= ~JOBCTL_LISTENING;
+	current->jobctl &= ~(JOBCTL_LISTENING | JOBCTL_PTRACE_FROZEN);
 
 	/*
 	 * Queued signals ignored us while we were stopped for tracing.