diff mbox series

handle_exit_race && PF_EXITING

Message ID 20191105152728.GA5666@redhat.com
State New
Headers show
Series handle_exit_race && PF_EXITING | expand

Commit Message

Oleg Nesterov Nov. 5, 2019, 3:27 p.m. UTC
On 11/05, Thomas Gleixner wrote:
>
> Out of curiosity, what's the race issue vs. robust list which you are
> trying to solve?

Off-topic, but this reminds me...

	#include <sched.h>
	#include <assert.h>
	#include <unistd.h>
	#include <syscall.h>

	#define FUTEX_LOCK_PI		6

	int main(void)
	{
		struct sched_param sp = {};

		sp.sched_priority = 2;
		assert(sched_setscheduler(0, SCHED_FIFO, &sp) == 0);

		int lock = vfork();
		if (!lock) {
			sp.sched_priority = 1;
			assert(sched_setscheduler(0, SCHED_FIFO, &sp) == 0);
			_exit(0);
		}

		syscall(__NR_futex, &lock, FUTEX_LOCK_PI, 0,0,0);
		return 0;
	}

this creates the unkillable RT process spinning in futex_lock_pi() on
a single CPU machine (or you can use taskset).

Probably the patch below makes sense anyway, but of course it doesn't
solve the real problem: futex_lock_pi() should not spin in this case.

It seems to me I even sent the fix a long ago, but I can't recall what
exactly it did. Probably the PF_EXITING check in attach_to_pi_owner()
must simply die, I'll try to recall...

Oleg.

Comments

Thomas Gleixner Nov. 5, 2019, 5:28 p.m. UTC | #1
On Tue, 5 Nov 2019, Oleg Nesterov wrote:
> On 11/05, Thomas Gleixner wrote:
> >
> > Out of curiosity, what's the race issue vs. robust list which you are
> > trying to solve?
> 
> Off-topic, but this reminds me...
> 
> 	#include <sched.h>
> 	#include <assert.h>
> 	#include <unistd.h>
> 	#include <syscall.h>
> 
> 	#define FUTEX_LOCK_PI		6
> 
> 	int main(void)
> 	{
> 		struct sched_param sp = {};
> 
> 		sp.sched_priority = 2;
> 		assert(sched_setscheduler(0, SCHED_FIFO, &sp) == 0);
> 
> 		int lock = vfork();
> 		if (!lock) {
> 			sp.sched_priority = 1;
> 			assert(sched_setscheduler(0, SCHED_FIFO, &sp) == 0);
> 			_exit(0);
> 		}
> 
> 		syscall(__NR_futex, &lock, FUTEX_LOCK_PI, 0,0,0);
> 		return 0;
> 	}
> 
> this creates the unkillable RT process spinning in futex_lock_pi() on
> a single CPU machine (or you can use taskset).

Uuurgh.

> Probably the patch below makes sense anyway, but of course it doesn't
> solve the real problem: futex_lock_pi() should not spin in this case.

Obviously not.

> It seems to me I even sent the fix a long ago, but I can't recall what
> exactly it did. Probably the PF_EXITING check in attach_to_pi_owner()
> must simply die, I'll try to recall...

We can't do that. The task might have released the futex already, so the
waiter would operate on inconsistent state :(

The problem with that exit race is that there is no form of serialization
which might be used to address that. We can't abuse any of the task locks
for this.

I was working on a patch set some time ago to fix another more esoteric
futex exit issue. Let me resurrect that.

Vs. the signal pending check. That makes sense on its own, but we should
not restrict it to fatal signals. Futexes are interruptible by definition.

Thanks,

	tglx
Thomas Gleixner Nov. 5, 2019, 5:59 p.m. UTC | #2
On Tue, 5 Nov 2019, Thomas Gleixner wrote:

> On Tue, 5 Nov 2019, Oleg Nesterov wrote:
> > On 11/05, Thomas Gleixner wrote:
> > >
> > > Out of curiosity, what's the race issue vs. robust list which you are
> > > trying to solve?
> > 
> > Off-topic, but this reminds me...
> > 
> > 	#include <sched.h>
> > 	#include <assert.h>
> > 	#include <unistd.h>
> > 	#include <syscall.h>
> > 
> > 	#define FUTEX_LOCK_PI		6
> > 
> > 	int main(void)
> > 	{
> > 		struct sched_param sp = {};
> > 
> > 		sp.sched_priority = 2;
> > 		assert(sched_setscheduler(0, SCHED_FIFO, &sp) == 0);
> > 
> > 		int lock = vfork();
> > 		if (!lock) {
> > 			sp.sched_priority = 1;
> > 			assert(sched_setscheduler(0, SCHED_FIFO, &sp) == 0);
> > 			_exit(0);
> > 		}
> > 
> > 		syscall(__NR_futex, &lock, FUTEX_LOCK_PI, 0,0,0);
> > 		return 0;
> > 	}
> > 
> > this creates the unkillable RT process spinning in futex_lock_pi() on
> > a single CPU machine (or you can use taskset).
> 
> Uuurgh.

But staring more at it. That's a scheduler bug.

parent	    	    	child

 set FIFO prio 2

 fork()	         ->	set FIFO prio 1
 		 	 sched_setscheduler(...)
			   return from syscall		<= BUG

 		 	_exit()

When the child lowers its priority from 2 to 1, then the parent _must_
preempt the child simply because the parent is now the top priority task on
that CPU. Child should never reach exit before the parent blocks on the
futex.

Peter?

What's even more disturbing is that even with that bug happening the child
is able to set PF_EXITING, but not PF_EXITPIDONE. That doesn't make sense.

Thanks,

	tglx
Thomas Gleixner Nov. 5, 2019, 6:56 p.m. UTC | #3
On Tue, 5 Nov 2019, Thomas Gleixner wrote:
> On Tue, 5 Nov 2019, Thomas Gleixner wrote:
> > On Tue, 5 Nov 2019, Oleg Nesterov wrote:
> > > On 11/05, Thomas Gleixner wrote:
> > > >
> > > > Out of curiosity, what's the race issue vs. robust list which you are
> > > > trying to solve?
> > > 
> > > Off-topic, but this reminds me...
> > > 
> > > 	#include <sched.h>
> > > 	#include <assert.h>
> > > 	#include <unistd.h>
> > > 	#include <syscall.h>
> > > 
> > > 	#define FUTEX_LOCK_PI		6
> > > 
> > > 	int main(void)
> > > 	{
> > > 		struct sched_param sp = {};
> > > 
> > > 		sp.sched_priority = 2;
> > > 		assert(sched_setscheduler(0, SCHED_FIFO, &sp) == 0);
> > > 
> > > 		int lock = vfork();
> > > 		if (!lock) {
> > > 			sp.sched_priority = 1;
> > > 			assert(sched_setscheduler(0, SCHED_FIFO, &sp) == 0);
> > > 			_exit(0);
> > > 		}
> > > 
> > > 		syscall(__NR_futex, &lock, FUTEX_LOCK_PI, 0,0,0);
> > > 		return 0;
> > > 	}
> > > 
> > > this creates the unkillable RT process spinning in futex_lock_pi() on
> > > a single CPU machine (or you can use taskset).
> > 
> > Uuurgh.
> 
> But staring more at it. That's a scheduler bug.
> 
> parent	    	    	child
> 
>  set FIFO prio 2
> 
>  fork()	         ->	set FIFO prio 1
>  		 	 sched_setscheduler(...)
> 			   return from syscall		<= BUG
> 
>  		 	_exit()
> 
> When the child lowers its priority from 2 to 1, then the parent _must_
> preempt the child simply because the parent is now the top priority task on
> that CPU. Child should never reach exit before the parent blocks on the
> futex.

I'm a moron. It's vfork() not fork() so the behaviour is expected.

Staring more at the trace which shows me where this goes down the drain.

Thanks,

	tglx
Thomas Gleixner Nov. 5, 2019, 7:19 p.m. UTC | #4
On Tue, 5 Nov 2019, Thomas Gleixner wrote:
> 
> I'm a moron. It's vfork() not fork() so the behaviour is expected.
> 
> Staring more at the trace which shows me where this goes down the drain.

 parent	    	    	child
 
  set FIFO prio 2
 
  vfork()			->	set FIFO prio 1
   implies wait_for_child()	 	sched_setscheduler(...)
 			   		exit()
					do_exit()
					tsk->flags |= PF_EXITING;
 					....
					mm_release()
					  exit_futex(); (NOOP in this case)
					  complete() --> wakes parent
 sys_futex()
    loop infinite because
    	 PF_EXITING is set,
	 but PF_EXITPIDONE not

So the obvious question is why PF_EXITPIDONE is set way after the futex
exit cleanup has run, but moving this right after exit_futex() would not
solve the exit race completely because the code after setting PF_EXITING is
preemptible. So the same crap could happen just by preemption:

  task holds futex
  ...
  do_exit()
    tsk->flags |= PF_EXITING;

preemption (unrelated wakeup of some other higher prio task, e.g. timer)

  switch_to(other_task)

  return to user
  sys_futex()
	loop infinite as above

And just for the fun of it the futex exit cleanup could trigger the wakeup
itself before PF_EXITPIDONE is set.

There is some other issue which I need to lookup again. That's a slightly
different problem but related to futex exit race conditions.

The way we can deal with that is:

    do_exit()
    tsk->flags |= PF_EXITING;
    ...
    mutex_lock(&tsk->futex_exit_mutex);
    futex_exit();
    tsk->flags |= PF_EXITPIDONE;
    mutex_unlock(&tsk->futex_exit_mutex);
    
and on the futex lock_pi side:

    if (!(tsk->flags & PF_EXITING))
    	return 0;		<- All good

    if (tsk->flags & PF_EXITPIDONE)
        return -EOWNERDEAD;	<- Locker can take over

    mutex_lock(&tsk->futex_exit_mutex);
    if (tsk->flags & PF_EXITPIDONE) {
        mutex_unlock(&tsk->futex_exit_mutex);
        return -EOWNERDEAD;	<- Locker can take over
    }

    queue_futex();
    mutex_unlock(&tsk->futex_exit_mutex);

Not that I think it's pretty, but it plugs all holes AFAICT.

Thanks,

	tglx
Oleg Nesterov Nov. 6, 2019, 8:55 a.m. UTC | #5
On 11/05, Thomas Gleixner wrote:
>
>  sys_futex()
>     loop infinite because
>     	 PF_EXITING is set,
> 	 but PF_EXITPIDONE not

Yes.

IOW, the problem is very simple. RT task preempts the exiting lock owner
after it sets PF_EXITING but before it sets PF_EXITPIDONE, if they run on
the same CPU futex_lock_pi() will spin forever.

> So the obvious question is why PF_EXITPIDONE is set way after the futex
> exit cleanup has run,

Another obvious question is why this code checks PF_EXITING. I still think
it should not.

> The way we can deal with that is:
>
>     do_exit()
>     tsk->flags |= PF_EXITING;
>     ...
>     mutex_lock(&tsk->futex_exit_mutex);
>     futex_exit();
>     tsk->flags |= PF_EXITPIDONE;
>     mutex_unlock(&tsk->futex_exit_mutex);
>
> and on the futex lock_pi side:
>
>     if (!(tsk->flags & PF_EXITING))
>     	return 0;		<- All good
>
>     if (tsk->flags & PF_EXITPIDONE)
>         return -EOWNERDEAD;	<- Locker can take over
>
>     mutex_lock(&tsk->futex_exit_mutex);
>     if (tsk->flags & PF_EXITPIDONE) {
>         mutex_unlock(&tsk->futex_exit_mutex);
>         return -EOWNERDEAD;	<- Locker can take over
>     }
>
>     queue_futex();
>     mutex_unlock(&tsk->futex_exit_mutex);
>
> Not that I think it's pretty, but it plugs all holes AFAICT.

I have found the fix I sent in 2015, attached below. I forgot everything
I knew about futex.c, so I need some time to adapt it to the current code.

But I think it is clear what this patch tries to do, do you see any hole?

Oleg.

[PATCH] futex: don't spin waiting for PF_EXITING -> PF_EXITPIDONE transition

It is absolutely not clear why attach_to_pi_owner() returns -EAGAIN which
triggers "retry" if the lock owner is PF_EXITING but not PF_EXITPIDONE.
This burns CPU for no reason and this can even livelock if the rt_task()
caller preempts a PF_EXITING owner.

Remove the PF_EXITING check altogether. We do not care if it is exiting,
all we need to know is can we rely on exit_pi_state_list() or not. So we
also need to set PF_EXITPIDONE before we flush ->pi_state_list and call
exit_pi_state_list() unconditionally.

Perhaps we can add the fast-path list_empty() check in mm_release() back,
but lets fix the problem first. Besides, in theory this check is already
not correct, at least it should be list_empty_careful() to avoid the race
with free_pi_state() in progress.

Signed-off-by: Oleg Nesterov <oleg@redhat.com>
---
 kernel/exit.c  |   22 +---------------------
 kernel/fork.c  |    3 +--
 kernel/futex.c |   40 ++++++++++------------------------------
 3 files changed, 12 insertions(+), 53 deletions(-)

diff --git a/kernel/exit.c b/kernel/exit.c
index 6806c55..bc969ed 100644
--- a/kernel/exit.c
+++ b/kernel/exit.c
@@ -683,27 +683,13 @@ void do_exit(long code)
 	 */
 	if (unlikely(tsk->flags & PF_EXITING)) {
 		pr_alert("Fixing recursive fault but reboot is needed!\n");
-		/*
-		 * We can do this unlocked here. The futex code uses
-		 * this flag just to verify whether the pi state
-		 * cleanup has been done or not. In the worst case it
-		 * loops once more. We pretend that the cleanup was
-		 * done as there is no way to return. Either the
-		 * OWNER_DIED bit is set by now or we push the blocked
-		 * task into the wait for ever nirwana as well.
-		 */
+		/* Avoid the new additions to ->pi_state_list at least */
 		tsk->flags |= PF_EXITPIDONE;
 		set_current_state(TASK_UNINTERRUPTIBLE);
 		schedule();
 	}
 
 	exit_signals(tsk);  /* sets PF_EXITING */
-	/*
-	 * tsk->flags are checked in the futex code to protect against
-	 * an exiting task cleaning up the robust pi futexes.
-	 */
-	smp_mb();
-	raw_spin_unlock_wait(&tsk->pi_lock);
 
 	if (unlikely(in_atomic()))
 		pr_info("note: %s[%d] exited with preempt_count %d\n",
@@ -779,12 +765,6 @@ void do_exit(long code)
 	 * Make sure we are holding no locks:
 	 */
 	debug_check_no_locks_held();
-	/*
-	 * We can do this unlocked here. The futex code uses this flag
-	 * just to verify whether the pi state cleanup has been done
-	 * or not. In the worst case it loops once more.
-	 */
-	tsk->flags |= PF_EXITPIDONE;
 
 	if (tsk->io_context)
 		exit_io_context(tsk);
diff --git a/kernel/fork.c b/kernel/fork.c
index 4dc2dda..ec3208e 100644
--- a/kernel/fork.c
+++ b/kernel/fork.c
@@ -803,8 +803,7 @@ void mm_release(struct task_struct *tsk, struct mm_struct *mm)
 		tsk->compat_robust_list = NULL;
 	}
 #endif
-	if (unlikely(!list_empty(&tsk->pi_state_list)))
-		exit_pi_state_list(tsk);
+	exit_pi_state_list(tsk);
 #endif
 
 	uprobe_free_utask(tsk);
diff --git a/kernel/futex.c b/kernel/futex.c
index b101381..c1104a8 100644
--- a/kernel/futex.c
+++ b/kernel/futex.c
@@ -716,11 +716,13 @@ void exit_pi_state_list(struct task_struct *curr)
 
 	if (!futex_cmpxchg_enabled)
 		return;
+
 	/*
-	 * We are a ZOMBIE and nobody can enqueue itself on
-	 * pi_state_list anymore, but we have to be careful
-	 * versus waiters unqueueing themselves:
+	 * attach_to_pi_owner() can no longer add the new entry. But
+	 * we have to be careful versus waiters unqueueing themselves.
 	 */
+	curr->flags |= PF_EXITPIDONE;
+
 	raw_spin_lock_irq(&curr->pi_lock);
 	while (!list_empty(head)) {
 
@@ -905,24 +907,12 @@ static int attach_to_pi_owner(u32 uval, union futex_key *key,
 		return -EPERM;
 	}
 
-	/*
-	 * We need to look at the task state flags to figure out,
-	 * whether the task is exiting. To protect against the do_exit
-	 * change of the task flags, we do this protected by
-	 * p->pi_lock:
-	 */
 	raw_spin_lock_irq(&p->pi_lock);
-	if (unlikely(p->flags & PF_EXITING)) {
-		/*
-		 * The task is on the way out. When PF_EXITPIDONE is
-		 * set, we know that the task has finished the
-		 * cleanup:
-		 */
-		int ret = (p->flags & PF_EXITPIDONE) ? -ESRCH : -EAGAIN;
-
+	if (unlikely(p->flags & PF_EXITPIDONE)) {
+		/* exit_pi_state_list() was already called */
 		raw_spin_unlock_irq(&p->pi_lock);
 		put_task_struct(p);
-		return ret;
+		return -ESRCH;
 	}
 
 	/*
@@ -1633,12 +1623,7 @@ retry_private:
 				goto retry;
 			goto out;
 		case -EAGAIN:
-			/*
-			 * Two reasons for this:
-			 * - Owner is exiting and we just wait for the
-			 *   exit to complete.
-			 * - The user space value changed.
-			 */
+			/* The user space value changed. */
 			free_pi_state(pi_state);
 			pi_state = NULL;
 			double_unlock_hb(hb1, hb2);
@@ -2295,12 +2280,7 @@ retry_private:
 		case -EFAULT:
 			goto uaddr_faulted;
 		case -EAGAIN:
-			/*
-			 * Two reasons for this:
-			 * - Task is exiting and we just wait for the
-			 *   exit to complete.
-			 * - The user space value changed.
-			 */
+			/* The user space value changed. */
 			queue_unlock(hb);
 			put_futex_key(&q.key);
 			cond_resched();
Thomas Gleixner Nov. 6, 2019, 9:53 a.m. UTC | #6
Oleg,

On Wed, 6 Nov 2019, Oleg Nesterov wrote:
> I have found the fix I sent in 2015, attached below. I forgot everything
> I knew about futex.c, so I need some time to adapt it to the current code.
> 
> But I think it is clear what this patch tries to do, do you see any hole?

> @@ -716,11 +716,13 @@ void exit_pi_state_list(struct task_struct *curr)
>  
>  	if (!futex_cmpxchg_enabled)
>  		return;
> +
>  	/*
> -	 * We are a ZOMBIE and nobody can enqueue itself on
> -	 * pi_state_list anymore, but we have to be careful
> -	 * versus waiters unqueueing themselves:
> +	 * attach_to_pi_owner() can no longer add the new entry. But
> +	 * we have to be careful versus waiters unqueueing themselves.
>  	 */
> +	curr->flags |= PF_EXITPIDONE;

This obviously would need a barrier or would have to be moved inside of the
pi_lock region.

>  	raw_spin_lock_irq(&curr->pi_lock);
>  	while (!list_empty(head)) {
>  
> @@ -905,24 +907,12 @@ static int attach_to_pi_owner(u32 uval, union futex_key *key,
>  		return -EPERM;
>  	}
>  
> -	/*
> -	 * We need to look at the task state flags to figure out,
> -	 * whether the task is exiting. To protect against the do_exit
> -	 * change of the task flags, we do this protected by
> -	 * p->pi_lock:
> -	 */
>  	raw_spin_lock_irq(&p->pi_lock);
> -	if (unlikely(p->flags & PF_EXITING)) {
> -		/*
> -		 * The task is on the way out. When PF_EXITPIDONE is
> -		 * set, we know that the task has finished the
> -		 * cleanup:
> -		 */
> -		int ret = (p->flags & PF_EXITPIDONE) ? -ESRCH : -EAGAIN;
> -
> +	if (unlikely(p->flags & PF_EXITPIDONE)) {
> +		/* exit_pi_state_list() was already called */
>  		raw_spin_unlock_irq(&p->pi_lock);
>  		put_task_struct(p);
> -		return ret;
> +		return -ESRCH;

But, this is incorrect because we'd return -ESRCH to user space while the
futex value still has the TID of the exiting task set which will
subsequently cleanout the futex and set the owner died bit.

The result is inconsistent state and will trigger the asserts in the futex
test suite and in the pthread_mutex implementation.

The only reason why -ESRCH can be returned is when the user space value of
the futex contains garbage. But in this case it does not contain garbage
and returning -ESRCH violates the implicit robustness guarantee of PI
futexes and causes unexpected havoc.

See da791a667536 ("futex: Cure exit race") for example.

The futex PI contract between kernel and user space relies on consistent
state. Guess why that code has more corner case handling than actual
functionality. :)

Thanks,

	tglx
Oleg Nesterov Nov. 6, 2019, 10:35 a.m. UTC | #7
On 11/06, Thomas Gleixner wrote:
>
> > @@ -716,11 +716,13 @@ void exit_pi_state_list(struct task_struct *curr)
> >
> >  	if (!futex_cmpxchg_enabled)
> >  		return;
> > +
> >  	/*
> > -	 * We are a ZOMBIE and nobody can enqueue itself on
> > -	 * pi_state_list anymore, but we have to be careful
> > -	 * versus waiters unqueueing themselves:
> > +	 * attach_to_pi_owner() can no longer add the new entry. But
> > +	 * we have to be careful versus waiters unqueueing themselves.
> >  	 */
> > +	curr->flags |= PF_EXITPIDONE;
>
> This obviously would need a barrier or would have to be moved inside of the
> pi_lock region.

probably yes,

> > +	if (unlikely(p->flags & PF_EXITPIDONE)) {
> > +		/* exit_pi_state_list() was already called */
> >  		raw_spin_unlock_irq(&p->pi_lock);
> >  		put_task_struct(p);
> > -		return ret;
> > +		return -ESRCH;
>
> But, this is incorrect because we'd return -ESRCH to user space while the
> futex value still has the TID of the exiting task set which will
> subsequently cleanout the futex and set the owner died bit.

Heh. Of course this is not correct. As I said, this patch should be adapted
to the current code. See below.

> See da791a667536 ("futex: Cure exit race") for example.

Thomas, I simply can't resist ;)

I reported this race when I sent this patch in 2015,

https://lore.kernel.org/lkml/20150205181014.GA20244@redhat.com/

but somehow that discussion died with no result.

> Guess why that code has more corner case handling than actual
> functionality. :)

I know why. To confuse me!

Oleg.
Thomas Gleixner Nov. 6, 2019, 11:07 a.m. UTC | #8
On Wed, 6 Nov 2019, Oleg Nesterov wrote:
> On 11/06, Thomas Gleixner wrote:
> > > +	if (unlikely(p->flags & PF_EXITPIDONE)) {
> > > +		/* exit_pi_state_list() was already called */
> > >  		raw_spin_unlock_irq(&p->pi_lock);
> > >  		put_task_struct(p);
> > > -		return ret;
> > > +		return -ESRCH;
> >
> > But, this is incorrect because we'd return -ESRCH to user space while the
> > futex value still has the TID of the exiting task set which will
> > subsequently cleanout the futex and set the owner died bit.
> 
> Heh. Of course this is not correct. As I said, this patch should be adapted
> to the current code. See below.
> 
> > See da791a667536 ("futex: Cure exit race") for example.
> 
> Thomas, I simply can't resist ;)
> 
> I reported this race when I sent this patch in 2015,
> 
> https://lore.kernel.org/lkml/20150205181014.GA20244@redhat.com/
> 
> but somehow that discussion died with no result.

Yes. I was not paying attention for some reason. Don't ask me what happened
in Feb. 2015 :)

But even if we adapt that patch to the current code it won't solve the
-ESRCH issue I described above.

> > Guess why that code has more corner case handling than actual
> > functionality. :)
> 
> I know why. To confuse me!

Of course. As Rusty said: "Futexes are also cursed"

Thanks,

	tglx
Oleg Nesterov Nov. 6, 2019, 12:11 p.m. UTC | #9
On 11/06, Thomas Gleixner wrote:
>
> But even if we adapt that patch to the current code it won't solve the
> -ESRCH issue I described above.

Hmm. I must have missed something. Why?

Please see the unfinished/untested patch below.

I think that (with or without this fix) handle_exit_race() logic needs
cleanups, there is no reason for get_futex_value_locked(), we can drop
->pi_lock right after we see PF_EXITPIDONE. Lets discuss this later.

But why we can not rely on handle_exit_race() without PF_EXITING check?

Yes, exit_robust_list() is called before exit_pi_state_list() which (with
this patch) sets PF_EXITPIDONE. But this is fine? handle_futex_death()
doesn't wakeup pi futexes, exit_pi_state_list() does this.

Could you explain?

Oleg.


diff --git a/kernel/exit.c b/kernel/exit.c
index a46a50d..a6c788c 100644
--- a/kernel/exit.c
+++ b/kernel/exit.c
@@ -761,17 +761,6 @@ void __noreturn do_exit(long code)
 	}
 
 	exit_signals(tsk);  /* sets PF_EXITING */
-	/*
-	 * Ensure that all new tsk->pi_lock acquisitions must observe
-	 * PF_EXITING. Serializes against futex.c:attach_to_pi_owner().
-	 */
-	smp_mb();
-	/*
-	 * Ensure that we must observe the pi_state in exit_mm() ->
-	 * mm_release() -> exit_pi_state_list().
-	 */
-	raw_spin_lock_irq(&tsk->pi_lock);
-	raw_spin_unlock_irq(&tsk->pi_lock);
 
 	if (unlikely(in_atomic())) {
 		pr_info("note: %s[%d] exited with preempt_count %d\n",
@@ -846,12 +835,6 @@ void __noreturn do_exit(long code)
 	 * Make sure we are holding no locks:
 	 */
 	debug_check_no_locks_held();
-	/*
-	 * We can do this unlocked here. The futex code uses this flag
-	 * just to verify whether the pi state cleanup has been done
-	 * or not. In the worst case it loops once more.
-	 */
-	tsk->flags |= PF_EXITPIDONE;
 
 	if (tsk->io_context)
 		exit_io_context(tsk);
diff --git a/kernel/fork.c b/kernel/fork.c
index bcdf531..0d8edcf 100644
--- a/kernel/fork.c
+++ b/kernel/fork.c
@@ -1297,8 +1297,7 @@ void mm_release(struct task_struct *tsk, struct mm_struct *mm)
 		tsk->compat_robust_list = NULL;
 	}
 #endif
-	if (unlikely(!list_empty(&tsk->pi_state_list)))
-		exit_pi_state_list(tsk);
+	exit_pi_state_list(tsk);
 #endif
 
 	uprobe_free_utask(tsk);
diff --git a/kernel/futex.c b/kernel/futex.c
index bd18f60..c3b5d33 100644
--- a/kernel/futex.c
+++ b/kernel/futex.c
@@ -905,6 +905,7 @@ void exit_pi_state_list(struct task_struct *curr)
 	 * versus waiters unqueueing themselves:
 	 */
 	raw_spin_lock_irq(&curr->pi_lock);
+	curr->flags |= PF_EXITPIDONE;
 	while (!list_empty(head)) {
 		next = head->next;
 		pi_state = list_entry(next, struct futex_pi_state, list);
@@ -1169,18 +1170,11 @@ static int attach_to_pi_state(u32 __user *uaddr, u32 uval,
 	return ret;
 }
 
-static int handle_exit_race(u32 __user *uaddr, u32 uval,
-			    struct task_struct *tsk)
+static int handle_exit_race(u32 __user *uaddr, u32 uval)
 {
 	u32 uval2;
 
 	/*
-	 * If PF_EXITPIDONE is not yet set, then try again.
-	 */
-	if (tsk && !(tsk->flags & PF_EXITPIDONE))
-		return -EAGAIN;
-
-	/*
 	 * Reread the user space value to handle the following situation:
 	 *
 	 * CPU0				CPU1
@@ -1245,7 +1239,7 @@ static int attach_to_pi_owner(u32 __user *uaddr, u32 uval, union futex_key *key,
 		return -EAGAIN;
 	p = find_get_task_by_vpid(pid);
 	if (!p)
-		return handle_exit_race(uaddr, uval, NULL);
+		return handle_exit_race(uaddr, uval);
 
 	if (unlikely(p->flags & PF_KTHREAD)) {
 		put_task_struct(p);
@@ -1259,13 +1253,13 @@ static int attach_to_pi_owner(u32 __user *uaddr, u32 uval, union futex_key *key,
 	 * p->pi_lock:
 	 */
 	raw_spin_lock_irq(&p->pi_lock);
-	if (unlikely(p->flags & PF_EXITING)) {
+	if (unlikely(p->flags & PF_EXITPIDONE)) {
 		/*
 		 * The task is on the way out. When PF_EXITPIDONE is
 		 * set, we know that the task has finished the
 		 * cleanup:
 		 */
-		int ret = handle_exit_race(uaddr, uval, p);
+		int ret = handle_exit_race(uaddr, uval);
 
 		raw_spin_unlock_irq(&p->pi_lock);
 		put_task_struct(p);
Thomas Gleixner Nov. 6, 2019, 1:38 p.m. UTC | #10
On Wed, 6 Nov 2019, Oleg Nesterov wrote:
> But why we can not rely on handle_exit_race() without PF_EXITING check?
> 
> Yes, exit_robust_list() is called before exit_pi_state_list() which (with
> this patch) sets PF_EXITPIDONE. But this is fine? handle_futex_death()
> doesn't wakeup pi futexes, exit_pi_state_list() does this.

I know. You still create inconsistent state because of this:

>  	raw_spin_lock_irq(&p->pi_lock);
> -	if (unlikely(p->flags & PF_EXITING)) {
> +	if (unlikely(p->flags & PF_EXITPIDONE)) {
>  		/*
>  		 * The task is on the way out. When PF_EXITPIDONE is
>  		 * set, we know that the task has finished the
>  		 * cleanup:
>  		 */
> -		int ret = handle_exit_race(uaddr, uval, p);
> +		int ret = handle_exit_race(uaddr, uval);
>  
>  		raw_spin_unlock_irq(&p->pi_lock);
>  		put_task_struct(p);

Same explanation as before just not prosa this time:

exit()					lock_pi(futex2)
  exit_pi_state_list()
   lock(tsk->pi_lock)
   tsk->flags |= PF_EXITPIDONE;		 attach_to_pi_owner()
					  ...
  // Loop unrolled for clarity
  while(!list_empty())			  lock(tsk->pi_lock);
     cleanup(futex1)
       unlock(tsk->pi_lock)
     ...			          if (tsk->flags & PF_EXITPIDONE)
					     ret = handle_exit_race()
					       if (uval != orig_uval)
					           return -EAGAIN;
					       return -ESRCH;

    cleanup(futex2)			  return to userspace err = -ESRCH
      update futex2->uval
      with new owner TID and set
      OWNER_DIED

    					 userspace handles -ESRCH

					 but futex2->uval has a valid TID
					 and OWNER_DIED set.

That's inconsistent state, the futex became a SNAFUtex and user space
cannot recover from that. At least not existing user space and we cannot
break that, right?

If the kernel returns -ESRCH then the futex value must be preserved (except
for the waiters bit being set, but that's not part of the problem at hand).

You cannot just look at the kernel state with futexes. We have to guarantee
consistent state between kernel and user space no matter what. And of
course we have to be careful about user space creating inconsistent state
for stupid or malicious reasons. See the whole state table above
attach_to_pi_state().

The only way to fix this live lock issue is to gracefully wait for the
cleanup to complete instead of busy looping.

Yes, it sucks, but futexes suck by definition.

Thanks,

	tglx
Thomas Gleixner Nov. 6, 2019, 5:42 p.m. UTC | #11
On Wed, 6 Nov 2019, Oleg Nesterov wrote:
> 
> I think that (with or without this fix) handle_exit_race() logic needs
> cleanups, there is no reason for get_futex_value_locked(), we can drop
> ->pi_lock right after we see PF_EXITPIDONE. Lets discuss this later.

Which still is in atomic because the hash bucket lock is held, ergo
get_futex_value_locked() needs to stay for now.

So the only thing we could do is to reduce the pi_lock held section a bit.

Thanks,

	tglx
Oleg Nesterov Nov. 7, 2019, 3:51 p.m. UTC | #12
On 11/06, Thomas Gleixner wrote:
>
> On Wed, 6 Nov 2019, Oleg Nesterov wrote:
> >
> > I think that (with or without this fix) handle_exit_race() logic needs
> > cleanups, there is no reason for get_futex_value_locked(), we can drop
> > ->pi_lock right after we see PF_EXITPIDONE. Lets discuss this later.
>
> Which still is in atomic because the hash bucket lock is held, ergo
> get_futex_value_locked() needs to stay for now.

Indeed, you are right.

> Same explanation as before just not prosa this time:
>
> exit()					lock_pi(futex2)
>   exit_pi_state_list()
>    lock(tsk->pi_lock)
>    tsk->flags |= PF_EXITPIDONE;		 attach_to_pi_owner()
> 					  ...
>   // Loop unrolled for clarity
>   while(!list_empty())			  lock(tsk->pi_lock);
>      cleanup(futex1)
>        unlock(tsk->pi_lock)
         ^^^^^^^^^^^^^^^^^^^^
Ah! Thanks.


Hmm. In particular, exit_pi_state() drops pi_lock if refcount_inc_not_zero() fails.

Isn't this another potential source of livelock ?

Suppose that a realtime lock owner X sleeps somewhere, another task T
calls put_pi_state(), refcount_dec_and_test() succeeds.

What if, say, X is killed right after that and preempts T on the same
CPU?

Oleg.
diff mbox series

Patch

--- x/kernel/futex.c
+++ x/kernel/futex.c
@@ -2842,10 +2842,12 @@  static int futex_lock_pi(u32 __user *uaddr, unsigned int flags,
 			 *   exit to complete.
 			 * - The user space value changed.
 			 */
-			queue_unlock(hb);
-			put_futex_key(&q.key);
-			cond_resched();
-			goto retry;
+			if (!fatal_signal_pending(current)) {
+				queue_unlock(hb);
+				put_futex_key(&q.key);
+				cond_resched();
+				goto retry;
+			}
 		default:
 			goto out_unlock_put_key;
 		}