Patchwork Tasks stuck in futex code (in 3.14-rc6)

login
register
mail settings
Submitter Peter Zijlstra
Date March 19, 2014, 5:08 p.m.
Message ID <20140319170829.GD8557@laptop.programming.kicks-ass.net>
Download mbox | patch
Permalink /patch/331798/
State Not Applicable
Headers show

Comments

Peter Zijlstra - March 19, 2014, 5:08 p.m.
On Wed, Mar 19, 2014 at 04:47:05PM +0100, Peter Zijlstra wrote:
> > I reverted b0c29f79ecea0b6fbcefc999e70f2843ae8306db on top of v3.14-rc6 and confirmed that
> > reverting the commit solved the problem.
> 
> Joy,.. let me look at that with ppc in mind.

OK; so while pretty much all the comments from that patch are utter
nonsense (what was I thinking), I cannot actually find a real bug.

But could you try the below which replaces a control dependency with a
full barrier. The control flow is plenty convoluted that I think the
control barrier isn't actually valid anymore and that might indeed
explain the fail.
Davidlohr Bueso - March 19, 2014, 6:06 p.m.
On Wed, 2014-03-19 at 18:08 +0100, Peter Zijlstra wrote:
> On Wed, Mar 19, 2014 at 04:47:05PM +0100, Peter Zijlstra wrote:
> > > I reverted b0c29f79ecea0b6fbcefc999e70f2843ae8306db on top of v3.14-rc6 and confirmed that
> > > reverting the commit solved the problem.
> > 
> > Joy,.. let me look at that with ppc in mind.

errr... just sat down to check email this morning. CC'ing Paul as for
any subtle barrier issues.
tulasidhar Dronamraju - March 20, 2014, 5:33 a.m.
> > Joy,.. let me look at that with ppc in mind.
> 
> OK; so while pretty much all the comments from that patch are utter
> nonsense (what was I thinking), I cannot actually find a real bug.
> 
> But could you try the below which replaces a control dependency with a
> full barrier. The control flow is plenty convoluted that I think the
> control barrier isn't actually valid anymore and that might indeed
> explain the fail.
> 

Unfortunately the patch didnt help. Still seeing tasks stuck

# ps -Ao pid,tt,user,fname,tmout,f,wchan | grep futex
14680 pts/0    root     java         - 0 futex_wait_queue_me
14797 pts/0    root     java         - 0 futex_wait_queue_me
# :> /var/log/messages
# echo t > /proc/sysrq-trigger 
# grep futex_wait_queue_me /var/log/messages | wc -l 
334
#

[ 6904.211478] Call Trace:
[ 6904.211481] [c000000fa1f1b4d0] [0000000000000020] 0x20 (unreliable)
[ 6904.211486] [c000000fa1f1b6a0] [c000000000015208] .__switch_to+0x1e8/0x330
[ 6904.211491] [c000000fa1f1b750] [c000000000702f00] .__schedule+0x360/0x8b0
[ 6904.211495] [c000000fa1f1b9d0] [c000000000147348] .futex_wait_queue_me+0xf8/0x1a0
[ 6904.211500] [c000000fa1f1ba60] [c0000000001486dc] .futex_wait+0x17c/0x2a0
[ 6904.211505] [c000000fa1f1bc10] [c00000000014a614] .do_futex+0x254/0xd80
[ 6904.211510] [c000000fa1f1bd60] [c00000000014b25c] .SyS_futex+0x11c/0x1d0
[ 6904.238874] [c000000fa1f1be30] [c00000000000a0fc] syscall_exit+0x0/0x7c
[ 6904.238879] java            S 00003fff825f6044     0 14682  14076 0x00000080

Is there any other information that I provide that can help?
Davidlohr Bueso - March 20, 2014, 5:56 a.m.
On Thu, 2014-03-20 at 11:03 +0530, Srikar Dronamraju wrote:
> > > Joy,.. let me look at that with ppc in mind.
> > 
> > OK; so while pretty much all the comments from that patch are utter
> > nonsense (what was I thinking), I cannot actually find a real bug.
> > 
> > But could you try the below which replaces a control dependency with a
> > full barrier. The control flow is plenty convoluted that I think the
> > control barrier isn't actually valid anymore and that might indeed
> > explain the fail.
> > 
> 
> Unfortunately the patch didnt help. Still seeing tasks stuck
> 
> # ps -Ao pid,tt,user,fname,tmout,f,wchan | grep futex
> 14680 pts/0    root     java         - 0 futex_wait_queue_me
> 14797 pts/0    root     java         - 0 futex_wait_queue_me
> # :> /var/log/messages
> # echo t > /proc/sysrq-trigger 
> # grep futex_wait_queue_me /var/log/messages | wc -l 
> 334
> #
> 
> [ 6904.211478] Call Trace:
> [ 6904.211481] [c000000fa1f1b4d0] [0000000000000020] 0x20 (unreliable)
> [ 6904.211486] [c000000fa1f1b6a0] [c000000000015208] .__switch_to+0x1e8/0x330
> [ 6904.211491] [c000000fa1f1b750] [c000000000702f00] .__schedule+0x360/0x8b0
> [ 6904.211495] [c000000fa1f1b9d0] [c000000000147348] .futex_wait_queue_me+0xf8/0x1a0
> [ 6904.211500] [c000000fa1f1ba60] [c0000000001486dc] .futex_wait+0x17c/0x2a0
> [ 6904.211505] [c000000fa1f1bc10] [c00000000014a614] .do_futex+0x254/0xd80
> [ 6904.211510] [c000000fa1f1bd60] [c00000000014b25c] .SyS_futex+0x11c/0x1d0
> [ 6904.238874] [c000000fa1f1be30] [c00000000000a0fc] syscall_exit+0x0/0x7c
> [ 6904.238879] java            S 00003fff825f6044     0 14682  14076 0x00000080
> 
> Is there any other information that I provide that can help?

This problem suggests that we missed a wakeup for a task that was adding
itself to the queue in a wait path. And the only place that can happen
is with the hb spinlock check for any pending waiters. Just in case we
missed some assumption about checking the hash bucket spinlock as a way
of detecting any waiters (powerpc?), could you revert this commit and
try the original atomic operations variant:

https://lkml.org/lkml/2013/12/19/630
Peter Zijlstra - March 20, 2014, 7:23 a.m.
On Thu, Mar 20, 2014 at 11:03:50AM +0530, Srikar Dronamraju wrote:
> > > Joy,.. let me look at that with ppc in mind.
> > 
> > OK; so while pretty much all the comments from that patch are utter
> > nonsense (what was I thinking), I cannot actually find a real bug.
> > 
> > But could you try the below which replaces a control dependency with a
> > full barrier. The control flow is plenty convoluted that I think the
> > control barrier isn't actually valid anymore and that might indeed
> > explain the fail.
> > 
> 
> Unfortunately the patch didnt help. Still seeing tasks stuck

Aww bugger. I'll be traveling tomorrow and today is wasted getting
ready. So unless Davidlohr has anything we'll need to scrap this change.
tulasidhar Dronamraju - March 20, 2014, 10:08 a.m.
> This problem suggests that we missed a wakeup for a task that was adding
> itself to the queue in a wait path. And the only place that can happen
> is with the hb spinlock check for any pending waiters. Just in case we
> missed some assumption about checking the hash bucket spinlock as a way
> of detecting any waiters (powerpc?), could you revert this commit and
> try the original atomic operations variant:
> 
> https://lkml.org/lkml/2013/12/19/630

I think the above url and the commit id that I reverted i.e
git://git.kernel.org/cgit/linux/kernel/git/torvalds/linux.git/commit/?id=b0c29f79ecea0b6fbcefc999
are the same.

Or am I missing something?
Davidlohr Bueso - March 20, 2014, 4:31 p.m.
On Wed, 2014-03-19 at 22:56 -0700, Davidlohr Bueso wrote:
> On Thu, 2014-03-20 at 11:03 +0530, Srikar Dronamraju wrote:
> > > > Joy,.. let me look at that with ppc in mind.
> > > 
> > > OK; so while pretty much all the comments from that patch are utter
> > > nonsense (what was I thinking), I cannot actually find a real bug.
> > > 
> > > But could you try the below which replaces a control dependency with a
> > > full barrier. The control flow is plenty convoluted that I think the
> > > control barrier isn't actually valid anymore and that might indeed
> > > explain the fail.
> > > 
> > 
> > Unfortunately the patch didnt help. Still seeing tasks stuck
> > 
> > # ps -Ao pid,tt,user,fname,tmout,f,wchan | grep futex
> > 14680 pts/0    root     java         - 0 futex_wait_queue_me
> > 14797 pts/0    root     java         - 0 futex_wait_queue_me
> > # :> /var/log/messages
> > # echo t > /proc/sysrq-trigger 
> > # grep futex_wait_queue_me /var/log/messages | wc -l 
> > 334
> > #
> > 
> > [ 6904.211478] Call Trace:
> > [ 6904.211481] [c000000fa1f1b4d0] [0000000000000020] 0x20 (unreliable)
> > [ 6904.211486] [c000000fa1f1b6a0] [c000000000015208] .__switch_to+0x1e8/0x330
> > [ 6904.211491] [c000000fa1f1b750] [c000000000702f00] .__schedule+0x360/0x8b0
> > [ 6904.211495] [c000000fa1f1b9d0] [c000000000147348] .futex_wait_queue_me+0xf8/0x1a0
> > [ 6904.211500] [c000000fa1f1ba60] [c0000000001486dc] .futex_wait+0x17c/0x2a0
> > [ 6904.211505] [c000000fa1f1bc10] [c00000000014a614] .do_futex+0x254/0xd80
> > [ 6904.211510] [c000000fa1f1bd60] [c00000000014b25c] .SyS_futex+0x11c/0x1d0
> > [ 6904.238874] [c000000fa1f1be30] [c00000000000a0fc] syscall_exit+0x0/0x7c
> > [ 6904.238879] java            S 00003fff825f6044     0 14682  14076 0x00000080
> > 
> > Is there any other information that I provide that can help?
> 
> This problem suggests that we missed a wakeup for a task that was adding
> itself to the queue in a wait path. And the only place that can happen
> is with the hb spinlock check for any pending waiters. Just in case we
> missed some assumption about checking the hash bucket spinlock as a way
> of detecting any waiters (powerpc?), could you revert this commit and
> try the original atomic operations variant:
> 
> https://lkml.org/lkml/2013/12/19/630

hmmm looking at ppc spinlock code, it seems that it doesn't have ticket
spinlocks -- in fact Torsten Duwe has been trying to get them upstream
very recently. Since we rely on the counter for detecting waiters, this
might explain the issue. Could someone confirm this spinlock
implementation difference?
Benjamin Herrenschmidt - March 20, 2014, 8:23 p.m.
On Thu, 2014-03-20 at 09:31 -0700, Davidlohr Bueso wrote:
> hmmm looking at ppc spinlock code, it seems that it doesn't have ticket
> spinlocks -- in fact Torsten Duwe has been trying to get them upstream
> very recently. Since we rely on the counter for detecting waiters, this
> might explain the issue. Could someone confirm this spinlock
> implementation difference? 

Indeed. I haven't merged ticket locks because they break lockref :-(

We have a problem here because we need to store the lock holder so we
can yield to the lock holder partition on contention and we are running
out of space in the spinlock.

The lock holder doesn't have to be atomic, so in theory we could have
the tickets and the lockref in the same 64-bit and the holder separately
but the way the layers are stacked at the moment that's not workable,
at least not without duplicating the whole lockref implementation and
breaking the spinlock in two, a "base" lock without older and the separate
variant with holder field. A mess...

I want to try sorting that out at some stage but haven't had a chance yet.

Cheers,
Ben.

Patch

--- a/kernel/futex.c
+++ b/kernel/futex.c
@@ -119,42 +119,32 @@ 
  * sys_futex(WAIT, futex, val);
  *   futex_wait(futex, val);
  *
- *   waiters++;
- *   mb(); (A) <-- paired with -.
- *                              |
- *   lock(hash_bucket(futex));  |
- *                              |
- *   uval = *futex;             |
- *                              |        *futex = newval;
- *                              |        sys_futex(WAKE, futex);
- *                              |          futex_wake(futex);
- *                              |
- *                              `------->  mb(); (B)
- *   if (uval == val)
+ *
+ *   lock(hash_bucket(futex)); (A)
+ *
+ *   uval = *futex;
+ *                                       *futex = newval;
+ *                                       sys_futex(WAKE, futex);
+ *                                         futex_wake(futex);
+ *
+ *   if (uval == val) (B)		   smp_mb(); (D)
  *     queue();
- *     unlock(hash_bucket(futex));
- *     schedule();                         if (waiters)
+ *     unlock(hash_bucket(futex)); (C)
+ *     schedule();                         if (spin_is_locked(&hb_lock) ||
+ *					       (smp_rmb(), !plist_empty))) (E)
  *                                           lock(hash_bucket(futex));
  *                                           wake_waiters(futex);
  *                                           unlock(hash_bucket(futex));
  *
- * Where (A) orders the waiters increment and the futex value read -- this
- * is guaranteed by the head counter in the hb spinlock; and where (B)
- * orders the write to futex and the waiters read -- this is done by the
- * barriers in get_futex_key_refs(), through either ihold or atomic_inc,
- * depending on the futex type.
- *
- * This yields the following case (where X:=waiters, Y:=futex):
- *
- *	X = Y = 0
- *
- *	w[X]=1		w[Y]=1
- *	MB		MB
- *	r[Y]=y		r[X]=x
- *
- * Which guarantees that x==0 && y==0 is impossible; which translates back into
- * the guarantee that we cannot both miss the futex variable change and the
- * enqueue.
+ *
+ * Because of the acquire (A) and release (C) the futex value load and the 
+ * plist_add are guaranteed to be inside the locked region. Furthermore, the
+ * control dependency (B) ensures the futex load happens before the plist_add().
+ *
+ * On the wakeup side, the full barrier (D) separates the futex value write
+ * from the hb_lock load, and matches with the control dependency. The rmb (E)
+ * separates the spin_is_locked() read and the plist_head_empty() read, such
+ * that ..., matches with the release barrier (C).
  */
 
 #ifndef CONFIG_HAVE_FUTEX_CMPXCHG
@@ -250,7 +240,7 @@  static inline void futex_get_mm(union fu
 	/*
 	 * Ensure futex_get_mm() implies a full barrier such that
 	 * get_futex_key() implies a full barrier. This is relied upon
-	 * as full barrier (B), see the ordering comment above.
+	 * as full barrier (D), see the ordering comment above.
 	 */
 	smp_mb__after_atomic();
 }
@@ -308,10 +298,10 @@  static void get_futex_key_refs(union fut
 
 	switch (key->both.offset & (FUT_OFF_INODE|FUT_OFF_MMSHARED)) {
 	case FUT_OFF_INODE:
-		ihold(key->shared.inode); /* implies MB (B) */
+		ihold(key->shared.inode); /* implies MB (D) */
 		break;
 	case FUT_OFF_MMSHARED:
-		futex_get_mm(key); /* implies MB (B) */
+		futex_get_mm(key); /* implies MB (D) */
 		break;
 	}
 }
@@ -385,7 +375,7 @@  get_futex_key(u32 __user *uaddr, int fsh
 	if (!fshared) {
 		key->private.mm = mm;
 		key->private.address = address;
-		get_futex_key_refs(key);  /* implies MB (B) */
+		get_futex_key_refs(key);  /* implies MB (D) */
 		return 0;
 	}
 
@@ -492,7 +482,7 @@  get_futex_key(u32 __user *uaddr, int fsh
 		key->shared.pgoff = basepage_index(page);
 	}
 
-	get_futex_key_refs(key); /* implies MB (B) */
+	get_futex_key_refs(key); /* implies MB (D) */
 
 out:
 	unlock_page(page_head);
@@ -1604,7 +1594,7 @@  static inline struct futex_hash_bucket *
 	hb = hash_futex(&q->key);
 	q->lock_ptr = &hb->lock;
 
-	spin_lock(&hb->lock); /* implies MB (A) */
+	spin_lock(&hb->lock);
 	return hb;
 }
 
@@ -1995,6 +1985,8 @@  static int futex_wait_setup(u32 __user *
 		goto retry;
 	}
 
+	smp_mb();
+
 	if (uval != val) {
 		queue_unlock(*hb);
 		ret = -EWOULDBLOCK;