diff mbox

mm: slub: Ensure that slab_unlock() is atomic

Message ID 20160309101349.GJ6344@twins.programming.kicks-ass.net
State Superseded
Headers show

Commit Message

Peter Zijlstra March 9, 2016, 10:13 a.m. UTC
On Wed, Mar 09, 2016 at 12:13:16PM +0530, Vineet Gupta wrote:
> +CC linux-arch, parisc folks, PeterZ
> 
> On Wednesday 09 March 2016 02:10 AM, Christoph Lameter wrote:
> > On Tue, 8 Mar 2016, Vineet Gupta wrote:
> > 
> >> # set the bit
> >> 80543b8e:	ld_s       r2,[r13,0] <--- (A) Finds PG_locked is set
> >> 80543b90:	or         r3,r2,1    <--- (B) other core unlocks right here
> >> 80543b94:	st_s       r3,[r13,0] <--- (C) sets PG_locked (overwrites unlock)
> > 
> > Duh. Guess you  need to take the spinlock also in the arch specific
> > implementation of __bit_spin_unlock(). This is certainly not the only case
> > in which we use the __ op to unlock.
> 
> __bit_spin_lock() by definition is *not* required to be atomic, bit_spin_lock() is
> - so I don't think we need a spinlock there.

Agreed. The double underscore prefixed instructions are not required to
be atomic in any way shape or form.

> There is clearly a problem in slub code that it is pairing a test_and_set_bit()
> with a __clear_bit(). Latter can obviously clobber former if they are not a single
> instruction each unlike x86 or they use llock/scond kind of instructions where the
> interim store from other core is detected and causes a retry of whole llock/scond
> sequence.

Yes, test_and_set_bit() + __clear_bit() is broken.

> > If you take the lock in __bit_spin_unlock
> > then the race cannot happen.
> 
> Of course it won't but that means we penalize all non atomic callers of the API
> with a superfluous spinlock which is not require din first place given the
> definition of API.

Quite. _However_, your arch is still broken, but not by your fault. Its
the generic-asm code that is wrong.

The thing is that __bit_spin_unlock() uses __clear_bit_unlock(), which
defaults to __clear_bit(). Which is wrong.

---
Subject: bitops: Do not default to __clear_bit() for __clear_bit_unlock()

__clear_bit_unlock() is a special little snowflake. While it carries the
non-atomic '__' prefix, it is specifically documented to pair with
test_and_set_bit() and therefore should be 'somewhat' atomic.

Therefore the generic implementation of __clear_bit_unlock() cannot use
the fully non-atomic __clear_bit() as a default.

If an arch is able to do better; is must provide an implementation of
__clear_bit_unlock() itself.

Reported-by: Vineet Gupta <Vineet.Gupta1@synopsys.com>
Signed-off-by: Peter Zijlstra (Intel) <peterz@infradead.org>
---
 include/asm-generic/bitops/lock.h | 14 +++++++-------
 1 file changed, 7 insertions(+), 7 deletions(-)

Comments

Peter Zijlstra March 9, 2016, 10:31 a.m. UTC | #1
On Wed, Mar 09, 2016 at 11:13:49AM +0100, Peter Zijlstra wrote:
> ---
> Subject: bitops: Do not default to __clear_bit() for __clear_bit_unlock()
> 
> __clear_bit_unlock() is a special little snowflake. While it carries the
> non-atomic '__' prefix, it is specifically documented to pair with
> test_and_set_bit() and therefore should be 'somewhat' atomic.
> 
> Therefore the generic implementation of __clear_bit_unlock() cannot use
> the fully non-atomic __clear_bit() as a default.
> 
> If an arch is able to do better; is must provide an implementation of
> __clear_bit_unlock() itself.
> 

FWIW, we could probably undo this if we unified all the spinlock based
atomic ops implementations (there's a whole bunch doing essentially the
same), and special cased __clear_bit_unlock() for that.

Collapsing them is probably a good idea anyway, just a fair bit of
non-trivial work to figure out all the differences and if they matter
etc..
Vineet Gupta March 9, 2016, 11 a.m. UTC | #2
On Wednesday 09 March 2016 03:43 PM, Peter Zijlstra wrote:
>>> If you take the lock in __bit_spin_unlock
>>> then the race cannot happen.
>>
>> Of course it won't but that means we penalize all non atomic callers of the API
>> with a superfluous spinlock which is not require din first place given the
>> definition of API.
> 
> Quite. _However_, your arch is still broken, but not by your fault. Its
> the generic-asm code that is wrong.
> 
> The thing is that __bit_spin_unlock() uses __clear_bit_unlock(), which
> defaults to __clear_bit(). Which is wrong.
> 
> ---
> Subject: bitops: Do not default to __clear_bit() for __clear_bit_unlock()
> 
> __clear_bit_unlock() is a special little snowflake. While it carries the
> non-atomic '__' prefix, it is specifically documented to pair with
> test_and_set_bit() and therefore should be 'somewhat' atomic.
> 
> Therefore the generic implementation of __clear_bit_unlock() cannot use
> the fully non-atomic __clear_bit() as a default.
> 
> If an arch is able to do better; is must provide an implementation of
> __clear_bit_unlock() itself.
> 
> Reported-by: Vineet Gupta <Vineet.Gupta1@synopsys.com>

This needs to be CCed stable as it fixes a real bug for ARC.

> Signed-off-by: Peter Zijlstra (Intel) <peterz@infradead.org>

Tested-by: Vineet Gupta <Vineet.Gupta1@synopsys.com>

FWIW, could we add some background to commit log, specifically what prompted this.
Something like below...

---->8------
This came up as a result of hackbench livelock'ing in slab_lock() on ARC with SMP
+ SLUB + !LLSC.

The issue was incorrect pairing of atomic ops.

slab_lock() -> bit_spin_lock() -> test_and_set_bit()
slab_unlock() -> __bit_spin_unlock() -> __clear_bit()

The non serializing __clear_bit() was getting "lost"

80543b8e:	ld_s       r2,[r13,0] <--- (A) Finds PG_locked is set
80543b90:	or         r3,r2,1    <--- (B) other core unlocks right here
80543b94:	st_s       r3,[r13,0] <--- (C) sets PG_locked (overwrites unlock)

Fixes ARC STAR 9000817404
---->8------

> ---
>  include/asm-generic/bitops/lock.h | 14 +++++++-------
>  1 file changed, 7 insertions(+), 7 deletions(-)
> 
> diff --git a/include/asm-generic/bitops/lock.h b/include/asm-generic/bitops/lock.h
> index c30266e94806..8ef0ccbf8167 100644
> --- a/include/asm-generic/bitops/lock.h
> +++ b/include/asm-generic/bitops/lock.h
> @@ -29,16 +29,16 @@ do {					\
>   * @nr: the bit to set
>   * @addr: the address to start counting from
>   *
> - * This operation is like clear_bit_unlock, however it is not atomic.
> - * It does provide release barrier semantics so it can be used to unlock
> - * a bit lock, however it would only be used if no other CPU can modify
> - * any bits in the memory until the lock is released (a good example is
> - * if the bit lock itself protects access to the other bits in the word).
> + * A weaker form of clear_bit_unlock() as used by __bit_lock_unlock(). If all
> + * the bits in the word are protected by this lock some archs can use weaker
> + * ops to safely unlock.
> + *
> + * See for example x86's implementation.
>   */

To be able to override/use-generic don't we need #ifndef ....

>  #define __clear_bit_unlock(nr, addr)	\
>  do {					\
> -	smp_mb();			\
> -	__clear_bit(nr, addr);		\
> +	smp_mb__before_atomic();	\
> +	clear_bit(nr, addr);		\
>  } while (0)
>  
>  #endif /* _ASM_GENERIC_BITOPS_LOCK_H_ */
>
Vineet Gupta March 9, 2016, 11:12 a.m. UTC | #3
On Wednesday 09 March 2016 04:01 PM, Peter Zijlstra wrote:
> On Wed, Mar 09, 2016 at 11:13:49AM +0100, Peter Zijlstra wrote:
>> ---
>> Subject: bitops: Do not default to __clear_bit() for __clear_bit_unlock()
>>
>> __clear_bit_unlock() is a special little snowflake. While it carries the
>> non-atomic '__' prefix, it is specifically documented to pair with
>> test_and_set_bit() and therefore should be 'somewhat' atomic.
>>
>> Therefore the generic implementation of __clear_bit_unlock() cannot use
>> the fully non-atomic __clear_bit() as a default.
>>
>> If an arch is able to do better; is must provide an implementation of
>> __clear_bit_unlock() itself.
>>
> 
> FWIW, we could probably undo this if we unified all the spinlock based
> atomic ops implementations (there's a whole bunch doing essentially the
> same), and special cased __clear_bit_unlock() for that.
> 
> Collapsing them is probably a good idea anyway, just a fair bit of
> non-trivial work to figure out all the differences and if they matter
> etc..

Indeed I thought about this when we first did the SMP port. The only issue was
somewhat more generated code with the hashed spinlocks (vs. my dumb 2 spin locks -
which as I see now will also cause false sharing - likely ending up in the same
cache line), but I was more of a micro-optimization freak then than I'm now :-)

So yeah I agree !
Vineet Gupta March 9, 2016, 1:22 p.m. UTC | #4
On Wednesday 09 March 2016 03:43 PM, Peter Zijlstra wrote:
>> There is clearly a problem in slub code that it is pairing a test_and_set_bit()
>> with a __clear_bit(). Latter can obviously clobber former if they are not a single
>> instruction each unlike x86 or they use llock/scond kind of instructions where the
>> interim store from other core is detected and causes a retry of whole llock/scond
>> sequence.
> 
> Yes, test_and_set_bit() + __clear_bit() is broken.

But in SLUB: bit_spin_lock() + __bit_spin_unlock() is acceptable ? How so
(ignoring the performance thing for discussion sake, which is a side effect of
this implementation).

So despite the comment below in bit_spinlock.h I don't quite comprehend how this
is allowable. And if say, by deduction, this is fine for LLSC or lock prefixed
cases, then isn't this true in general for lot more cases in kernel, i.e. pairing
atomic lock with non-atomic unlock ? I'm missing something !

| /*
|  *  bit-based spin_unlock()
|  *  non-atomic version, which can be used eg. if the bit lock itself is
|  *  protecting the rest of the flags in the word.
|  */
| static inline void __bit_spin_unlock(int bitnum, unsigned long *addr)
Peter Zijlstra March 9, 2016, 2:51 p.m. UTC | #5
On Wed, Mar 09, 2016 at 06:52:45PM +0530, Vineet Gupta wrote:
> On Wednesday 09 March 2016 03:43 PM, Peter Zijlstra wrote:
> >> There is clearly a problem in slub code that it is pairing a test_and_set_bit()
> >> with a __clear_bit(). Latter can obviously clobber former if they are not a single
> >> instruction each unlike x86 or they use llock/scond kind of instructions where the
> >> interim store from other core is detected and causes a retry of whole llock/scond
> >> sequence.
> > 
> > Yes, test_and_set_bit() + __clear_bit() is broken.
> 
> But in SLUB: bit_spin_lock() + __bit_spin_unlock() is acceptable ? How so
> (ignoring the performance thing for discussion sake, which is a side effect of
> this implementation).

The sort answer is: Per definition. They are defined to work together,
which is what makes __clear_bit_unlock() such a special function.

> So despite the comment below in bit_spinlock.h I don't quite comprehend how this
> is allowable. And if say, by deduction, this is fine for LLSC or lock prefixed
> cases, then isn't this true in general for lot more cases in kernel, i.e. pairing
> atomic lock with non-atomic unlock ? I'm missing something !

x86 (and others) do in fact use non-atomic instructions for
spin_unlock(). But as this is all arch specific, we can make these
assumptions. Its just that generic code cannot rely on it.

So let me try and explain.


The problem as identified is:

CPU0						CPU1

bit_spin_lock()					__bit_spin_unlock()
1:
	/* fetch_or, r1 holds the old value */
	spin_lock
	load	r1, addr
						load	r1, addr
						bclr	r2, r1, 1
						store	r2, addr
	or	r2, r1, 1
	store	r2, addr	/* lost the store from CPU1 */
	spin_unlock

	and	r1, 1
	bnz	2	/* it was set, go wait */
	ret

2:
	load	r1, addr
	and	r1, 1
	bnz	2	/* wait until its not set */

	b	1	/* try again */



For LL/SC we replace:

	spin_lock
	load	r1, addr

	...

	store	r2, addr
	spin_unlock

With the (obvious):

1:
	load-locked	r1, addr

	...

	store-cond	r2, addr
	bnz		1 /* or whatever branch instruction is required to retry */


In this case the failure cannot happen, because the store from CPU1
would have invalidated the lock from CPU0 and caused the
store-cond to fail and retry the loop, observing the new value.
Vineet Gupta March 10, 2016, 5:51 a.m. UTC | #6
On Wednesday 09 March 2016 08:21 PM, Peter Zijlstra wrote:
>> But in SLUB: bit_spin_lock() + __bit_spin_unlock() is acceptable ? How so
>> (ignoring the performance thing for discussion sake, which is a side effect of
>> this implementation).
> 
> The sort answer is: Per definition. They are defined to work together,
> which is what makes __clear_bit_unlock() such a special function.
> 
>> So despite the comment below in bit_spinlock.h I don't quite comprehend how this
>> is allowable. And if say, by deduction, this is fine for LLSC or lock prefixed
>> cases, then isn't this true in general for lot more cases in kernel, i.e. pairing
>> atomic lock with non-atomic unlock ? I'm missing something !
> 
> x86 (and others) do in fact use non-atomic instructions for
> spin_unlock(). But as this is all arch specific, we can make these
> assumptions. Its just that generic code cannot rely on it.

OK despite being obvious now, I was not seeing the similarity between spin_*lock()
and bit_spin*lock() :-(

ARC also uses standard ST for spin_unlock() so by analogy __bit_spin_unlock() (for
LLSC case) would be correctly paired with bit_spin_lock().

But then why would anyone need bit_spin_unlock() at all. Specially after this
patch from you which tightens __bit_spin_lock() even more for the general case.

Thing is if the API exists majority of people would would use the more
conservative version w/o understand all these nuances. Can we pursue the path of
moving bit_spin_unlock() over to __bit_spin_lock(): first changing the backend
only and if proven stable replacing the call-sites themselves.

> 
> So let me try and explain.
> 
> 
> The problem as identified is:
> 
> CPU0						CPU1
> 
> bit_spin_lock()					__bit_spin_unlock()
> 1:
> 	/* fetch_or, r1 holds the old value */
> 	spin_lock
> 	load	r1, addr
> 						load	r1, addr
> 						bclr	r2, r1, 1
> 						store	r2, addr
> 	or	r2, r1, 1
> 	store	r2, addr	/* lost the store from CPU1 */
> 	spin_unlock
> 
> 	and	r1, 1
> 	bnz	2	/* it was set, go wait */
> 	ret
> 
> 2:
> 	load	r1, addr
> 	and	r1, 1
> 	bnz	2	/* wait until its not set */
> 
> 	b	1	/* try again */
> 
> 
> 
> For LL/SC we replace:
> 
> 	spin_lock
> 	load	r1, addr
> 
> 	...
> 
> 	store	r2, addr
> 	spin_unlock
> 
> With the (obvious):
> 
> 1:
> 	load-locked	r1, addr
> 
> 	...
> 
> 	store-cond	r2, addr
> 	bnz		1 /* or whatever branch instruction is required to retry */
> 
> 
> In this case the failure cannot happen, because the store from CPU1
> would have invalidated the lock from CPU0 and caused the
> store-cond to fail and retry the loop, observing the new value. 

You did it again, A picture is worth thousand words !

Thx,
-Vineet
Peter Zijlstra March 10, 2016, 9:10 a.m. UTC | #7
On Thu, Mar 10, 2016 at 11:21:21AM +0530, Vineet Gupta wrote:
> On Wednesday 09 March 2016 08:21 PM, Peter Zijlstra wrote:
> >> But in SLUB: bit_spin_lock() + __bit_spin_unlock() is acceptable ? How so
> >> (ignoring the performance thing for discussion sake, which is a side effect of
> >> this implementation).
> > 
> > The sort answer is: Per definition. They are defined to work together,
> > which is what makes __clear_bit_unlock() such a special function.
> > 
> >> So despite the comment below in bit_spinlock.h I don't quite comprehend how this
> >> is allowable. And if say, by deduction, this is fine for LLSC or lock prefixed
> >> cases, then isn't this true in general for lot more cases in kernel, i.e. pairing
> >> atomic lock with non-atomic unlock ? I'm missing something !
> > 
> > x86 (and others) do in fact use non-atomic instructions for
> > spin_unlock(). But as this is all arch specific, we can make these
> > assumptions. Its just that generic code cannot rely on it.
> 
> OK despite being obvious now, I was not seeing the similarity between spin_*lock()
> and bit_spin*lock() :-(
> 
> ARC also uses standard ST for spin_unlock() so by analogy __bit_spin_unlock() (for
> LLSC case) would be correctly paired with bit_spin_lock().
> 
> But then why would anyone need bit_spin_unlock() at all. Specially after this
> patch from you which tightens __bit_spin_lock() even more for the general case.
> 
> Thing is if the API exists majority of people would would use the more
> conservative version w/o understand all these nuances. Can we pursue the path of
> moving bit_spin_unlock() over to __bit_spin_lock(): first changing the backend
> only and if proven stable replacing the call-sites themselves.

So the thing is, __bit_spin_unlock() is not safe if other bits in that
word can have concurrent modifications.

Only if the bitlock locks the whole word (or something else ensures no
other bits will change) can you use __bit_spin_unlock() to clear the
lock bit.
diff mbox

Patch

diff --git a/include/asm-generic/bitops/lock.h b/include/asm-generic/bitops/lock.h
index c30266e94806..8ef0ccbf8167 100644
--- a/include/asm-generic/bitops/lock.h
+++ b/include/asm-generic/bitops/lock.h
@@ -29,16 +29,16 @@  do {					\
  * @nr: the bit to set
  * @addr: the address to start counting from
  *
- * This operation is like clear_bit_unlock, however it is not atomic.
- * It does provide release barrier semantics so it can be used to unlock
- * a bit lock, however it would only be used if no other CPU can modify
- * any bits in the memory until the lock is released (a good example is
- * if the bit lock itself protects access to the other bits in the word).
+ * A weaker form of clear_bit_unlock() as used by __bit_lock_unlock(). If all
+ * the bits in the word are protected by this lock some archs can use weaker
+ * ops to safely unlock.
+ *
+ * See for example x86's implementation.
  */
 #define __clear_bit_unlock(nr, addr)	\
 do {					\
-	smp_mb();			\
-	__clear_bit(nr, addr);		\
+	smp_mb__before_atomic();	\
+	clear_bit(nr, addr);		\
 } while (0)
 
 #endif /* _ASM_GENERIC_BITOPS_LOCK_H_ */