diff mbox

mm: slub: Ensure that slab_unlock() is atomic

Message ID 1457447457-25878-1-git-send-email-vgupta@synopsys.com
State Superseded
Headers show

Commit Message

Vineet Gupta March 8, 2016, 2:30 p.m. UTC
We observed livelocks on ARC SMP setup when running hackbench with SLUB.
This hardware configuration lacks atomic instructions (LLOCK/SCOND) thus
kernel resorts to a central @smp_bitops_lock to protect any R-M-W ops
suh as test_and_set_bit()

The spinlock itself is implemented using Atomic [EX]change instruction
which is always available.

The race happened when both cores tried to slab_lock() the same page.

   c1		    c0
-----------	-----------
slab_lock
		slab_lock
slab_unlock
		Not observing the unlock

This in turn happened because slab_unlock() doesn't serialize properly
(doesn't use atomic clear) with a concurrent running
slab_lock()->test_and_set_bit()

Cc: Christoph Lameter <cl@linux.com>
Cc: Pekka Enberg <penberg@kernel.org>
Cc: David Rientjes <rientjes@google.com>
Cc: Joonsoo Kim <iamjoonsoo.kim@lge.com>
Cc: Andrew Morton <akpm@linux-foundation.org>
Cc: Noam Camus <noamc@ezchip.com>
Cc: <stable@vger.kernel.org>
Cc: <linux-mm@kvack.org>
Cc: <linux-kernel@vger.kernel.org>
Cc: <linux-snps-arc@lists.infradead.org>
Signed-off-by: Vineet Gupta <vgupta@synopsys.com>
---
 mm/slub.c | 2 +-
 1 file changed, 1 insertion(+), 1 deletion(-)

Comments

Christoph Lameter (Ampere) March 8, 2016, 3 p.m. UTC | #1
On Tue, 8 Mar 2016, Vineet Gupta wrote:

> This in turn happened because slab_unlock() doesn't serialize properly
> (doesn't use atomic clear) with a concurrent running
> slab_lock()->test_and_set_bit()

This is intentional because of the increased latency of atomic
instructions. Why would the unlock need to be atomic? This patch will
cause regressions.

Guess this is an architecture specific issue of modified
cachelines not becoming visible to other processors?
Vlastimil Babka March 8, 2016, 3:32 p.m. UTC | #2
On 03/08/2016 03:30 PM, Vineet Gupta wrote:
> We observed livelocks on ARC SMP setup when running hackbench with SLUB.
> This hardware configuration lacks atomic instructions (LLOCK/SCOND) thus
> kernel resorts to a central @smp_bitops_lock to protect any R-M-W ops
> suh as test_and_set_bit()

Sounds like this architecture should then redefine __clear_bit_unlock
and perhaps other non-atomic __X_bit() variants to be atomic, and not
defer this requirement to places that use the API?

> The spinlock itself is implemented using Atomic [EX]change instruction
> which is always available.
> 
> The race happened when both cores tried to slab_lock() the same page.
> 
>    c1		    c0
> -----------	-----------
> slab_lock
> 		slab_lock
> slab_unlock
> 		Not observing the unlock
> 
> This in turn happened because slab_unlock() doesn't serialize properly
> (doesn't use atomic clear) with a concurrent running
> slab_lock()->test_and_set_bit()
> 
> Cc: Christoph Lameter <cl@linux.com>
> Cc: Pekka Enberg <penberg@kernel.org>
> Cc: David Rientjes <rientjes@google.com>
> Cc: Joonsoo Kim <iamjoonsoo.kim@lge.com>
> Cc: Andrew Morton <akpm@linux-foundation.org>
> Cc: Noam Camus <noamc@ezchip.com>
> Cc: <stable@vger.kernel.org>
> Cc: <linux-mm@kvack.org>
> Cc: <linux-kernel@vger.kernel.org>
> Cc: <linux-snps-arc@lists.infradead.org>
> Signed-off-by: Vineet Gupta <vgupta@synopsys.com>
> ---
>  mm/slub.c | 2 +-
>  1 file changed, 1 insertion(+), 1 deletion(-)
> 
> diff --git a/mm/slub.c b/mm/slub.c
> index d8fbd4a6ed59..b7d345a508dc 100644
> --- a/mm/slub.c
> +++ b/mm/slub.c
> @@ -345,7 +345,7 @@ static __always_inline void slab_lock(struct page *page)
>  static __always_inline void slab_unlock(struct page *page)
>  {
>  	VM_BUG_ON_PAGE(PageTail(page), page);
> -	__bit_spin_unlock(PG_locked, &page->flags);
> +	bit_spin_unlock(PG_locked, &page->flags);
>  }
>  
>  static inline void set_page_slub_counters(struct page *page, unsigned long counters_new)
>
Vineet Gupta March 8, 2016, 3:46 p.m. UTC | #3
On Tuesday 08 March 2016 08:30 PM, Christoph Lameter wrote:
> On Tue, 8 Mar 2016, Vineet Gupta wrote:
> 
>> This in turn happened because slab_unlock() doesn't serialize properly
>> (doesn't use atomic clear) with a concurrent running
>> slab_lock()->test_and_set_bit()
> 
> This is intentional because of the increased latency of atomic
> instructions. Why would the unlock need to be atomic? This patch will
> cause regressions.
> 
> Guess this is an architecture specific issue of modified
> cachelines not becoming visible to other processors?

Absolutely not - we verified with the hardware coherency tracing that there was no
foul play there. And I would dare not point finger at code which was last updated
in 2011 w/o being absolutely sure.

Let me explain this in bit more detail. Like I mentioned in commitlog, this config
of ARC doesn't have exclusive load/stores (LLOCK/SCOND) so atomic ops are
implemented using a "central" spin lock. The spin lock itself is implemented using
EX instruction (atomic R-W)

Generated code for slab_lock() - essentially bit_spin_lock() is below (I've
removed generated code for CONFIG_PREEMPT for simplicity)

80543b0c <slab_lock>:
80543b0c:	push_s     blink
...
80543b3a:	mov_s      r15,0x809de168   <-- @smp_bitops_lock
80543b40:	mov_s      r17,1
80543b46:	mov_s      r16,0

# spin lock() inside test_and_set_bit() - see arc bitops.h (!LLSC code)
80543b78:	clri       r4
80543b7c:	dmb        3
80543b80:	mov_s      r2,r17
80543b82:	ex         r2,[r15]
80543b86:	breq       r2,1,80543b82
80543b8a:	dmb        3

# set the bit
80543b8e:	ld_s       r2,[r13,0] <--- (A) Finds PG_locked is set
80543b90:	or         r3,r2,1    <--- (B) other core unlocks right here
80543b94:	st_s       r3,[r13,0] <--- (C) sets PG_locked (overwrites unlock)

# spin_unlock
80543b96:	dmb        3
80543b9a:	mov_s      r3,r16
80543b9c:	ex         r3,[r15]
80543ba0:	dmb        3
80543ba4:	seti       r4

# check the old bit
80543ba8:	bbit0      r2,0,80543bb8   <--- bit was set, branch not taken
80543bac:	b_s        80543b68        <--- enter the test_bit() loop

   80543b68:	ld_s       r2,[r13,0]	   <-- (C) reads the bit, set by SELF
   80543b6a:	bbit1    r2,0,80543b68              spins infinitely

...


Now using hardware coherency tracing (and using the cycle timestamps) we verified
(A) and (B)

Thing is with exclusive load/store this race can't just happen since the
intervening ST will cause the ST in (C) to NOT commit and the LD/ST will be retried.

And there will be very few production systems which are SMP but lack exclusive
load/stores.

Are you convinced now !

-Vineet
Christoph Lameter (Ampere) March 8, 2016, 8:40 p.m. UTC | #4
On Tue, 8 Mar 2016, Vineet Gupta wrote:

> # set the bit
> 80543b8e:	ld_s       r2,[r13,0] <--- (A) Finds PG_locked is set
> 80543b90:	or         r3,r2,1    <--- (B) other core unlocks right here
> 80543b94:	st_s       r3,[r13,0] <--- (C) sets PG_locked (overwrites unlock)

Duh. Guess you  need to take the spinlock also in the arch specific
implementation of __bit_spin_unlock(). This is certainly not the only case
in which we use the __ op to unlock.

You need a true atomic op or you need to take the "spinlock" in all
cases where you modify the bit. If you take the lock in __bit_spin_unlock
then the race cannot happen.

> Are you convinced now !

Yes, please fix your arch specific code.
Vineet Gupta March 9, 2016, 6:43 a.m. UTC | #5
+CC linux-arch, parisc folks, PeterZ

On Wednesday 09 March 2016 02:10 AM, Christoph Lameter wrote:
> On Tue, 8 Mar 2016, Vineet Gupta wrote:
> 
>> # set the bit
>> 80543b8e:	ld_s       r2,[r13,0] <--- (A) Finds PG_locked is set
>> 80543b90:	or         r3,r2,1    <--- (B) other core unlocks right here
>> 80543b94:	st_s       r3,[r13,0] <--- (C) sets PG_locked (overwrites unlock)
> 
> Duh. Guess you  need to take the spinlock also in the arch specific
> implementation of __bit_spin_unlock(). This is certainly not the only case
> in which we use the __ op to unlock.

__bit_spin_lock() by definition is *not* required to be atomic, bit_spin_lock() is
- so I don't think we need a spinlock there.

There is clearly a problem in slub code that it is pairing a test_and_set_bit()
with a __clear_bit(). Latter can obviously clobber former if they are not a single
instruction each unlike x86 or they use llock/scond kind of instructions where the
interim store from other core is detected and causes a retry of whole llock/scond
sequence.

BTW ARC is not the only arch which suffers from this - other arches potentially
also are. AFAIK PARISC also doesn't have atomic r-m-w and also uses a set of
external hashed spinlocks to protect the r-m-w sequences.

https://lkml.org/lkml/2014/6/1/178

So there also we have the same race because the outer spin lock is not taken for
slab_unlock() -> __bit_spin_lock() -> __clear_bit.

Arguably I can fix the ARC !LLSC variant of test_and_set_bit() to not set the bit
unconditionally but only if it was clear (PARISC does the same). That would be a
slight micro-optimization as we won't need another snoop transaction to make line
writable and that would also elide this problem, but I think there is a
fundamental problem here in slub which is pairing atomic and non atomic ops - for
performance reasons. It doesn't work on all arches and/or configurations.

> You need a true atomic op or you need to take the "spinlock" in all
> cases where you modify the bit.

No we don't in __bit_spin_lock and we already do in bit_spin_lock.

> If you take the lock in __bit_spin_unlock
> then the race cannot happen.

Of course it won't but that means we penalize all non atomic callers of the API
with a superfluous spinlock which is not require din first place given the
definition of API.


>> Are you convinced now !
> 
> Yes, please fix your arch specific code.
diff mbox

Patch

diff --git a/mm/slub.c b/mm/slub.c
index d8fbd4a6ed59..b7d345a508dc 100644
--- a/mm/slub.c
+++ b/mm/slub.c
@@ -345,7 +345,7 @@  static __always_inline void slab_lock(struct page *page)
 static __always_inline void slab_unlock(struct page *page)
 {
 	VM_BUG_ON_PAGE(PageTail(page), page);
-	__bit_spin_unlock(PG_locked, &page->flags);
+	bit_spin_unlock(PG_locked, &page->flags);
 }
 
 static inline void set_page_slub_counters(struct page *page, unsigned long counters_new)