diff mbox

Hung task when calling clone() due to netfilter/slab

Message ID alpine.DEB.2.00.1201171620590.14697@router.home
State Not Applicable, archived
Delegated to: David Miller
Headers show

Commit Message

Christoph Lameter (Ampere) Jan. 17, 2012, 10:22 p.m. UTC
Another version that drops the slub lock for both invocations of sysfs
functions from kmem_cache_create. The invocation from slab_sysfs_init
is not a problem since user space is not active at that point.


Subject: slub: Do not take the slub lock while calling into sysfs

This patch avoids holding the slub_lock during kmem_cache_create()
when calling sysfs. It is possible because kmem_cache_create()
allocates the kmem_cache object and therefore is the only one context
that can access the newly created object. It is therefore possible
to drop the slub_lock early. We defer the adding of the new kmem_cache
to the end of processing because the new kmem_cache structure would
be reachable otherwise via scans over slabs. This allows sysfs_slab_add()
to run without holding any locks.

The case is different if we are creating an alias instead of a new
kmem_cache structure. In that case we can also drop the slub lock
early because we have taken a refcount on the kmem_cache structure.
It therefore cannot vanish from under us.
But if the sysfs_slab_alias() call fails we can no longer simply
decrement the refcount since the other references may have gone
away in the meantime. Call kmem_cache_destroy() to cause the
refcount to be decremented and the kmem_cache structure to be
freed if all references are gone.

Signed-off-by: Christoph Lameter <cl@linux.com>


---
 mm/slub.c |   25 +++++++++++--------------
 1 file changed, 11 insertions(+), 14 deletions(-)

--
To unsubscribe from this list: send the line "unsubscribe netdev" in
the body of a message to majordomo@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html

Comments

Eric W. Biederman Jan. 19, 2012, 9:43 p.m. UTC | #1
Christoph Lameter <cl@linux.com> writes:

> Another version that drops the slub lock for both invocations of sysfs
> functions from kmem_cache_create. The invocation from slab_sysfs_init
> is not a problem since user space is not active at that point.
>
>
> Subject: slub: Do not take the slub lock while calling into sysfs
>
> This patch avoids holding the slub_lock during kmem_cache_create()
> when calling sysfs. It is possible because kmem_cache_create()
> allocates the kmem_cache object and therefore is the only one context
> that can access the newly created object. It is therefore possible
> to drop the slub_lock early. We defer the adding of the new kmem_cache
> to the end of processing because the new kmem_cache structure would
> be reachable otherwise via scans over slabs. This allows sysfs_slab_add()
> to run without holding any locks.
>
> The case is different if we are creating an alias instead of a new
> kmem_cache structure. In that case we can also drop the slub lock
> early because we have taken a refcount on the kmem_cache structure.
> It therefore cannot vanish from under us.
> But if the sysfs_slab_alias() call fails we can no longer simply
> decrement the refcount since the other references may have gone
> away in the meantime. Call kmem_cache_destroy() to cause the
> refcount to be decremented and the kmem_cache structure to be
> freed if all references are gone.
>
> Signed-off-by: Christoph Lameter <cl@linux.com>

I am dense.  Is the deadlock here that you are fixing slub calling sysfs
with the slub_lock held but sysfs then calling kmem_cache_zalloc?

I don't see what sysfs is doing in the creation path that would cause
a deadlock except for using slab.

> ---
>  mm/slub.c |   25 +++++++++++--------------
>  1 file changed, 11 insertions(+), 14 deletions(-)
>
> Index: linux-2.6/mm/slub.c
> ===================================================================
> --- linux-2.6.orig/mm/slub.c	2012-01-17 09:53:26.599505365 -0600
> +++ linux-2.6/mm/slub.c	2012-01-17 09:59:57.131497273 -0600
> @@ -3912,13 +3912,14 @@ struct kmem_cache *kmem_cache_create(con
>  		s->objsize = max(s->objsize, (int)size);
>  		s->inuse = max_t(int, s->inuse, ALIGN(size, sizeof(void *)));
>
> +		up_write(&slub_lock);
>  		if (sysfs_slab_alias(s, name)) {
> -			s->refcount--;
> +			kmem_cache_destroy(s);
>  			goto err;
>  		}
> -		up_write(&slub_lock);
>  		return s;
>  	}
> +	up_write(&slub_lock);
>
>  	n = kstrdup(name, GFP_KERNEL);
>  	if (!n)
> @@ -3928,27 +3929,23 @@ struct kmem_cache *kmem_cache_create(con
>  	if (s) {
>  		if (kmem_cache_open(s, n,
>  				size, align, flags, ctor)) {
> -			list_add(&s->list, &slab_caches);
> -			if (sysfs_slab_add(s)) {
> -				list_del(&s->list);
> -				kfree(n);
> -				kfree(s);
> -				goto err;
> +
> +			if (sysfs_slab_add(s) == 0) {
> +				down_write(&slub_lock);
> +				list_add(&s->list, &slab_caches);
> +				up_write(&slub_lock);
> +				return s;
>  			}
> -			up_write(&slub_lock);
> -			return s;
>  		}
>  		kfree(n);
>  		kfree(s);
>  	}
>  err:
> -	up_write(&slub_lock);
>
>  	if (flags & SLAB_PANIC)
>  		panic("Cannot create slabcache %s\n", name);
> -	else
> -		s = NULL;
> -	return s;
> +
> +	return NULL;
>  }
>  EXPORT_SYMBOL(kmem_cache_create);
>
> --
> To unsubscribe from this list: send the line "unsubscribe netdev" in
> the body of a message to majordomo@vger.kernel.org
> More majordomo info at  http://vger.kernel.org/majordomo-info.html
--
To unsubscribe from this list: send the line "unsubscribe netdev" in
the body of a message to majordomo@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Eric W. Biederman Jan. 19, 2012, 10:15 p.m. UTC | #2
ebiederm@xmission.com (Eric W. Biederman) writes:

> Christoph Lameter <cl@linux.com> writes:
>
>> Another version that drops the slub lock for both invocations of sysfs
>> functions from kmem_cache_create. The invocation from slab_sysfs_init
>> is not a problem since user space is not active at that point.
>>
>>
>> Subject: slub: Do not take the slub lock while calling into sysfs
>>
>> This patch avoids holding the slub_lock during kmem_cache_create()
>> when calling sysfs. It is possible because kmem_cache_create()
>> allocates the kmem_cache object and therefore is the only one context
>> that can access the newly created object. It is therefore possible
>> to drop the slub_lock early. We defer the adding of the new kmem_cache
>> to the end of processing because the new kmem_cache structure would
>> be reachable otherwise via scans over slabs. This allows sysfs_slab_add()
>> to run without holding any locks.
>>
>> The case is different if we are creating an alias instead of a new
>> kmem_cache structure. In that case we can also drop the slub lock
>> early because we have taken a refcount on the kmem_cache structure.
>> It therefore cannot vanish from under us.
>> But if the sysfs_slab_alias() call fails we can no longer simply
>> decrement the refcount since the other references may have gone
>> away in the meantime. Call kmem_cache_destroy() to cause the
>> refcount to be decremented and the kmem_cache structure to be
>> freed if all references are gone.
>>
>> Signed-off-by: Christoph Lameter <cl@linux.com>
>
> I am dense.  Is the deadlock here that you are fixing slub calling sysfs
> with the slub_lock held but sysfs then calling kmem_cache_zalloc?
>
> I don't see what sysfs is doing in the creation path that would cause
> a deadlock except for using slab.

Oh.  I see.  The problem is calling kobject_uevent (which happens to
live in slabs sysfs_slab_add) with a lock held.  And kobject_uevent
makes a blocking call to userspace.

No locks held seems to be a good policy on that one.

Eric
--
To unsubscribe from this list: send the line "unsubscribe netdev" in
the body of a message to majordomo@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Christoph Lameter (Ampere) Jan. 20, 2012, 2:03 a.m. UTC | #3
On Thu, 19 Jan 2012, Eric W. Biederman wrote:

> Oh.  I see.  The problem is calling kobject_uevent (which happens to
> live in slabs sysfs_slab_add) with a lock held.  And kobject_uevent
> makes a blocking call to userspace.
>
> No locks held seems to be a good policy on that one.

Well we can just remove that call to kobject_uevent instead then. Does it
do anything useful? Cannot remember why we put that in there.

--
To unsubscribe from this list: send the line "unsubscribe netdev" in
the body of a message to majordomo@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Eric W. Biederman Jan. 20, 2012, 2:31 a.m. UTC | #4
Christoph Lameter <cl@linux.com> writes:

> On Thu, 19 Jan 2012, Eric W. Biederman wrote:
>
>> Oh.  I see.  The problem is calling kobject_uevent (which happens to
>> live in slabs sysfs_slab_add) with a lock held.  And kobject_uevent
>> makes a blocking call to userspace.
>>
>> No locks held seems to be a good policy on that one.
>
> Well we can just remove that call to kobject_uevent instead then. Does it
> do anything useful? Cannot remember why we put that in there.

Empirically it sounds like something is listening for it and doing cat
/proc/slabinfo.  Something like that would have to occur for their to be
a deadlock that was observed.

On the flip side removing from sysfs with locks held must be done
carefully, and as a default I would recommend not to hold locks over
removing things from sysfs.  As removal blocks waiting for all of the
callers into sysfs those sysfs attributes to complete.

It looks like you are ok on the removal because none of the sysfs
attributes appear to take the slub_lock, just /proc/slabinfo.  But
it does look like playing with fire.

Eric
--
To unsubscribe from this list: send the line "unsubscribe netdev" in
the body of a message to majordomo@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Christoph Lameter (Ampere) Jan. 20, 2012, 2:49 p.m. UTC | #5
On Thu, 19 Jan 2012, Eric W. Biederman wrote:

> On the flip side removing from sysfs with locks held must be done
> carefully, and as a default I would recommend not to hold locks over
> removing things from sysfs.  As removal blocks waiting for all of the
> callers into sysfs those sysfs attributes to complete.
>
> It looks like you are ok on the removal because none of the sysfs
> attributes appear to take the slub_lock, just /proc/slabinfo.  But
> it does look like playing with fire.

Ok then I guess my last patch is needed to make sysfs operations safe.

It may be good to audit the kernel for locks being held while calling
sysfs functions. Isnt there a lockdep check that ensures that no locks are
held?


--
To unsubscribe from this list: send the line "unsubscribe netdev" in
the body of a message to majordomo@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Eric W. Biederman Jan. 20, 2012, 8:40 p.m. UTC | #6
Christoph Lameter <cl@linux.com> writes:

> On Thu, 19 Jan 2012, Eric W. Biederman wrote:
>
>> On the flip side removing from sysfs with locks held must be done
>> carefully, and as a default I would recommend not to hold locks over
>> removing things from sysfs.  As removal blocks waiting for all of the
>> callers into sysfs those sysfs attributes to complete.
>>
>> It looks like you are ok on the removal because none of the sysfs
>> attributes appear to take the slub_lock, just /proc/slabinfo.  But
>> it does look like playing with fire.
>
> Ok then I guess my last patch is needed to make sysfs operations safe.
>
> It may be good to audit the kernel for locks being held while calling
> sysfs functions. Isnt there a lockdep check that ensures that no locks are
> held?

I don't see a no locks are held check but call_usermodehelper in the
blocking case could certainly use one.

For the sysfs remove case lockdep should work.

Eric
--
To unsubscribe from this list: send the line "unsubscribe netdev" in
the body of a message to majordomo@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Pekka Enberg Feb. 1, 2012, 8:05 a.m. UTC | #7
On Fri, Jan 20, 2012 at 4:49 PM, Christoph Lameter <cl@linux.com> wrote:
> Ok then I guess my last patch is needed to make sysfs operations safe.

Hmm. So is the latter patch needed or not?
--
To unsubscribe from this list: send the line "unsubscribe netdev" in
the body of a message to majordomo@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Eric W. Biederman Feb. 1, 2012, 5:32 p.m. UTC | #8
Pekka Enberg <penberg@kernel.org> writes:

> On Fri, Jan 20, 2012 at 4:49 PM, Christoph Lameter <cl@linux.com> wrote:
>> Ok then I guess my last patch is needed to make sysfs operations safe.
>
> Hmm. So is the latter patch needed or not?

Imperfect changelog but yes.

Eric

--
To unsubscribe from this list: send the line "unsubscribe netdev" in
the body of a message to majordomo@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
diff mbox

Patch

Index: linux-2.6/mm/slub.c
===================================================================
--- linux-2.6.orig/mm/slub.c	2012-01-17 09:53:26.599505365 -0600
+++ linux-2.6/mm/slub.c	2012-01-17 09:59:57.131497273 -0600
@@ -3912,13 +3912,14 @@  struct kmem_cache *kmem_cache_create(con
 		s->objsize = max(s->objsize, (int)size);
 		s->inuse = max_t(int, s->inuse, ALIGN(size, sizeof(void *)));

+		up_write(&slub_lock);
 		if (sysfs_slab_alias(s, name)) {
-			s->refcount--;
+			kmem_cache_destroy(s);
 			goto err;
 		}
-		up_write(&slub_lock);
 		return s;
 	}
+	up_write(&slub_lock);

 	n = kstrdup(name, GFP_KERNEL);
 	if (!n)
@@ -3928,27 +3929,23 @@  struct kmem_cache *kmem_cache_create(con
 	if (s) {
 		if (kmem_cache_open(s, n,
 				size, align, flags, ctor)) {
-			list_add(&s->list, &slab_caches);
-			if (sysfs_slab_add(s)) {
-				list_del(&s->list);
-				kfree(n);
-				kfree(s);
-				goto err;
+
+			if (sysfs_slab_add(s) == 0) {
+				down_write(&slub_lock);
+				list_add(&s->list, &slab_caches);
+				up_write(&slub_lock);
+				return s;
 			}
-			up_write(&slub_lock);
-			return s;
 		}
 		kfree(n);
 		kfree(s);
 	}
 err:
-	up_write(&slub_lock);

 	if (flags & SLAB_PANIC)
 		panic("Cannot create slabcache %s\n", name);
-	else
-		s = NULL;
-	return s;
+
+	return NULL;
 }
 EXPORT_SYMBOL(kmem_cache_create);