diff mbox

UNIX: Do not loop forever at unix_autobind().

Message ID 201008302227.DJH30258.OQFMFtFJOOVSHL@I-love.SAKURA.ne.jp
State Changes Requested, archived
Delegated to: David Miller
Headers show

Commit Message

Tetsuo Handa Aug. 30, 2010, 1:27 p.m. UTC
I tried to create and autobind 1048576 UNIX domain sockets using
http://I-love.SAKURA.ne.jp/tmp/unixloop.asm on 2.6.18-194.11.1.el5.x86_64,
and found that it needs about 2GB of RAM. Thus, on systems where
/proc/sys/fs/file-max is larger than 1048576, a local user can create and
autobind 1048576 UNIX domain sockets in order to let applications fall into

  while (1)
  	yield();

loop. Below is the patch (only compile tested) to avoid falling into this loop.
Is there any reason a name has to be '\0' + 5 letters?
Shoud I give up after checking 1048576 names rather than after checking
4294967296 names?
----------------------------------------
 From 703b0677bf03afda19665f05fae4850ecf88afe3 Mon Sep 17 00:00:00 2001
From: Tetsuo Handa <penguin-kernel@I-love.SAKURA.ne.jp>
Date: Mon, 30 Aug 2010 22:07:13 +0900
Subject: [PATCH] UNIX: Do not loop forever at unix_autobind().

We assumed that unix_autobind() never fails if kzalloc() succeeded.
But unix_autobind() allows only 1048576 names. If /proc/sys/fs/file-max is
larger than 1048576 (e.g. systems with more than 10GB of RAM), a local user can
consume all names using fork()/socket()/bind().

If all names are in use, those who call bind() with addr_len == sizeof(short)
or connect()/sendmsg() with setsockopt(SO_PASSCRED) will continue

  while (1)
  	yield();

loop at unix_autobind() till a name becomes available.

This patch changes unix_autobind() to check 4294967296 names and also give up
unix_autobind() if all names are in use.

Signed-off-by: Tetsuo Handa <penguin-kernel@I-love.SAKURA.ne.jp>
---
 net/unix/af_unix.c |   13 ++++++++++---
 1 files changed, 10 insertions(+), 3 deletions(-)

Comments

Eric Dumazet Sept. 1, 2010, 7:47 p.m. UTC | #1
Le lundi 30 août 2010 à 22:27 +0900, Tetsuo Handa a écrit :
> I tried to create and autobind 1048576 UNIX domain sockets using
> http://I-love.SAKURA.ne.jp/tmp/unixloop.asm on 2.6.18-194.11.1.el5.x86_64,
> and found that it needs about 2GB of RAM. Thus, on systems where
> /proc/sys/fs/file-max is larger than 1048576, a local user can create and
> autobind 1048576 UNIX domain sockets in order to let applications fall into
> 
>   while (1)
>   	yield();
> 
> loop. Below is the patch (only compile tested) to avoid falling into this loop.
> Is there any reason a name has to be '\0' + 5 letters?
> Shoud I give up after checking 1048576 names rather than after checking
> 4294967296 names?

Yes please. Fix the bug first.

Then, a following patch to increase current limit, if necessary.

> ----------------------------------------

>  
>  	err = -ENOMEM;
> -	addr = kzalloc(sizeof(*addr) + sizeof(short) + 16, GFP_KERNEL);
> +	addr = kzalloc(sizeof(*addr) + sizeof(short) + 19, GFP_KERNEL);

IMHO, this 16 or 19 value is wrong, we need less memory than that.

(But this will be adressed in a 2nd patch)

Thanks


--
To unsubscribe from this list: send the line "unsubscribe netdev" in
the body of a message to majordomo@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Tetsuo Handa Sept. 1, 2010, 9:33 p.m. UTC | #2
I contacted Brad Spengler regarding below topic

  http://kerneltrap.org/mailarchive/linux-netdev/2010/8/21/6283491

and received some comments.

Brad Spengler wrote:
> Tetsuo Handa wrote:
> > Brad Spengler wrote:
> > > Nice find, seems like the problem isn't just the exhaustion of the space 
> > > itself, but that a single non-root user can completely exhaust it with 
> > > no limitation (given enough RAM).  There should be some per-user 
> > > resource limit for this (similar to RLIMIT_NPROC) or the space should be 
> > > made less rigid.
> > > 
> > I posted a patch in the third message of that thread, but no response so far.
> > I don't know whether we should allow 4294967296 names or not
> > because more RAM will be consumed by attacker if more names are allowed.
> 
> The max size for sun_path is considerably large; why couldn't we encode 
> the uid/namespace into it somehow and reduce the amount each user can 
> allocate?  (conceptually it'd be something like, <namespace 
> uuid><uid><hex to operate on as previously>)  The OOM killer needs to 
> make sure creators of large numbers of these get killed, but 
> implementing a per-user limit on this keeps it from turning into a DoS.

Is it possible to make OOM killer to kill processes consuming kernel memory
rather than userspace memory?

I'm happy if OOM killer can select victim process based on both kernel memory
usage and userspace memory usage. For example, opening a file requires kernel
memory allocation (e.g. /sys/kernel/security/tomoyo/self_domain allocates 8KB
of kernel memory). Therefore, I refuse allowing everybody to open that file
even if the content is public (because an attacker would open
/sys/kernel/security/tomoyo/self_domain as many as possible using
fork() and open() in order to exhaust kernel memory).
--
To unsubscribe from this list: send the line "unsubscribe netdev" in
the body of a message to majordomo@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
David Rientjes Sept. 1, 2010, 10:25 p.m. UTC | #3
On Thu, 2 Sep 2010, Tetsuo Handa wrote:

> Is it possible to make OOM killer to kill processes consuming kernel memory
> rather than userspace memory?
> 
> I'm happy if OOM killer can select victim process based on both kernel memory
> usage and userspace memory usage. For example, opening a file requires kernel
> memory allocation (e.g. /sys/kernel/security/tomoyo/self_domain allocates 8KB
> of kernel memory). Therefore, I refuse allowing everybody to open that file
> even if the content is public (because an attacker would open
> /sys/kernel/security/tomoyo/self_domain as many as possible using
> fork() and open() in order to exhaust kernel memory).

The oom killer has been rewritten for 2.6.36, so it'd probably help to 
frame it in the context of its new behavior, which you can try out in 
2.6.36-rc3.

The case you're describing would be difficult to kill with the oom killer 
because each fork() and exec() would be considered as seperate users of 
memory, so an aggregate of kernel memory amongst these threads wouldn't be 
considered.

The oom killer isn't really concerned whether the memory was allocated in 
userspace or kernelspace, the only thing that matters is that it's used 
and is unreclaimable (and non-migratable).

The memory controller accounts for user memory, so it's probably not a 
sufficient alternative either in this case.  Is it not possible for the 
kernel to intelligently limit the amount of memory that it can allocate 
through an interface at any given time?
--
To unsubscribe from this list: send the line "unsubscribe netdev" in
the body of a message to majordomo@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
KOSAKI Motohiro Sept. 3, 2010, 6:32 a.m. UTC | #4
> Is it possible to make OOM killer to kill processes consuming kernel memory
> rather than userspace memory?
> 
> I'm happy if OOM killer can select victim process based on both kernel memory
> usage and userspace memory usage. For example, opening a file requires kernel
> memory allocation (e.g. /sys/kernel/security/tomoyo/self_domain allocates 8KB
> of kernel memory). Therefore, I refuse allowing everybody to open that file
> even if the content is public (because an attacker would open
> /sys/kernel/security/tomoyo/self_domain as many as possible using
> fork() and open() in order to exhaust kernel memory).

Unfortunatelly, It's impossible. We have zero information which kernel
memory should be bound caller task. Almost all kernel memory are task
unrelated, so we can't bind them with caller task blindly.

Thanks.




--
To unsubscribe from this list: send the line "unsubscribe netdev" in
the body of a message to majordomo@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
diff mbox

Patch

diff --git a/net/unix/af_unix.c b/net/unix/af_unix.c
index 4414a18..1271e98 100644
--- a/net/unix/af_unix.c
+++ b/net/unix/af_unix.c
@@ -692,6 +692,7 @@  static int unix_autobind(struct socket *sock)
 	static u32 ordernum = 1;
 	struct unix_address *addr;
 	int err;
+	u32 loop = 0;
 
 	mutex_lock(&u->readlock);
 
@@ -700,7 +701,7 @@  static int unix_autobind(struct socket *sock)
 		goto out;
 
 	err = -ENOMEM;
-	addr = kzalloc(sizeof(*addr) + sizeof(short) + 16, GFP_KERNEL);
+	addr = kzalloc(sizeof(*addr) + sizeof(short) + 19, GFP_KERNEL);
 	if (!addr)
 		goto out;
 
@@ -708,11 +709,11 @@  static int unix_autobind(struct socket *sock)
 	atomic_set(&addr->refcnt, 1);
 
 retry:
-	addr->len = sprintf(addr->name->sun_path+1, "%05x", ordernum) + 1 + sizeof(short);
+	addr->len = sprintf(addr->name->sun_path+1, "%08x", ordernum) + 1 + sizeof(short);
 	addr->hash = unix_hash_fold(csum_partial(addr->name, addr->len, 0));
 
 	spin_lock(&unix_table_lock);
-	ordernum = (ordernum+1)&0xFFFFF;
+	ordernum++;
 
 	if (__unix_find_socket_byname(net, addr->name, addr->len, sock->type,
 				      addr->hash)) {
@@ -720,6 +721,12 @@  retry:
 		/* Sanity yield. It is unusual case, but yet... */
 		if (!(ordernum&0xFF))
 			yield();
+		/* Give up if all names are in use. */
+		if (unlikely(!++loop)) {
+			err = -EAGAIN;
+			kfree(addr);
+			goto out;
+		}
 		goto retry;
 	}
 	addr->hash ^= sk->sk_type;