Message ID | CAOssrKcfncAYsQWkfLGFgoOxAQJVT2hYVWdBA6Cw7hhO8RJ_wQ@mail.gmail.com |
---|---|
State | RFC, archived |
Delegated to: | David Miller |
Headers | show |
On 08/30/2016 12:18 PM, Miklos Szeredi wrote: > On Tue, Aug 30, 2016 at 12:37 AM, Miklos Szeredi <mszeredi@redhat.com> wrote: >> On Sat, Aug 27, 2016 at 11:55 AM, Miklos Szeredi <mszeredi@redhat.com> wrote: > >> crash> list -H gc_inflight_list unix_sock.link -s unix_sock.inflight | >> grep counter | cut -d= -f2 | awk '{s+=$1} END {print s}' >> 130 >> crash> p unix_tot_inflight >> unix_tot_inflight = $2 = 135 >> >> We've lost track of a total of five inflight sockets, so it's not a >> one-off thing. Really weird... Now off to sleep, maybe I'll dream of >> the solution. > > Okay, found one bug: gc assumes that in-flight sockets that don't have > an external ref can't gain one while unix_gc_lock is held. That is > true because unix_notinflight() will be called before detaching fds, > which takes unix_gc_lock. Only MSG_PEEK was somehow overlooked. That > one also clones the fds, also keeping them in the skb. But through > MSG_PEEK an external reference can definitely be gained without ever > touching unix_gc_lock. > > Not sure whether the reported bug can be explained by this. Can you > confirm the MSG_PEEK was used in the setup? > > Does someone want to write a stress test for SCM_RIGHTS + MSG_PEEK? > > Anyway, attaching a fix that works by acquiring unix_gc_lock in case > of MSG_PEEK also. It is trivially correct, but I haven't tested it. I have no way of being 100% sure but looking through nginx's source code it seems they do utilize MSG_PEEK on several occasions. This issue has been apparently very hard to reproduce since I have 100s of servers running a lot of NGINX processes and this has been triggered only once. On a different note - if I inspect a live node without this patch should the discrepancy between the gc_inflight_list and the unix_tot_inflight be present VS with this patch applied? > > Thanks, > Miklos >
On Tue, Aug 30, 2016 at 11:31 AM, Nikolay Borisov <kernel@kyup.com> wrote: > > > On 08/30/2016 12:18 PM, Miklos Szeredi wrote: >> On Tue, Aug 30, 2016 at 12:37 AM, Miklos Szeredi <mszeredi@redhat.com> wrote: >>> On Sat, Aug 27, 2016 at 11:55 AM, Miklos Szeredi <mszeredi@redhat.com> wrote: >> >>> crash> list -H gc_inflight_list unix_sock.link -s unix_sock.inflight | >>> grep counter | cut -d= -f2 | awk '{s+=$1} END {print s}' >>> 130 >>> crash> p unix_tot_inflight >>> unix_tot_inflight = $2 = 135 >>> >>> We've lost track of a total of five inflight sockets, so it's not a >>> one-off thing. Really weird... Now off to sleep, maybe I'll dream of >>> the solution. >> >> Okay, found one bug: gc assumes that in-flight sockets that don't have >> an external ref can't gain one while unix_gc_lock is held. That is >> true because unix_notinflight() will be called before detaching fds, >> which takes unix_gc_lock. Only MSG_PEEK was somehow overlooked. That >> one also clones the fds, also keeping them in the skb. But through >> MSG_PEEK an external reference can definitely be gained without ever >> touching unix_gc_lock. >> >> Not sure whether the reported bug can be explained by this. Can you >> confirm the MSG_PEEK was used in the setup? >> >> Does someone want to write a stress test for SCM_RIGHTS + MSG_PEEK? >> >> Anyway, attaching a fix that works by acquiring unix_gc_lock in case >> of MSG_PEEK also. It is trivially correct, but I haven't tested it. > > I have no way of being 100% sure but looking through nginx's source code > it seems they do utilize MSG_PEEK on several occasions. This issue has > been apparently very hard to reproduce since I have 100s of servers > running a lot of NGINX processes and this has been triggered only once. > > On a different note - if I inspect a live node without this patch should > the discrepancy between the gc_inflight_list and the unix_tot_inflight > be present VS with this patch applied? May well be, since in the vmcore 4 in-flight sockets were "lost" before triggering the bug. I guess the best way to check is with a systemtap script that walks the list with the gc lock. Thanks, Miklos
On 30.08.2016 11:18, Miklos Szeredi wrote: > On Tue, Aug 30, 2016 at 12:37 AM, Miklos Szeredi <mszeredi@redhat.com> wrote: >> On Sat, Aug 27, 2016 at 11:55 AM, Miklos Szeredi <mszeredi@redhat.com> wrote: > >> crash> list -H gc_inflight_list unix_sock.link -s unix_sock.inflight | >> grep counter | cut -d= -f2 | awk '{s+=$1} END {print s}' >> 130 >> crash> p unix_tot_inflight >> unix_tot_inflight = $2 = 135 >> >> We've lost track of a total of five inflight sockets, so it's not a >> one-off thing. Really weird... Now off to sleep, maybe I'll dream of >> the solution. > > Okay, found one bug: gc assumes that in-flight sockets that don't have > an external ref can't gain one while unix_gc_lock is held. That is > true because unix_notinflight() will be called before detaching fds, > which takes unix_gc_lock. Only MSG_PEEK was somehow overlooked. That > one also clones the fds, also keeping them in the skb. But through > MSG_PEEK an external reference can definitely be gained without ever > touching unix_gc_lock. > > Not sure whether the reported bug can be explained by this. Can you > confirm the MSG_PEEK was used in the setup? > > Does someone want to write a stress test for SCM_RIGHTS + MSG_PEEK? > > Anyway, attaching a fix that works by acquiring unix_gc_lock in case > of MSG_PEEK also. It is trivially correct, but I haven't tested it. You can use spin_unlock_wait in unix_gc_barrier to make it a bit more lightweight. Anyway, all of the scans on the socket receive queues are actually protected by the appropriate locks again, I didn't see a way were we could result in such a crash because of concurrent modification of the receive queue. Do you have any hints or looked into this more closely? Thanks, Hannes
[Added Dave Miller to see what's the status of this patch] On 08/30/2016 12:18 PM, Miklos Szeredi wrote: > On Tue, Aug 30, 2016 at 12:37 AM, Miklos Szeredi <mszeredi@redhat.com> wrote: >> On Sat, Aug 27, 2016 at 11:55 AM, Miklos Szeredi <mszeredi@redhat.com> wrote: > >> crash> list -H gc_inflight_list unix_sock.link -s unix_sock.inflight | >> grep counter | cut -d= -f2 | awk '{s+=$1} END {print s}' >> 130 >> crash> p unix_tot_inflight >> unix_tot_inflight = $2 = 135 >> >> We've lost track of a total of five inflight sockets, so it's not a >> one-off thing. Really weird... Now off to sleep, maybe I'll dream of >> the solution. > > Okay, found one bug: gc assumes that in-flight sockets that don't have > an external ref can't gain one while unix_gc_lock is held. That is > true because unix_notinflight() will be called before detaching fds, > which takes unix_gc_lock. Only MSG_PEEK was somehow overlooked. That > one also clones the fds, also keeping them in the skb. But through > MSG_PEEK an external reference can definitely be gained without ever > touching unix_gc_lock. > > Not sure whether the reported bug can be explained by this. Can you > confirm the MSG_PEEK was used in the setup? > > Does someone want to write a stress test for SCM_RIGHTS + MSG_PEEK? > > Anyway, attaching a fix that works by acquiring unix_gc_lock in case > of MSG_PEEK also. It is trivially correct, but I haven't tested it. > > Thanks, > Miklos > Dave, What's the status of https://patchwork.ozlabs.org/patch/664062/ , is this going to be picked up ? Regards, Nikolay
On Tue, Sep 27, 2016, at 16:16, Nikolay Borisov wrote: > Dave, > > What's the status of https://patchwork.ozlabs.org/patch/664062/ , is > this going to be picked up ? Not sure if we actually fix a bug with this. Miklos could you maybe enhance the changelog then? Thanks, Hannes
From: Nikolay Borisov <kernel@kyup.com> Date: Tue, 27 Sep 2016 17:16:27 +0300 > What's the status of https://patchwork.ozlabs.org/patch/664062/ , is > this going to be picked up ? Why would I apply a patch that's an RFC, doesn't have a proper commit message, lacks a proper signoff, and also lacks ACK's and feedback from other knowledgable developers?
From: Miklos Szeredi <mszeredi@redhat.com> Subject: af_unix: fix garbage collect vs. MSG_PEEK Gc assumes that in-flight sockets that don't have an external ref can't gain one while unix_gc_lock is held. That is true because unix_notinflight() will be called before detaching fds, which takes unix_gc_lock. Only MSG_PEEK was somehow overlooked. That one also clones the fds, also keeping them in the skb. But through MSG_PEEK an external reference can definitely be gained without ever touching unix_gc_lock. Signed-off-by: Miklos Szeredi <mszeredi@redhat.com> Cc: <stable@vger.kernel.org> --- include/net/af_unix.h | 1 + net/unix/af_unix.c | 15 +++++++++++++-- net/unix/garbage.c | 6 ++++++ 3 files changed, 20 insertions(+), 2 deletions(-) --- a/include/net/af_unix.h +++ b/include/net/af_unix.h @@ -10,6 +10,7 @@ void unix_inflight(struct user_struct *u void unix_notinflight(struct user_struct *user, struct file *fp); void unix_gc(void); void wait_for_unix_gc(void); +void unix_gc_barrier(void); struct sock *unix_get_socket(struct file *filp); struct sock *unix_peer_get(struct sock *); --- a/net/unix/af_unix.c +++ b/net/unix/af_unix.c @@ -1563,6 +1563,17 @@ static int unix_attach_fds(struct scm_co return max_level; } +static void unix_peek_fds(struct scm_cookie *scm, struct sk_buff *skb) +{ + scm->fp = scm_fp_dup(UNIXCB(skb).fp); + /* + * During garbage collection it is assumed that in-flight sockets don't + * get a new external reference. So we need to wait until current run + * finishes. + */ + unix_gc_barrier(); +} + static int unix_scm_to_skb(struct scm_cookie *scm, struct sk_buff *skb, bool send_fds) { int err = 0; @@ -2195,7 +2206,7 @@ static int unix_dgram_recvmsg(struct soc sk_peek_offset_fwd(sk, size); if (UNIXCB(skb).fp) - scm.fp = scm_fp_dup(UNIXCB(skb).fp); + unix_peek_fds(&scm, skb); } err = (flags & MSG_TRUNC) ? skb->len - skip : size; @@ -2435,7 +2446,7 @@ static int unix_stream_read_generic(stru /* It is questionable, see note in unix_dgram_recvmsg. */ if (UNIXCB(skb).fp) - scm.fp = scm_fp_dup(UNIXCB(skb).fp); + unix_peek_fds(&scm, skb); sk_peek_offset_fwd(sk, chunk); --- a/net/unix/garbage.c +++ b/net/unix/garbage.c @@ -266,6 +266,12 @@ void wait_for_unix_gc(void) wait_event(unix_gc_wait, gc_in_progress == false); } +void unix_gc_barrier(void) +{ + spin_lock(&unix_gc_lock); + spin_unlock(&unix_gc_lock); +} + /* The external entry point: unix_gc() */ void unix_gc(void) {