Message ID | 1316447547.2539.34.camel@edumazet-HP-Compaq-6005-Pro-SFF-PC |
---|---|
State | Accepted, archived |
Delegated to: | David Miller |
Headers | show |
On Mon, 2011-09-19 at 17:52 +0200, Eric Dumazet wrote: > Since commit 7361c36c5224 (af_unix: Allow credentials to work across > user and pid namespaces) af_unix performance dropped a lot. > > This is because we now take a reference on pid and cred in each write(), > and release them in read(), usually done from another process, > eventually from another cpu. This triggers false sharing. > > # Events: 154K cycles > # > # Overhead Command Shared Object Symbol > # ........ ....... .................. ......................... > # > 10.40% hackbench [kernel.kallsyms] [k] put_pid > 8.60% hackbench [kernel.kallsyms] [k] unix_stream_recvmsg > 7.87% hackbench [kernel.kallsyms] [k] unix_stream_sendmsg > 6.11% hackbench [kernel.kallsyms] [k] do_raw_spin_lock > 4.95% hackbench [kernel.kallsyms] [k] unix_scm_to_skb > 4.87% hackbench [kernel.kallsyms] [k] pid_nr_ns > 4.34% hackbench [kernel.kallsyms] [k] cred_to_ucred > 2.39% hackbench [kernel.kallsyms] [k] unix_destruct_scm > 2.24% hackbench [kernel.kallsyms] [k] sub_preempt_count > 1.75% hackbench [kernel.kallsyms] [k] fget_light > 1.51% hackbench [kernel.kallsyms] [k] > __mutex_lock_interruptible_slowpath > 1.42% hackbench [kernel.kallsyms] [k] sock_alloc_send_pskb > > > This patch includes SCM_CREDENTIALS information in a af_unix message/skb > only if requested by the sender, [man 7 unix for details how to include > ancillary data using sendmsg() system call] > > Note: This might break buggy applications that expected SCM_CREDENTIAL > from an unaware write() system call, and receiver not using SO_PASSCRED > socket option. > > If SOCK_PASSCRED is set on source or destination socket, we still > include credentials for mere write() syscalls. > > Performance boost in hackbench : more than 50% gain on a 16 thread > machine (2 quad-core cpus, 2 threads per core) > > hackbench 20 thread 2000 > > 4.228 sec instead of 9.102 sec > > Signed-off-by: Eric Dumazet <eric.dumazet@gmail.com> > --- Do we have to worry about the case where peer socket changes its flag to SOCK_PASSCRED while packets are in flight? If there isn't such pathological use case, the patch looks fine to me. Acked-by: Tim Chen <tim.c.chen@linux.intel.com> -- To unsubscribe from this list: send the line "unsubscribe netdev" in the body of a message to majordomo@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
On Mon, 19 Sep 2011 14:39:58 PDT, Tim Chen said: > Do we have to worry about the case where peer socket changes its flag > to SOCK_PASSCRED while packets are in flight? If there isn't such > pathological use case, the patch looks fine to me. I wouldn't think so - if you're sending a packet, and retroactively trying to change the flag and expect it to work, your program is too ugly to live. After all, if the scheduler had cut off your timeslice and scheduledthe receiving process before you set the flag, that packet would be delivered and done with anyhow, and no amount of wishing will set that flag on an already-delivered packet. What *is* worth checking is that we DTRT if a process/thread is doing a send on one CPU, and another process/thread with a shared file descriptor for that socket is diddling the flag. But if we just define it as "atomic op to change the flag and other observers get whatever value their CPU sees at that instant", I'm OK with that too.. ;)
Le lundi 19 septembre 2011 à 22:10 -0400, Valdis.Kletnieks@vt.edu a écrit : > On Mon, 19 Sep 2011 14:39:58 PDT, Tim Chen said: > > Do we have to worry about the case where peer socket changes its flag > > to SOCK_PASSCRED while packets are in flight? If there isn't such > > pathological use case, the patch looks fine to me. > > I wouldn't think so - if you're sending a packet, and retroactively trying to > change the flag and expect it to work, your program is too ugly to live. After > all, if the scheduler had cut off your timeslice and scheduledthe receiving > process before you set the flag, that packet would be delivered and done with > anyhow, and no amount of wishing will set that flag on an already-delivered > packet. > > What *is* worth checking is that we DTRT if a process/thread is doing a send on > one CPU, and another process/thread with a shared file descriptor for that > socket is diddling the flag. But if we just define it as "atomic op to change > the flag and other observers get whatever value their CPU sees at that > instant", I'm OK with that too.. ;) > Note : The man page does states : "To receive a struct ucred message the SO_PASSCRED option must be enabled on the socket." But it doesnt say if the SO_PASSCRED option must be enabled before the sender sends its message, or before receiver attempts to read it. Once a message is queued on an unix socket, flipping SO_PASSCRED cant change its content (adding or removing credentials), since sender might already have disappeared. So current code includes credentials in all sent messages, just in case receiver actually fetch credentials. There are probably programs that assume they can set SO_PASSCRED right before calling recvmsg(). Are we taking risk to break them, or are we gentle and provide a sysctl option to ease the transition, I dont know... -- To unsubscribe from this list: send the line "unsubscribe netdev" in the body of a message to majordomo@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
On Tue, 2011-09-20 at 06:16 +0200, Eric Dumazet wrote: > Le lundi 19 septembre 2011 à 22:10 -0400, Valdis.Kletnieks@vt.edu a > écrit : > > On Mon, 19 Sep 2011 14:39:58 PDT, Tim Chen said: > > > Do we have to worry about the case where peer socket changes its flag > > > to SOCK_PASSCRED while packets are in flight? If there isn't such > > > pathological use case, the patch looks fine to me. > > > > I wouldn't think so - if you're sending a packet, and retroactively trying to > > change the flag and expect it to work, your program is too ugly to live. After > > all, if the scheduler had cut off your timeslice and scheduledthe receiving > > process before you set the flag, that packet would be delivered and done with > > anyhow, and no amount of wishing will set that flag on an already-delivered > > packet. > > > > What *is* worth checking is that we DTRT if a process/thread is doing a send on > > one CPU, and another process/thread with a shared file descriptor for that > > socket is diddling the flag. But if we just define it as "atomic op to change > > the flag and other observers get whatever value their CPU sees at that > > instant", I'm OK with that too.. ;) > > > > Note : The man page does states : > > "To receive a struct ucred message the SO_PASSCRED option must be > enabled on the socket." > > But it doesnt say if the SO_PASSCRED option must be enabled before the > sender sends its message, or before receiver attempts to read it. > > Once a message is queued on an unix socket, flipping SO_PASSCRED cant > change its content (adding or removing credentials), since sender might > already have disappeared. > > So current code includes credentials in all sent messages, just in case > receiver actually fetch credentials. > > There are probably programs that assume they can set SO_PASSCRED right > before calling recvmsg(). Are we taking risk to break them, or are we > gentle and provide a sysctl option to ease the transition, I dont > know... > Should we reconsider the original approach of reducing the pid/credential references, with your fixes to correct its flaws in the streaming msg case? Tim -- To unsubscribe from this list: send the line "unsubscribe netdev" in the body of a message to majordomo@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
From: Eric Dumazet <eric.dumazet@gmail.com> Date: Mon, 19 Sep 2011 17:52:27 +0200 > This patch includes SCM_CREDENTIALS information in a af_unix message/skb > only if requested by the sender, [man 7 unix for details how to include > ancillary data using sendmsg() system call] > > Note: This might break buggy applications that expected SCM_CREDENTIAL > from an unaware write() system call, and receiver not using SO_PASSCRED > socket option. > > If SOCK_PASSCRED is set on source or destination socket, we still > include credentials for mere write() syscalls. I thought a lot about this and I think we should be able to get away with this trick, so I've added this patch to net-next, thanks! -- To unsubscribe from this list: send the line "unsubscribe netdev" in the body of a message to majordomo@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
On 09/20/2011 06:16 AM, Eric Dumazet wrote: > Note : The man page does states : > > "To receive a struct ucred message the SO_PASSCRED option must be > enabled on the socket." > > But it doesnt say if the SO_PASSCRED option must be enabled before the > sender sends its message, or before receiver attempts to read it. > > Once a message is queued on an unix socket, flipping SO_PASSCRED cant > change its content (adding or removing credentials), since sender might > already have disappeared. > > So current code includes credentials in all sent messages, just in case > receiver actually fetch credentials. > > There are probably programs that assume they can set SO_PASSCRED right > before calling recvmsg(). Are we taking risk to break them, or are we > gentle and provide a sysctl option to ease the transition, I dont > know... Such a case has just appeared: https://bugzilla.redhat.com/show_bug.cgi?id=757628 systemd allows on-demand socket activation of services. It creates a listening socket without the SO_PASSCRED flag. When the first message arrives to the socket, systemd spawns the service and passes the socket's fd to it. The service sets SO_PASSCRED before actually receiving the message. I can fix that in systemd, but there may be more cases like this. Michal -- To unsubscribe from this list: send the line "unsubscribe netdev" in the body of a message to majordomo@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
Le lundi 28 novembre 2011 à 14:23 +0100, Michal Schmidt a écrit : > On 09/20/2011 06:16 AM, Eric Dumazet wrote: > > Note : The man page does states : > > > > "To receive a struct ucred message the SO_PASSCRED option must be > > enabled on the socket." > > > > But it doesnt say if the SO_PASSCRED option must be enabled before the > > sender sends its message, or before receiver attempts to read it. > > > > Once a message is queued on an unix socket, flipping SO_PASSCRED cant > > change its content (adding or removing credentials), since sender might > > already have disappeared. > > > > So current code includes credentials in all sent messages, just in case > > receiver actually fetch credentials. > > > > There are probably programs that assume they can set SO_PASSCRED right > > before calling recvmsg(). Are we taking risk to break them, or are we > > gentle and provide a sysctl option to ease the transition, I dont > > know... > > Such a case has just appeared: > https://bugzilla.redhat.com/show_bug.cgi?id=757628 > > systemd allows on-demand socket activation of services. It creates a > listening socket without the SO_PASSCRED flag. When the first message > arrives to the socket, systemd spawns the service and passes the > socket's fd to it. The service sets SO_PASSCRED before actually > receiving the message. > > I can fix that in systemd, but there may be more cases like this. Yes, we were afraid of this. Performance drop is really huge and deserves some fixes in userland... People add features to kernel (in this case namespaces) without thinking on performance regression. So poor guys like us have to "fix" things later, in a reasonable way. -- To unsubscribe from this list: send the line "unsubscribe netdev" in the body of a message to majordomo@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
diff --git a/include/net/scm.h b/include/net/scm.h index 745460f..d456f4c 100644 --- a/include/net/scm.h +++ b/include/net/scm.h @@ -49,7 +49,7 @@ static __inline__ void scm_set_cred(struct scm_cookie *scm, struct pid *pid, const struct cred *cred) { scm->pid = get_pid(pid); - scm->cred = get_cred(cred); + scm->cred = cred ? get_cred(cred) : NULL; cred_to_ucred(pid, cred, &scm->creds); } @@ -73,8 +73,7 @@ static __inline__ void scm_destroy(struct scm_cookie *scm) static __inline__ int scm_send(struct socket *sock, struct msghdr *msg, struct scm_cookie *scm) { - scm_set_cred(scm, task_tgid(current), current_cred()); - scm->fp = NULL; + memset(scm, 0, sizeof(*scm)); unix_get_peersec_dgram(sock, scm); if (msg->msg_controllen <= 0) return 0; diff --git a/net/core/scm.c b/net/core/scm.c index 811b53f..ff52ad0 100644 --- a/net/core/scm.c +++ b/net/core/scm.c @@ -173,7 +173,7 @@ int __scm_send(struct socket *sock, struct msghdr *msg, struct scm_cookie *p) if (err) goto error; - if (pid_vnr(p->pid) != p->creds.pid) { + if (!p->pid || pid_vnr(p->pid) != p->creds.pid) { struct pid *pid; err = -ESRCH; pid = find_get_pid(p->creds.pid); @@ -183,8 +183,9 @@ int __scm_send(struct socket *sock, struct msghdr *msg, struct scm_cookie *p) p->pid = pid; } - if ((p->cred->euid != p->creds.uid) || - (p->cred->egid != p->creds.gid)) { + if (!p->cred || + (p->cred->euid != p->creds.uid) || + (p->cred->egid != p->creds.gid)) { struct cred *cred; err = -ENOMEM; cred = prepare_creds(); @@ -193,7 +194,8 @@ int __scm_send(struct socket *sock, struct msghdr *msg, struct scm_cookie *p) cred->uid = cred->euid = p->creds.uid; cred->gid = cred->egid = p->creds.gid; - put_cred(p->cred); + if (p->cred) + put_cred(p->cred); p->cred = cred; } break; diff --git a/net/netlink/af_netlink.c b/net/netlink/af_netlink.c index 4330db9..1201b6d 100644 --- a/net/netlink/af_netlink.c +++ b/net/netlink/af_netlink.c @@ -1324,10 +1324,9 @@ static int netlink_sendmsg(struct kiocb *kiocb, struct socket *sock, if (msg->msg_flags&MSG_OOB) return -EOPNOTSUPP; - if (NULL == siocb->scm) { + if (NULL == siocb->scm) siocb->scm = &scm; - memset(&scm, 0, sizeof(scm)); - } + err = scm_send(sock, msg, siocb->scm); if (err < 0) return err; diff --git a/net/unix/af_unix.c b/net/unix/af_unix.c index ec68e1c..466fbcc 100644 --- a/net/unix/af_unix.c +++ b/net/unix/af_unix.c @@ -1381,8 +1381,10 @@ static int unix_attach_fds(struct scm_cookie *scm, struct sk_buff *skb) static int unix_scm_to_skb(struct scm_cookie *scm, struct sk_buff *skb, bool send_fds) { int err = 0; + UNIXCB(skb).pid = get_pid(scm->pid); - UNIXCB(skb).cred = get_cred(scm->cred); + if (scm->cred) + UNIXCB(skb).cred = get_cred(scm->cred); UNIXCB(skb).fp = NULL; if (scm->fp && send_fds) err = unix_attach_fds(scm, skb); @@ -1392,6 +1394,24 @@ static int unix_scm_to_skb(struct scm_cookie *scm, struct sk_buff *skb, bool sen } /* + * Some apps rely on write() giving SCM_CREDENTIALS + * We include credentials if source or destination socket + * asserted SOCK_PASSCRED. + */ +static void maybe_add_creds(struct sk_buff *skb, const struct socket *sock, + const struct sock *other) +{ + if (UNIXCB(skb).cred) + return; + if (test_bit(SOCK_PASSCRED, &sock->flags) || + !other->sk_socket || + test_bit(SOCK_PASSCRED, &other->sk_socket->flags)) { + UNIXCB(skb).pid = get_pid(task_tgid(current)); + UNIXCB(skb).cred = get_current_cred(); + } +} + +/* * Send AF_UNIX data. */ @@ -1538,6 +1558,7 @@ restart: if (sock_flag(other, SOCK_RCVTSTAMP)) __net_timestamp(skb); + maybe_add_creds(skb, sock, other); skb_queue_tail(&other->sk_receive_queue, skb); if (max_level > unix_sk(other)->recursion_level) unix_sk(other)->recursion_level = max_level; @@ -1652,6 +1673,7 @@ static int unix_stream_sendmsg(struct kiocb *kiocb, struct socket *sock, (other->sk_shutdown & RCV_SHUTDOWN)) goto pipe_err_free; + maybe_add_creds(skb, sock, other); skb_queue_tail(&other->sk_receive_queue, skb); if (max_level > unix_sk(other)->recursion_level) unix_sk(other)->recursion_level = max_level;
Since commit 7361c36c5224 (af_unix: Allow credentials to work across user and pid namespaces) af_unix performance dropped a lot. This is because we now take a reference on pid and cred in each write(), and release them in read(), usually done from another process, eventually from another cpu. This triggers false sharing. # Events: 154K cycles # # Overhead Command Shared Object Symbol # ........ ....... .................. ......................... # 10.40% hackbench [kernel.kallsyms] [k] put_pid 8.60% hackbench [kernel.kallsyms] [k] unix_stream_recvmsg 7.87% hackbench [kernel.kallsyms] [k] unix_stream_sendmsg 6.11% hackbench [kernel.kallsyms] [k] do_raw_spin_lock 4.95% hackbench [kernel.kallsyms] [k] unix_scm_to_skb 4.87% hackbench [kernel.kallsyms] [k] pid_nr_ns 4.34% hackbench [kernel.kallsyms] [k] cred_to_ucred 2.39% hackbench [kernel.kallsyms] [k] unix_destruct_scm 2.24% hackbench [kernel.kallsyms] [k] sub_preempt_count 1.75% hackbench [kernel.kallsyms] [k] fget_light 1.51% hackbench [kernel.kallsyms] [k] __mutex_lock_interruptible_slowpath 1.42% hackbench [kernel.kallsyms] [k] sock_alloc_send_pskb This patch includes SCM_CREDENTIALS information in a af_unix message/skb only if requested by the sender, [man 7 unix for details how to include ancillary data using sendmsg() system call] Note: This might break buggy applications that expected SCM_CREDENTIAL from an unaware write() system call, and receiver not using SO_PASSCRED socket option. If SOCK_PASSCRED is set on source or destination socket, we still include credentials for mere write() syscalls. Performance boost in hackbench : more than 50% gain on a 16 thread machine (2 quad-core cpus, 2 threads per core) hackbench 20 thread 2000 4.228 sec instead of 9.102 sec Signed-off-by: Eric Dumazet <eric.dumazet@gmail.com> --- net/netlink/af_netlink.c net/unix/af_unix.c|diffstat -p1 -w70 include/net/scm.h | 5 ++--- net/core/scm.c | 10 ++++++---- net/netlink/af_netlink.c | 5 ++--- net/unix/af_unix.c | 24 +++++++++++++++++++++++-- 4 files changed, 33 insertions(+), 11 deletions(-) -- To unsubscribe from this list: send the line "unsubscribe netdev" in the body of a message to majordomo@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html