diff mbox

[v2,net-next] af_unix: dont send SCM_CREDENTIALS by default

Message ID 1316447547.2539.34.camel@edumazet-HP-Compaq-6005-Pro-SFF-PC
State Accepted, archived
Delegated to: David Miller
Headers show

Commit Message

Eric Dumazet Sept. 19, 2011, 3:52 p.m. UTC
Since commit 7361c36c5224 (af_unix: Allow credentials to work across
user and pid namespaces) af_unix performance dropped a lot.

This is because we now take a reference on pid and cred in each write(),
and release them in read(), usually done from another process,
eventually from another cpu. This triggers false sharing.

# Events: 154K cycles
#
# Overhead  Command       Shared Object        Symbol
# ........  .......  ..................  .........................
#
    10.40%  hackbench  [kernel.kallsyms]   [k] put_pid
     8.60%  hackbench  [kernel.kallsyms]   [k] unix_stream_recvmsg
     7.87%  hackbench  [kernel.kallsyms]   [k] unix_stream_sendmsg
     6.11%  hackbench  [kernel.kallsyms]   [k] do_raw_spin_lock
     4.95%  hackbench  [kernel.kallsyms]   [k] unix_scm_to_skb
     4.87%  hackbench  [kernel.kallsyms]   [k] pid_nr_ns
     4.34%  hackbench  [kernel.kallsyms]   [k] cred_to_ucred
     2.39%  hackbench  [kernel.kallsyms]   [k] unix_destruct_scm
     2.24%  hackbench  [kernel.kallsyms]   [k] sub_preempt_count
     1.75%  hackbench  [kernel.kallsyms]   [k] fget_light
     1.51%  hackbench  [kernel.kallsyms]   [k]
__mutex_lock_interruptible_slowpath
     1.42%  hackbench  [kernel.kallsyms]   [k] sock_alloc_send_pskb


This patch includes SCM_CREDENTIALS information in a af_unix message/skb
only if requested by the sender, [man 7 unix for details how to include
ancillary data using sendmsg() system call]

Note: This might break buggy applications that expected SCM_CREDENTIAL
from an unaware write() system call, and receiver not using SO_PASSCRED
socket option.

If SOCK_PASSCRED is set on source or destination socket, we still
include credentials for mere write() syscalls.

Performance boost in hackbench : more than 50% gain on a 16 thread
machine (2 quad-core cpus, 2 threads per core)

hackbench 20 thread 2000

4.228 sec instead of 9.102 sec

Signed-off-by: Eric Dumazet <eric.dumazet@gmail.com>
---
 net/netlink/af_netlink.c net/unix/af_unix.c|diffstat -p1 -w70
 include/net/scm.h        |    5 ++---
 net/core/scm.c           |   10 ++++++----
 net/netlink/af_netlink.c |    5 ++---
 net/unix/af_unix.c       |   24 +++++++++++++++++++++++--
 4 files changed, 33 insertions(+), 11 deletions(-)



--
To unsubscribe from this list: send the line "unsubscribe netdev" in
the body of a message to majordomo@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html

Comments

Tim Chen Sept. 19, 2011, 9:39 p.m. UTC | #1
On Mon, 2011-09-19 at 17:52 +0200, Eric Dumazet wrote:
> Since commit 7361c36c5224 (af_unix: Allow credentials to work across
> user and pid namespaces) af_unix performance dropped a lot.
> 
> This is because we now take a reference on pid and cred in each write(),
> and release them in read(), usually done from another process,
> eventually from another cpu. This triggers false sharing.
> 
> # Events: 154K cycles
> #
> # Overhead  Command       Shared Object        Symbol
> # ........  .......  ..................  .........................
> #
>     10.40%  hackbench  [kernel.kallsyms]   [k] put_pid
>      8.60%  hackbench  [kernel.kallsyms]   [k] unix_stream_recvmsg
>      7.87%  hackbench  [kernel.kallsyms]   [k] unix_stream_sendmsg
>      6.11%  hackbench  [kernel.kallsyms]   [k] do_raw_spin_lock
>      4.95%  hackbench  [kernel.kallsyms]   [k] unix_scm_to_skb
>      4.87%  hackbench  [kernel.kallsyms]   [k] pid_nr_ns
>      4.34%  hackbench  [kernel.kallsyms]   [k] cred_to_ucred
>      2.39%  hackbench  [kernel.kallsyms]   [k] unix_destruct_scm
>      2.24%  hackbench  [kernel.kallsyms]   [k] sub_preempt_count
>      1.75%  hackbench  [kernel.kallsyms]   [k] fget_light
>      1.51%  hackbench  [kernel.kallsyms]   [k]
> __mutex_lock_interruptible_slowpath
>      1.42%  hackbench  [kernel.kallsyms]   [k] sock_alloc_send_pskb
> 
> 
> This patch includes SCM_CREDENTIALS information in a af_unix message/skb
> only if requested by the sender, [man 7 unix for details how to include
> ancillary data using sendmsg() system call]
> 
> Note: This might break buggy applications that expected SCM_CREDENTIAL
> from an unaware write() system call, and receiver not using SO_PASSCRED
> socket option.
> 
> If SOCK_PASSCRED is set on source or destination socket, we still
> include credentials for mere write() syscalls.
> 
> Performance boost in hackbench : more than 50% gain on a 16 thread
> machine (2 quad-core cpus, 2 threads per core)
> 
> hackbench 20 thread 2000
> 
> 4.228 sec instead of 9.102 sec
> 
> Signed-off-by: Eric Dumazet <eric.dumazet@gmail.com>
> ---

Do we have to worry about the case where peer socket changes its flag
to SOCK_PASSCRED while packets are in flight?  If there isn't such
pathological use case, the patch looks fine to me.

Acked-by: Tim Chen <tim.c.chen@linux.intel.com>

--
To unsubscribe from this list: send the line "unsubscribe netdev" in
the body of a message to majordomo@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Valdis Kl ē tnieks Sept. 20, 2011, 2:10 a.m. UTC | #2
On Mon, 19 Sep 2011 14:39:58 PDT, Tim Chen said:
> Do we have to worry about the case where peer socket changes its flag
> to SOCK_PASSCRED while packets are in flight?  If there isn't such
> pathological use case, the patch looks fine to me.

I wouldn't think so - if you're sending a packet, and retroactively trying to
change the flag and expect it to work, your program is too ugly to live.  After
all, if the scheduler had cut off your timeslice and scheduledthe receiving
process before you set the flag, that packet would be delivered and done with
anyhow, and no amount of wishing will set that flag on an already-delivered
packet.

What *is* worth checking is that we DTRT if a process/thread is doing a send on
one CPU, and another process/thread with a shared file descriptor for that
socket is diddling the flag.  But if we just define it as "atomic op to change
the flag and other observers get whatever value their CPU sees at that
instant", I'm OK with that too.. ;)
Eric Dumazet Sept. 20, 2011, 4:16 a.m. UTC | #3
Le lundi 19 septembre 2011 à 22:10 -0400, Valdis.Kletnieks@vt.edu a
écrit :
> On Mon, 19 Sep 2011 14:39:58 PDT, Tim Chen said:
> > Do we have to worry about the case where peer socket changes its flag
> > to SOCK_PASSCRED while packets are in flight?  If there isn't such
> > pathological use case, the patch looks fine to me.
> 
> I wouldn't think so - if you're sending a packet, and retroactively trying to
> change the flag and expect it to work, your program is too ugly to live.  After
> all, if the scheduler had cut off your timeslice and scheduledthe receiving
> process before you set the flag, that packet would be delivered and done with
> anyhow, and no amount of wishing will set that flag on an already-delivered
> packet.
> 
> What *is* worth checking is that we DTRT if a process/thread is doing a send on
> one CPU, and another process/thread with a shared file descriptor for that
> socket is diddling the flag.  But if we just define it as "atomic op to change
> the flag and other observers get whatever value their CPU sees at that
> instant", I'm OK with that too.. ;)
> 

Note : The man page does states :

"To receive a struct ucred message the SO_PASSCRED option  must  be
enabled  on  the socket."

But it doesnt say if the SO_PASSCRED option must be enabled before the
sender sends its message, or before receiver attempts to read it.

Once a message is queued on an unix socket, flipping SO_PASSCRED cant
change its content (adding or removing credentials), since sender might
already have disappeared.

So current code includes credentials in all sent messages, just in case
receiver actually fetch credentials.

There are probably programs that assume they can set SO_PASSCRED right
before calling recvmsg(). Are we taking risk to break them, or are we
gentle and provide a sysctl option to ease the transition, I dont
know...



--
To unsubscribe from this list: send the line "unsubscribe netdev" in
the body of a message to majordomo@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
tim Sept. 22, 2011, 4:15 p.m. UTC | #4
On Tue, 2011-09-20 at 06:16 +0200, Eric Dumazet wrote:
> Le lundi 19 septembre 2011 à 22:10 -0400, Valdis.Kletnieks@vt.edu a
> écrit :
> > On Mon, 19 Sep 2011 14:39:58 PDT, Tim Chen said:
> > > Do we have to worry about the case where peer socket changes its flag
> > > to SOCK_PASSCRED while packets are in flight?  If there isn't such
> > > pathological use case, the patch looks fine to me.
> > 
> > I wouldn't think so - if you're sending a packet, and retroactively trying to
> > change the flag and expect it to work, your program is too ugly to live.  After
> > all, if the scheduler had cut off your timeslice and scheduledthe receiving
> > process before you set the flag, that packet would be delivered and done with
> > anyhow, and no amount of wishing will set that flag on an already-delivered
> > packet.
> > 
> > What *is* worth checking is that we DTRT if a process/thread is doing a send on
> > one CPU, and another process/thread with a shared file descriptor for that
> > socket is diddling the flag.  But if we just define it as "atomic op to change
> > the flag and other observers get whatever value their CPU sees at that
> > instant", I'm OK with that too.. ;)
> > 
> 
> Note : The man page does states :
> 
> "To receive a struct ucred message the SO_PASSCRED option  must  be
> enabled  on  the socket."
> 
> But it doesnt say if the SO_PASSCRED option must be enabled before the
> sender sends its message, or before receiver attempts to read it.
> 
> Once a message is queued on an unix socket, flipping SO_PASSCRED cant
> change its content (adding or removing credentials), since sender might
> already have disappeared.
> 
> So current code includes credentials in all sent messages, just in case
> receiver actually fetch credentials.
> 
> There are probably programs that assume they can set SO_PASSCRED right
> before calling recvmsg(). Are we taking risk to break them, or are we
> gentle and provide a sysctl option to ease the transition, I dont
> know...
> 

Should we reconsider the original approach of reducing the
pid/credential references, with your fixes to correct its flaws in the
streaming msg case?

Tim


--
To unsubscribe from this list: send the line "unsubscribe netdev" in
the body of a message to majordomo@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
David Miller Sept. 28, 2011, 5:30 p.m. UTC | #5
From: Eric Dumazet <eric.dumazet@gmail.com>
Date: Mon, 19 Sep 2011 17:52:27 +0200

> This patch includes SCM_CREDENTIALS information in a af_unix message/skb
> only if requested by the sender, [man 7 unix for details how to include
> ancillary data using sendmsg() system call]
> 
> Note: This might break buggy applications that expected SCM_CREDENTIAL
> from an unaware write() system call, and receiver not using SO_PASSCRED
> socket option.
> 
> If SOCK_PASSCRED is set on source or destination socket, we still
> include credentials for mere write() syscalls.

I thought a lot about this and I think we should be able to get away
with this trick, so I've added this patch to net-next, thanks!
--
To unsubscribe from this list: send the line "unsubscribe netdev" in
the body of a message to majordomo@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Michal Schmidt Nov. 28, 2011, 1:23 p.m. UTC | #6
On 09/20/2011 06:16 AM, Eric Dumazet wrote:
> Note : The man page does states :
>
> "To receive a struct ucred message the SO_PASSCRED option  must  be
> enabled  on  the socket."
>
> But it doesnt say if the SO_PASSCRED option must be enabled before the
> sender sends its message, or before receiver attempts to read it.
>
> Once a message is queued on an unix socket, flipping SO_PASSCRED cant
> change its content (adding or removing credentials), since sender might
> already have disappeared.
>
> So current code includes credentials in all sent messages, just in case
> receiver actually fetch credentials.
>
> There are probably programs that assume they can set SO_PASSCRED right
> before calling recvmsg(). Are we taking risk to break them, or are we
> gentle and provide a sysctl option to ease the transition, I dont
> know...

Such a case has just appeared:
https://bugzilla.redhat.com/show_bug.cgi?id=757628

systemd allows on-demand socket activation of services. It creates a 
listening socket without the SO_PASSCRED flag. When the first message 
arrives to the socket, systemd spawns the service and passes the 
socket's fd to it. The service sets SO_PASSCRED before actually 
receiving the message.

I can fix that in systemd, but there may be more cases like this.

Michal
--
To unsubscribe from this list: send the line "unsubscribe netdev" in
the body of a message to majordomo@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Eric Dumazet Nov. 28, 2011, 1:38 p.m. UTC | #7
Le lundi 28 novembre 2011 à 14:23 +0100, Michal Schmidt a écrit :
> On 09/20/2011 06:16 AM, Eric Dumazet wrote:
> > Note : The man page does states :
> >
> > "To receive a struct ucred message the SO_PASSCRED option  must  be
> > enabled  on  the socket."
> >
> > But it doesnt say if the SO_PASSCRED option must be enabled before the
> > sender sends its message, or before receiver attempts to read it.
> >
> > Once a message is queued on an unix socket, flipping SO_PASSCRED cant
> > change its content (adding or removing credentials), since sender might
> > already have disappeared.
> >
> > So current code includes credentials in all sent messages, just in case
> > receiver actually fetch credentials.
> >
> > There are probably programs that assume they can set SO_PASSCRED right
> > before calling recvmsg(). Are we taking risk to break them, or are we
> > gentle and provide a sysctl option to ease the transition, I dont
> > know...
> 
> Such a case has just appeared:
> https://bugzilla.redhat.com/show_bug.cgi?id=757628
> 
> systemd allows on-demand socket activation of services. It creates a 
> listening socket without the SO_PASSCRED flag. When the first message 
> arrives to the socket, systemd spawns the service and passes the 
> socket's fd to it. The service sets SO_PASSCRED before actually 
> receiving the message.
> 
> I can fix that in systemd, but there may be more cases like this.


Yes, we were afraid of this.

Performance drop is really huge and deserves some fixes in userland...

People add features to kernel (in this case namespaces) without thinking
on performance regression. So poor guys like us have to "fix" things
later, in a reasonable way.



--
To unsubscribe from this list: send the line "unsubscribe netdev" in
the body of a message to majordomo@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
diff mbox

Patch

diff --git a/include/net/scm.h b/include/net/scm.h
index 745460f..d456f4c 100644
--- a/include/net/scm.h
+++ b/include/net/scm.h
@@ -49,7 +49,7 @@  static __inline__ void scm_set_cred(struct scm_cookie *scm,
 				    struct pid *pid, const struct cred *cred)
 {
 	scm->pid  = get_pid(pid);
-	scm->cred = get_cred(cred);
+	scm->cred = cred ? get_cred(cred) : NULL;
 	cred_to_ucred(pid, cred, &scm->creds);
 }
 
@@ -73,8 +73,7 @@  static __inline__ void scm_destroy(struct scm_cookie *scm)
 static __inline__ int scm_send(struct socket *sock, struct msghdr *msg,
 			       struct scm_cookie *scm)
 {
-	scm_set_cred(scm, task_tgid(current), current_cred());
-	scm->fp = NULL;
+	memset(scm, 0, sizeof(*scm));
 	unix_get_peersec_dgram(sock, scm);
 	if (msg->msg_controllen <= 0)
 		return 0;
diff --git a/net/core/scm.c b/net/core/scm.c
index 811b53f..ff52ad0 100644
--- a/net/core/scm.c
+++ b/net/core/scm.c
@@ -173,7 +173,7 @@  int __scm_send(struct socket *sock, struct msghdr *msg, struct scm_cookie *p)
 			if (err)
 				goto error;
 
-			if (pid_vnr(p->pid) != p->creds.pid) {
+			if (!p->pid || pid_vnr(p->pid) != p->creds.pid) {
 				struct pid *pid;
 				err = -ESRCH;
 				pid = find_get_pid(p->creds.pid);
@@ -183,8 +183,9 @@  int __scm_send(struct socket *sock, struct msghdr *msg, struct scm_cookie *p)
 				p->pid = pid;
 			}
 
-			if ((p->cred->euid != p->creds.uid) ||
-				(p->cred->egid != p->creds.gid)) {
+			if (!p->cred ||
+			    (p->cred->euid != p->creds.uid) ||
+			    (p->cred->egid != p->creds.gid)) {
 				struct cred *cred;
 				err = -ENOMEM;
 				cred = prepare_creds();
@@ -193,7 +194,8 @@  int __scm_send(struct socket *sock, struct msghdr *msg, struct scm_cookie *p)
 
 				cred->uid = cred->euid = p->creds.uid;
 				cred->gid = cred->egid = p->creds.gid;
-				put_cred(p->cred);
+				if (p->cred)
+					put_cred(p->cred);
 				p->cred = cred;
 			}
 			break;
diff --git a/net/netlink/af_netlink.c b/net/netlink/af_netlink.c
index 4330db9..1201b6d 100644
--- a/net/netlink/af_netlink.c
+++ b/net/netlink/af_netlink.c
@@ -1324,10 +1324,9 @@  static int netlink_sendmsg(struct kiocb *kiocb, struct socket *sock,
 	if (msg->msg_flags&MSG_OOB)
 		return -EOPNOTSUPP;
 
-	if (NULL == siocb->scm) {
+	if (NULL == siocb->scm)
 		siocb->scm = &scm;
-		memset(&scm, 0, sizeof(scm));
-	}
+
 	err = scm_send(sock, msg, siocb->scm);
 	if (err < 0)
 		return err;
diff --git a/net/unix/af_unix.c b/net/unix/af_unix.c
index ec68e1c..466fbcc 100644
--- a/net/unix/af_unix.c
+++ b/net/unix/af_unix.c
@@ -1381,8 +1381,10 @@  static int unix_attach_fds(struct scm_cookie *scm, struct sk_buff *skb)
 static int unix_scm_to_skb(struct scm_cookie *scm, struct sk_buff *skb, bool send_fds)
 {
 	int err = 0;
+
 	UNIXCB(skb).pid  = get_pid(scm->pid);
-	UNIXCB(skb).cred = get_cred(scm->cred);
+	if (scm->cred)
+		UNIXCB(skb).cred = get_cred(scm->cred);
 	UNIXCB(skb).fp = NULL;
 	if (scm->fp && send_fds)
 		err = unix_attach_fds(scm, skb);
@@ -1392,6 +1394,24 @@  static int unix_scm_to_skb(struct scm_cookie *scm, struct sk_buff *skb, bool sen
 }
 
 /*
+ * Some apps rely on write() giving SCM_CREDENTIALS
+ * We include credentials if source or destination socket
+ * asserted SOCK_PASSCRED.
+ */
+static void maybe_add_creds(struct sk_buff *skb, const struct socket *sock,
+			    const struct sock *other)
+{
+	if (UNIXCB(skb).cred)
+		return;
+	if (test_bit(SOCK_PASSCRED, &sock->flags) ||
+	    !other->sk_socket ||
+	    test_bit(SOCK_PASSCRED, &other->sk_socket->flags)) {
+		UNIXCB(skb).pid  = get_pid(task_tgid(current));
+		UNIXCB(skb).cred = get_current_cred();
+	}
+}
+
+/*
  *	Send AF_UNIX data.
  */
 
@@ -1538,6 +1558,7 @@  restart:
 
 	if (sock_flag(other, SOCK_RCVTSTAMP))
 		__net_timestamp(skb);
+	maybe_add_creds(skb, sock, other);
 	skb_queue_tail(&other->sk_receive_queue, skb);
 	if (max_level > unix_sk(other)->recursion_level)
 		unix_sk(other)->recursion_level = max_level;
@@ -1652,6 +1673,7 @@  static int unix_stream_sendmsg(struct kiocb *kiocb, struct socket *sock,
 		    (other->sk_shutdown & RCV_SHUTDOWN))
 			goto pipe_err_free;
 
+		maybe_add_creds(skb, sock, other);
 		skb_queue_tail(&other->sk_receive_queue, skb);
 		if (max_level > unix_sk(other)->recursion_level)
 			unix_sk(other)->recursion_level = max_level;