diff mbox

[4/9] AF_UNIX: find the recipients for multicast messages

Message ID 1290450982-17480-4-git-send-email-alban.crequy@collabora.co.uk
State RFC, archived
Delegated to: David Miller
Headers show

Commit Message

Alban Crequy Nov. 22, 2010, 6:36 p.m. UTC
unix_find_multicast_recipients() builds an array of recipients. It can either
find the peers of a specific multicast address, or find all the peers of all
multicast group the sender is part of.

Signed-off-by: Alban Crequy <alban.crequy@collabora.co.uk>
---
 net/unix/af_unix.c |  144 ++++++++++++++++++++++++++++++++++++++++++++++++++++
 1 files changed, 144 insertions(+), 0 deletions(-)

Comments

David Miller Nov. 22, 2010, 7:05 p.m. UTC | #1
From: Alban Crequy <alban.crequy@collabora.co.uk>
Date: Mon, 22 Nov 2010 18:36:17 +0000

> unix_find_multicast_recipients() builds an array of recipients. It can either
> find the peers of a specific multicast address, or find all the peers of all
> multicast group the sender is part of.
> 
> Signed-off-by: Alban Crequy <alban.crequy@collabora.co.uk>

You really should use RCU to lock this stuff, this way sends run
lockless and have less worries wrt. the memory allocation.  You'll
also only take a spinlock in the write paths which change the
multicast groups, which ought to be rare.

Although to be honest you should optimize the case of small numbers of
recipients, in the same way we optimize small numbers of iovecs on
sends.  Have an on-stack array that holds a small number of entries
and use that if the set fits, otherwise dynamic allocation.
--
To unsubscribe from this list: send the line "unsubscribe netdev" in
the body of a message to majordomo@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Andi Kleen Nov. 22, 2010, 8:14 p.m. UTC | #2
Alban Crequy <alban.crequy@collabora.co.uk> writes:

>+static DEFINE_SPINLOCK(unix_multicast_lock);

For DBUS it's probably ok, but I suspect for other usages
the global lock in the multipath fast path is going to hurt
sooner or later.

> +
> +        /* Allocate for the set and hope the number of recipients does not
> +	 * change while the lock is released. If it changes, we have to try
> +	 * again... We allocate a bit more than needed, so if a _few_ members
> +	 * are added in a multicast group meanwhile, we don't always need to
> +	 * try again. */
> +	recipient_cnt += 5;
> +
> +	set = kmalloc(sizeof(struct sock_set)
> +		      + sizeof(struct sock_item) * recipient_cnt,
> +	    GFP_KERNEL);

FWIW for a large number of sockets this will likely run into
memory fragmentation issues. There are various workarounds like
fallback to vmalloc or use something like flex_arrays.


-Andi
Alban Crequy Nov. 23, 2010, 3:03 p.m. UTC | #3
Le Mon, 22 Nov 2010 11:05:19 -0800 (PST),
David Miller <davem@davemloft.net> a écrit :

> From: Alban Crequy <alban.crequy@collabora.co.uk>
> Date: Mon, 22 Nov 2010 18:36:17 +0000
> 
> > unix_find_multicast_recipients() builds an array of recipients. It
> > can either find the peers of a specific multicast address, or find
> > all the peers of all multicast group the sender is part of.
> > 
> > Signed-off-by: Alban Crequy <alban.crequy@collabora.co.uk>
> 
> You really should use RCU to lock this stuff, this way sends run
> lockless and have less worries wrt. the memory allocation.  You'll
> also only take a spinlock in the write paths which change the
> multicast groups, which ought to be rare.

I understand the benefit to use RCU in order to have lockless sends.

But with RCU I will still have worries about the memory allocation:

- I cannot allocate inside a rcu_read_lock()-rcu_read_unlock() block.

- If I iterate locklessly over the multicast group members with 
  hlist_for_each_entry_rcu(), new members can be added, so the
  array can be allocated with the wrong size and I have to try again
  ("goto try_again") when this rare case occurs.

- Another idea would be to avoid completely the allocation by inlining
  unix_find_multicast_recipients() inside unix_dgram_sendmsg() and
  delivering the messages to the recipients as long as the list is
  being iterated locklessly. But I want to provide atomicity of
  delivery: the message must be delivered with skb_queue_tail() either
  to all the recipients or to none of them in case of interruption or
  memory pressure. I don't see how I can achieve that without
  iterating several times on the list of recipients, hence the
  allocation and the copy in the array. I also want to guarantee the
  order of delivery as described in multicast-unix-sockets.txt and for
  this, I am taking lots of spinlocks anyway. I don't see how to avoid
  that, but I would be happy to be wrong and have a better solution.

> Although to be honest you should optimize the case of small numbers of
> recipients, in the same way we optimize small numbers of iovecs on
> sends.  Have an on-stack array that holds a small number of entries
> and use that if the set fits, otherwise dynamic allocation.

To give an idea of the number of members in a multicast group for the
D-Bus use case, I have 90 D-Bus connections on my session bus:

$ dbus-send --print-reply --dest=org.freedesktop.DBus \
/org/freedesktop/DBus org.freedesktop.DBus.ListNames | grep '":'|wc -l
90

In common cases, there should be only a few real recipients (1 or 2?)
after the socket filters eliminate most of them, but
unix_find_multicast_recipients() will still allocate an array of
about that size.

--
To unsubscribe from this list: send the line "unsubscribe netdev" in
the body of a message to majordomo@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Eric Dumazet Nov. 23, 2010, 4:08 p.m. UTC | #4
Le mardi 23 novembre 2010 à 15:03 +0000, Alban Crequy a écrit :
> Le Mon, 22 Nov 2010 11:05:19 -0800 (PST),
> David Miller <davem@davemloft.net> a écrit :
> 
> > From: Alban Crequy <alban.crequy@collabora.co.uk>
> > Date: Mon, 22 Nov 2010 18:36:17 +0000
> > 
> > > unix_find_multicast_recipients() builds an array of recipients. It
> > > can either find the peers of a specific multicast address, or find
> > > all the peers of all multicast group the sender is part of.
> > > 
> > > Signed-off-by: Alban Crequy <alban.crequy@collabora.co.uk>
> > 
> > You really should use RCU to lock this stuff, this way sends run
> > lockless and have less worries wrt. the memory allocation.  You'll
> > also only take a spinlock in the write paths which change the
> > multicast groups, which ought to be rare.
> 
> I understand the benefit to use RCU in order to have lockless sends.
> 
> But with RCU I will still have worries about the memory allocation:
> 
> - I cannot allocate inside a rcu_read_lock()-rcu_read_unlock() block.
> 

Thats not true.

Sames rules than inside a spin_lock() or write_lock() apply.

We already allocate memory inside rcu_read_lock() in network stack.

> - If I iterate locklessly over the multicast group members with 
>   hlist_for_each_entry_rcu(), new members can be added, so the
>   array can be allocated with the wrong size and I have to try again
>   ("goto try_again") when this rare case occurs.

You are allowed to allocate memory to add stuff while doing your loop
iteration.

Nothing prevents you to use a chain of items, each item holding up to
128 sockets for example. If full, allocate a new item.

We have such schem in poll()/select() for example 

fs/select.c  function poll_get_entry()

Use a small embedded struct on stack, and allocate extra items if number
of fd is too big.

(If you cant allocate memory to hold pointers, chance is you wont be
able to clone skbs anyway. One skb is about 400 bytes.)

If new members are added to the group while you are iterating the list,
they wont receive a copy of the message.

Or just chain skbs while you clone them, store in skb->sk the socket...
no need for extra memory allocations.

> 
> - Another idea would be to avoid completely the allocation by inlining
>   unix_find_multicast_recipients() inside unix_dgram_sendmsg() and
>   delivering the messages to the recipients as long as the list is
>   being iterated locklessly. But I want to provide atomicity of
>   delivery: the message must be delivered with skb_queue_tail() either
>   to all the recipients or to none of them in case of interruption or
>   memory pressure. I don't see how I can achieve that without
>   iterating several times on the list of recipients, hence the
>   allocation and the copy in the array. I also want to guarantee the
>   order of delivery as described in multicast-unix-sockets.txt and for
>   this, I am taking lots of spinlocks anyway. I don't see how to avoid
>   that, but I would be happy to be wrong and have a better solution.
> 


So if one destination has a full receive queue, you want nobody receive
the message ? That seems a bit risky to me, if someone sends SIGSTOP to
one of your process...



> 
> To give an idea of the number of members in a multicast group for the
> D-Bus use case, I have 90 D-Bus connections on my session bus:
> 
> $ dbus-send --print-reply --dest=org.freedesktop.DBus \
> /org/freedesktop/DBus org.freedesktop.DBus.ListNames | grep '":'|wc -l
> 90
> 
> In common cases, there should be only a few real recipients (1 or 2?)
> after the socket filters eliminate most of them, but
> unix_find_multicast_recipients() will still allocate an array of
> about that size.
> 

I am not sure if doing 90 clones of skb and filtering them one by one is
going to be fast :-(





--
To unsubscribe from this list: send the line "unsubscribe netdev" in
the body of a message to majordomo@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Eric Dumazet Nov. 23, 2010, 4:56 p.m. UTC | #5
Hmm, just thought about lockless again, and we had same multicast
problem on udp as well.

We are forced to hold a lock (to forbid concurrent deletes), or we might
going through one 'about to be removed' socket and abort the iteration
in the middle. Some sockets would not receive a copy of the message.

(UDP sockets using SLAB_DESTROY_BY_RCU, we could even have multiple
copies sent to some sockets, if the removed socket is re-inserted in
front of chain because of instant reuse)

To have a true lockless path, you would need to restart the full scan if
you notice a delete was done during the iteration, eventually using a
sequence number per chain. That would be expensive, because you have to
undo all the socket accumulation (refcount) done during the lookup,
before restart the thing.

To avoid starvation, getting a lock at the second iteration would be
good ;)



--
To unsubscribe from this list: send the line "unsubscribe netdev" in
the body of a message to majordomo@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Alban Crequy Nov. 23, 2010, 5:47 p.m. UTC | #6
Le Tue, 23 Nov 2010 17:08:37 +0100,
Eric Dumazet <eric.dumazet@gmail.com> a écrit :

> (...)

Thanks for the explanations 

> Le mardi 23 novembre 2010 à 15:03 +0000, Alban Crequy a écrit :
> > 
> > - Another idea would be to avoid completely the allocation by
> > inlining unix_find_multicast_recipients() inside
> > unix_dgram_sendmsg() and delivering the messages to the recipients
> > as long as the list is being iterated locklessly. But I want to
> > provide atomicity of delivery: the message must be delivered with
> > skb_queue_tail() either to all the recipients or to none of them in
> > case of interruption or memory pressure. I don't see how I can
> > achieve that without iterating several times on the list of
> > recipients, hence the allocation and the copy in the array. I also
> > want to guarantee the order of delivery as described in
> > multicast-unix-sockets.txt and for this, I am taking lots of
> > spinlocks anyway. I don't see how to avoid that, but I would be
> > happy to be wrong and have a better solution.
> > 
> 
> 
> So if one destination has a full receive queue, you want nobody
> receive the message ? That seems a bit risky to me, if someone sends
> SIGSTOP to one of your process...

Yes. For the D-Bus usage, I want to have this guarantee. If random
remote procedure calls are lost, it will break applications built on
top of D-Bus with multicast Unix sockets. The current implementation of
D-Bus avoid this problem by having almost infinite receiving queues in
the process dbus-daemon: 1GB. But in the kernel,
/proc/sys/net/unix/max_dgram_qlen is 10 messages by default. Increasing
it a bit will not fix the problem and increasing it to 1GB is not
reasonable in kernel.

There is different actions the kernel can do when the queue is full:

1. block the sender. It is useful in RPC, we don't want random RPC to
   disappear unnoticed.
2. drop the message for recipients with a full queue. It could be
   acceptable for some slow monitoring tools that don't want to disturb
   the applications.
3. close the receiving socket as a punishment. At least the problem is
   not unnoticed and the user can have some error feedback.

I was thinking to make it configurable when a socket joins a multicast
group. So different multicast group members would behave differently.
The flag UNIX_MREQ_DROP_WHEN_FULL is there for that (but not fully
implemented in the patchset).

It makes things more complex for poll(POLLOUT). Before the buffer
reaches the kernel, it cannot run the socket filters, so it is not
possible to know the exact recipients. So poll(POLLOUT) has to block
as soon as only one receiving queue is full (unless the multicast
member has the flag UNIX_MREQ_DROP_WHEN_FULL).

When the peers install sockets filters and there is 2 flows of messages
from A to B and from C to D, if the receiving queue of D is full, it
will also block the communication from A to B: poll(A, POLLOUT) will
block. This is annoying but I don't see how to fix it.

> > To give an idea of the number of members in a multicast group for
> > the D-Bus use case, I have 90 D-Bus connections on my session bus:
> > 
> > $ dbus-send --print-reply --dest=org.freedesktop.DBus \
> > /org/freedesktop/DBus org.freedesktop.DBus.ListNames | grep '":'|wc
> > -l 90
> > 
> > In common cases, there should be only a few real recipients (1 or
> > 2?) after the socket filters eliminate most of them, but
> > unix_find_multicast_recipients() will still allocate an array of
> > about that size.
> > 
> 
> I am not sure if doing 90 clones of skb and filtering them one by one
> is going to be fast :-(

Yes... I think it can be optimized. Run the socket filter first by
calling sk_run_filter() directly and then call skb_clone() + pskb_trim()
only on the few remaining sockets.


--
To unsubscribe from this list: send the line "unsubscribe netdev" in
the body of a message to majordomo@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
David Miller Nov. 23, 2010, 6:39 p.m. UTC | #7
From: Alban Crequy <alban.crequy@collabora.co.uk>
Date: Tue, 23 Nov 2010 17:47:01 +0000

> Le Tue, 23 Nov 2010 17:08:37 +0100,
> Eric Dumazet <eric.dumazet@gmail.com> a écrit :
>> I am not sure if doing 90 clones of skb and filtering them one by one
>> is going to be fast :-(
> 
> Yes... I think it can be optimized. Run the socket filter first by
> calling sk_run_filter() directly and then call skb_clone() + pskb_trim()
> only on the few remaining sockets.

BTW, we have and have talked about the same exact problem with
AF_PACKET socket users such as DHCP.

We clone and push the packet down into the AF_PACKET protocol
code from the pt_type callback when %99 of the time the socket
filter doesn't match and thus the clone is completely wasted
work.

If we know the socket, or more specifically the filter, early enough,
we could have a special interface like:

	struct sk_buff *skb_filter_or_clone(struct sk_buff *skb, ...)

Which returns a non-NULL cloned SKB if the filter accepts the
packet.
--
To unsubscribe from this list: send the line "unsubscribe netdev" in
the body of a message to majordomo@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
diff mbox

Patch

diff --git a/net/unix/af_unix.c b/net/unix/af_unix.c
index 2278829..3cc9695 100644
--- a/net/unix/af_unix.c
+++ b/net/unix/af_unix.c
@@ -114,15 +114,48 @@ 
 #include <linux/mount.h>
 #include <net/checksum.h>
 #include <linux/security.h>
+#include <linux/sort.h>
 
 static struct hlist_head unix_socket_table[UNIX_HASH_SIZE + 1];
 static DEFINE_SPINLOCK(unix_table_lock);
+static DEFINE_SPINLOCK(unix_multicast_lock);
 static atomic_long_t unix_nr_socks;
 
 #define unix_sockets_unbound	(&unix_socket_table[UNIX_HASH_SIZE])
 
 #define UNIX_ABSTRACT(sk)	(unix_sk(sk)->addr->hash != UNIX_HASH_SIZE)
 
+struct sock_item {
+	struct sock *s;
+	struct sk_buff *skb;
+	int to_deliver;
+};
+
+struct sock_set {
+	int cnt;
+	struct sock_item items[0];
+};
+
+static void kfree_sock_set(struct sock_set *set)
+{
+	int i;
+	for (i = 0 ; i < set->cnt ; i++)
+		sock_put(set->items[i].s);
+	kfree(set);
+}
+
+static int sock_item_compare(const void *_a, const void *_b)
+{
+	const struct sock_item *a = _a;
+	const struct sock_item *b = _b;
+	if (a->s > b->s)
+		return 1;
+	else if (a->s < b->s)
+		return -1;
+	else
+		return 0;
+}
+
 #ifdef CONFIG_SECURITY_NETWORK
 static void unix_get_secdata(struct scm_cookie *scm, struct sk_buff *skb)
 {
@@ -824,6 +857,117 @@  fail:
 	return NULL;
 }
 
+static int unix_find_multicast_members(struct sock_set *set,
+				       int recipient_cnt,
+				       struct sock *sender,
+				       struct hlist_head *list)
+{
+	struct unix_mcast *node;
+	struct hlist_node *pos;
+	hlist_for_each_entry(node, pos, list,
+			     member_node) {
+		if (set->cnt + 1 > recipient_cnt)
+			return -ENOMEM;
+		if (node->member == unix_sk(sender) &&
+		    !(node->flags & UNIX_MREQ_LOOPBACK))
+			continue;
+
+		sock_hold(&node->member->sk);
+		set->items[set->cnt].s = &node->member->sk;
+		set->items[set->cnt].skb = NULL;
+		set->items[set->cnt].to_deliver = 1;
+		set->cnt++;
+	}
+	return 0;
+}
+
+/* Find the recipients for a message sent by 'sender' to 'addr'. If 'dest' is
+ * NULL, the recipients are peers of all subscribed groups.
+ */
+static struct sock_set *unix_find_multicast_recipients(struct sock *sender,
+						       struct sock *dest,
+						       int *err)
+{
+	struct unix_sock *u = unix_sk(sender);
+	struct unix_mcast *node;
+	struct hlist_node *pos;
+	struct sock_set *set;
+	int recipient_cnt;
+
+	/* We cannot allocate in the spin lock. First, count the recipients */
+try_again:
+	spin_lock(&unix_multicast_lock);
+	if (dest != NULL) {
+		if (unix_sk(dest)->is_mcast_addr) {
+			recipient_cnt = unix_sk(dest)->mcast_members_cnt;
+		} else {
+			recipient_cnt = 1;
+		}
+	} else {
+		recipient_cnt = 0;
+		hlist_for_each_entry(node, pos, &u->mcast_subscriptions,
+				     subscription_node) {
+			recipient_cnt += node->addr->mcast_members_cnt;
+		}
+	}
+	spin_unlock(&unix_multicast_lock);
+
+        /* Allocate for the set and hope the number of recipients does not
+	 * change while the lock is released. If it changes, we have to try
+	 * again... We allocate a bit more than needed, so if a _few_ members
+	 * are added in a multicast group meanwhile, we don't always need to
+	 * try again. */
+	recipient_cnt += 5;
+
+	set = kmalloc(sizeof(struct sock_set)
+		      + sizeof(struct sock_item) * recipient_cnt,
+	    GFP_KERNEL);
+	if (!set) {
+		*err = -ENOMEM;
+		return NULL;
+	}
+	set->cnt = 0;
+
+	spin_lock(&unix_multicast_lock);
+	if (dest && unix_sk(dest)->is_mcast_addr) {
+		/* Message sent to a multicast address */
+		if (unix_find_multicast_members(set, recipient_cnt,
+				sender,
+				&unix_sk(dest)->mcast_members)) {
+			spin_unlock(&unix_multicast_lock);
+			kfree_sock_set(set);
+			goto try_again;
+		}
+	} else if (!dest) {
+		/* Destination not specified, sending to all peers of
+		 * subscribed groups */
+		hlist_for_each_entry(node, pos, &u->mcast_subscriptions,
+				     subscription_node) {
+			if (unix_find_multicast_members(set, recipient_cnt,
+					sender,
+					&node->addr->mcast_members)) {
+				spin_unlock(&unix_multicast_lock);
+				kfree_sock_set(set);
+				goto try_again;
+			}
+		}
+	} else {
+		/* Message sent to a non-multicast address */
+		BUG_ON(recipient_cnt < 1);
+		set->cnt = 1;
+		sock_hold(dest);
+		set->items[0].s = dest;
+		set->items[0].skb = NULL;
+		set->items[0].to_deliver = 1;
+	}
+	spin_unlock(&unix_multicast_lock);
+
+	/* Keep the array ordered to prevent deadlocks on circular waits */
+	sort(set->items, set->cnt, sizeof(struct sock_item),
+	     sock_item_compare, NULL);
+	return set;
+}
+
 
 static int unix_bind(struct socket *sock, struct sockaddr *uaddr, int addr_len)
 {