diff mbox series

[net] packets: Always register packet sk in the same order

Message ID 20190316134130.28065-1-maxime.chevallier@bootlin.com
State Accepted
Delegated to: David Miller
Headers show
Series [net] packets: Always register packet sk in the same order | expand

Commit Message

Maxime Chevallier March 16, 2019, 1:41 p.m. UTC
When using fanouts with AF_PACKET, the demux functions such as
fanout_demux_cpu will return an index in the fanout socket array, which
corresponds to the selected socket.

The ordering of this array depends on the order the sockets were added
to a given fanout group, so for FANOUT_CPU this means sockets are bound
to cpus in the order they are configured, which is OK.

However, when stopping then restarting the interface these sockets are
bound to, the sockets are reassigned to the fanout group in the reverse
order, due to the fact that they were inserted at the head of the
interface's AF_PACKET socket list.

This means that traffic that was directed to the first socket in the
fanout group is now directed to the last one after an interface restart.

In the case of FANOUT_CPU, traffic from CPU0 will be directed to the
socket that used to receive traffic from the last CPU after an interface
restart.

This commit introduces a helper to add a socket at the tail of a list,
then uses it to register AF_PACKET sockets.

Note that this changes the order in which sockets are listed in /proc and
with sock_diag.

Fixes: dc99f600698d ("packet: Add fanout support")
Signed-off-by: Maxime Chevallier <maxime.chevallier@bootlin.com>
---
 include/net/sock.h     | 6 ++++++
 net/packet/af_packet.c | 2 +-
 2 files changed, 7 insertions(+), 1 deletion(-)

Comments

Willem de Bruijn March 16, 2019, 6:09 p.m. UTC | #1
On Sat, Mar 16, 2019 at 9:43 AM Maxime Chevallier
<maxime.chevallier@bootlin.com> wrote:
>
> When using fanouts with AF_PACKET, the demux functions such as
> fanout_demux_cpu will return an index in the fanout socket array, which
> corresponds to the selected socket.
>
> The ordering of this array depends on the order the sockets were added
> to a given fanout group, so for FANOUT_CPU this means sockets are bound
> to cpus in the order they are configured, which is OK.
>
> However, when stopping then restarting the interface these sockets are
> bound to, the sockets are reassigned to the fanout group in the reverse
> order, due to the fact that they were inserted at the head of the
> interface's AF_PACKET socket list.
>
> This means that traffic that was directed to the first socket in the
> fanout group is now directed to the last one after an interface restart.
>
> In the case of FANOUT_CPU, traffic from CPU0 will be directed to the
> socket that used to receive traffic from the last CPU after an interface
> restart.
>
> This commit introduces a helper to add a socket at the tail of a list,
> then uses it to register AF_PACKET sockets.
>
> Note that this changes the order in which sockets are listed in /proc and
> with sock_diag.
>
> Fixes: dc99f600698d ("packet: Add fanout support")
> Signed-off-by: Maxime Chevallier <maxime.chevallier@bootlin.com>

Acked-by: Willem de Bruijn <willemb@google.com>

Note that another consequence of this patch is that insertion on
packet create is now O(N) with the number of active packet sockets,
due to sklist being an hlist.
David Miller March 17, 2019, 1:21 a.m. UTC | #2
From: Willem de Bruijn <willemdebruijn.kernel@gmail.com>
Date: Sat, 16 Mar 2019 14:09:33 -0400

> Note that another consequence of this patch is that insertion on
> packet create is now O(N) with the number of active packet sockets,
> due to sklist being an hlist.

Exploitable...
Willem de Bruijn March 17, 2019, 4:42 p.m. UTC | #3
On Sat, Mar 16, 2019 at 9:21 PM David Miller <davem@davemloft.net> wrote:
>
> From: Willem de Bruijn <willemdebruijn.kernel@gmail.com>
> Date: Sat, 16 Mar 2019 14:09:33 -0400
>
> > Note that another consequence of this patch is that insertion on
> > packet create is now O(N) with the number of active packet sockets,
> > due to sklist being an hlist.
>
> Exploitable...

With root in userns? The running time is limited by open file rlimit.
This pattern is already used in a sk_add_node_rcu in some way, so
important to be sure.

In practice I see no significant wall clock time difference when
inserting to a fairly standard default limit of 16K. Regardless of
insertion order, running time is dominated by cleanup on process exit
(synchronize_net barriers?). At higher rlimit it does become
problematic.

The packet socket sklist is not easily converted from hlist to a
regular list, due to the use of seq_hlist_next_rcu in packet_seq_ops.
There is no equivalent seq_list_next_rcu. One option might be instead
to leave insertion order as is, but traverse the list in reverse in
packet_notifier on NETDEV_DOWN. That would require an
sk_for_each_reverse_rcu and hlist_for_each_entry_reverse_rcu. These do
not exist, but since an hlist_pprev_rcu does exist, it is probably
feasible. Though not a trivial change.

Another more narrow option may be to work around the ordering in
fanout itself, e.g., record in the socket the initially assigned
location in the fanout array and try to reclaim this spot on
re-insertion.
David Miller March 19, 2019, 12:59 a.m. UTC | #4
From: David Miller <davem@davemloft.net>
Date: Sat, 16 Mar 2019 18:21:19 -0700 (PDT)

> From: Willem de Bruijn <willemdebruijn.kernel@gmail.com>
> Date: Sat, 16 Mar 2019 14:09:33 -0400
> 
>> Note that another consequence of this patch is that insertion on
>> packet create is now O(N) with the number of active packet sockets,
>> due to sklist being an hlist.
> 
> Exploitable...

After reading Willem's explanation of how the scope of such a problem
is constrained, I am going to apply this and queue it up for -stable.

Maybe in the long term we can do the hlist reverse RCU thing.
diff mbox series

Patch

diff --git a/include/net/sock.h b/include/net/sock.h
index 328cb7cb7b0b..8de5ee258b93 100644
--- a/include/net/sock.h
+++ b/include/net/sock.h
@@ -710,6 +710,12 @@  static inline void sk_add_node_rcu(struct sock *sk, struct hlist_head *list)
 		hlist_add_head_rcu(&sk->sk_node, list);
 }
 
+static inline void sk_add_node_tail_rcu(struct sock *sk, struct hlist_head *list)
+{
+	sock_hold(sk);
+	hlist_add_tail_rcu(&sk->sk_node, list);
+}
+
 static inline void __sk_nulls_add_node_rcu(struct sock *sk, struct hlist_nulls_head *list)
 {
 	hlist_nulls_add_head_rcu(&sk->sk_nulls_node, list);
diff --git a/net/packet/af_packet.c b/net/packet/af_packet.c
index 8376bc1c1508..8754d7c93b84 100644
--- a/net/packet/af_packet.c
+++ b/net/packet/af_packet.c
@@ -3243,7 +3243,7 @@  static int packet_create(struct net *net, struct socket *sock, int protocol,
 	}
 
 	mutex_lock(&net->packet.sklist_lock);
-	sk_add_node_rcu(sk, &net->packet.sklist);
+	sk_add_node_tail_rcu(sk, &net->packet.sklist);
 	mutex_unlock(&net->packet.sklist_lock);
 
 	preempt_disable();