diff mbox series

[net-next,1/5] net: add support for noref skb->sk

Message ID 39b55ec9584df7dfd6eb498a3d354cd2e08e3eaa.1505926196.git.pabeni@redhat.com
State Changes Requested, archived
Delegated to: David Miller
Headers show
Series net: introduce noref sk | expand

Commit Message

Paolo Abeni Sept. 20, 2017, 4:54 p.m. UTC
Noref sk do not carry a socket refcount, are valid
only inside the current RCU section and must be
explicitly cleared before exiting such section.

They will be used in a later patch to allow early demux
without sock refcounting.

Signed-off-by: Paolo Abeni <pabeni@redhat.com>
---
 include/linux/skbuff.h | 30 ++++++++++++++++++++++++++++++
 net/core/sock.c        |  6 ++++++
 2 files changed, 36 insertions(+)

Comments

Eric Dumazet Sept. 20, 2017, 5:41 p.m. UTC | #1
On Wed, 2017-09-20 at 18:54 +0200, Paolo Abeni wrote:
> Noref sk do not carry a socket refcount, are valid
> only inside the current RCU section and must be
> explicitly cleared before exiting such section.
> 
> They will be used in a later patch to allow early demux
> without sock refcounting.




> +/* dummy destructor used by noref sockets */
> +void sock_dummyfree(struct sk_buff *skb)
> +{

BUG();

> +}
> +EXPORT_SYMBOL(sock_dummyfree);
> +


I do not see how you ensure we do not leave RCU section with an skb
destructor pointing to this sock_dummyfree()

This patch series looks quite dangerous to me.

Do we really have real applications using connected UDP sockets and
wanting very high pps throughput ?

I am pretty sure the bottleneck is the sender part.
Paolo Abeni Sept. 21, 2017, 9:14 a.m. UTC | #2
Hi,

Thank you for looking at it!

On Wed, 2017-09-20 at 10:41 -0700, Eric Dumazet wrote:
> On Wed, 2017-09-20 at 18:54 +0200, Paolo Abeni wrote:
> > Noref sk do not carry a socket refcount, are valid
> > only inside the current RCU section and must be
> > explicitly cleared before exiting such section.
> > 
> > They will be used in a later patch to allow early demux
> > without sock refcounting.
> 
> 
> 
> 
> > +/* dummy destructor used by noref sockets */
> > +void sock_dummyfree(struct sk_buff *skb)
> > +{
> 
> BUG();
> 
> > +}
> > +EXPORT_SYMBOL(sock_dummyfree);
> > +

We can call sock_dummyfree() in legitimate paths, see below, but we can
add a:

WARN_ON_ONCE(!rcu_read_lock_held());

here and in  skb_clear_noref_sk(). That should help much to catch
possible bugs.

> I do not see how you ensure we do not leave RCU section with an skb
> destructor pointing to this sock_dummyfree()
> 
> This patch series looks quite dangerous to me.

The idea is to explicitly clear the sknoref references before leaving
the RCU section. Quite alike what we currently do for dst noref, but
here the only place where we get a noref socket is the socket early
demux, thus the scope of this change is more limited to what we have
with noref dst_entries.

The relevant code is in the next 2 patches; after the demux we preserve
the sknoref only if the skb has a local destination. The UDP socket
will then set the noref on early demux lookup, and the skb will either:

* land on the corresponding UDP socket, the receive function will steal
the sknoref
* be dropped by some nft/iptables target - the dummy destructor is
called
* forwarded by some nft/iptables target outside the input path; we
clear the skref explicitly in such targets. 

Currently there are an handful of places affected, and we can simplify
the code dropping the early demux result for locally terminated
multicast sockets on a host acting as a multicast router, please see
the comment on the next patch.

> Do we really have real applications using connected UDP sockets and
> wanting very high pps throughput ?

The ultimate goal is to improve the unconnected UDP sockets scenario,
we do actually have use cases for that - DNS servers and VoIP SBCs.

Thanks,

Paolo
Eric Dumazet Sept. 21, 2017, 10:35 a.m. UTC | #3
On Thu, 2017-09-21 at 11:14 +0200, Paolo Abeni wrote:
> Hi,
> 
> Thank you for looking at it!
> 
> On Wed, 2017-09-20 at 10:41 -0700, Eric Dumazet wrote:
> > On Wed, 2017-09-20 at 18:54 +0200, Paolo Abeni wrote:
> > > Noref sk do not carry a socket refcount, are valid
> > > only inside the current RCU section and must be
> > > explicitly cleared before exiting such section.
> > > 
> > > They will be used in a later patch to allow early demux
> > > without sock refcounting.
> > 
> > 
> > 
> > 
> > > +/* dummy destructor used by noref sockets */
> > > +void sock_dummyfree(struct sk_buff *skb)
> > > +{
> > 
> > BUG();
> > 
> > > +}
> > > +EXPORT_SYMBOL(sock_dummyfree);
> > > +
> 
> We can call sock_dummyfree() in legitimate paths, see below, but we can
> add a:
> 
> WARN_ON_ONCE(!rcu_read_lock_held());

This wont be enough see below.

> 
> here and in  skb_clear_noref_sk(). That should help much to catch
> possible bugs.
> 
> > I do not see how you ensure we do not leave RCU section with an skb
> > destructor pointing to this sock_dummyfree()
> > 
> > This patch series looks quite dangerous to me.
> 
> The idea is to explicitly clear the sknoref references before leaving
> the RCU section. Quite alike what we currently do for dst noref, but
> here the only place where we get a noref socket is the socket early
> demux, thus the scope of this change is more limited to what we have
> with noref dst_entries.
> 
> The relevant code is in the next 2 patches; after the demux we preserve
> the sknoref only if the skb has a local destination. The UDP socket
> will then set the noref on early demux lookup, and the skb will either:
> 
> * land on the corresponding UDP socket, the receive function will steal
> the sknoref
> * be dropped by some nft/iptables target - the dummy destructor is
> called
> * forwarded by some nft/iptables target outside the input path; we
> clear the skref explicitly in such targets. 
> 
> Currently there are an handful of places affected, and we can simplify
> the code dropping the early demux result for locally terminated
> multicast sockets on a host acting as a multicast router, please see
> the comment on the next patch.
> 
> > Do we really have real applications using connected UDP sockets and
> > wanting very high pps throughput ?
> 
> The ultimate goal is to improve the unconnected UDP sockets scenario,
> we do actually have use cases for that - DNS servers and VoIP SBCs.

Unconnected UDP traffic does not use refcounting on sk _already_.

And SO_REUSEPORT already allows us to handle all the traffic we want
_already_.


Please take a look at 71563f3414e917c62acd8e0fb0edf8ed6af63e4b

This might tell you why I am so nervous about your changes.

Checking WARN_ON_ONCE(!rcu_read_lock_held());
is not enough.

rcu_read_lock()
skb->destructor = sock_dummyfree;

queue the packet into an intermediate queue.
rcu_read_unlock();

....

rcu_read_lock()
...
if (skb->sk && skb->sk->state == ...) // crash

Also you covered IPv4, but really we need to forget about IPv4 and focus
on IPv6 only. And _then_ take care of IPv4 compat.
diff mbox series

Patch

diff --git a/include/linux/skbuff.h b/include/linux/skbuff.h
index 72299ef00061..459a5672811d 100644
--- a/include/linux/skbuff.h
+++ b/include/linux/skbuff.h
@@ -922,6 +922,36 @@  static inline struct rtable *skb_rtable(const struct sk_buff *skb)
 	return (struct rtable *)skb_dst(skb);
 }
 
+void sock_dummyfree(struct sk_buff *skb);
+
+/* only early demux can set noref socks
+ * noref socks do not carry any refcount and must be
+ * cleared before exiting the current RCU section
+ */
+static inline void skb_set_noref_sk(struct sk_buff *skb, struct sock *sk)
+{
+	skb->sk = sk;
+	skb->destructor = sock_dummyfree;
+}
+
+static inline bool skb_has_noref_sk(struct sk_buff *skb)
+{
+	return skb->destructor == sock_dummyfree;
+}
+
+static inline struct sock *skb_clear_noref_sk(struct sk_buff *skb)
+{
+	struct sock *ret;
+
+	if (!skb_has_noref_sk(skb))
+		return NULL;
+
+	ret = skb->sk;
+	skb->sk = NULL;
+	skb->destructor = NULL;
+	return ret;
+}
+
 /* For mangling skb->pkt_type from user space side from applications
  * such as nft, tc, etc, we only allow a conservative subset of
  * possible pkt_types to be set.
diff --git a/net/core/sock.c b/net/core/sock.c
index 9b7b6bbb2a23..3aa4950639bb 100644
--- a/net/core/sock.c
+++ b/net/core/sock.c
@@ -1893,6 +1893,12 @@  void sock_efree(struct sk_buff *skb)
 }
 EXPORT_SYMBOL(sock_efree);
 
+/* dummy destructor used by noref sockets */
+void sock_dummyfree(struct sk_buff *skb)
+{
+}
+EXPORT_SYMBOL(sock_dummyfree);
+
 kuid_t sock_i_uid(struct sock *sk)
 {
 	kuid_t uid;