Patchwork [1/5] soreuseport: infrastructure

login
register
mail settings
Submitter Tom Herbert
Date Jan. 14, 2013, 8 p.m.
Message ID <alpine.DEB.2.00.1301141152130.3433@pokey.mtv.corp.google.com>
Download mbox | patch
Permalink /patch/211874/
State Changes Requested
Delegated to: David Miller
Headers show

Comments

Tom Herbert - Jan. 14, 2013, 8 p.m.
Definitions and macros for implementing soreusport.

Signed-off-by: Tom Herbert <therbert@google.com>
---
 include/linux/random.h            |    6 ++++++
 include/net/sock.h                |    5 ++++-
 include/uapi/asm-generic/socket.h |    3 +--
 net/core/sock.c                   |    7 +++++++
 4 files changed, 18 insertions(+), 3 deletions(-)
stephen hemminger - Jan. 15, 2013, 3:53 p.m.
On Mon, 14 Jan 2013 12:00:18 -0800 (PST)
Tom Herbert <therbert@google.com> wrote:

> +/* Pseudo random number generator from numerical recipes. */
> +static inline u32 next_pseudo_random32(u32 seed)
> +{
> +	return seed * 1664525 + 1013904223;
> +}
> +

Don't reimplement a pseudo random number generator, there already
exists net_random() 
--
To unsubscribe from this list: send the line "unsubscribe netdev" in
the body of a message to majordomo@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Eric Dumazet - Jan. 15, 2013, 4:14 p.m.
On Tue, 2013-01-15 at 07:53 -0800, Stephen Hemminger wrote:
> On Mon, 14 Jan 2013 12:00:18 -0800 (PST)
> Tom Herbert <therbert@google.com> wrote:
> 
> > +/* Pseudo random number generator from numerical recipes. */
> > +static inline u32 next_pseudo_random32(u32 seed)
> > +{
> > +	return seed * 1664525 + 1013904223;
> > +}
> > +
> 
> Don't reimplement a pseudo random number generator, there already
> exists net_random() 

net_random() is way more expensive and not needed in this context.

If you have 32 listeners bound on the same port, we can call this 32
times per SYN message.

Initial seed is random enough (phash = inet_ehashfn(net, daddr,
hnum,saddr,sport)

Anyway, the full idea of distributing SYN using a random generator is
not the best one for a multi queue NIC, and/or if RPS/RFS is used.

Ideally, we should chose a target given by the current CPU number, in
case SYN messages are spread on all cpus or a set of cpus.

(same idea than PACKET_FANOUT_CPU in net/packet/af_packet.c)




--
To unsubscribe from this list: send the line "unsubscribe netdev" in
the body of a message to majordomo@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Tom Herbert - Jan. 16, 2013, 6:37 p.m.
> Ideally, we should chose a target given by the current CPU number, in
> case SYN messages are spread on all cpus or a set of cpus.
>
It is an ideal, but I don't readily see a practical way to do this
given the available information, the fact that number of sockets
created is up to the application, and the fact that there is no fixed
binding of a socket to CPU.

Consider the simple "cpu % num" algorithm in packet fanout.  Suppose
is a 16 CPU system, and RX queues for the NIC are processed on CPUS
0,3,7,11 and user creates 4 sockets.  In this configuration, on the
first socket would ever be selected!

Tom

> (same idea than PACKET_FANOUT_CPU in net/packet/af_packet.c)
>
>
>
>
--
To unsubscribe from this list: send the line "unsubscribe netdev" in
the body of a message to majordomo@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Eric Dumazet - Jan. 16, 2013, 7:13 p.m.
On Wed, 2013-01-16 at 10:37 -0800, Tom Herbert wrote:
> > Ideally, we should chose a target given by the current CPU number, in
> > case SYN messages are spread on all cpus or a set of cpus.
> >
> It is an ideal, but I don't readily see a practical way to do this
> given the available information, the fact that number of sockets
> created is up to the application, and the fact that there is no fixed
> binding of a socket to CPU.
> 
> Consider the simple "cpu % num" algorithm in packet fanout.  Suppose
> is a 16 CPU system, and RX queues for the NIC are processed on CPUS
> 0,3,7,11 and user creates 4 sockets.  In this configuration, on the
> first socket would ever be selected!

Sure, any hand coded 'optimization' should be correctly done.

On a 16 cpus system, I would create 16 queues, if we stick to the "cpu %
nr_queues" simple selection.

If some queues are never used, thats not a big deal, unless you have a
crazy spin polling of the queues. A blocked thread consumes almost
nothing.

It would be a rather straightforward patch to add mask capability to
af_packet (a la rps_cpus ), if we really wanted 4 queues, served by cpus
0,3,7,11 but I don't think there is an urgent need.

Another way would be to let user land to declare a preferred cpu for the
queue, even if not related to process scheduler affinity.

Anyway, this can be addressed later, in followup patches.


--
To unsubscribe from this list: send the line "unsubscribe netdev" in
the body of a message to majordomo@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Tom Herbert - Jan. 20, 2013, 11:33 p.m.
> It would be pretty neat if the CPU layout could make it to this level, so
> that you could choose queues based on the shared cache layout.  e.g. if
> cores 0 and 2 shared the same L2 cache, then you can be intelligent about
> landing flows on those queues bound to those cores.  Essentially the info
> that feeds the cpuid utility being used to make even smarter decisions of
> flow tuning.
>
Indeed.  I think Ben's cpu_rmap code already pretty much does this...

> Cheers,
> -PJ
--
To unsubscribe from this list: send the line "unsubscribe netdev" in
the body of a message to majordomo@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html

Patch

diff --git a/include/linux/random.h b/include/linux/random.h
index d984608..347ce55 100644
--- a/include/linux/random.h
+++ b/include/linux/random.h
@@ -74,4 +74,10 @@  static inline int arch_get_random_int(unsigned int *v)
 }
 #endif
 
+/* Pseudo random number generator from numerical recipes. */
+static inline u32 next_pseudo_random32(u32 seed)
+{
+	return seed * 1664525 + 1013904223;
+}
+
 #endif /* _LINUX_RANDOM_H */
diff --git a/include/net/sock.h b/include/net/sock.h
index 182ca99..360b412 100644
--- a/include/net/sock.h
+++ b/include/net/sock.h
@@ -140,6 +140,7 @@  typedef __u64 __bitwise __addrpair;
  *	@skc_family: network address family
  *	@skc_state: Connection state
  *	@skc_reuse: %SO_REUSEADDR setting
+ *	@skc_reuseport: %SO_REUSEPORT setting
  *	@skc_bound_dev_if: bound device index if != 0
  *	@skc_bind_node: bind hash linkage for various protocol lookup tables
  *	@skc_portaddr_node: second hash linkage for UDP/UDP-Lite protocol
@@ -179,7 +180,8 @@  struct sock_common {
 
 	unsigned short		skc_family;
 	volatile unsigned char	skc_state;
-	unsigned char		skc_reuse;
+	unsigned char		skc_reuse:4;
+	unsigned char		skc_reuseport:4;
 	int			skc_bound_dev_if;
 	union {
 		struct hlist_node	skc_bind_node;
@@ -297,6 +299,7 @@  struct sock {
 #define sk_family		__sk_common.skc_family
 #define sk_state		__sk_common.skc_state
 #define sk_reuse		__sk_common.skc_reuse
+#define sk_reuseport		__sk_common.skc_reuseport
 #define sk_bound_dev_if		__sk_common.skc_bound_dev_if
 #define sk_bind_node		__sk_common.skc_bind_node
 #define sk_prot			__sk_common.skc_prot
diff --git a/include/uapi/asm-generic/socket.h b/include/uapi/asm-generic/socket.h
index 2d32d07..331e322 100644
--- a/include/uapi/asm-generic/socket.h
+++ b/include/uapi/asm-generic/socket.h
@@ -22,8 +22,7 @@ 
 #define SO_PRIORITY	12
 #define SO_LINGER	13
 #define SO_BSDCOMPAT	14
-/* To add :#define SO_REUSEPORT 15 */
-
+#define SO_REUSEPORT	15
 #ifndef SO_PASSCRED /* powerpc only differs in these */
 #define SO_PASSCRED	16
 #define SO_PEERCRED	17
diff --git a/net/core/sock.c b/net/core/sock.c
index bc131d4..0040832 100644
--- a/net/core/sock.c
+++ b/net/core/sock.c
@@ -665,6 +665,9 @@  int sock_setsockopt(struct socket *sock, int level, int optname,
 	case SO_REUSEADDR:
 		sk->sk_reuse = (valbool ? SK_CAN_REUSE : SK_NO_REUSE);
 		break;
+	case SO_REUSEPORT:
+		sk->sk_reuseport = valbool;
+		break;
 	case SO_TYPE:
 	case SO_PROTOCOL:
 	case SO_DOMAIN:
@@ -965,6 +968,10 @@  int sock_getsockopt(struct socket *sock, int level, int optname,
 		v.val = sk->sk_reuse;
 		break;
 
+	case SO_REUSEPORT:
+		v.val = sk->sk_reuseport;
+		break;
+
 	case SO_KEEPALIVE:
 		v.val = sock_flag(sk, SOCK_KEEPOPEN);
 		break;