diff mbox

[RFC] inetpeer: Support ipv6 addresses.

Message ID 20100327.203120.98894145.davem@davemloft.net
State RFC, archived
Delegated to: David Miller
Headers show

Commit Message

David Miller March 28, 2010, 3:31 a.m. UTC
inetpeer: Support ipv6 addresses.

Signed-off-by: David S. Miller <davem@davemloft.net>

---

This a first step based upon the talks we were having the other week
about potentially moving the dst metrics out into the inetpeer cache.

The next step would be to move the peer pointer into dst_entry (or
some common encapsulator datastructure that ipv4 and ipv6 could
share, inet_dst_entry or something like that), and provide the
necessary rt6_bind_peer() function for ipv6.  A small set of changes
could then add timewait recycling support to ipv6 essentially for
free (commonize code, provide some wrapper for route type specific
code).

The limited dst metric usage in DecNET could then be moved into
the DecNET route entry struct.  This way we don't have to support
DecNET in the inetpeer cache just for the sake of it's metrics. :-)

Finally, we can really move the metrics array into the inetpeer
entries.

How does this sound?

--
To unsubscribe from this list: send the line "unsubscribe netdev" in
the body of a message to majordomo@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html

Comments

Herbert Xu March 28, 2010, 8:22 a.m. UTC | #1
On Sat, Mar 27, 2010 at 08:31:20PM -0700, David Miller wrote:
>
> Finally, we can really move the metrics array into the inetpeer
> entries.

My main question is how do we deal with source-address policy
routing in a host cache?

Thanks,
David Miller March 28, 2010, 12:53 p.m. UTC | #2
From: Herbert Xu <herbert@gondor.apana.org.au>
Date: Sun, 28 Mar 2010 16:22:50 +0800

> On Sat, Mar 27, 2010 at 08:31:20PM -0700, David Miller wrote:
>>
>> Finally, we can really move the metrics array into the inetpeer
>> entries.
> 
> My main question is how do we deal with source-address policy
> routing in a host cache?

We don't, the same like how we don't handle fully specified
IPSEC policies deciding upon the route.
--
To unsubscribe from this list: send the line "unsubscribe netdev" in
the body of a message to majordomo@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Herbert Xu March 28, 2010, 1:11 p.m. UTC | #3
On Sun, Mar 28, 2010 at 05:53:12AM -0700, David Miller wrote:
> From: Herbert Xu <herbert@gondor.apana.org.au>
> Date: Sun, 28 Mar 2010 16:22:50 +0800
>
> > My main question is how do we deal with source-address policy
> > routing in a host cache?
> 
> We don't, the same like how we don't handle fully specified
> IPSEC policies deciding upon the route.

I thought we did handle source-address policy routing for PMTU
messages at least.  I just checked ip_rt_frag_needed and it does

		if (rth->fl.fl4_dst != daddr ||
		    rth->fl.fl4_src != skeys[i] ||
		    rth->rt_dst != daddr ||
		    rth->rt_src != iph->saddr ||
		    rth->fl.oif != ikeys[k] ||
		    rth->fl.iif != 0 ||
		    dst_metric_locked(&rth->u.dst, RTAX_MTU) ||
		    !net_eq(dev_net(rth->u.dst.dev), net) ||
		    rt_is_expired(rth))
			continue;

Cheers,
David Miller March 28, 2010, 1:40 p.m. UTC | #4
From: Herbert Xu <herbert@gondor.apana.org.au>
Date: Sun, 28 Mar 2010 21:11:12 +0800

> On Sun, Mar 28, 2010 at 05:53:12AM -0700, David Miller wrote:
>> From: Herbert Xu <herbert@gondor.apana.org.au>
>> Date: Sun, 28 Mar 2010 16:22:50 +0800
>>
>> > My main question is how do we deal with source-address policy
>> > routing in a host cache?
>> 
>> We don't, the same like how we don't handle fully specified
>> IPSEC policies deciding upon the route.
> 
> I thought we did handle source-address policy routing for PMTU
> messages at least.  I just checked ip_rt_frag_needed and it does

Same for all the other metrics at the TCP level.

I guess this is the decision we have to make, what does
"host level" metrics mean for us if we decide to move
things into the inetpeer cache.

--
To unsubscribe from this list: send the line "unsubscribe netdev" in
the body of a message to majordomo@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Herbert Xu March 28, 2010, 1:59 p.m. UTC | #5
On Sun, Mar 28, 2010 at 06:40:12AM -0700, David Miller wrote:
>
> Same for all the other metrics at the TCP level.

I don't think they are quite the same.  The TCP time stamp is
an attribute of the destination host, it doesn't vary depending
on which route you take to reach the host.  The MTU on the other
hand is an attribute of the route that reaches the host.

BTW, it appears that the inetpeer cache doesn't take namespaces
into account.  This means that information could potentially leak
from one namespace into another.  I'm not sure whether that's a
big deal or not but it's something for the namespaces folks to
consider.

Cheers,
Herbert Xu March 28, 2010, 2:32 p.m. UTC | #6
On Sun, Mar 28, 2010 at 09:59:31PM +0800, Herbert Xu wrote:
> 
> I don't think they are quite the same.  The TCP time stamp is
> an attribute of the destination host, it doesn't vary depending
> on which route you take to reach the host.  The MTU on the other
> hand is an attribute of the route that reaches the host.

Here's a convoluted scheme that I just thought up as a compromise
between your wish to expel the metrics and maintaining the policy
routing semantics.

Instead of placing the metrics into the inetpeer, we create a new
global cache for them (let's call it the metrics cache for now).
However, we don't actually populate this cache every time we create
an rt object.  Instead, we only create an entry in this cache
when an event requires it, e.g., when we receive a PMTU message.

In order for this to then propagate to the rt objects, we increment
a genid in the inetpeer cache for the corresponding host.  This
genid is then checked by the rt object every time.  When it is
out of step, the rt object can perform a lookup in the metrics cache
to get the latest data.

Of course once an rt object has a pointer to an entry in the metrics
cache it doesn't need to check the genid anymore, until the metrics
data expires at which point this process can be repeated.

Cheers,
Herbert Xu March 29, 2010, 3:08 p.m. UTC | #7
On Sun, Mar 28, 2010 at 10:32:35PM +0800, Herbert Xu wrote:
> 
> Here's a convoluted scheme that I just thought up as a compromise
> between your wish to expel the metrics and maintaining the policy
> routing semantics.

I think the bigger question is are we now ready for a per-cpu
route cache?

The route cache is one of the few remaining global structures
that is standing in the way of full multicore scalability.

Or is there another way of solving this?

Cheers,
David Miller March 29, 2010, 10:15 p.m. UTC | #8
From: Herbert Xu <herbert@gondor.apana.org.au>
Date: Sun, 28 Mar 2010 22:32:35 +0800

> Instead of placing the metrics into the inetpeer, we create a new
> global cache for them (let's call it the metrics cache for now).
> However, we don't actually populate this cache every time we create
> an rt object.  Instead, we only create an entry in this cache
> when an event requires it, e.g., when we receive a PMTU message.
> 
> In order for this to then propagate to the rt objects, we increment
> a genid in the inetpeer cache for the corresponding host.  This
> genid is then checked by the rt object every time.  When it is
> out of step, the rt object can perform a lookup in the metrics cache
> to get the latest data.
> 
> Of course once an rt object has a pointer to an entry in the metrics
> cache it doesn't need to check the genid anymore, until the metrics
> data expires at which point this process can be repeated.

Interesting idea, but there is the issue of how to fill in
new metrics cache entries when these requests come in later.

We'd have to retain a pointer to the routing table fib entry.
This is because the fib entry states what the initial metric
values need to be for cached routes.

So we'd need a pointer to the fib_info in the routing cache entry, and
this pointer would need to grab a reference to the fib_info.  And this
subsequently leads to the question of what to do for route changes
(f.e. hold the fib_info around until all the route cache entries drop
their references and have a dead state in the fib_info struct that can
be checked, and if we find it dead what can we even do as the route
we're working with might be cached in a socket or similar)

The other option is to relookup the FIB, but we'd then have to
validate that the route cache entry we're working with matches
precisely still, and also this lookup alone going to have non-trivial
cost :-)

It's really depressing how hard it is to untangle the way we have
things currently setup, isn't it. :-)

--
To unsubscribe from this list: send the line "unsubscribe netdev" in
the body of a message to majordomo@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
David Miller March 29, 2010, 10:21 p.m. UTC | #9
From: Herbert Xu <herbert@gondor.apana.org.au>
Date: Sun, 28 Mar 2010 21:59:31 +0800

> On Sun, Mar 28, 2010 at 06:40:12AM -0700, David Miller wrote:
>>
>> Same for all the other metrics at the TCP level.
> 
> I don't think they are quite the same.  The TCP time stamp is
> an attribute of the destination host, it doesn't vary depending
> on which route you take to reach the host.  The MTU on the other
> hand is an attribute of the route that reaches the host.

It does make a difference, I think.

When we use IPSEC rules on ports and crazy stuff like that,
we end up with cases such as:

1) We're going over a VPN so RTT, RTTVAR, SSTHRESH, CWND, and other
   TCP metrics which are based upon aspects of the path can end up
   being wildly different.

2) even the end host can be different in some convoluted
   setups

IPSEC encapsulation can effectively change the entire universe in fact
:-) Also, even considering only case #1 above, that's nearly half of
the metrics which we arguably can't move into something like the
inetpeer cache.

This is basically why I've been resistent in the past to these kinds
of ideas to simplify metric handling, as it has the potential to break
something.

The gains of being able to pull this off are still enticing which
is why this topic keeps getting revisited nonetheless :-)
--
To unsubscribe from this list: send the line "unsubscribe netdev" in
the body of a message to majordomo@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Herbert Xu March 30, 2010, 12:34 p.m. UTC | #10
On Mon, Mar 29, 2010 at 03:15:39PM -0700, David Miller wrote:
>
> Interesting idea, but there is the issue of how to fill in
> new metrics cache entries when these requests come in later.

You just perform an rt lookup as usual.  That'll give you the
metrics either from the route cache, or from scratch if we get
a cache miss.

> We'd have to retain a pointer to the routing table fib entry.
> This is because the fib entry states what the initial metric
> values need to be for cached routes.

We can keep the static metrics in the dst/rt entry.  For me
shrinking the dst entry isn't such a big deal, but avoiding touching
global state when constructing a new rt entry is the deal-breaker.

But if you're really into dieting then we can have that fib
pointer :)

> So we'd need a pointer to the fib_info in the routing cache entry, and
> this pointer would need to grab a reference to the fib_info.  And this
> subsequently leads to the question of what to do for route changes
> (f.e. hold the fib_info around until all the route cache entries drop
> their references and have a dead state in the fib_info struct that can
> be checked, and if we find it dead what can we even do as the route
> we're working with might be cached in a socket or similar)

AFAIK on route changes we always flush the route cache.

> The other option is to relookup the FIB, but we'd then have to
> validate that the route cache entry we're working with matches
> precisely still, and also this lookup alone going to have non-trivial
> cost :-)
> 
> It's really depressing how hard it is to untangle the way we have
> things currently setup, isn't it. :-)

Yeah it's like the Gordian Knot :)
Eric W. Biederman March 31, 2010, 5:43 a.m. UTC | #11
Herbert Xu <herbert@gondor.apana.org.au> writes:


> BTW, it appears that the inetpeer cache doesn't take namespaces
> into account.  This means that information could potentially leak
> from one namespace into another.  I'm not sure whether that's a
> big deal or not but it's something for the namespaces folks to
> consider.

Bother. I wrote a patch a while back, I even remember people
commenting on it, but it certainly doesn't look like anything
ever made it in.

I will see if I can make some time to dig that up and post
the patch.

Thanks for noticing,
Eric
--
To unsubscribe from this list: send the line "unsubscribe netdev" in
the body of a message to majordomo@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Eric W. Biederman March 31, 2010, 5:44 a.m. UTC | #12
Herbert Xu <herbert@gondor.apana.org.au> writes:


> BTW, it appears that the inetpeer cache doesn't take namespaces
> into account.  This means that information could potentially leak
> from one namespace into another.  I'm not sure whether that's a
> big deal or not but it's something for the namespaces folks to
> consider.

Bother. I wrote a patch a while back, I even remember people
commenting on it, but it certainly doesn't look like anything
ever made it in.

I will see if I can make some time to dig that up and post
the patch.

Thanks for noticing,
Eric
--
To unsubscribe from this list: send the line "unsubscribe netdev" in
the body of a message to majordomo@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
diff mbox

Patch

diff --git a/include/net/inetpeer.h b/include/net/inetpeer.h
index 87b1df0..2924302 100644
--- a/include/net/inetpeer.h
+++ b/include/net/inetpeer.h
@@ -13,10 +13,18 @@ 
 #include <linux/spinlock.h>
 #include <asm/atomic.h>
 
+typedef struct {
+	union {
+		__be32		a4;
+		__be32		a6[4];
+	};
+	__u16	family;
+} inet_peer_address_t;
+
 struct inet_peer {
 	/* group together avl_left,avl_right,v4daddr to speedup lookups */
 	struct inet_peer	*avl_left, *avl_right;
-	__be32			v4daddr;	/* peer's address */
+	inet_peer_address_t	daddr;
 	__u32			avl_height;
 	struct list_head	unused;
 	__u32			dtime;		/* the time of last use of not
@@ -31,7 +39,7 @@  struct inet_peer {
 void			inet_initpeers(void) __init;
 
 /* can be called with or without local BH being disabled */
-struct inet_peer	*inet_getpeer(__be32 daddr, int create);
+struct inet_peer	*inet_getpeer(inet_peer_address_t *daddr, int create);
 
 /* can be called from BH context or outside */
 extern void inet_putpeer(struct inet_peer *p);
diff --git a/net/ipv4/inetpeer.c b/net/ipv4/inetpeer.c
index 6bcfe52..87066eb 100644
--- a/net/ipv4/inetpeer.c
+++ b/net/ipv4/inetpeer.c
@@ -19,6 +19,7 @@ 
 #include <linux/net.h>
 #include <net/ip.h>
 #include <net/inetpeer.h>
+#include <net/ipv6.h>
 
 /*
  *  Theory of operations.
@@ -136,6 +137,47 @@  static void unlink_from_unused(struct inet_peer *p)
 	spin_unlock_bh(&inet_peer_unused_lock);
 }
 
+static inline bool inet_peer_addr_equal(inet_peer_address_t *a, inet_peer_address_t *b)
+{
+	if (a->family == b->family) {
+		switch (a->family) {
+		case AF_INET:
+			if (a->a4 == b->a4)
+				return true;
+			break;
+		case AF_INET6:
+			if (!ipv6_addr_cmp((struct in6_addr *)a,
+					   (struct in6_addr *)b))
+				return true;
+			break;
+		default:
+			break;
+		}
+	}
+	return false;
+}
+
+static inline u32 inet_peer_key(inet_peer_address_t *a)
+{
+	u32 key;
+
+	switch (a->family) {
+	case AF_INET:
+		key = (__force __u32) a->a4;
+		break;
+	case AF_INET6:
+		key = ((__force __u32)a->a6[0] ^
+		       (__force __u32)a->a6[1] ^
+		       (__force __u32)a->a6[2] ^
+		       (__force __u32)a->a6[3]);
+		break;
+	default:
+		key = 0;
+		break;
+	}
+	return key;
+}
+
 /*
  * Called with local BH disabled and the pool lock held.
  * _stack is known to be NULL or not at compile time,
@@ -143,15 +185,16 @@  static void unlink_from_unused(struct inet_peer *p)
  */
 #define lookup(_daddr, _stack) 					\
 ({								\
+	u32 key = inet_peer_key(_daddr);			\
 	struct inet_peer *u, **v;				\
 	if (_stack != NULL) {					\
 		stackptr = _stack;				\
 		*stackptr++ = &peer_root;			\
 	}							\
 	for (u = peer_root; u != peer_avl_empty; ) {		\
-		if (_daddr == u->v4daddr)			\
+		if (inet_peer_addr_equal(_daddr, &u->daddr))	\
 			break;					\
-		if ((__force __u32)_daddr < (__force __u32)u->v4daddr)	\
+		if (key < inet_peer_key(&u->daddr))		\
 			v = &u->avl_left;			\
 		else						\
 			v = &u->avl_right;			\
@@ -280,7 +323,7 @@  static void unlink_from_pool(struct inet_peer *p)
 	if (atomic_read(&p->refcnt) == 1) {
 		struct inet_peer **stack[PEER_MAXDEPTH];
 		struct inet_peer ***stackptr, ***delp;
-		if (lookup(p->v4daddr, stack) != p)
+		if (lookup(&p->daddr, stack) != p)
 			BUG();
 		delp = stackptr - 1; /* *delp[0] == p */
 		if (p->avl_left == peer_avl_empty) {
@@ -292,7 +335,7 @@  static void unlink_from_pool(struct inet_peer *p)
 			t = lookup_rightempty(p);
 			BUG_ON(*stackptr[-1] != t);
 			**--stackptr = t->avl_left;
-			/* t is removed, t->v4daddr > x->v4daddr for any
+			/* t is removed, t->daddr > x->daddr for any
 			 * x in p->avl_left subtree.
 			 * Put t in the old place of p. */
 			*delp[0] = t;
@@ -358,7 +401,7 @@  static int cleanup_once(unsigned long ttl)
 }
 
 /* Called with or without local BH being disabled. */
-struct inet_peer *inet_getpeer(__be32 daddr, int create)
+struct inet_peer *inet_getpeer(inet_peer_address_t *daddr, int create)
 {
 	struct inet_peer *p, *n;
 	struct inet_peer **stack[PEER_MAXDEPTH], ***stackptr;
@@ -384,10 +427,11 @@  struct inet_peer *inet_getpeer(__be32 daddr, int create)
 	n = kmem_cache_alloc(peer_cachep, GFP_ATOMIC);
 	if (n == NULL)
 		return NULL;
-	n->v4daddr = daddr;
+	n->daddr = *daddr;
 	atomic_set(&n->refcnt, 1);
 	atomic_set(&n->rid, 0);
-	atomic_set(&n->ip_id_count, secure_ip_id(daddr));
+	if (daddr->family == AF_INET)
+		atomic_set(&n->ip_id_count, secure_ip_id(daddr->a4));
 	n->tcp_ts_stamp = 0;
 
 	write_lock_bh(&peer_pool_lock);
diff --git a/net/ipv4/ip_fragment.c b/net/ipv4/ip_fragment.c
index b59430b..681c8b9 100644
--- a/net/ipv4/ip_fragment.c
+++ b/net/ipv4/ip_fragment.c
@@ -142,8 +142,14 @@  static void ip4_frag_init(struct inet_frag_queue *q, void *a)
 	qp->saddr = arg->iph->saddr;
 	qp->daddr = arg->iph->daddr;
 	qp->user = arg->user;
-	qp->peer = sysctl_ipfrag_max_dist ?
-		inet_getpeer(arg->iph->saddr, 1) : NULL;
+	if (sysctl_ipfrag_max_dist) {
+		inet_peer_address_t addr;
+
+		addr.a4 = arg->iph->saddr;
+		addr.family = AF_INET;
+		qp->peer = inet_getpeer(&addr, 1);
+	} else
+		qp->peer = NULL;
 }
 
 static __inline__ void ip4_frag_free(struct inet_frag_queue *q)
diff --git a/net/ipv4/route.c b/net/ipv4/route.c
index 32d3961..9334e29 100644
--- a/net/ipv4/route.c
+++ b/net/ipv4/route.c
@@ -1287,9 +1287,12 @@  skip_hashing:
 void rt_bind_peer(struct rtable *rt, int create)
 {
 	static DEFINE_SPINLOCK(rt_peer_lock);
+	inet_peer_address_t addr;
 	struct inet_peer *peer;
 
-	peer = inet_getpeer(rt->rt_dst, create);
+	addr.a4 = rt->rt_dst;
+	addr.family = AF_INET;
+	peer = inet_getpeer(&addr, create);
 
 	spin_lock_bh(&rt_peer_lock);
 	if (rt->peer == NULL) {
diff --git a/net/ipv4/tcp_ipv4.c b/net/ipv4/tcp_ipv4.c
index f4df5f9..9de6a12 100644
--- a/net/ipv4/tcp_ipv4.c
+++ b/net/ipv4/tcp_ipv4.c
@@ -1350,7 +1350,7 @@  int tcp_v4_conn_request(struct sock *sk, struct sk_buff *skb)
 		    tcp_death_row.sysctl_tw_recycle &&
 		    (dst = inet_csk_route_req(sk, req)) != NULL &&
 		    (peer = rt_get_peer((struct rtable *)dst)) != NULL &&
-		    peer->v4daddr == saddr) {
+		    peer->daddr.a4 == saddr) {
 			if ((u32)get_seconds() - peer->tcp_ts_stamp < TCP_PAWS_MSL &&
 			    (s32)(peer->tcp_ts - req->ts_recent) >
 							TCP_PAWS_WINDOW) {
@@ -1770,7 +1770,11 @@  int tcp_v4_remember_stamp(struct sock *sk)
 	int release_it = 0;
 
 	if (!rt || rt->rt_dst != inet->inet_daddr) {
-		peer = inet_getpeer(inet->inet_daddr, 1);
+		inet_peer_address_t addr;
+
+		addr.a4 = inet->inet_daddr;
+		addr.family = AF_INET;
+		peer = inet_getpeer(&addr, 1);
 		release_it = 1;
 	} else {
 		if (!rt->peer)
@@ -1795,8 +1799,12 @@  int tcp_v4_remember_stamp(struct sock *sk)
 
 int tcp_v4_tw_remember_stamp(struct inet_timewait_sock *tw)
 {
-	struct inet_peer *peer = inet_getpeer(tw->tw_daddr, 1);
+	inet_peer_address_t addr;
+	struct inet_peer *peer;
 
+	addr.a4 = tw->tw_daddr;
+	addr.family = AF_INET;
+	peer = inet_getpeer(&addr, 1);
 	if (peer) {
 		const struct tcp_timewait_sock *tcptw = tcp_twsk((struct sock *)tw);