diff mbox

[net] mpls: modify RTA_NEWDST netlink attribute to include family

Message ID 1431664722-59539-1-git-send-email-roopa@cumulusnetworks.com
State Changes Requested, archived
Delegated to: David Miller
Headers show

Commit Message

Roopa Prabhu May 15, 2015, 4:38 a.m. UTC
From: Roopa Prabhu <roopa@cumulusnetworks.com>

RTA_NEWDST netlink attribute today is used to carry mpls
labels. This patch encodes family in RTA_NEWDST.

RTA_NEWDST by its name and its use in iproute2 can be
used as a generic new dst. But it is currently used only for
mpls labels ie with family AF_MPLS. Encoding family in the
attribute will help its reuse in the future.

One usecase where family with RTA_NEWDST becomes necessary
is when we implement mpls label edge router function.

This is a uapi change but RTA_NEWDST has not made
into any release yet. so, trying to rush this change into
4.1 if acceptable.

(iproute2 patch will follow)

Signed-off-by: Roopa Prabhu <roopa@cumulusnetworks.com>
---
eric, if you had already thought about other ways to represent
labels for LER function, pls let me know. I am looking for suggestions.

 include/uapi/linux/rtnetlink.h |    7 ++-
 net/mpls/af_mpls.c             |  118 +++++++++++++++++++++++++++++++---------
 net/mpls/internal.h            |    5 +-
 3 files changed, 100 insertions(+), 30 deletions(-)

Comments

Eric W. Biederman May 15, 2015, 6:35 a.m. UTC | #1
roopa@cumulusnetworks.com writes:

> From: Roopa Prabhu <roopa@cumulusnetworks.com>
>
> RTA_NEWDST netlink attribute today is used to carry mpls
> labels. This patch encodes family in RTA_NEWDST.
>
> RTA_NEWDST by its name and its use in iproute2 can be
> used as a generic new dst. But it is currently used only for
> mpls labels ie with family AF_MPLS. Encoding family in the
> attribute will help its reuse in the future.
>
> One usecase where family with RTA_NEWDST becomes necessary
> is when we implement mpls label edge router function.

I don't think this makes any sense.

How do you change the destination address on a packet to a value in
another protocol?  None of IPv4, IPv6, and MPLS support that.

Aka this attribute represents DNAT.


> This is a uapi change but RTA_NEWDST has not made
> into any release yet. so, trying to rush this change into
> 4.1 if acceptable.
>
> (iproute2 patch will follow)
>
> Signed-off-by: Roopa Prabhu <roopa@cumulusnetworks.com>
> ---
> eric, if you had already thought about other ways to represent
> labels for LER function, pls let me know. I am looking for suggestions.

I have to some extent, nothing I am completely pleased with yet but
enough that I can narrow things down to some extent.

I believe you are referring to the case where we have an ipv4 packet
or an ipv6 packet and we are inserting it into an mpls tunnel for the
next step of it's travel.  Egress from mpls appears to already be
convered.

The bounding set of challenges looks something like this:
- We might be placing a full routing table into mpls with
  a different mpls tunnel for each different route.
  A full routing table today runs about 1 million routes
  so we need to support inserting into the ballpark of 1 million
  different mpls tunnels.
  As it happens 1 million is also 2^20 or the number of mpls labels.

At 1 million tunnels that rules out using network devices.

Network devices have two basic things that cause scalability problems.
- struct netdevice and all of sysfs and sysctl overheads fixable
  but they run at about 32K today.
- The accounting of ingress and egress packets.
  It takes a lot of percpu counters to make accounting fast
  so I think fundamentally we want something without counters.

Which lead me to look at the kernel xfrm subsystem.  xfrm is a close
match in requirements.  But having to do a second inefficient lookup and
lookup on more than what we normally used to route a packet seems
wrong. Not hooking into the routing tables seems wrong.  The xfrm data
structures themselves seem heavy weight for simple low cost
encapsulation.


So I think we need to build yet another infrastructure for dealing with
light weight tunnels (not just mpls).

What I would propose would be a new infrastructure for dealing with
simple stateless tunnels.  (AKA tunneling over IP or UDP or MPLS is fine
but tunneling over TCP or otherwise needing smarts to insert a packet
into a tunnel is a no-go).

To support entering these tunnels and egressing from these tunnels we
need a number that would represent the tunnel type that is linux
specific.  This tunnel type would be a superset of the ipv4/ipv6
protocol number that is stored in /etc/protocol and
http://www.iana.org/assignments/protocol-numbers As well as being a
superset the pseudo wire types
http://www.iana.org/assignments/pwe3-parameters
There are mpls tunnels that are not pseudo wires and there are
tunnels over ip that are encoded in udp are something else as well.

I believe I would represent this in rtnetlink with a new attribute
RTA_ENCAP.  The current idea in my mind is that RTA_ENCAP would include
the encapsulation type, a set of fixed headers and possibly some nested
attributes (like output device), probably RTA_ENCAP and possibly
RTA_DST.

At an implementation level I would hook these to the ipv4 and ipv6
routing tables at the same place as the destination network device,
possibly sharing storage with where we put the destination network
device today.

We should be able to use dst->output to do all of the work and thus be
able to use many if not all of the same hooks as the fast path of xfrm.

We definitely need an ecapsulation method because we need to deal with
things like the ttl, mtu and fragmentation and so we need to propogate
bits algorithmically between the different layers.

There is also the complication that ip over mpls natively vs ip over an
mpls pseudo wire while in practice have the same encoding of the mpls
labels they appear propogate the ttl differently.  In one case the ttl
from the inner packet propogates to the outer packet during
encapsulation and propogates to the inner packet when deccapsulating,
and in the other case the mpls tunnel is treated as a single hop
by the ip layer.


So I think the right solution is to do the leg work and come up with
an RTA_ENCAP netlink option, and the associated


The cheap hack version of this is to use RTA_FLOW and encode a 32bit
number in the routing table and use a magic device to look up that 32bit
number in the mpls routing table (or possibly an mpls flow table)
and use that to generate the mpls labels.

I don't think we want add the cheap hack.  I think we want a good
version that can work for all simple well defined tunnel types like
mpls, gre, ipip, vxlan?, etc.


I think we also will want a small layer of indirection in the
implementation of RTA_ENCAP such that we can define a simple
encapsulation separately from defining the route.  For IPv4 with in some
cases 8 different prefixes for a single destination address, in the
general case, and internal to a companies network I suspect the
aggregation level can be much higher.

What such an encapsulation would be is that we would have a tunnel
table with simple integer index, and RTA_ENCAP would just hold
that index to that tunnel.  The routing table would hold a reference
counted pointer to the tunnel (so no extra lookups required in the fast
path), and some other bits of netwlink would create and destroy the
light-weight encapsulations.

Anyway that is my brainstorm on how things should look, and I really
don't think extending RTA_NEWDST makes much if any sense at all.
RTA_NEWDST is just DNAT.

Eric


>  include/uapi/linux/rtnetlink.h |    7 ++-
>  net/mpls/af_mpls.c             |  118 +++++++++++++++++++++++++++++++---------
>  net/mpls/internal.h            |    5 +-
>  3 files changed, 100 insertions(+), 30 deletions(-)
>
> diff --git a/include/uapi/linux/rtnetlink.h b/include/uapi/linux/rtnetlink.h
> index 974db03..79879cb 100644
> --- a/include/uapi/linux/rtnetlink.h
> +++ b/include/uapi/linux/rtnetlink.h
> @@ -356,8 +356,13 @@ struct rtvia {
>  	__u8			rtvia_addr[0];
>  };
>  
> -/* RTM_CACHEINFO */
> +/* RTA_NEWDST */
> +struct rtnewdst {
> +	__kernel_sa_family_t	family;
> +	__u8	dst[0];
> +};
>  
> +/* RTM_CACHEINFO */
>  struct rta_cacheinfo {
>  	__u32	rta_clntref;
>  	__u32	rta_lastuse;
> diff --git a/net/mpls/af_mpls.c b/net/mpls/af_mpls.c
> index 91ed656..6c31108 100644
> --- a/net/mpls/af_mpls.c
> +++ b/net/mpls/af_mpls.c
> @@ -599,18 +599,13 @@ static int nla_put_via(struct sk_buff *skb,
>  	return 0;
>  }
>  
> -int nla_put_labels(struct sk_buff *skb, int attrtype,
> +int nla_put_labels(struct sk_buff *skb, void *addr,
>  		   u8 labels, const u32 label[])
>  {
> -	struct nlattr *nla;
> -	struct mpls_shim_hdr *nla_label;
> +	struct mpls_shim_hdr *nla_label = addr;
>  	bool bos;
>  	int i;
> -	nla = nla_reserve(skb, attrtype, labels*4);
> -	if (!nla)
> -		return -EMSGSIZE;
>  
> -	nla_label = nla_data(nla);
>  	bos = true;
>  	for (i = labels - 1; i >= 0; i--) {
>  		nla_label[i] = mpls_entry_encode(label[i], 0, 0, bos);
> @@ -620,25 +615,45 @@ int nla_put_labels(struct sk_buff *skb, int attrtype,
>  	return 0;
>  }
>  
> -int nla_get_labels(const struct nlattr *nla,
> -		   u32 max_labels, u32 *labels, u32 label[])
> +int nla_put_newdst(struct sk_buff *skb, int attrtype, int family,
> +		   u8 labels, const u32 label[])
>  {
> -	unsigned len = nla_len(nla);
> -	unsigned nla_labels;
> -	struct mpls_shim_hdr *nla_label;
> -	bool bos;
> -	int i;
> +	struct nlattr *nla;
> +	struct rtnewdst *newdst;
>  
> -	/* len needs to be an even multiple of 4 (the label size) */
> -	if (len & 3)
> -		return -EINVAL;
> +	nla = nla_reserve(skb, attrtype, 2 + (labels * 4));
> +	if (!nla)
> +		return -EMSGSIZE;
>  
> -	/* Limit the number of new labels allowed */
> -	nla_labels = len/4;
> -	if (nla_labels > max_labels)
> -		return -EINVAL;
> +	newdst = nla_data(nla);
> +	newdst->family = family;
> +
> +	nla_put_labels(skb, &newdst->dst, labels, label);
> +
> +	return 0;
> +}
> +EXPORT_SYMBOL(nla_put_newdst);
> +
> +int nla_put_dst(struct sk_buff *skb, int attrtype, u8 labels,
> +		const u32 label[])
> +{
> +	struct nlattr *nla;
> +
> +	nla = nla_reserve(skb, attrtype, labels * 4);
> +	if (!nla)
> +		return -EMSGSIZE;
> +
> +	nla_put_labels(skb, nla_data(nla), labels, label);
> +
> +	return 0;
> +}
> +
> +int nla_get_labels(void *addr, u32 nla_labels, u32 *labels, u32 label[])
> +{
> +	struct mpls_shim_hdr *nla_label = addr;
> +	bool bos;
> +	int i;
>  
> -	nla_label = nla_data(nla);
>  	bos = true;
>  	for (i = nla_labels - 1; i >= 0; i--, bos = false) {
>  		struct mpls_entry_decoded dec;
> @@ -665,6 +680,54 @@ int nla_get_labels(const struct nlattr *nla,
>  	return 0;
>  }
>  
> +int nla_get_newdst(const struct nlattr *nla, u32 max_labels,
> +		   u32 *labels, u32 label[])
> +{
> +	struct rtnewdst *newdst = nla_data(nla);
> +	unsigned nla_labels;
> +	unsigned len;
> +
> +	if (nla_len(nla) < offsetof(struct rtnewdst, dst))
> +		return -EINVAL;
> +
> +	len = nla_len(nla) - sizeof(struct rtnewdst);
> +
> +	/* len needs to be an even multiple of 4 (the label size) */
> +	if (len & 3)
> +		return -EINVAL;
> +
> +	/* Limit the number of new labels allowed */
> +	nla_labels = len / 4;
> +	if (nla_labels > max_labels)
> +		return -EINVAL;
> +
> +	nla_get_labels(&newdst->dst, nla_labels, labels, label);
> +
> +	return 0;
> +}
> +EXPORT_SYMBOL(nla_get_newdst);
> +
> +int nla_get_dst(const struct nlattr *nla,
> +		u32 max_labels, u32 *labels, u32 label[])
> +{
> +	unsigned len = nla_len(nla);
> +	unsigned nla_labels;
> +
> +	/* len needs to be an even multiple of 4 (the label size) */
> +	if (len & 3)
> +		return -EINVAL;
> +
> +	/* Limit the number of new labels allowed */
> +	nla_labels = len / 4;
> +	if (nla_labels > max_labels)
> +		return -EINVAL;
> +
> +	nla_get_labels(nla_data(nla), nla_labels, labels, label);
> +
> +	return 0;
> +}
> +EXPORT_SYMBOL(nla_get_dst);
> +
>  static int rtm_to_route_config(struct sk_buff *skb,  struct nlmsghdr *nlh,
>  			       struct mpls_route_config *cfg)
>  {
> @@ -721,7 +784,7 @@ static int rtm_to_route_config(struct sk_buff *skb,  struct nlmsghdr *nlh,
>  			cfg->rc_ifindex = nla_get_u32(nla);
>  			break;
>  		case RTA_NEWDST:
> -			if (nla_get_labels(nla, MAX_NEW_LABELS,
> +			if (nla_get_newdst(nla, MAX_NEW_LABELS,
>  					   &cfg->rc_output_labels,
>  					   cfg->rc_output_label))
>  				goto errout;
> @@ -729,8 +792,8 @@ static int rtm_to_route_config(struct sk_buff *skb,  struct nlmsghdr *nlh,
>  		case RTA_DST:
>  		{
>  			u32 label_count;
> -			if (nla_get_labels(nla, 1, &label_count,
> -					   &cfg->rc_label))
> +			if (nla_get_dst(nla, 1, &label_count,
> +					&cfg->rc_label))
>  				goto errout;
>  
>  			/* The first 16 labels are reserved, and may not be set */
> @@ -831,14 +894,15 @@ static int mpls_dump_route(struct sk_buff *skb, u32 portid, u32 seq, int event,
>  	rtm->rtm_flags = 0;
>  
>  	if (rt->rt_labels &&
> -	    nla_put_labels(skb, RTA_NEWDST, rt->rt_labels, rt->rt_label))
> +	    nla_put_newdst(skb, RTA_NEWDST, AF_MPLS, rt->rt_labels,
> +			   rt->rt_label))
>  		goto nla_put_failure;
>  	if (nla_put_via(skb, rt->rt_via_table, rt->rt_via, rt->rt_via_alen))
>  		goto nla_put_failure;
>  	dev = rtnl_dereference(rt->rt_dev);
>  	if (dev && nla_put_u32(skb, RTA_OIF, dev->ifindex))
>  		goto nla_put_failure;
> -	if (nla_put_labels(skb, RTA_DST, 1, &label))
> +	if (nla_put_dst(skb, RTA_DST, 1, &label))
>  		goto nla_put_failure;
>  
>  	nlmsg_end(skb, nlh);
> diff --git a/net/mpls/internal.h b/net/mpls/internal.h
> index b064c34..99d7a79 100644
> --- a/net/mpls/internal.h
> +++ b/net/mpls/internal.h
> @@ -49,7 +49,8 @@ static inline struct mpls_entry_decoded mpls_entry_decode(struct mpls_shim_hdr *
>  	return result;
>  }
>  
> -int nla_put_labels(struct sk_buff *skb, int attrtype,  u8 labels, const u32 label[]);
> -int nla_get_labels(const struct nlattr *nla, u32 max_labels, u32 *labels, u32 label[]);
> +int nla_put_labels(struct sk_buff *skb, void *addr,  u8 labels,
> +		   const u32 label[]);
> +int nla_get_labels(void *addr, u32 nla_labels, u32 *labels, u32 label[]);
>  
>  #endif /* MPLS_INTERNAL_H */
--
To unsubscribe from this list: send the line "unsubscribe netdev" in
the body of a message to majordomo@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Roopa Prabhu May 15, 2015, 6:18 p.m. UTC | #2
On 5/14/15, 11:35 PM, Eric W. Biederman wrote:
> roopa@cumulusnetworks.com writes:
>
>> From: Roopa Prabhu <roopa@cumulusnetworks.com>
>>
>> RTA_NEWDST netlink attribute today is used to carry mpls
>> labels. This patch encodes family in RTA_NEWDST.
>>
>> RTA_NEWDST by its name and its use in iproute2 can be
>> used as a generic new dst. But it is currently used only for
>> mpls labels ie with family AF_MPLS. Encoding family in the
>> attribute will help its reuse in the future.
>>
>> One usecase where family with RTA_NEWDST becomes necessary
>> is when we implement mpls label edge router function.
> I don't think this makes any sense.
>
> How do you change the destination address on a packet to a value in
> another protocol?  None of IPv4, IPv6, and MPLS support that.
>
> Aka this attribute represents DNAT.

thanks for that clarification (some details on what i was trying to do 
is at the end of this email).
>
>
>> This is a uapi change but RTA_NEWDST has not made
>> into any release yet. so, trying to rush this change into
>> 4.1 if acceptable.
>>
>> (iproute2 patch will follow)
>>
>> Signed-off-by: Roopa Prabhu <roopa@cumulusnetworks.com>
>> ---
>> eric, if you had already thought about other ways to represent
>> labels for LER function, pls let me know. I am looking for suggestions.
> I have to some extent, nothing I am completely pleased with yet but
> enough that I can narrow things down to some extent.
>
> I believe you are referring to the case where we have an ipv4 packet
> or an ipv6 packet and we are inserting it into an mpls tunnel for the
> next step of it's travel.  Egress from mpls appears to already be
> convered.
yes, correct.
>
> The bounding set of challenges looks something like this:
> - We might be placing a full routing table into mpls with
>    a different mpls tunnel for each different route.
>    A full routing table today runs about 1 million routes
>    so we need to support inserting into the ballpark of 1 million
>    different mpls tunnels.
>    As it happens 1 million is also 2^20 or the number of mpls labels.
>
> At 1 million tunnels that rules out using network devices.
>
> Network devices have two basic things that cause scalability problems.
> - struct netdevice and all of sysfs and sysctl overheads fixable
>    but they run at about 32K today.
> - The accounting of ingress and egress packets.
>    It takes a lot of percpu counters to make accounting fast
>    so I think fundamentally we want something without counters.

agreed. And we have the same conclusions. device is not an option.
> Which lead me to look at the kernel xfrm subsystem.  xfrm is a close
> match in requirements.  But having to do a second inefficient lookup and
> lookup on more than what we normally used to route a packet seems
> wrong. Not hooking into the routing tables seems wrong.  The xfrm data
> structures themselves seem heavy weight for simple low cost
> encapsulation.
I have not looked at the xfrm infrastructure in detail. will do so.
>
>
> So I think we need to build yet another infrastructure for dealing with
> light weight tunnels (not just mpls).

ok, I was looking for a word to describe tunnels like mpls..., 'light 
weight tunnels'
sounds good.
>
> What I would propose would be a new infrastructure for dealing with
> simple stateless tunnels.  (AKA tunneling over IP or UDP or MPLS is fine
> but tunneling over TCP or otherwise needing smarts to insert a packet
> into a tunnel is a no-go).
>
> To support entering these tunnels and egressing from these tunnels we
> need a number that would represent the tunnel type that is linux
> specific.  This tunnel type would be a superset of the ipv4/ipv6
> protocol number that is stored in /etc/protocol and
> http://www.iana.org/assignments/protocol-numbers As well as being a
> superset the pseudo wire types
> http://www.iana.org/assignments/pwe3-parameters
> There are mpls tunnels that are not pseudo wires and there are
> tunnels over ip that are encoded in udp are something else as well.
>
> I believe I would represent this in rtnetlink with a new attribute
> RTA_ENCAP.  The current idea in my mind is that RTA_ENCAP would include
> the encapsulation type, a set of fixed headers and possibly some nested
> attributes (like output device), probably RTA_ENCAP and possibly
> RTA_DST.
ok..

>
> At an implementation level I would hook these to the ipv4 and ipv6
> routing tables at the same place as the destination network device,
> possibly sharing storage with where we put the destination network
> device today.
>
> We should be able to use dst->output to do all of the work and thus be
> able to use many if not all of the same hooks as the fast path of xfrm.
>
> We definitely need an ecapsulation method because we need to deal with
> things like the ttl, mtu and fragmentation and so we need to propogate
> bits algorithmically between the different layers.
>
> There is also the complication that ip over mpls natively vs ip over an
> mpls pseudo wire while in practice have the same encoding of the mpls
> labels they appear propogate the ttl differently.  In one case the ttl
> from the inner packet propogates to the outer packet during
> encapsulation and propogates to the inner packet when deccapsulating,
> and in the other case the mpls tunnel is treated as a single hop
> by the ip layer.
>
>
> So I think the right solution is to do the leg work and come up with
> an RTA_ENCAP netlink option, and the associated
>
>
> The cheap hack version of this is to use RTA_FLOW and encode a 32bit
> number in the routing table and use a magic device to look up that 32bit
> number in the mpls routing table (or possibly an mpls flow table)
> and use that to generate the mpls labels.
>
> I don't think we want add the cheap hack.  I think we want a good
> version that can work for all simple well defined tunnel types like
> mpls, gre, ipip, vxlan?, etc.
>
>
> I think we also will want a small layer of indirection in the
> implementation of RTA_ENCAP such that we can define a simple
> encapsulation separately from defining the route.  For IPv4 with in some
> cases 8 different prefixes for a single destination address, in the
> general case, and internal to a companies network I suspect the
> aggregation level can be much higher.
>
> What such an encapsulation would be is that we would have a tunnel
> table with simple integer index, and RTA_ENCAP would just hold
> that index to that tunnel.  The routing table would hold a reference
> counted pointer to the tunnel (so no extra lookups required in the fast
> path), and some other bits of netwlink would create and destroy the
> light-weight encapsulations.

ok, thanks for all the thoughts on this. I was not thinking separate 
tunnel table.
>
> Anyway that is my brainstorm on how things should look, and I really
> don't think extending RTA_NEWDST makes much if any sense at all.
> RTA_NEWDST is just DNAT.
Let me tell you where I was going with RTA_NEWDST: I was completely on 
board with all your
hints on a separate generic encapsulation layer for such "light weigh 
tunnels" in
  your previous emails on this. The part that wasn't clear was a 
separate tunnel table.

 From what i saw, mpls today was the only such light weight tunnel. And, 
to me RTA_NEWDST
was to some extent RTA_ENCAP you were talking about. Clearly i seem to 
have ignored all the other
encapsulation parameters that may need to go into it :). But, i guess in 
my mind i was thinking
those will be additional attributes. But agree, a new nested attribute 
could be a better option.

 From IPv4 for example, to me this looked something like adding the below.

ip route add 10.1.1.0/30 as mpls 200 via inet 10.1.1.2 dev swp1

the 'mpls 200' goes into RTA_NEWDST.

And from ipv4 code you look at the encap family and pass it on to the 
respective output func (i was looking at a possible
abstraction layer here...maybe something like xfrm covering different 
tunnel types like you mention above).

In the hacked up version of my patch (which i was not going to post if 
it looked like a hack anyways),
  i essentially set the dst->output to mpls_output.

I will see if I can come up with something on the lines of RTA_ENCAP you 
share above.

Thanks for the details eric! appreciate it.
--
To unsubscribe from this list: send the line "unsubscribe netdev" in
the body of a message to majordomo@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Robert Shearman May 19, 2015, 10:15 a.m. UTC | #3
On 15/05/15 07:35, Eric W. Biederman wrote:
> roopa@cumulusnetworks.com writes:
>
>> From: Roopa Prabhu <roopa@cumulusnetworks.com>
>>
>> RTA_NEWDST netlink attribute today is used to carry mpls
>> labels. This patch encodes family in RTA_NEWDST.
>>
>> RTA_NEWDST by its name and its use in iproute2 can be
>> used as a generic new dst. But it is currently used only for
>> mpls labels ie with family AF_MPLS. Encoding family in the
>> attribute will help its reuse in the future.
>>
>> One usecase where family with RTA_NEWDST becomes necessary
>> is when we implement mpls label edge router function.
>
> I don't think this makes any sense.
>
> How do you change the destination address on a packet to a value in
> another protocol?  None of IPv4, IPv6, and MPLS support that.
>
> Aka this attribute represents DNAT.
>
>
>> This is a uapi change but RTA_NEWDST has not made
>> into any release yet. so, trying to rush this change into
>> 4.1 if acceptable.
>>
>> (iproute2 patch will follow)
>>
>> Signed-off-by: Roopa Prabhu <roopa@cumulusnetworks.com>
>> ---
>> eric, if you had already thought about other ways to represent
>> labels for LER function, pls let me know. I am looking for suggestions.
>
> I have to some extent, nothing I am completely pleased with yet but
> enough that I can narrow things down to some extent.
>
> I believe you are referring to the case where we have an ipv4 packet
> or an ipv6 packet and we are inserting it into an mpls tunnel for the
> next step of it's travel.  Egress from mpls appears to already be
> convered.
>
> The bounding set of challenges looks something like this:
> - We might be placing a full routing table into mpls with
>    a different mpls tunnel for each different route.
>    A full routing table today runs about 1 million routes
>    so we need to support inserting into the ballpark of 1 million
>    different mpls tunnels.
>    As it happens 1 million is also 2^20 or the number of mpls labels.

I'd like to add a couple of other requirements into the mix:
- Allow for prefix-independent convergence of BGP routes for IGP changes 
(BGP-PIC Core - see informational IETF draft-rtgwg-bgp-pic-02). What 
this means is that if the IGP route for the loopback address of a BGP 
peer router changes then all of the BGP routes recursive via that route 
should converge in a time independent of the number of such BGP routes. 
Whilst it might be desirable to have this happen in a pure IP case for 
the full Internet route table, the use of MPLS-VPNs makes this much more 
of a requirement because it scales the problem up by potentially 
multiplying the ~500k routes by the number of VPNs.
- Ensure the TTL is correctly set in both the IP and MPLS header (i.e. 
avoid a re-switch and TTL decrement)

>
> At 1 million tunnels that rules out using network devices.
>
> Network devices have two basic things that cause scalability problems.
> - struct netdevice and all of sysfs and sysctl overheads fixable
>    but they run at about 32K today.
> - The accounting of ingress and egress packets.
>    It takes a lot of percpu counters to make accounting fast
>    so I think fundamentally we want something without counters.
>
> Which lead me to look at the kernel xfrm subsystem.  xfrm is a close
> match in requirements.  But having to do a second inefficient lookup and
> lookup on more than what we normally used to route a packet seems
> wrong. Not hooking into the routing tables seems wrong.  The xfrm data
> structures themselves seem heavy weight for simple low cost
> encapsulation.
>
>
> So I think we need to build yet another infrastructure for dealing with
> light weight tunnels (not just mpls).
>
> What I would propose would be a new infrastructure for dealing with
> simple stateless tunnels.  (AKA tunneling over IP or UDP or MPLS is fine
> but tunneling over TCP or otherwise needing smarts to insert a packet
> into a tunnel is a no-go).
>
> To support entering these tunnels and egressing from these tunnels we
> need a number that would represent the tunnel type that is linux
> specific.  This tunnel type would be a superset of the ipv4/ipv6
> protocol number that is stored in /etc/protocol and
> http://www.iana.org/assignments/protocol-numbers As well as being a
> superset the pseudo wire types
> http://www.iana.org/assignments/pwe3-parameters
> There are mpls tunnels that are not pseudo wires and there are
> tunnels over ip that are encoded in udp are something else as well.
>
> I believe I would represent this in rtnetlink with a new attribute
> RTA_ENCAP.  The current idea in my mind is that RTA_ENCAP would include
> the encapsulation type, a set of fixed headers and possibly some nested
> attributes (like output device), probably RTA_ENCAP and possibly
> RTA_DST.
>
> At an implementation level I would hook these to the ipv4 and ipv6
> routing tables at the same place as the destination network device,
> possibly sharing storage with where we put the destination network
> device today.
>
> We should be able to use dst->output to do all of the work and thus be
> able to use many if not all of the same hooks as the fast path of xfrm.
>
> We definitely need an ecapsulation method because we need to deal with
> things like the ttl, mtu and fragmentation and so we need to propogate
> bits algorithmically between the different layers.

I really like this idea of having an RTA_ENCAP attribute that can 
specify the encapsulation to be used by any sort of encapsulation that 
might be useful to perform on a per-route basis.

While we're brainstorming, I'll throw out another option: have output 
interface be a virtual interface for the encap type and then having the 
RTA_ENCAP data interpreted by that interface based on skb->dst. Note 
that the interface could be shared by multiple routes with differing 
encap data, but all sharing common parameters. In the case where there 
are no parameters to configure, or they're common to all the routes, 
there would only need to be one instance of the virtual interface (for a 
given namespace).

The encap data for mpls could then store the outgoing labels, interface 
and nexthop. Alternatively, to support PIC as per the above requirement, 
it could store the VPN label and then either a local label allocated for 
the IGP prefix or the recursive nexthop, either of which could then be 
looked up at packet forwarding time to determine the outgoing label, 
interface and nexthop.

Any thoughts on this? The use of the encap-specific virtual interface 
has the advantage of having an object on which parameters like ttl, mtu 
and don't-fragment could be configured and stored, whilst at the same 
time minimising the new infra required.

>
> There is also the complication that ip over mpls natively vs ip over an
> mpls pseudo wire while in practice have the same encoding of the mpls
> labels they appear propogate the ttl differently.  In one case the ttl
> from the inner packet propogates to the outer packet during
> encapsulation and propogates to the inner packet when deccapsulating,
> and in the other case the mpls tunnel is treated as a single hop
> by the ip layer.
>
>
> So I think the right solution is to do the leg work and come up with
> an RTA_ENCAP netlink option, and the associated
>
>
> The cheap hack version of this is to use RTA_FLOW and encode a 32bit
> number in the routing table and use a magic device to look up that 32bit
> number in the mpls routing table (or possibly an mpls flow table)
> and use that to generate the mpls labels.
>
> I don't think we want add the cheap hack.  I think we want a good
> version that can work for all simple well defined tunnel types like
> mpls, gre, ipip, vxlan?, etc.

Agreed.

>
> I think we also will want a small layer of indirection in the
> implementation of RTA_ENCAP such that we can define a simple
> encapsulation separately from defining the route.  For IPv4 with in some
> cases 8 different prefixes for a single destination address, in the
> general case, and internal to a companies network I suspect the
> aggregation level can be much higher.
>
> What such an encapsulation would be is that we would have a tunnel
> table with simple integer index, and RTA_ENCAP would just hold
> that index to that tunnel.  The routing table would hold a reference
> counted pointer to the tunnel (so no extra lookups required in the fast
> path), and some other bits of netwlink would create and destroy the
> light-weight encapsulations.

As long as the layer of indirection is optional I'm ok with that, as it 
might not be worth if for certain types of encaps that don't need to 
store much more data than the size of a pointer on a 64-bit architecture.

>
> Anyway that is my brainstorm on how things should look, and I really
> don't think extending RTA_NEWDST makes much if any sense at all.
> RTA_NEWDST is just DNAT.
>
> Eric

Thanks,
Rob
--
To unsubscribe from this list: send the line "unsubscribe netdev" in
the body of a message to majordomo@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
diff mbox

Patch

diff --git a/include/uapi/linux/rtnetlink.h b/include/uapi/linux/rtnetlink.h
index 974db03..79879cb 100644
--- a/include/uapi/linux/rtnetlink.h
+++ b/include/uapi/linux/rtnetlink.h
@@ -356,8 +356,13 @@  struct rtvia {
 	__u8			rtvia_addr[0];
 };
 
-/* RTM_CACHEINFO */
+/* RTA_NEWDST */
+struct rtnewdst {
+	__kernel_sa_family_t	family;
+	__u8	dst[0];
+};
 
+/* RTM_CACHEINFO */
 struct rta_cacheinfo {
 	__u32	rta_clntref;
 	__u32	rta_lastuse;
diff --git a/net/mpls/af_mpls.c b/net/mpls/af_mpls.c
index 91ed656..6c31108 100644
--- a/net/mpls/af_mpls.c
+++ b/net/mpls/af_mpls.c
@@ -599,18 +599,13 @@  static int nla_put_via(struct sk_buff *skb,
 	return 0;
 }
 
-int nla_put_labels(struct sk_buff *skb, int attrtype,
+int nla_put_labels(struct sk_buff *skb, void *addr,
 		   u8 labels, const u32 label[])
 {
-	struct nlattr *nla;
-	struct mpls_shim_hdr *nla_label;
+	struct mpls_shim_hdr *nla_label = addr;
 	bool bos;
 	int i;
-	nla = nla_reserve(skb, attrtype, labels*4);
-	if (!nla)
-		return -EMSGSIZE;
 
-	nla_label = nla_data(nla);
 	bos = true;
 	for (i = labels - 1; i >= 0; i--) {
 		nla_label[i] = mpls_entry_encode(label[i], 0, 0, bos);
@@ -620,25 +615,45 @@  int nla_put_labels(struct sk_buff *skb, int attrtype,
 	return 0;
 }
 
-int nla_get_labels(const struct nlattr *nla,
-		   u32 max_labels, u32 *labels, u32 label[])
+int nla_put_newdst(struct sk_buff *skb, int attrtype, int family,
+		   u8 labels, const u32 label[])
 {
-	unsigned len = nla_len(nla);
-	unsigned nla_labels;
-	struct mpls_shim_hdr *nla_label;
-	bool bos;
-	int i;
+	struct nlattr *nla;
+	struct rtnewdst *newdst;
 
-	/* len needs to be an even multiple of 4 (the label size) */
-	if (len & 3)
-		return -EINVAL;
+	nla = nla_reserve(skb, attrtype, 2 + (labels * 4));
+	if (!nla)
+		return -EMSGSIZE;
 
-	/* Limit the number of new labels allowed */
-	nla_labels = len/4;
-	if (nla_labels > max_labels)
-		return -EINVAL;
+	newdst = nla_data(nla);
+	newdst->family = family;
+
+	nla_put_labels(skb, &newdst->dst, labels, label);
+
+	return 0;
+}
+EXPORT_SYMBOL(nla_put_newdst);
+
+int nla_put_dst(struct sk_buff *skb, int attrtype, u8 labels,
+		const u32 label[])
+{
+	struct nlattr *nla;
+
+	nla = nla_reserve(skb, attrtype, labels * 4);
+	if (!nla)
+		return -EMSGSIZE;
+
+	nla_put_labels(skb, nla_data(nla), labels, label);
+
+	return 0;
+}
+
+int nla_get_labels(void *addr, u32 nla_labels, u32 *labels, u32 label[])
+{
+	struct mpls_shim_hdr *nla_label = addr;
+	bool bos;
+	int i;
 
-	nla_label = nla_data(nla);
 	bos = true;
 	for (i = nla_labels - 1; i >= 0; i--, bos = false) {
 		struct mpls_entry_decoded dec;
@@ -665,6 +680,54 @@  int nla_get_labels(const struct nlattr *nla,
 	return 0;
 }
 
+int nla_get_newdst(const struct nlattr *nla, u32 max_labels,
+		   u32 *labels, u32 label[])
+{
+	struct rtnewdst *newdst = nla_data(nla);
+	unsigned nla_labels;
+	unsigned len;
+
+	if (nla_len(nla) < offsetof(struct rtnewdst, dst))
+		return -EINVAL;
+
+	len = nla_len(nla) - sizeof(struct rtnewdst);
+
+	/* len needs to be an even multiple of 4 (the label size) */
+	if (len & 3)
+		return -EINVAL;
+
+	/* Limit the number of new labels allowed */
+	nla_labels = len / 4;
+	if (nla_labels > max_labels)
+		return -EINVAL;
+
+	nla_get_labels(&newdst->dst, nla_labels, labels, label);
+
+	return 0;
+}
+EXPORT_SYMBOL(nla_get_newdst);
+
+int nla_get_dst(const struct nlattr *nla,
+		u32 max_labels, u32 *labels, u32 label[])
+{
+	unsigned len = nla_len(nla);
+	unsigned nla_labels;
+
+	/* len needs to be an even multiple of 4 (the label size) */
+	if (len & 3)
+		return -EINVAL;
+
+	/* Limit the number of new labels allowed */
+	nla_labels = len / 4;
+	if (nla_labels > max_labels)
+		return -EINVAL;
+
+	nla_get_labels(nla_data(nla), nla_labels, labels, label);
+
+	return 0;
+}
+EXPORT_SYMBOL(nla_get_dst);
+
 static int rtm_to_route_config(struct sk_buff *skb,  struct nlmsghdr *nlh,
 			       struct mpls_route_config *cfg)
 {
@@ -721,7 +784,7 @@  static int rtm_to_route_config(struct sk_buff *skb,  struct nlmsghdr *nlh,
 			cfg->rc_ifindex = nla_get_u32(nla);
 			break;
 		case RTA_NEWDST:
-			if (nla_get_labels(nla, MAX_NEW_LABELS,
+			if (nla_get_newdst(nla, MAX_NEW_LABELS,
 					   &cfg->rc_output_labels,
 					   cfg->rc_output_label))
 				goto errout;
@@ -729,8 +792,8 @@  static int rtm_to_route_config(struct sk_buff *skb,  struct nlmsghdr *nlh,
 		case RTA_DST:
 		{
 			u32 label_count;
-			if (nla_get_labels(nla, 1, &label_count,
-					   &cfg->rc_label))
+			if (nla_get_dst(nla, 1, &label_count,
+					&cfg->rc_label))
 				goto errout;
 
 			/* The first 16 labels are reserved, and may not be set */
@@ -831,14 +894,15 @@  static int mpls_dump_route(struct sk_buff *skb, u32 portid, u32 seq, int event,
 	rtm->rtm_flags = 0;
 
 	if (rt->rt_labels &&
-	    nla_put_labels(skb, RTA_NEWDST, rt->rt_labels, rt->rt_label))
+	    nla_put_newdst(skb, RTA_NEWDST, AF_MPLS, rt->rt_labels,
+			   rt->rt_label))
 		goto nla_put_failure;
 	if (nla_put_via(skb, rt->rt_via_table, rt->rt_via, rt->rt_via_alen))
 		goto nla_put_failure;
 	dev = rtnl_dereference(rt->rt_dev);
 	if (dev && nla_put_u32(skb, RTA_OIF, dev->ifindex))
 		goto nla_put_failure;
-	if (nla_put_labels(skb, RTA_DST, 1, &label))
+	if (nla_put_dst(skb, RTA_DST, 1, &label))
 		goto nla_put_failure;
 
 	nlmsg_end(skb, nlh);
diff --git a/net/mpls/internal.h b/net/mpls/internal.h
index b064c34..99d7a79 100644
--- a/net/mpls/internal.h
+++ b/net/mpls/internal.h
@@ -49,7 +49,8 @@  static inline struct mpls_entry_decoded mpls_entry_decode(struct mpls_shim_hdr *
 	return result;
 }
 
-int nla_put_labels(struct sk_buff *skb, int attrtype,  u8 labels, const u32 label[]);
-int nla_get_labels(const struct nlattr *nla, u32 max_labels, u32 *labels, u32 label[]);
+int nla_put_labels(struct sk_buff *skb, void *addr,  u8 labels,
+		   const u32 label[]);
+int nla_get_labels(void *addr, u32 nla_labels, u32 *labels, u32 label[]);
 
 #endif /* MPLS_INTERNAL_H */