Message ID | 1431664722-59539-1-git-send-email-roopa@cumulusnetworks.com |
---|---|
State | Changes Requested, archived |
Delegated to: | David Miller |
Headers | show |
roopa@cumulusnetworks.com writes: > From: Roopa Prabhu <roopa@cumulusnetworks.com> > > RTA_NEWDST netlink attribute today is used to carry mpls > labels. This patch encodes family in RTA_NEWDST. > > RTA_NEWDST by its name and its use in iproute2 can be > used as a generic new dst. But it is currently used only for > mpls labels ie with family AF_MPLS. Encoding family in the > attribute will help its reuse in the future. > > One usecase where family with RTA_NEWDST becomes necessary > is when we implement mpls label edge router function. I don't think this makes any sense. How do you change the destination address on a packet to a value in another protocol? None of IPv4, IPv6, and MPLS support that. Aka this attribute represents DNAT. > This is a uapi change but RTA_NEWDST has not made > into any release yet. so, trying to rush this change into > 4.1 if acceptable. > > (iproute2 patch will follow) > > Signed-off-by: Roopa Prabhu <roopa@cumulusnetworks.com> > --- > eric, if you had already thought about other ways to represent > labels for LER function, pls let me know. I am looking for suggestions. I have to some extent, nothing I am completely pleased with yet but enough that I can narrow things down to some extent. I believe you are referring to the case where we have an ipv4 packet or an ipv6 packet and we are inserting it into an mpls tunnel for the next step of it's travel. Egress from mpls appears to already be convered. The bounding set of challenges looks something like this: - We might be placing a full routing table into mpls with a different mpls tunnel for each different route. A full routing table today runs about 1 million routes so we need to support inserting into the ballpark of 1 million different mpls tunnels. As it happens 1 million is also 2^20 or the number of mpls labels. At 1 million tunnels that rules out using network devices. Network devices have two basic things that cause scalability problems. - struct netdevice and all of sysfs and sysctl overheads fixable but they run at about 32K today. - The accounting of ingress and egress packets. It takes a lot of percpu counters to make accounting fast so I think fundamentally we want something without counters. Which lead me to look at the kernel xfrm subsystem. xfrm is a close match in requirements. But having to do a second inefficient lookup and lookup on more than what we normally used to route a packet seems wrong. Not hooking into the routing tables seems wrong. The xfrm data structures themselves seem heavy weight for simple low cost encapsulation. So I think we need to build yet another infrastructure for dealing with light weight tunnels (not just mpls). What I would propose would be a new infrastructure for dealing with simple stateless tunnels. (AKA tunneling over IP or UDP or MPLS is fine but tunneling over TCP or otherwise needing smarts to insert a packet into a tunnel is a no-go). To support entering these tunnels and egressing from these tunnels we need a number that would represent the tunnel type that is linux specific. This tunnel type would be a superset of the ipv4/ipv6 protocol number that is stored in /etc/protocol and http://www.iana.org/assignments/protocol-numbers As well as being a superset the pseudo wire types http://www.iana.org/assignments/pwe3-parameters There are mpls tunnels that are not pseudo wires and there are tunnels over ip that are encoded in udp are something else as well. I believe I would represent this in rtnetlink with a new attribute RTA_ENCAP. The current idea in my mind is that RTA_ENCAP would include the encapsulation type, a set of fixed headers and possibly some nested attributes (like output device), probably RTA_ENCAP and possibly RTA_DST. At an implementation level I would hook these to the ipv4 and ipv6 routing tables at the same place as the destination network device, possibly sharing storage with where we put the destination network device today. We should be able to use dst->output to do all of the work and thus be able to use many if not all of the same hooks as the fast path of xfrm. We definitely need an ecapsulation method because we need to deal with things like the ttl, mtu and fragmentation and so we need to propogate bits algorithmically between the different layers. There is also the complication that ip over mpls natively vs ip over an mpls pseudo wire while in practice have the same encoding of the mpls labels they appear propogate the ttl differently. In one case the ttl from the inner packet propogates to the outer packet during encapsulation and propogates to the inner packet when deccapsulating, and in the other case the mpls tunnel is treated as a single hop by the ip layer. So I think the right solution is to do the leg work and come up with an RTA_ENCAP netlink option, and the associated The cheap hack version of this is to use RTA_FLOW and encode a 32bit number in the routing table and use a magic device to look up that 32bit number in the mpls routing table (or possibly an mpls flow table) and use that to generate the mpls labels. I don't think we want add the cheap hack. I think we want a good version that can work for all simple well defined tunnel types like mpls, gre, ipip, vxlan?, etc. I think we also will want a small layer of indirection in the implementation of RTA_ENCAP such that we can define a simple encapsulation separately from defining the route. For IPv4 with in some cases 8 different prefixes for a single destination address, in the general case, and internal to a companies network I suspect the aggregation level can be much higher. What such an encapsulation would be is that we would have a tunnel table with simple integer index, and RTA_ENCAP would just hold that index to that tunnel. The routing table would hold a reference counted pointer to the tunnel (so no extra lookups required in the fast path), and some other bits of netwlink would create and destroy the light-weight encapsulations. Anyway that is my brainstorm on how things should look, and I really don't think extending RTA_NEWDST makes much if any sense at all. RTA_NEWDST is just DNAT. Eric > include/uapi/linux/rtnetlink.h | 7 ++- > net/mpls/af_mpls.c | 118 +++++++++++++++++++++++++++++++--------- > net/mpls/internal.h | 5 +- > 3 files changed, 100 insertions(+), 30 deletions(-) > > diff --git a/include/uapi/linux/rtnetlink.h b/include/uapi/linux/rtnetlink.h > index 974db03..79879cb 100644 > --- a/include/uapi/linux/rtnetlink.h > +++ b/include/uapi/linux/rtnetlink.h > @@ -356,8 +356,13 @@ struct rtvia { > __u8 rtvia_addr[0]; > }; > > -/* RTM_CACHEINFO */ > +/* RTA_NEWDST */ > +struct rtnewdst { > + __kernel_sa_family_t family; > + __u8 dst[0]; > +}; > > +/* RTM_CACHEINFO */ > struct rta_cacheinfo { > __u32 rta_clntref; > __u32 rta_lastuse; > diff --git a/net/mpls/af_mpls.c b/net/mpls/af_mpls.c > index 91ed656..6c31108 100644 > --- a/net/mpls/af_mpls.c > +++ b/net/mpls/af_mpls.c > @@ -599,18 +599,13 @@ static int nla_put_via(struct sk_buff *skb, > return 0; > } > > -int nla_put_labels(struct sk_buff *skb, int attrtype, > +int nla_put_labels(struct sk_buff *skb, void *addr, > u8 labels, const u32 label[]) > { > - struct nlattr *nla; > - struct mpls_shim_hdr *nla_label; > + struct mpls_shim_hdr *nla_label = addr; > bool bos; > int i; > - nla = nla_reserve(skb, attrtype, labels*4); > - if (!nla) > - return -EMSGSIZE; > > - nla_label = nla_data(nla); > bos = true; > for (i = labels - 1; i >= 0; i--) { > nla_label[i] = mpls_entry_encode(label[i], 0, 0, bos); > @@ -620,25 +615,45 @@ int nla_put_labels(struct sk_buff *skb, int attrtype, > return 0; > } > > -int nla_get_labels(const struct nlattr *nla, > - u32 max_labels, u32 *labels, u32 label[]) > +int nla_put_newdst(struct sk_buff *skb, int attrtype, int family, > + u8 labels, const u32 label[]) > { > - unsigned len = nla_len(nla); > - unsigned nla_labels; > - struct mpls_shim_hdr *nla_label; > - bool bos; > - int i; > + struct nlattr *nla; > + struct rtnewdst *newdst; > > - /* len needs to be an even multiple of 4 (the label size) */ > - if (len & 3) > - return -EINVAL; > + nla = nla_reserve(skb, attrtype, 2 + (labels * 4)); > + if (!nla) > + return -EMSGSIZE; > > - /* Limit the number of new labels allowed */ > - nla_labels = len/4; > - if (nla_labels > max_labels) > - return -EINVAL; > + newdst = nla_data(nla); > + newdst->family = family; > + > + nla_put_labels(skb, &newdst->dst, labels, label); > + > + return 0; > +} > +EXPORT_SYMBOL(nla_put_newdst); > + > +int nla_put_dst(struct sk_buff *skb, int attrtype, u8 labels, > + const u32 label[]) > +{ > + struct nlattr *nla; > + > + nla = nla_reserve(skb, attrtype, labels * 4); > + if (!nla) > + return -EMSGSIZE; > + > + nla_put_labels(skb, nla_data(nla), labels, label); > + > + return 0; > +} > + > +int nla_get_labels(void *addr, u32 nla_labels, u32 *labels, u32 label[]) > +{ > + struct mpls_shim_hdr *nla_label = addr; > + bool bos; > + int i; > > - nla_label = nla_data(nla); > bos = true; > for (i = nla_labels - 1; i >= 0; i--, bos = false) { > struct mpls_entry_decoded dec; > @@ -665,6 +680,54 @@ int nla_get_labels(const struct nlattr *nla, > return 0; > } > > +int nla_get_newdst(const struct nlattr *nla, u32 max_labels, > + u32 *labels, u32 label[]) > +{ > + struct rtnewdst *newdst = nla_data(nla); > + unsigned nla_labels; > + unsigned len; > + > + if (nla_len(nla) < offsetof(struct rtnewdst, dst)) > + return -EINVAL; > + > + len = nla_len(nla) - sizeof(struct rtnewdst); > + > + /* len needs to be an even multiple of 4 (the label size) */ > + if (len & 3) > + return -EINVAL; > + > + /* Limit the number of new labels allowed */ > + nla_labels = len / 4; > + if (nla_labels > max_labels) > + return -EINVAL; > + > + nla_get_labels(&newdst->dst, nla_labels, labels, label); > + > + return 0; > +} > +EXPORT_SYMBOL(nla_get_newdst); > + > +int nla_get_dst(const struct nlattr *nla, > + u32 max_labels, u32 *labels, u32 label[]) > +{ > + unsigned len = nla_len(nla); > + unsigned nla_labels; > + > + /* len needs to be an even multiple of 4 (the label size) */ > + if (len & 3) > + return -EINVAL; > + > + /* Limit the number of new labels allowed */ > + nla_labels = len / 4; > + if (nla_labels > max_labels) > + return -EINVAL; > + > + nla_get_labels(nla_data(nla), nla_labels, labels, label); > + > + return 0; > +} > +EXPORT_SYMBOL(nla_get_dst); > + > static int rtm_to_route_config(struct sk_buff *skb, struct nlmsghdr *nlh, > struct mpls_route_config *cfg) > { > @@ -721,7 +784,7 @@ static int rtm_to_route_config(struct sk_buff *skb, struct nlmsghdr *nlh, > cfg->rc_ifindex = nla_get_u32(nla); > break; > case RTA_NEWDST: > - if (nla_get_labels(nla, MAX_NEW_LABELS, > + if (nla_get_newdst(nla, MAX_NEW_LABELS, > &cfg->rc_output_labels, > cfg->rc_output_label)) > goto errout; > @@ -729,8 +792,8 @@ static int rtm_to_route_config(struct sk_buff *skb, struct nlmsghdr *nlh, > case RTA_DST: > { > u32 label_count; > - if (nla_get_labels(nla, 1, &label_count, > - &cfg->rc_label)) > + if (nla_get_dst(nla, 1, &label_count, > + &cfg->rc_label)) > goto errout; > > /* The first 16 labels are reserved, and may not be set */ > @@ -831,14 +894,15 @@ static int mpls_dump_route(struct sk_buff *skb, u32 portid, u32 seq, int event, > rtm->rtm_flags = 0; > > if (rt->rt_labels && > - nla_put_labels(skb, RTA_NEWDST, rt->rt_labels, rt->rt_label)) > + nla_put_newdst(skb, RTA_NEWDST, AF_MPLS, rt->rt_labels, > + rt->rt_label)) > goto nla_put_failure; > if (nla_put_via(skb, rt->rt_via_table, rt->rt_via, rt->rt_via_alen)) > goto nla_put_failure; > dev = rtnl_dereference(rt->rt_dev); > if (dev && nla_put_u32(skb, RTA_OIF, dev->ifindex)) > goto nla_put_failure; > - if (nla_put_labels(skb, RTA_DST, 1, &label)) > + if (nla_put_dst(skb, RTA_DST, 1, &label)) > goto nla_put_failure; > > nlmsg_end(skb, nlh); > diff --git a/net/mpls/internal.h b/net/mpls/internal.h > index b064c34..99d7a79 100644 > --- a/net/mpls/internal.h > +++ b/net/mpls/internal.h > @@ -49,7 +49,8 @@ static inline struct mpls_entry_decoded mpls_entry_decode(struct mpls_shim_hdr * > return result; > } > > -int nla_put_labels(struct sk_buff *skb, int attrtype, u8 labels, const u32 label[]); > -int nla_get_labels(const struct nlattr *nla, u32 max_labels, u32 *labels, u32 label[]); > +int nla_put_labels(struct sk_buff *skb, void *addr, u8 labels, > + const u32 label[]); > +int nla_get_labels(void *addr, u32 nla_labels, u32 *labels, u32 label[]); > > #endif /* MPLS_INTERNAL_H */ -- To unsubscribe from this list: send the line "unsubscribe netdev" in the body of a message to majordomo@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
On 5/14/15, 11:35 PM, Eric W. Biederman wrote: > roopa@cumulusnetworks.com writes: > >> From: Roopa Prabhu <roopa@cumulusnetworks.com> >> >> RTA_NEWDST netlink attribute today is used to carry mpls >> labels. This patch encodes family in RTA_NEWDST. >> >> RTA_NEWDST by its name and its use in iproute2 can be >> used as a generic new dst. But it is currently used only for >> mpls labels ie with family AF_MPLS. Encoding family in the >> attribute will help its reuse in the future. >> >> One usecase where family with RTA_NEWDST becomes necessary >> is when we implement mpls label edge router function. > I don't think this makes any sense. > > How do you change the destination address on a packet to a value in > another protocol? None of IPv4, IPv6, and MPLS support that. > > Aka this attribute represents DNAT. thanks for that clarification (some details on what i was trying to do is at the end of this email). > > >> This is a uapi change but RTA_NEWDST has not made >> into any release yet. so, trying to rush this change into >> 4.1 if acceptable. >> >> (iproute2 patch will follow) >> >> Signed-off-by: Roopa Prabhu <roopa@cumulusnetworks.com> >> --- >> eric, if you had already thought about other ways to represent >> labels for LER function, pls let me know. I am looking for suggestions. > I have to some extent, nothing I am completely pleased with yet but > enough that I can narrow things down to some extent. > > I believe you are referring to the case where we have an ipv4 packet > or an ipv6 packet and we are inserting it into an mpls tunnel for the > next step of it's travel. Egress from mpls appears to already be > convered. yes, correct. > > The bounding set of challenges looks something like this: > - We might be placing a full routing table into mpls with > a different mpls tunnel for each different route. > A full routing table today runs about 1 million routes > so we need to support inserting into the ballpark of 1 million > different mpls tunnels. > As it happens 1 million is also 2^20 or the number of mpls labels. > > At 1 million tunnels that rules out using network devices. > > Network devices have two basic things that cause scalability problems. > - struct netdevice and all of sysfs and sysctl overheads fixable > but they run at about 32K today. > - The accounting of ingress and egress packets. > It takes a lot of percpu counters to make accounting fast > so I think fundamentally we want something without counters. agreed. And we have the same conclusions. device is not an option. > Which lead me to look at the kernel xfrm subsystem. xfrm is a close > match in requirements. But having to do a second inefficient lookup and > lookup on more than what we normally used to route a packet seems > wrong. Not hooking into the routing tables seems wrong. The xfrm data > structures themselves seem heavy weight for simple low cost > encapsulation. I have not looked at the xfrm infrastructure in detail. will do so. > > > So I think we need to build yet another infrastructure for dealing with > light weight tunnels (not just mpls). ok, I was looking for a word to describe tunnels like mpls..., 'light weight tunnels' sounds good. > > What I would propose would be a new infrastructure for dealing with > simple stateless tunnels. (AKA tunneling over IP or UDP or MPLS is fine > but tunneling over TCP or otherwise needing smarts to insert a packet > into a tunnel is a no-go). > > To support entering these tunnels and egressing from these tunnels we > need a number that would represent the tunnel type that is linux > specific. This tunnel type would be a superset of the ipv4/ipv6 > protocol number that is stored in /etc/protocol and > http://www.iana.org/assignments/protocol-numbers As well as being a > superset the pseudo wire types > http://www.iana.org/assignments/pwe3-parameters > There are mpls tunnels that are not pseudo wires and there are > tunnels over ip that are encoded in udp are something else as well. > > I believe I would represent this in rtnetlink with a new attribute > RTA_ENCAP. The current idea in my mind is that RTA_ENCAP would include > the encapsulation type, a set of fixed headers and possibly some nested > attributes (like output device), probably RTA_ENCAP and possibly > RTA_DST. ok.. > > At an implementation level I would hook these to the ipv4 and ipv6 > routing tables at the same place as the destination network device, > possibly sharing storage with where we put the destination network > device today. > > We should be able to use dst->output to do all of the work and thus be > able to use many if not all of the same hooks as the fast path of xfrm. > > We definitely need an ecapsulation method because we need to deal with > things like the ttl, mtu and fragmentation and so we need to propogate > bits algorithmically between the different layers. > > There is also the complication that ip over mpls natively vs ip over an > mpls pseudo wire while in practice have the same encoding of the mpls > labels they appear propogate the ttl differently. In one case the ttl > from the inner packet propogates to the outer packet during > encapsulation and propogates to the inner packet when deccapsulating, > and in the other case the mpls tunnel is treated as a single hop > by the ip layer. > > > So I think the right solution is to do the leg work and come up with > an RTA_ENCAP netlink option, and the associated > > > The cheap hack version of this is to use RTA_FLOW and encode a 32bit > number in the routing table and use a magic device to look up that 32bit > number in the mpls routing table (or possibly an mpls flow table) > and use that to generate the mpls labels. > > I don't think we want add the cheap hack. I think we want a good > version that can work for all simple well defined tunnel types like > mpls, gre, ipip, vxlan?, etc. > > > I think we also will want a small layer of indirection in the > implementation of RTA_ENCAP such that we can define a simple > encapsulation separately from defining the route. For IPv4 with in some > cases 8 different prefixes for a single destination address, in the > general case, and internal to a companies network I suspect the > aggregation level can be much higher. > > What such an encapsulation would be is that we would have a tunnel > table with simple integer index, and RTA_ENCAP would just hold > that index to that tunnel. The routing table would hold a reference > counted pointer to the tunnel (so no extra lookups required in the fast > path), and some other bits of netwlink would create and destroy the > light-weight encapsulations. ok, thanks for all the thoughts on this. I was not thinking separate tunnel table. > > Anyway that is my brainstorm on how things should look, and I really > don't think extending RTA_NEWDST makes much if any sense at all. > RTA_NEWDST is just DNAT. Let me tell you where I was going with RTA_NEWDST: I was completely on board with all your hints on a separate generic encapsulation layer for such "light weigh tunnels" in your previous emails on this. The part that wasn't clear was a separate tunnel table. From what i saw, mpls today was the only such light weight tunnel. And, to me RTA_NEWDST was to some extent RTA_ENCAP you were talking about. Clearly i seem to have ignored all the other encapsulation parameters that may need to go into it :). But, i guess in my mind i was thinking those will be additional attributes. But agree, a new nested attribute could be a better option. From IPv4 for example, to me this looked something like adding the below. ip route add 10.1.1.0/30 as mpls 200 via inet 10.1.1.2 dev swp1 the 'mpls 200' goes into RTA_NEWDST. And from ipv4 code you look at the encap family and pass it on to the respective output func (i was looking at a possible abstraction layer here...maybe something like xfrm covering different tunnel types like you mention above). In the hacked up version of my patch (which i was not going to post if it looked like a hack anyways), i essentially set the dst->output to mpls_output. I will see if I can come up with something on the lines of RTA_ENCAP you share above. Thanks for the details eric! appreciate it. -- To unsubscribe from this list: send the line "unsubscribe netdev" in the body of a message to majordomo@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
On 15/05/15 07:35, Eric W. Biederman wrote: > roopa@cumulusnetworks.com writes: > >> From: Roopa Prabhu <roopa@cumulusnetworks.com> >> >> RTA_NEWDST netlink attribute today is used to carry mpls >> labels. This patch encodes family in RTA_NEWDST. >> >> RTA_NEWDST by its name and its use in iproute2 can be >> used as a generic new dst. But it is currently used only for >> mpls labels ie with family AF_MPLS. Encoding family in the >> attribute will help its reuse in the future. >> >> One usecase where family with RTA_NEWDST becomes necessary >> is when we implement mpls label edge router function. > > I don't think this makes any sense. > > How do you change the destination address on a packet to a value in > another protocol? None of IPv4, IPv6, and MPLS support that. > > Aka this attribute represents DNAT. > > >> This is a uapi change but RTA_NEWDST has not made >> into any release yet. so, trying to rush this change into >> 4.1 if acceptable. >> >> (iproute2 patch will follow) >> >> Signed-off-by: Roopa Prabhu <roopa@cumulusnetworks.com> >> --- >> eric, if you had already thought about other ways to represent >> labels for LER function, pls let me know. I am looking for suggestions. > > I have to some extent, nothing I am completely pleased with yet but > enough that I can narrow things down to some extent. > > I believe you are referring to the case where we have an ipv4 packet > or an ipv6 packet and we are inserting it into an mpls tunnel for the > next step of it's travel. Egress from mpls appears to already be > convered. > > The bounding set of challenges looks something like this: > - We might be placing a full routing table into mpls with > a different mpls tunnel for each different route. > A full routing table today runs about 1 million routes > so we need to support inserting into the ballpark of 1 million > different mpls tunnels. > As it happens 1 million is also 2^20 or the number of mpls labels. I'd like to add a couple of other requirements into the mix: - Allow for prefix-independent convergence of BGP routes for IGP changes (BGP-PIC Core - see informational IETF draft-rtgwg-bgp-pic-02). What this means is that if the IGP route for the loopback address of a BGP peer router changes then all of the BGP routes recursive via that route should converge in a time independent of the number of such BGP routes. Whilst it might be desirable to have this happen in a pure IP case for the full Internet route table, the use of MPLS-VPNs makes this much more of a requirement because it scales the problem up by potentially multiplying the ~500k routes by the number of VPNs. - Ensure the TTL is correctly set in both the IP and MPLS header (i.e. avoid a re-switch and TTL decrement) > > At 1 million tunnels that rules out using network devices. > > Network devices have two basic things that cause scalability problems. > - struct netdevice and all of sysfs and sysctl overheads fixable > but they run at about 32K today. > - The accounting of ingress and egress packets. > It takes a lot of percpu counters to make accounting fast > so I think fundamentally we want something without counters. > > Which lead me to look at the kernel xfrm subsystem. xfrm is a close > match in requirements. But having to do a second inefficient lookup and > lookup on more than what we normally used to route a packet seems > wrong. Not hooking into the routing tables seems wrong. The xfrm data > structures themselves seem heavy weight for simple low cost > encapsulation. > > > So I think we need to build yet another infrastructure for dealing with > light weight tunnels (not just mpls). > > What I would propose would be a new infrastructure for dealing with > simple stateless tunnels. (AKA tunneling over IP or UDP or MPLS is fine > but tunneling over TCP or otherwise needing smarts to insert a packet > into a tunnel is a no-go). > > To support entering these tunnels and egressing from these tunnels we > need a number that would represent the tunnel type that is linux > specific. This tunnel type would be a superset of the ipv4/ipv6 > protocol number that is stored in /etc/protocol and > http://www.iana.org/assignments/protocol-numbers As well as being a > superset the pseudo wire types > http://www.iana.org/assignments/pwe3-parameters > There are mpls tunnels that are not pseudo wires and there are > tunnels over ip that are encoded in udp are something else as well. > > I believe I would represent this in rtnetlink with a new attribute > RTA_ENCAP. The current idea in my mind is that RTA_ENCAP would include > the encapsulation type, a set of fixed headers and possibly some nested > attributes (like output device), probably RTA_ENCAP and possibly > RTA_DST. > > At an implementation level I would hook these to the ipv4 and ipv6 > routing tables at the same place as the destination network device, > possibly sharing storage with where we put the destination network > device today. > > We should be able to use dst->output to do all of the work and thus be > able to use many if not all of the same hooks as the fast path of xfrm. > > We definitely need an ecapsulation method because we need to deal with > things like the ttl, mtu and fragmentation and so we need to propogate > bits algorithmically between the different layers. I really like this idea of having an RTA_ENCAP attribute that can specify the encapsulation to be used by any sort of encapsulation that might be useful to perform on a per-route basis. While we're brainstorming, I'll throw out another option: have output interface be a virtual interface for the encap type and then having the RTA_ENCAP data interpreted by that interface based on skb->dst. Note that the interface could be shared by multiple routes with differing encap data, but all sharing common parameters. In the case where there are no parameters to configure, or they're common to all the routes, there would only need to be one instance of the virtual interface (for a given namespace). The encap data for mpls could then store the outgoing labels, interface and nexthop. Alternatively, to support PIC as per the above requirement, it could store the VPN label and then either a local label allocated for the IGP prefix or the recursive nexthop, either of which could then be looked up at packet forwarding time to determine the outgoing label, interface and nexthop. Any thoughts on this? The use of the encap-specific virtual interface has the advantage of having an object on which parameters like ttl, mtu and don't-fragment could be configured and stored, whilst at the same time minimising the new infra required. > > There is also the complication that ip over mpls natively vs ip over an > mpls pseudo wire while in practice have the same encoding of the mpls > labels they appear propogate the ttl differently. In one case the ttl > from the inner packet propogates to the outer packet during > encapsulation and propogates to the inner packet when deccapsulating, > and in the other case the mpls tunnel is treated as a single hop > by the ip layer. > > > So I think the right solution is to do the leg work and come up with > an RTA_ENCAP netlink option, and the associated > > > The cheap hack version of this is to use RTA_FLOW and encode a 32bit > number in the routing table and use a magic device to look up that 32bit > number in the mpls routing table (or possibly an mpls flow table) > and use that to generate the mpls labels. > > I don't think we want add the cheap hack. I think we want a good > version that can work for all simple well defined tunnel types like > mpls, gre, ipip, vxlan?, etc. Agreed. > > I think we also will want a small layer of indirection in the > implementation of RTA_ENCAP such that we can define a simple > encapsulation separately from defining the route. For IPv4 with in some > cases 8 different prefixes for a single destination address, in the > general case, and internal to a companies network I suspect the > aggregation level can be much higher. > > What such an encapsulation would be is that we would have a tunnel > table with simple integer index, and RTA_ENCAP would just hold > that index to that tunnel. The routing table would hold a reference > counted pointer to the tunnel (so no extra lookups required in the fast > path), and some other bits of netwlink would create and destroy the > light-weight encapsulations. As long as the layer of indirection is optional I'm ok with that, as it might not be worth if for certain types of encaps that don't need to store much more data than the size of a pointer on a 64-bit architecture. > > Anyway that is my brainstorm on how things should look, and I really > don't think extending RTA_NEWDST makes much if any sense at all. > RTA_NEWDST is just DNAT. > > Eric Thanks, Rob -- To unsubscribe from this list: send the line "unsubscribe netdev" in the body of a message to majordomo@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
diff --git a/include/uapi/linux/rtnetlink.h b/include/uapi/linux/rtnetlink.h index 974db03..79879cb 100644 --- a/include/uapi/linux/rtnetlink.h +++ b/include/uapi/linux/rtnetlink.h @@ -356,8 +356,13 @@ struct rtvia { __u8 rtvia_addr[0]; }; -/* RTM_CACHEINFO */ +/* RTA_NEWDST */ +struct rtnewdst { + __kernel_sa_family_t family; + __u8 dst[0]; +}; +/* RTM_CACHEINFO */ struct rta_cacheinfo { __u32 rta_clntref; __u32 rta_lastuse; diff --git a/net/mpls/af_mpls.c b/net/mpls/af_mpls.c index 91ed656..6c31108 100644 --- a/net/mpls/af_mpls.c +++ b/net/mpls/af_mpls.c @@ -599,18 +599,13 @@ static int nla_put_via(struct sk_buff *skb, return 0; } -int nla_put_labels(struct sk_buff *skb, int attrtype, +int nla_put_labels(struct sk_buff *skb, void *addr, u8 labels, const u32 label[]) { - struct nlattr *nla; - struct mpls_shim_hdr *nla_label; + struct mpls_shim_hdr *nla_label = addr; bool bos; int i; - nla = nla_reserve(skb, attrtype, labels*4); - if (!nla) - return -EMSGSIZE; - nla_label = nla_data(nla); bos = true; for (i = labels - 1; i >= 0; i--) { nla_label[i] = mpls_entry_encode(label[i], 0, 0, bos); @@ -620,25 +615,45 @@ int nla_put_labels(struct sk_buff *skb, int attrtype, return 0; } -int nla_get_labels(const struct nlattr *nla, - u32 max_labels, u32 *labels, u32 label[]) +int nla_put_newdst(struct sk_buff *skb, int attrtype, int family, + u8 labels, const u32 label[]) { - unsigned len = nla_len(nla); - unsigned nla_labels; - struct mpls_shim_hdr *nla_label; - bool bos; - int i; + struct nlattr *nla; + struct rtnewdst *newdst; - /* len needs to be an even multiple of 4 (the label size) */ - if (len & 3) - return -EINVAL; + nla = nla_reserve(skb, attrtype, 2 + (labels * 4)); + if (!nla) + return -EMSGSIZE; - /* Limit the number of new labels allowed */ - nla_labels = len/4; - if (nla_labels > max_labels) - return -EINVAL; + newdst = nla_data(nla); + newdst->family = family; + + nla_put_labels(skb, &newdst->dst, labels, label); + + return 0; +} +EXPORT_SYMBOL(nla_put_newdst); + +int nla_put_dst(struct sk_buff *skb, int attrtype, u8 labels, + const u32 label[]) +{ + struct nlattr *nla; + + nla = nla_reserve(skb, attrtype, labels * 4); + if (!nla) + return -EMSGSIZE; + + nla_put_labels(skb, nla_data(nla), labels, label); + + return 0; +} + +int nla_get_labels(void *addr, u32 nla_labels, u32 *labels, u32 label[]) +{ + struct mpls_shim_hdr *nla_label = addr; + bool bos; + int i; - nla_label = nla_data(nla); bos = true; for (i = nla_labels - 1; i >= 0; i--, bos = false) { struct mpls_entry_decoded dec; @@ -665,6 +680,54 @@ int nla_get_labels(const struct nlattr *nla, return 0; } +int nla_get_newdst(const struct nlattr *nla, u32 max_labels, + u32 *labels, u32 label[]) +{ + struct rtnewdst *newdst = nla_data(nla); + unsigned nla_labels; + unsigned len; + + if (nla_len(nla) < offsetof(struct rtnewdst, dst)) + return -EINVAL; + + len = nla_len(nla) - sizeof(struct rtnewdst); + + /* len needs to be an even multiple of 4 (the label size) */ + if (len & 3) + return -EINVAL; + + /* Limit the number of new labels allowed */ + nla_labels = len / 4; + if (nla_labels > max_labels) + return -EINVAL; + + nla_get_labels(&newdst->dst, nla_labels, labels, label); + + return 0; +} +EXPORT_SYMBOL(nla_get_newdst); + +int nla_get_dst(const struct nlattr *nla, + u32 max_labels, u32 *labels, u32 label[]) +{ + unsigned len = nla_len(nla); + unsigned nla_labels; + + /* len needs to be an even multiple of 4 (the label size) */ + if (len & 3) + return -EINVAL; + + /* Limit the number of new labels allowed */ + nla_labels = len / 4; + if (nla_labels > max_labels) + return -EINVAL; + + nla_get_labels(nla_data(nla), nla_labels, labels, label); + + return 0; +} +EXPORT_SYMBOL(nla_get_dst); + static int rtm_to_route_config(struct sk_buff *skb, struct nlmsghdr *nlh, struct mpls_route_config *cfg) { @@ -721,7 +784,7 @@ static int rtm_to_route_config(struct sk_buff *skb, struct nlmsghdr *nlh, cfg->rc_ifindex = nla_get_u32(nla); break; case RTA_NEWDST: - if (nla_get_labels(nla, MAX_NEW_LABELS, + if (nla_get_newdst(nla, MAX_NEW_LABELS, &cfg->rc_output_labels, cfg->rc_output_label)) goto errout; @@ -729,8 +792,8 @@ static int rtm_to_route_config(struct sk_buff *skb, struct nlmsghdr *nlh, case RTA_DST: { u32 label_count; - if (nla_get_labels(nla, 1, &label_count, - &cfg->rc_label)) + if (nla_get_dst(nla, 1, &label_count, + &cfg->rc_label)) goto errout; /* The first 16 labels are reserved, and may not be set */ @@ -831,14 +894,15 @@ static int mpls_dump_route(struct sk_buff *skb, u32 portid, u32 seq, int event, rtm->rtm_flags = 0; if (rt->rt_labels && - nla_put_labels(skb, RTA_NEWDST, rt->rt_labels, rt->rt_label)) + nla_put_newdst(skb, RTA_NEWDST, AF_MPLS, rt->rt_labels, + rt->rt_label)) goto nla_put_failure; if (nla_put_via(skb, rt->rt_via_table, rt->rt_via, rt->rt_via_alen)) goto nla_put_failure; dev = rtnl_dereference(rt->rt_dev); if (dev && nla_put_u32(skb, RTA_OIF, dev->ifindex)) goto nla_put_failure; - if (nla_put_labels(skb, RTA_DST, 1, &label)) + if (nla_put_dst(skb, RTA_DST, 1, &label)) goto nla_put_failure; nlmsg_end(skb, nlh); diff --git a/net/mpls/internal.h b/net/mpls/internal.h index b064c34..99d7a79 100644 --- a/net/mpls/internal.h +++ b/net/mpls/internal.h @@ -49,7 +49,8 @@ static inline struct mpls_entry_decoded mpls_entry_decode(struct mpls_shim_hdr * return result; } -int nla_put_labels(struct sk_buff *skb, int attrtype, u8 labels, const u32 label[]); -int nla_get_labels(const struct nlattr *nla, u32 max_labels, u32 *labels, u32 label[]); +int nla_put_labels(struct sk_buff *skb, void *addr, u8 labels, + const u32 label[]); +int nla_get_labels(void *addr, u32 nla_labels, u32 *labels, u32 label[]); #endif /* MPLS_INTERNAL_H */