diff mbox

[net,v1] fib_rules: interface group matching

Message ID 20160914124025.13417-1-vincent@bernat.im
State Rejected, archived
Delegated to: David Miller
Headers show

Commit Message

Vincent Bernat Sept. 14, 2016, 12:40 p.m. UTC
When a user wants to assign a routing table to a group of incoming
interfaces, the current solutions are:

 - one IP rule for each interface (scalability problems)
 - use of fwmark and devgroup matcher (don't work with internal route
   lookups, used for example by RPF)
 - use of VRF devices (more complex)

Each interface can be assigned to a numeric group using IFLA_GROUP. This
commit enables a user to reference such a group into an IP rule. Here is
an example of output of iproute2:

    $ ip rule show
    0:      from all lookup local
    32764:  from all iifgroup 2 lookup 2
    32765:  from all iifgroup 1 lookup 1
    32766:  from all lookup main
    32767:  from all lookup default

Signed-off-by: Vincent Bernat <vincent@bernat.im>
---
 include/net/fib_rules.h        |  6 ++++-
 include/uapi/linux/fib_rules.h |  2 ++
 net/core/fib_rules.c           | 57 +++++++++++++++++++++++++++++++++++++++++-
 3 files changed, 63 insertions(+), 2 deletions(-)

Comments

Vincent Bernat Sept. 14, 2016, 12:43 p.m. UTC | #1
❦ 14 septembre 2016 14:40 CEST, Vincent Bernat <vincent@bernat.im> :

> Each interface can be assigned to a numeric group using IFLA_GROUP. This
> commit enables a user to reference such a group into an IP rule. Here is
> an example of output of iproute2:
>
>     $ ip rule show
>     0:      from all lookup local
>     32764:  from all iifgroup 2 lookup 2
>     32765:  from all iifgroup 1 lookup 1
>     32766:  from all lookup main
>     32767:  from all lookup default

The patch for iproute2 is available here (didn't post it inline to avoid
confuse patchwork):
 http://paste.debian.net/821247/
David Ahern Sept. 14, 2016, 2:15 p.m. UTC | #2
On 9/14/16 6:40 AM, Vincent Bernat wrote:
> When a user wants to assign a routing table to a group of incoming
> interfaces, the current solutions are:
> 
>  - one IP rule for each interface (scalability problems)
>  - use of fwmark and devgroup matcher (don't work with internal route
>    lookups, used for example by RPF)
>  - use of VRF devices (more complex)

Why do you believe that? A VRF is a formalized grouping of interfaces that includes an API for locally generated traffic to specify which VRF/group to use. And, with the l3mdev rule you only need 1 rule for all VRFs regardless of the number which is the best solution to the scalability problem of adding rules per device/group/VRF.

What use case are trying to solve?
Vincent Bernat Sept. 14, 2016, 2:25 p.m. UTC | #3
❦ 14 septembre 2016 16:15 CEST, David Ahern <dsa@cumulusnetworks.com> :

>> When a user wants to assign a routing table to a group of incoming
>> interfaces, the current solutions are:
>> 
>>  - one IP rule for each interface (scalability problems)
>>  - use of fwmark and devgroup matcher (don't work with internal route
>>    lookups, used for example by RPF)
>>  - use of VRF devices (more complex)
>
> Why do you believe that? A VRF is a formalized grouping of interfaces
> that includes an API for locally generated traffic to specify which
> VRF/group to use. And, with the l3mdev rule you only need 1 rule for
> all VRFs regardless of the number which is the best solution to the
> scalability problem of adding rules per device/group/VRF.
>
> What use case are trying to solve?

Local processes have to be made aware of the VRF by binding to the
pseudo-device. Some processes may be tricked by LD_PRELOAD but some
won't (like stuff written in Go). Maybe I should just find a better way
to bind a process to a VRF without its cooperation.
David Ahern Sept. 14, 2016, 2:39 p.m. UTC | #4
On 9/14/16 8:25 AM, Vincent Bernat wrote:
>  ❦ 14 septembre 2016 16:15 CEST, David Ahern <dsa@cumulusnetworks.com> :
> 
>>> When a user wants to assign a routing table to a group of incoming
>>> interfaces, the current solutions are:
>>>
>>>  - one IP rule for each interface (scalability problems)
>>>  - use of fwmark and devgroup matcher (don't work with internal route
>>>    lookups, used for example by RPF)
>>>  - use of VRF devices (more complex)
>>
>> Why do you believe that? A VRF is a formalized grouping of interfaces
>> that includes an API for locally generated traffic to specify which
>> VRF/group to use. And, with the l3mdev rule you only need 1 rule for
>> all VRFs regardless of the number which is the best solution to the
>> scalability problem of adding rules per device/group/VRF.
>>
>> What use case are trying to solve?
> 
> Local processes have to be made aware of the VRF by binding to the
> pseudo-device. Some processes may be tricked by LD_PRELOAD but some
> won't (like stuff written in Go). Maybe I should just find a better way
> to bind a process to a VRF without its cooperation.
> 

What API are you using for interface groups? How does an app tell the kernel to use interface group 1 versus group 2?


LD_PRELOAD and overloading socket is an ad-hoc hack at best with many holes - as you have found.

We (Cumulus Linux) are using this cgroups patch:
   http://www.mail-archive.com/netdev@vger.kernel.org/msg93408.html

I want something formal like the cgroups patch or even the first idea of adding a default sk_bound_dev_if to the task struct:
    https://github.com/dsahern/linux/commit/b3e5ccc291505c8a503edb20ea2c2b5e86bed96f

Parent-to-child inheritance of the setting is a requirement as is the setting getting applied to all IPv4/v6 sockets without action by the process itself.

Still have some work to do to get a solution into the kernel.
Vincent Bernat Sept. 14, 2016, 3:14 p.m. UTC | #5
❦ 14 septembre 2016 16:39 CEST, David Ahern <dsa@cumulusnetworks.com> :

>>>> When a user wants to assign a routing table to a group of incoming
>>>> interfaces, the current solutions are:
>>>>
>>>>  - one IP rule for each interface (scalability problems)
>>>>  - use of fwmark and devgroup matcher (don't work with internal route
>>>>    lookups, used for example by RPF)
>>>>  - use of VRF devices (more complex)
>>>
>>> Why do you believe that? A VRF is a formalized grouping of interfaces
>>> that includes an API for locally generated traffic to specify which
>>> VRF/group to use. And, with the l3mdev rule you only need 1 rule for
>>> all VRFs regardless of the number which is the best solution to the
>>> scalability problem of adding rules per device/group/VRF.
>>>
>>> What use case are trying to solve?
>> 
>> Local processes have to be made aware of the VRF by binding to the
>> pseudo-device. Some processes may be tricked by LD_PRELOAD but some
>> won't (like stuff written in Go). Maybe I should just find a better way
>> to bind a process to a VRF without its cooperation.
>
> What API are you using for interface groups? How does an app tell the
> kernel to use interface group 1 versus group 2?

In my testbed, I have only one local application which is dnsmasq as a
DHCP server. It sends back the answer to the physical interface (with
sendmsg() and auxillary data). So it makes my argument a bit moot as the
situation is in fact worse without VRF. :-/

My testbed is here (with use of VRF, more recent commits just use plain
ip rules):

 https://github.com/vincentbernat/network-lab/blob/d86e9ed658863ef0f51d7b853d0dc9f8b7427b21/lab-l3-hyperv/setup

I could just give more time to VRF. I also had some concerns over
performance with the way Netfilter integration is done, but I understand
that I could just stay away from POSTROUTING rules which is the only
hook executed twice?
David Ahern Sept. 14, 2016, 3:25 p.m. UTC | #6
On 9/14/16 9:14 AM, Vincent Bernat wrote:
> I could just give more time to VRF. I also had some concerns over
> performance with the way Netfilter integration is done, but I understand
> that I could just stay away from POSTROUTING rules which is the only
> hook executed twice?
> 

With the changes that were committed this past weekend, the VRF code is now setup where I can set a flag on a per VRF basis to disable the extra rx and tx processing - ie., no network taps, no netfilter, no qdisc, etc. Drops the overhead of VRF to ~3% maybe a bit less. I need to think about the user api a bit more and formalize the patch. Given my other commitments that probably won't happen until mid-October. But in terms of a building block, the overhead of VRF is continuing to drop.
Vincent Bernat Sept. 14, 2016, 4:01 p.m. UTC | #7
❦ 14 septembre 2016 17:25 CEST, David Ahern <dsa@cumulusnetworks.com> :

>> I could just give more time to VRF. I also had some concerns over
>> performance with the way Netfilter integration is done, but I understand
>> that I could just stay away from POSTROUTING rules which is the only
>> hook executed twice?

> With the changes that were committed this past weekend, the VRF code
> is now setup where I can set a flag on a per VRF basis to disable the
> extra rx and tx processing - ie., no network taps, no netfilter, no
> qdisc, etc. Drops the overhead of VRF to ~3% maybe a bit less. I need
> to think about the user api a bit more and formalize the patch. Given
> my other commitments that probably won't happen until mid-October. But
> in terms of a building block, the overhead of VRF is continuing to
> drop.

Fine by me. We can drop my patch.

Thanks!
diff mbox

Patch

diff --git a/include/net/fib_rules.h b/include/net/fib_rules.h
index 456e4a6006ab..a96b186ccd02 100644
--- a/include/net/fib_rules.h
+++ b/include/net/fib_rules.h
@@ -28,6 +28,8 @@  struct fib_rule {
 	u32			pref;
 	int			suppress_ifgroup;
 	int			suppress_prefixlen;
+	int			iifgroup;
+	int			oifgroup;
 	char			iifname[IFNAMSIZ];
 	char			oifname[IFNAMSIZ];
 	struct rcu_head		rcu;
@@ -92,7 +94,9 @@  struct fib_rules_ops {
 	[FRA_SUPPRESS_PREFIXLEN] = { .type = NLA_U32 }, \
 	[FRA_SUPPRESS_IFGROUP] = { .type = NLA_U32 }, \
 	[FRA_GOTO]	= { .type = NLA_U32 }, \
-	[FRA_L3MDEV]	= { .type = NLA_U8 }
+	[FRA_L3MDEV]	= { .type = NLA_U8 }, \
+	[FRA_IIFGROUP] = { .type = NLA_U32 }, \
+	[FRA_OIFGROUP] = { .type = NLA_U32 }
 
 static inline void fib_rule_get(struct fib_rule *rule)
 {
diff --git a/include/uapi/linux/fib_rules.h b/include/uapi/linux/fib_rules.h
index 14404b3ebb89..0bf5a5e94d9a 100644
--- a/include/uapi/linux/fib_rules.h
+++ b/include/uapi/linux/fib_rules.h
@@ -51,6 +51,8 @@  enum {
 	FRA_OIFNAME,
 	FRA_PAD,
 	FRA_L3MDEV,	/* iif or oif is l3mdev goto its table */
+	FRA_IIFGROUP,	/* interface group */
+	FRA_OIFGROUP,
 	__FRA_MAX
 };
 
diff --git a/net/core/fib_rules.c b/net/core/fib_rules.c
index be4629c344a6..f8ed6ba85c72 100644
--- a/net/core/fib_rules.c
+++ b/net/core/fib_rules.c
@@ -37,6 +37,9 @@  int fib_default_rule_add(struct fib_rules_ops *ops,
 	r->suppress_prefixlen = -1;
 	r->suppress_ifgroup = -1;
 
+	r->iifgroup = -1;
+	r->oifgroup = -1;
+
 	/* The lock is not required here, the list in unreacheable
 	 * at the moment this function is called */
 	list_add_tail(&r->list, &ops->rules_list);
@@ -193,6 +196,30 @@  static int fib_rule_match(struct fib_rule *rule, struct fib_rules_ops *ops,
 	if (rule->l3mdev && !l3mdev_fib_rule_match(rule->fr_net, fl, arg))
 		goto out;
 
+	if (rule->iifgroup != -1) {
+		struct net_device *dev;
+
+		rcu_read_lock();
+		dev = dev_get_by_index_rcu(rule->fr_net, fl->flowi_iif);
+		if (!dev || dev->group != rule->iifgroup) {
+			rcu_read_unlock();
+			goto out;
+		}
+		rcu_read_unlock();
+	}
+
+	if (rule->oifgroup != -1) {
+		struct net_device *dev;
+
+		rcu_read_lock();
+		dev = dev_get_by_index_rcu(rule->fr_net, fl->flowi_oif);
+		if (!dev || dev->group != rule->oifgroup) {
+			rcu_read_unlock();
+			goto out;
+		}
+		rcu_read_unlock();
+	}
+
 	ret = ops->match(rule, fl, flags);
 out:
 	return (rule->flags & FIB_RULE_INVERT) ? !ret : ret;
@@ -305,6 +332,12 @@  static int rule_exists(struct fib_rules_ops *ops, struct fib_rule_hdr *frh,
 		if (r->l3mdev != rule->l3mdev)
 			continue;
 
+		if (r->iifgroup != rule->iifgroup)
+			continue;
+
+		if (r->oifgroup != rule->oifgroup)
+			continue;
+
 		if (!ops->compare(r, frh, tb))
 			continue;
 		return 1;
@@ -391,6 +424,16 @@  int fib_nl_newrule(struct sk_buff *skb, struct nlmsghdr *nlh)
 			goto errout_free;
 	}
 
+	if (tb[FRA_IIFGROUP])
+		rule->iifgroup = nla_get_u32(tb[FRA_IIFGROUP]);
+	else
+		rule->iifgroup = -1;
+
+	if (tb[FRA_OIFGROUP])
+		rule->oifgroup = nla_get_u32(tb[FRA_OIFGROUP]);
+	else
+		rule->oifgroup = -1;
+
 	rule->action = frh->action;
 	rule->flags = frh->flags;
 	rule->table = frh_get_table(frh, tb);
@@ -552,6 +595,14 @@  int fib_nl_delrule(struct sk_buff *skb, struct nlmsghdr *nlh)
 		    (rule->l3mdev != nla_get_u8(tb[FRA_L3MDEV])))
 			continue;
 
+		if (tb[FRA_IIFGROUP] &&
+		    (rule->iifgroup != nla_get_u32(tb[FRA_IIFGROUP])))
+			continue;
+
+		if (tb[FRA_OIFGROUP] &&
+		    (rule->oifgroup != nla_get_u32(tb[FRA_OIFGROUP])))
+			continue;
+
 		if (!ops->compare(rule, frh, tb))
 			continue;
 
@@ -679,7 +730,11 @@  static int fib_nl_fill_rule(struct sk_buff *skb, struct fib_rule *rule,
 	    (rule->tun_id &&
 	     nla_put_be64(skb, FRA_TUN_ID, rule->tun_id, FRA_PAD)) ||
 	    (rule->l3mdev &&
-	     nla_put_u8(skb, FRA_L3MDEV, rule->l3mdev)))
+	     nla_put_u8(skb, FRA_L3MDEV, rule->l3mdev)) ||
+	    ((rule->iifgroup != -1) &&
+	     nla_put_u32(skb, FRA_IIFGROUP, rule->iifgroup)) ||
+	    ((rule->oifgroup != -1) &&
+	     nla_put_u32(skb, FRA_OIFGROUP, rule->oifgroup)))
 		goto nla_put_failure;
 
 	if (rule->suppress_ifgroup != -1) {