diff mbox

[net-next,3/7] mpls: Add a sysctl to control the size of the mpls label table

Message ID 87egp5tvlz.fsf_-_@x220.int.ebiederm.org
State Accepted, archived
Delegated to: David Miller
Headers show

Commit Message

Eric W. Biederman March 4, 2015, 1:11 a.m. UTC
This sysctl gives two benefits.  By defaulting the table size to 0
mpls even when compiled in and enabled defaults to not forwarding
any packets.  This prevents unpleasant surprises for users.

The other benefit is that as mpls labels are allocated locally a dense
table a small dense label table may be used which saves memory and
is extremely simple and efficient to implement.

This sysctl allows userspace to choose the restrictions on the label
table size userspace applications need to cope with.

Signed-off-by: "Eric W. Biederman" <ebiederm@xmission.com>
---
 Documentation/networking/mpls-sysctl.txt |  20 +++++
 include/net/netns/mpls.h                 |   2 +
 net/mpls/af_mpls.c                       | 146 +++++++++++++++++++++++++++++++
 3 files changed, 168 insertions(+)
 create mode 100644 Documentation/networking/mpls-sysctl.txt

Comments

Vivek Venkatraman March 5, 2015, 9:45 a.m. UTC | #1
On Tue, Mar 3, 2015 at 5:11 PM, Eric W. Biederman <ebiederm@xmission.com> wrote:
>
> This sysctl gives two benefits.  By defaulting the table size to 0
> mpls even when compiled in and enabled defaults to not forwarding
> any packets.  This prevents unpleasant surprises for users.
>
> The other benefit is that as mpls labels are allocated locally a dense
> table a small dense label table may be used which saves memory and
> is extremely simple and efficient to implement.
>

The label space is often partitioned into multiple sets in MPLS and
used for different purposes - for example, LSP labels, VPN labels,
Segment labels. This in turn means that the table may no longer be
dense. A sysctl allowing min and max label that spans the sets of
labels may be useful. Or should the ILM be made a hash table?

> This sysctl allows userspace to choose the restrictions on the label
> table size userspace applications need to cope with.
>

Vivek
--
To unsubscribe from this list: send the line "unsubscribe netdev" in
the body of a message to majordomo@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Eric W. Biederman March 5, 2015, 1:22 p.m. UTC | #2
Vivek Venkatraman <vivek@cumulusnetworks.com> writes:

> On Tue, Mar 3, 2015 at 5:11 PM, Eric W. Biederman <ebiederm@xmission.com> wrote:
>>
>> This sysctl gives two benefits.  By defaulting the table size to 0
>> mpls even when compiled in and enabled defaults to not forwarding
>> any packets.  This prevents unpleasant surprises for users.
>>
>> The other benefit is that as mpls labels are allocated locally a dense
>> table a small dense label table may be used which saves memory and
>> is extremely simple and efficient to implement.
>>
>
> The label space is often partitioned into multiple sets in MPLS and
> used for different purposes - for example, LSP labels, VPN labels,
> Segment labels. This in turn means that the table may no longer be
> dense. A sysctl allowing min and max label that spans the sets of
> labels may be useful. Or should the ILM be made a hash table?

Good question.

These kinds of labels are a local label management problem.

Given how nice it is to have a reasonably dense label space I am not
keen to abandon the notion of having a dense label space, as it makes
the code simple and fast for forwarding mpls packets.

That said my code is a starting point.  If you have a real world use
case and you can show a better way to deal with it.  Go for it.
Now is definitely time to evolve the API.

Eric
--
To unsubscribe from this list: send the line "unsubscribe netdev" in
the body of a message to majordomo@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Eric W. Biederman March 5, 2015, 2:38 p.m. UTC | #3
ebiederm@xmission.com (Eric W. Biederman) writes:

> Vivek Venkatraman <vivek@cumulusnetworks.com> writes:
>
>> On Tue, Mar 3, 2015 at 5:11 PM, Eric W. Biederman <ebiederm@xmission.com> wrote:
>>>
>>> This sysctl gives two benefits.  By defaulting the table size to 0
>>> mpls even when compiled in and enabled defaults to not forwarding
>>> any packets.  This prevents unpleasant surprises for users.
>>>
>>> The other benefit is that as mpls labels are allocated locally a dense
>>> table a small dense label table may be used which saves memory and
>>> is extremely simple and efficient to implement.
>>>
>>
>> The label space is often partitioned into multiple sets in MPLS and
>> used for different purposes - for example, LSP labels, VPN labels,
>> Segment labels. This in turn means that the table may no longer be
>> dense. A sysctl allowing min and max label that spans the sets of
>> labels may be useful. Or should the ILM be made a hash table?
>
> Good question.
>
> These kinds of labels are a local label management problem.
>
> Given how nice it is to have a reasonably dense label space I am not
> keen to abandon the notion of having a dense label space, as it makes
> the code simple and fast for forwarding mpls packets.
>
> That said my code is a starting point.  If you have a real world use
> case and you can show a better way to deal with it.  Go for it.
> Now is definitely time to evolve the API.

A couple more thoughts.  

The rtnetlink interface and my implementation carries a type field so it
is possible to mark which routing protocol uses an mpls route.

The global routing table is already over 500,000 routes so the 1 million
forward equivalanece classes of mpls with a single label may be
exhausted in the not too distant future so a dense label space may be a
necessity.

In a similar vein.  When I look at top of rack switches and their
hardware forwarding capacity it looks like they are in the ballpark
of 32K MPLS routes.

All of which says to me that the MPLS label space is limited and it
should be managed as a precious resource.  (A good example of why I
might want to rethink my mpls ingress path).

So while I can see arguments for one use of labels getting one quota of
labels and another use of labels getting another quota when I look at
the space there are not that many labels and I don't see how or why it
would make sense to manage the labels explicitly with ranges.

At some point for MPLS multicast traffic and MPLS source specific
addresses if we choose to support those we will need a hash table
as those addresses are assigned by others, though in that case
we will be limited in our egress set of labels we can use.

So I think MPLS interfaces need to encourage thrifty label use,
which in my mind almost certainly means not manually allocated
label use.

Eric
--
To unsubscribe from this list: send the line "unsubscribe netdev" in
the body of a message to majordomo@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Vivek Venkatraman March 5, 2015, 4:49 p.m. UTC | #4
On Thu, Mar 5, 2015 at 6:38 AM, Eric W. Biederman <ebiederm@xmission.com> wrote:
> ebiederm@xmission.com (Eric W. Biederman) writes:
>
>> Vivek Venkatraman <vivek@cumulusnetworks.com> writes:
>>
>>> On Tue, Mar 3, 2015 at 5:11 PM, Eric W. Biederman <ebiederm@xmission.com> wrote:
>>>>
>>>> This sysctl gives two benefits.  By defaulting the table size to 0
>>>> mpls even when compiled in and enabled defaults to not forwarding
>>>> any packets.  This prevents unpleasant surprises for users.
>>>>
>>>> The other benefit is that as mpls labels are allocated locally a dense
>>>> table a small dense label table may be used which saves memory and
>>>> is extremely simple and efficient to implement.
>>>>
>>>
>>> The label space is often partitioned into multiple sets in MPLS and
>>> used for different purposes - for example, LSP labels, VPN labels,
>>> Segment labels. This in turn means that the table may no longer be
>>> dense. A sysctl allowing min and max label that spans the sets of
>>> labels may be useful. Or should the ILM be made a hash table?
>>
>> Good question.
>>
>> These kinds of labels are a local label management problem.
>>
>> Given how nice it is to have a reasonably dense label space I am not
>> keen to abandon the notion of having a dense label space, as it makes
>> the code simple and fast for forwarding mpls packets.
>>

That's true. I guess this can be considered a local label management issue.

>> That said my code is a starting point.  If you have a real world use
>> case and you can show a better way to deal with it.  Go for it.
>> Now is definitely time to evolve the API.
>
> A couple more thoughts.
>
> The rtnetlink interface and my implementation carries a type field so it
> is possible to mark which routing protocol uses an mpls route.
>
> The global routing table is already over 500,000 routes so the 1 million
> forward equivalanece classes of mpls with a single label may be
> exhausted in the not too distant future so a dense label space may be a
> necessity.
>
> In a similar vein.  When I look at top of rack switches and their
> hardware forwarding capacity it looks like they are in the ballpark
> of 32K MPLS routes.
>
> All of which says to me that the MPLS label space is limited and it
> should be managed as a precious resource.  (A good example of why I
> might want to rethink my mpls ingress path).
>
> So while I can see arguments for one use of labels getting one quota of
> labels and another use of labels getting another quota when I look at
> the space there are not that many labels and I don't see how or why it
> would make sense to manage the labels explicitly with ranges.
>

The ability to impose a label stack and the way a FEC is defined at
the edge is what may help against the exhaustion of the label space.
It is usually for the label stack that multiple ranges of labels are
used. Again, I guess that can be handled by the application and the
dataplane can treat the entire set of labels as a single flat range.

> At some point for MPLS multicast traffic and MPLS source specific
> addresses if we choose to support those we will need a hash table
> as those addresses are assigned by others, though in that case
> we will be limited in our egress set of labels we can use.
>
> So I think MPLS interfaces need to encourage thrifty label use,
> which in my mind almost certainly means not manually allocated
> label use.
>
> Eric
--
To unsubscribe from this list: send the line "unsubscribe netdev" in
the body of a message to majordomo@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
diff mbox

Patch

diff --git a/Documentation/networking/mpls-sysctl.txt b/Documentation/networking/mpls-sysctl.txt
new file mode 100644
index 000000000000..639ddf0ece9b
--- /dev/null
+++ b/Documentation/networking/mpls-sysctl.txt
@@ -0,0 +1,20 @@ 
+/proc/sys/net/mpls/* Variables:
+
+platform_labels - INTEGER
+	Number of entries in the platform label table.  It is not
+	possible to configure forwarding for label values equal to or
+	greater than the number of platform labels.
+
+	A dense utliziation of the entries in the platform label table
+	is possible and expected aas the platform labels are locally
+	allocated.
+
+	If the number of platform label table entries is set to 0 no
+	label will be recognized by the kernel and mpls forwarding
+	will be disabled.
+
+	Reducing this value will remove all label routing entries that
+	no longer fit in the table.
+
+	Possible values: 0 - 1048575
+	Default: 0
diff --git a/include/net/netns/mpls.h b/include/net/netns/mpls.h
index f90aaf8d4f89..d29203651c01 100644
--- a/include/net/netns/mpls.h
+++ b/include/net/netns/mpls.h
@@ -6,10 +6,12 @@ 
 #define __NETNS_MPLS_H__
 
 struct mpls_route;
+struct ctl_table_header;
 
 struct netns_mpls {
 	size_t platform_labels;
 	struct mpls_route __rcu * __rcu *platform_label;
+	struct ctl_table_header *ctl;
 };
 
 #endif /* __NETNS_MPLS_H__ */
diff --git a/net/mpls/af_mpls.c b/net/mpls/af_mpls.c
index 924377736b2a..b097125dfa33 100644
--- a/net/mpls/af_mpls.c
+++ b/net/mpls/af_mpls.c
@@ -1,6 +1,7 @@ 
 #include <linux/types.h>
 #include <linux/skbuff.h>
 #include <linux/socket.h>
+#include <linux/sysctl.h>
 #include <linux/net.h>
 #include <linux/module.h>
 #include <linux/if_arp.h>
@@ -31,6 +32,9 @@  struct mpls_route { /* next hop label forwarding entry */
 	u8			rt_via[0];
 };
 
+static int zero = 0;
+static int label_limit = (1 << 20) - 1;
+
 static struct mpls_route *mpls_route_input_rcu(struct net *net, unsigned index)
 {
 	struct mpls_route *rt = NULL;
@@ -273,18 +277,160 @@  static struct notifier_block mpls_dev_notifier = {
 	.notifier_call = mpls_dev_notify,
 };
 
+static int resize_platform_label_table(struct net *net, size_t limit)
+{
+	size_t size = sizeof(struct mpls_route *) * limit;
+	size_t old_limit;
+	size_t cp_size;
+	struct mpls_route __rcu **labels = NULL, **old;
+	struct mpls_route *rt0 = NULL, *rt2 = NULL;
+	unsigned index;
+
+	if (size) {
+		labels = kzalloc(size, GFP_KERNEL | __GFP_NOWARN | __GFP_NORETRY);
+		if (!labels)
+			labels = vzalloc(size);
+
+		if (!labels)
+			goto nolabels;
+	}
+
+	/* In case the predefined labels need to be populated */
+	if (limit > LABEL_IPV4_EXPLICIT_NULL) {
+		struct net_device *lo = net->loopback_dev;
+		rt0 = mpls_rt_alloc(lo->addr_len);
+		if (!rt0)
+			goto nort0;
+		rt0->rt_dev = lo;
+		rt0->rt_protocol = RTPROT_KERNEL;
+		rt0->rt_via_family = AF_PACKET;
+		memcpy(rt0->rt_via, lo->dev_addr, lo->addr_len);
+	}
+	if (limit > LABEL_IPV6_EXPLICIT_NULL) {
+		struct net_device *lo = net->loopback_dev;
+		rt2 = mpls_rt_alloc(lo->addr_len);
+		if (!rt2)
+			goto nort2;
+		rt2->rt_dev = lo;
+		rt2->rt_protocol = RTPROT_KERNEL;
+		rt2->rt_via_family = AF_PACKET;
+		memcpy(rt2->rt_via, lo->dev_addr, lo->addr_len);
+	}
+
+	rtnl_lock();
+	/* Remember the original table */
+	old = net->mpls.platform_label;
+	old_limit = net->mpls.platform_labels;
+
+	/* Free any labels beyond the new table */
+	for (index = limit; index < old_limit; index++)
+		mpls_route_update(net, index, NULL, NULL, NULL);
+
+	/* Copy over the old labels */
+	cp_size = size;
+	if (old_limit < limit)
+		cp_size = old_limit * sizeof(struct mpls_route *);
+
+	memcpy(labels, old, cp_size);
+
+	/* If needed set the predefined labels */
+	if ((old_limit <= LABEL_IPV6_EXPLICIT_NULL) &&
+	    (limit > LABEL_IPV6_EXPLICIT_NULL)) {
+		labels[LABEL_IPV6_EXPLICIT_NULL] = rt2;
+		rt2 = NULL;
+	}
+
+	if ((old_limit <= LABEL_IPV4_EXPLICIT_NULL) &&
+	    (limit > LABEL_IPV4_EXPLICIT_NULL)) {
+		labels[LABEL_IPV4_EXPLICIT_NULL] = rt0;
+		rt0 = NULL;
+	}
+
+	/* Update the global pointers */
+	net->mpls.platform_labels = limit;
+	net->mpls.platform_label = labels;
+
+	rtnl_unlock();
+
+	mpls_rt_free(rt2);
+	mpls_rt_free(rt0);
+
+	if (old) {
+		synchronize_rcu();
+		kvfree(old);
+	}
+	return 0;
+
+nort2:
+	mpls_rt_free(rt0);
+nort0:
+	kvfree(labels);
+nolabels:
+	return -ENOMEM;
+}
+
+static int mpls_platform_labels(struct ctl_table *table, int write,
+				void __user *buffer, size_t *lenp, loff_t *ppos)
+{
+	struct net *net = table->data;
+	int platform_labels = net->mpls.platform_labels;
+	int ret;
+	struct ctl_table tmp = {
+		.procname	= table->procname,
+		.data		= &platform_labels,
+		.maxlen		= sizeof(int),
+		.mode		= table->mode,
+		.extra1		= &zero,
+		.extra2		= &label_limit,
+	};
+
+	ret = proc_dointvec_minmax(&tmp, write, buffer, lenp, ppos);
+
+	if (write && ret == 0)
+		ret = resize_platform_label_table(net, platform_labels);
+
+	return ret;
+}
+
+static struct ctl_table mpls_table[] = {
+	{
+		.procname	= "platform_labels",
+		.data		= NULL,
+		.maxlen		= sizeof(int),
+		.mode		= 0644,
+		.proc_handler	= mpls_platform_labels,
+	},
+	{ }
+};
+
 static int mpls_net_init(struct net *net)
 {
+	struct ctl_table *table;
+
 	net->mpls.platform_labels = 0;
 	net->mpls.platform_label = NULL;
 
+	table = kmemdup(mpls_table, sizeof(mpls_table), GFP_KERNEL);
+	if (table == NULL)
+		return -ENOMEM;
+
+	table[0].data = net;
+	net->mpls.ctl = register_net_sysctl(net, "net/mpls", table);
+	if (net->mpls.ctl == NULL)
+		return -ENOMEM;
+
 	return 0;
 }
 
 static void mpls_net_exit(struct net *net)
 {
+	struct ctl_table *table;
 	unsigned int index;
 
+	table = net->mpls.ctl->ctl_table_arg;
+	unregister_net_sysctl_table(net->mpls.ctl);
+	kfree(table);
+
 	/* An rcu grace period haselapsed since there was a device in
 	 * the network namespace (and thus the last in fqlight packet)
 	 * left this network namespace.  This is because