diff mbox

[1/2] ipv4: Improve the scaling of the ARP cache for multicast destinations.

Message ID 50400B68.3060302@aristanetworks.com
State Rejected, archived
Delegated to: David Miller
Headers show

Commit Message

Bob Gilligan Aug. 31, 2012, 12:55 a.m. UTC
The ARP cache maintains entries for both unicast and multicast IPv4
next-hop destinations.  The MAC addresses for unicast destinations are
determined by running the ARP protocol, but those for multicast
destinations are determined by a simple direct mapping from the
destination IPv4 multicast address.

Currently, the ARP cache maintains one entry for each IPv4 multicast
destination for each interface that has members in that group.  On a
multicast router that is forwarding traffic for many groups via many
interfaces, the number of ARP cache entries for multicast destinations
can become large. It could be as many as: (number of interfaces) *
(number of groups).  Beside using a great deal of memory, these entries
consume space in the ARP cache that could otherwise be occupied by
unicast entries, makeing it more likely that the ARP cache will become
full.

The mapping from multicast IPv4 address to MAC address can just as
easily be done at the time a packet is to be sent.  With this change,
we maintain one ARP cache entry for each interface that has at least
one multicast group member.  All routes to IPv4 multicast destinations
via a particular interface use the same ARP cache entry.  This entry
does not store the MAC address to use.  Instead, packets for multicast
destinations go to a new output function that maps the destination
IPv4 multicast address into the MAC address and forms the MAC header.

Signed-off-by: Bob Gilligan <gilligan@aristanetworks.com>
---
 net/ipv4/arp.c   |   49 +++++++++++++++++++++++++++++++++++++++++++++----
 net/ipv4/route.c |   14 ++++++++++++--
 2 files changed, 57 insertions(+), 6 deletions(-)

--
To unsubscribe from this list: send the line "unsubscribe netdev" in
the body of a message to majordomo@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html

Comments

David Miller Aug. 31, 2012, 1:06 a.m. UTC | #1
From: Bob Gilligan <gilligan@aristanetworks.com>
Date: Thu, 30 Aug 2012 17:55:04 -0700

> The mapping from multicast IPv4 address to MAC address can just as
> easily be done at the time a packet is to be sent.  With this change,
> we maintain one ARP cache entry for each interface that has at least
> one multicast group member.  All routes to IPv4 multicast destinations
> via a particular interface use the same ARP cache entry.  This entry
> does not store the MAC address to use.  Instead, packets for multicast
> destinations go to a new output function that maps the destination
> IPv4 multicast address into the MAC address and forms the MAC header.

Doing an ARP MC mapping on every packet is much more expensive than
doing a copy of the hard header cache.

I do not believe the memory consumption issue you use to justify this
change is a real issue.

If you are talking to that many multicast groups actively, you do want
that many neighbour cache entries.  This is not different from talking
to nearly every IP address on a local /8 subnet.  You'll have a huge
number of neighbour table entries in that case as well.

If your the actual steady state number of active groups being spoken
to is smaller, you can tune the neighbour cache thresholds to collect
old less used entries more quickly.

And this today is trivial, since routes no longer hold a reference
to neighbour entries.  Therefore any neighbour entry whatsoever can
be immediately reclaimed at any moment.

I'm not fond of these patches, and adding yet more special cases to
the neighbour layer, and therefore will not apply them.
--
To unsubscribe from this list: send the line "unsubscribe netdev" in
the body of a message to majordomo@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Bob Gilligan Aug. 31, 2012, 7:21 p.m. UTC | #2
On 8/30/12 6:06 PM, David Miller wrote:
> From: Bob Gilligan <gilligan@aristanetworks.com>
> Date: Thu, 30 Aug 2012 17:55:04 -0700
> 
>> The mapping from multicast IPv4 address to MAC address can just as
>> easily be done at the time a packet is to be sent.  With this change,
>> we maintain one ARP cache entry for each interface that has at least
>> one multicast group member.  All routes to IPv4 multicast destinations
>> via a particular interface use the same ARP cache entry.  This entry
>> does not store the MAC address to use.  Instead, packets for multicast
>> destinations go to a new output function that maps the destination
>> IPv4 multicast address into the MAC address and forms the MAC header.
> 
> Doing an ARP MC mapping on every packet is much more expensive than
> doing a copy of the hard header cache.
> 
> I do not believe the memory consumption issue you use to justify this
> change is a real issue.
> 
> If you are talking to that many multicast groups actively, you do want
> that many neighbour cache entries.  This is not different from talking
> to nearly every IP address on a local /8 subnet.  You'll have a huge
> number of neighbour table entries in that case as well.
> 
> If your the actual steady state number of active groups being spoken
> to is smaller, you can tune the neighbour cache thresholds to collect
> old less used entries more quickly.
> 
> And this today is trivial, since routes no longer hold a reference
> to neighbour entries.  Therefore any neighbour entry whatsoever can
> be immediately reclaimed at any moment.

The scaling is N-squared: the number of neighbor cache entries
required for your multicast traffic is interfaces * groups.  100
interfaces and 100 groups could generate 10,000 entries. 1,000
interfaces and 1,000 groups could generate a million entries.

But the number of groups is hard to predict: it depends on the
applications in use and the multicast traffic they generate.  So, it
is hard to come up with a "budget" for multicast entries in the
neighbor cache for a multicast router.

If you pick a gc_thresh3 that is less than your working set, you'll
end up thrashing the neighbor cache.  And calls to neigh_forced_gc()
are expensive: It performs a linear search of the entire neighbor
cache.  Also, the calls to neigh_forced_gc() due to a large number of
multicast entries will negatively impact the unicast entries sharing the
neighbor cache: it will free any unreferenced but resolved unicast
entries. Any subsequent packets for those destinations will trigger a
re-ARP.  Unnecessary re-ARPing is generally undesirable in a router.

The user who wants to avoid these problems is left with the
alternative of setting gc_thresh3 to a very large number based on a
worst case estimate of the number of unicast plus multicast entries
required.

Seems just simpler and more efficient to keep the multicast entries
out of the neighbor cache entirely.

Bob.



> 
> I'm not fond of these patches, and adding yet more special cases to
> the neighbour layer, and therefore will not apply them.
> 

--
To unsubscribe from this list: send the line "unsubscribe netdev" in
the body of a message to majordomo@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Nicolas de Pesloüan Sept. 2, 2012, 1:26 p.m. UTC | #3
Le 31/08/2012 21:21, Bob Gilligan a écrit :
> On 8/30/12 6:06 PM, David Miller wrote:
>> From: Bob Gilligan<gilligan@aristanetworks.com>
>> Date: Thu, 30 Aug 2012 17:55:04 -0700
>>
>>> The mapping from multicast IPv4 address to MAC address can just as
>>> easily be done at the time a packet is to be sent.  With this change,
>>> we maintain one ARP cache entry for each interface that has at least
>>> one multicast group member.  All routes to IPv4 multicast destinations
>>> via a particular interface use the same ARP cache entry.  This entry
>>> does not store the MAC address to use.  Instead, packets for multicast
>>> destinations go to a new output function that maps the destination
>>> IPv4 multicast address into the MAC address and forms the MAC header.
>>
>> Doing an ARP MC mapping on every packet is much more expensive than
>> doing a copy of the hard header cache.
>>
>> I do not believe the memory consumption issue you use to justify this
>> change is a real issue.

My two cents:

Why do we need a per interface neighbor cache for multicast? Isn't the target MAC expected to be the 
same for a give target multicast IP address, whatever the egress interface?

Can't we store the target MAC for multicast address on a neighbor cache entry with a fake dev value 
meaning "multicast"?

Can't we detect the fact that the IP destination address is multicast and search only the neighbor 
cache entries that have this fake dev value, instead of the egress interface?

This should reduce memory consumption (not N-squared anymore) without significantly increasing the 
CPU usage.

Do I miss something?

	Nicolas.
--
To unsubscribe from this list: send the line "unsubscribe netdev" in
the body of a message to majordomo@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Bob Gilligan Sept. 4, 2012, 4:22 a.m. UTC | #4
On 9/2/12 6:26 AM, Nicolas de Pesloüan wrote:
> Le 31/08/2012 21:21, Bob Gilligan a écrit :
>> On 8/30/12 6:06 PM, David Miller wrote:
>>> From: Bob Gilligan<gilligan@aristanetworks.com>
>>> Date: Thu, 30 Aug 2012 17:55:04 -0700
>>>
>>>> The mapping from multicast IPv4 address to MAC address can just as
>>>> easily be done at the time a packet is to be sent.  With this change,
>>>> we maintain one ARP cache entry for each interface that has at least
>>>> one multicast group member.  All routes to IPv4 multicast destinations
>>>> via a particular interface use the same ARP cache entry.  This entry
>>>> does not store the MAC address to use.  Instead, packets for multicast
>>>> destinations go to a new output function that maps the destination
>>>> IPv4 multicast address into the MAC address and forms the MAC header.
>>>
>>> Doing an ARP MC mapping on every packet is much more expensive than
>>> doing a copy of the hard header cache.
>>>
>>> I do not believe the memory consumption issue you use to justify this
>>> change is a real issue.
> 
> My two cents:
> 
> Why do we need a per interface neighbor cache for multicast? Isn't the
> target MAC expected to be the same for a give target multicast IP
> address, whatever the egress interface?
> 
> Can't we store the target MAC for multicast address on a neighbor cache
> entry with a fake dev value meaning "multicast"?
> 
> Can't we detect the fact that the IP destination address is multicast
> and search only the neighbor cache entries that have this fake dev
> value, instead of the egress interface?
> 
> This should reduce memory consumption (not N-squared anymore) without
> significantly increasing the CPU usage.
> 
> Do I miss something?

Sounds like you are suggesting the converse of my proposed patch:  One
"special" Neighbor Cache Entry per multicast group instead of one per
interface.  Unfortunately, I don't think this will work.  Different MAC
types have different frame formats and different multicast address
mappings, so can't share a common neighbor cache entry.

Bob.
> 
>     Nicolas.

--
To unsubscribe from this list: send the line "unsubscribe netdev" in
the body of a message to majordomo@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
diff mbox

Patch

Index: b/net/ipv4/arp.c
===================================================================
--- a/net/ipv4/arp.c
+++ b/net/ipv4/arp.c
@@ -126,6 +126,7 @@  static int arp_constructor(struct neighb
 static void arp_solicit(struct neighbour *neigh, struct sk_buff *skb);
 static void arp_error_report(struct neighbour *neigh, struct sk_buff *skb);
 static void parp_redo(struct sk_buff *skb);
+static int arp_multicast_output(struct neighbour *neigh, struct sk_buff *skb);
 
 static const struct neigh_ops arp_generic_ops = {
 	.family =		AF_INET,
@@ -157,6 +158,13 @@  static const struct neigh_ops arp_broken
 	.connected_output =	neigh_compat_output,
 };
 
+static const struct neigh_ops arp_multicast_ops = {
+	.family =		AF_INET,
+	.error_report =		arp_error_report,
+	.output =		arp_multicast_output,
+	.connected_output =	arp_multicast_output,
+};
+
 struct neigh_table arp_tbl = {
 	.family		= AF_INET,
 	.key_len	= 4,
@@ -217,6 +225,38 @@  static u32 arp_hash(const void *pkey,
 	return arp_hashfn(*(u32 *)pkey, dev, *hash_rnd);
 }
 
+
+/*
+ * Output function for IPv4 multicast destinations.  We map the
+ * next-hop address directly into the destination MAC addr here so
+ * that we don't have to store it in the ARP cache entry.  This allows
+ * routes for multiple multicast destinations to share a single ARP
+ * cache entry.
+ */
+static int arp_multicast_output(struct neighbour *neigh, struct sk_buff *skb)
+{
+	int err;
+	struct dst_entry *dst = skb_dst(skb);
+	struct rtable *rt = (struct rtable *)dst;
+	struct net_device *dev = neigh->dev;
+	unsigned char ha[ALIGN(MAX_ADDR_LEN, sizeof(unsigned long))];
+
+	__skb_pull(skb, skb_network_offset(skb));
+
+	arp_mc_map(rt->rt_gateway, ha, dev, 1);
+
+	err = dev_hard_header(skb, dev, ntohs(skb->protocol), ha, NULL,
+			      skb->len);
+	if (err >= 0)
+		err = dev_queue_xmit(skb);
+	else {
+		err = -EINVAL;
+		kfree_skb(skb);
+	}
+	return err;
+}
+
+
 static int arp_constructor(struct neighbour *neigh)
 {
 	__be32 addr = *(__be32 *)neigh->primary_key;
@@ -287,10 +327,9 @@  static int arp_constructor(struct neighb
 #endif
 		}
 #endif
-		if (neigh->type == RTN_MULTICAST) {
+		if (neigh->type == RTN_MULTICAST)
 			neigh->nud_state = NUD_NOARP;
-			arp_mc_map(addr, neigh->ha, dev, 1);
-		} else if (dev->flags & (IFF_NOARP | IFF_LOOPBACK)) {
+		else if (dev->flags & (IFF_NOARP | IFF_LOOPBACK)) {
 			neigh->nud_state = NUD_NOARP;
 			memcpy(neigh->ha, dev->dev_addr, dev->addr_len);
 		} else if (neigh->type == RTN_BROADCAST ||
@@ -299,7 +338,9 @@  static int arp_constructor(struct neighb
 			memcpy(neigh->ha, dev->broadcast, dev->addr_len);
 		}
 
-		if (dev->header_ops->cache)
+		if (neigh->type == RTN_MULTICAST)
+			neigh->ops = &arp_multicast_ops;
+		else if (dev->header_ops->cache)
 			neigh->ops = &arp_hh_ops;
 		else
 			neigh->ops = &arp_generic_ops;
Index: b/net/ipv4/route.c
===================================================================
--- a/net/ipv4/route.c
+++ b/net/ipv4/route.c
@@ -1114,6 +1114,7 @@  static int slow_chain_length(const struc
 static struct neighbour *ipv4_neigh_lookup(const struct dst_entry *dst, const void *daddr)
 {
 	static const __be32 inaddr_any = 0;
+	static const __be32 inaddr_unspec_group = htonl(INADDR_UNSPEC_GROUP);
 	struct net_device *dev = dst->dev;
 	const __be32 *pkey = daddr;
 	const struct rtable *rt;
@@ -1123,8 +1124,17 @@  static struct neighbour *ipv4_neigh_look
 
 	if (dev->flags & (IFF_LOOPBACK | IFF_POINTOPOINT))
 		pkey = &inaddr_any;
-	else if (rt->rt_gateway)
-		pkey = (const __be32 *) &rt->rt_gateway;
+	else {
+		if (rt->rt_gateway)
+			pkey = (const __be32 *) &rt->rt_gateway;
+		if (pkey && ipv4_is_multicast(*pkey))
+			/*
+			 * Map all multicast destinations to a single
+			 * address so tht they share a single ARP
+			 * cache entry per interface.
+			 */
+			pkey = &inaddr_unspec_group;
+	}
 
 	n = __ipv4_neigh_lookup(dev, *(__force u32 *)pkey);
 	if (n)